ktx ingest

ktx ingest builds or refreshes ktx context from configured connections, and can also capture free-form text into ktx memory. Database connections build enriched context — schema plus AI-generated descriptions, embeddings, and relationship evidence — and require a configured model and embeddings. Context-source connections ingest metadata from tools such as dbt, Looker, Metabase, MetricFlow, LookML, Notion, and Sigma. Pass --text or --file to capture inline text or text files into memory instead.

Command signature

ktx ingest [options] [connectionId]

Bare ktx ingest (no positional, no --all) ingests every configured connection.
ktx ingest <connectionId> ingests one configured connection.
ktx ingest --text "..." (or --file <path>) captures notes into ktx memory instead of ingesting a connection.

Database connections run before context-source connections when more than one connection is selected.

Options

Flag	Description	Default
`--all`	Ingest all configured connections (same as bare invocation)	`false`
`--query-history`	Include database query-history usage patterns	Stored connection default
`--no-query-history`	Skip database query-history usage patterns for this run	Stored connection default
`--query-history-window-days <days>`	BigQuery/Snowflake query-history lookback window for this run	Stored connection default
`--stages <list>`	Comma-separated enrichment stages to (re)run: `descriptions`, `embeddings`, `relationships`	All three
`--text <content>`	Capture inline text into ktx memory; repeatable	`[]`
`--file <path>`	Capture a text file into ktx memory; use `-` for stdin; repeatable	`[]`
`--verbatim`	Store each `--text`/`--file` document body unchanged as a `GLOBAL` wiki page; the LLM derives metadata only	`false`
`--connection-id <connectionId>`	ktx connection id to tag captured text/file notes	-
`--user-id <id>`	Memory user id for text/file capture attribution	`local-cli`
`--fail-fast`	Stop after the first failed text/file item	`false`
`--plain`	Print plain text output	`true`
`--json`	Print JSON output	`false`
`--yes`	Install required managed runtime features without prompting	`false`
`--no-input`	Disable interactive terminal input	-

Database ingest always builds enriched context and requires a configured model and embeddings (run ktx setup); connections without that configuration fail before any work starts. Query-history flags apply only to database connections that support query history. The window flag applies to BigQuery and Snowflake; Postgres reads the current pg_stat_statements aggregate data instead of a time-windowed history table. Query-history ingest runs after the schema scan.

When more than one connection is selected, database ingest runs first, then context-source ingest and memory updates run for context-source connections.

Some ingest paths use the managed ktx Python runtime. Query-history ingest uses it for SQL analysis, and Looker context-source ingest uses it for Looker identifier parsing. In an interactive terminal, ktx ingest prompts before installing the required runtime features. Use --yes to install them without prompting, or use --no-input to fail fast with install guidance.

--text and --file cannot be combined with a positional connectionId or --all; pass --connection-id <id> instead to tag captured notes.

Verbatim ingest

By default, captured text is routed through the memory agent, which decides what to persist and may rewrite, condense, split, or re-title it. For authoritative documents — metric definitions, formula specs, runbooks, compliance text — that paraphrasing is a defect. Add --verbatim to store each --text/--file document body unchanged as a GLOBAL wiki page:

The stored body is the input document, written by code; the LLM never edits it. It is used only to derive page metadata (summary, tags, sl_refs), and even that is skipped for fields the document's own frontmatter already sets.
The page key is deterministic: a --file derives it from the filename, inline --text from the document's leading Markdown heading (inline text without a heading is rejected — pass it as --file instead).
Ingest is idempotent. Re-running the same document is a safe no-op; a different body at the same key fails loudly rather than overwriting.
--verbatim works with llm.provider.backend: none — the only ingest path that does. With no backend the summary is derived from the heading or first sentence and tags/sl_refs are left empty; the full body is still stored.
Existing frontmatter passes through untouched (including fields ktx does not model, such as effective_date or version); generated metadata only fills absent fields. --connection-id <id> scopes the page to that connection by setting its connections frontmatter.

Selecting enrichment stages

Database enrichment runs three stages: descriptions (one LLM call per table), embeddings (vectors over the schema and descriptions), and relationships (join detection, optionally LLM-proposed). Each stage is cached on a per-stage hash of only its own inputs, so changing one stage's inputs invalidates only that stage. Switching the description LLM re-runs only descriptions; upgrading the embeddings model re-runs only embeddings; turning on scan.relationships.llmProposals re-runs only relationships. The expensive per-table descriptions are never thrown away because an unrelated setting moved.

--stages <list> re-runs a chosen subset on an already-ingested connection. A named stage is force-recomputed (it bypasses the completed-stage cache), while unselected stages are left exactly as they are on disk:

ktx ingest warehouse --stages embeddings — re-embed on a new model, keeping descriptions and joins.
ktx ingest --all --stages relationships --no-query-history — backfill joins across every database after enabling llmProposals, without re-paying for descriptions.
ktx ingest warehouse --stages descriptions — re-run thin descriptions (for example after raising KTX_ENRICH_LLM_TIMEOUT_MS). When nothing the descriptions depend on changed, the per-table resume record means only the tables that previously failed are re-sent to the LLM.

Stage names are validated: an unknown or empty name (--stages foo, --stages descriptions,foo, --stages "") is a hard parse error. Naming all three (--stages descriptions,embeddings,relationships) forces a full enrichment recompute, which is not the same as omitting the flag (omitting resumes whatever is already done). After a selective run, ktx warns (enrichment_stage_stale) when an unselected stage's inputs no longer match what it was last built from — for example, re-running descriptions flags embeddings as stale until you re-run --stages embeddings. The warning is informational; ktx never silently cascades the extra work.

Examples

# Build every configured connection (bare = --all)
ktx ingest

# Build one database or context-source connection
ktx ingest warehouse

# Include query-history usage patterns
ktx ingest warehouse --query-history
# Set the lookback window for BigQuery or Snowflake query history
ktx ingest warehouse --query-history-window-days 30

# Re-embed one connection on a new embeddings model (descriptions/joins untouched)
ktx ingest warehouse --stages embeddings
# Backfill LLM-proposed joins across every database without re-describing
ktx ingest --all --stages relationships --no-query-history

# Build a context-source connection
ktx ingest notion

# Capture inline text into memory
ktx ingest --text "Refunds are excluded from net revenue."

# Capture multiple text snippets in one call
ktx ingest --text "Revenue is gross receipts." --text "Orders are completed purchases."

# Capture a local Markdown file into memory and tag it to a connection
ktx ingest --file docs/revenue-notes.md --connection-id warehouse

# Capture one stdin item
printf "Refunds are excluded from net revenue." | ktx ingest --file -

# Store an authoritative document verbatim (body preserved exactly)
ktx ingest --file docs/rfm-bucket-definitions.md --verbatim

# Store it verbatim and scope it to one connection
ktx ingest --file docs/haversine-formula.md --verbatim --connection-id warehouse

Output

Plain output summarizes each target and the operations that ran.

output

Ingest finished

Source         Database schema  Query history  Source ingest  Memory update
warehouse      done             done           skipped        skipped
notion         skipped          skipped        done           done

Use --json when a script or agent needs the selected plan and per-target results.

Final validation pruning

At the end of a context-source ingest, ktx validates the composed semantic layer and wiki before saving it. If the final validation finds dangling references, ktx removes the reference instead of failing accepted work. This can remove joins that point at missing semantic sources, wiki refs, wiki sl_refs, and inline wiki body references. If a generated semantic source is invalid, ktx drops that source from the final save.

The stored ingest report records these changes as finalGatePrunedReferences and finalGateDroppedSources. The trace emits final_gate_reference_pruned, final_gate_source_dropped, final_gate_prune_committed, and final_gate_prune_finished events when pruning runs. If validation still fails after pruning, the ingest fails and the report keeps the final validation error.

Inspect context-source ingest traces

Context-source ingest writes persistent JSONL traces for postmortem debugging. Plain ingest output prints the trace path near the report, run, and job identifiers when a trace is available:

output

Report: report-abc123
Run: run-abc123
Job: job-abc123
Trace: .ktx/ingest-traces/job-abc123/trace.jsonl

The trace file lives under the project directory at .ktx/ingest-traces/<jobId>/trace.jsonl. Each line is a JSON event with the job id, run id, sync id, connection id, source key, phase, event name, timing, state snapshot, decision context, and error details. Failed runs also write a stored ingest report with status: "failed", failure.phase, failure.message, and the same trace path.

Use jq or line-oriented tools to inspect a trace:

jq -c '. | {at, level, phase, event, durationMs, data, error}' \
  .ktx/ingest-traces/<jobId>/trace.jsonl

ktx writes debug trace events by default. Set KTX_INGEST_TRACE_LEVEL to error, info, debug, or trace before running ingest to change the trace verbosity:

KTX_INGEST_TRACE_LEVEL=trace ktx ingest metabase

Profiling a slow ingest

Each timed phase and work unit records a durationMs in the trace, and each agent loop records its step count and token usage. To see where wall-clock time went, enable profiling and ktx prints a rolled-up breakdown to stderr at the end of the run. There are two ways to turn it on, and two output formats.

Turn it on per run with the KTX_PROFILE_INGEST environment variable, or persistently with ingest.profile in ktx.yaml (useful for CI or while iterating on a slow source):

KTX_PROFILE_INGEST=1 ktx ingest metabase       # human-readable table
KTX_PROFILE_INGEST=json ktx ingest metabase    # raw JSON for coding agents

ingest:
  profile: true   # human table; use "json" for the machine-readable form

Both formats report total wall time, time per phase, and the slowest work units, splitting each work unit's agent-loop time into model time versus tool-execution time. The json form emits the full structured profile (raw milliseconds and token counts, stable keys) plus a summary.headline one-line diagnosis, so a coding agent can parse it directly instead of scraping the table. If both the env var and the config request profiling, json wins. Example headline:

output

Slowest phase: reconciliation (2m 05s, 48% of wall time). 2 work units (1 failed), ~88% model generation vs ~12% tools.

Work units run serially by default (ingest.workUnits.maxConcurrency is 1); raise it in ktx.yaml if the profile shows the run is bound by serialized work-unit agent loops. If the provider reports an LLM rate limit, ktx shows a transient wait message and temporarily reduces effective work-unit concurrency according to ingest.rateLimit.

Common errors

Error	Cause	Recovery
Connection not configured	The connection id is not present in `ktx.yaml`	Add the connection with `ktx setup` or update `ktx.yaml`
Enrichment is not configured	Database ingest needs a model, embeddings, and scan-enrichment configuration	Run `ktx setup` to configure a model and embeddings
Query history is unsupported	The selected database driver does not support query history	Run ingest without query-history flags
Python runtime is missing	The selected ingest target needs runtime-backed SQL analysis or source parsing	Accept the interactive prompt, rerun with `--yes`, or run the suggested `ktx admin runtime install` command
Context-source options were ignored	Query-history flags were supplied for a context-source connection	Omit database-only flags when ingesting context-source connections
Text ingest stops early	`--fail-fast` was used and one item failed	Fix the failed item or rerun without `--fail-fast` to collect all failures
`--verbatim requires --text or --file`	`--verbatim` was passed without a document to store	Add `--text` or `--file`, or drop `--verbatim`
Inline verbatim text needs a leading heading	`--text --verbatim` content has no `# Heading` to derive a stable key	Add a leading Markdown heading, or pass the content as `--file <path>`
A different page already exists at key	A verbatim re-run targeted an existing key with a different body	Use a distinct document name/key, or remove the existing page first
Connection scope conflict	Frontmatter `connections` disagrees with `--connection-id`	Remove one so the intended scope is unambiguous

On this page