ktx ingest
Build or refresh ktx context, or capture text into ktx memory.
ktx ingest builds or refreshes ktx context from configured connections, and
can also capture free-form text into ktx memory. Database connections build
enriched context — schema plus AI-generated descriptions, embeddings, and
relationship evidence — and require a configured model and embeddings.
Context-source connections ingest metadata from tools such as dbt, Looker,
Metabase, MetricFlow, LookML, Notion, and Sigma. Pass --text or --file to capture
inline text or text files into memory instead.
Command signature
ktx ingest [options] [connectionId]
- Bare
ktx ingest(no positional, no--all) ingests every configured connection. ktx ingest <connectionId>ingests one configured connection.ktx ingest --text "..."(or--file <path>) captures notes into ktx memory instead of ingesting a connection.
Database connections run before context-source connections when more than one connection is selected.
Options
| Flag | Description | Default |
|---|---|---|
--all | Ingest all configured connections (same as bare invocation) | false |
--query-history | Include database query-history usage patterns | Stored connection default |
--no-query-history | Skip database query-history usage patterns for this run | Stored connection default |
--query-history-window-days <days> | BigQuery/Snowflake query-history lookback window for this run | Stored connection default |
--stages <list> | Comma-separated enrichment stages to (re)run: descriptions, embeddings, relationships | All three |
--text <content> | Capture inline text into ktx memory; repeatable | [] |
--file <path> | Capture a text file into ktx memory; use - for stdin; repeatable | [] |
--verbatim | Store each --text/--file document body unchanged as a GLOBAL wiki page; the LLM derives metadata only | false |
--connection-id <connectionId> | ktx connection id to tag captured text/file notes | - |
--user-id <id> | Memory user id for text/file capture attribution | local-cli |
--fail-fast | Stop after the first failed text/file item | false |
--plain | Print plain text output | true |
--json | Print JSON output | false |
--yes | Install required managed runtime features without prompting | false |
--no-input | Disable interactive terminal input | - |
Database ingest always builds enriched context and requires a configured model
and embeddings (run ktx setup); connections without that configuration fail
before any work starts. Query-history flags apply only to database connections
that support query history. The window flag applies to BigQuery and Snowflake;
Postgres reads the current pg_stat_statements aggregate data instead of a
time-windowed history table. Query-history ingest runs after the schema scan.
When more than one connection is selected, database ingest runs first, then context-source ingest and memory updates run for context-source connections.
Some ingest paths use the managed ktx Python runtime. Query-history ingest uses
it for SQL analysis, and Looker context-source ingest uses it for Looker identifier
parsing. In an interactive terminal, ktx ingest prompts before installing the
required runtime features. Use --yes to install them without prompting, or
use --no-input to fail fast with install guidance.
--text and --file cannot be combined with a positional connectionId or
--all; pass --connection-id <id> instead to tag captured notes.
Verbatim ingest
By default, captured text is routed through the memory agent, which decides what
to persist and may rewrite, condense, split, or re-title it. For authoritative
documents — metric definitions, formula specs, runbooks, compliance text — that
paraphrasing is a defect. Add --verbatim to store each --text/--file
document body unchanged as a GLOBAL wiki page:
- The stored body is the input document, written by code; the LLM never edits it.
It is used only to derive page metadata (
summary,tags,sl_refs), and even that is skipped for fields the document's own frontmatter already sets. - The page key is deterministic: a
--filederives it from the filename, inline--textfrom the document's leading Markdown heading (inline text without a heading is rejected — pass it as--fileinstead). - Ingest is idempotent. Re-running the same document is a safe no-op; a different body at the same key fails loudly rather than overwriting.
--verbatimworks withllm.provider.backend: none— the only ingest path that does. With no backend thesummaryis derived from the heading or first sentence andtags/sl_refsare left empty; the full body is still stored.- Existing frontmatter passes through untouched (including fields ktx does not
model, such as
effective_dateorversion); generated metadata only fills absent fields.--connection-id <id>scopes the page to that connection by setting itsconnectionsfrontmatter.
Selecting enrichment stages
Database enrichment runs three stages: descriptions (one LLM call per table),
embeddings (vectors over the schema and descriptions), and relationships
(join detection, optionally LLM-proposed). Each stage is cached on a per-stage
hash of only its own inputs, so changing one stage's inputs invalidates only
that stage. Switching the description LLM re-runs only descriptions; upgrading
the embeddings model re-runs only embeddings; turning on
scan.relationships.llmProposals re-runs only relationships. The expensive
per-table descriptions are never thrown away because an unrelated setting moved.
--stages <list> re-runs a chosen subset on an already-ingested connection. A
named stage is force-recomputed (it bypasses the completed-stage cache),
while unselected stages are left exactly as they are on disk:
ktx ingest warehouse --stages embeddings— re-embed on a new model, keeping descriptions and joins.ktx ingest --all --stages relationships --no-query-history— backfill joins across every database after enablingllmProposals, without re-paying for descriptions.ktx ingest warehouse --stages descriptions— re-run thin descriptions (for example after raisingKTX_ENRICH_LLM_TIMEOUT_MS). When nothing the descriptions depend on changed, the per-table resume record means only the tables that previously failed are re-sent to the LLM.
Stage names are validated: an unknown or empty name (--stages foo, --stages descriptions,foo, --stages "") is a hard parse error. Naming all three
(--stages descriptions,embeddings,relationships) forces a full enrichment
recompute, which is not the same as omitting the flag (omitting resumes
whatever is already done). After a selective run, ktx warns
(enrichment_stage_stale) when an unselected stage's inputs no longer match what
it was last built from — for example, re-running descriptions flags
embeddings as stale until you re-run --stages embeddings. The warning is
informational; ktx never silently cascades the extra work.
Examples
# Build every configured connection (bare = --all) ktx ingest # Build one database or context-source connection ktx ingest warehouse # Include query-history usage patterns ktx ingest warehouse --query-history # Set the lookback window for BigQuery or Snowflake query history ktx ingest warehouse --query-history-window-days 30 # Re-embed one connection on a new embeddings model (descriptions/joins untouched) ktx ingest warehouse --stages embeddings # Backfill LLM-proposed joins across every database without re-describing ktx ingest --all --stages relationships --no-query-history # Build a context-source connection ktx ingest notion # Capture inline text into memory ktx ingest --text "Refunds are excluded from net revenue." # Capture multiple text snippets in one call ktx ingest --text "Revenue is gross receipts." --text "Orders are completed purchases." # Capture a local Markdown file into memory and tag it to a connection ktx ingest --file docs/revenue-notes.md --connection-id warehouse # Capture one stdin item printf "Refunds are excluded from net revenue." | ktx ingest --file - # Store an authoritative document verbatim (body preserved exactly) ktx ingest --file docs/rfm-bucket-definitions.md --verbatim # Store it verbatim and scope it to one connection ktx ingest --file docs/haversine-formula.md --verbatim --connection-id warehouse
Output
Plain output summarizes each target and the operations that ran.
Ingest finished
Source Database schema Query history Source ingest Memory update
warehouse done done skipped skipped
notion skipped skipped done doneUse --json when a script or agent needs the selected plan and per-target
results.
Final validation pruning
At the end of a context-source ingest, ktx validates the composed semantic
layer and wiki before saving it. If the final validation finds dangling
references, ktx removes the reference instead of failing accepted work. This
can remove joins that point at missing semantic sources, wiki refs, wiki
sl_refs, and inline wiki body references. If a generated semantic source is
invalid, ktx drops that source from the final save.
The stored ingest report records these changes as finalGatePrunedReferences
and finalGateDroppedSources. The trace emits final_gate_reference_pruned,
final_gate_source_dropped, final_gate_prune_committed, and
final_gate_prune_finished events when pruning runs. If validation still fails
after pruning, the ingest fails and the report keeps the final validation error.
Inspect context-source ingest traces
Context-source ingest writes persistent JSONL traces for postmortem debugging. Plain ingest output prints the trace path near the report, run, and job identifiers when a trace is available:
Report: report-abc123
Run: run-abc123
Job: job-abc123
Trace: .ktx/ingest-traces/job-abc123/trace.jsonlThe trace file lives under the project directory at
.ktx/ingest-traces/<jobId>/trace.jsonl. Each line is a JSON event with the
job id, run id, sync id, connection id, source key, phase, event name, timing,
state snapshot, decision context, and error details. Failed runs also write a
stored ingest report with status: "failed", failure.phase,
failure.message, and the same trace path.
Use jq or line-oriented tools to inspect a trace:
jq -c '. | {at, level, phase, event, durationMs, data, error}' \ .ktx/ingest-traces/<jobId>/trace.jsonl
ktx writes debug trace events by default. Set KTX_INGEST_TRACE_LEVEL to
error, info, debug, or trace before running ingest to change the trace
verbosity:
KTX_INGEST_TRACE_LEVEL=trace ktx ingest metabase
Profiling a slow ingest
Each timed phase and work unit records a durationMs in the trace, and each
agent loop records its step count and token usage. To see where wall-clock time
went, enable profiling and ktx prints a rolled-up breakdown to stderr at the
end of the run. There are two ways to turn it on, and two output formats.
Turn it on per run with the KTX_PROFILE_INGEST environment variable, or
persistently with ingest.profile in ktx.yaml (useful for CI or while
iterating on a slow source):
KTX_PROFILE_INGEST=1 ktx ingest metabase # human-readable table KTX_PROFILE_INGEST=json ktx ingest metabase # raw JSON for coding agents
ingest: profile: true # human table; use "json" for the machine-readable form
Both formats report total wall time, time per phase, and the slowest work units,
splitting each work unit's agent-loop time into model time versus tool-execution
time. The json form emits the full structured profile (raw milliseconds and
token counts, stable keys) plus a summary.headline one-line diagnosis, so a
coding agent can parse it directly instead of scraping the table. If both the env
var and the config request profiling, json wins. Example headline:
Slowest phase: reconciliation (2m 05s, 48% of wall time). 2 work units (1 failed), ~88% model generation vs ~12% tools.Work units run serially by default (ingest.workUnits.maxConcurrency is 1);
raise it in ktx.yaml if the profile shows the run is bound by serialized
work-unit agent loops. If the provider reports an LLM rate limit, ktx shows
a transient wait message and temporarily reduces effective work-unit concurrency
according to ingest.rateLimit.
Common errors
| Error | Cause | Recovery |
|---|---|---|
| Connection not configured | The connection id is not present in ktx.yaml | Add the connection with ktx setup or update ktx.yaml |
| Enrichment is not configured | Database ingest needs a model, embeddings, and scan-enrichment configuration | Run ktx setup to configure a model and embeddings |
| Query history is unsupported | The selected database driver does not support query history | Run ingest without query-history flags |
| Python runtime is missing | The selected ingest target needs runtime-backed SQL analysis or source parsing | Accept the interactive prompt, rerun with --yes, or run the suggested ktx admin runtime install command |
| Context-source options were ignored | Query-history flags were supplied for a context-source connection | Omit database-only flags when ingesting context-source connections |
| Text ingest stops early | --fail-fast was used and one item failed | Fix the failed item or rerun without --fail-fast to collect all failures |
--verbatim requires --text or --file | --verbatim was passed without a document to store | Add --text or --file, or drop --verbatim |
| Inline verbatim text needs a leading heading | --text --verbatim content has no # Heading to derive a stable key | Add a leading Markdown heading, or pass the content as --file <path> |
| A different page already exists at key | A verbatim re-run targeted an existing key with a different body | Use a distinct document name/key, or remove the existing page first |
| Connection scope conflict | Frontmatter connections disagrees with --connection-id | Remove one so the intended scope is unambiguous |