ktx.yaml reference

Every top-level block of the ktx.yaml project file, what it controls, accepted values, and defaults.

ktx.yaml is the single source of truth for a ktx project. The file lives at the project root and tells ktx which databases to read, which context sources to ingest, which LLM and embedding providers to use, how to store state, and how the scan and agent layers behave. Every block below is optional and falls back to a documented default, so a minimal ktx.yaml is just one connection.

This page is the canonical reference for the file. For the guided flow that writes it, see ktx setup.

Where blocks fit

ktx.yaml has eight top-level keys. They group into three layers: what to read, how to think, and where to put the results.

ktx.yaml at a glance

Inputs flow left to right. Storage and memory persist the result.

Inputs

connections - warehouses, BI tools, dbt, Notion
setup - which connections are primary databases

Compute

llm - provider, models, prompt cache
ingest - connectors, embeddings, work units
scan - enrichment, relationships
agent - research-agent feature flags

Persistence

storage - state and search backends, git policy
memory - agent memory commit policy

Minimal config

A working ktx.yaml needs one entry in connections. Everything else accepts defaults. The example below registers a local Postgres connection; building context with ktx ingest warehouse also needs a model and embeddings, which ktx setup configures.

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL

Secret references

Several fields accept either a literal value or a reference. References keep secrets out of ktx.yaml so the file can stay in git.

Form	Resolved to	Used for
`env:VAR_NAME`	The value of the environment variable `VAR_NAME` at runtime	API keys, connection URLs, OAuth secrets
`file:/abs/path` or `file:~/path`	The first line of the referenced file, with `~` expanded to your home directory	Long-lived credentials kept under `.ktx/secrets/`
Literal string	Used as-is	Non-secret values such as `base_url`

References work in: warehouse url, Metabase api_key / api_key_ref, Looker client_secret / client_secret_ref, Notion / dbt / LookML / MetricFlow auth_token / auth_token_ref, and any api_key under the llm and ingest.embeddings blocks.

`connections`

The connections block is a map from a connection ID you choose to the configuration for that connector. The connection ID is what every other part of ktx uses to address a connector - ktx ingest warehouse, ktx sql --connection warehouse, the semantic-layer path semantic-layer/warehouse/, and so on.

Each entry is discriminated by the driver field. Warehouse drivers and context-source drivers share the map.

Driver	Kind	Required fields	Common optional fields
`postgres`	Warehouse	`driver`	`url`, `enabled_tables`, `historicSql`, `context.queryHistory`
`mysql`	Warehouse	`driver`	`url`, `enabled_tables`
`sqlite`	Warehouse	`driver`	`url` or `path`, `enabled_tables`
`duckdb`	Warehouse	`driver`	`url` or `path`, `enabled_tables`
`sqlserver`	Warehouse	`driver`	`url`, `enabled_tables`
`bigquery`	Warehouse	`driver`	`credentials_json`, `dataset_ids`, `enabled_tables`, `historicSql`
`snowflake`	Warehouse	`driver`	`schema_names`, `enabled_tables`, `historicSql`
`clickhouse`	Warehouse	`driver`	`url`, `database`, `databases`, `enabled_tables`
`metabase`	Context source	`driver`, `api_url`	`api_key_ref`, `mappings`
`looker`	Context source	`driver`, `base_url`, `client_id`	`client_secret_ref`, `mappings`
`lookml`	Context source	`driver`, `repoUrl`	`branch`, `path`, `auth_token_ref`, `mappings`
`dbt`	Context source	`driver`, one of `source_dir` or `repo_url`	`branch`, `path`, `profiles_path`, `target`, `project_name`
`metricflow`	Context source	`driver`, `metricflow.repoUrl`	`metricflow.branch`, `metricflow.path`, `metricflow.auth_token_ref`
`notion`	Context source	`driver`, `auth_token_ref`	`crawl_mode`, `root__ids`, `max__per_run`
`sigma`	Context source	`driver`, `client_id`, `client_secret_ref`	`api_url`

Warehouse drivers

Warehouse connections are open objects: the listed fields are validated, and any other field is preserved and passed through to the connector. Use enabled_tables to scope ingest to a specific list of objects - useful for smoke tests. Each entry accepts a catalog.db.name, db.name, or bare name qualifier. ktx restricts the scan to the listed objects and fails with a clear error (naming the available objects) if none match.

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL
    enabled_tables:
      - public.orders
      - public.customers

For SQLite, which exposes a single main schema, the qualified main.<name> and the bare <name> forms select the same object:

connections:
  local-db:
    driver: sqlite
    path: ./warehouse.db
    enabled_tables:
      - customers # equivalent to main.customers

Connector-specific scope fields let setup and scan use the same warehouse boundary:

connections:
  mysql-warehouse:
    driver: mysql
    url: env:MYSQL_URL
    schemas: [analytics, mart]
  clickhouse-warehouse:
    driver: clickhouse
    url: env:CLICKHOUSE_URL
    database: analytics
    databases: [analytics, mart]
  bigquery-warehouse:
    driver: bigquery
    credentials_json: file:./service-account.json
    location: US
    dataset_ids: [analytics, mart]

A BigQuery dataset_ids / dataset_id entry may be written project.dataset to introspect a dataset hosted in another project (for example bigquery-public-data.austin_311); jobs still bill to the project_id in credentials_json. A bare dataset keeps using your own project. See Primary sources → BigQuery.

For Postgres, MySQL, SQL Server, and Snowflake connections, set maxConnections when scan or ingest work needs to stay below the target's connection cap. Postgres, MySQL, and SQL Server default to 10; Snowflake defaults to 4. This caps all concurrent SQL work for that connector instance, including schema introspection, table sampling, relationship profiling, relationship validation, and read-only SQL execution. BigQuery and ClickHouse do not expose maxConnections because their connectors don't use client-side connection pools.

For Postgres, BigQuery, and Snowflake, historicSql and context.queryHistory toggle query-history ingest. The shape is connector-specific; the setup wizard writes these fields when you pass --enable-query-history.

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL
    context:
      queryHistory:
        enabled: true
        enabledSchemas:
          - orbit_raw
          - orbit_analytics
        minExecutions: 5

enabledSchemas: Optional list of schema or dataset names that query-history ingest may mine. Omit it to let ktx derive the modeled schema floor from the connection and semantic-layer sources. Use ["*"] to disable the floor for discovery runs.
filters.serviceAccounts: Optional service-account filter block. During setup, when query history is enabled and no service-account block already exists, ktx can propose exact role patterns such as ^svc_loader$ from observed in-scope query history. The block uses mode: exclude and remains hand-editable.

Query policy

Set query_policy: semantic-layer-only on a warehouse connection to stop agents from authoring SQL against it. The default, read-only-sql, allows parser-validated read-only SQL through ktx sql and the sql_execution MCP tool alongside semantic-layer queries.

connections:
  warehouse:
    driver: snowflake
    query_policy: semantic-layer-only

With semantic-layer-only:

ktx sql and the sql_execution MCP tool reject the connection with a clear error. When every SQL connection in the project is restricted, the sql_execution tool is not registered at all.
Raw SQL against the federated connection (_ktx_federated) is rejected when any member connection is restricted.
Semantic-layer queries (ktx sl query, the sl_query tool) accept only measures predefined in the semantic-layer sources. Composed aggregate expressions such as sum(orders.amount) are rejected wherever they appear, including inside filters (a HAVING-style clause may only compare a predefined measure by name, e.g. orders.revenue > 100). Grouping by declared dimensions, filtering on columns, and segments remain available.
connection_list marks the connection as restricted so agents route to sl_query instead of burning a failed call.

The policy governs agent-facing query authorship, not data access: ktx's own scan, ingest, and semantic-layer-generated SQL still run, and context tools such as entity_details and dictionary_search still expose schema metadata and sampled values.

Metabase

connections:
  metabase:
    driver: metabase
    api_url: https://metabase.example.com
    api_key_ref: env:METABASE_API_KEY
    mappings:
      databaseMappings:
        "1": warehouse        # Metabase DB id "1" -> ktx connection "warehouse"
      syncMode: ALL           # ALL | ONLY | EXCEPT

Field	Purpose
`api_url`	Metabase instance URL. Required.
`api_key`	Literal token. Prefer `api_key_ref`.
`api_key_ref`	Reference to the token (`env:` or `file:`).
`mappings.databaseMappings`	Map of Metabase database ID (positive-integer string) to a `ktx` warehouse connection ID. `null` explicitly unmaps.
`mappings.syncEnabled`	Per-database boolean toggle, keyed by Metabase DB ID.
`mappings.syncMode`	`ALL` (all mapped DBs), `ONLY` (those with `syncEnabled: true`), or `EXCEPT` (skip those with `syncEnabled: true`). Default `ALL`.
`mappings.selections.collections` / `items`	Optional Metabase collection or item IDs to scope ingest.
`mappings.defaultTagNames`	Default tag names attached to ingested artifacts.
`network_proxy` / `networkProxy`	Optional proxy configuration.

Looker

connections:
  looker:
    driver: looker
    base_url: https://looker.example.com
    client_id: ktx-integration
    client_secret_ref: env:LOOKER_CLIENT_SECRET
    mappings:
      connectionMappings:
        prod_warehouse: warehouse

Field	Purpose
`base_url`	Looker instance URL. Required.
`client_id`	Looker OAuth client ID. Required.
`client_secret` / `client_secret_ref`	Literal secret or reference. Prefer the `_ref`.
`mappings.connectionMappings`	Map of Looker connection name to `ktx` warehouse connection ID.

LookML

connections:
  lookml:
    driver: lookml
    repoUrl: git@github.com:org/lookml.git
    branch: main
    path: lookml/
    auth_token_ref: env:GITHUB_TOKEN
    mappings:
      expectedLookerConnectionName: prod_warehouse

Field	Purpose
`repoUrl`	Git URL of the LookML project (`https`, `ssh`, or `file:`). Required. Camel-case by convention.
`branch`	Branch to fetch. Defaults to `main`.
`path`	Subdirectory inside the repo when LookML lives in a monorepo.
`auth_token_ref`	Reference to a Git auth token for private repos.
`mappings.expectedLookerConnectionName`	Looker connection name LookML models must declare. Mismatches block semantic-layer writes during ingest.

dbt

connections:
  dbt_main:
    driver: dbt
    source_dir: ../dbt-project
    target: prod

Field	Purpose
`source_dir`	Absolute or project-relative path to a local dbt project.
`repo_url`	Git URL of the dbt project. Use this instead of `source_dir` when fetching remotely.
`branch`	Branch to fetch when using `repo_url`.
`path`	Subdirectory inside the repo.
`auth_token_ref`	Git auth reference for private repos.
`profiles_path`	Override path to `profiles.yml`.
`target`	dbt target name (for example `dev`, `prod`).
`project_name`	Override the auto-detected dbt project name.

MetricFlow

connections:
  metricflow:
    driver: metricflow
    metricflow:
      repoUrl: git@github.com:org/sl-config.git
      branch: main
      path: semantic_models/
      auth_token_ref: env:GITHUB_TOKEN

The MetricFlow connector wraps its fields in a nested metricflow block. repoUrl is required; the rest mirrors the LookML / dbt git fields.

Notion

connections:
  notion:
    driver: notion
    auth_token_ref: env:NOTION_TOKEN
    crawl_mode: selected_roots
    root_database_ids:
      - 9f30c2c4d4f24a8d9a8d8e2c1b2a3d4e
    max_pages_per_run: 500
    max_knowledge_creates_per_run: 5
    max_knowledge_updates_per_run: 25

Field	Purpose
`auth_token` / `auth_token_ref`	Notion integration token. Prefer the `_ref`.
`crawl_mode`	`selected_roots` (requires at least one `root_*_ids`) or `all_accessible`.
`root_page_ids`, `root_database_ids`, `root_data_source_ids`	Notion IDs to crawl when `crawl_mode` is `selected_roots`.
`max_pages_per_run`	Max pages fetched per ingest run (1-10000).
`max_knowledge_creates_per_run`	Max new wiki pages created per run (0-25).
`max_knowledge_updates_per_run`	Max existing wiki pages updated per run (0-100).

Sigma

connections:
  sigma-main:
    driver: sigma
    api_url: https://api.sigmacomputing.com
    client_id: "<your-client-id>"
    client_secret_ref: env:SIGMA_CLIENT_SECRET
    workbookFilter:
      includeArchived: false
      includeExplorations: false
      updatedSince: "2026-01-01T00:00:00Z"

Field	Purpose
`api_url`	Sigma API base URL. Defaults to `https://api.sigmacomputing.com` (GCP US). Override for AWS US (`https://aws-api.sigmacomputing.com`) or other regions.
`client_id`	Sigma OAuth client ID. Required.
`client_secret` / `client_secret_ref`	Literal secret or reference. Prefer the `_ref`.
`connectionMappings`	Maps Sigma internal connection UUIDs to ktx warehouse connection IDs. Enables `sl_validate` for projected semantic-layer sources.
`workbookFilter.includeArchived`	Include archived workbooks during ingest. Default: `false`.
`workbookFilter.includeExplorations`	Include exploration workbooks during ingest. Default: `false`.
`workbookFilter.updatedSince`	ISO 8601 date string. Only workbooks updated on or after this date are fetched. Useful for limiting ingest scope at large scale.

`setup`

Captured by the setup wizard. The only field ktx still reads is database_connection_ids, which tells the ingest layer which entries in connections are primary warehouses. When omitted, every warehouse-typed connection is treated as primary.

setup:
  database_connection_ids:
    - warehouse

Field	Type	Default	Purpose
`database_connection_ids`	`string[]`	`[]`	IDs in `connections` treated as primary warehouses by ingest and scan.

`storage`

storage controls where ktx keeps its own state and search index. Defaults work for a single-user local project.

storage:
  state: sqlite          # sqlite | postgres
  search: sqlite-fts5    # sqlite-fts5 | postgres-hybrid
  git:
    author: "ktx <ktx@example.com>"

Field	Type	Default	Purpose
`state`	`sqlite` \| `postgres`	`sqlite`	Backend for ktx state. `sqlite` uses `.ktx/db.sqlite`; `postgres` expects a configured Postgres connection.
`search`	`sqlite-fts5` \| `postgres-hybrid`	`sqlite-fts5`	Backend for search indexes. `postgres-hybrid` combines lexical and vector search in Postgres.
`git.author`	`string`	`ktx <ktx@example.com>`	Git author identity for commits. Standard `Name <email>` form.

`llm`

The llm block selects the LLM provider, lets you override the model used for specific roles, and tunes prompt caching.

llm:
  provider:
    backend: anthropic
    anthropic:
      api_key: env:ANTHROPIC_API_KEY
  models:
    default: claude-sonnet-4-6
    triage: claude-haiku-4-5
    candidateExtraction: claude-sonnet-4-6
    curator: claude-opus-4-7
    reconcile: claude-opus-4-7
    repair: claude-haiku-4-5
  promptCaching:
    enabled: true
    systemTtl: 1h
    toolsTtl: 1h
    historyTtl: 5m
    vertexFallbackTo5m: true

Provider

Field	Type	Default	Purpose
`provider.backend`	`none` \| `anthropic` \| `vertex` \| `gateway` \| `claude-code` \| `codex`	`none`	Selected backend. `none` disables LLM features. `claude-code` uses the local Claude Code session and needs no API key. `codex` uses local Codex authentication and needs no API key.
`provider.anthropic.api_key`	`string`	-	Anthropic API key. Required when `backend: anthropic`. Accepts `env:` or `file:` references.
`provider.anthropic.base_url`	`string`	-	Override the Anthropic API base URL (proxy, self-hosted gateway).
`provider.gateway.api_key` / `base_url`	`string`	-	Credentials for an AI Gateway provider. Required when `backend: gateway`.
`provider.vertex.project`	`string`	-	Google Cloud project ID hosting the Vertex AI endpoint.
`provider.vertex.location`	`string`	-	Vertex AI region (for example `us-east5`). Required when the `vertex` block is present.

Use codex when local Codex authentication should power ktx LLM work:

llm:
  provider:
    backend: codex
  models:
    default: gpt-5.5
    triage: gpt-5.5
    candidateExtraction: gpt-5.5
    curator: gpt-5.5
    reconcile: gpt-5.5
    repair: gpt-5.5

Model roles

models overrides the per-role model. Keys are fixed; values are provider-specific model identifiers.

Role	Used for
`default`	Catch-all when no role-specific override exists.
`triage`	Cheap routing decisions during ingest and scan.
`candidateExtraction`	Extracting relationship and entity candidates from data.
`curator`	Reconciling proposed context against accepted files.
`reconcile`	Resolving conflicts between incoming and existing context.
`repair`	Fixing invalid generated YAML before write.

Prompt caching

Field	Type	Default	Purpose
`promptCaching.enabled`	`boolean`	backend default	Master switch for Anthropic-style prompt caching.
`promptCaching.systemTtl`	`5m` \| `1h`	backend default	Cache TTL for the system prompt segment.
`promptCaching.toolsTtl`	`5m` \| `1h`	backend default	Cache TTL for the tools/schema segment.
`promptCaching.historyTtl`	`5m` \| `1h`	backend default	Cache TTL for conversation-history breakpoints.
`promptCaching.vertexFallbackTo5m`	`boolean`	`false`	When `true`, downgrade `1h` TTLs to `5m` on Vertex, which does not support `1h` caching.

`ingest`

ingest controls how ktx builds context from your stack. It lists the connectors to run, the embedding provider used when connectors embed documents, and the concurrency and failure policy for work units.

ingest:
  adapters:
    - live-database
    - dbt
    - metabase
  embeddings:
    backend: openai
    model: text-embedding-3-small
    dimensions: 1536
    openai:
      api_key: env:OPENAI_API_KEY
  workUnits:
    stepBudget: 40
    maxConcurrency: 2
    failureMode: continue
  rateLimit:
    enabled: true
    throttleThreshold: 0.8
    minConcurrencyUnderPressure: 1
    maxWaitMs: 600000
    retry:
      maxAttempts: 6
      baseDelayMs: 1000
      maxDelayMs: 60000
      jitter: true

Connectors

adapters is a list of connector IDs that should run. Each ID matches a connector that ktx ships locally:

Connector ID	What it ingests
`live-database`	Live warehouse introspection (schemas, tables, columns, samples).
`historic-sql`	Query history from Postgres `pg_stat_statements`, BigQuery `INFORMATION_SCHEMA.JOBS`, or Snowflake query history.
`dbt`	dbt manifest models, sources, tests, and exposures.
`metricflow`	MetricFlow / Semantic Layer models and metrics.
`lookml`	LookML projects (models, explores, views, joins).
`looker`	Looker dashboards and looks via the API.
`metabase`	Metabase cards, dashboards, and database mappings.
`notion`	Notion pages and databases for wiki context.
`fake`	Test/demo connector. Useful in fixtures.

Embeddings

The embeddings block can also appear inside scan.enrichment; that override wins when present.

Field	Type	Default	Purpose
`backend`	`none` \| `openai` \| `sentence-transformers`	`none`	Embedding provider. `none` disables embeddings.
`model`	`string`	-	Provider model ID, for example `text-embedding-3-small` or `all-MiniLM-L6-v2`.
`dimensions`	`int > 0`	`8`	Vector size. Default `8` is a placeholder that's only valid with `backend: none`. Set explicitly to match your model (1536 for `text-embedding-3-small`, 384 for `all-MiniLM-L6-v2`).
`openai.api_key` / `base_url`	`string`	-	OpenAI credentials. Required when `backend: openai`.
`sentenceTransformers.base_url`	`string`	`""`	URL of the sentence-transformers server. Empty when ktx manages the local daemon for you.
`sentenceTransformers.pathPrefix`	`string`	-	Optional URL path prefix prepended to embedding requests.
`batchSize`	`int > 0`	provider default	Texts per embedding API call.

Work units

A work unit is one unit of agent-driven ingest work (for example one table or one Metabase question). These knobs bound how long it runs and how the run handles failures.

Field	Type	Default	Purpose
`workUnits.stepBudget`	`int > 0`	`40`	Maximum agent steps allowed per work unit before it's force-terminated.
`workUnits.maxConcurrency`	`int > 0`	`1`	How many work units run in parallel.
`workUnits.failureMode`	`abort` \| `continue`	`continue`	`abort` stops the whole ingest run on the first failure; `continue` records it and keeps going.

Rate limits

rateLimit controls provider-neutral pacing for LLM calls during ingest. When a provider reports a subscription window, retry-after delay, or HTTP 429, ktx pauses new work-unit model calls, shows a transient wait in the CLI, and reduces work-unit concurrency while the provider is under pressure.

Field	Type	Default	Purpose
`rateLimit.enabled`	`boolean`	`true`	Master switch for ingest LLM rate-limit pacing and visible waits.
`rateLimit.throttleThreshold`	`number between 0 and 1`	`0.8`	Fraction of a known provider window at which ktx starts reducing concurrency.
`rateLimit.minConcurrencyUnderPressure`	`int > 0`	`1`	Effective work-unit concurrency while a provider is under rate-limit pressure.
`rateLimit.maxWaitMs`	`int > 0`	unset	Caps how long a single provider-reset wait can last. This bounds each wait, not the whole run: after a capped wait elapses ktx retries and may pause again. Omit to wait until the provider's reset time.
`rateLimit.retry.maxAttempts`	`int > 0`	`6`	Maximum attempts for a single rate-limited LLM call before the failure surfaces (counts the first try). Also bounds how far opaque backoff grows for responses without a reset time or retry-after value.
`rateLimit.retry.baseDelayMs`	`int > 0`	`1000`	Initial opaque retry delay in milliseconds.
`rateLimit.retry.maxDelayMs`	`int > 0`	`60000`	Maximum opaque retry delay in milliseconds.
`rateLimit.retry.jitter`	`boolean`	`true`	Add jitter to opaque retry delays.

`scan`

scan configures how schema-level inputs become structured context: column-level enrichment and inferred relationships between tables.

scan:
  enrichment:
    mode: llm           # none | deterministic | llm
  relationships:
    enabled: true
    llmProposals: true
    validationRequiredForManifest: true
    acceptThreshold: 0.85
    reviewThreshold: 0.55
    maxLlmTablesPerBatch: 40
    maxCandidatesPerColumn: 25
    profileSampleRows: 10000
    profileConcurrency: 4
    validationConcurrency: 4
    validationBudget: all
    detectionBudgetMs: 600000

Enrichment

Field	Type	Default	Purpose
`enrichment.mode`	`none` \| `deterministic` \| `llm`	`none`	How columns and tables get described. `deterministic` uses local heuristics; `llm` calls the configured provider.
`enrichment.embeddings`	embedding block	-	Optional override for enrichment-time vectorization. Falls back to `ingest.embeddings`.

Relationships

The relationship discovery step proposes joins between tables, scores them, and optionally validates each one against the database before writing it to the manifest.

Field	Type	Default	Purpose
`relationships.enabled`	`boolean`	`true`	Master switch for relationship discovery.
`relationships.llmProposals`	`boolean`	`true`	When `true`, propose relationships using the LLM in addition to deterministic candidates.
`relationships.validationRequiredForManifest`	`boolean`	`true`	When `true`, only proposals that pass database-side validation reach the manifest.
`relationships.acceptThreshold`	`number 0-1`	`0.85`	Confidence at or above which a proposal is auto-accepted.
`relationships.reviewThreshold`	`number 0-1`	`0.55`	Confidence at or above which a proposal is surfaced for human review (but not auto-accepted).
`relationships.maxLlmTablesPerBatch`	`int > 0`	`40`	Max tables included in a single LLM relationship-proposal batch.
`relationships.maxCandidatesPerColumn`	`int > 0`	`25`	Max join partners considered per column.
`relationships.profileSampleRows`	`int > 0`	`10000`	Rows sampled per table when profiling values for relationship inference.
`relationships.profileConcurrency`	`int > 0`	`4`	Parallel relationship-profile queries against the database. For pooled connectors, effective database concurrency is also bounded by the connection's `maxConnections`.
`relationships.validationConcurrency`	`int > 0`	`4`	Parallel relationship validation queries against the database.
`relationships.validationBudget`	`all` \| `int ≥ 0`	runtime default	Cap on validation queries per scan. `all` means unlimited.
`relationships.detectionBudgetMs`	`int > 0`	`600000`	Wall-clock budget (ms) for the whole relationship-detection stage, checked at table-profile, candidate-validation, and composite-probe boundaries. On exhaustion the stage stops scheduling new work and writes the joins found so far, marked partial; descriptions and embeddings are already durable. Sits above the per-query deadline. Raise it to trigger a fresher, fuller run.

`agent`

agent carries feature flags for ktx-side agent behavior. Today the only block is run_research, which gates the research agent invoked by ktx mcp and CLI research tools.

agent:
  run_research:
    enabled: true
    max_iterations: 20
    default_toolset:
      - sl_query
      - wiki_search
      - sl_read_source

Field	Type	Default	Purpose
`run_research.enabled`	`boolean`	`false`	Master switch for the research agent.
`run_research.max_iterations`	`int ≥ 0`	`20`	Maximum tool-call iterations per research run.
`run_research.default_toolset`	`string[]`	`[sl_query, wiki_search, sl_read_source]`	Tool identifiers exposed to the research agent.

A full example

Combining the blocks above:

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL
  metabase:
    driver: metabase
    api_url: https://metabase.example.com
    api_key_ref: env:METABASE_API_KEY
    mappings:
      databaseMappings:
        "1": warehouse
      syncMode: ALL
setup:
  database_connection_ids:
    - warehouse
storage:
  state: sqlite
  search: sqlite-fts5
  git:
    author: "ktx <ktx@example.com>"
llm:
  provider:
    backend: claude-code
  models:
    default: sonnet
    triage: haiku
    candidateExtraction: sonnet
    curator: opus
    reconcile: opus
    repair: haiku
ingest:
  adapters:
    - live-database
    - metabase
  embeddings:
    backend: openai
    model: text-embedding-3-small
    dimensions: 1536
    openai:
      api_key: env:OPENAI_API_KEY
  workUnits:
    maxConcurrency: 2
scan:
  enrichment:
    mode: llm
  relationships:
    acceptThreshold: 0.85
    reviewThreshold: 0.55
agent:
  run_research:
    enabled: true

Validating your config

ktx validates ktx.yaml when it loads, and treats two kinds of problems differently:

An invalid value on a field ktx recognizes (for example llm.provider.backend: nope) is a hard error. Setup and CLI commands stop and report the exact path so you can fix it.
An unrecognized key — one left over from a different ktx version, or a typo such as scan.relationships.acceptThreshhold — is tolerated, not fatal. ktx ignores the key and keeps running, so a misspelled field quietly falls back to its default instead of taking effect. ktx status lists each ignored key as a warning (and exits 0) so you can remove or correct it when convenient.

Warehouse connections accept extra driver-specific fields, so passthrough values like historicSql and context.queryHistory are allowed.

To re-validate without running anything else:

ktx status

ktx status parses ktx.yaml, surfaces validation issues, and reports which inputs are ready.

ktx setup - the guided flow that writes most of these fields for you.
ktx status - readiness check for the current ktx.yaml.
LLM configuration - provider-specific setup notes.
Primary sources and Context sources - connector-specific details and credentials.

ktx.yaml reference

On this page