ktxby Kaelio
Docs
Configuration

ktx.yaml reference

Every top-level block of the ktx.yaml project file, what it controls, accepted values, and defaults.

ktx.yaml is the single source of truth for a ktx project. The file lives at the project root and tells ktx which databases to read, which context sources to ingest, which LLM and embedding providers to use, how to store state, and how the scan and agent layers behave. Every block below is optional and falls back to a documented default, so a minimal ktx.yaml is just one connection.

This page is the canonical reference for the file. For the guided flow that writes it, see ktx setup.

Where blocks fit

ktx.yaml has eight top-level keys. They group into three layers: what to read, how to think, and where to put the results.

ktx.yaml at a glance

Inputs flow left to right. Storage and memory persist the result.

Inputs

  • connections - warehouses, BI tools, dbt, Notion
  • setup - which connections are primary databases

Compute

  • llm - provider, models, prompt cache
  • ingest - adapters, embeddings, work units
  • scan - enrichment, relationships
  • agent - research-agent feature flags

Persistence

  • storage - state and search backends, git policy
  • memory - agent memory commit policy

Minimal config

A working ktx.yaml needs one entry in connections. Everything else accepts defaults. The example below is enough for ktx ingest warehouse to run a fast schema scan against a local Postgres.

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL

Secret references

Several fields accept either a literal value or a reference. References keep secrets out of ktx.yaml so the file can stay in git.

FormResolved toUsed for
env:VAR_NAMEThe value of the environment variable VAR_NAME at runtimeAPI keys, connection URLs, OAuth secrets
file:/abs/path or file:~/pathThe first line of the referenced file, with ~ expanded to your home directoryLong-lived credentials kept under .ktx/secrets/
Literal stringUsed as-isNon-secret values such as base_url

References work in: warehouse url, Metabase api_key / api_key_ref, Looker client_secret / client_secret_ref, Notion / dbt / LookML / MetricFlow auth_token / auth_token_ref, and any api_key under the llm and ingest.embeddings blocks.

connections

The connections block is a map from a connection ID you choose to the configuration for that connector. The connection ID is what every other part of ktx uses to address a connector - ktx ingest warehouse, ktx sql --connection warehouse, the semantic-layer path semantic-layer/warehouse/, and so on.

Each entry is discriminated by the driver field. Warehouse drivers and context-source drivers share the map.

DriverKindRequired fieldsCommon optional fields
postgresWarehousedriverurl, enabled_tables, historicSql, context.queryHistory
mysqlWarehousedriverurl, enabled_tables
sqliteWarehousedriverurl or path, enabled_tables
sqlserverWarehousedriverurl, enabled_tables
bigqueryWarehousedrivercredentials_json, dataset_ids, enabled_tables, historicSql
snowflakeWarehousedriverschema_names, enabled_tables, historicSql
clickhouseWarehousedriverurl, database, databases, enabled_tables
metabaseContext sourcedriver, api_urlapi_key_ref, mappings
lookerContext sourcedriver, base_url, client_idclient_secret_ref, mappings
lookmlContext sourcedriver, repoUrlbranch, path, auth_token_ref, mappings
dbtContext sourcedriver, one of source_dir or repo_urlbranch, path, profiles_path, target, project_name
metricflowContext sourcedriver, metricflow.repoUrlmetricflow.branch, metricflow.path, metricflow.auth_token_ref
notionContext sourcedriver, auth_token_refcrawl_mode, root_*_ids, max_*_per_run

Warehouse drivers

Warehouse connections are open objects: the listed fields are validated, and any other field is preserved and passed through to the connector. Use enabled_tables to scope deep ingest to a specific list of schema.table names - useful for smoke tests.

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL
    enabled_tables:
      - public.orders
      - public.customers

Connector-specific scope fields let setup and scan use the same warehouse boundary:

connections:
  mysql-warehouse:
    driver: mysql
    url: env:MYSQL_URL
    schemas: [analytics, mart]
  clickhouse-warehouse:
    driver: clickhouse
    url: env:CLICKHOUSE_URL
    database: analytics
    databases: [analytics, mart]
  bigquery-warehouse:
    driver: bigquery
    credentials_json: file:./service-account.json
    location: US
    dataset_ids: [analytics, mart]

For Postgres, MySQL, SQL Server, and Snowflake connections, set maxConnections when scan or ingest work needs to stay below the target's connection cap. Postgres, MySQL, and SQL Server default to 10; Snowflake defaults to 4. This caps all concurrent SQL work for that connector instance, including schema introspection, table sampling, relationship profiling, relationship validation, and read-only SQL execution. BigQuery and ClickHouse do not expose maxConnections because their connectors don't use client-side connection pools.

For Postgres, BigQuery, and Snowflake, historicSql and context.queryHistory toggle query-history ingest. The shape is connector-specific; the setup wizard writes these fields when you pass --enable-query-history.

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL
    context:
      queryHistory:
        enabled: true
        minExecutions: 5

Metabase

connections:
  metabase:
    driver: metabase
    api_url: https://metabase.example.com
    api_key_ref: env:METABASE_API_KEY
    mappings:
      databaseMappings:
        "1": warehouse        # Metabase DB id "1" -> ktx connection "warehouse"
      syncMode: ALL           # ALL | ONLY | EXCEPT
FieldPurpose
api_urlMetabase instance URL. Required.
api_keyLiteral token. Prefer api_key_ref.
api_key_refReference to the token (env: or file:).
mappings.databaseMappingsMap of Metabase database ID (positive-integer string) to a ktx warehouse connection ID. null explicitly unmaps.
mappings.syncEnabledPer-database boolean toggle, keyed by Metabase DB ID.
mappings.syncModeALL (all mapped DBs), ONLY (those with syncEnabled: true), or EXCEPT (skip those with syncEnabled: true). Default ALL.
mappings.selections.collections / itemsOptional Metabase collection or item IDs to scope ingest.
mappings.defaultTagNamesDefault tag names attached to ingested artifacts.
network_proxy / networkProxyOptional proxy configuration.

Looker

connections:
  looker:
    driver: looker
    base_url: https://looker.example.com
    client_id: ktx-integration
    client_secret_ref: env:LOOKER_CLIENT_SECRET
    mappings:
      connectionMappings:
        prod_warehouse: warehouse
FieldPurpose
base_urlLooker instance URL. Required.
client_idLooker OAuth client ID. Required.
client_secret / client_secret_refLiteral secret or reference. Prefer the _ref.
mappings.connectionMappingsMap of Looker connection name to ktx warehouse connection ID.

LookML

connections:
  lookml:
    driver: lookml
    repoUrl: git@github.com:org/lookml.git
    branch: main
    path: lookml/
    auth_token_ref: env:GITHUB_TOKEN
    mappings:
      expectedLookerConnectionName: prod_warehouse
FieldPurpose
repoUrlGit URL of the LookML project (https, ssh, or file:). Required. Camel-case by convention.
branchBranch to fetch. Defaults to main.
pathSubdirectory inside the repo when LookML lives in a monorepo.
auth_token_refReference to a Git auth token for private repos.
mappings.expectedLookerConnectionNameLooker connection name LookML models must declare. Mismatches block semantic-layer writes during ingest.

dbt

connections:
  dbt_main:
    driver: dbt
    source_dir: ../dbt-project
    target: prod
FieldPurpose
source_dirAbsolute or project-relative path to a local dbt project.
repo_urlGit URL of the dbt project. Use this instead of source_dir when fetching remotely.
branchBranch to fetch when using repo_url.
pathSubdirectory inside the repo.
auth_token_refGit auth reference for private repos.
profiles_pathOverride path to profiles.yml.
targetdbt target name (for example dev, prod).
project_nameOverride the auto-detected dbt project name.

MetricFlow

connections:
  metricflow:
    driver: metricflow
    metricflow:
      repoUrl: git@github.com:org/sl-config.git
      branch: main
      path: semantic_models/
      auth_token_ref: env:GITHUB_TOKEN

The MetricFlow connector wraps its fields in a nested metricflow block. repoUrl is required; the rest mirrors the LookML / dbt git fields.

Notion

connections:
  notion:
    driver: notion
    auth_token_ref: env:NOTION_TOKEN
    crawl_mode: selected_roots
    root_database_ids:
      - 9f30c2c4d4f24a8d9a8d8e2c1b2a3d4e
    max_pages_per_run: 500
    max_knowledge_creates_per_run: 5
    max_knowledge_updates_per_run: 25
FieldPurpose
auth_token / auth_token_refNotion integration token. Prefer the _ref.
crawl_modeselected_roots (requires at least one root_*_ids) or all_accessible.
root_page_ids, root_database_ids, root_data_source_idsNotion IDs to crawl when crawl_mode is selected_roots.
max_pages_per_runMax pages fetched per ingest run (1-10000).
max_knowledge_creates_per_runMax new wiki pages created per run (0-25).
max_knowledge_updates_per_runMax existing wiki pages updated per run (0-100).

setup

Captured by the setup wizard. The only field ktx still reads is database_connection_ids, which tells the ingest layer which entries in connections are primary warehouses. When omitted, every warehouse-typed connection is treated as primary.

setup:
  database_connection_ids:
    - warehouse
FieldTypeDefaultPurpose
database_connection_idsstring[][]IDs in connections treated as primary warehouses by ingest and scan.

storage

storage controls where ktx keeps its own state and search index, and how state changes are committed. Defaults work for a single-user local project.

storage:
  state: sqlite          # sqlite | postgres
  search: sqlite-fts5    # sqlite-fts5 | postgres-hybrid
  git:
    auto_commit: true
    author: "ktx <ktx@example.com>"
FieldTypeDefaultPurpose
statesqlite | postgressqliteBackend for ktx state. sqlite uses .ktx/db.sqlite; postgres expects a configured Postgres connection.
searchsqlite-fts5 | postgres-hybridsqlite-fts5Backend for search indexes. postgres-hybrid combines lexical and vector search in Postgres.
git.auto_commitbooleantrueWhen true, ktx auto-commits changes to the git-backed state store.
git.authorstringktx <ktx@example.com>Git author identity for auto-commits. Standard Name <email> form.

llm

The llm block selects the LLM provider, lets you override the model used for specific roles, and tunes prompt caching.

llm:
  provider:
    backend: anthropic
    anthropic:
      api_key: env:ANTHROPIC_API_KEY
  models:
    default: claude-sonnet-4-6
    triage: claude-haiku-4-5
  promptCaching:
    enabled: true
    systemTtl: 1h
    toolsTtl: 1h
    historyTtl: 5m
    vertexFallbackTo5m: true

Provider

FieldTypeDefaultPurpose
provider.backendnone | anthropic | vertex | gateway | claude-codenoneSelected backend. none disables LLM features. claude-code uses the local Claude Code session and needs no API key.
provider.anthropic.api_keystring-Anthropic API key. Required when backend: anthropic. Accepts env: or file: references.
provider.anthropic.base_urlstring-Override the Anthropic API base URL (proxy, self-hosted gateway).
provider.gateway.api_key / base_urlstring-Credentials for an AI Gateway provider. Required when backend: gateway.
provider.vertex.projectstring-Google Cloud project ID hosting the Vertex AI endpoint.
provider.vertex.locationstring-Vertex AI region (for example us-east5). Required when the vertex block is present.

Model roles

models overrides the per-role model. Keys are fixed; values are provider-specific model identifiers.

RoleUsed for
defaultCatch-all when no role-specific override exists.
triageCheap routing decisions during ingest and scan.
candidateExtractionExtracting relationship and entity candidates from data.
curatorReconciling proposed context against accepted files.
reconcileResolving conflicts between incoming and existing context.
repairFixing invalid generated YAML before write.

Prompt caching

FieldTypeDefaultPurpose
promptCaching.enabledbooleanbackend defaultMaster switch for Anthropic-style prompt caching.
promptCaching.systemTtl5m | 1hbackend defaultCache TTL for the system prompt segment.
promptCaching.toolsTtl5m | 1hbackend defaultCache TTL for the tools/schema segment.
promptCaching.historyTtl5m | 1hbackend defaultCache TTL for conversation-history breakpoints.
promptCaching.vertexFallbackTo5mbooleanfalseWhen true, downgrade 1h TTLs to 5m on Vertex, which does not support 1h caching.

ingest

ingest controls how ktx builds context from your stack. It lists the adapters to run, the embedding provider used when adapters embed documents, and the concurrency and failure policy for work units.

ingest:
  adapters:
    - live-database
    - dbt
    - metabase
  embeddings:
    backend: openai
    model: text-embedding-3-small
    dimensions: 1536
    openai:
      api_key: env:OPENAI_API_KEY
  workUnits:
    stepBudget: 40
    maxConcurrency: 2
    failureMode: continue

Adapters

adapters is a list of adapter IDs that should run. Each ID matches a connector that ktx ships locally:

Adapter IDWhat it ingests
live-databaseLive warehouse introspection (schemas, tables, columns, samples).
historic-sqlQuery history from Postgres pg_stat_statements, BigQuery INFORMATION_SCHEMA.JOBS, or Snowflake query history.
dbtdbt manifest models, sources, tests, and exposures.
metricflowMetricFlow / Semantic Layer models and metrics.
lookmlLookML projects (models, explores, views, joins).
lookerLooker dashboards and looks via the API.
metabaseMetabase cards, dashboards, and database mappings.
notionNotion pages and databases for wiki context.
fakeTest/demo adapter. Useful in fixtures.

Embeddings

The embeddings block can also appear inside scan.enrichment; that override wins when present.

FieldTypeDefaultPurpose
backendnone | openai | sentence-transformersnoneEmbedding provider. none disables embeddings.
modelstring-Provider model ID, for example text-embedding-3-small or all-MiniLM-L6-v2.
dimensionsint > 08Vector size. Default 8 is a placeholder that's only valid with backend: none. Set explicitly to match your model (1536 for text-embedding-3-small, 384 for all-MiniLM-L6-v2).
openai.api_key / base_urlstring-OpenAI credentials. Required when backend: openai.
sentenceTransformers.base_urlstring""URL of the sentence-transformers server. Empty when ktx manages the local daemon for you.
sentenceTransformers.pathPrefixstring-Optional URL path prefix prepended to embedding requests.
batchSizeint > 0provider defaultTexts per embedding API call.

Work units

A work unit is one unit of agent-driven ingest work (for example one table or one Metabase question). These knobs bound how long it runs and how the run handles failures.

FieldTypeDefaultPurpose
workUnits.stepBudgetint > 040Maximum agent steps allowed per work unit before it's force-terminated.
workUnits.maxConcurrencyint > 01How many work units run in parallel.
workUnits.failureModeabort | continuecontinueabort stops the whole ingest run on the first failure; continue records it and keeps going.

scan

scan configures how schema-level inputs become structured context: column-level enrichment and inferred relationships between tables.

scan:
  enrichment:
    mode: llm           # none | deterministic | llm
  relationships:
    enabled: true
    llmProposals: true
    validationRequiredForManifest: true
    acceptThreshold: 0.85
    reviewThreshold: 0.55
    maxLlmTablesPerBatch: 40
    maxCandidatesPerColumn: 25
    profileSampleRows: 10000
    profileConcurrency: 4
    validationConcurrency: 4
    validationBudget: all

Enrichment

FieldTypeDefaultPurpose
enrichment.modenone | deterministic | llmnoneHow columns and tables get described. deterministic uses local heuristics; llm calls the configured provider.
enrichment.embeddingsembedding block-Optional override for enrichment-time vectorization. Falls back to ingest.embeddings.

Relationships

The relationship discovery step proposes joins between tables, scores them, and optionally validates each one against the database before writing it to the manifest.

FieldTypeDefaultPurpose
relationships.enabledbooleantrueMaster switch for relationship discovery.
relationships.llmProposalsbooleantrueWhen true, propose relationships using the LLM in addition to deterministic candidates.
relationships.validationRequiredForManifestbooleantrueWhen true, only proposals that pass database-side validation reach the manifest.
relationships.acceptThresholdnumber 0-10.85Confidence at or above which a proposal is auto-accepted.
relationships.reviewThresholdnumber 0-10.55Confidence at or above which a proposal is surfaced for human review (but not auto-accepted).
relationships.maxLlmTablesPerBatchint > 040Max tables included in a single LLM relationship-proposal batch.
relationships.maxCandidatesPerColumnint > 025Max join partners considered per column.
relationships.profileSampleRowsint > 010000Rows sampled per table when profiling values for relationship inference.
relationships.profileConcurrencyint > 04Parallel relationship-profile queries against the database. For pooled connectors, effective database concurrency is also bounded by the connection's maxConnections.
relationships.validationConcurrencyint > 04Parallel relationship validation queries against the database.
relationships.validationBudgetall | int ≥ 0runtime defaultCap on validation queries per scan. all means unlimited.

agent

agent carries feature flags for ktx-side agent behavior. Today the only block is run_research, which gates the research agent invoked by ktx mcp and CLI research tools.

agent:
  run_research:
    enabled: true
    max_iterations: 20
    default_toolset:
      - sl_query
      - wiki_search
      - sl_read_source
FieldTypeDefaultPurpose
run_research.enabledbooleanfalseMaster switch for the research agent.
run_research.max_iterationsint ≥ 020Maximum tool-call iterations per research run.
run_research.default_toolsetstring[][sl_query, wiki_search, sl_read_source]Tool identifiers exposed to the research agent.

memory

memory controls the agent memory subsystem.

memory:
  auto_commit: true
FieldTypeDefaultPurpose
auto_commitbooleantrueWhen true, ktx auto-commits memory updates to the git-backed store.

A full example

Combining the blocks above:

connections:
  warehouse:
    driver: postgres
    url: env:DATABASE_URL
  metabase:
    driver: metabase
    api_url: https://metabase.example.com
    api_key_ref: env:METABASE_API_KEY
    mappings:
      databaseMappings:
        "1": warehouse
      syncMode: ALL
setup:
  database_connection_ids:
    - warehouse
storage:
  state: sqlite
  search: sqlite-fts5
  git:
    auto_commit: true
    author: "ktx <ktx@example.com>"
llm:
  provider:
    backend: claude-code
  models:
    default: sonnet
ingest:
  adapters:
    - live-database
    - metabase
  embeddings:
    backend: openai
    model: text-embedding-3-small
    dimensions: 1536
    openai:
      api_key: env:OPENAI_API_KEY
  workUnits:
    maxConcurrency: 2
scan:
  enrichment:
    mode: llm
  relationships:
    acceptThreshold: 0.85
    reviewThreshold: 0.55
agent:
  run_research:
    enabled: true
memory:
  auto_commit: true

Validating your config

ktx validates ktx.yaml strictly: unknown keys at the top level or inside strict blocks cause setup and CLI commands to fail with a precise path (scan.relationships.acceptThreshhold: Unrecognized key). Warehouse connections accept extra driver-specific fields, so passthrough values like historicSql and context.queryHistory are allowed.

To re-validate without running anything else:

ktx status

ktx status parses ktx.yaml, surfaces validation issues, and reports which inputs are ready.