Primary Sources

Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, MongoDB, or Amazon Athena.

ktx connects to your data warehouse or database to build schema context, discover relationships, and execute semantic layer queries. Each connection is defined in ktx.yaml under the connections key.

For analytics tools and knowledge systems such as dbt, MetricFlow, LookML, Metabase, Looker, and Notion, use Context Sources. For Claude Code, Codex, Cursor, OpenCode, and other agent clients, use Agent Clients.

All connectors share these conventions:

Sensitive values support env:VAR_NAME (read from environment) and file:/path/to/secret (read from file) references
Connections are read-only; ktx never writes to your database
Database ingest discovers tables, columns, types, and constraints automatically

Connection field reference

Agents should prefer environment or file references over literal secrets.

Field	Required	Applies to	Description
`driver`	Yes	all connections	Connector driver such as `postgres`, `snowflake`, `bigquery`, `mysql`, `clickhouse`, `sqlserver`, `sqlite`, `duckdb`, `mongodb`, or `athena`
`url`	One of the connection methods	URL-style connectors	Database URL, `env:NAME`, or `file:/path/to/secret`
`host`, `port`, `database`, `username`, `password`	One of the connection methods	PostgreSQL, MySQL, SQL Server	Field-by-field connection values
`schema` or `schemas`	No	schema-aware warehouses	Single schema or list of schemas to scan
`databases`	No	ClickHouse, MongoDB, Athena	List of databases to scan
`sample_size`, `order_by`	No	MongoDB	Schema-inference sampling controls (recent documents, sort field)
`context.queryHistory`	No	PostgreSQL, Snowflake, BigQuery	Enables query-history ingestion when the warehouse supports it
`path`	Yes for path-style SQLite/DuckDB	SQLite, DuckDB	Local SQLite or DuckDB database path or `env:NAME` reference
`max_bytes_billed`	No	BigQuery	Maximum bytes billed per query job
`query_timeout_ms`	No	all warehouses	Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a `query exceeded Ns` error so the agent can revise.
`project_id`	No	BigQuery	Optional local descriptor and mapping metadata; not used for BigQuery authentication
`region`	Yes	Athena	AWS region where the Athena workgroup and Glue catalog reside (e.g. `us-east-1`)
`s3_staging_dir`	Yes	Athena	S3 URI for Athena query result storage (e.g. `s3://my-bucket/athena-results/`)
`workgroup`	No	Athena	Athena workgroup name (default `primary`)
`catalog`	No	Athena	Glue Data Catalog name (default `AwsDataCatalog`)
`database`	No	Athena	Default Glue database name passed as the query execution context
`databases`	No	Athena	Glue databases to include in schema scans; written by `ktx setup` and read by `ktx ingest`

PostgreSQL

The most full-featured connector. Supports schema introspection, foreign key detection, column statistics, and query history via pg_stat_statements.

Connection config

yamlktx.yaml

connections:
  my-postgres:
    driver: postgres
    url: env:DATABASE_URL
    schema: public

Or with individual fields:

yamlktx.yaml

connections:
  my-postgres:
    driver: postgres
    host: localhost
    port: 5432
    database: analytics
    username: ktx_reader
    password: env:PG_PASSWORD
    schemas:
      - public
      - analytics
    ssl: true

Authentication

Method	Config
Password	`password: env:PG_PASSWORD` or `password: file:/path/to/secret`
Connection URL	`url: env:DATABASE_URL`
SSL	`ssl: true`, optionally `rejectUnauthorized: false` for self-signed certs

Features

Feature	Supported	Notes
Tables & views	Yes	Via `pg_catalog`
Primary keys	Yes	Via `information_schema.table_constraints`
Foreign keys	Yes	Full constraint detection
Row count estimates	Yes	Via `pg_class.reltuples`
Column statistics	Yes	Requires `pg_read_all_stats` role
Query history	Yes	Via `pg_stat_statements` extension
Table sampling	Yes	`TABLESAMPLE SYSTEM`

Query history

PostgreSQL query history mines real query patterns from pg_stat_statements. This helps ktx understand how your team actually queries the data.

Requirements:

pg_stat_statements extension enabled
pg_read_all_stats role granted to the ktx user

Config options:

    context:
      queryHistory:
        enabled: true
        minExecutions: 5
        filters:
          dropTrivialProbes: true

Dialect notes

SQL compilation uses LIMIT/OFFSET pagination
Named parameters converted to positional ($1, $2, ...)
Supports COUNT(*) FILTER (WHERE ...) for null analysis
Full support for PostgreSQL types: uuid, jsonb, timestamptz, numeric, text[], etc.

Snowflake

Connects via the Snowflake SDK. Supports multi-schema scanning, RSA key authentication, and query-history configuration for Snowflake query history.

Connection config

yamlktx.yaml

connections:
  my-snowflake:
    driver: snowflake
    account: xy12345
    warehouse: ANALYTICS_WH
    database: PROD
    schema_names:
      - PUBLIC
      - SALES
      - MARKETING
    username: KTX_SERVICE
    password: env:SNOWFLAKE_PASSWORD
    role: ANALYST

ktx setup discovers schemas after the connection is verified and writes the selected list to schema_names. You can also set this field manually. For a single schema, schema_name: PUBLIC is accepted as an equivalent shorthand.

Authentication

Method	Config
Password	`password: env:SNOWFLAKE_PASSWORD`
RSA key pair	`authMethod: rsa`, `privateKey: file:~/.ssh/snowflake_key.pem`, optional `passphrase`

Features

Feature	Supported	Notes
Tables & views	Yes	Via `INFORMATION_SCHEMA.TABLES`
Primary keys	Yes	Via table constraints
Foreign keys	No	Not available in Snowflake
Row count estimates	Yes	From `INFORMATION_SCHEMA.TABLES.ROW_COUNT`
Column statistics	No	-
Query history	Yes	Via `SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY` when enabled
Table sampling	Yes	-

Query history

Snowflake query history reads aggregated query-history templates from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY and feeds the same unified staged artifact shape as Postgres and BigQuery.

    context:
      queryHistory:
        enabled: true
        windowDays: 90
        minExecutions: 5
        filters:
          dropTrivialProbes: true
          serviceAccounts:
            patterns: ['^svc_']
            mode: exclude
        redactionPatterns: []

Dialect notes

All identifiers are uppercase by default (case-insensitive matching)
Connection context set per query (USE ROLE, USE WAREHOUSE, USE DATABASE, USE SCHEMA)
Parameter binding uses positional ? placeholders
Date values normalized to ISO 8601 strings

BigQuery

Authenticates via GCP service account credentials. Supports multi-dataset scanning and query-history configuration for INFORMATION_SCHEMA.JOBS_BY_PROJECT.

Connection config

yamlktx.yaml

connections:
  my-bigquery:
    driver: bigquery
    credentials_json: file:~/.config/gcloud/bq-service-account.json
    dataset_id: analytics
    location: US

For multiple datasets:

    dataset_ids:
      - analytics
      - marketing
      - finance

BigQuery dataset scope is stored in connections.<id>.dataset_ids. Interactive setup discovers datasets from credentials plus location, then writes the chosen dataset ids as the scan scope.

Cross-project datasets

To introspect a dataset hosted in a different project than the one your credentials bill to — for example Google's bigquery-public-data, a partner's shared project, or an organization's central data project — qualify the entry as project.dataset:

yamlktx.yaml

connections:
  public-bq:
    driver: bigquery
    credentials_json: file:~/.config/gcloud/bq-service-account.json
    location: US
    dataset_ids:
      - bigquery-public-data.austin_311
      - bigquery-public-data.census_bureau_usa
      - analytics

ktx introspects each dataset in its host project while every query job still bills to the project_id inside your credentials_json. A bare dataset entry (no prefix) is scanned in your own project, exactly as before. A single connection may mix datasets from several projects, and two projects may host datasets with the same name without colliding.

Interactive setup does not enumerate datasets in projects your credentials don't own, so hand-write project.dataset entries for foreign datasets. The wizard's table picker also only lists datasets in your connection's location region; this affects table selection only — ingest and discover_data introspect a cross-project dataset regardless of region.

Authentication

Method	Config
Service account JSON	`credentials_json: file:/path/to/key.json`
Environment variable	`credentials_json: env:BIGQUERY_CREDENTIALS_JSON`

The project ID is extracted automatically from the service account JSON file. If you set project_id in ktx.yaml, ktx treats it as local descriptor and mapping metadata. The BigQuery connector still authenticates with the project_id inside credentials_json.

Features

Feature	Supported	Notes
Tables & views	Yes	Including materialized views and external tables
Primary keys	Yes	Via `INFORMATION_SCHEMA` table constraints when declared
Foreign keys	No	Not available in BigQuery
Row count estimates	Yes	From table metadata
Column statistics	No	-
Query history	Yes	Via region-scoped `INFORMATION_SCHEMA.JOBS_BY_PROJECT` when enabled
Table sampling	Yes	-

Query history

BigQuery query history reads aggregated query-history templates from region-scoped INFORMATION_SCHEMA.JOBS_BY_PROJECT and feeds the same unified staged artifact shape as Postgres and Snowflake.

    context:
      queryHistory:
        enabled: true
        windowDays: 90
        minExecutions: 5
        filters:
          dropTrivialProbes: true
          serviceAccounts:
            patterns: ['@bot\\.']
            mode: exclude
        redactionPatterns: []

Dialect notes

Parameter binding uses named @param syntax
Arrays flattened to comma-separated strings in results
Location specified at query execution time
Supports the max_bytes_billed limit from ktx.yaml; the shared query_timeout_ms field maps to the query job's jobTimeoutMs

Amazon Athena

Connects to Amazon Athena using the AWS Glue Data Catalog for schema introspection and the Athena query API for read-only SQL execution. Authentication uses the standard AWS credential chain — no credentials are embedded in ktx.yaml.

Connection config

yamlktx.yaml

connections:
  my-athena:
    driver: athena
    region: us-east-1
    s3_staging_dir: s3://my-bucket/athena-results/

With optional fields:

yamlktx.yaml

connections:
  my-athena:
    driver: athena
    region: us-east-1
    s3_staging_dir: env:ATHENA_S3_STAGING_DIR
    workgroup: analytics
    catalog: AwsDataCatalog
    database: my_default_database
    databases:
      - analytics
      - raw

ktx setup writes the databases array when you select Glue databases during setup. ktx scan reads it to limit introspection to those databases.

Authentication

ktx uses the AWS SDK default credential chain — no credentials appear in ktx.yaml. The chain resolves credentials in this order:

Method	How to configure
Environment variables	Set `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and optionally `AWS_SESSION_TOKEN`
Shared credentials file	Configure `~/.aws/credentials` with a `[default]` or named profile; set `AWS_PROFILE` to select a non-default profile
IAM instance profile	Attach an IAM role to the EC2 instance or ECS task — no local configuration needed
IAM roles for service accounts (EKS)	Annotate the pod's service account with the IAM role ARN

The IAM principal must have athena:StartQueryExecution, athena:GetQueryExecution, athena:GetQueryResults, glue:GetDatabases, and glue:GetTables permissions, plus read access to the S3 results bucket.

Features

Feature	Supported	Notes
Tables & views	Yes	Via AWS Glue Data Catalog
Primary keys	No	Glue does not expose constraint metadata
Foreign keys	No	Not available in Glue/Athena
Row count estimates	No	Glue table statistics are often stale
Column statistics	No	-
Query history	No	-
Table sampling	Yes	`SELECT ... LIMIT n`

Dialect notes

SQL dialect is Presto/Trino; identifiers are quoted with double-quotes
Table names use three-part format: "catalog"."database"."table" (e.g. "AwsDataCatalog"."analytics"."orders")
Partition columns (PartitionKeys in Glue) are included after regular columns in the schema and are fully queryable
Athena does not support TABLESAMPLE; random sampling uses ORDER BY rand()
Query execution is asynchronous: ktx starts the query, polls until completion, then fetches results from S3
Results are stored in s3_staging_dir; the IAM principal needs write access to that bucket
Use workgroup to apply per-workgroup cost controls and result configuration
The connector always uses your account's default Glue Data Catalog; cross-account catalog access (CatalogId pointing to another account) is not supported

MySQL

Standard MySQL/MariaDB connector with full foreign key support and schema introspection.

Connection config

yamlktx.yaml

connections:
  my-mysql:
    driver: mysql
    url: env:MYSQL_DATABASE_URL

MySQL supports selecting one or more databases during ktx setup. The selected database scope is stored in connections.<id>.schemas, and ktx scan reads exactly those databases.

Or with individual fields:

yamlktx.yaml

connections:
  my-mysql:
    driver: mysql
    host: mysql.internal
    port: 3306
    database: analytics
    username: ktx_reader
    password: env:MYSQL_PASSWORD
    ssl: true

Authentication

Method	Config
Password	`password: env:MYSQL_PASSWORD` or `password: file:/path/to/secret`
SSL	`ssl: true` or `ssl: { rejectUnauthorized: false }`
URL parameters	`?ssl=true` or `?sslmode=required` in connection URL

Features

Feature	Supported	Notes
Tables & views	Yes	Via `INFORMATION_SCHEMA.TABLES`
Primary keys	Yes	Via `KEY_COLUMN_USAGE`
Foreign keys	Yes	Via `REFERENTIAL_CONSTRAINTS`
Row count estimates	Yes	From `TABLE_ROWS` (InnoDB estimate)
Column statistics	No	-
Query history	No	-
Table sampling	Yes	Uses `RAND()` filter

Dialect notes

Parameter binding uses positional ? placeholders
Uses LIMIT X OFFSET Y for pagination
Multi-database scanning uses schemas as the selected database list
Supports 20+ MySQL types including enum, json, datetime, decimal
Table comments extracted with InnoDB metadata prefix stripping

ClickHouse

Connects to ClickHouse over HTTP. Supports table and column introspection across one or more selected databases.

Connection config

yamlktx.yaml

connections:
  my-clickhouse:
    driver: clickhouse
    url: env:CLICKHOUSE_DATABASE_URL
    database: analytics

For multiple databases:

    databases:
      - analytics
      - mart

ClickHouse supports selecting one or more databases during ktx setup. The selected scan scope is stored in connections.<id>.databases. The single database field remains the connection default for raw SQL and ktx sql.

Authentication

Method	Config
URL	`url: env:CLICKHOUSE_DATABASE_URL`
Password	`password: env:CLICKHOUSE_PASSWORD` or `password: file:/path/to/secret`

Features

Feature	Supported	Notes
Tables & views	Yes	Via `system.tables`
Primary keys	No	Not exposed as relational constraints
Foreign keys	No	Not available in ClickHouse
Row count estimates	Yes	From ClickHouse metadata where available
Column statistics	No	-
Query history	No	-
Table sampling	Yes	Uses ClickHouse sampling syntax when supported

Dialect notes

Parameter binding uses named placeholders
The database field sets the default database for SQL execution
The databases array controls the scan scope

SQL Server

Connects to Microsoft SQL Server and Azure SQL. Supports multi-schema scanning with dbo as the default schema.

Connection config

yamlktx.yaml

connections:
  my-sqlserver:
    driver: sqlserver
    url: env:SQLSERVER_DATABASE_URL

Or with individual fields:

yamlktx.yaml

connections:
  my-sqlserver:
    driver: sqlserver
    host: sql.internal
    port: 1433
    database: Analytics
    username: ktx_reader
    password: env:MSSQL_PASSWORD
    schema: dbo
    trustServerCertificate: true

For multiple schemas:

    schemas:
      - dbo
      - analytics
      - staging

Authentication

Method	Config
SQL Server auth	`username` + `password`
Encrypted connection	Always enabled, `trustServerCertificate: true` for self-signed

Features

Feature	Supported	Notes
Tables & views	Yes	Via `INFORMATION_SCHEMA.TABLES`
Primary keys	Yes	Via `TABLE_CONSTRAINTS` and `KEY_COLUMN_USAGE`
Foreign keys	Yes	Via `REFERENTIAL_CONSTRAINTS`
Row count estimates	Yes	Via `sys.dm_db_partition_stats`
Column statistics	No	-
Query history	No	-
Table sampling	Yes	-
Nested analysis	No	-

Dialect notes

Parameter binding uses @paramName syntax
Row limiting uses SELECT TOP N * FROM (query) AS ktx_query_result
Encryption is always required; certificate validation is optional
Multi-schema support with per-schema isolation

SQLite

File-based connector using better-sqlite3. Ideal for local development, embedded analytics, or testing.

Connection config

yamlktx.yaml

connections:
  my-sqlite:
    driver: sqlite
    path: ./data/warehouse.sqlite

Path supports multiple formats:

# Relative path (resolved against project directory)
path: ./warehouse.sqlite

# Absolute path
path: /var/data/analytics.db

# Home directory expansion
path: ~/data/warehouse.sqlite

# Environment variable
path: env:SQLITE_DB_PATH

# URL format
url: sqlite:///path/to/db.sqlite

Authentication

No authentication required - SQLite is file-based. The file must be readable by the process running ktx.

Features

Feature	Supported	Notes
Tables & views	Yes	Via `sqlite_master`
Primary keys	Yes	Via `PRAGMA table_info()`
Foreign keys	Yes	Via `PRAGMA foreign_key_list()` (requires `PRAGMA foreign_keys = ON`)
Row count estimates	Yes	Exact count via `SELECT COUNT(*)`
Column statistics	No	-
Query history	No	-
Table sampling	Yes	-
Nested analysis	No	-

Dialect notes

Synchronous query execution (no connection pooling)
Parameter binding uses :paramName syntax
Uses LIMIT X OFFSET Y for pagination
SQLite type affinity system: TEXT, NUMERIC, INTEGER, REAL, BLOB
Foreign key enforcement requires explicit PRAGMA foreign_keys = ON
Database file must exist before ktx connection test or ingest runs

DuckDB

File-based connector using the DuckDB Node.js API. Ideal for local analytics, embedded warehouses, and cross-database federation.

Connection config

yamlktx.yaml

connections:
  warehouse:
    driver: duckdb
    path: data/warehouse.duckdb

path is resolved relative to the project directory. The .duckdb file must already exist — ktx never creates a missing database file.

Authentication

No authentication required — DuckDB is file-based. The .duckdb file must be readable by the process running ktx.

Features

Feature	Supported	Notes
Tables & views	Yes	Via `information_schema` on the `main` schema
Primary keys	Yes	Via `information_schema.table_constraints`
Foreign keys	Yes	Via DuckDB's `duckdb_constraints()` catalog function
Row count estimates	Yes	Exact count via `SELECT COUNT(*)`
Column statistics	No	-
Query history	No	-
Table sampling	Yes	-
Nested analysis	No	-

Dialect notes

Introspection scans the main schema only
Execution is read-only; ktx opens the file without write access
Parameter binding uses positional ? placeholders
Uses LIMIT X OFFSET Y for pagination
Database file must exist before ktx connection test or ingest runs

Cross-database federation

When a project declares two or more attach-compatible connections — any combination of postgres, mysql, sqlite, and duckdb — ktx derives a cross-database federation connection. That connection can ATTACH a native .duckdb file, allowing semantic queries to join across sources without manually copying data.

MongoDB

Connects to MongoDB as a primary context source. ktx treats each collection as a table and each inferred top-level field as a column. MongoDB is a non-SQL source: ktx sql and semantic-layer metric compilation do not apply to a MongoDB connection, but its collections still flow through ktx ingest, descriptions, and relationship discovery.

Connection config

yamlktx.yaml

connections:
  mongo-prod:
    driver: mongodb
    url: env:MONGO_URL
    databases: [app]
    enabled_tables: [app.users, app.orders] # optional collection allowlist
    sample_size: 1000
    # order_by: createdAt                    # only when _id is not an ObjectId

Standard mongodb:// and mongodb+srv:// connection strings are supported, including TLS and MongoDB Atlas — pass the full connection string (with its query parameters) as url. The databases list selects which databases to introspect; if omitted, ktx uses the database in the URL path. ktx setup also offers MongoDB and stores the selected databases under connections.<id>.databases.

Authentication

Method	Config
Connection string	`url: env:MONGO_URL` or `url: file:/path/to/secret`
Atlas / TLS	Use a `mongodb+srv://` URL with the credentials and TLS options Atlas provides

Schema inference

MongoDB has no fixed schema, so ktx infers one by sampling the most recent sample_size documents per collection (default 1000), sorted by _id descending. Because an ObjectId embeds its creation time, this captures the collection's current shape with zero configuration. When _id is not an ObjectId (custom string or UUID keys), set order_by to a timestamp field such as createdAt so "most recent" is well-defined. A custom order_by field should be indexed — an unindexed sort hits MongoDB's in-memory sort limit and fails on large collections (_id, the default, is always indexed).

For each top-level field, ktx unions the BSON types seen and derives nullability from how often the field is present:

Scalar BSON types map to string, number, time, or boolean
A field seen with more than one type is recorded as mixed and treated as a string
Sub-documents and arrays become a single opaque json column (no dotted-path columns); their sampled values are stringified, not faithfully serialized
_id is the primary key

Features

Feature	Supported	Notes
Collections (as tables)	Yes	Via `listCollections`; `system.*` collections are excluded
Primary keys	Yes	`_id`
Foreign keys	No	MongoDB has no formal foreign keys
Row count estimates	Yes	Via `estimatedDocumentCount`
Column statistics	No	-
Query history	No	-
Table sampling	Yes	Reads the most recent documents
Nested analysis	Yes	Sub-documents and arrays modeled as opaque `json`
Read-only SQL (`ktx sql`)	No	MongoDB is not a SQL source

Dialect notes

Strictly read-only: the connector only issues find, listCollections, estimatedDocumentCount, and read aggregations
Sampling rides the _id index and uses a server-side time limit so large collections do not stall a run; a custom order_by must be indexed for the same guarantee
sample_size trades inference coverage for speed; raise it for collections with highly variable documents

Common errors

Error or symptom	Likely cause	Recovery
Connection URL appears in git diff	A literal credential URL was written to `ktx.yaml`	Replace it with `env:NAME` or `file:/path/to/secret` and rotate exposed credentials
Database ingest returns no tables	Schema, database, or project filter is wrong, or the user lacks metadata permissions	Verify the schema list and grant metadata read permissions. For Athena, confirm the IAM principal has `glue:GetDatabases` and `glue:GetTables` permissions
Query history is empty	Query history extension or warehouse history view is unavailable	Enable the warehouse-specific history feature, then rerun `ktx ingest <connectionId> --query-history` or `ktx setup`
Column statistics are missing	Connector cannot access stats tables or the warehouse does not expose them	Grant stats permissions where supported; otherwise rely on schema-level context without column statistics
Semantic query execution fails	Connection is missing, unreachable, or query execution is disabled	Run `ktx connection test <id>` and check the `ktx sl query` flags
Athena query fails with `ACCESS_DENIED`	IAM principal lacks `athena:StartQueryExecution` or S3 write access to `s3_staging_dir`	Attach a policy granting Athena query permissions and `s3:PutObject` on the staging bucket
Athena ingest finds databases but no tables	IAM principal has `glue:GetDatabases` but not `glue:GetTables`	Grant `glue:GetTables` on the relevant Glue catalog resources

On this page