Primary Sources
Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, MongoDB, or Amazon Athena.
ktx connects to your data warehouse or database to build schema context,
discover relationships, and execute semantic layer queries. Each connection is
defined in ktx.yaml under the connections key.
For analytics tools and knowledge systems such as dbt, MetricFlow, LookML, Metabase, Looker, and Notion, use Context Sources. For Claude Code, Codex, Cursor, OpenCode, and other agent clients, use Agent Clients.
All connectors share these conventions:
- Sensitive values support
env:VAR_NAME(read from environment) andfile:/path/to/secret(read from file) references - Connections are read-only; ktx never writes to your database
- Database ingest discovers tables, columns, types, and constraints automatically
Connection field reference
Agents should prefer environment or file references over literal secrets.
| Field | Required | Applies to | Description |
|---|---|---|---|
driver | Yes | all connections | Connector driver such as postgres, snowflake, bigquery, mysql, clickhouse, sqlserver, sqlite, duckdb, mongodb, or athena |
url | One of the connection methods | URL-style connectors | Database URL, env:NAME, or file:/path/to/secret |
host, port, database, username, password | One of the connection methods | PostgreSQL, MySQL, SQL Server | Field-by-field connection values |
schema or schemas | No | schema-aware warehouses | Single schema or list of schemas to scan |
databases | No | ClickHouse, MongoDB, Athena | List of databases to scan |
sample_size, order_by | No | MongoDB | Schema-inference sampling controls (recent documents, sort field) |
context.queryHistory | No | PostgreSQL, Snowflake, BigQuery | Enables query-history ingestion when the warehouse supports it |
path | Yes for path-style SQLite/DuckDB | SQLite, DuckDB | Local SQLite or DuckDB database path or env:NAME reference |
max_bytes_billed | No | BigQuery | Maximum bytes billed per query job |
query_timeout_ms | No | all warehouses | Maximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a query exceeded Ns error so the agent can revise. |
project_id | No | BigQuery | Optional local descriptor and mapping metadata; not used for BigQuery authentication |
region | Yes | Athena | AWS region where the Athena workgroup and Glue catalog reside (e.g. us-east-1) |
s3_staging_dir | Yes | Athena | S3 URI for Athena query result storage (e.g. s3://my-bucket/athena-results/) |
workgroup | No | Athena | Athena workgroup name (default primary) |
catalog | No | Athena | Glue Data Catalog name (default AwsDataCatalog) |
database | No | Athena | Default Glue database name passed as the query execution context |
databases | No | Athena | Glue databases to include in schema scans; written by ktx setup and read by ktx ingest |
PostgreSQL
The most full-featured connector. Supports schema introspection, foreign key detection, column statistics, and query history via pg_stat_statements.
Connection config
connections: my-postgres: driver: postgres url: env:DATABASE_URL schema: public
Or with individual fields:
connections: my-postgres: driver: postgres host: localhost port: 5432 database: analytics username: ktx_reader password: env:PG_PASSWORD schemas: - public - analytics ssl: true
Authentication
| Method | Config |
|---|---|
| Password | password: env:PG_PASSWORD or password: file:/path/to/secret |
| Connection URL | url: env:DATABASE_URL |
| SSL | ssl: true, optionally rejectUnauthorized: false for self-signed certs |
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via pg_catalog |
| Primary keys | Yes | Via information_schema.table_constraints |
| Foreign keys | Yes | Full constraint detection |
| Row count estimates | Yes | Via pg_class.reltuples |
| Column statistics | Yes | Requires pg_read_all_stats role |
| Query history | Yes | Via pg_stat_statements extension |
| Table sampling | Yes | TABLESAMPLE SYSTEM |
Query history
PostgreSQL query history mines real query patterns from pg_stat_statements.
This helps ktx understand how your team actually queries the data.
Requirements:
pg_stat_statementsextension enabledpg_read_all_statsrole granted to the ktx user
Config options:
context: queryHistory: enabled: true minExecutions: 5 filters: dropTrivialProbes: true
Dialect notes
- SQL compilation uses
LIMIT/OFFSETpagination - Named parameters converted to positional (
$1,$2, ...) - Supports
COUNT(*) FILTER (WHERE ...)for null analysis - Full support for PostgreSQL types:
uuid,jsonb,timestamptz,numeric,text[], etc.
Snowflake
Connects via the Snowflake SDK. Supports multi-schema scanning, RSA key authentication, and query-history configuration for Snowflake query history.
Connection config
connections: my-snowflake: driver: snowflake account: xy12345 warehouse: ANALYTICS_WH database: PROD schema_names: - PUBLIC - SALES - MARKETING username: KTX_SERVICE password: env:SNOWFLAKE_PASSWORD role: ANALYST
ktx setup discovers schemas after the connection is verified and writes the
selected list to schema_names. You can also set this field manually. For a
single schema, schema_name: PUBLIC is accepted as an equivalent shorthand.
Authentication
| Method | Config |
|---|---|
| Password | password: env:SNOWFLAKE_PASSWORD |
| RSA key pair | authMethod: rsa, privateKey: file:~/.ssh/snowflake_key.pem, optional passphrase |
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via INFORMATION_SCHEMA.TABLES |
| Primary keys | Yes | Via table constraints |
| Foreign keys | No | Not available in Snowflake |
| Row count estimates | Yes | From INFORMATION_SCHEMA.TABLES.ROW_COUNT |
| Column statistics | No | - |
| Query history | Yes | Via SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY when enabled |
| Table sampling | Yes | - |
Query history
Snowflake query history reads aggregated query-history templates from
SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY and feeds the same unified staged
artifact shape as Postgres and BigQuery.
context: queryHistory: enabled: true windowDays: 90 minExecutions: 5 filters: dropTrivialProbes: true serviceAccounts: patterns: ['^svc_'] mode: exclude redactionPatterns: []
Dialect notes
- All identifiers are uppercase by default (case-insensitive matching)
- Connection context set per query (
USE ROLE,USE WAREHOUSE,USE DATABASE,USE SCHEMA) - Parameter binding uses positional
?placeholders - Date values normalized to ISO 8601 strings
BigQuery
Authenticates via GCP service account credentials. Supports multi-dataset scanning and query-history configuration for INFORMATION_SCHEMA.JOBS_BY_PROJECT.
Connection config
connections: my-bigquery: driver: bigquery credentials_json: file:~/.config/gcloud/bq-service-account.json dataset_id: analytics location: US
For multiple datasets:
dataset_ids: - analytics - marketing - finance
BigQuery dataset scope is stored in connections.<id>.dataset_ids. Interactive
setup discovers datasets from credentials plus location, then writes the chosen
dataset ids as the scan scope.
Cross-project datasets
To introspect a dataset hosted in a different project than the one your
credentials bill to — for example Google's bigquery-public-data, a partner's
shared project, or an organization's central data project — qualify the entry
as project.dataset:
connections: public-bq: driver: bigquery credentials_json: file:~/.config/gcloud/bq-service-account.json location: US dataset_ids: - bigquery-public-data.austin_311 - bigquery-public-data.census_bureau_usa - analytics
ktx introspects each dataset in its host project while every query job still
bills to the project_id inside your credentials_json. A bare dataset entry
(no prefix) is scanned in your own project, exactly as before. A single
connection may mix datasets from several projects, and two projects may host
datasets with the same name without colliding.
Interactive setup does not enumerate datasets in projects your credentials don't
own, so hand-write project.dataset entries for foreign datasets. The wizard's
table picker also only lists datasets in your connection's location region;
this affects table selection only — ingest and discover_data introspect a
cross-project dataset regardless of region.
Authentication
| Method | Config |
|---|---|
| Service account JSON | credentials_json: file:/path/to/key.json |
| Environment variable | credentials_json: env:BIGQUERY_CREDENTIALS_JSON |
The project ID is extracted automatically from the service account JSON file.
If you set project_id in ktx.yaml, ktx treats it as local descriptor and
mapping metadata. The BigQuery connector still authenticates with the
project_id inside credentials_json.
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Including materialized views and external tables |
| Primary keys | Yes | Via INFORMATION_SCHEMA table constraints when declared |
| Foreign keys | No | Not available in BigQuery |
| Row count estimates | Yes | From table metadata |
| Column statistics | No | - |
| Query history | Yes | Via region-scoped INFORMATION_SCHEMA.JOBS_BY_PROJECT when enabled |
| Table sampling | Yes | - |
Query history
BigQuery query history reads aggregated query-history templates from
region-scoped INFORMATION_SCHEMA.JOBS_BY_PROJECT and feeds the same unified
staged artifact shape as Postgres and Snowflake.
context: queryHistory: enabled: true windowDays: 90 minExecutions: 5 filters: dropTrivialProbes: true serviceAccounts: patterns: ['@bot\\.'] mode: exclude redactionPatterns: []
Dialect notes
- Parameter binding uses named
@paramsyntax - Arrays flattened to comma-separated strings in results
- Location specified at query execution time
- Supports the
max_bytes_billedlimit fromktx.yaml; the sharedquery_timeout_msfield maps to the query job'sjobTimeoutMs
Amazon Athena
Connects to Amazon Athena using the AWS Glue Data Catalog for schema introspection and the Athena query API for read-only SQL execution. Authentication uses the standard AWS credential chain — no credentials are embedded in ktx.yaml.
Connection config
connections: my-athena: driver: athena region: us-east-1 s3_staging_dir: s3://my-bucket/athena-results/
With optional fields:
connections: my-athena: driver: athena region: us-east-1 s3_staging_dir: env:ATHENA_S3_STAGING_DIR workgroup: analytics catalog: AwsDataCatalog database: my_default_database databases: - analytics - raw
ktx setup writes the databases array when you select Glue databases during setup. ktx scan reads it to limit introspection to those databases.
Authentication
ktx uses the AWS SDK default credential chain — no credentials appear in ktx.yaml. The chain resolves credentials in this order:
| Method | How to configure |
|---|---|
| Environment variables | Set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN |
| Shared credentials file | Configure ~/.aws/credentials with a [default] or named profile; set AWS_PROFILE to select a non-default profile |
| IAM instance profile | Attach an IAM role to the EC2 instance or ECS task — no local configuration needed |
| IAM roles for service accounts (EKS) | Annotate the pod's service account with the IAM role ARN |
The IAM principal must have athena:StartQueryExecution, athena:GetQueryExecution, athena:GetQueryResults, glue:GetDatabases, and glue:GetTables permissions, plus read access to the S3 results bucket.
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via AWS Glue Data Catalog |
| Primary keys | No | Glue does not expose constraint metadata |
| Foreign keys | No | Not available in Glue/Athena |
| Row count estimates | No | Glue table statistics are often stale |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | SELECT ... LIMIT n |
Dialect notes
- SQL dialect is Presto/Trino; identifiers are quoted with double-quotes
- Table names use three-part format:
"catalog"."database"."table"(e.g."AwsDataCatalog"."analytics"."orders") - Partition columns (
PartitionKeysin Glue) are included after regular columns in the schema and are fully queryable - Athena does not support
TABLESAMPLE; random sampling usesORDER BY rand() - Query execution is asynchronous: ktx starts the query, polls until completion, then fetches results from S3
- Results are stored in
s3_staging_dir; the IAM principal needs write access to that bucket - Use
workgroupto apply per-workgroup cost controls and result configuration - The connector always uses your account's default Glue Data Catalog; cross-account catalog access (
CatalogIdpointing to another account) is not supported
MySQL
Standard MySQL/MariaDB connector with full foreign key support and schema introspection.
Connection config
connections: my-mysql: driver: mysql url: env:MYSQL_DATABASE_URL
MySQL supports selecting one or more databases during ktx setup. The selected
database scope is stored in connections.<id>.schemas, and ktx scan reads
exactly those databases.
Or with individual fields:
connections: my-mysql: driver: mysql host: mysql.internal port: 3306 database: analytics username: ktx_reader password: env:MYSQL_PASSWORD ssl: true
Authentication
| Method | Config |
|---|---|
| Password | password: env:MYSQL_PASSWORD or password: file:/path/to/secret |
| SSL | ssl: true or ssl: { rejectUnauthorized: false } |
| URL parameters | ?ssl=true or ?sslmode=required in connection URL |
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via INFORMATION_SCHEMA.TABLES |
| Primary keys | Yes | Via KEY_COLUMN_USAGE |
| Foreign keys | Yes | Via REFERENTIAL_CONSTRAINTS |
| Row count estimates | Yes | From TABLE_ROWS (InnoDB estimate) |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Uses RAND() filter |
Dialect notes
- Parameter binding uses positional
?placeholders - Uses
LIMIT X OFFSET Yfor pagination - Multi-database scanning uses
schemasas the selected database list - Supports 20+ MySQL types including
enum,json,datetime,decimal - Table comments extracted with InnoDB metadata prefix stripping
ClickHouse
Connects to ClickHouse over HTTP. Supports table and column introspection across one or more selected databases.
Connection config
connections: my-clickhouse: driver: clickhouse url: env:CLICKHOUSE_DATABASE_URL database: analytics
For multiple databases:
databases: - analytics - mart
ClickHouse supports selecting one or more databases during ktx setup. The
selected scan scope is stored in connections.<id>.databases. The single
database field remains the connection default for raw SQL and ktx sql.
Authentication
| Method | Config |
|---|---|
| URL | url: env:CLICKHOUSE_DATABASE_URL |
| Password | password: env:CLICKHOUSE_PASSWORD or password: file:/path/to/secret |
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via system.tables |
| Primary keys | No | Not exposed as relational constraints |
| Foreign keys | No | Not available in ClickHouse |
| Row count estimates | Yes | From ClickHouse metadata where available |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Uses ClickHouse sampling syntax when supported |
Dialect notes
- Parameter binding uses named placeholders
- The
databasefield sets the default database for SQL execution - The
databasesarray controls the scan scope
SQL Server
Connects to Microsoft SQL Server and Azure SQL. Supports multi-schema scanning with dbo as the default schema.
Connection config
connections: my-sqlserver: driver: sqlserver url: env:SQLSERVER_DATABASE_URL
Or with individual fields:
connections: my-sqlserver: driver: sqlserver host: sql.internal port: 1433 database: Analytics username: ktx_reader password: env:MSSQL_PASSWORD schema: dbo trustServerCertificate: true
For multiple schemas:
schemas: - dbo - analytics - staging
Authentication
| Method | Config |
|---|---|
| SQL Server auth | username + password |
| Encrypted connection | Always enabled, trustServerCertificate: true for self-signed |
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via INFORMATION_SCHEMA.TABLES |
| Primary keys | Yes | Via TABLE_CONSTRAINTS and KEY_COLUMN_USAGE |
| Foreign keys | Yes | Via REFERENTIAL_CONSTRAINTS |
| Row count estimates | Yes | Via sys.dm_db_partition_stats |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
Dialect notes
- Parameter binding uses
@paramNamesyntax - Row limiting uses
SELECT TOP N * FROM (query) AS ktx_query_result - Encryption is always required; certificate validation is optional
- Multi-schema support with per-schema isolation
SQLite
File-based connector using better-sqlite3. Ideal for local development, embedded analytics, or testing.
Connection config
connections: my-sqlite: driver: sqlite path: ./data/warehouse.sqlite
Path supports multiple formats:
# Relative path (resolved against project directory) path: ./warehouse.sqlite # Absolute path path: /var/data/analytics.db # Home directory expansion path: ~/data/warehouse.sqlite # Environment variable path: env:SQLITE_DB_PATH # URL format url: sqlite:///path/to/db.sqlite
Authentication
No authentication required - SQLite is file-based. The file must be readable by the process running ktx.
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via sqlite_master |
| Primary keys | Yes | Via PRAGMA table_info() |
| Foreign keys | Yes | Via PRAGMA foreign_key_list() (requires PRAGMA foreign_keys = ON) |
| Row count estimates | Yes | Exact count via SELECT COUNT(*) |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
Dialect notes
- Synchronous query execution (no connection pooling)
- Parameter binding uses
:paramNamesyntax - Uses
LIMIT X OFFSET Yfor pagination - SQLite type affinity system:
TEXT,NUMERIC,INTEGER,REAL,BLOB - Foreign key enforcement requires explicit
PRAGMA foreign_keys = ON - Database file must exist before
ktx connection testor ingest runs
DuckDB
File-based connector using the DuckDB Node.js API. Ideal for local analytics, embedded warehouses, and cross-database federation.
Connection config
connections: warehouse: driver: duckdb path: data/warehouse.duckdb
path is resolved relative to the project directory. The .duckdb file must already exist — ktx never creates a missing database file.
Authentication
No authentication required — DuckDB is file-based. The .duckdb file must be readable by the process running ktx.
Features
| Feature | Supported | Notes |
|---|---|---|
| Tables & views | Yes | Via information_schema on the main schema |
| Primary keys | Yes | Via information_schema.table_constraints |
| Foreign keys | Yes | Via DuckDB's duckdb_constraints() catalog function |
| Row count estimates | Yes | Exact count via SELECT COUNT(*) |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | - |
| Nested analysis | No | - |
Dialect notes
- Introspection scans the
mainschema only - Execution is read-only; ktx opens the file without write access
- Parameter binding uses positional
?placeholders - Uses
LIMIT X OFFSET Yfor pagination - Database file must exist before
ktx connection testor ingest runs
Cross-database federation
When a project declares two or more attach-compatible connections — any combination of postgres, mysql, sqlite, and duckdb — ktx derives a cross-database federation connection. That connection can ATTACH a native .duckdb file, allowing semantic queries to join across sources without manually copying data.
MongoDB
Connects to MongoDB as a primary context source. ktx treats each collection
as a table and each inferred top-level field as a column. MongoDB is a non-SQL
source: ktx sql and semantic-layer metric compilation do not apply to a MongoDB
connection, but its collections still flow through ktx ingest, descriptions, and
relationship discovery.
Connection config
connections: mongo-prod: driver: mongodb url: env:MONGO_URL databases: [app] enabled_tables: [app.users, app.orders] # optional collection allowlist sample_size: 1000 # order_by: createdAt # only when _id is not an ObjectId
Standard mongodb:// and mongodb+srv:// connection strings are supported,
including TLS and MongoDB Atlas — pass the full connection string (with its
query parameters) as url. The databases list selects which databases to
introspect; if omitted, ktx uses the database in the URL path. ktx setup
also offers MongoDB and stores the selected databases under
connections.<id>.databases.
Authentication
| Method | Config |
|---|---|
| Connection string | url: env:MONGO_URL or url: file:/path/to/secret |
| Atlas / TLS | Use a mongodb+srv:// URL with the credentials and TLS options Atlas provides |
Schema inference
MongoDB has no fixed schema, so ktx infers one by sampling the most recent
sample_size documents per collection (default 1000), sorted by _id
descending. Because an ObjectId embeds its creation time, this captures the
collection's current shape with zero configuration. When _id is not an
ObjectId (custom string or UUID keys), set order_by to a timestamp field such
as createdAt so "most recent" is well-defined. A custom order_by field
should be indexed — an unindexed sort hits MongoDB's in-memory sort limit and
fails on large collections (_id, the default, is always indexed).
For each top-level field, ktx unions the BSON types seen and derives nullability from how often the field is present:
- Scalar BSON types map to
string,number,time, orboolean - A field seen with more than one type is recorded as
mixedand treated as a string - Sub-documents and arrays become a single opaque
jsoncolumn (no dotted-path columns); their sampled values are stringified, not faithfully serialized _idis the primary key
Features
| Feature | Supported | Notes |
|---|---|---|
| Collections (as tables) | Yes | Via listCollections; system.* collections are excluded |
| Primary keys | Yes | _id |
| Foreign keys | No | MongoDB has no formal foreign keys |
| Row count estimates | Yes | Via estimatedDocumentCount |
| Column statistics | No | - |
| Query history | No | - |
| Table sampling | Yes | Reads the most recent documents |
| Nested analysis | Yes | Sub-documents and arrays modeled as opaque json |
Read-only SQL (ktx sql) | No | MongoDB is not a SQL source |
Dialect notes
- Strictly read-only: the connector only issues
find,listCollections,estimatedDocumentCount, and read aggregations - Sampling rides the
_idindex and uses a server-side time limit so large collections do not stall a run; a customorder_bymust be indexed for the same guarantee sample_sizetrades inference coverage for speed; raise it for collections with highly variable documents
Common errors
| Error or symptom | Likely cause | Recovery |
|---|---|---|
| Connection URL appears in git diff | A literal credential URL was written to ktx.yaml | Replace it with env:NAME or file:/path/to/secret and rotate exposed credentials |
| Database ingest returns no tables | Schema, database, or project filter is wrong, or the user lacks metadata permissions | Verify the schema list and grant metadata read permissions. For Athena, confirm the IAM principal has glue:GetDatabases and glue:GetTables permissions |
| Query history is empty | Query history extension or warehouse history view is unavailable | Enable the warehouse-specific history feature, then rerun ktx ingest <connectionId> --query-history or ktx setup |
| Column statistics are missing | Connector cannot access stats tables or the warehouse does not expose them | Grant stats permissions where supported; otherwise rely on schema-level context without column statistics |
| Semantic query execution fails | Connection is missing, unreachable, or query execution is disabled | Run ktx connection test <id> and check the ktx sl query flags |
Athena query fails with ACCESS_DENIED | IAM principal lacks athena:StartQueryExecution or S3 write access to s3_staging_dir | Attach a policy granting Athena query permissions and s3:PutObject on the staging bucket |
| Athena ingest finds databases but no tables | IAM principal has glue:GetDatabases but not glue:GetTables | Grant glue:GetTables on the relevant Glue catalog resources |