Integrations

Primary Sources

Connect ktx to PostgreSQL, Snowflake, BigQuery, MySQL, ClickHouse, SQL Server, SQLite, DuckDB, MongoDB, or Amazon Athena.

ktx connects to your data warehouse or database to build schema context, discover relationships, and execute semantic layer queries. Each connection is defined in ktx.yaml under the connections key.

For analytics tools and knowledge systems such as dbt, MetricFlow, LookML, Metabase, Looker, and Notion, use Context Sources. For Claude Code, Codex, Cursor, OpenCode, and other agent clients, use Agent Clients.

All connectors share these conventions:

  • Sensitive values support env:VAR_NAME (read from environment) and file:/path/to/secret (read from file) references
  • Connections are read-only; ktx never writes to your database
  • Database ingest discovers tables, columns, types, and constraints automatically

Connection field reference

Agents should prefer environment or file references over literal secrets.

FieldRequiredApplies toDescription
driverYesall connectionsConnector driver such as postgres, snowflake, bigquery, mysql, clickhouse, sqlserver, sqlite, duckdb, mongodb, or athena
urlOne of the connection methodsURL-style connectorsDatabase URL, env:NAME, or file:/path/to/secret
host, port, database, username, passwordOne of the connection methodsPostgreSQL, MySQL, SQL ServerField-by-field connection values
schema or schemasNoschema-aware warehousesSingle schema or list of schemas to scan
databasesNoClickHouse, MongoDB, AthenaList of databases to scan
sample_size, order_byNoMongoDBSchema-inference sampling controls (recent documents, sort field)
context.queryHistoryNoPostgreSQL, Snowflake, BigQueryEnables query-history ingestion when the warehouse supports it
pathYes for path-style SQLite/DuckDBSQLite, DuckDBLocal SQLite or DuckDB database path or env:NAME reference
max_bytes_billedNoBigQueryMaximum bytes billed per query job
query_timeout_msNoall warehousesMaximum execution time for a single read-only query, in milliseconds (default 30000). A query exceeding it is cancelled server-side (or, for SQLite, by terminating the off-process executor) and returns a query exceeded Ns error so the agent can revise.
project_idNoBigQueryOptional local descriptor and mapping metadata; not used for BigQuery authentication
regionYesAthenaAWS region where the Athena workgroup and Glue catalog reside (e.g. us-east-1)
s3_staging_dirYesAthenaS3 URI for Athena query result storage (e.g. s3://my-bucket/athena-results/)
workgroupNoAthenaAthena workgroup name (default primary)
catalogNoAthenaGlue Data Catalog name (default AwsDataCatalog)
databaseNoAthenaDefault Glue database name passed as the query execution context
databasesNoAthenaGlue databases to include in schema scans; written by ktx setup and read by ktx ingest

PostgreSQL

The most full-featured connector. Supports schema introspection, foreign key detection, column statistics, and query history via pg_stat_statements.

Connection config

yamlktx.yaml
connections:
  my-postgres:
    driver: postgres
    url: env:DATABASE_URL
    schema: public

Or with individual fields:

yamlktx.yaml
connections:
  my-postgres:
    driver: postgres
    host: localhost
    port: 5432
    database: analytics
    username: ktx_reader
    password: env:PG_PASSWORD
    schemas:
      - public
      - analytics
    ssl: true

Authentication

MethodConfig
Passwordpassword: env:PG_PASSWORD or password: file:/path/to/secret
Connection URLurl: env:DATABASE_URL
SSLssl: true, optionally rejectUnauthorized: false for self-signed certs

Features

FeatureSupportedNotes
Tables & viewsYesVia pg_catalog
Primary keysYesVia information_schema.table_constraints
Foreign keysYesFull constraint detection
Row count estimatesYesVia pg_class.reltuples
Column statisticsYesRequires pg_read_all_stats role
Query historyYesVia pg_stat_statements extension
Table samplingYesTABLESAMPLE SYSTEM

Query history

PostgreSQL query history mines real query patterns from pg_stat_statements. This helps ktx understand how your team actually queries the data.

Requirements:

  • pg_stat_statements extension enabled
  • pg_read_all_stats role granted to the ktx user

Config options:

    context:
      queryHistory:
        enabled: true
        minExecutions: 5
        filters:
          dropTrivialProbes: true

Dialect notes

  • SQL compilation uses LIMIT/OFFSET pagination
  • Named parameters converted to positional ($1, $2, ...)
  • Supports COUNT(*) FILTER (WHERE ...) for null analysis
  • Full support for PostgreSQL types: uuid, jsonb, timestamptz, numeric, text[], etc.

Snowflake

Connects via the Snowflake SDK. Supports multi-schema scanning, RSA key authentication, and query-history configuration for Snowflake query history.

Connection config

yamlktx.yaml
connections:
  my-snowflake:
    driver: snowflake
    account: xy12345
    warehouse: ANALYTICS_WH
    database: PROD
    schema_names:
      - PUBLIC
      - SALES
      - MARKETING
    username: KTX_SERVICE
    password: env:SNOWFLAKE_PASSWORD
    role: ANALYST

ktx setup discovers schemas after the connection is verified and writes the selected list to schema_names. You can also set this field manually. For a single schema, schema_name: PUBLIC is accepted as an equivalent shorthand.

Authentication

MethodConfig
Passwordpassword: env:SNOWFLAKE_PASSWORD
RSA key pairauthMethod: rsa, privateKey: file:~/.ssh/snowflake_key.pem, optional passphrase

Features

FeatureSupportedNotes
Tables & viewsYesVia INFORMATION_SCHEMA.TABLES
Primary keysYesVia table constraints
Foreign keysNoNot available in Snowflake
Row count estimatesYesFrom INFORMATION_SCHEMA.TABLES.ROW_COUNT
Column statisticsNo-
Query historyYesVia SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY when enabled
Table samplingYes-

Query history

Snowflake query history reads aggregated query-history templates from SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY and feeds the same unified staged artifact shape as Postgres and BigQuery.

    context:
      queryHistory:
        enabled: true
        windowDays: 90
        minExecutions: 5
        filters:
          dropTrivialProbes: true
          serviceAccounts:
            patterns: ['^svc_']
            mode: exclude
        redactionPatterns: []

Dialect notes

  • All identifiers are uppercase by default (case-insensitive matching)
  • Connection context set per query (USE ROLE, USE WAREHOUSE, USE DATABASE, USE SCHEMA)
  • Parameter binding uses positional ? placeholders
  • Date values normalized to ISO 8601 strings

BigQuery

Authenticates via GCP service account credentials. Supports multi-dataset scanning and query-history configuration for INFORMATION_SCHEMA.JOBS_BY_PROJECT.

Connection config

yamlktx.yaml
connections:
  my-bigquery:
    driver: bigquery
    credentials_json: file:~/.config/gcloud/bq-service-account.json
    dataset_id: analytics
    location: US

For multiple datasets:

    dataset_ids:
      - analytics
      - marketing
      - finance

BigQuery dataset scope is stored in connections.<id>.dataset_ids. Interactive setup discovers datasets from credentials plus location, then writes the chosen dataset ids as the scan scope.

Cross-project datasets

To introspect a dataset hosted in a different project than the one your credentials bill to — for example Google's bigquery-public-data, a partner's shared project, or an organization's central data project — qualify the entry as project.dataset:

yamlktx.yaml
connections:
  public-bq:
    driver: bigquery
    credentials_json: file:~/.config/gcloud/bq-service-account.json
    location: US
    dataset_ids:
      - bigquery-public-data.austin_311
      - bigquery-public-data.census_bureau_usa
      - analytics

ktx introspects each dataset in its host project while every query job still bills to the project_id inside your credentials_json. A bare dataset entry (no prefix) is scanned in your own project, exactly as before. A single connection may mix datasets from several projects, and two projects may host datasets with the same name without colliding.

Interactive setup does not enumerate datasets in projects your credentials don't own, so hand-write project.dataset entries for foreign datasets. The wizard's table picker also only lists datasets in your connection's location region; this affects table selection only — ingest and discover_data introspect a cross-project dataset regardless of region.

Authentication

MethodConfig
Service account JSONcredentials_json: file:/path/to/key.json
Environment variablecredentials_json: env:BIGQUERY_CREDENTIALS_JSON

The project ID is extracted automatically from the service account JSON file. If you set project_id in ktx.yaml, ktx treats it as local descriptor and mapping metadata. The BigQuery connector still authenticates with the project_id inside credentials_json.

Features

FeatureSupportedNotes
Tables & viewsYesIncluding materialized views and external tables
Primary keysYesVia INFORMATION_SCHEMA table constraints when declared
Foreign keysNoNot available in BigQuery
Row count estimatesYesFrom table metadata
Column statisticsNo-
Query historyYesVia region-scoped INFORMATION_SCHEMA.JOBS_BY_PROJECT when enabled
Table samplingYes-

Query history

BigQuery query history reads aggregated query-history templates from region-scoped INFORMATION_SCHEMA.JOBS_BY_PROJECT and feeds the same unified staged artifact shape as Postgres and Snowflake.

    context:
      queryHistory:
        enabled: true
        windowDays: 90
        minExecutions: 5
        filters:
          dropTrivialProbes: true
          serviceAccounts:
            patterns: ['@bot\\.']
            mode: exclude
        redactionPatterns: []

Dialect notes

  • Parameter binding uses named @param syntax
  • Arrays flattened to comma-separated strings in results
  • Location specified at query execution time
  • Supports the max_bytes_billed limit from ktx.yaml; the shared query_timeout_ms field maps to the query job's jobTimeoutMs

Amazon Athena

Connects to Amazon Athena using the AWS Glue Data Catalog for schema introspection and the Athena query API for read-only SQL execution. Authentication uses the standard AWS credential chain — no credentials are embedded in ktx.yaml.

Connection config

yamlktx.yaml
connections:
  my-athena:
    driver: athena
    region: us-east-1
    s3_staging_dir: s3://my-bucket/athena-results/

With optional fields:

yamlktx.yaml
connections:
  my-athena:
    driver: athena
    region: us-east-1
    s3_staging_dir: env:ATHENA_S3_STAGING_DIR
    workgroup: analytics
    catalog: AwsDataCatalog
    database: my_default_database
    databases:
      - analytics
      - raw

ktx setup writes the databases array when you select Glue databases during setup. ktx scan reads it to limit introspection to those databases.

Authentication

ktx uses the AWS SDK default credential chain — no credentials appear in ktx.yaml. The chain resolves credentials in this order:

MethodHow to configure
Environment variablesSet AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and optionally AWS_SESSION_TOKEN
Shared credentials fileConfigure ~/.aws/credentials with a [default] or named profile; set AWS_PROFILE to select a non-default profile
IAM instance profileAttach an IAM role to the EC2 instance or ECS task — no local configuration needed
IAM roles for service accounts (EKS)Annotate the pod's service account with the IAM role ARN

The IAM principal must have athena:StartQueryExecution, athena:GetQueryExecution, athena:GetQueryResults, glue:GetDatabases, and glue:GetTables permissions, plus read access to the S3 results bucket.

Features

FeatureSupportedNotes
Tables & viewsYesVia AWS Glue Data Catalog
Primary keysNoGlue does not expose constraint metadata
Foreign keysNoNot available in Glue/Athena
Row count estimatesNoGlue table statistics are often stale
Column statisticsNo-
Query historyNo-
Table samplingYesSELECT ... LIMIT n

Dialect notes

  • SQL dialect is Presto/Trino; identifiers are quoted with double-quotes
  • Table names use three-part format: "catalog"."database"."table" (e.g. "AwsDataCatalog"."analytics"."orders")
  • Partition columns (PartitionKeys in Glue) are included after regular columns in the schema and are fully queryable
  • Athena does not support TABLESAMPLE; random sampling uses ORDER BY rand()
  • Query execution is asynchronous: ktx starts the query, polls until completion, then fetches results from S3
  • Results are stored in s3_staging_dir; the IAM principal needs write access to that bucket
  • Use workgroup to apply per-workgroup cost controls and result configuration
  • The connector always uses your account's default Glue Data Catalog; cross-account catalog access (CatalogId pointing to another account) is not supported

MySQL

Standard MySQL/MariaDB connector with full foreign key support and schema introspection.

Connection config

yamlktx.yaml
connections:
  my-mysql:
    driver: mysql
    url: env:MYSQL_DATABASE_URL

MySQL supports selecting one or more databases during ktx setup. The selected database scope is stored in connections.<id>.schemas, and ktx scan reads exactly those databases.

Or with individual fields:

yamlktx.yaml
connections:
  my-mysql:
    driver: mysql
    host: mysql.internal
    port: 3306
    database: analytics
    username: ktx_reader
    password: env:MYSQL_PASSWORD
    ssl: true

Authentication

MethodConfig
Passwordpassword: env:MYSQL_PASSWORD or password: file:/path/to/secret
SSLssl: true or ssl: { rejectUnauthorized: false }
URL parameters?ssl=true or ?sslmode=required in connection URL

Features

FeatureSupportedNotes
Tables & viewsYesVia INFORMATION_SCHEMA.TABLES
Primary keysYesVia KEY_COLUMN_USAGE
Foreign keysYesVia REFERENTIAL_CONSTRAINTS
Row count estimatesYesFrom TABLE_ROWS (InnoDB estimate)
Column statisticsNo-
Query historyNo-
Table samplingYesUses RAND() filter

Dialect notes

  • Parameter binding uses positional ? placeholders
  • Uses LIMIT X OFFSET Y for pagination
  • Multi-database scanning uses schemas as the selected database list
  • Supports 20+ MySQL types including enum, json, datetime, decimal
  • Table comments extracted with InnoDB metadata prefix stripping

ClickHouse

Connects to ClickHouse over HTTP. Supports table and column introspection across one or more selected databases.

Connection config

yamlktx.yaml
connections:
  my-clickhouse:
    driver: clickhouse
    url: env:CLICKHOUSE_DATABASE_URL
    database: analytics

For multiple databases:

    databases:
      - analytics
      - mart

ClickHouse supports selecting one or more databases during ktx setup. The selected scan scope is stored in connections.<id>.databases. The single database field remains the connection default for raw SQL and ktx sql.

Authentication

MethodConfig
URLurl: env:CLICKHOUSE_DATABASE_URL
Passwordpassword: env:CLICKHOUSE_PASSWORD or password: file:/path/to/secret

Features

FeatureSupportedNotes
Tables & viewsYesVia system.tables
Primary keysNoNot exposed as relational constraints
Foreign keysNoNot available in ClickHouse
Row count estimatesYesFrom ClickHouse metadata where available
Column statisticsNo-
Query historyNo-
Table samplingYesUses ClickHouse sampling syntax when supported

Dialect notes

  • Parameter binding uses named placeholders
  • The database field sets the default database for SQL execution
  • The databases array controls the scan scope

SQL Server

Connects to Microsoft SQL Server and Azure SQL. Supports multi-schema scanning with dbo as the default schema.

Connection config

yamlktx.yaml
connections:
  my-sqlserver:
    driver: sqlserver
    url: env:SQLSERVER_DATABASE_URL

Or with individual fields:

yamlktx.yaml
connections:
  my-sqlserver:
    driver: sqlserver
    host: sql.internal
    port: 1433
    database: Analytics
    username: ktx_reader
    password: env:MSSQL_PASSWORD
    schema: dbo
    trustServerCertificate: true

For multiple schemas:

    schemas:
      - dbo
      - analytics
      - staging

Authentication

MethodConfig
SQL Server authusername + password
Encrypted connectionAlways enabled, trustServerCertificate: true for self-signed

Features

FeatureSupportedNotes
Tables & viewsYesVia INFORMATION_SCHEMA.TABLES
Primary keysYesVia TABLE_CONSTRAINTS and KEY_COLUMN_USAGE
Foreign keysYesVia REFERENTIAL_CONSTRAINTS
Row count estimatesYesVia sys.dm_db_partition_stats
Column statisticsNo-
Query historyNo-
Table samplingYes-
Nested analysisNo-

Dialect notes

  • Parameter binding uses @paramName syntax
  • Row limiting uses SELECT TOP N * FROM (query) AS ktx_query_result
  • Encryption is always required; certificate validation is optional
  • Multi-schema support with per-schema isolation

SQLite

File-based connector using better-sqlite3. Ideal for local development, embedded analytics, or testing.

Connection config

yamlktx.yaml
connections:
  my-sqlite:
    driver: sqlite
    path: ./data/warehouse.sqlite

Path supports multiple formats:

# Relative path (resolved against project directory)
path: ./warehouse.sqlite

# Absolute path
path: /var/data/analytics.db

# Home directory expansion
path: ~/data/warehouse.sqlite

# Environment variable
path: env:SQLITE_DB_PATH

# URL format
url: sqlite:///path/to/db.sqlite

Authentication

No authentication required - SQLite is file-based. The file must be readable by the process running ktx.

Features

FeatureSupportedNotes
Tables & viewsYesVia sqlite_master
Primary keysYesVia PRAGMA table_info()
Foreign keysYesVia PRAGMA foreign_key_list() (requires PRAGMA foreign_keys = ON)
Row count estimatesYesExact count via SELECT COUNT(*)
Column statisticsNo-
Query historyNo-
Table samplingYes-
Nested analysisNo-

Dialect notes

  • Synchronous query execution (no connection pooling)
  • Parameter binding uses :paramName syntax
  • Uses LIMIT X OFFSET Y for pagination
  • SQLite type affinity system: TEXT, NUMERIC, INTEGER, REAL, BLOB
  • Foreign key enforcement requires explicit PRAGMA foreign_keys = ON
  • Database file must exist before ktx connection test or ingest runs

DuckDB

File-based connector using the DuckDB Node.js API. Ideal for local analytics, embedded warehouses, and cross-database federation.

Connection config

yamlktx.yaml
connections:
  warehouse:
    driver: duckdb
    path: data/warehouse.duckdb

path is resolved relative to the project directory. The .duckdb file must already exist — ktx never creates a missing database file.

Authentication

No authentication required — DuckDB is file-based. The .duckdb file must be readable by the process running ktx.

Features

FeatureSupportedNotes
Tables & viewsYesVia information_schema on the main schema
Primary keysYesVia information_schema.table_constraints
Foreign keysYesVia DuckDB's duckdb_constraints() catalog function
Row count estimatesYesExact count via SELECT COUNT(*)
Column statisticsNo-
Query historyNo-
Table samplingYes-
Nested analysisNo-

Dialect notes

  • Introspection scans the main schema only
  • Execution is read-only; ktx opens the file without write access
  • Parameter binding uses positional ? placeholders
  • Uses LIMIT X OFFSET Y for pagination
  • Database file must exist before ktx connection test or ingest runs

Cross-database federation

When a project declares two or more attach-compatible connections — any combination of postgres, mysql, sqlite, and duckdbktx derives a cross-database federation connection. That connection can ATTACH a native .duckdb file, allowing semantic queries to join across sources without manually copying data.


MongoDB

Connects to MongoDB as a primary context source. ktx treats each collection as a table and each inferred top-level field as a column. MongoDB is a non-SQL source: ktx sql and semantic-layer metric compilation do not apply to a MongoDB connection, but its collections still flow through ktx ingest, descriptions, and relationship discovery.

Connection config

yamlktx.yaml
connections:
  mongo-prod:
    driver: mongodb
    url: env:MONGO_URL
    databases: [app]
    enabled_tables: [app.users, app.orders] # optional collection allowlist
    sample_size: 1000
    # order_by: createdAt                    # only when _id is not an ObjectId

Standard mongodb:// and mongodb+srv:// connection strings are supported, including TLS and MongoDB Atlas — pass the full connection string (with its query parameters) as url. The databases list selects which databases to introspect; if omitted, ktx uses the database in the URL path. ktx setup also offers MongoDB and stores the selected databases under connections.<id>.databases.

Authentication

MethodConfig
Connection stringurl: env:MONGO_URL or url: file:/path/to/secret
Atlas / TLSUse a mongodb+srv:// URL with the credentials and TLS options Atlas provides

Schema inference

MongoDB has no fixed schema, so ktx infers one by sampling the most recent sample_size documents per collection (default 1000), sorted by _id descending. Because an ObjectId embeds its creation time, this captures the collection's current shape with zero configuration. When _id is not an ObjectId (custom string or UUID keys), set order_by to a timestamp field such as createdAt so "most recent" is well-defined. A custom order_by field should be indexed — an unindexed sort hits MongoDB's in-memory sort limit and fails on large collections (_id, the default, is always indexed).

For each top-level field, ktx unions the BSON types seen and derives nullability from how often the field is present:

  • Scalar BSON types map to string, number, time, or boolean
  • A field seen with more than one type is recorded as mixed and treated as a string
  • Sub-documents and arrays become a single opaque json column (no dotted-path columns); their sampled values are stringified, not faithfully serialized
  • _id is the primary key

Features

FeatureSupportedNotes
Collections (as tables)YesVia listCollections; system.* collections are excluded
Primary keysYes_id
Foreign keysNoMongoDB has no formal foreign keys
Row count estimatesYesVia estimatedDocumentCount
Column statisticsNo-
Query historyNo-
Table samplingYesReads the most recent documents
Nested analysisYesSub-documents and arrays modeled as opaque json
Read-only SQL (ktx sql)NoMongoDB is not a SQL source

Dialect notes

  • Strictly read-only: the connector only issues find, listCollections, estimatedDocumentCount, and read aggregations
  • Sampling rides the _id index and uses a server-side time limit so large collections do not stall a run; a custom order_by must be indexed for the same guarantee
  • sample_size trades inference coverage for speed; raise it for collections with highly variable documents

Common errors

Error or symptomLikely causeRecovery
Connection URL appears in git diffA literal credential URL was written to ktx.yamlReplace it with env:NAME or file:/path/to/secret and rotate exposed credentials
Database ingest returns no tablesSchema, database, or project filter is wrong, or the user lacks metadata permissionsVerify the schema list and grant metadata read permissions. For Athena, confirm the IAM principal has glue:GetDatabases and glue:GetTables permissions
Query history is emptyQuery history extension or warehouse history view is unavailableEnable the warehouse-specific history feature, then rerun ktx ingest <connectionId> --query-history or ktx setup
Column statistics are missingConnector cannot access stats tables or the warehouse does not expose themGrant stats permissions where supported; otherwise rely on schema-level context without column statistics
Semantic query execution failsConnection is missing, unreachable, or query execution is disabledRun ktx connection test <id> and check the ktx sl query flags
Athena query fails with ACCESS_DENIEDIAM principal lacks athena:StartQueryExecution or S3 write access to s3_staging_dirAttach a policy granting Athena query permissions and s3:PutObject on the staging bucket
Athena ingest finds databases but no tablesIAM principal has glue:GetDatabases but not glue:GetTablesGrant glue:GetTables on the relevant Glue catalog resources