Integrations

Context Sources

Ingest semantic context from dbt, MetricFlow, LookML, Metabase, Looker, Notion, Sigma, and Google Drive.

Context sources feed your existing analytics tooling into ktx. During ingestion, ktx extracts metadata from each source and uses a reconciliation agent to reconcile it with your existing semantic layer and knowledge base - preserving accepted edits rather than overwriting.

All context sources are configured in ktx.yaml under connections with their respective driver value.

Ingestion workflow

Agents must configure and ingest context sources in this order:

  1. Add the context source connection in ktx.yaml or with ktx setup.
  2. Store tokens as env:NAME or file:/path/to/secret.
  3. Run ktx ingest <connectionId> for one source or ktx ingest --all for every configured source.
  4. Review the foreground ingest output.
  5. Review generated semantic-layer/ YAML and wiki/ Markdown files in git.
  6. Validate changed semantic sources with ktx sl validate.

Common source fields

Git repository fields are source-specific. dbt uses top-level repo_url, LookML uses top-level repoUrl, and MetricFlow uses nested metricflow.repoUrl.

FieldRequiredDescription
driverYesSource connector: dbt, metricflow, lookml, metabase, looker, notion, sigma, or gdrive
source_dirFor local file sourcesAbsolute or project-relative source directory
repo_urlFor Git-hosted dbt sourcesGit repository URL
repoUrlFor Git-hosted LookML sourcesGit repository URL
metricflow.repoUrlFor Git-hosted MetricFlow sourcesGit repository URL
branchNoGit branch to read
pathNoSubdirectory inside a monorepo
auth_token_refFor private APIs/reposenv:NAME or file:/path/to/secret token reference

dbt

Ingests schema definitions, model descriptions, column metadata, and column test definitions from a dbt project.

What it provides

  • Model and source definitions from schema.yml files
  • Column names, descriptions, and data types
  • Column tests, mapped to semantic facts — not_null / unique become column constraints, accepted_values becomes enum value lists, and relationships becomes join / foreign-key edges
  • Model and source tags, and source freshness settings

MetricFlow semantic_models: and metrics: are ingested through the separate MetricFlow source, not the dbt driver.

Connection config

yamlktx.yaml
connections:
  my-dbt:
    driver: dbt
    source_dir: /path/to/dbt/project

For a Git-hosted project:

yamlktx.yaml
connections:
  my-dbt:
    driver: dbt
    repo_url: https://github.com/org/dbt-repo
    branch: main
    path: analytics/dbt          # For monorepos
    auth_token_ref: env:GITHUB_TOKEN

Authentication

MethodConfig
Local pathsource_dir: /absolute/path/to/dbt/project
Public reporepo_url: https://github.com/org/repo
Private reporepo_url + auth_token_ref: env:GITHUB_TOKEN

Optional fields:

FieldDescription
profiles_pathPath to profiles.yml (if non-standard location)
targetdbt target name (e.g., dev, prod)
project_nameOverride auto-detected project name

What gets ingested

  • Semantic-layer overlays (semantic-layer/*.yaml): descriptions, constraints, enum values, and joins from the dbt YAML are written onto the semantic source for the matching warehouse table. Overlays land on the warehouse connection that owns the table, which is usually a different connection than the dbt source itself.
  • Wiki pages (wiki/): for definitions or relationships that don't map to a confirmed physical table.
  • Work units for parallel processing: one per schema file under models/ when the project has more than 25 YAML files, otherwise a single combined unit.

MetricFlow

Ingests MetricFlow semantic models and metric definitions. Useful when your team defines metrics in MetricFlow's YAML format.

What it provides

  • Semantic model definitions (entities, dimensions, measures)
  • Cross-model metric definitions
  • Entity relationships between models, inferred from matching foreign and primary entities

Connection config

yamlktx.yaml
connections:
  my-metricflow:
    driver: metricflow
    metricflow:
      repoUrl: https://github.com/org/metricflow-repo
      branch: main
      path: dbt_metrics           # Subdirectory for monorepos
      auth_token_ref: env:GITHUB_TOKEN

For a local path:

    metricflow:
      repoUrl: file:///absolute/path/to/project

Authentication

MethodConfig
Public reporepoUrl: https://github.com/org/repo
Private reporepoUrl + auth_token_ref: env:GITHUB_TOKEN
Local pathrepoUrl: file:///path/to/project

What gets ingested

  • Semantic models with their entities, dimensions, measures, and the join edges inferred from entity relationships
  • Metric definitions with their expressions and filters
  • Work units organized by connected component (metrics + related semantic models grouped together)

LookML

Ingests LookML view and model definitions from a Git repository. Extracts field definitions, SQL table references, and join relationships.

What it provides

  • View definitions (dimensions, measures, derived tables)
  • Model explore definitions and joins
  • SQL table name references
  • Field-level descriptions and labels

Connection config

yamlktx.yaml
connections:
  my-lookml:
    driver: lookml
    repoUrl: https://github.com/org/lookml-repo
    branch: main
    path: analytics                # Subdirectory for monorepos
    auth_token_ref: env:GITHUB_TOKEN

For a local path:

    repoUrl: file:///absolute/path/to/lookml

Authentication

MethodConfig
Public reporepoUrl: https://github.com/org/repo
Private reporepoUrl + auth_token_ref: env:GITHUB_TOKEN
Local pathrepoUrl: file:///path/to/project

What gets ingested

  • One work unit per model, plus a unit for orphan views and one per dashboard
  • Semantic-layer sources per view — overlays for thin sql_table_name wrappers, standalone sources for derived_table views
  • Measures, joins (with their Looker relationship:), and field types mapped to column types (yesno → boolean, date/timestamp → time)
  • Wiki pages for relationships and descriptions, with warehouse identifiers verified before writing

Warehouse mapping

Optionally validate that LookML references match your expected Looker connection:

    mappings:
      expectedLookerConnectionName: postgres_connection

This compares each model's connection: declaration against the expected name. Mismatched models are flagged, and semantic-layer writes are disabled for them during that ingest while wiki extraction still proceeds.


Metabase

Ingests collections, questions, models, and metrics — with their underlying SQL — from a Metabase instance. Maps Metabase databases to your ktx warehouse connections.

What it provides

  • Collections and their hierarchy, used to organize ingested context
  • Questions, models, and metrics — resolved SQL for both native and structured (MBQL) queries
  • Each card's output schema: column types and primary/foreign-key hints
  • Database-to-warehouse relationship mapping

Connection config

yamlktx.yaml
connections:
  my-metabase:
    driver: metabase
    api_url: https://metabase.company.com
    api_key_ref: env:METABASE_API_KEY
    mappings:
      databaseMappings:
        "3": postgres-main         # Metabase DB ID → ktx connection
      syncEnabled:
        "3": true
      syncMode: ONLY               # Only ingest mapped databases

Authentication

MethodConfig
API keyapi_key_ref: env:METABASE_API_KEY

Generate an API key in Metabase: Admin > Settings > Authentication > API Keys.

What gets ingested

  • Semantic-layer sources generated from each card's resolved SQL and column metadata, written to the mapped warehouse connection
  • Fallback wiki notes only when a referenced table can't be mapped or an identifier can't be verified
  • One work unit per Metabase collection; re-syncs reprocess only collections with changed cards

Warehouse mapping

Metabase databases must be mapped to ktx connections so ingested context links to the correct warehouse:

mappings:
  databaseMappings:
    "<metabase_db_id>": "<ktx_connection_id>"
  syncEnabled:
    "<metabase_db_id>": true
  syncMode: ONLY    # ONLY = restrict to mapped DBs

Find Metabase database IDs in Admin > Databases - the ID is in the URL when editing a database.


Looker

Ingests explores, looks, and dashboards from a Looker instance via the Looker API. Maps Looker connections to your ktx warehouse connections.

What it provides

  • Explore definitions and field metadata
  • Dashboard and look configurations
  • Query patterns and usage signals
  • Looker folder structure for organization context

Connection config

yamlktx.yaml
connections:
  my-looker:
    driver: looker
    base_url: https://looker.company.com
    client_id: your-looker-client-id
    client_secret_ref: env:LOOKER_CLIENT_SECRET
    mappings:
      connectionMappings:
        postgres_connection: postgres-main   # Looker conn → ktx conn

Authentication

MethodConfig
OAuth client credentialsclient_id + client_secret_ref: env:LOOKER_CLIENT_SECRET

Generate API credentials in Looker: Admin > Users > Edit > API Keys.

What gets ingested

  • Semantic-layer sources from explore fields, written to the mapped warehouse connection (mapped explores only)
  • Wiki pages capturing reusable metric, segment, and domain knowledge from dashboards and Looks
  • Usage and recency signals that drive a triage gate, focusing processing on high-value content
  • Work units per explore, per dashboard, and per Look

Warehouse mapping

Map Looker connection names to ktx connections so explores link to the correct warehouse:

mappings:
  connectionMappings:
    "<looker_connection_name>": "<ktx_connection_id>"

Find Looker connection names in Admin > Database > Connections.


Notion

Ingests pages and databases from a Notion workspace as wiki pages. Useful for capturing business definitions, data dictionaries, and team documentation that agents need for context.

What it provides

  • Notion pages crawled from selected roots or all accessible content
  • Page bodies and blocks normalized to Markdown
  • Page hierarchy and cross-page links (child pages, mentions, relations)
  • Notion databases and their data-source rows as individual pages

Connection config

yamlktx.yaml
connections:
  my-notion:
    driver: notion
    auth_token_ref: env:NOTION_TOKEN
    crawl_mode: selected_roots
    root_page_ids:
      - "abc123def456..."

For crawling all accessible pages:

yamlktx.yaml
connections:
  my-notion:
    driver: notion
    auth_token_ref: env:NOTION_TOKEN
    crawl_mode: all_accessible

Authentication

MethodConfig
Internal integration tokenauth_token_ref: env:NOTION_TOKEN

Create an integration at notion.so/my-integrations, then share target pages with the integration.

Configuration options

FieldDescriptionDefault
crawl_modeall_accessible or selected_roots-
root_page_idsPage IDs to crawl from (for selected_roots)[]
root_database_idsDatabase IDs to include[]
root_data_source_idsData-source IDs to include (for selected_roots)[]
max_pages_per_runPages processed per sync1000
max_knowledge_creates_per_runNew pages created per sync25
max_knowledge_updates_per_runPages updated per sync20

What gets ingested

  • Wiki pages synthesized from Notion content (not raw copies)
  • Semantic-layer sources when a page defines a reusable dataset or metric mapped to a confirmed non-Notion target; otherwise the fact stays wiki-only
  • Page-relevance triage that skips transient content (task lists, status updates, date-titled snapshots)
  • Work units clustered by embedding similarity for efficient synthesis

Notes

  • Notion is wiki-first: it writes durable wiki pages by default and only emits semantic-layer sources for content mapped to a confirmed non-Notion target; unmapped facts stay wiki-only
  • Rate limits apply; large workspaces may require multiple ingestion runs
  • Incremental sync cursors are stored in .ktx/db.sqlite; don't add last_successful_cursor to ktx.yaml

Sigma

Ingests data model definitions and workbook metadata from a Sigma workspace as semantic context. Uses the Sigma REST API to fetch data model specs and workbook summaries.

What it provides

  • Data model names, folder paths, and ownership metadata
  • Page and element definitions within each data model
  • Column identifiers and data types where available
  • Workbook names, paths, descriptions, and version metadata

Connection config

yamlktx.yaml
connections:
  sigma-main:
    driver: sigma
    api_url: https://api.sigmacomputing.com   # Omit for GCP US (default)
    client_id: "<your-client-id>"
    client_secret_ref: env:SIGMA_CLIENT_SECRET

For the AWS US region, override api_url:

yamlktx.yaml
connections:
  sigma-main:
    driver: sigma
    api_url: https://aws-api.sigmacomputing.com
    client_id: "<your-client-id>"
    client_secret_ref: env:SIGMA_CLIENT_SECRET

Authentication

MethodConfig
OAuth client credentialsclient_id + client_secret_ref: env:SIGMA_CLIENT_SECRET

Generate a client in Sigma: Administration → Developer Access → Add New Client.

What gets ingested

  • Active data model specs, organized by folder into work units
  • Workbook metadata (name, path, description, version) — archived and exploration workbooks excluded by default
  • Models backed by CSV uploads or unsupported connector subtypes are listed in the manifest but skipped during spec fetch (a Sigma API limitation)

Warehouse connection mapping

connectionMappings is optional. Without it, ktx produces wiki knowledge only — no semantic-layer sources are written and warehouse validation is skipped. To get semantic-layer output and enable sl_validate, map each Sigma internal connection UUID to a ktx warehouse connection ID:

yamlktx.yaml
connections:
  sigma-main:
    driver: sigma
    client_id: "<your-client-id>"
    client_secret_ref: env:SIGMA_CLIENT_SECRET
    connectionMappings:
      "<sigma-internal-uuid>": snowflake-prod   # data models using this connection get SL sources

Find the Sigma connection UUID in Administration → Connections or from the source.connectionId field in a fetched data model spec. Data model elements whose connectionId has no mapping are ingested as wiki-only.

Workbook filter

At large scale, you can limit which workbooks are fetched during ingest using workbookFilter:

yamlktx.yaml
connections:
  sigma-main:
    driver: sigma
    client_id: "<your-client-id>"
    client_secret_ref: env:SIGMA_CLIENT_SECRET
    workbookFilter:
      includeArchived: false       # default
      includeExplorations: false   # default
      updatedSince: "2026-01-01T00:00:00Z"   # only recently updated workbooks
FieldDefaultDescription
includeArchivedfalseInclude archived workbooks
includeExplorationsfalseInclude exploration workbooks
updatedSinceISO 8601 date; only workbooks updated on or after this date are fetched

Notes

  • connectionMappings is optional for wiki-only ingest; it is required to generate semantic-layer sources and run warehouse validation
  • Context ingest (ktx ingest sigma-main) fetches from the Sigma API directly
  • Ingest is incremental: items whose updatedAt timestamp is unchanged since the last run are skipped
  • Models backed by CSV uploads or unsupported connector subtypes cannot have their spec exported; these are skipped with a warning (a Sigma API limitation)
  • Joins are not projected from Sigma data models in this release; joins: [] is always written by the projection step. Lookup relationships visible in data model specs are captured as wiki knowledge instead.

Google Drive

Ingests Google Docs from a shared Google Drive folder as wiki-ready knowledge content. This v1 implementation is knowledge-only and ingests Google Docs MIME types only.

What it provides

  • Wiki pages synthesized from Google Docs content
  • Folder-scoped knowledge ingestion from a specific Drive folder
  • Markdown normalization for headings, lists, paragraphs, links, common inline formatting, and Google Docs tables

Connection config

yamlktx.yaml
connections:
  company-docs:
    driver: gdrive
    service_account_key_ref: file:/absolute/path/to/google-service-account.json
    folder_id: your-google-drive-folder-id
    recursive: false

Authentication

MethodConfig
Service account JSON key fileservice_account_key_ref: file:/absolute/path/to/key.json

Google Cloud setup

  1. Create a Google Cloud project.
  2. Enable the Google Drive API.
  3. Enable the Google Docs API.
  4. Create a service account.
  5. Download the service account JSON key.
  6. Share the target Drive folder with the service account email.
  7. Reference the key in ktx.yaml with service_account_key_ref.

Required scopes

  • https://www.googleapis.com/auth/drive.readonly
  • https://www.googleapis.com/auth/documents.readonly

Configuration options

FieldDescriptionDefault
service_account_key_refFile reference to the service account JSON key-
folder_idGoogle Drive folder ID to ingest-
recursiveTraverse subfolders under folder_idfalse

What gets ingested

  • Google Docs documents only
  • Wiki-oriented knowledge content
  • One work unit per staged Google Doc

Notes

  • gdrive is knowledge-only in v1; it does not produce semantic layer sources
  • ktx setup supports Google Drive configuration, including the service-account key ref, folder id, and recursive crawl flag
  • ktx connection test <connectionId> supports gdrive: it verifies that folder_id resolves to a folder the service account can read, then reports the number of Google Docs visible in it. A wrong or unshared folder_id fails the test instead of reporting zero docs
  • Only Google Docs are ingested in v1; other file types (Sheets, Slides, PDFs) in the folder are skipped and recorded in the staged manifest
  • The service account must be granted access to the target folder explicitly

Common errors

Error or symptomLikely causeRecovery
Connector cannot read source filessource_dir, repo_url, repoUrl, metricflow.repoUrl, branch, or path is wrongVerify the path locally or clone the repo manually with the same credentials
Private repo/API authentication failsToken env var or secret file is missingExport the env var or update auth_token_ref to a readable file
Ingest creates duplicate contextExisting source names or wiki pages do not match imported terminologyReview the diff, rename duplicates, and add wiki pages with canonical names
Notion ingest skips pagesIntegration lacks access or root ids are missingShare pages with the Notion integration and set root_page_ids or use all_accessible carefully
Generated semantic sources fail validationTool metadata does not match the live warehouse schemaMap BI/source databases to primary warehouse connections and rerun validation