Semantic querying

How ktx compiles a short semantic query into safe, dialect-correct SQL using a reviewed join graph.

ktx's semantic layer is a compiler that turns intent into SQL. The agent declares what it wants - measures, dimensions, filters - in a small semantic query. ktx figures out the how: which tables to join, what grain to aggregate at, how to keep fanout from inflating measures, and what dialect the warehouse speaks.

This page covers four mechanics:

The semantic query contract agents send to the compiler.
The planner steps that turn a semantic query into SQL.
The join graph that backs those steps, and how it's built.
The fanout failure mode the compiler is designed to prevent.

Imperative SQL vs declarative semantic querying

Writing analytics SQL is imperative work. Every question forces the agent to hold two things in mind at once: what it wants - a measure, a slice, a filter - and how to compute it: which tables to join, which key links them, what grain to aggregate at, how to keep one fact from inflating another, and what dialect the warehouse speaks. Plumbing on top of intent, every query.

ktx's semantic layer separates those concerns:

You and ktx maintain the how. Sources, joins, grain, measures, and segments live in reviewable YAML - the analytical contract the team agrees on, version-controlled.
The agent declares the what. It sends a semantic query and trusts the compiler to produce safe SQL.

The agent stops reasoning about plumbing. It states intent. ktx turns that into SQL the warehouse can run.

Imperative vs declarative

Same answer, two contracts

On the left, the agent works imperatively: chooses tables, writes joins, picks the grain, and remembers each warehouse's dialect. On the right, the agent only declares what it wants. ktx handles every how.

Drag to pan · ⌘/Ctrl + scroll to zoom

Data agent

Asks: monthly net revenue and open tickets per segment, high-value orders only, no test customers

Without ktx

Agent writes the SQL

Stitches four tables, mixes grains, and ships numbers that won't match the dashboard.

sqlagent-authored

-- agent stitches four tables, mixes facts,
-- and ships numbers that won't match the dashboard
 
SELECT
  c.segment,
  DATE_TRUNC('month', o.created_at)  AS month,
  SUM(o.amount) - SUM(r.amount)      AS net_revenue,
  COUNT(t.id)                        AS open_tickets
FROM customers c
LEFT JOIN orders  o
  ON o.customer_id = c.id
LEFT JOIN refunds r
  ON r.order_id = o.id
LEFT JOIN tickets t
  ON t.customer_id = c.id
WHERE
  c.is_test = false
  AND o.amount >= 100
  AND t.status = 'open'     -- turns LEFT JOIN into INNER
GROUP BY
  c.segment,
  DATE_TRUNC('month', o.created_at)
ORDER BY
  month,
  c.segment
LIMIT 1000;
 
-- chasm trap: orders rows multiply by tickets and refunds
-- net_revenue and open_tickets are both inflated
-- DATE_TRUNC syntax breaks on BigQuery

Re-stitches a 4-way join on every question
Reinvents net_revenue and the high-value rule
Hides a chasm trap across three facts
Filters a LEFT JOIN target in WHERE
Hardcodes one warehouse's date functions

With ktx

Agent sends a semantic query

Names the measures, dimensions, segments, and filters it wants. No SQL, no joins.

jsonagent-authored

{
  "measures": [
    "orders.revenue",
    "refunds.amount",
    "tickets.open_count",
    {
      "name": "net_revenue",
      "expr": "orders.revenue - refunds.amount"
    }
  ],
  "dimensions": [
    "customers.segment",
    { "field": "orders.created_at", "granularity": "month" }
  ],
  "segments": ["orders.high_value"],
  "filters": ["customers.is_test = false"],
  "limit": 1000
}

Semantic-layer engine

Plans the query against the reviewed graph

1
Resolve refs
qualify columns, look up measure formulas
2
Build join tree
Dijkstra over typed edges from an anchor source
3
Detect fanout
group measures by source, flag chasm traps
4
Localize aggregation
pre-aggregate each fact as its own CTE
5
Transpile dialect
emit Postgres-shaped SQL, then target dialect

Generated SQL

ktx returns dialect-correct SQL

Pre-aggregates each fact at its own grain, then joins back on the shared dimension.

sqlktx-compiled

WITH orders_agg AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', created_at) AS month,
    SUM(amount) AS revenue
  FROM public.orders
  WHERE amount >= 100
  GROUP BY
    customer_id,
    DATE_TRUNC('month', created_at)
),
refunds_agg AS (
  SELECT
    o.customer_id,
    DATE_TRUNC('month', o.created_at) AS month,
    SUM(r.amount) AS refund_amount
  FROM public.refunds r
  JOIN public.orders o
    ON o.id = r.order_id
  WHERE o.amount >= 100
  GROUP BY
    o.customer_id,
    DATE_TRUNC('month', o.created_at)
),
tickets_agg AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', opened_at) AS month,
    COUNT(*) AS open_count
  FROM public.tickets
  WHERE status = 'open'
  GROUP BY
    customer_id,
    DATE_TRUNC('month', opened_at)
)
SELECT
  c.segment,
  o.month,
  SUM(o.revenue - COALESCE(r.refund_amount, 0)) AS net_revenue,
  SUM(o.revenue)                               AS revenue,
  SUM(r.refund_amount)                         AS refund_amount,
  SUM(COALESCE(t.open_count, 0))               AS open_tickets
FROM public.customers c
JOIN orders_agg o
  ON o.customer_id = c.id
LEFT JOIN refunds_agg r
  ON r.customer_id = c.id
 AND r.month = o.month
LEFT JOIN tickets_agg t
  ON t.customer_id = c.id
 AND t.month = o.month
WHERE c.is_test = false
GROUP BY
  c.segment,
  o.month
ORDER BY
  o.month,
  c.segment
LIMIT 1000;

Walks the reviewed join graph automatically
Uses the canonical net_revenue formula
Pre-aggregates each fact to avoid the chasm trap
Keeps LEFT JOIN filters on the dimension source
Transpiles DATE_TRUNC to the target dialect

Warehouse

PostgreSQL • Snowflake • BigQuery

The semantic query contract

A semantic query is the JSON payload the agent sends. Every field is optional except measures, and column references are fully qualified (source.column) so the compiler never has to guess where a name came from.

Notice what's not in the payload: no FROM, no JOIN, no GROUP BY, no WITH. The agent states what it wants. ktx picks the join path, the grain, the SQL shape, and the dialect.

Field	Purpose
`measures`	Names of pre-defined measures, or inline expressions like `sum(orders.amount)`
`dimensions`	Columns to group by, optionally with a `granularity` for time fields
`filters`	Row-level predicates, classified into `WHERE` or `HAVING` at planning time
`segments`	Named filter sets defined on a source, applied as additional predicates
`order_by`	Sort fields with optional direction
`limit`	Row cap on the result

A typical agent call looks like this:

{
  "measures": ["orders.revenue", "tickets.ticket_count"],
  "dimensions": ["customers.segment"],
  "filters": ["orders.created_at >= '2025-01-01'"],
  "limit": 1000
}

That payload is enough for ktx to plan and compile. The agent never authors a join, a CTE, or a dialect-specific cast.

What the planner does

The planner is a deterministic pipeline. Each semantic query runs through the same ordered steps before any SQL is emitted.

Resolve refs. Qualify bare column names, look up pre-defined measure expressions, and classify each measure as raw or derived.
Pick an anchor and build the join tree. Choose the largest measure source as the root, then run a shortest-path search across the typed join graph to reach every required source.
Detect fanout. Group measures by their owning source. If more than one group exists, the planner marks the query as a chasm trap and switches to aggregate-locality compilation.
Classify filters. Split predicates into row-level (WHERE) and aggregate-level (HAVING) based on whether they reference a measure.
Generate SQL. Emit Postgres-shaped SQL with the right shape: single-source aggregation when the query is safe, per-source CTEs when fanout is present.
Transpile to the target dialect. Run the result through sqlglot so the warehouse receives syntax it understands.

The output is the SQL string, the resolved plan, and any warnings surfaced during planning.

The join graph

A semantic source is a node. A declared join is a typed edge. The graph is bidirectional: every forward edge has a reverse with the relationship inverted, so the planner can traverse from any anchor.

Relationship	Planning impact
`many_to_one`	Safe direction for adding dimensions
`one_to_many`	Multiplies measures and triggers fanout handling
`one_to_one`	Safe in either direction when keys match
Equal-cost paths	Treated as ambiguous; aliases or explicit joins resolve them

customers

grain: customer_id

orders

grain: order_id

order_items

grain: order_id, line_id

orders -> customers: many_to_one

orders -> order_items: one_to_many

Example: refunds joins to orders. Used carefully, it explains net revenue. Joined naively, it duplicates order-level measures.

Edges and grain come from your YAML. The compiler treats them as fact, not a guess.

yamlsemantic-layer/warehouse/orders.yaml

name: orders
table: public.orders
grain: [order_id]
joins:
  - to: customers
    on: customer_id = customers.id
    relationship: many_to_one
  - to: order_items
    on: id = order_items.order_id
    relationship: one_to_many
measures:
  - name: revenue
    expr: sum(case when status != 'refunded' then amount end)

Building and maintaining the graph

ktx builds the graph from evidence and accepted edits, not from runtime inference. Each input contributes a different kind of authority.

Evidence	What it contributes
Declared primary keys	Initial row grain
Declared foreign keys	Formal join candidates
Inferred relationships	Edges when the warehouse lacks constraints
dbt, MetricFlow, and LookML imports	Existing metrics, dimensions, explores, and joins
Query history	Real join and filter patterns from analyst SQL
Analyst review	Final authority before context is merged

Semantic maintenance loop

Every accepted correction becomes input to the next graph build.

reviewed context

The accepted graph becomes the starting point for the next build.

Step 1

ingest evidence

scan schemas, imports, and accepted files

Step 2

YAML diff

draft source, join, grain, and measure changes

Step 3

validation

check relationships, syntax, and unsafe query shapes

Step 4

analyst review

accept, edit, or reject generated context

Step 5

agent use

serve context to search, explain, and query

Step 6

corrections

agent and analyst fixes become new evidence

Fanout and aggregate locality

Fanout is the classic analytics failure mode. Two fact tables join to a shared dimension. A naive query joins them all together first, so each row from one fact is multiplied by the matching rows from the other. Measures duplicate, numbers go wrong, and the agent doesn't notice.

ktx's planner detects the shape by grouping measures by their owning source. If more than one source contributes raw measures, the generator switches to aggregate locality: each fact is pre-aggregated at its own grain inside a CTE, and the CTEs are joined back to the dimension at the end.

Naive SQL shape	Semantic-layer SQL shape
Join facts and dimensions first, then aggregate	Aggregate each fact at its own grain, then join
Put every filter in one outer `WHERE` clause	Keep measure filters with the measure source
Trust the shortest textual join path	Prefer typed safe paths, reject disconnected sources
Let dimension grain differ across facts	Raise when an asymmetric dimension would fan out another measure

The result is the same analyst answer, computed with the join shape an analyst would have written by hand.

Where the context comes from

The planner is only as good as the YAML it reads. ktx builds and maintains that YAML for you.

raw-sources/<connection>/ holds scan evidence from your warehouse: schemas, columns, keys, samples, and observed usage patterns.
wiki/ holds business language, definitions, and caveats. The planner doesn't read wiki at compile time, but the agent does, so measure names and dimensions stay anchored to terms the team uses.
semantic-layer/<connection>/ holds the structured sources, joins, grain, measures, and segments the planner actually compiles against.

Every accepted edit flows back into the next ingest, so the graph stays current as the warehouse changes.

Agent usage notes

Point an agent at this page when it needs to explain why ktx asks for grain, why a query was rejected as unsafe, or why the compiled SQL looks different from what the agent first proposed.

Agent task	Relevant section	Next page
Explain the semantic query shape	The semantic query contract	ktx sl
Describe what the planner does between query and SQL	What the planner does	ktx sl
Explain why ktx asks for grain and relationship types	The join graph	Writing context
Diagnose duplicated measures after a join	Fanout and aggregate locality	ktx sl
Describe how semantic context stays current	Building and maintaining the graph	Reviewing Context

Semantic querying

Same answer, two contracts

On this page