Wiki retrieval
How ktx ranks wiki pages with hybrid search, links them into a graph, and keeps both sides anchored to evidence.
The wiki is the prose half of the context layer. Agents reach it two ways: they search for a page, then follow references inside the pages they already opened. This page covers how both work.
- The wiki page contract that retrieval and validation depend on.
- The hybrid search pipeline that turns a question into ranked pages.
- The reference graph agents traverse without rerunning search.
- How pages get authored from evidence, and how broken edges get pruned.
The wiki page contract
A wiki page is a Markdown file with a YAML frontmatter block. Frontmatter
carries metadata; the prose below it is free-form. Keys are flat tokens
(revenue, mart_account_segments), not paths, so every page is
addressable as [[key]] from any other page.
--- summary: Paid order value after refunds tags: [finance, orders] sl_refs: [warehouse.orders] refs: [segment-classification] usage_mode: auto --- Revenue is paid order amount after refund adjustments. Use `orders.total_revenue` for recognized order value and `orders.order_count` for paid order volume.
| Field | Purpose |
|---|---|
summary | One-line description shown in search results and the agent's knowledge index |
tags | Topic labels mixed into the search text and used for filtering |
refs | Outgoing edges to other wiki pages by key |
sl_refs | Outgoing edges to semantic-layer sources by connection.source name |
usage_mode | always, auto, or never - whether the agent must, may, or must not surface this page |
source | Where the page came from when authored by ingest (e.g. historic-sql, dbt) |
usage | Stats attached to historic-SQL pattern pages: executions, distinct users, runtime percentiles, error rate |
Pages live under two scopes. wiki/global/*.md is the team's shared
context; wiki/user/<user>/*.md is per-agent scratch space that shadows
global pages with the same key.
What retrieval does
A wiki search runs the same ordered steps every time.
- Normalize the query. Lowercase, tokenize, deduplicate terms.
- Score in three lanes. Lexical (SQLite FTS5 bm25), semantic (cosine similarity over embeddings), and token (term-overlap fallback) each rank every page independently.
- Fuse with Reciprocal Rank Fusion. Each lane contributes
weight / (60 + rank)to a candidate's score. Lanes that fail or skip are dropped, not zeroed. - Order and trim. Sort by fused score, then by how many lanes
matched, then by id for stable tie-breaks. Return the top
limitresults with their summaries. - Hydrate on demand. The agent calls
wiki_readto load full bodies for the few pages that look relevant.
Hybrid retrieval
Three lanes, one ranking
lexical
sqlite fts5 / bm25
Matches stems and phrases. Strong on the exact terms the team already uses.
weight 1.5
semantic
cosine over embeddings
Catches synonyms and paraphrases the lexical lane misses.
weight 2
token
term-overlap fallback
Always available, so short queries still produce candidates.
weight 0.75
Reciprocal Rank Fusion
score = Σ weight / (60 + rank)
Pages that rank well in multiple lanes outscore pages that rank well in only one.
The text each lane scores is built deterministically: page key, summary, body, and tags concatenated in that order. A precise summary and the right tags make a page reachable before its body matches anything.
The page graph
Two frontmatter fields and one inline syntax turn the wiki into a graph the agent traverses without re-running search.
| Edge | Source | Target |
|---|---|---|
sl_refs: [warehouse.orders] | Frontmatter | Semantic source by name |
refs: [segment-classification] | Frontmatter | Another wiki page by key |
[[segment-classification]] | Inline in body | Another wiki page by key |
refs stays in the prose layer; sl_refs crosses into the executable
half of the context layer. Inline [[wikilinks]] are extracted from
page bodies at validation time and treated as declared refs.
Anatomy of a traversal
Edges to prose, edges to SQL
wiki/global/revenue.md
revenue
declares
- sl_refs: warehouse.orders
- refs: segment-classification
wiki/global/segment-classification.md
segment-classification
declares
- sl_refs: warehouse.customers
semantic-layer/warehouse/orders.yaml
warehouse.orders
grain: order_id · measure: total_revenue
semantic-layer/warehouse/customers.yaml
warehouse.customers
grain: customer_id · dim: segment
Keeping the graph live
A page that references a deleted source is worse than no reference at all - it sends the agent confidently to a definition that no longer exists. ktx prevents that with three layered checks:
- At write time. Every
refsentry and[[wikilink]]is validated against the pages visible in the current scope. A write that targets a missing page is rejected before any file changes. - At ingest time. Adapters prune
sl_refswhen the target source is deleted, mark stale pattern pages withstale_since, and setarchived_sinceon retired pages instead of removing them silently. - At session end. Every page touched by an ingest run is re-scanned for references that resolved at write time but no longer point at a live target. Dangling pairs are reported so the next iteration can fix them.
Where the pages come from
ktx writes wiki pages from evidence, not free invention. Each input contributes a different kind of page, and accepted edits feed the next ingest as input.
| Evidence | What it produces |
|---|---|
| Schema scans | One page per material table, with grain, columns, and known constraints |
| Query history | Pattern pages with usage frontmatter for executions, distinct users, runtime percentiles, and error rate |
| dbt manifests | Pages per model, exposure, and test, with sl_refs to the matching semantic source |
| MetricFlow, Looker, Metabase | Pages per metric, explore, or saved question, linked back to the source artifact |
| Notion, docs, analyst notes | Pages preserving business definitions, policies, and incident write-ups |
| Agent and analyst edits | First-class input to the next ingest, not a fork |
Provenance stays with the page. Ingested pages keep HTML comments like
<!-- from: raw-sources/.../cards/69.json --> inline, so a reviewer can
walk from the prose back to the artifact that produced it.
Agent usage notes
Point an agent at this page when it needs to explain why a wiki search returned the pages it did, why a write was rejected, or how the wiki stays in step with the semantic layer.
| Agent task | Relevant section | Next page |
|---|---|---|
| Explain why two searches return different pages for the same query | What retrieval does | ktx wiki |
Decide whether to add a refs or sl_refs entry | The page graph | Writing Context |
| Repair a wiki write rejected for missing references | Keeping the graph live | Writing Context |
| Describe how historic SQL becomes a wiki page | Where the pages come from | Building Context |
| Explain raw-source provenance comments | Where the pages come from | Reviewing Context |