Wiki retrieval

How ktx ranks wiki pages with hybrid search, links them into a graph, and keeps both sides anchored to evidence.

The wiki is the prose half of the context layer. Agents reach it two ways: they search for a page, then follow references inside the pages they already opened. This page covers how both work.

The wiki page contract that retrieval and validation depend on.
The hybrid search pipeline that turns a question into ranked pages.
The reference graph agents traverse without rerunning search.
How pages get authored from evidence, and how broken edges get pruned.

The wiki page contract

A wiki page is a Markdown file with a YAML frontmatter block. Frontmatter carries metadata; the prose below it is free-form. Keys are flat tokens (revenue, mart_account_segments), not paths, so every page is addressable as [[key]] from any other page.

markdownwiki/global/revenue.md

---
summary: Paid order value after refunds
tags: [finance, orders]
sl_refs: [warehouse.orders]
refs: [segment-classification]
usage_mode: auto
---

Revenue is paid order amount after refund adjustments.

Use `orders.total_revenue` for recognized order value and
`orders.order_count` for paid order volume.

Field	Purpose
`summary`	One-line description shown in search results and the agent's knowledge index
`tags`	Topic labels mixed into the search text and used for filtering
`refs`	Outgoing edges to other wiki pages by key
`sl_refs`	Outgoing edges to semantic-layer sources by `connection.source` name
`usage_mode`	`always`, `auto`, or `never` - whether the agent must, may, or must not surface this page
`source`	Where the page came from when authored by ingest (e.g. `historic-sql`, `dbt`)
`usage`	Stats attached to historic-SQL pattern pages: executions, distinct users, runtime percentiles, error rate

Pages live under two scopes. wiki/global/*.md is the team's shared context; wiki/user/<user>/*.md is per-agent scratch space that shadows global pages with the same key.

What retrieval does

A wiki search runs the same ordered steps every time.

Normalize the query. Lowercase, tokenize, deduplicate terms.
Score in three lanes. Lexical (SQLite FTS5 bm25), semantic (cosine similarity over embeddings), and token (term-overlap fallback) each rank every page independently.
Fuse with Reciprocal Rank Fusion. Each lane contributes weight / (60 + rank) to a candidate's score. Lanes that fail or skip are dropped, not zeroed.
Order and trim. Sort by fused score, then by how many lanes matched, then by id for stable tie-breaks. Return the top limit results with their summaries.
Hydrate on demand. The agent calls wiki_read to load full bodies for the few pages that look relevant.

Hybrid retrieval

Three lanes, one ranking

lexical

sqlite fts5 / bm25

Matches stems and phrases. Strong on the exact terms the team already uses.

weight 1.5

semantic

cosine over embeddings

Catches synonyms and paraphrases the lexical lane misses.

weight 2

token

term-overlap fallback

Always available, so short queries still produce candidates.

weight 0.75

Reciprocal Rank Fusion

score = Σ weight / (60 + rank)

Pages that rank well in multiple lanes outscore pages that rank well in only one.

Defaults are tunable. Lane weights and the RRF constant K are configuration, not assumptions.

The text each lane scores is built deterministically: page key, summary, body, and tags concatenated in that order. A precise summary and the right tags make a page reachable before its body matches anything.

The page graph

Two frontmatter fields and one inline syntax turn the wiki into a graph the agent traverses without re-running search.

Edge	Source	Target
`sl_refs: [warehouse.orders]`	Frontmatter	Semantic source by name
`refs: [segment-classification]`	Frontmatter	Another wiki page by key
`[[segment-classification]]`	Inline in body	Another wiki page by key

refs stays in the prose layer; sl_refs crosses into the executable half of the context layer. Inline [[wikilinks]] are extracted from page bodies at validation time and treated as declared refs.

Anatomy of a traversal

Edges to prose, edges to SQL

wiki/global/revenue.md

revenue

declares

sl_refs: warehouse.orders
refs: segment-classification

wiki/global/segment-classification.md

segment-classification

declares

sl_refs: warehouse.customers

revenue → warehouse.orders · sl_refs

revenue → segment-classification · refs

semantic-layer/warehouse/orders.yaml

warehouse.orders

grain: order_id · measure: total_revenue

semantic-layer/warehouse/customers.yaml

warehouse.customers

grain: customer_id · dim: segment

Green nodes are wiki pages; blue nodes are semantic sources.

Keeping the graph live

A page that references a deleted source is worse than no reference at all - it sends the agent confidently to a definition that no longer exists. ktx prevents that with three layered checks:

At write time. Every refs entry and [[wikilink]] is validated against the pages visible in the current scope. A write that targets a missing page is rejected before any file changes.
At ingest time. Adapters prune sl_refs when the target source is deleted, mark stale pattern pages with stale_since, and set archived_since on retired pages instead of removing them silently.
At session end. Every page touched by an ingest run is re-scanned for references that resolved at write time but no longer point at a live target. Dangling pairs are reported so the next iteration can fix them.

Where the pages come from

ktx writes wiki pages from evidence, not free invention. Each input contributes a different kind of page, and accepted edits feed the next ingest as input.

Evidence	What it produces
Schema scans	One page per material table, with grain, columns, and known constraints
Query history	Pattern pages with `usage` frontmatter for executions, distinct users, runtime percentiles, and error rate
dbt manifests	Pages per model, exposure, and test, with `sl_refs` to the matching semantic source
MetricFlow, Looker, Metabase	Pages per metric, explore, or saved question, linked back to the source artifact
Notion, docs, analyst notes	Pages preserving business definitions, policies, and incident write-ups
Agent and analyst edits	First-class input to the next ingest, not a fork

Provenance stays with the page. Ingested pages keep HTML comments like  inline, so a reviewer can walk from the prose back to the artifact that produced it.

Agent usage notes

Point an agent at this page when it needs to explain why a wiki search returned the pages it did, why a write was rejected, or how the wiki stays in step with the semantic layer.

Agent task	Relevant section	Next page
Explain why two searches return different pages for the same query	What retrieval does	ktx wiki
Decide whether to add a `refs` or `sl_refs` entry	The page graph	Writing Context
Repair a wiki write rejected for missing references	Keeping the graph live	Writing Context
Describe how historic SQL becomes a wiki page	Where the pages come from	Building Context
Explain raw-source provenance comments	Where the pages come from	Reviewing Context