Notes on retrieval quality in the real world

6/2/2025

Most retrieval issues aren’t model problems—they’re index hygiene and query strategy problems.

Index hygiene

Chunking: self-contained chunks (300–800 tokens) with titles/breadcrumbs.
Metadata: add source, section, date, entities, tags; filters are free precision.
Freshness: re-index deltas; keep old versions but mark them.
Duplicates: dedupe exacts; canonicalize URLs/IDs.

Query strategy

Use hybrid (keyword + dense) retrieval; RRF is a strong default.
Expand queries with synonyms and key entities; reuse “successful queries.”
Apply filters first (product line, geography), then similarity.
Distill long questions to query intents (who/what/when) before retrieval.

Ranking & answerability

Re-rank top-k with a cross-encoder or a small LLM judge (coverage + trust).
Compute an answerability score; if low, ask a clarifying question.

Measuring quality

Maintain a tiny set of question → must-contain passages.
Track coverage, noise in top-k, and time to answer (latency + steps).
Watch real-time signals: copy/paste, follow-ups, manual overrides.

Get the basics right before chasing exotic tricks. Retrieval quality is mostly process.