Notes on retrieval quality in the real world
6/2/2025
Most retrieval issues aren’t model problems—they’re index hygiene and query strategy problems.
Index hygiene
- Chunking: self-contained chunks (300–800 tokens) with titles/breadcrumbs.
- Metadata: add source, section, date, entities, tags; filters are free precision.
- Freshness: re-index deltas; keep old versions but mark them.
- Duplicates: dedupe exacts; canonicalize URLs/IDs.
Query strategy
- Use hybrid (keyword + dense) retrieval; RRF is a strong default.
- Expand queries with synonyms and key entities; reuse “successful queries.”
- Apply filters first (product line, geography), then similarity.
- Distill long questions to query intents (who/what/when) before retrieval.
Ranking & answerability
- Re-rank top-k with a cross-encoder or a small LLM judge (coverage + trust).
- Compute an answerability score; if low, ask a clarifying question.
Measuring quality
- Maintain a tiny set of question → must-contain passages.
- Track coverage, noise in top-k, and time to answer (latency + steps).
- Watch real-time signals: copy/paste, follow-ups, manual overrides.
Get the basics right before chasing exotic tricks. Retrieval quality is mostly process.