The retrieval problem nobody talks about

·4 min read

Every RAG tutorial follows the same script: chunk your documents, embed them, stuff them in a vector database, and boom—you've built ChatGPT for your data. Then you try it on real documents and realize you've built an expensive random fact generator.

The problem isn't the tutorial. The problem is that retrieval is fundamentally broken for how humans actually ask questions.

Your documents are lying to you

Real documents aren't Wikipedia articles. They're:

  • PowerPoints where slide 23 contradicts slide 4
  • PDFs where the actual answer is in a footnote that got embedded separately from its context
  • Excel sheets where the column header is "Q3" but you need to know it means "Q3 2019 excluding the Johnson account"
  • Meeting notes where "the project" could mean any of seventeen different initiatives

I once built a RAG system for a law firm. The partners were excited until it confidently cited page 247 of a contract. The contract was 50 pages long. Turns out it was concatenating multiple documents and hallucinating page numbers. But here's the thing: the retrieved chunks were "semantically similar" to the query. The system was working exactly as designed.

Semantic similarity is a terrible proxy for usefulness

Vector search finds things that are similar. Users want things that are relevant. These are not the same.

Ask: "What was our Q3 revenue?" Get: "In Q3, we launched seven new products" (high similarity, zero relevance)

Ask: "Who approved the budget?" Get: "The budget was $2.3M" (similar vectors, wrong answer)

Ask: "What are the contract termination terms?" Get: Five different termination clauses from five different contracts, all equally "similar"

The bitter truth: cosine similarity doesn't understand causation, temporality, or specificity. It just knows that "budget" and "approved" often hang out together in vector space.

Why precision beats recall, always

Every RAG tutorial optimizes for recall—did you find all the relevant documents? But in production, precision is what matters—is what you found actually correct?

Users would rather hear "I don't know" than get a confident wrong answer. But vector databases don't do "I don't know." They always return something, and that something always has a similarity score that looks reasonable.

I've seen teams add increasingly complex reranking layers, chain multiple retrievers, implement hybrid search with BM25—all to fix the fundamental problem that semantic search returns semantic neighbors, not answers.

The questions users actually ask

Your beautiful RAG system assumes users will ask well-formed questions about topics clearly covered in your documents. Instead, they ask:

  • "What's that thing John mentioned in the meeting about the customer thing?"
  • "How much did we spend on that versus last time?"
  • "Can we do the same thing we did for Acme but for GlobalCorp?"
  • "Why is this different from what you told me yesterday?"

Half these questions require context you don't have. The other half require reasoning across multiple documents in ways that embedding models can't capture. Your retriever is playing checkers while users are playing Calvinball.

What actually works

The RAG systems that survive production all do the same things:

They cite everything. Not because it looks professional, but because users need to verify. When your system says "According to document X, page Y..." users can check. Trust is built on verification, not accuracy scores.

They embrace metadata. Dates, authors, document types, version numbers—all the boring stuff that doesn't embed well but determines whether an answer is right. The best RAG systems are 50% vector search and 50% SQL filters.

They chunk defensively. Not sliding windows or semantic chunking or any fancy algorithm. They chunk based on how documents are actually structured—keeping headings with their content, preserving list items together, maintaining table integrity. Boring, manual, effective.

They admit defeat. When confidence is low or multiple conflicting sources exist, they show the user everything and let them decide. It's not elegant, but it's honest.

The uncomfortable conclusion

RAG isn't broken, but the way we think about it is. We've been trying to build Google for private documents, but what users need is more like a research assistant who's really good at citing sources.

The next time you build a RAG system, forget about embedding models and vector databases for a second. Start with this: if a human had to answer this question using these documents, what would they need to know beyond just "semantic similarity"?

Then build that. Even if it means your cutting-edge AI system is mostly regular expressions and timestamp comparisons.

Especially if it means that.