Why RAG Fails in Production
After shipping RAG-powered features for a dozen clients over the past year, we have seen the same failure modes emerge repeatedly. The good news: every one of them is preventable with the right architecture decisions upfront.
1. Chunking Strategy Matters More Than the Embedding Model
Most engineers spend weeks benchmarking embedding models (ada-002 vs. text-embedding-3-large vs. BGE-M3) while using the default fixed-size 512-token chunking. This is backwards. In our benchmarks, switching from fixed-size to semantic/proposition chunking improved retrieval precision by 31% across all embedding models tested.
Use LangChain SemanticChunker or the chonkie library for proposition-based chunking. For structured documents (PDFs with tables), combine Unstructured.io with a layout-preserving chunker.
2. Hybrid Search Is Non-Negotiable
Pure vector search (cosine similarity) misses exact keyword matches — critical for product names, version numbers, and technical identifiers. Implement hybrid search: BM25 sparse retrieval + dense vector search, fused with Reciprocal Rank Fusion (RRF). pgvector now supports hybrid search natively; Qdrant and Weaviate have had it for years.
3. Metadata Filtering Before Vector Search
Pre-filtering by metadata (date range, document type, product category) reduces the search space dramatically, improving both precision and latency. Store rich metadata at index time — it is cheap storage compared to the latency cost of post-retrieval re-ranking.
4. Context Window Budgeting
GPT-4o supports 128K tokens but retrieving 20 chunks × 512 tokens = 10K tokens that cost real money per request. Implement a context budget manager: score retrieved chunks with a cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2), keep only the top-k that fit your budget, and always put the most relevant chunk first due to the lost-in-the-middle phenomenon.
5. Observability Is the Only Way to Improve
Log every query, retrieved context, and generated answer to a tracing system (Langfuse, Arize Phoenix, or LangSmith). Without traces you cannot diagnose whether a bad answer is a retrieval failure or a generation failure — and they require completely different fixes.
Recommended Production Stack
- Embedding: text-embedding-3-large (OpenAI) or BGE-M3 (open-source)
- Vector DB: Qdrant (self-hosted) or Pinecone (managed)
- Orchestration: LangChain or LlamaIndex
- Re-ranking: Cohere Rerank v3 or a cross-encoder
- Tracing: Langfuse (open-source, self-hostable)
- Evaluation: RAGAS framework for automated RAG quality metrics

