AI & Machine Learning

Building Production RAG Systems: Lessons from 12 Months in the Field

RAG (Retrieval-Augmented Generation) works brilliantly in demos. Here is what breaks in production and how to fix it before your customers notice.

By Ravi Tripathi, CTO & Co-FounderApril 21, 202612 min read
Building Production RAG Systems: Lessons from 12 Months in the Field

Why RAG Fails in Production

After shipping RAG-powered features for a dozen clients over the past year, we have seen the same failure modes emerge repeatedly. The good news: every one of them is preventable with the right architecture decisions upfront.

1. Chunking Strategy Matters More Than the Embedding Model

Most engineers spend weeks benchmarking embedding models (ada-002 vs. text-embedding-3-large vs. BGE-M3) while using the default fixed-size 512-token chunking. This is backwards. In our benchmarks, switching from fixed-size to semantic/proposition chunking improved retrieval precision by 31% across all embedding models tested.

Use LangChain SemanticChunker or the chonkie library for proposition-based chunking. For structured documents (PDFs with tables), combine Unstructured.io with a layout-preserving chunker.

2. Hybrid Search Is Non-Negotiable

Pure vector search (cosine similarity) misses exact keyword matches — critical for product names, version numbers, and technical identifiers. Implement hybrid search: BM25 sparse retrieval + dense vector search, fused with Reciprocal Rank Fusion (RRF). pgvector now supports hybrid search natively; Qdrant and Weaviate have had it for years.

3. Metadata Filtering Before Vector Search

Pre-filtering by metadata (date range, document type, product category) reduces the search space dramatically, improving both precision and latency. Store rich metadata at index time — it is cheap storage compared to the latency cost of post-retrieval re-ranking.

4. Context Window Budgeting

GPT-4o supports 128K tokens but retrieving 20 chunks × 512 tokens = 10K tokens that cost real money per request. Implement a context budget manager: score retrieved chunks with a cross-encoder re-ranker (ms-marco-MiniLM-L-6-v2), keep only the top-k that fit your budget, and always put the most relevant chunk first due to the lost-in-the-middle phenomenon.

5. Observability Is the Only Way to Improve

Log every query, retrieved context, and generated answer to a tracing system (Langfuse, Arize Phoenix, or LangSmith). Without traces you cannot diagnose whether a bad answer is a retrieval failure or a generation failure — and they require completely different fixes.

Recommended Production Stack

  • Embedding: text-embedding-3-large (OpenAI) or BGE-M3 (open-source)
  • Vector DB: Qdrant (self-hosted) or Pinecone (managed)
  • Orchestration: LangChain or LlamaIndex
  • Re-ranking: Cohere Rerank v3 or a cross-encoder
  • Tracing: Langfuse (open-source, self-hostable)
  • Evaluation: RAGAS framework for automated RAG quality metrics

Frequently Asked Questions

What is the best embedding model for RAG?+

For English-only use cases, OpenAI text-embedding-3-large gives the best out-of-box precision. For multilingual or cost-sensitive workloads, BGE-M3 (open-source) is a strong alternative. However, chunking strategy has a larger impact than embedding model choice.

Should I use LangChain or LlamaIndex?+

LlamaIndex has better built-in abstractions for document ingestion and RAG evaluation. LangChain has a wider ecosystem and better agent tooling. Many production systems use LlamaIndex for the RAG pipeline and LangChain for agentic orchestration.

How do I evaluate my RAG system?+

Use the RAGAS framework which measures context precision, context recall, faithfulness, and answer relevancy automatically using an LLM-as-judge approach. Run it on a golden evaluation dataset of 50–200 question-answer pairs representative of real user queries.