RAG Architecture: Building Retrieval-Augmented Generation Systems
A comprehensive guide to building production-ready RAG pipelines, covering chunking strategies, embedding models, vector stores, reranking, and evaluation metrics.
Large language models are powerful, but they hallucinate, grow stale, and lack access to your proprietary data. Retrieval-Augmented Generation (RAG) solves all three problems by grounding LLM responses in relevant documents fetched at query time. Yet the gap between a demo RAG app and a production system is enormous. This guide walks through each stage of the RAG pipeline, from document ingestion to deployment, with the architectural decisions that separate reliable systems from fragile prototypes.
Understanding the RAG Pipeline
A RAG system has two core phases: an offline indexing phase and an online retrieval-generation phase.
During indexing, source documents are loaded, split into chunks, converted to vector embeddings, and stored in a vector database. During retrieval-generation, a user query is embedded, the most relevant chunks are retrieved, optionally reranked, and then passed as context to the LLM alongside the query.
A simplified pipeline looks like this:
Each stage introduces design choices that compound in their effect on output quality. A weak chunking strategy will undermine even the best embedding model, and a poor prompt template will waste perfectly retrieved context.
Chunking Strategies That Actually Work
Chunking is where most RAG implementations go wrong first. The goal is to produce self-contained units of meaning that are small enough to be precise but large enough to retain context.
Fixed-size chunking splits text into segments of a set token count (typically 256 to 512 tokens) with overlap (50 to 100 tokens). It is simple and predictable but ignores document structure entirely.
Recursive character splitting attempts to split on paragraph boundaries first, then sentences, then words. Libraries like LangChain provide this out of the box. It produces more coherent chunks than fixed-size splitting with minimal extra complexity.
Semantic chunking uses an embedding model to detect topic shifts within a document. When the cosine similarity between successive sentences drops below a threshold, a new chunk boundary is created. This yields high-quality boundaries but adds computational cost during indexing.
Document-aware chunking leverages the native structure of the source format. Markdown headers, HTML sections, PDF layout analysis, or code function boundaries become natural split points. For structured content, this consistently outperforms generic strategies.
Practical recommendations:
- Start with recursive splitting at 512 tokens with 50-token overlap
- Add document-aware splitting for structured sources (API docs, legal contracts, codebases)
- Prepend metadata (document title, section header) to each chunk so the LLM has context about where the information originates
- Store the parent document ID with each chunk to enable parent-document retrieval when a single chunk is insufficient
Choosing Embedding Models and Vector Stores
The embedding model converts text into dense vectors that capture semantic meaning. Your choice here directly determines retrieval quality.
Proprietary models like OpenAI's text-embedding-3-large (3072 dimensions) and Cohere's embed-v4 offer strong out-of-the-box performance with minimal setup. They are ideal when latency and operational simplicity matter more than cost at scale.
Open-source models like bge-large-en-v1.5, GTE-large, and E5-mistral-7b-instruct run on your own infrastructure. They eliminate per-token API costs and keep data on-premises, which matters for regulated industries.
When selecting a model, evaluate on the MTEB benchmark for your target domain and language. A model that excels on general English retrieval may underperform on technical or multilingual content.
For vector stores, the landscape has matured significantly:
| Store | Best For | Key Feature |
|---|---|---|
| Pinecone | Managed simplicity | Serverless tier, automatic scaling |
| Weaviate | Hybrid search | Native BM25 + vector fusion |
| Qdrant | Performance-critical | Rust-based, quantization support |
| pgvector | Existing Postgres users | No new infrastructure needed |
| Chroma | Prototyping | Embedded mode, zero config |
For most production systems, the decision comes down to whether you want a managed service (Pinecone, Weaviate Cloud) or self-hosted control (Qdrant, pgvector). If you are already running Postgres, pgvector with HNSW indexing is a pragmatic starting point that avoids adding new infrastructure.
Reranking: The Retrieval Quality Multiplier
Initial vector search retrieves candidates based on embedding similarity, but embedding models compress an entire passage into a single vector. A cross-encoder reranker reads the query and each candidate passage together, producing a much more accurate relevance score.
The typical pattern retrieves a broad set (top 20 to 50 results) from the vector store, then reranks to select the final top 3 to 5 passages for the LLM context window.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
# candidates = list of (chunk_text, metadata) from vector search
pairs = [(query, chunk.text) for chunk in candidates]
scores = reranker.predict(pairs)
# Sort by reranker score and take top-k
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]Cohere Rerank and Jina Reranker offer hosted API alternatives that avoid running GPU infrastructure for the cross-encoder. In benchmarks, adding a reranker typically improves answer accuracy by 10 to 25 percent over vector search alone, making it one of the highest-leverage improvements available.
Evaluating RAG Systems Rigorously
You cannot improve what you cannot measure. RAG evaluation requires metrics at both the retrieval and generation stages.
Retrieval metrics assess whether the right documents were found:
- Recall@k: What fraction of relevant documents appear in the top k results?
- Mean Reciprocal Rank (MRR): How high does the first relevant document rank?
- Normalized Discounted Cumulative Gain (nDCG): Do higher-ranked results have higher relevance?
Generation metrics assess whether the LLM answer is correct and grounded:
- Faithfulness: Does the answer only contain claims supported by the retrieved context? Tools like RAGAS and DeepEval automate this check using an LLM-as-judge approach.
- Answer Relevance: Does the answer actually address the user's question?
- Context Precision: What proportion of the retrieved context was actually used?
Build an evaluation dataset of at least 50 to 100 question-answer pairs with annotated relevant documents. Run this suite on every pipeline change. Without this discipline, you are tuning parameters in the dark.
Production Deployment Patterns
Moving from notebook to production introduces challenges around latency, cost, freshness, and reliability.
Caching is the simplest optimization. Cache embeddings for repeated queries and cache LLM responses for identical query-context pairs. A semantic cache that matches queries above a similarity threshold (rather than exact match) further increases hit rates.
Hybrid search combines dense vector search with sparse keyword search (BM25). This is critical for queries containing proper nouns, product codes, or technical terms that embedding models may not handle well. Weaviate and Elasticsearch support this natively; for other stores, run both searches and fuse results with Reciprocal Rank Fusion.
Streaming responses significantly improve perceived latency. Stream the LLM output token by token while displaying source citations alongside the response.
Document freshness requires an incremental indexing pipeline. Rather than rebuilding the entire index on every update, track document change timestamps and re-embed only modified or new chunks. Tools like LlamaIndex and Unstructured provide connectors for common data sources (Confluence, Google Drive, S3) with change detection.
Guardrails prevent the system from generating harmful or off-topic responses. Implement input classification to detect prompt injection attempts, output validation to ensure responses cite retrieved sources, and fallback behavior when retrieval confidence is low.