RAG Architecture: Retrieval-Augmented Generation
The Reader's Dilemma
Dear Marilyn,I've built an LLM application, but it keeps making up facts that aren't true. My users are losing trust. How do I make my AI give accurate, up-to-date information without constantly retraining the model?
Marilyn's Reply
You've discovered the fundamental limitation of LLMs: they only know what they learned during training. The solution isn't to retrain—it's to give your model access to external knowledge at query time. This is called Retrieval-Augmented Generation, and it's transforming how we build AI applications.
The Spark: Understanding RAG
Why RAG?
LLMs have a knowledge cutoff date and can hallucinate facts. RAG solves both problems by retrieving relevant documents before generating responses.
The RAG Pipeline
Quick Check
What is the primary purpose of RAG (Retrieval-Augmented Generation)?
Vector Embeddings: The Foundation
RAG relies on vector embeddings—numerical representations of text that capture semantic meaning. Similar concepts have similar vectors.
# Conceptual example
"king" - "man" + "woman" ≈ "queen"
# Vectors capture relationships!
| Embedding Model | Dimensions | Best For |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | High accuracy retrieval |
| Cohere embed-v3 | 1024 | Multilingual support |
| BGE-large-en | 1024 | Open source, self-hosted |
Quick Check
What do vector embeddings capture about text?
Chunking Strategies
Before embedding, documents must be split into chunks. The chunking strategy significantly impacts retrieval quality.
Fixed-Size Chunking
Split by character/token count. Simple but may break mid-sentence.
chunk_size=512, overlap=50
Semantic Chunking
Split by meaning boundaries (paragraphs, sections). Preserves context.
Recursive Chunking
Try multiple separators in order: sections → paragraphs → sentences → characters.
Quick Check
Why is chunk overlap important in RAG systems?
Advanced RAG Patterns
Hybrid Search
Combine vector similarity with keyword search (BM25) for better results.
final_score = α × vector_score + (1-α) × keyword_score
Re-ranking
Use a cross-encoder to re-score retrieved documents for better precision.
Query Expansion
Generate multiple query variations to improve recall.
Quick Check
What is the benefit of hybrid search in RAG?