RAG Architecture: Retrieval-Augmented Generation

~18 min read4 quizzes

The Reader's Dilemma

Dear Marilyn,I've built an LLM application, but it keeps making up facts that aren't true. My users are losing trust. How do I make my AI give accurate, up-to-date information without constantly retraining the model?

Marilyn's Reply

You've discovered the fundamental limitation of LLMs: they only know what they learned during training. The solution isn't to retrain—it's to give your model access to external knowledge at query time. This is called Retrieval-Augmented Generation, and it's transforming how we build AI applications.

The Spark: Understanding RAG

Why RAG?

LLMs have a knowledge cutoff date and can hallucinate facts. RAG solves both problems by retrieving relevant documents before generating responses.

The RAG Pipeline

1Query: User asks a question
2Retrieve: Find relevant documents from knowledge base
3Augment: Add retrieved context to the prompt
4Generate: LLM produces grounded response

Quick Check

What is the primary purpose of RAG (Retrieval-Augmented Generation)?

Vector Embeddings: The Foundation

RAG relies on vector embeddings—numerical representations of text that capture semantic meaning. Similar concepts have similar vectors.

# Conceptual example

"king" - "man" + "woman" ≈ "queen"

# Vectors capture relationships!

Embedding ModelDimensionsBest For
OpenAI text-embedding-3-large3072High accuracy retrieval
Cohere embed-v31024Multilingual support
BGE-large-en1024Open source, self-hosted

Quick Check

What do vector embeddings capture about text?

Chunking Strategies

Before embedding, documents must be split into chunks. The chunking strategy significantly impacts retrieval quality.

Fixed-Size Chunking

Split by character/token count. Simple but may break mid-sentence.

chunk_size=512, overlap=50

Semantic Chunking

Split by meaning boundaries (paragraphs, sections). Preserves context.

Recursive Chunking

Try multiple separators in order: sections → paragraphs → sentences → characters.

Quick Check

Why is chunk overlap important in RAG systems?

Advanced RAG Patterns

Hybrid Search

Combine vector similarity with keyword search (BM25) for better results.

final_score = α × vector_score + (1-α) × keyword_score

Re-ranking

Use a cross-encoder to re-score retrieved documents for better precision.

Query Expansion

Generate multiple query variations to improve recall.

Quick Check

What is the benefit of hybrid search in RAG?