RAG Architecture: Retrieval-Augmented Generation

~18 min read4 quizzes

The Reader's Dilemma

Dear Marilyn,I've built an LLM application, but it keeps making up facts that aren't true. My users are losing trust. How do I make my AI give accurate, up-to-date information without constantly retraining the model?

Marilyn's Reply

You've discovered the fundamental limitation of LLMs: they only know what they learned during training. The solution isn't to retrain—it's to give your model access to external knowledge at query time. This is called Retrieval-Augmented Generation, and it's transforming how we build AI applications.

The Spark: Understanding RAG

Why RAG?

LLMs have a knowledge cutoff date and can hallucinate facts. RAG solves both problems by retrieving relevant documents before generating responses.

The RAG Pipeline

1Query: User asks a question

2Retrieve: Find relevant documents from knowledge base

3Augment: Add retrieved context to the prompt

4Generate: LLM produces grounded response

Quick Check

What is the primary purpose of RAG (Retrieval-Augmented Generation)?

Vector Embeddings: The Foundation

RAG relies on vector embeddings—numerical representations of text that capture semantic meaning. Similar concepts have similar vectors.

# Conceptual example

"king" - "man" + "woman" ≈ "queen"

# Vectors capture relationships!

Embedding Model	Dimensions	Best For
OpenAI text-embedding-3-large	3072	High accuracy retrieval
Cohere embed-v3	1024	Multilingual support
BGE-large-en	1024	Open source, self-hosted

Quick Check

What do vector embeddings capture about text?

Chunking Strategies

Before embedding, documents must be split into chunks. The chunking strategy significantly impacts retrieval quality.

Fixed-Size Chunking

Split by character/token count. Simple but may break mid-sentence.

chunk_size=512, overlap=50

Semantic Chunking

Split by meaning boundaries (paragraphs, sections). Preserves context.

Recursive Chunking

Try multiple separators in order: sections → paragraphs → sentences → characters.

Quick Check

Why is chunk overlap important in RAG systems?

Advanced RAG Patterns

Hybrid Search

Combine vector similarity with keyword search (BM25) for better results.

final_score = α × vector_score + (1-α) × keyword_score

Re-ranking

Use a cross-encoder to re-score retrieved documents for better precision.

Query Expansion

Generate multiple query variations to improve recall.

Quick Check

What is the benefit of hybrid search in RAG?

The Reader's Dilemma

Marilyn's Reply

The Spark: Understanding RAG

Why RAG?

LLMs have a knowledge cutoff date and can hallucinate facts. RAG solves both problems by retrieving relevant documents before generating responses.

The RAG Pipeline

1Query: User asks a question

2Retrieve: Find relevant documents from knowledge base

3Augment: Add retrieved context to the prompt

4Generate: LLM produces grounded response

Quick Check

What is the primary purpose of RAG (Retrieval-Augmented Generation)?

Vector Embeddings: The Foundation

RAG relies on vector embeddings—numerical representations of text that capture semantic meaning. Similar concepts have similar vectors.

# Conceptual example

"king" - "man" + "woman" ≈ "queen"

# Vectors capture relationships!

Embedding Model	Dimensions	Best For
OpenAI text-embedding-3-large	3072	High accuracy retrieval
Cohere embed-v3	1024	Multilingual support
BGE-large-en	1024	Open source, self-hosted

Quick Check

What do vector embeddings capture about text?

Chunking Strategies

Before embedding, documents must be split into chunks. The chunking strategy significantly impacts retrieval quality.

Fixed-Size Chunking

Split by character/token count. Simple but may break mid-sentence.

chunk_size=512, overlap=50

Semantic Chunking

Split by meaning boundaries (paragraphs, sections). Preserves context.

Recursive Chunking

Try multiple separators in order: sections → paragraphs → sentences → characters.

Quick Check

Why is chunk overlap important in RAG systems?

Advanced RAG Patterns

Hybrid Search

Combine vector similarity with keyword search (BM25) for better results.

final_score = α × vector_score + (1-α) × keyword_score

Re-ranking

Use a cross-encoder to re-score retrieved documents for better precision.

Query Expansion

Generate multiple query variations to improve recall.

Quick Check

What is the benefit of hybrid search in RAG?