How RAG Works: Give Your AI a Memory

RAG stands for Retrieval-Augmented Generation. Fancy name, simple idea:

Give your AI Google for its brain.

Here's why that matters and how it works.

The Problem: AI Can't Remember Everything

Language models like ChatGPT are trained on internet data from before 2023. Ask about your company's internal docs? It has no clue.

Options:

Retrain the whole model (costs $1M+, takes months)
Fine-tune it (costs $10K+, forgets other stuff)
RAG (costs $10, takes 1 hour) ← We're doing this

What RAG Does (Simple Explanation)

Instead of cramming everything into the AI's brain during training:

Store your documents in a searchable database
When user asks a question, search for relevant docs
Give AI the search results + the question
AI answers using the docs you just handed it

Analogy: Open book test vs. closed book test.

The 3 Steps of RAG

Step 1: Retrieval (Find Relevant Stuff)

User asks: "What's our return policy?"

System searches your docs and finds:

Returns accepted within 30 days
Original packaging required
Refund processed in 5-7 business days

Step 2: Augmentation (Add Context to Prompt)

Original prompt:

User: What's our return policy?

Augmented prompt:

Context: [Return policy doc text here]

User: What's our return policy?

Answer using only the context above.

Step 3: Generation (AI Answers with Facts)

AI responds using the context, not making stuff up.

Result: Accurate answer with citations.

Why RAG Beats Fine-Tuning

Feature	RAG	Fine-Tuning
Cost	$10	$10,000
Time	1 hour	1 week
Update docs	Just add files	Retrain entire model
Hallucinations	Rare (cites sources)	Common
Private data	Stays private	Goes into model

The Tech Stack (What You Actually Need)

1. Embedding Model

Turns text into numbers (vectors).

Free option: all-MiniLM-L6-v2 (runs on CPU) Better option: bge-large-en-v1.5 (needs GPU)

2. Vector Database

Stores embeddings, finds similar ones fast.

Options:

ChromaDB (easiest, runs locally)
Pinecone (cloud, free tier: 1GB)
Weaviate (open source, self-hosted)
FAISS (Facebook's library, fastest)

3. LLM

The actual AI that generates answers.

Options:

OpenAI API ($0.002/1K tokens)
Local models (free but needs GPU)
Groq API (fastest, free tier available)

Simple RAG in 50 Lines of Python

from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI

# 1. Initialize
embedder = SentenceTransformer('all-MiniLM-L6-v2')
db = chromadb.Client()
collection = db.create_collection("docs")
client = OpenAI()

# 2. Add documents
docs = [
    "Our return policy allows 30-day returns...",
    "Shipping takes 5-7 business days...",
    "Customer support available 9-5 EST..."
]

for i, doc in enumerate(docs):
    embedding = embedder.encode(doc)
    collection.add(
        embeddings=[embedding.tolist()],
        documents=[doc],
        ids=[f"doc_{i}"]
    )

# 3. Query
question = "What's the return policy?"
q_embedding = embedder.encode(question)

results = collection.query(
    query_embeddings=[q_embedding.tolist()],
    n_results=2
)

# 4. Augment and generate
context = "\n".join(results['documents'][0])
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

That's it. RAG in under 50 lines.

Advanced RAG Techniques (For Later)

Once you master basics:

Hybrid search (semantic + keyword)
Reranking (improve search quality)
Query decomposition (break complex questions)
Agentic RAG (AI decides when to search)
Multi-hop (combine info from multiple docs)

Common Mistakes

1. Chunks too big - AI gets confused with 2000-word chunks. Keep it 200-500 words.

2. Bad search - Returns irrelevant docs. Use better embeddings or hybrid search.

3. No citations - Always return source documents. Users want to verify.

4. Ignoring cost - Embedding 1M documents costs $$. Cache embeddings!

Why You Should Care

Every company needs RAG for:

Customer support (search knowledge base)
Internal docs (find that one Slack message)
Code search (find similar bugs)
Legal/compliance (cite regulations accurately)

Market size: RAG market growing from $1.2B (2024) to $30B+ (2030)

This skill pays.

Next Week

Part 2: We'll scrape LinkedIn, Twitter, and Medium to collect YOUR writing. Bring your login credentials!

Homework:

Run the 50-line example above
Join our Discord
Star the GitHub repo

Welcome to Science Church. Class is in session.

Series Progress: Part 1 of 6 Complete ✓ Next: Part 2 - Scraping Your Digital Self (Aug 31) GitHub: Full code + notebooks Discord: Live Q&A every Sunday

How RAG Works: Give Your AI a Memory

The Problem: AI Can't Remember Everything

What RAG Does (Simple Explanation)

The 3 Steps of RAG

Step 1: Retrieval (Find Relevant Stuff)

Step 2: Augmentation (Add Context to Prompt)

Step 3: Generation (AI Answers with Facts)

Why RAG Beats Fine-Tuning

The Tech Stack (What You Actually Need)

1. Embedding Model

2. Vector Database

3. LLM

Simple RAG in 50 Lines of Python

Advanced RAG Techniques (For Later)

Common Mistakes

Why You Should Care

Next Week

Share this article

Related Articles

Build an LLM Twin: Part 6 - Deploy to Production

Build an LLM Twin: Part 5 - Building the Inference API

Build an LLM Twin: Part 4 - Fine-Tuning on Your Writing