How RAG Works: Give Your AI a Memory
How RAG Works: Give Your AI a Memory
RAG stands for Retrieval-Augmented Generation. Fancy name, simple idea:
Give your AI Google for its brain.
Here's why that matters and how it works.
The Problem: AI Can't Remember Everything
Language models like ChatGPT are trained on internet data from before 2023. Ask about your company's internal docs? It has no clue.
Options:
- Retrain the whole model (costs $1M+, takes months)
- Fine-tune it (costs $10K+, forgets other stuff)
- RAG (costs $10, takes 1 hour) ← We're doing this
What RAG Does (Simple Explanation)
Instead of cramming everything into the AI's brain during training:
- Store your documents in a searchable database
- When user asks a question, search for relevant docs
- Give AI the search results + the question
- AI answers using the docs you just handed it
Analogy: Open book test vs. closed book test.
The 3 Steps of RAG
Step 1: Retrieval (Find Relevant Stuff)
User asks: "What's our return policy?"
System searches your docs and finds:
- Returns accepted within 30 days
- Original packaging required
- Refund processed in 5-7 business days
Step 2: Augmentation (Add Context to Prompt)
Original prompt:
User: What's our return policy?
Augmented prompt:
Context: [Return policy doc text here]
User: What's our return policy?
Answer using only the context above.
Step 3: Generation (AI Answers with Facts)
AI responds using the context, not making stuff up.
Result: Accurate answer with citations.
Why RAG Beats Fine-Tuning
| Feature | RAG | Fine-Tuning |
|---|---|---|
| Cost | $10 | $10,000 |
| Time | 1 hour | 1 week |
| Update docs | Just add files | Retrain entire model |
| Hallucinations | Rare (cites sources) | Common |
| Private data | Stays private | Goes into model |
The Tech Stack (What You Actually Need)
1. Embedding Model
Turns text into numbers (vectors).
Free option: all-MiniLM-L6-v2 (runs on CPU)
Better option: bge-large-en-v1.5 (needs GPU)
2. Vector Database
Stores embeddings, finds similar ones fast.
Options:
- ChromaDB (easiest, runs locally)
- Pinecone (cloud, free tier: 1GB)
- Weaviate (open source, self-hosted)
- FAISS (Facebook's library, fastest)
3. LLM
The actual AI that generates answers.
Options:
- OpenAI API ($0.002/1K tokens)
- Local models (free but needs GPU)
- Groq API (fastest, free tier available)
Simple RAG in 50 Lines of Python
from sentence_transformers import SentenceTransformer
import chromadb
from openai import OpenAI
# 1. Initialize
embedder = SentenceTransformer('all-MiniLM-L6-v2')
db = chromadb.Client()
collection = db.create_collection("docs")
client = OpenAI()
# 2. Add documents
docs = [
"Our return policy allows 30-day returns...",
"Shipping takes 5-7 business days...",
"Customer support available 9-5 EST..."
]
for i, doc in enumerate(docs):
embedding = embedder.encode(doc)
collection.add(
embeddings=[embedding.tolist()],
documents=[doc],
ids=[f"doc_{i}"]
)
# 3. Query
question = "What's the return policy?"
q_embedding = embedder.encode(question)
results = collection.query(
query_embeddings=[q_embedding.tolist()],
n_results=2
)
# 4. Augment and generate
context = "\n".join(results['documents'][0])
prompt = f"Context: {context}\n\nQuestion: {question}\n\nAnswer:"
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
print(response.choices[0].message.content)
That's it. RAG in under 50 lines.
Advanced RAG Techniques (For Later)
Once you master basics:
- Hybrid search (semantic + keyword)
- Reranking (improve search quality)
- Query decomposition (break complex questions)
- Agentic RAG (AI decides when to search)
- Multi-hop (combine info from multiple docs)
Common Mistakes
1. Chunks too big - AI gets confused with 2000-word chunks. Keep it 200-500 words.
2. Bad search - Returns irrelevant docs. Use better embeddings or hybrid search.
3. No citations - Always return source documents. Users want to verify.
4. Ignoring cost - Embedding 1M documents costs $$. Cache embeddings!
Why You Should Care
Every company needs RAG for:
- Customer support (search knowledge base)
- Internal docs (find that one Slack message)
- Code search (find similar bugs)
- Legal/compliance (cite regulations accurately)
Market size: RAG market growing from $1.2B (2024) to $30B+ (2030)
This skill pays.
Next Week
Part 2: We'll scrape LinkedIn, Twitter, and Medium to collect YOUR writing. Bring your login credentials!
Homework:
- Run the 50-line example above
- Join our Discord
- Star the GitHub repo
Welcome to Science Church. Class is in session.
Series Progress: Part 1 of 6 Complete ✓ Next: Part 2 - Scraping Your Digital Self (Aug 31) GitHub: Full code + notebooks Discord: Live Q&A every Sunday