Build an LLM Twin: Part 3 - Vector Embeddings Made Simple
Part 3: Vector Embeddings Made Simple
You have your writing collected (Part 2). Now we turn it into numbers AI can understand.
What Are Embeddings?
Simple version: Math that captures meaning.
"King" and "Queen" are similar → their embedding vectors are close. "King" and "Banana" are different → vectors are far apart.
Technically: Arrays of numbers (usually 384-1536 dimensions).
"Hello world" → [0.23, -0.45, 0.67, ..., 0.12] # 384 numbers
Similar meanings = similar numbers.
Why This Matters
Your AI twin needs to know:
- "How do I..." and "What's the process for..." mean similar things
- Your formal LinkedIn posts vs casual tweets are both "you"
- Technical writing vs storytelling have different embeddings
Embeddings capture style, tone, and meaning.
Create Embeddings (15 Lines)
from sentence_transformers import SentenceTransformer
import json
# Load model (downloads ~100MB first time)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Load your writing
with open('data/raw/combined.json') as f:
writings = json.load(f)
# Create embeddings
for item in writings:
text = item['text']
embedding = model.encode(text)
item['embedding'] = embedding.tolist()
# Save
with open('data/processed/embeddings.json', 'w') as f:
json.dump(writings, f)
That's it. Your writing is now searchable by meaning.
Store in ChromaDB (Vector Database)
import chromadb
client = chromadb.Client()
collection = client.create_collection("my_writing")
for i, item in enumerate(writings):
collection.add(
ids=[f"doc_{i}"],
embeddings=[item['embedding']],
documents=[item['text']],
metadatas=[{'platform': item['platform']}]
)
Test Semantic Search
# Query
query = "how to build web applications"
query_embedding = model.encode(query)
# Search
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5
)
# Results are your posts most similar to the query!
for doc in results['documents'][0]:
print(doc[:200])
Next Week
Part 4: Fine-tune a model on your embeddings. Make it write like you!
Homework: Run the code above, verify semantic search works.
Series Progress: Part 3 of 6 ✓ Next: Part 4 - Fine-Tuning (Aug 10)