Part 3: Vector Embeddings Made Simple

You have your writing collected (Part 2). Now we turn it into numbers AI can understand.

What Are Embeddings?

Simple version: Math that captures meaning.

"King" and "Queen" are similar → their embedding vectors are close. "King" and "Banana" are different → vectors are far apart.

Technically: Arrays of numbers (usually 384-1536 dimensions).

"Hello world" → [0.23, -0.45, 0.67, ..., 0.12]  # 384 numbers

Similar meanings = similar numbers.

Why This Matters

Your AI twin needs to know:

  • "How do I..." and "What's the process for..." mean similar things
  • Your formal LinkedIn posts vs casual tweets are both "you"
  • Technical writing vs storytelling have different embeddings

Embeddings capture style, tone, and meaning.

Create Embeddings (15 Lines)

from sentence_transformers import SentenceTransformer
import json

# Load model (downloads ~100MB first time)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Load your writing
with open('data/raw/combined.json') as f:
    writings = json.load(f)

# Create embeddings
for item in writings:
    text = item['text']
    embedding = model.encode(text)
    item['embedding'] = embedding.tolist()

# Save
with open('data/processed/embeddings.json', 'w') as f:
    json.dump(writings, f)

That's it. Your writing is now searchable by meaning.

Store in ChromaDB (Vector Database)

import chromadb

client = chromadb.Client()
collection = client.create_collection("my_writing")

for i, item in enumerate(writings):
    collection.add(
        ids=[f"doc_{i}"],
        embeddings=[item['embedding']],
        documents=[item['text']],
        metadatas=[{'platform': item['platform']}]
    )
# Query
query = "how to build web applications"
query_embedding = model.encode(query)

# Search
results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=5
)

# Results are your posts most similar to the query!
for doc in results['documents'][0]:
    print(doc[:200])

Next Week

Part 4: Fine-tune a model on your embeddings. Make it write like you!

Homework: Run the code above, verify semantic search works.


Series Progress: Part 3 of 6 ✓ Next: Part 4 - Fine-Tuning (Aug 10)