Build an LLM Twin: Part 2 - Scraping Your Digital Self
Part 2: Scraping Your Digital Self
Welcome back to Science Church! Today we're collecting the data that becomes your AI twin.
What We're Building
A Python scraper that collects:
- ✅ Your LinkedIn posts
- ✅ Your tweets/X posts
- ✅ Your Medium articles
- ✅ Your blog posts
Output: data/raw/your_writing.json (ready for Part 3)
The Game Plan
Step 1: LinkedIn Posts (Selenium)
LinkedIn blocks simple scrapers. We need a real browser.
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import json
driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/login')
# Manual login (handle 2FA if needed)
input("Log in, then press Enter...")
# Navigate to your posts
driver.get('https://www.linkedin.com/in/YOUR-PROFILE/recent-activity/all/')
time.sleep(3)
# Scroll to load more posts
for _ in range(10):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
# Extract posts
posts = driver.find_elements(By.CSS_SELECTOR, '.feed-shared-update-v2__description')
data = []
for post in posts:
text = post.text
if len(text) > 50: # Filter out short posts
data.append({
'platform': 'linkedin',
'text': text,
'type': 'post'
})
# Save
with open('data/raw/linkedin.json', 'w') as f:
json.dump(data, f, indent=2)
print(f"Collected {len(data)} LinkedIn posts!")
Run time: 2-3 minutes for 50+ posts
Step 2: Twitter/X Posts (API or Scraper)
Option A: Official API (requires paid access now)
Option B: Snscrape (free, no API needed)
pip install snscrape
import snscrape.modules.twitter as sntwitter
tweets = []
for tweet in sntwitter.TwitterSearchScraper(f'from:{YOUR_USERNAME}').get_items():
if len(tweets) >= 1000: # Get last 1000 tweets
break
tweets.append({
'platform': 'twitter',
'text': tweet.rawContent,
'date': tweet.date.isoformat()
})
with open('data/raw/twitter.json', 'w') as f:
json.dump(tweets, f, indent=2)
Run time: 30 seconds for 1000 tweets
Step 3: Medium Articles (RSS Feed)
Medium has RSS! No scraping needed.
import feedparser
import requests
# Your Medium RSS feed
feed_url = f'https://medium.com/feed/@{YOUR_USERNAME}'
feed = feedparser.parse(feed_url)
articles = []
for entry in feed.entries:
# Get full article content
response = requests.get(entry.link)
# Parse HTML (simplified)
text = extract_text_from_html(response.text)
articles.append({
'platform': 'medium',
'title': entry.title,
'text': text,
'url': entry.link,
'date': entry.published
})
with open('data/raw/medium.json', 'w') as f:
json.dump(articles, f, indent=2)
Step 4: Combine Everything
import json
import glob
all_data = []
# Load all JSON files
for file in glob.glob('data/raw/*.json'):
with open(file) as f:
data = json.load(f)
all_data.extend(data)
# Deduplicate
seen = set()
unique_data = []
for item in all_data:
text_hash = hash(item['text'][:100]) # Hash first 100 chars
if text_hash not in seen:
seen.add(text_hash)
unique_data.append(item)
# Save combined dataset
with open('data/raw/combined.json', 'w') as f:
json.dump(unique_data, f, indent=2)
print(f"Total unique writings: {len(unique_data)}")
Data Quality Tips
1. Minimum length Filter out posts < 50 characters. "Thanks!" isn't useful training data.
2. Remove replies You want YOUR voice, not conversations.
3. Check for duplicates Same content cross-posted? Keep one copy.
4. Date range Your writing style evolves. Maybe only use last 2 years.
What You Should Have Now
data/
└── raw/
├── linkedin.json (50-200 posts)
├── twitter.json (500-1000 tweets)
├── medium.json (10-50 articles)
└── combined.json (all of the above)
Total words: 50,000-200,000 (enough to train on!)
Troubleshooting
LinkedIn blocked you?
- Wait 10 minutes between scrapes
- Use slower scrolling
- Don't scrape more than once/day
Twitter rate limited?
- snscrape respects limits automatically
- If it fails, try again in 15 minutes
Medium paywall?
- Your own articles are always accessible
- RSS feed doesn't hit paywalls
Next Week
Part 3: We turn this text into vector embeddings. That's the math that lets AI understand meaning.
Homework:
- Run the scrapers above
- Verify you have
combined.json - Check the data looks good (open in text editor)
Questions? Discord link in footer!
Series Progress: Part 2 of 6 Complete ✓ Next: Part 3 - Vector Embeddings Made Simple (Aug 24) Code: GitHub repo updated weekly