Part 2: Scraping Your Digital Self

Welcome back to Science Church! Today we're collecting the data that becomes your AI twin.

What We're Building

A Python scraper that collects:

✅ Your LinkedIn posts
✅ Your tweets/X posts
✅ Your Medium articles
✅ Your blog posts

Output: data/raw/your_writing.json (ready for Part 3)

The Game Plan

Step 1: LinkedIn Posts (Selenium)

LinkedIn blocks simple scrapers. We need a real browser.

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import json

driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/login')

# Manual login (handle 2FA if needed)
input("Log in, then press Enter...")

# Navigate to your posts
driver.get('https://www.linkedin.com/in/YOUR-PROFILE/recent-activity/all/')
time.sleep(3)

# Scroll to load more posts
for _ in range(10):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)

# Extract posts
posts = driver.find_elements(By.CSS_SELECTOR, '.feed-shared-update-v2__description')

data = []
for post in posts:
    text = post.text
    if len(text) > 50:  # Filter out short posts
        data.append({
            'platform': 'linkedin',
            'text': text,
            'type': 'post'
        })

# Save
with open('data/raw/linkedin.json', 'w') as f:
    json.dump(data, f, indent=2)

print(f"Collected {len(data)} LinkedIn posts!")

Run time: 2-3 minutes for 50+ posts

Step 2: Twitter/X Posts (API or Scraper)

Option A: Official API (requires paid access now)

Option B: Snscrape (free, no API needed)

pip install snscrape

import snscrape.modules.twitter as sntwitter

tweets = []
for tweet in sntwitter.TwitterSearchScraper(f'from:{YOUR_USERNAME}').get_items():
    if len(tweets) >= 1000:  # Get last 1000 tweets
        break

    tweets.append({
        'platform': 'twitter',
        'text': tweet.rawContent,
        'date': tweet.date.isoformat()
    })

with open('data/raw/twitter.json', 'w') as f:
    json.dump(tweets, f, indent=2)

Run time: 30 seconds for 1000 tweets

Step 3: Medium Articles (RSS Feed)

Medium has RSS! No scraping needed.

import feedparser
import requests

# Your Medium RSS feed
feed_url = f'https://medium.com/feed/@{YOUR_USERNAME}'
feed = feedparser.parse(feed_url)

articles = []
for entry in feed.entries:
    # Get full article content
    response = requests.get(entry.link)
    # Parse HTML (simplified)
    text = extract_text_from_html(response.text)

    articles.append({
        'platform': 'medium',
        'title': entry.title,
        'text': text,
        'url': entry.link,
        'date': entry.published
    })

with open('data/raw/medium.json', 'w') as f:
    json.dump(articles, f, indent=2)

Step 4: Combine Everything

import json
import glob

all_data = []

# Load all JSON files
for file in glob.glob('data/raw/*.json'):
    with open(file) as f:
        data = json.load(f)
        all_data.extend(data)

# Deduplicate
seen = set()
unique_data = []
for item in all_data:
    text_hash = hash(item['text'][:100])  # Hash first 100 chars
    if text_hash not in seen:
        seen.add(text_hash)
        unique_data.append(item)

# Save combined dataset
with open('data/raw/combined.json', 'w') as f:
    json.dump(unique_data, f, indent=2)

print(f"Total unique writings: {len(unique_data)}")

Data Quality Tips

1. Minimum length Filter out posts < 50 characters. "Thanks!" isn't useful training data.

2. Remove replies You want YOUR voice, not conversations.

3. Check for duplicates Same content cross-posted? Keep one copy.

4. Date range Your writing style evolves. Maybe only use last 2 years.

What You Should Have Now

data/
└── raw/
    ├── linkedin.json      (50-200 posts)
    ├── twitter.json       (500-1000 tweets)
    ├── medium.json        (10-50 articles)
    └── combined.json      (all of the above)

Total words: 50,000-200,000 (enough to train on!)

Troubleshooting

LinkedIn blocked you?

Wait 10 minutes between scrapes
Use slower scrolling
Don't scrape more than once/day

Twitter rate limited?

snscrape respects limits automatically
If it fails, try again in 15 minutes

Medium paywall?

Your own articles are always accessible
RSS feed doesn't hit paywalls

Next Week

Part 3: We turn this text into vector embeddings. That's the math that lets AI understand meaning.

Homework:

Run the scrapers above
Verify you have combined.json
Check the data looks good (open in text editor)

Questions? Discord link in footer!

Series Progress: Part 2 of 6 Complete ✓ Next: Part 3 - Vector Embeddings Made Simple (Aug 24) Code: GitHub repo updated weekly

Part 2: Scraping Your Digital Self

What We're Building

The Game Plan

Step 1: LinkedIn Posts (Selenium)

Step 2: Twitter/X Posts (API or Scraper)

Step 3: Medium Articles (RSS Feed)

Step 4: Combine Everything

Data Quality Tips

What You Should Have Now

Troubleshooting

Next Week

Share this article

Related Articles

Build an LLM Twin: Part 1 - What's an AI Clone?

Build an LLM Twin: Part 6 - Deploy to Production

Build an LLM Twin: Part 5 - Building the Inference API