Part 2: Scraping Your Digital Self

Welcome back to Science Church! Today we're collecting the data that becomes your AI twin.

What We're Building

A Python scraper that collects:

  • ✅ Your LinkedIn posts
  • ✅ Your tweets/X posts
  • ✅ Your Medium articles
  • ✅ Your blog posts

Output: data/raw/your_writing.json (ready for Part 3)

The Game Plan

Step 1: LinkedIn Posts (Selenium)

LinkedIn blocks simple scrapers. We need a real browser.

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import json

driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/login')

# Manual login (handle 2FA if needed)
input("Log in, then press Enter...")

# Navigate to your posts
driver.get('https://www.linkedin.com/in/YOUR-PROFILE/recent-activity/all/')
time.sleep(3)

# Scroll to load more posts
for _ in range(10):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)

# Extract posts
posts = driver.find_elements(By.CSS_SELECTOR, '.feed-shared-update-v2__description')

data = []
for post in posts:
    text = post.text
    if len(text) > 50:  # Filter out short posts
        data.append({
            'platform': 'linkedin',
            'text': text,
            'type': 'post'
        })

# Save
with open('data/raw/linkedin.json', 'w') as f:
    json.dump(data, f, indent=2)

print(f"Collected {len(data)} LinkedIn posts!")

Run time: 2-3 minutes for 50+ posts

Step 2: Twitter/X Posts (API or Scraper)

Option A: Official API (requires paid access now)

Option B: Snscrape (free, no API needed)

pip install snscrape
import snscrape.modules.twitter as sntwitter

tweets = []
for tweet in sntwitter.TwitterSearchScraper(f'from:{YOUR_USERNAME}').get_items():
    if len(tweets) >= 1000:  # Get last 1000 tweets
        break

    tweets.append({
        'platform': 'twitter',
        'text': tweet.rawContent,
        'date': tweet.date.isoformat()
    })

with open('data/raw/twitter.json', 'w') as f:
    json.dump(tweets, f, indent=2)

Run time: 30 seconds for 1000 tweets

Step 3: Medium Articles (RSS Feed)

Medium has RSS! No scraping needed.

import feedparser
import requests

# Your Medium RSS feed
feed_url = f'https://medium.com/feed/@{YOUR_USERNAME}'
feed = feedparser.parse(feed_url)

articles = []
for entry in feed.entries:
    # Get full article content
    response = requests.get(entry.link)
    # Parse HTML (simplified)
    text = extract_text_from_html(response.text)

    articles.append({
        'platform': 'medium',
        'title': entry.title,
        'text': text,
        'url': entry.link,
        'date': entry.published
    })

with open('data/raw/medium.json', 'w') as f:
    json.dump(articles, f, indent=2)

Step 4: Combine Everything

import json
import glob

all_data = []

# Load all JSON files
for file in glob.glob('data/raw/*.json'):
    with open(file) as f:
        data = json.load(f)
        all_data.extend(data)

# Deduplicate
seen = set()
unique_data = []
for item in all_data:
    text_hash = hash(item['text'][:100])  # Hash first 100 chars
    if text_hash not in seen:
        seen.add(text_hash)
        unique_data.append(item)

# Save combined dataset
with open('data/raw/combined.json', 'w') as f:
    json.dump(unique_data, f, indent=2)

print(f"Total unique writings: {len(unique_data)}")

Data Quality Tips

1. Minimum length Filter out posts < 50 characters. "Thanks!" isn't useful training data.

2. Remove replies You want YOUR voice, not conversations.

3. Check for duplicates Same content cross-posted? Keep one copy.

4. Date range Your writing style evolves. Maybe only use last 2 years.

What You Should Have Now

data/
└── raw/
    ├── linkedin.json      (50-200 posts)
    ├── twitter.json       (500-1000 tweets)
    ├── medium.json        (10-50 articles)
    └── combined.json      (all of the above)

Total words: 50,000-200,000 (enough to train on!)

Troubleshooting

LinkedIn blocked you?

  • Wait 10 minutes between scrapes
  • Use slower scrolling
  • Don't scrape more than once/day

Twitter rate limited?

  • snscrape respects limits automatically
  • If it fails, try again in 15 minutes

Medium paywall?

  • Your own articles are always accessible
  • RSS feed doesn't hit paywalls

Next Week

Part 3: We turn this text into vector embeddings. That's the math that lets AI understand meaning.

Homework:

  • Run the scrapers above
  • Verify you have combined.json
  • Check the data looks good (open in text editor)

Questions? Discord link in footer!


Series Progress: Part 2 of 6 Complete ✓ Next: Part 3 - Vector Embeddings Made Simple (Aug 24) Code: GitHub repo updated weekly