Week 7 — Web Scraping

Extract structured data from web pages ethically. Use requests for HTTP and BeautifulSoup for parsing HTML. Implement rate limiting, respect robots.txt, and cache results.

requestsBeautifulSoupscrapingHTMLrate limitrobots.txt

Duration

⏱ 2.5 hours

Level

📊 Advanced

Prerequisite

🎯 Intermediate Weeks 3–6

OUTCOME

Build a news headline scraper with rate limiting and JSON export

What you'll learn

1Send HTTP requests and inspect responses with requests
2Parse HTML trees with BeautifulSoup4
3Handle pagination and dynamic content
4Respect robots.txt and implement polite rate limiting
5Cache responses to avoid redundant requests

1. requests

python

import requests

resp = requests.get(
    "https://httpbin.org/get",
    params={"key": "value"},
    headers={"User-Agent": "my-scraper/1.0"},
    timeout=10,
)
print(resp.status_code)   # 200
print(resp.headers["Content-Type"])
data = resp.json()
print(data["args"])       # {'key': 'value'}

2. BeautifulSoup

python

from bs4 import BeautifulSoup
import requests

html = requests.get("https://news.ycombinator.com").text
soup = BeautifulSoup(html, "html.parser")

# Select all story titles
for item in soup.select(".titleline > a")[:5]:
    print(item.text)
    print(" ", item.get("href"))

3. Polite Scraping

python

import time, random, requests

class PoliteScraper:
    def __init__(self, min_delay=1.0, max_delay=3.0):
        self.session = requests.Session()
        self.session.headers["User-Agent"] = "MyBot/1.0 (+https://example.com)"
        self._min = min_delay
        self._max = max_delay
        self._last = 0

    def get(self, url):
        wait = self._min + random.random() * (self._max - self._min)
        elapsed = time.time() - self._last
        if elapsed < wait:
            time.sleep(wait - elapsed)
        self._last = time.time()
        return self.session.get(url, timeout=10)

⚠️

Always check robots.txt before scraping. Never hammer a server with rapid requests. Identify your bot with a descriptive User-Agent.

💻 Examples

Run these examples and check the output yourself.

01_hacker_news.py— Scrape top stories from Hacker News

CODE

import requests, json
from bs4 import BeautifulSoup
import time

BASE = "https://news.ycombinator.com"

def scrape_page(page=1):
    url = BASE if page == 1 else f"{BASE}/?p={page}"
    time.sleep(1)  # polite delay
    soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
    stories = []
    for el in soup.select(".athing"):
        title_el = el.select_one(".titleline > a")
        if title_el:
            stories.append({"title": title_el.text, "url": title_el.get("href")})
    return stories

stories = scrape_page(1)
print(json.dumps(stories[:3], indent=2))

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Quote scraper

Goal: Scrape all quotes and authors from quotes.toscrape.com (all pages).

Requirements

Follow 'Next' pagination links until none remain
Store as list of {quote, author, tags} dicts
Save to quotes.json with json.dump

▶Toggle solution

SOLUTION

import requests, json, time
from bs4 import BeautifulSoup

BASE = "http://quotes.toscrape.com"
all_quotes = []
url = BASE
while url:
    soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
    for q in soup.select(".quote"):
        all_quotes.append({
            "quote":  q.select_one(".text").text.strip('\u201c\u201d'),
            "author": q.select_one(".author").text,
            "tags":   [t.text for t in q.select(".tag")],
        })
    nxt = soup.select_one("li.next a")
    url = BASE + nxt["href"] if nxt else None
    time.sleep(0.5)
print(f"Scraped {len(all_quotes)} quotes")
with open("quotes.json", "w") as f:
    json.dump(all_quotes, f, indent=2)

Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗