← Back to Python series
🚀
Advanced
requests · BeautifulSoup · Rate limiting · robots.txt · Caching

Week 7 — Web Scraping

Extract structured data from web pages ethically. Use requests for HTTP and BeautifulSoup for parsing HTML. Implement rate limiting, respect robots.txt, and cache results.

requestsBeautifulSoupscrapingHTMLrate limitrobots.txt
Duration
2.5 hours
Level
📊 Advanced
Prerequisite
🎯 Intermediate Weeks 3–6
OUTCOME
Build a news headline scraper with rate limiting and JSON export

What you'll learn

  • 1Send HTTP requests and inspect responses with requests
  • 2Parse HTML trees with BeautifulSoup4
  • 3Handle pagination and dynamic content
  • 4Respect robots.txt and implement polite rate limiting
  • 5Cache responses to avoid redundant requests

1. requests

python
import requests

resp = requests.get(
    "https://httpbin.org/get",
    params={"key": "value"},
    headers={"User-Agent": "my-scraper/1.0"},
    timeout=10,
)
print(resp.status_code)   # 200
print(resp.headers["Content-Type"])
data = resp.json()
print(data["args"])       # {'key': 'value'}

2. BeautifulSoup

python
from bs4 import BeautifulSoup
import requests

html = requests.get("https://news.ycombinator.com").text
soup = BeautifulSoup(html, "html.parser")

# Select all story titles
for item in soup.select(".titleline > a")[:5]:
    print(item.text)
    print(" ", item.get("href"))

3. Polite Scraping

python
import time, random, requests

class PoliteScraper:
    def __init__(self, min_delay=1.0, max_delay=3.0):
        self.session = requests.Session()
        self.session.headers["User-Agent"] = "MyBot/1.0 (+https://example.com)"
        self._min = min_delay
        self._max = max_delay
        self._last = 0

    def get(self, url):
        wait = self._min + random.random() * (self._max - self._min)
        elapsed = time.time() - self._last
        if elapsed < wait:
            time.sleep(wait - elapsed)
        self._last = time.time()
        return self.session.get(url, timeout=10)
⚠️

Always check robots.txt before scraping. Never hammer a server with rapid requests. Identify your bot with a descriptive User-Agent.

💻 Examples

Run these examples and check the output yourself.

01_hacker_news.pyScrape top stories from Hacker News
CODE
import requests, json
from bs4 import BeautifulSoup
import time

BASE = "https://news.ycombinator.com"

def scrape_page(page=1):
    url = BASE if page == 1 else f"{BASE}/?p={page}"
    time.sleep(1)  # polite delay
    soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
    stories = []
    for el in soup.select(".athing"):
        title_el = el.select_one(".titleline > a")
        if title_el:
            stories.append({"title": title_el.text, "url": title_el.get("href")})
    return stories

stories = scrape_page(1)
print(json.dumps(stories[:3], indent=2))

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Quote scraper

Goal: Scrape all quotes and authors from quotes.toscrape.com (all pages).

Requirements
  • Follow 'Next' pagination links until none remain
  • Store as list of {quote, author, tags} dicts
  • Save to quotes.json with json.dump
Toggle solution
SOLUTION
import requests, json, time
from bs4 import BeautifulSoup

BASE = "http://quotes.toscrape.com"
all_quotes = []
url = BASE
while url:
    soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
    for q in soup.select(".quote"):
        all_quotes.append({
            "quote":  q.select_one(".text").text.strip('\u201c\u201d'),
            "author": q.select_one(".author").text,
            "tags":   [t.text for t in q.select(".tag")],
        })
    nxt = soup.select_one("li.next a")
    url = BASE + nxt["href"] if nxt else None
    time.sleep(0.5)
print(f"Scraped {len(all_quotes)} quotes")
with open("quotes.json", "w") as f:
    json.dump(all_quotes, f, indent=2)
Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗