🚀
Advanced
requests · BeautifulSoup · Rate limiting · robots.txt · Caching
Week 7 — Web Scraping
Extract structured data from web pages ethically. Use requests for HTTP and BeautifulSoup for parsing HTML. Implement rate limiting, respect robots.txt, and cache results.
requestsBeautifulSoupscrapingHTMLrate limitrobots.txt
Duration
⏱ 2.5 hours
Level
📊 Advanced
Prerequisite
🎯 Intermediate Weeks 3–6
OUTCOME
Build a news headline scraper with rate limiting and JSON export
What you'll learn
- 1Send HTTP requests and inspect responses with requests
- 2Parse HTML trees with BeautifulSoup4
- 3Handle pagination and dynamic content
- 4Respect robots.txt and implement polite rate limiting
- 5Cache responses to avoid redundant requests
1. requests
python
import requests
resp = requests.get(
"https://httpbin.org/get",
params={"key": "value"},
headers={"User-Agent": "my-scraper/1.0"},
timeout=10,
)
print(resp.status_code) # 200
print(resp.headers["Content-Type"])
data = resp.json()
print(data["args"]) # {'key': 'value'}2. BeautifulSoup
python
from bs4 import BeautifulSoup
import requests
html = requests.get("https://news.ycombinator.com").text
soup = BeautifulSoup(html, "html.parser")
# Select all story titles
for item in soup.select(".titleline > a")[:5]:
print(item.text)
print(" ", item.get("href"))3. Polite Scraping
python
import time, random, requests
class PoliteScraper:
def __init__(self, min_delay=1.0, max_delay=3.0):
self.session = requests.Session()
self.session.headers["User-Agent"] = "MyBot/1.0 (+https://example.com)"
self._min = min_delay
self._max = max_delay
self._last = 0
def get(self, url):
wait = self._min + random.random() * (self._max - self._min)
elapsed = time.time() - self._last
if elapsed < wait:
time.sleep(wait - elapsed)
self._last = time.time()
return self.session.get(url, timeout=10)⚠️
Always check robots.txt before scraping. Never hammer a server with rapid requests. Identify your bot with a descriptive User-Agent.
💻 Examples
Run these examples and check the output yourself.
01_hacker_news.py— Scrape top stories from Hacker News
CODE
import requests, json
from bs4 import BeautifulSoup
import time
BASE = "https://news.ycombinator.com"
def scrape_page(page=1):
url = BASE if page == 1 else f"{BASE}/?p={page}"
time.sleep(1) # polite delay
soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
stories = []
for el in soup.select(".athing"):
title_el = el.select_one(".titleline > a")
if title_el:
stories.append({"title": title_el.text, "url": title_el.get("href")})
return stories
stories = scrape_page(1)
print(json.dumps(stories[:3], indent=2))
📝 Exercises
Try them yourself first, then open the solution to compare.
Exercise 1
Quote scraper
Goal: Scrape all quotes and authors from quotes.toscrape.com (all pages).
Requirements
- Follow 'Next' pagination links until none remain
- Store as list of {quote, author, tags} dicts
- Save to quotes.json with json.dump
▶Toggle solution
SOLUTION
import requests, json, time
from bs4 import BeautifulSoup
BASE = "http://quotes.toscrape.com"
all_quotes = []
url = BASE
while url:
soup = BeautifulSoup(requests.get(url, timeout=10).text, "html.parser")
for q in soup.select(".quote"):
all_quotes.append({
"quote": q.select_one(".text").text.strip('\u201c\u201d'),
"author": q.select_one(".author").text,
"tags": [t.text for t in q.select(".tag")],
})
nxt = soup.select_one("li.next a")
url = BASE + nxt["href"] if nxt else None
time.sleep(0.5)
print(f"Scraped {len(all_quotes)} quotes")
with open("quotes.json", "w") as f:
json.dump(all_quotes, f, indent=2)
Example code / lecture materials
All lecture materials and example code are openly available on GitHub.
View on GitHub ↗