⚙️
Intermediate
re module · Patterns · Groups · Flags · findall · sub
Week 9 — Regular Expressions
Unlock powerful text-matching with regular expressions. Learn pattern syntax, groups, lookahead/lookbehind, and use re.findall, re.sub, and re.fullmatch for text extraction and cleaning.
regexrepatterngroupsfindallsub
Duration
⏱ 2.5 hours
Level
📊 Intermediate
Prerequisite
🎯 Basic Week 8
OUTCOME
Build a log parser that extracts IPs, timestamps, and error messages using regex
What you'll learn
- 1Write regex patterns for common scenarios (email, phone, date)
- 2Use groups () and named groups (?P<name>...)
- 3Apply re.match, re.search, re.findall, re.sub
- 4Use regex flags: re.IGNORECASE, re.MULTILINE
- 5Understand greedy vs non-greedy quantifiers
1. Pattern Basics
| Pattern | Matches |
|---|---|
| . | Any character (except newline) |
| \d | Digit [0-9] |
| \w | Word char [a-zA-Z0-9_] |
| \s | Whitespace |
| ^ | Start of string |
| $ | End of string |
| * | 0 or more |
| + | 1 or more |
| ? | 0 or 1 |
| {n,m} | n to m times |
| [abc] | Any of a, b, c |
| (abc) | Group |
2. re Functions
python
import re
text = "Call us at 010-1234-5678 or 02-987-6543"
# findall: return all matches
phones = re.findall(r"\d{2,3}-\d{3,4}-\d{4}", text)
print(phones) # ['010-1234-5678', '02-987-6543']
# sub: replace
masked = re.sub(r"\d{4}$", "****", phones[0])
print(masked) # 010-1234-****
# Named groups
m = re.search(r"(?P<area>\d{2,3})-(?P<number>\d+-\d+)", text)
print(m.group("area"), m.group("number"))3. Common Patterns
python
EMAIL = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
URL = r"https?://[\w./%-]+"
DATE = r"\d{4}-\d{2}-\d{2}"
IP_ADDR = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"
for pattern, sample in [
(EMAIL, "user@example.com"),
(URL, "https://python.org/docs"),
(DATE, "2024-05-15"),
]:
m = re.fullmatch(pattern, sample)
print(f"{'OK' if m else 'FAIL'} — {sample}")4. Common Mistakes
- Forgetting to use raw strings (r'...') — backslash sequences get misinterpreted.
- Greedy quantifiers match as much as possible. Use *? or +? for non-greedy matching.
- re.match() only matches at the start; re.search() scans the whole string.
💻 Examples
Run these examples and check the output yourself.
01_log_parser.py— Extract structured data from an Apache log line
CODE
import re
LOG = '127.0.0.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326'
pattern = re.compile(
r'(?P<ip>[\d.]+) \S+ (?P<user>\S+) '
r'\[(?P<time>[^\]]+)\] '
r'"(?P<method>\S+) (?P<path>\S+) [^"]+"'
r' (?P<status>\d{3}) (?P<size>\d+)'
)
m = pattern.match(LOG)
if m:
for key, val in m.groupdict().items():
print(f" {key:<8}: {val}")
▶ Output
ip : 127.0.0.1
user : frank
time : 10/Oct/2024:13:55:36 -0700
method : GET
path : /index.html
status : 200
size : 2326📝 Exercises
Try them yourself first, then open the solution to compare.
Exercise 1
Email Extractor & Validator
Goal: Extract all email addresses from a block of text and validate each one.
Requirements
- Use re.findall with EMAIL pattern
- Validate: local part ≤ 64 chars, domain ≤ 255 chars
- Print valid/invalid for each found address
▶Toggle solution
SOLUTION
import re
EMAIL = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
text = "Contact admin@example.com or info@co.uk. Invalid: @bad, x@"
emails = re.findall(EMAIL, text)
for e in emails:
local, domain = e.split("@", 1)
if len(local) <= 64 and len(domain) <= 255:
print(f" VALID: {e}")
else:
print(f" INVALID: {e}")
▶ Output
VALID: admin@example.com
VALID: info@co.ukExample code / lecture materials
All lecture materials and example code are openly available on GitHub.
View on GitHub ↗