← Back to Python series
⚙️
Intermediate
re module · Patterns · Groups · Flags · findall · sub

Week 9 — Regular Expressions

Unlock powerful text-matching with regular expressions. Learn pattern syntax, groups, lookahead/lookbehind, and use re.findall, re.sub, and re.fullmatch for text extraction and cleaning.

regexrepatterngroupsfindallsub
Duration
2.5 hours
Level
📊 Intermediate
Prerequisite
🎯 Basic Week 8
OUTCOME
Build a log parser that extracts IPs, timestamps, and error messages using regex

What you'll learn

  • 1Write regex patterns for common scenarios (email, phone, date)
  • 2Use groups () and named groups (?P<name>...)
  • 3Apply re.match, re.search, re.findall, re.sub
  • 4Use regex flags: re.IGNORECASE, re.MULTILINE
  • 5Understand greedy vs non-greedy quantifiers

1. Pattern Basics

PatternMatches
.Any character (except newline)
\dDigit [0-9]
\wWord char [a-zA-Z0-9_]
\sWhitespace
^Start of string
$End of string
*0 or more
+1 or more
?0 or 1
{n,m}n to m times
[abc]Any of a, b, c
(abc)Group

2. re Functions

python
import re

text = "Call us at 010-1234-5678 or 02-987-6543"

# findall: return all matches
phones = re.findall(r"\d{2,3}-\d{3,4}-\d{4}", text)
print(phones)   # ['010-1234-5678', '02-987-6543']

# sub: replace
masked = re.sub(r"\d{4}$", "****", phones[0])
print(masked)   # 010-1234-****

# Named groups
m = re.search(r"(?P<area>\d{2,3})-(?P<number>\d+-\d+)", text)
print(m.group("area"), m.group("number"))

3. Common Patterns

python
EMAIL   = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
URL     = r"https?://[\w./%-]+"
DATE    = r"\d{4}-\d{2}-\d{2}"
IP_ADDR = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"

for pattern, sample in [
    (EMAIL, "user@example.com"),
    (URL, "https://python.org/docs"),
    (DATE, "2024-05-15"),
]:
    m = re.fullmatch(pattern, sample)
    print(f"{'OK' if m else 'FAIL'} — {sample}")

4. Common Mistakes

  1. Forgetting to use raw strings (r'...') — backslash sequences get misinterpreted.
  2. Greedy quantifiers match as much as possible. Use *? or +? for non-greedy matching.
  3. re.match() only matches at the start; re.search() scans the whole string.

💻 Examples

Run these examples and check the output yourself.

01_log_parser.pyExtract structured data from an Apache log line
CODE
import re

LOG = '127.0.0.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326'

pattern = re.compile(
    r'(?P<ip>[\d.]+) \S+ (?P<user>\S+) '  
    r'\[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) [^"]+"'
    r' (?P<status>\d{3}) (?P<size>\d+)'
)

m = pattern.match(LOG)
if m:
    for key, val in m.groupdict().items():
        print(f"  {key:<8}: {val}")
▶ Output
  ip      : 127.0.0.1
  user    : frank
  time    : 10/Oct/2024:13:55:36 -0700
  method  : GET
  path    : /index.html
  status  : 200
  size    : 2326

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Email Extractor & Validator

Goal: Extract all email addresses from a block of text and validate each one.

Requirements
  • Use re.findall with EMAIL pattern
  • Validate: local part ≤ 64 chars, domain ≤ 255 chars
  • Print valid/invalid for each found address
Toggle solution
SOLUTION
import re

EMAIL = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

text = "Contact admin@example.com or info@co.uk. Invalid: @bad, x@"
emails = re.findall(EMAIL, text)
for e in emails:
    local, domain = e.split("@", 1)
    if len(local) <= 64 and len(domain) <= 255:
        print(f"  VALID:   {e}")
    else:
        print(f"  INVALID: {e}")
▶ Output
  VALID:   admin@example.com
  VALID:   info@co.uk
Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗