Week 9 — Regular Expressions

Unlock powerful text-matching with regular expressions. Learn pattern syntax, groups, lookahead/lookbehind, and use re.findall, re.sub, and re.fullmatch for text extraction and cleaning.

regexrepatterngroupsfindallsub

Duration

⏱ 2.5 hours

Level

📊 Intermediate

Prerequisite

🎯 Basic Week 8

OUTCOME

Build a log parser that extracts IPs, timestamps, and error messages using regex

What you'll learn

1Write regex patterns for common scenarios (email, phone, date)
2Use groups () and named groups (?P<name>...)
3Apply re.match, re.search, re.findall, re.sub
4Use regex flags: re.IGNORECASE, re.MULTILINE
5Understand greedy vs non-greedy quantifiers

1. Pattern Basics

Pattern	Matches
.	Any character (except newline)
\d	Digit [0-9]
\w	Word char [a-zA-Z0-9_]
\s	Whitespace
^	Start of string
$	End of string
*	0 or more
+	1 or more
?	0 or 1
{n,m}	n to m times
[abc]	Any of a, b, c
(abc)	Group

2. re Functions

python

import re

text = "Call us at 010-1234-5678 or 02-987-6543"

# findall: return all matches
phones = re.findall(r"\d{2,3}-\d{3,4}-\d{4}", text)
print(phones)   # ['010-1234-5678', '02-987-6543']

# sub: replace
masked = re.sub(r"\d{4}$", "****", phones[0])
print(masked)   # 010-1234-****

# Named groups
m = re.search(r"(?P<area>\d{2,3})-(?P<number>\d+-\d+)", text)
print(m.group("area"), m.group("number"))

3. Common Patterns

python

EMAIL   = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
URL     = r"https?://[\w./%-]+"
DATE    = r"\d{4}-\d{2}-\d{2}"
IP_ADDR = r"\b(?:\d{1,3}\.){3}\d{1,3}\b"

for pattern, sample in [
    (EMAIL, "user@example.com"),
    (URL, "https://python.org/docs"),
    (DATE, "2024-05-15"),
]:
    m = re.fullmatch(pattern, sample)
    print(f"{'OK' if m else 'FAIL'} — {sample}")

4. Common Mistakes

Forgetting to use raw strings (r'...') — backslash sequences get misinterpreted.
Greedy quantifiers match as much as possible. Use *? or +? for non-greedy matching.
re.match() only matches at the start; re.search() scans the whole string.

💻 Examples

Run these examples and check the output yourself.

01_log_parser.py— Extract structured data from an Apache log line

CODE

import re

LOG = '127.0.0.1 - frank [10/Oct/2024:13:55:36 -0700] "GET /index.html HTTP/1.1" 200 2326'

pattern = re.compile(
    r'(?P<ip>[\d.]+) \S+ (?P<user>\S+) '  
    r'\[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) [^"]+"'
    r' (?P<status>\d{3}) (?P<size>\d+)'
)

m = pattern.match(LOG)
if m:
    for key, val in m.groupdict().items():
        print(f"  {key:<8}: {val}")

▶ Output

  ip      : 127.0.0.1
  user    : frank
  time    : 10/Oct/2024:13:55:36 -0700
  method  : GET
  path    : /index.html
  status  : 200
  size    : 2326

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Email Extractor & Validator

Goal: Extract all email addresses from a block of text and validate each one.

Requirements

Use re.findall with EMAIL pattern
Validate: local part ≤ 64 chars, domain ≤ 255 chars
Print valid/invalid for each found address

▶Toggle solution

SOLUTION

import re

EMAIL = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

text = "Contact admin@example.com or info@co.uk. Invalid: @bad, x@"
emails = re.findall(EMAIL, text)
for e in emails:
    local, domain = e.split("@", 1)
    if len(local) <= 64 and len(domain) <= 255:
        print(f"  VALID:   {e}")
    else:
        print(f"  INVALID: {e}")

▶ Output

  VALID:   admin@example.com
  VALID:   info@co.uk

Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗