← Back to Python series
⚙️
Intermediate
CLI · File I/O · OOP · Regex · Statistics

Week 10 — Intermediate Capstone: Text Analysis Tool

Combine the intermediate track skills to build a command-line text analysis tool that reads files, tokenizes text, computes statistics, and produces a formatted report.

capstoneprojectCLItext analysisOOPregex
Duration
3 hours
Level
📊 Intermediate
Prerequisite
🎯 Intermediate Weeks 1–9
OUTCOME
A CLI tool that analyses any text file and prints a statistics report

What you'll learn

  • 1Integrate file I/O, regex, OOP, and collections in one project
  • 2Design a class hierarchy for the analysis pipeline
  • 3Handle command-line arguments with argparse
  • 4Produce formatted tabular output
  • 5Write unit tests for the tokenizer and statistics functions

1. Project Specification

bash
python analyze.py [OPTIONS] FILE

Options:
  --top N       Show top N words (default 10)
  --format FMT  Output format: text|csv|json (default text)
  --no-stop     Exclude common stop words
  • Total words, unique words, average word length
  • Top-N word frequency (with ASCII bar chart)
  • Sentence count, average sentence length
  • Most common bigrams
  • Readability score (Flesch Reading Ease approximation)

2. Architecture

  • Tokenizer class — regex-based word/sentence splitter
  • TextStats dataclass — holds all computed metrics
  • Analyzer class — orchestrates tokenization and metric computation
  • Reporter class — formats and prints TextStats (text/csv/json)
  • main() — argparse + Analyzer + Reporter pipeline

3. Sample Output

text
$ python analyze.py hamlet.txt --top 5

File: hamlet.txt  (5447 words, 1234 unique)

Top 5 words:
  the     ████████████████  250
  and     ██████████████    215
  to      ████████████      180
  of      ██████████        150
  a       ████████          130

Sentences: 423  (avg 12.9 words)
Readability: 68.4 (Standard)

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Build the analyzer

Goal: Implement the text analysis tool with all features above.

Requirements
  • All metrics computed correctly
  • argparse with all flags
  • text/csv/json output formats
  • Unit tests for Tokenizer
Grading
  • · Metrics correct — 40%
  • · argparse flags work — 20%
  • · Output formats — 20%
  • · Tests pass — 20%
Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗