⚙️
Intermediate
CLI · File I/O · OOP · Regex · Statistics
Week 10 — Intermediate Capstone: Text Analysis Tool
Combine the intermediate track skills to build a command-line text analysis tool that reads files, tokenizes text, computes statistics, and produces a formatted report.
capstoneprojectCLItext analysisOOPregex
Duration
⏱ 3 hours
Level
📊 Intermediate
Prerequisite
🎯 Intermediate Weeks 1–9
OUTCOME
A CLI tool that analyses any text file and prints a statistics report
What you'll learn
- 1Integrate file I/O, regex, OOP, and collections in one project
- 2Design a class hierarchy for the analysis pipeline
- 3Handle command-line arguments with argparse
- 4Produce formatted tabular output
- 5Write unit tests for the tokenizer and statistics functions
1. Project Specification
bash
python analyze.py [OPTIONS] FILE
Options:
--top N Show top N words (default 10)
--format FMT Output format: text|csv|json (default text)
--no-stop Exclude common stop words- Total words, unique words, average word length
- Top-N word frequency (with ASCII bar chart)
- Sentence count, average sentence length
- Most common bigrams
- Readability score (Flesch Reading Ease approximation)
2. Architecture
- Tokenizer class — regex-based word/sentence splitter
- TextStats dataclass — holds all computed metrics
- Analyzer class — orchestrates tokenization and metric computation
- Reporter class — formats and prints TextStats (text/csv/json)
- main() — argparse + Analyzer + Reporter pipeline
3. Sample Output
text
$ python analyze.py hamlet.txt --top 5
File: hamlet.txt (5447 words, 1234 unique)
Top 5 words:
the ████████████████ 250
and ██████████████ 215
to ████████████ 180
of ██████████ 150
a ████████ 130
Sentences: 423 (avg 12.9 words)
Readability: 68.4 (Standard)📝 Exercises
Try them yourself first, then open the solution to compare.
Exercise 1
Build the analyzer
Goal: Implement the text analysis tool with all features above.
Requirements
- All metrics computed correctly
- argparse with all flags
- text/csv/json output formats
- Unit tests for Tokenizer
Grading
- · Metrics correct — 40%
- · argparse flags work — 20%
- · Output formats — 20%
- · Tests pass — 20%
Example code / lecture materials
All lecture materials and example code are openly available on GitHub.
View on GitHub ↗