Week 10 — Intermediate Capstone: Text Analysis Tool

Combine the intermediate track skills to build a command-line text analysis tool that reads files, tokenizes text, computes statistics, and produces a formatted report.

capstoneprojectCLItext analysisOOPregex

Duration

⏱ 3 hours

Level

📊 Intermediate

Prerequisite

🎯 Intermediate Weeks 1–9

OUTCOME

A CLI tool that analyses any text file and prints a statistics report

What you'll learn

1Integrate file I/O, regex, OOP, and collections in one project
2Design a class hierarchy for the analysis pipeline
3Handle command-line arguments with argparse
4Produce formatted tabular output
5Write unit tests for the tokenizer and statistics functions

1. Project Specification

bash

python analyze.py [OPTIONS] FILE

Options:
  --top N       Show top N words (default 10)
  --format FMT  Output format: text|csv|json (default text)
  --no-stop     Exclude common stop words

Total words, unique words, average word length
Top-N word frequency (with ASCII bar chart)
Sentence count, average sentence length
Most common bigrams
Readability score (Flesch Reading Ease approximation)

2. Architecture

Tokenizer class — regex-based word/sentence splitter
TextStats dataclass — holds all computed metrics
Analyzer class — orchestrates tokenization and metric computation
Reporter class — formats and prints TextStats (text/csv/json)
main() — argparse + Analyzer + Reporter pipeline

3. Sample Output

text

$ python analyze.py hamlet.txt --top 5

File: hamlet.txt  (5447 words, 1234 unique)

Top 5 words:
  the     ████████████████  250
  and     ██████████████    215
  to      ████████████      180
  of      ██████████        150
  a       ████████          130

Sentences: 423  (avg 12.9 words)
Readability: 68.4 (Standard)

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Build the analyzer

Goal: Implement the text analysis tool with all features above.

Requirements

All metrics computed correctly
argparse with all flags
text/csv/json output formats
Unit tests for Tokenizer

Grading

· Metrics correct — 40%
· argparse flags work — 20%
· Output formats — 20%
· Tests pass — 20%

Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗