Week 8 — NumPy & Pandas

Accelerate numerical computation with NumPy arrays and explore tabular data with Pandas. Combine them with Matplotlib to visualize results.

numpypandasmatplotlibDataFramevectorized

Duration

⏱ 3 hours

Level

📊 Advanced

Prerequisite

🎯 Basic Weeks 6–7

OUTCOME

Analyze a CSV dataset: clean, group, aggregate, and plot the results

What you'll learn

1Create and manipulate NumPy arrays with vectorized operations
2Understand broadcasting rules
3Load, clean, and filter data with Pandas DataFrames
4Aggregate with groupby and pivot_table
5Create line, bar, and scatter plots with Matplotlib

1. NumPy Arrays

python

import numpy as np

a = np.array([1, 2, 3, 4, 5])
print(a * 2)          # [2 4 6 8 10]  — vectorized
print(a[a > 3])       # [4 5]          — boolean indexing

# 2D array
m = np.arange(9).reshape(3, 3)
print(m)
print(m.T)            # transpose
print(m @ m)          # matrix multiply

2. Pandas DataFrame

python

import pandas as pd

df = pd.DataFrame({
    "name":  ["Alice","Bob","Carol","Dave"],
    "dept":  ["Eng","Eng","HR","HR"],
    "score": [92, 85, 78, 88]
})

print(df.describe())
print(df[df["score"] >= 85])
print(df.groupby("dept")["score"].mean())

3. Data Cleaning

python

df = pd.read_csv("data.csv")

# Inspect
print(df.shape, df.dtypes)
print(df.isnull().sum())

# Clean
df = df.dropna(subset=["score"])      # drop rows with no score
df["score"] = pd.to_numeric(df["score"], errors="coerce")
df = df[df["score"].between(0, 100)]  # valid range
df = df.drop_duplicates()

💻 Examples

Run these examples and check the output yourself.

01_analysis.py— Full mini data analysis pipeline

CODE

import numpy as np
import pandas as pd

# Create sample data
np.random.seed(42)
df = pd.DataFrame({
    "name": [f"Student{i}" for i in range(50)],
    "math": np.random.randint(40, 100, 50),
    "eng":  np.random.randint(40, 100, 50),
    "sci":  np.random.randint(40, 100, 50),
})
df["avg"] = df[["math", "eng", "sci"]].mean(axis=1)
df["grade"] = pd.cut(df["avg"], bins=[0,60,70,80,90,100],
                     labels=["F","D","C","B","A"])

print(df.groupby("grade").agg(count=("name","count"), mean_avg=("avg","mean")))
print("\nTop 5 students:")
print(df.nlargest(5, "avg")[["name","avg","grade"]])

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Sales Analysis

Goal: Load sales.csv and produce a monthly revenue report.

Requirements

Load CSV with pd.read_csv
Parse date column, extract year-month
Group by month, sum revenue
Plot bar chart with matplotlib
Print top 3 months

▶Toggle solution

SOLUTION

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("sales.csv", parse_dates=["date"])
df["month"] = df["date"].dt.to_period("M")
monthly = df.groupby("month")["revenue"].sum().sort_index()
print("Top 3 months:")
print(monthly.nlargest(3))
monthly.plot(kind="bar", title="Monthly Revenue")
plt.tight_layout()
plt.savefig("monthly_revenue.png")
print("Chart saved.")

Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗