← Back to Python series
🚀
Advanced
ndarray · Broadcasting · DataFrame · GroupBy · Matplotlib

Week 8 — NumPy & Pandas

Accelerate numerical computation with NumPy arrays and explore tabular data with Pandas. Combine them with Matplotlib to visualize results.

numpypandasmatplotlibDataFramevectorized
Duration
3 hours
Level
📊 Advanced
Prerequisite
🎯 Basic Weeks 6–7
OUTCOME
Analyze a CSV dataset: clean, group, aggregate, and plot the results

What you'll learn

  • 1Create and manipulate NumPy arrays with vectorized operations
  • 2Understand broadcasting rules
  • 3Load, clean, and filter data with Pandas DataFrames
  • 4Aggregate with groupby and pivot_table
  • 5Create line, bar, and scatter plots with Matplotlib

1. NumPy Arrays

python
import numpy as np

a = np.array([1, 2, 3, 4, 5])
print(a * 2)          # [2 4 6 8 10]  — vectorized
print(a[a > 3])       # [4 5]          — boolean indexing

# 2D array
m = np.arange(9).reshape(3, 3)
print(m)
print(m.T)            # transpose
print(m @ m)          # matrix multiply

2. Pandas DataFrame

python
import pandas as pd

df = pd.DataFrame({
    "name":  ["Alice","Bob","Carol","Dave"],
    "dept":  ["Eng","Eng","HR","HR"],
    "score": [92, 85, 78, 88]
})

print(df.describe())
print(df[df["score"] >= 85])
print(df.groupby("dept")["score"].mean())

3. Data Cleaning

python
df = pd.read_csv("data.csv")

# Inspect
print(df.shape, df.dtypes)
print(df.isnull().sum())

# Clean
df = df.dropna(subset=["score"])      # drop rows with no score
df["score"] = pd.to_numeric(df["score"], errors="coerce")
df = df[df["score"].between(0, 100)]  # valid range
df = df.drop_duplicates()

💻 Examples

Run these examples and check the output yourself.

01_analysis.pyFull mini data analysis pipeline
CODE
import numpy as np
import pandas as pd

# Create sample data
np.random.seed(42)
df = pd.DataFrame({
    "name": [f"Student{i}" for i in range(50)],
    "math": np.random.randint(40, 100, 50),
    "eng":  np.random.randint(40, 100, 50),
    "sci":  np.random.randint(40, 100, 50),
})
df["avg"] = df[["math", "eng", "sci"]].mean(axis=1)
df["grade"] = pd.cut(df["avg"], bins=[0,60,70,80,90,100],
                     labels=["F","D","C","B","A"])

print(df.groupby("grade").agg(count=("name","count"), mean_avg=("avg","mean")))
print("\nTop 5 students:")
print(df.nlargest(5, "avg")[["name","avg","grade"]])

📝 Exercises

Try them yourself first, then open the solution to compare.

Exercise 1

Sales Analysis

Goal: Load sales.csv and produce a monthly revenue report.

Requirements
  • Load CSV with pd.read_csv
  • Parse date column, extract year-month
  • Group by month, sum revenue
  • Plot bar chart with matplotlib
  • Print top 3 months
Toggle solution
SOLUTION
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("sales.csv", parse_dates=["date"])
df["month"] = df["date"].dt.to_period("M")
monthly = df.groupby("month")["revenue"].sum().sort_index()
print("Top 3 months:")
print(monthly.nlargest(3))
monthly.plot(kind="bar", title="Monthly Revenue")
plt.tight_layout()
plt.savefig("monthly_revenue.png")
print("Chart saved.")
Example code / lecture materials

All lecture materials and example code are openly available on GitHub.

View on GitHub ↗