Stage 02 — ML Fundamentals: Solutions

Worked solutions for Stage 2 exercises.

Dependencies: numpy, scikit-learn, matplotlib.

Confusion matrix and classification metrics by hand

For predictions [0,1,1,0,1,1,0,0] and labels [0,1,0,0,1,1,1,0], compute accuracy, precision, recall, F1.

import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

y_pred = [0, 1, 1, 0, 1, 1, 0, 0]
y_true = [0, 1, 0, 0, 1, 1, 1, 0]

# By hand:
# Position-by-position:
# i=0  pred=0 true=0  TN
# i=1  pred=1 true=1  TP
# i=2  pred=1 true=0  FP
# i=3  pred=0 true=0  TN
# i=4  pred=1 true=1  TP
# i=5  pred=1 true=1  TP
# i=6  pred=0 true=1  FN
# i=7  pred=0 true=0  TN

# Counts:
# TP = 3, FP = 1, TN = 3, FN = 1

TP, FP, TN, FN = 3, 1, 3, 1

accuracy  = (TP + TN) / (TP + TN + FP + FN)   # 6/8 = 0.75
precision = TP / (TP + FP)                    # 3/4 = 0.75
recall    = TP / (TP + FN)                    # 3/4 = 0.75
f1        = 2 * precision * recall / (precision + recall)   # 0.75

print(f"acc {accuracy} prec {precision} rec {recall} f1 {f1}")

# Verify with scikit-learn
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))

Confusion matrix from sklearn:

[[3 1]      <- actual 0 (TN, FP)
 [1 3]]     <- actual 1 (FN, TP)

All four metrics happen to be 0.75 here because TP = TN and FP = FN. In real datasets, they’ll differ — and which one you optimize is a product decision (Stage 02’s evaluation article).

AUC by counting pairs

Scores [0.9, 0.4, 0.7, 0.2, 0.5] and labels [1, 0, 1, 0, 1]. Compute AUC by counting correctly-ordered pairs.

import numpy as np
from sklearn.metrics import roc_auc_score

scores = np.array([0.9, 0.4, 0.7, 0.2, 0.5])
labels = np.array([1, 0, 1, 0, 1])

# AUC = P(score(positive) > score(negative))
# Count over all (positive, negative) pairs.

pos = scores[labels == 1]
neg = scores[labels == 0]

# All pairs: 3 positives × 2 negatives = 6 pairs
correct = 0
ties = 0
for p in pos:
    for n in neg:
        if p > n: correct += 1
        elif p == n: ties += 1

# AUC counts ties as 0.5
auc_manual = (correct + 0.5 * ties) / (len(pos) * len(neg))
print(f"manual AUC = {auc_manual}")           # 1.0
print(f"sklearn AUC = {roc_auc_score(labels, scores)}")  # 1.0

All three positive scores (0.9, 0.7, 0.5) are higher than both negative scores (0.4, 0.2). Perfect separation → AUC = 1.0.

Calibration plot

Train a model on a small dataset; plot predicted probability vs actual frequency in 10 bins.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve

X, y = make_classification(n_samples=5000, n_features=20, random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)

model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
probs = model.predict_proba(X_te)[:, 1]

# Bin predictions into 10 buckets; compare predicted prob vs actual rate in each
prob_true, prob_pred = calibration_curve(y_te, probs, n_bins=10, strategy="uniform")

plt.plot([0, 1], [0, 1], "k--", label="perfect")
plt.plot(prob_pred, prob_true, "o-", label="logistic")
plt.xlabel("predicted probability")
plt.ylabel("observed frequency")
plt.legend(); plt.title("Calibration plot")
plt.savefig("calibration.png")

A well-calibrated classifier traces the diagonal. Logistic regression usually does. Modern neural networks usually don’t — they’re overconfident. The fix: temperature scaling on a held-out set (regularization article).

Bootstrap confidence interval

Model gets 92% accuracy on 500 examples. 95% CI on the accuracy via 1000 bootstrap resamples.

import numpy as np

n = 500
acc = 0.92
correct = int(round(acc * n))
predictions = np.array([1] * correct + [0] * (n - correct))

rng = np.random.default_rng(0)
boot_accs = []
for _ in range(1000):
    sample = rng.choice(predictions, size=n, replace=True)
    boot_accs.append(sample.mean())

lo, hi = np.percentile(boot_accs, [2.5, 97.5])
print(f"95% CI: [{lo:.3f}, {hi:.3f}]")
# Roughly [0.896, 0.946] — about ±2.5pp at this sample size.

Read this as: if you ran this experiment many times, 95% of computed CIs would contain the true accuracy. A 92% point estimate with a ±2.5pp CI tells you a “93% accuracy” model probably isn’t actually better.

Train logistic regression on Iris

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)

model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
y_pred = model.predict(X_te)

print(classification_report(y_te, y_pred, target_names=["setosa", "versicolor", "virginica"]))
print(confusion_matrix(y_te, y_pred))

Iris is easy. Expect ~95–100% accuracy. The interesting per-class metrics tell you which classes the model confuses (typically versicolor ↔ virginica).

Tabular: when boosted trees beat deep learning

Train a model on UCI Adult; compare logistic regression and gradient-boosted trees.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

# UCI Adult dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
cols = ["age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
        "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
        "hours_per_week", "native_country", "income"]
df = pd.read_csv(url, header=None, names=cols, na_values=" ?", skipinitialspace=True).dropna()

y = (df["income"].str.strip() == ">50K").astype(int)
X = df.drop(columns=["income"])
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

cat_cols = X.select_dtypes(include=["object"]).columns.tolist()

# Logistic regression (with one-hot encoding)
pre = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), cat_cols), remainder="passthrough")
lr = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000))]).fit(X_tr, y_tr)
auc_lr = roc_auc_score(y_te, lr.predict_proba(X_te)[:, 1])

# LightGBM (handles categoricals natively)
X_tr_lgb = X_tr.copy(); X_te_lgb = X_te.copy()
for c in cat_cols:
    X_tr_lgb[c] = X_tr_lgb[c].astype("category")
    X_te_lgb[c] = X_te_lgb[c].astype("category")
gbm = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, verbose=-1).fit(X_tr_lgb, y_tr)
auc_gbm = roc_auc_score(y_te, gbm.predict_proba(X_te_lgb)[:, 1])

print(f"Logistic regression AUC: {auc_lr:.4f}")     # ~0.90
print(f"LightGBM AUC:           {auc_gbm:.4f}")    # ~0.93

A 3-percentage-point AUC win for ~10 lines more code. On real tabular data, this gap often widens. Try LGBM first.