Stage 02 — ML Fundamentals: Solutions
Worked solutions for Stage 2 exercises.
Dependencies: numpy, scikit-learn, matplotlib.
Confusion matrix and classification metrics by hand
For predictions
[0,1,1,0,1,1,0,0]and labels[0,1,0,0,1,1,1,0], compute accuracy, precision, recall, F1.
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
y_pred = [0, 1, 1, 0, 1, 1, 0, 0]
y_true = [0, 1, 0, 0, 1, 1, 1, 0]
# By hand:
# Position-by-position:
# i=0 pred=0 true=0 TN
# i=1 pred=1 true=1 TP
# i=2 pred=1 true=0 FP
# i=3 pred=0 true=0 TN
# i=4 pred=1 true=1 TP
# i=5 pred=1 true=1 TP
# i=6 pred=0 true=1 FN
# i=7 pred=0 true=0 TN
# Counts:
# TP = 3, FP = 1, TN = 3, FN = 1
TP, FP, TN, FN = 3, 1, 3, 1
accuracy = (TP + TN) / (TP + TN + FP + FN) # 6/8 = 0.75
precision = TP / (TP + FP) # 3/4 = 0.75
recall = TP / (TP + FN) # 3/4 = 0.75
f1 = 2 * precision * recall / (precision + recall) # 0.75
print(f"acc {accuracy} prec {precision} rec {recall} f1 {f1}")
# Verify with scikit-learn
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))
Confusion matrix from sklearn:
[[3 1] <- actual 0 (TN, FP)
[1 3]] <- actual 1 (FN, TP)
All four metrics happen to be 0.75 here because TP = TN and FP = FN. In real datasets, they’ll differ — and which one you optimize is a product decision (Stage 02’s evaluation article).
AUC by counting pairs
Scores
[0.9, 0.4, 0.7, 0.2, 0.5]and labels[1, 0, 1, 0, 1]. Compute AUC by counting correctly-ordered pairs.
import numpy as np
from sklearn.metrics import roc_auc_score
scores = np.array([0.9, 0.4, 0.7, 0.2, 0.5])
labels = np.array([1, 0, 1, 0, 1])
# AUC = P(score(positive) > score(negative))
# Count over all (positive, negative) pairs.
pos = scores[labels == 1]
neg = scores[labels == 0]
# All pairs: 3 positives × 2 negatives = 6 pairs
correct = 0
ties = 0
for p in pos:
for n in neg:
if p > n: correct += 1
elif p == n: ties += 1
# AUC counts ties as 0.5
auc_manual = (correct + 0.5 * ties) / (len(pos) * len(neg))
print(f"manual AUC = {auc_manual}") # 1.0
print(f"sklearn AUC = {roc_auc_score(labels, scores)}") # 1.0
All three positive scores (0.9, 0.7, 0.5) are higher than both negative scores (0.4, 0.2). Perfect separation → AUC = 1.0.
Calibration plot
Train a model on a small dataset; plot predicted probability vs actual frequency in 10 bins.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.calibration import calibration_curve
X, y = make_classification(n_samples=5000, n_features=20, random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0)
model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
probs = model.predict_proba(X_te)[:, 1]
# Bin predictions into 10 buckets; compare predicted prob vs actual rate in each
prob_true, prob_pred = calibration_curve(y_te, probs, n_bins=10, strategy="uniform")
plt.plot([0, 1], [0, 1], "k--", label="perfect")
plt.plot(prob_pred, prob_true, "o-", label="logistic")
plt.xlabel("predicted probability")
plt.ylabel("observed frequency")
plt.legend(); plt.title("Calibration plot")
plt.savefig("calibration.png")
A well-calibrated classifier traces the diagonal. Logistic regression usually does. Modern neural networks usually don’t — they’re overconfident. The fix: temperature scaling on a held-out set (regularization article).
Bootstrap confidence interval
Model gets 92% accuracy on 500 examples. 95% CI on the accuracy via 1000 bootstrap resamples.
import numpy as np
n = 500
acc = 0.92
correct = int(round(acc * n))
predictions = np.array([1] * correct + [0] * (n - correct))
rng = np.random.default_rng(0)
boot_accs = []
for _ in range(1000):
sample = rng.choice(predictions, size=n, replace=True)
boot_accs.append(sample.mean())
lo, hi = np.percentile(boot_accs, [2.5, 97.5])
print(f"95% CI: [{lo:.3f}, {hi:.3f}]")
# Roughly [0.896, 0.946] — about ±2.5pp at this sample size.
Read this as: if you ran this experiment many times, 95% of computed CIs would contain the true accuracy. A 92% point estimate with a ±2.5pp CI tells you a “93% accuracy” model probably isn’t actually better.
Train logistic regression on Iris
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)
model = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
y_pred = model.predict(X_te)
print(classification_report(y_te, y_pred, target_names=["setosa", "versicolor", "virginica"]))
print(confusion_matrix(y_te, y_pred))
Iris is easy. Expect ~95–100% accuracy. The interesting per-class metrics tell you which classes the model confuses (typically versicolor ↔ virginica).
Tabular: when boosted trees beat deep learning
Train a model on UCI Adult; compare logistic regression and gradient-boosted trees.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
# UCI Adult dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
cols = ["age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
"occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
"hours_per_week", "native_country", "income"]
df = pd.read_csv(url, header=None, names=cols, na_values=" ?", skipinitialspace=True).dropna()
y = (df["income"].str.strip() == ">50K").astype(int)
X = df.drop(columns=["income"])
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
cat_cols = X.select_dtypes(include=["object"]).columns.tolist()
# Logistic regression (with one-hot encoding)
pre = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), cat_cols), remainder="passthrough")
lr = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=1000))]).fit(X_tr, y_tr)
auc_lr = roc_auc_score(y_te, lr.predict_proba(X_te)[:, 1])
# LightGBM (handles categoricals natively)
X_tr_lgb = X_tr.copy(); X_te_lgb = X_te.copy()
for c in cat_cols:
X_tr_lgb[c] = X_tr_lgb[c].astype("category")
X_te_lgb[c] = X_te_lgb[c].astype("category")
gbm = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, verbose=-1).fit(X_tr_lgb, y_tr)
auc_gbm = roc_auc_score(y_te, gbm.predict_proba(X_te_lgb)[:, 1])
print(f"Logistic regression AUC: {auc_lr:.4f}") # ~0.90
print(f"LightGBM AUC: {auc_gbm:.4f}") # ~0.93
A 3-percentage-point AUC win for ~10 lines more code. On real tabular data, this gap often widens. Try LGBM first.