⏱ 75 minLive session

Session 12 — Evaluate Like a Pro

Duration: 75 min · Format: live online

What you'll learn: by the end, you can split data properly, spot overfitting, read a confusion matrix, compute precision/recall/F1 and know when accuracy is misleading, run cross-validation, and summarise it all in a model card.

Soft skill focus — Critical thinking

Today you'll also grow Critical thinking. Evaluation is where you turn your critical eye on your own model — the one you want to succeed. The pro move is to distrust a good score until you've checked how it was measured.

Try this: every time you see a high number today, ask "high on what data, and could this metric be hiding something?" A 99% accuracy on imbalanced data can be worse than useless — train that reflex now.
Think about: "Am I measuring what actually matters here, or just the number that's easiest to make look good?"

What you'll need

Google Colab open in a tab, ready for a new notebook.
The diagram below open — it maps the whole evaluation flow you'll build today.
Your reproduction skills from Session 10 — you'll reuse train_test_split and sklearn.

Hook

A cancer-detection model announces 99% accuracy. Impressive — until you learn that only 1% of the scans actually have cancer. A model that says "no cancer" to everyone also scores 99%, and it catches exactly zero real cases. It would be perfectly accurate and perfectly useless.

This is the trap that catches beginners and embarrasses professionals: a single number that sounds great while hiding a disaster. Today you learn the tools that see through it — the split that stops you fooling yourself, the confusion matrix that shows the real mistakes, and the metrics that tell you whether your model is genuinely any good.

Teach — Train, validation, test

Split data into train, validation and test; then inspect the mistakes with a confusion matrix

You never judge a model on the data it learned from — that's like marking your own homework. So you split your data three ways:

Train (~60–70%) — the model learns from this.
Validation (~15–20%) — you tune choices here (which model, which settings). You peek at this a lot, so it can't be your final judge.
Test (~15–20%) — locked in a vault, touched once, at the very end. This is your honest estimate of real-world performance.

The golden rule: the test set is sacred. Every time you make a decision based on the test score, you contaminate it — and your reported number becomes a lie.

Teach — Overfitting

Overfitting is when a model memorises the training data instead of learning the general pattern. The tell-tale sign: high training accuracy, low test accuracy. It aced the practice questions and flunked the real exam.

The train/validation gap is your early-warning system. If training accuracy is 99% and validation is 72%, your model isn't smart — it's memorising. Less complexity, more data, or regularisation is the cure.

Teach — The confusion matrix and its metrics

Accuracy collapses everything into one number. The confusion matrix unfolds it, showing the four outcomes of a yes/no classifier:

	Predicted No	Predicted Yes
Actually No	True Negative	False Positive
Actually Yes	False Negative	True Positive

From those four boxes come the metrics that matter:

Precision = of everything the model flagged "Yes", how much was really Yes? (Punishes false alarms.)
Recall = of everything that was really Yes, how much did the model catch? (Punishes misses.)
F1 = the balance of precision and recall in one number.

The cancer model above had perfect accuracy but zero recall — it caught none of the real cases. That's why accuracy alone lies, especially on imbalanced data.

⚠ Watch out: precision and recall usually trade off — pushing one up drags the other down. Which matters more depends on the cost of the mistake: for cancer screening, a miss (low recall) is deadly, so you favour recall; for a spam filter, a false alarm that bins a real email is worse, so you favour precision. Never optimise one without naming the cost of the other.

Activity — Evaluate a model honestly

Let's evaluate a real classifier the professional way. We'll use the breast-cancer dataset, build the confusion matrix, and read the metrics behind the accuracy.

First, split and train:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X_train, y_train)

print("train accuracy:", round(model.score(X_train, y_train), 3))
print("test accuracy :", round(model.score(X_test, y_test), 3))  # gap = overfitting check

Now go beyond accuracy — the confusion matrix and the full report:

from sklearn.metrics import confusion_matrix, classification_report

preds = model.predict(X_test)

print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds, target_names=data.target_names))

Now validate properly with cross-validation:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)   # 5 different train/test splits
print("per-fold:", scores.round(3))
print(f"mean {scores.mean():.3f}  ±  {scores.std():.3f}")

Now read it like a pro:

Is there a big gap between train and test accuracy? (Small gap = not overfitting.)
In the confusion matrix, find the false negatives — the cancers the model missed. For a medical tool, which is worse: a false positive or a false negative? Look at the recall for the malignant class.
Cross-validation gave five scores, not one. Report the mean and the spread — a single split could have been lucky. Why is "0.95 ± 0.02" a more honest claim than "0.96"?

You just evaluated a model the way a real ML researcher would.

Check yourself

Why keep a separate test set you touch only once? → Because any decision you make from it contaminates it — only truly-unseen data gives an honest estimate of real-world performance.
When does accuracy lie, and what do you use instead? → On imbalanced data — a model can score high by ignoring the rare class. Use the confusion matrix with precision, recall and F1.
What does cross-validation give you that one split doesn't? → Several scores from several splits, so you can report a mean and spread instead of one possibly-lucky number.

Wrap-up

You now evaluate like a professional: split honestly, watch the train/validation gap for overfitting, read the confusion matrix, choose precision or recall by the cost of the mistake, cross-validate for a stable estimate, and write it all in a model card. This is the difference between a number and the truth.

Try this at home: write a model card for the breast-cancer model — a short note listing the data used, the metrics (accuracy, precision, recall, F1), the per-group or per-class scores, and one honest limitation. Model cards are how professionals ship models responsibly; you now have all the numbers to fill one in.

Tips & extra challenges

Watch out: never tune your model against the test set. If you keep changing settings until the test score looks good, you've turned your test set into a training set and your final number is fiction.
Want more? Try this: create an imbalanced dataset (e.g. keep only 5% of one class) and show that accuracy stays high while recall for the rare class collapses. Seeing it happen once makes the lesson permanent.

Vocabulary

Term	Meaning
Validation set	Held-out data used to tune choices before the final test
Overfitting	Memorising training data — high train accuracy, low test accuracy
Confusion matrix	A table of true/false positives and negatives for a classifier
Precision / Recall	Correctness of "Yes" predictions / share of real "Yes" cases caught
Cross-validation	Averaging performance over several train/test splits for a stable estimate

Resources

Google Colab — run today's evaluation here.
scikit-learn — Model evaluation — the reference for every metric you used today.
Google — Model Cards — how professionals document a model's performance, limits and fairness.

Practice set

Practise on your own — work these easy → hard. Answers follow each arrow.

1. Which set is sacred? Which split do you touch only once, at the very end? → The test set — every peek that changes a decision contaminates it.

2. Diagnose it. Train accuracy 99%, test accuracy 70%. What's happening? → Overfitting — the model memorised the training data instead of learning the general pattern.

3. When accuracy lies. A fraud detector is 99% accurate but 99% of transactions are legitimate. Why is that number worthless? → It can score 99% by calling everything legitimate, catching zero fraud — accuracy hides the failure on the rare class. Check recall.

4. Precision vs recall. For cancer screening, which do you favour and why? → Recall — missing a real cancer (a false negative) is far more costly than a false alarm, so you want to catch as many true cases as possible.

5. Read the matrix. In a confusion matrix, which cell holds the real "Yes" cases the model missed? → The false negatives — actually Yes, predicted No.

6. Cross-validate (harder, code). Write the line that runs 5-fold cross-validation on model, X, y and prints the mean score. → print(cross_val_score(model, X, y, cv=5).mean()). (Any correct cross_val_score with cv=5 and a mean earns it.)

Going deeper (optional)

Optional — for when you want to know why a validation set isn't enough on its own.

Why cross-validation beats a single validation split. One validation set is itself a random slice — tune against it enough and you start overfitting to that particular slice, quietly memorising its quirks. k-fold cross-validation fixes this by rotating the validation role across k different slices: each data point gets to be validation exactly once, and you average the k scores. The payoff is two-fold — a more stable estimate (less dependent on one lucky split) and a spread (the standard deviation) that tells you how much the result wobbles. When your capstone asks "how good is this model, really?", the honest answer is almost never one number — it's a mean plus a spread, earned by cross-validation.

Common mistakes & fixes

Mistake: Reporting training accuracy as if it were real performance. → Fix: always report the test score — training accuracy is marking your own homework.
Mistake: Trusting accuracy on imbalanced data. → Fix: read the confusion matrix and use precision, recall and F1 for the rare class.
Mistake: Tuning settings against the test set. → Fix: tune on the validation set; keep the test set locked until the single final check.
Mistake: Reporting one accuracy from one split as the truth. → Fix: cross-validate and report a mean ± spread — one split can be lucky.
Mistake: Optimising precision or recall without saying why. → Fix: name the cost of each mistake for your use case, then choose which metric to favour.

What's next

Session 13 — An End-to-End ML Project (the start of Unit 4 — Build, Deploy & Showcase): you've now got the full research toolkit — question, reproduce, audit for fairness, evaluate honestly. Next you put it all together on one real project, start to finish: from raw data to a trained, evaluated, documented model you're proud to show.