Session 12 — Evaluate Like a Pro
Duration: 75 min · Format: live online
What you'll learn: by the end, you can split data properly, spot overfitting, read a confusion matrix, compute precision/recall/F1 and know when accuracy is misleading, run cross-validation, and summarise it all in a model card.
Soft skill focus — Critical thinking
Today you'll also grow Critical thinking. Evaluation is where you turn your critical eye on your own model — the one you want to succeed. The pro move is to distrust a good score until you've checked how it was measured.
- Try this: every time you see a high number today, ask "high on what data, and could this metric be hiding something?" A 99% accuracy on imbalanced data can be worse than useless — train that reflex now.
- Think about: "Am I measuring what actually matters here, or just the number that's easiest to make look good?"
What you'll need
- Google Colab open in a tab, ready for a new notebook.
- The diagram below open — it maps the whole evaluation flow you'll build today.
- Your reproduction skills from Session 10 — you'll reuse
train_test_splitand sklearn.
Hook
A cancer-detection model announces 99% accuracy. Impressive — until you learn that only 1% of the scans actually have cancer. A model that says "no cancer" to everyone also scores 99%, and it catches exactly zero real cases. It would be perfectly accurate and perfectly useless.
This is the trap that catches beginners and embarrasses professionals: a single number that sounds great while hiding a disaster. Today you learn the tools that see through it — the split that stops you fooling yourself, the confusion matrix that shows the real mistakes, and the metrics that tell you whether your model is genuinely any good.
Teach — Train, validation, test
You never judge a model on the data it learned from — that's like marking your own homework. So you split your data three ways:
- Train (~60–70%) — the model learns from this.
- Validation (~15–20%) — you tune choices here (which model, which settings). You peek at this a lot, so it can't be your final judge.
- Test (~15–20%) — locked in a vault, touched once, at the very end. This is your honest estimate of real-world performance.
The golden rule: the test set is sacred. Every time you make a decision based on the test score, you contaminate it — and your reported number becomes a lie.
Teach — Overfitting
Overfitting is when a model memorises the training data instead of learning the general pattern. The tell-tale sign: high training accuracy, low test accuracy. It aced the practice questions and flunked the real exam.
The train/validation gap is your early-warning system. If training accuracy is 99% and validation is 72%, your model isn't smart — it's memorising. Less complexity, more data, or regularisation is the cure.
Teach — The confusion matrix and its metrics
Accuracy collapses everything into one number. The confusion matrix unfolds it, showing the four outcomes of a yes/no classifier:
| Predicted No | Predicted Yes | |
|---|---|---|
| Actually No | True Negative | False Positive |
| Actually Yes | False Negative | True Positive |
From those four boxes come the metrics that matter:
- Precision = of everything the model flagged "Yes", how much was really Yes? (Punishes false alarms.)
- Recall = of everything that was really Yes, how much did the model catch? (Punishes misses.)
- F1 = the balance of precision and recall in one number.
The cancer model above had perfect accuracy but zero recall — it caught none of the real cases. That's why accuracy alone lies, especially on imbalanced data.
⚠ Watch out: precision and recall usually trade off — pushing one up drags the other down. Which matters more depends on the cost of the mistake: for cancer screening, a miss (low recall) is deadly, so you favour recall; for a spam filter, a false alarm that bins a real email is worse, so you favour precision. Never optimise one without naming the cost of the other.
Activity — Evaluate a model honestly
Let's evaluate a real classifier the professional way. We'll use the breast-cancer dataset, build the confusion matrix, and read the metrics behind the accuracy.
First, split and train:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X_train, y_train)
print("train accuracy:", round(model.score(X_train, y_train), 3))
print("test accuracy :", round(model.score(X_test, y_test), 3)) # gap = overfitting check
Now go beyond accuracy — the confusion matrix and the full report:
from sklearn.metrics import confusion_matrix, classification_report
preds = model.predict(X_test)
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds, target_names=data.target_names))
Now validate properly with cross-validation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) # 5 different train/test splits
print("per-fold:", scores.round(3))
print(f"mean {scores.mean():.3f} ± {scores.std():.3f}")
Now read it like a pro:
- Is there a big gap between train and test accuracy? (Small gap = not overfitting.)
- In the confusion matrix, find the false negatives — the cancers the model missed. For a medical tool, which is worse: a false positive or a false negative? Look at the recall for the malignant class.
- Cross-validation gave five scores, not one. Report the mean and the spread — a single split could have been lucky. Why is "0.95 ± 0.02" a more honest claim than "0.96"?
You just evaluated a model the way a real ML researcher would.
Check yourself
- Why keep a separate test set you touch only once? → Because any decision you make from it contaminates it — only truly-unseen data gives an honest estimate of real-world performance.
- When does accuracy lie, and what do you use instead? → On imbalanced data — a model can score high by ignoring the rare class. Use the confusion matrix with precision, recall and F1.
- What does cross-validation give you that one split doesn't? → Several scores from several splits, so you can report a mean and spread instead of one possibly-lucky number.
Wrap-up
You now evaluate like a professional: split honestly, watch the train/validation gap for overfitting, read the confusion matrix, choose precision or recall by the cost of the mistake, cross-validate for a stable estimate, and write it all in a model card. This is the difference between a number and the truth.
- Try this at home: write a model card for the breast-cancer model — a short note listing the data used, the metrics (accuracy, precision, recall, F1), the per-group or per-class scores, and one honest limitation. Model cards are how professionals ship models responsibly; you now have all the numbers to fill one in.
Tips & extra challenges
- Watch out: never tune your model against the test set. If you keep changing settings until the test score looks good, you've turned your test set into a training set and your final number is fiction.
- Want more? Try this: create an imbalanced dataset (e.g. keep only 5% of one class) and show that accuracy stays high while recall for the rare class collapses. Seeing it happen once makes the lesson permanent.
Vocabulary
| Term | Meaning |
|---|---|
| Validation set | Held-out data used to tune choices before the final test |
| Overfitting | Memorising training data — high train accuracy, low test accuracy |
| Confusion matrix | A table of true/false positives and negatives for a classifier |
| Precision / Recall | Correctness of "Yes" predictions / share of real "Yes" cases caught |
| Cross-validation | Averaging performance over several train/test splits for a stable estimate |
Resources
- Google Colab — run today's evaluation here.
- scikit-learn — Model evaluation — the reference for every metric you used today.
- Google — Model Cards — how professionals document a model's performance, limits and fairness.
Practice set
Practise on your own — work these easy → hard. Answers follow each arrow.
1. Which set is sacred? Which split do you touch only once, at the very end? → The test set — every peek that changes a decision contaminates it.
2. Diagnose it. Train accuracy 99%, test accuracy 70%. What's happening? → Overfitting — the model memorised the training data instead of learning the general pattern.
3. When accuracy lies. A fraud detector is 99% accurate but 99% of transactions are legitimate. Why is that number worthless? → It can score 99% by calling everything legitimate, catching zero fraud — accuracy hides the failure on the rare class. Check recall.
4. Precision vs recall. For cancer screening, which do you favour and why? → Recall — missing a real cancer (a false negative) is far more costly than a false alarm, so you want to catch as many true cases as possible.
5. Read the matrix. In a confusion matrix, which cell holds the real "Yes" cases the model missed? → The false negatives — actually Yes, predicted No.
6. Cross-validate (harder, code). Write the line that runs 5-fold cross-validation on model, X, y and prints the mean score. → print(cross_val_score(model, X, y, cv=5).mean()). (Any correct cross_val_score with cv=5 and a mean earns it.)
Going deeper (optional)
Optional — for when you want to know why a validation set isn't enough on its own.
Why cross-validation beats a single validation split. One validation set is itself a random slice — tune against it enough and you start overfitting to that particular slice, quietly memorising its quirks. k-fold cross-validation fixes this by rotating the validation role across k different slices: each data point gets to be validation exactly once, and you average the k scores. The payoff is two-fold — a more stable estimate (less dependent on one lucky split) and a spread (the standard deviation) that tells you how much the result wobbles. When your capstone asks "how good is this model, really?", the honest answer is almost never one number — it's a mean plus a spread, earned by cross-validation.
Common mistakes & fixes
- Mistake: Reporting training accuracy as if it were real performance. → Fix: always report the test score — training accuracy is marking your own homework.
- Mistake: Trusting accuracy on imbalanced data. → Fix: read the confusion matrix and use precision, recall and F1 for the rare class.
- Mistake: Tuning settings against the test set. → Fix: tune on the validation set; keep the test set locked until the single final check.
- Mistake: Reporting one accuracy from one split as the truth. → Fix: cross-validate and report a mean ± spread — one split can be lucky.
- Mistake: Optimising precision or recall without saying why. → Fix: name the cost of each mistake for your use case, then choose which metric to favour.
What's next
Session 13 — An End-to-End ML Project (the start of Unit 4 — Build, Deploy & Showcase): you've now got the full research toolkit — question, reproduce, audit for fairness, evaluate honestly. Next you put it all together on one real project, start to finish: from raw data to a trained, evaluated, documented model you're proud to show.