Ibnovate Course 3 · The Future Builders
⏱ 75 minLive session

Session 12 — Evaluate Like a Pro

Duration: 75 min · Format: live online

What you'll learn: by the end, you can split data properly, spot overfitting, read a confusion matrix, compute precision/recall/F1 and know when accuracy is misleading, run cross-validation, and summarise it all in a model card.

Soft skill focus — Critical thinking

Today you'll also grow Critical thinking. Evaluation is where you turn your critical eye on your own model — the one you want to succeed. The pro move is to distrust a good score until you've checked how it was measured.

What you'll need

Hook

A cancer-detection model announces 99% accuracy. Impressive — until you learn that only 1% of the scans actually have cancer. A model that says "no cancer" to everyone also scores 99%, and it catches exactly zero real cases. It would be perfectly accurate and perfectly useless.

This is the trap that catches beginners and embarrasses professionals: a single number that sounds great while hiding a disaster. Today you learn the tools that see through it — the split that stops you fooling yourself, the confusion matrix that shows the real mistakes, and the metrics that tell you whether your model is genuinely any good.

Teach — Train, validation, test

Split data into train, validation and test; then inspect the mistakes with a confusion matrix

You never judge a model on the data it learned from — that's like marking your own homework. So you split your data three ways:

The golden rule: the test set is sacred. Every time you make a decision based on the test score, you contaminate it — and your reported number becomes a lie.

Teach — Overfitting

Overfitting is when a model memorises the training data instead of learning the general pattern. The tell-tale sign: high training accuracy, low test accuracy. It aced the practice questions and flunked the real exam.

The train/validation gap is your early-warning system. If training accuracy is 99% and validation is 72%, your model isn't smart — it's memorising. Less complexity, more data, or regularisation is the cure.

Teach — The confusion matrix and its metrics

Accuracy collapses everything into one number. The confusion matrix unfolds it, showing the four outcomes of a yes/no classifier:

Predicted No Predicted Yes
Actually No True Negative False Positive
Actually Yes False Negative True Positive

From those four boxes come the metrics that matter:

The cancer model above had perfect accuracy but zero recall — it caught none of the real cases. That's why accuracy alone lies, especially on imbalanced data.

⚠ Watch out: precision and recall usually trade off — pushing one up drags the other down. Which matters more depends on the cost of the mistake: for cancer screening, a miss (low recall) is deadly, so you favour recall; for a spam filter, a false alarm that bins a real email is worse, so you favour precision. Never optimise one without naming the cost of the other.

Activity — Evaluate a model honestly

Let's evaluate a real classifier the professional way. We'll use the breast-cancer dataset, build the confusion matrix, and read the metrics behind the accuracy.

First, split and train:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=5000, random_state=42)
model.fit(X_train, y_train)

print("train accuracy:", round(model.score(X_train, y_train), 3))
print("test accuracy :", round(model.score(X_test, y_test), 3))  # gap = overfitting check

Now go beyond accuracy — the confusion matrix and the full report:

from sklearn.metrics import confusion_matrix, classification_report

preds = model.predict(X_test)

print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds, target_names=data.target_names))

Now validate properly with cross-validation:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)   # 5 different train/test splits
print("per-fold:", scores.round(3))
print(f"mean {scores.mean():.3f}  ±  {scores.std():.3f}")

Now read it like a pro:

  1. Is there a big gap between train and test accuracy? (Small gap = not overfitting.)
  2. In the confusion matrix, find the false negatives — the cancers the model missed. For a medical tool, which is worse: a false positive or a false negative? Look at the recall for the malignant class.
  3. Cross-validation gave five scores, not one. Report the mean and the spread — a single split could have been lucky. Why is "0.95 ± 0.02" a more honest claim than "0.96"?

You just evaluated a model the way a real ML researcher would.

Check yourself

  1. Why keep a separate test set you touch only once? → Because any decision you make from it contaminates it — only truly-unseen data gives an honest estimate of real-world performance.
  2. When does accuracy lie, and what do you use instead? → On imbalanced data — a model can score high by ignoring the rare class. Use the confusion matrix with precision, recall and F1.
  3. What does cross-validation give you that one split doesn't? → Several scores from several splits, so you can report a mean and spread instead of one possibly-lucky number.

Wrap-up

You now evaluate like a professional: split honestly, watch the train/validation gap for overfitting, read the confusion matrix, choose precision or recall by the cost of the mistake, cross-validate for a stable estimate, and write it all in a model card. This is the difference between a number and the truth.

Tips & extra challenges

Vocabulary

Term Meaning
Validation set Held-out data used to tune choices before the final test
Overfitting Memorising training data — high train accuracy, low test accuracy
Confusion matrix A table of true/false positives and negatives for a classifier
Precision / Recall Correctness of "Yes" predictions / share of real "Yes" cases caught
Cross-validation Averaging performance over several train/test splits for a stable estimate

Resources

Practice set

Practise on your own — work these easy → hard. Answers follow each arrow.

1. Which set is sacred? Which split do you touch only once, at the very end? → The test set — every peek that changes a decision contaminates it.

2. Diagnose it. Train accuracy 99%, test accuracy 70%. What's happening? → Overfitting — the model memorised the training data instead of learning the general pattern.

3. When accuracy lies. A fraud detector is 99% accurate but 99% of transactions are legitimate. Why is that number worthless? → It can score 99% by calling everything legitimate, catching zero fraud — accuracy hides the failure on the rare class. Check recall.

4. Precision vs recall. For cancer screening, which do you favour and why? → Recall — missing a real cancer (a false negative) is far more costly than a false alarm, so you want to catch as many true cases as possible.

5. Read the matrix. In a confusion matrix, which cell holds the real "Yes" cases the model missed? → The false negatives — actually Yes, predicted No.

6. Cross-validate (harder, code). Write the line that runs 5-fold cross-validation on model, X, y and prints the mean score. → print(cross_val_score(model, X, y, cv=5).mean()). (Any correct cross_val_score with cv=5 and a mean earns it.)

Going deeper (optional)

Optional — for when you want to know why a validation set isn't enough on its own.

Why cross-validation beats a single validation split. One validation set is itself a random slice — tune against it enough and you start overfitting to that particular slice, quietly memorising its quirks. k-fold cross-validation fixes this by rotating the validation role across k different slices: each data point gets to be validation exactly once, and you average the k scores. The payoff is two-fold — a more stable estimate (less dependent on one lucky split) and a spread (the standard deviation) that tells you how much the result wobbles. When your capstone asks "how good is this model, really?", the honest answer is almost never one number — it's a mean plus a spread, earned by cross-validation.

Common mistakes & fixes

What's next

Session 13 — An End-to-End ML Project (the start of Unit 4 — Build, Deploy & Showcase): you've now got the full research toolkit — question, reproduce, audit for fairness, evaluate honestly. Next you put it all together on one real project, start to finish: from raw data to a trained, evaluated, documented model you're proud to show.

Ibnovate · Build · Innovate
Type to search · Esc to close
Welcome back
Sign in to continue building.
Accounts are created by Ibnovate — ask your instructor for your login.
🔒