⏱ 75 minLive session

Session 10 — Reproduce a Result

Duration: 75 min · Format: live online

What you'll learn: by the end, you can take a claimed result, load the data, run the stated method yourself in Colab, compare your numbers to theirs, and report honestly what you found — including when it doesn't match.

Soft skill focus — Resilience

Today you'll also grow Resilience. Reproduction is where research gets humbling: your first run will probably not match, the code will throw errors, and the temptation to fudge the numbers is real. Resilience is staying honest and steady through all of that.

Try this: when your number comes out "wrong", don't panic or hide it — treat it as a clue. Write down what you expected, what you got, and one reason they might differ. A mismatch you investigate is worth more than a match you got by luck.
Think about: "When a result doesn't go my way, do I bury it or dig into it? Which one makes me a real scientist?"

What you'll need

Google Colab open in a tab, signed in, ready for a new notebook.
The diagram below open — you'll come back to "change one thing" every time you run a comparison.
A willingness to be wrong on the first try. That's normal and it's the point.

Hook

In 2016, researchers tried to reproduce 100 published psychology experiments. Fewer than half came out the same. It shook the field — and it's why "reproducibility" became one of the most important words in science, AI included.

A result that only works once, in one lab, on one lucky run, isn't knowledge — it's an anecdote. The way you tell the difference is simple and brutal: do it again yourself. Today you reproduce a real machine-learning result from scratch, and you'll feel exactly why this is the truest test there is.

Teach — What reproducibility actually means

A result is reproducible if someone else, following the same method on the same data, gets the same answer. It sounds obvious, but it's the thing most claims quietly fail.

Three habits make your own work reproducible:

Fix the randomness. Models shuffle data and start from random numbers. Set a random seed (random_state=42) so every run is identical — otherwise your "result" changes each time you press play.
Write down every setting. Data version, model, split ratio, parameters. If you can't list them, you can't repeat them — and neither can anyone else.
Report the exact numbers. Not "about 95%" — the number you actually got, to the decimal you measured.

Teach — Control your variables

A fair test: change one thing, keep everything else the same

Reproducing isn't just re-running — it's re-running fairly. When you compare "their method" to "another method", the only thing allowed to differ is the method. Same data, same split, same seed, same test set. Change one thing; hold everything else constant.

If you change the method and the train/test split at the same time and the score moves, you've learned nothing about the method. This is the single most common way people fool themselves — and the fair-test diagram is your guard against it.

Teach — Honest reporting when it doesn't match

Here's the rule that separates scientists from salespeople: if your number doesn't match the claim, you report your number — not theirs.

A mismatch isn't failure. It's information. Maybe they used a different data version, a preprocessing step they didn't mention, or a lucky seed. Your job is to state clearly: "The paper claims X. Following their stated method, I got Y. Here is one likely reason for the gap." That single honest paragraph is worth more than any number you could have faked.

⚠ Watch out: never tweak your experiment after seeing the answer just to hit the number you wanted (people call this "p-hacking" or "shopping for a result"). Decide your method first, run it once, and report what came out — even if it's disappointing.

Activity — Reproduce a classic claim

The claim you'll test: "On the classic Iris flower dataset, a simple logistic regression classifier reaches around 95% accuracy." Let's find out if that holds up when you run it.

First, load the data and check what you've got:

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print("shape:", X.shape)          # how many samples, how many features?
print("classes:", iris.target_names)
print(X.head())

Now run the stated method — with the variables controlled:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Fix the split AND the seed so this is reproducible
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print("my reproduced accuracy:", round(acc, 4))

Now compare and report honestly:

What accuracy did you get? Write it down to 4 decimals.
Does it match the claim of "around 95%"? (It should land close — but note your exact number, not "about right".)
Change only the seed to random_state=0 in both the split and the model, and re-run. Did the number move? By how much? This shows you why fixing the seed matters — the "result" wobbles run to run.
Write one sentence: "The claim was ~95%. I reproduced X% with seed 42. It held / didn't hold because ___."

You just did real reproduction — the same act that keeps the whole field honest.

Check yourself

What does it mean for a result to be reproducible? → Someone else following the same method on the same data gets the same answer — so it's real knowledge, not a lucky one-off.
Why set a random seed? → It fixes the randomness (shuffling, initial weights) so every run is identical and your reported number is stable and repeatable.
Your number doesn't match the claim — what do you do? → Report your number honestly, then investigate why it differs (data version, preprocessing, seed). A mismatch is information, not something to hide.

Wrap-up

You reproduced a real result end to end: loaded the data, ran the stated method with controlled variables, compared your number, and reported it honestly. That loop — and the discipline to report what actually happened — is the backbone of trustworthy AI.

Try this at home: reproduce the same experiment with a different model (swap LogisticRegression for sklearn.tree.DecisionTreeClassifier), keeping the data, split and seed identical. Report both numbers side by side. You've just run a fair method comparison — one variable changed, everything else held.

Tips & extra challenges

Watch out: if your accuracy is suspiciously perfect (100%), suspect leakage — check you're testing on X_test, not accidentally on the training data.
Want more? Try this: wrap the experiment in a loop over five seeds (for seed in [0, 1, 2, 3, 4]:), collect the five accuracies, and report the mean and range. A single number hides how much a result wobbles; five reveal it.

Vocabulary

Term	Meaning
Reproducibility	Getting the same result by repeating the same method on the same data
Random seed	A fixed number that makes randomness identical on every run
Controlled variable	Something you deliberately hold constant so a comparison stays fair
Baseline	The reference method or score a new result is compared against
p-hacking	Tweaking an experiment after seeing results to fake a desired number

Resources

Google Colab — run today's reproduction here, no install needed.
scikit-learn — Getting Started — the library you used; clear docs and examples.
Papers with Code — find a paper with its code and try to reproduce a real one.

Practice set

Practise on your own — work these easy → hard. Answers follow each arrow.

1. Define it. In one sentence, what is reproducibility? → Getting the same result when you repeat the same method on the same data.

2. Why the seed? Your accuracy changes every time you run the cell. What's the one-line fix? → Set a random_state (a random seed) in the split and the model so runs are identical.

3. Spot the unfair test. You compare two models but each got a different train/test split. What's wrong? → The variable isn't controlled — different splits mean any score gap could be the split, not the model. Use the same split and seed.

4. Honest reporting. You expected 95% and got 91%. What do you report? → You report 91% (your real number) and investigate the gap — never the number you wished for.

5. Load and check (code). Write the two lines that load Iris into X, y and print the shape of X. → iris = load_iris() then X, y = iris.data, iris.target; print(X.shape). (Any correct load + .shape earns it.)

6. Fair comparison (harder, code). You want to test whether a decision tree beats logistic regression on Iris. Describe the setup that keeps it fair. → Use the same X_train/X_test, same random_state, same test set; change only the model class. Then compare the two accuracy_score values.

Going deeper (optional)

Optional — for when you want to know why one accuracy number is never enough.

The danger of a single run. Accuracy from one train/test split is itself a random draw — a slightly lucky or unlucky slice of the data. That's why a serious reproduction reports accuracy across many splits (you'll meet cross-validation in Session 12), giving a mean and a spread. If a paper's claim is "95%" but across ten seeds you see anywhere from 88% to 96%, the honest reproduction isn't "it matched" — it's "their number sits at the top of a wide range." Learning to report the range, not just the friendliest point in it, is what turns a re-run into real evidence.

Common mistakes & fixes

Mistake: Forgetting random_state, so the accuracy changes every run. → Fix: set a seed in both the split and the model — reproducibility needs fixed randomness.
Mistake: Reporting "about 95%" instead of your actual number. → Fix: report the exact figure you measured, to the decimal — vagueness hides problems.
Mistake: Changing the model and the split, then comparing. → Fix: hold the data and split constant; change only the one thing you're testing.
Mistake: Getting 100% and celebrating. → Fix: suspect data leakage — confirm you're scoring on unseen test data, not the training set.
Mistake: Editing the experiment until it hits the number you wanted. → Fix: decide the method up front, run it, and report the outcome even if it disappoints.

What's next

Session 11 — AI Ethics, Bias & Safety: you can now test whether a result is true. Next you'll ask a harder question — whether it's fair. Where bias comes from, how to audit a model across different groups of people, and why a technically-accurate model can still do real harm.