⏱ 75 minLive session

Session 2 — How Networks Learn

Duration: 75 min · Format: live online

What you'll learn: by the end, you can explain how a network measures its own mistakes (the loss function), how it improves by rolling downhill toward less error (gradient descent), and how these fit into the training loop that powers all of deep learning — and you'll run a tiny loop in Python that fixes a weight by itself.

Soft skill focus — Problem-solving

Today you'll also grow Problem-solving. Training a network is problem-solving made mechanical: measure how wrong you are, take one small step in the direction that helps, then measure again. That "measure → adjust → repeat" loop is exactly how you crack hard problems in life, not just in code.

Try this: when your demo doesn't behave, resist changing five things at once. Change one number, re-run, and watch what moves. That's the same disciplined loop the network itself is using — one small, measured step at a time.
Think about: "When I'm stuck, do I take one careful step and check, or do I panic and thrash? What would change if I trained myself to move like gradient descent?"

What you'll need

Google Colab open in a tab, signed in, ready for a new notebook.
Your neuron from Session 1 fresh in mind — weights are just numbers, and until now you picked them.
Paper and a pen — you'll predict which way a weight should move before the code proves it.

Hook

In Session 1 you built a neuron and chose its weights by hand. That works for two inputs. But a real network has millions of weights — nobody could ever set those by hand. So here's the question that built the entire field:

How can a network find its own weights?

The answer is beautifully simple. First, give the network a way to measure how wrong it is — one number. Then keep changing the weights in whatever direction makes that number smaller. Do it enough times and the network teaches itself. Today you'll see exactly how, and you'll watch a weight fix itself in front of you.

Teach — Loss: one number for "how wrong"

Before a network can improve, it needs to know how bad its current answer is. That measurement is the loss — a single number where big = very wrong and 0 = perfect.

Gradient descent: the model rolls down an error curve to find the weights with the lowest error

A common loss for numbers is squared error: take the prediction, subtract the true answer, and square it.

Prediction 10, truth 10 → error 0 → loss 0. Perfect.
Prediction 13, truth 10 → error 3 → loss 9. Wrong, and the squaring punishes big misses hard.
Prediction 7, truth 10 → error -3 → loss 9. Squaring makes it positive, so "too low" and "too high" both count as wrong.

Picture the loss as a curve: on the bottom axis is the weight you could pick, and going up is how much error that weight causes. Somewhere on that curve is a lowest point — the weight with the least error. Learning is the search for the bottom of that valley.

Teach — Gradient descent: roll downhill

So you have an error curve and you want its lowest point. How do you get there without seeing the whole curve at once?

Imagine standing on a foggy hillside, wanting the valley floor, able to see only the ground at your feet. The trick: feel which way is downhill, take a small step that way, and repeat. You'll reach the bottom without ever seeing the whole hill. That is gradient descent.

The "which way is downhill" part is the gradient — it's the slope of the loss curve at your current weight. It tells you two things at once: which direction reduces the loss, and how steep it is right there.

The update rule is one line, and it's the heart of all training:

new weight = old weight − (learning rate × gradient)

The gradient points uphill, so we subtract it to go downhill.
The learning rate is the size of your step — a small number like 0.1. Too big and you leap past the valley; too small and you crawl.

⚠ Watch out: the learning rate is the setting people get wrong most. Too large and the loss bounces around or blows up to NaN (you overshot the valley and shot up the far side). Too small and training barely moves and seems "stuck." When training misbehaves, the learning rate is the first dial to check.

Teach — The training loop

Loss and gradient descent come together in a cycle the network repeats thousands of times. This loop is training.

The training loop: data in, prediction, loss, adjust the weights, repeat

Data in — feed the network an example (or a batch of them).
Predict — it runs the forward pass you built in Session 1 and produces an output.
Loss — compare the prediction to the true answer; get one number for how wrong it is.
Adjust the weights — compute the gradient and nudge every weight a little downhill (gradient descent).
Repeat — go back to step 1 with slightly better weights.

Each full pass over your data is called an epoch. After enough epochs the loss shrinks toward the bottom of the valley, the weights settle, and the network has learned. That's it — no magic, just this loop, run at enormous scale.

Activity — Train one weight in Python

Let's make a weight fix itself. You have a machine prediction = weight × input. The true rule is prediction = 2 × input, so the correct weight is 2 — but your weight starts wrong at 0.0. You will not set it to 2; the loop will find it.

First, by hand (30 seconds): input 1.5, truth 3.0, weight 0.0. The prediction is 0.0, so it's way too low. Which way must the weight move — up or down? Write it down, then let the loop prove you right.

Type and run this:

x = 1.5          # one input
y_true = 3.0     # the correct answer (because 2 * 1.5 = 3)
weight = 0.0     # our weight starts wrong
lr = 0.1         # learning rate: the size of each step

for step in range(20):
    pred = weight * x               # 1-2: predict
    loss = (pred - y_true) ** 2     # 3: squared-error loss
    grad = 2 * (pred - y_true) * x  # 4: slope of loss w.r.t. weight
    weight = weight - lr * grad     # 4: step downhill
    print(f"step {step:2d}  weight={weight:.3f}  loss={loss:.3f}")

Now watch what happened:

Did the loss shrink toward 0 as the steps went on? That's the network getting less wrong.
Did the weight climb toward 2.0 on its own? Nobody told it 2 — it found it by rolling downhill.
Change lr to 0.9 and re-run. Does the loss bounce around or explode? You just overshot the valley — that's a learning rate too big.
Change lr to 0.01. Does it crawl and never quite arrive in 20 steps? That's a learning rate too small.

You just ran the exact process — loss, gradient, update, repeat — that trains every deep-learning model, from a two-line demo to a model with billions of weights.

Check yourself

What does the loss function measure? → How wrong the model's prediction is, as a single number — big means very wrong, 0 means perfect.
In gradient descent, why do we subtract the gradient? → The gradient points uphill (toward more loss), so subtracting it moves us downhill (toward less loss).
What are the five steps of the training loop? → Data in → predict → loss → adjust the weights → repeat.

Wrap-up

You now know how a network learns without anyone setting its weights: it measures its error with a loss function, uses gradient descent to step the weights downhill, and repeats that in a training loop until the loss is small. Every model you'll ever train runs this exact cycle.

Try this at home: change y_true in your demo to 4.5 (so the correct weight is now 3). Don't touch weight — leave it at 0.0. Run the loop again and confirm it climbs to a different target on its own. The loop doesn't know the answer; it only knows how to get less wrong.

Tips & extra challenges

Watch out: the loss going down on your training data does not guarantee the model is actually smart — it could be memorising. Measuring learning honestly (on data it hasn't seen) is a whole skill you'll build in Session 12.
Want more? Try this: add a bias to the demo. Make pred = weight * x + bias, start bias = 0.0, and add a second update line bias = bias - lr * (2 * (pred - y_true)). Now the loop tunes two numbers downhill at once — a baby network learning two parameters.

Vocabulary

Term	Meaning
Loss function	A number measuring how wrong a prediction is (0 = perfect)
Gradient	The slope of the loss — which way, and how steeply, error changes
Gradient descent	Repeatedly stepping the weights downhill to reduce the loss
Learning rate	The size of each downhill step (e.g. 0.1)
Epoch	One full pass of the training loop over all the data

Resources

Google Colab — where you'll write and run everything this course.
3Blue1Brown — "Gradient descent, how neural networks learn" — the clearest visual walkthrough of today's idea.
TensorFlow Playground — nudge the learning rate live and watch the loss curve respond.

Practice set

Practise on your own — work these easy → hard. Answers follow each arrow.

1. Read the loss. Model A has loss 0.2; model B has loss 8.0. Which one is more wrong? → Model B — a bigger loss means a worse prediction.

2. Compute a loss. Prediction 5, truth 8, using squared error. What is the loss? → (5 − 8)² = (−3)² =9.

3. Pick the direction. Your prediction is too high and you want less loss. Should the weight (with a positive input) go up or down? → Down — a smaller weight lowers the prediction toward the truth.

4. Diagnose the run. You train and the loss reads 2.1 → 40 → 900 → NaN. What single setting is almost certainly wrong? → The learning rate is too large — the steps are overshooting and the loss is exploding.

5. Trace one step (harder). Weight 1.0, input 2.0, truth 6.0, learning rate 0.1. Compute the prediction, the gradient 2 * (pred − truth) * input, and the new weight. → pred = 1.0 × 2.0 = 2.0; gradient = 2 × (2.0 − 6.0) × 2.0 = −16; new weight = 1.0 − 0.1 × (−16) =2.6 (it moved up toward the true weight of 3).

Going deeper (optional)

Optional — for when you want to know where that gradient formula comes from.

Why is the gradient 2 * (pred − truth) * input? The loss is (pred − truth)² and pred = weight × input. Calculus asks: if I wiggle the weight a tiny bit, how much does the loss change? The chain rule answers it in two links — the loss changes by 2 × (pred − truth) for each unit of prediction, and the prediction changes by input for each unit of weight. Multiply the links and you get 2 × (pred − truth) × input. You don't need to derive this by hand in real projects — libraries like TensorFlow compute every gradient for you automatically (it's called backpropagation, and you'll rely on it next session). But seeing it once, on one weight, means the automatic version will never feel like magic.

Common mistakes & fixes

Mistake: Thinking the loss going up means the code is broken. → Fix: a rising loss usually means the learning rate is too high — lower it (try 0.1 → 0.01) before you touch anything else.
Mistake: Setting the weight to the right answer yourself. → Fix: the whole point is that the loop finds it — start the weight wrong and let gradient descent do the work.
Mistake: Confusing "step" and "epoch." → Fix: a step is one weight update; an epoch is one full pass over all your data (which may be many steps).
Mistake: Adding the gradient instead of subtracting it. → Fix: the gradient points uphill, so you subtract it to descend — add it and the loss will climb.
Mistake: Expecting loss to reach exactly 0. → Fix: on real data loss usually settles at a small number, not 0 — perfect is rare and often a sign of memorising.

What's next

Session 3 — Build a Neural Network: you've trained one weight by hand. Next you'll hand the whole loop — loss, gradients, and all — to Keras/TensorFlow and train a real multi-layer network on a real image dataset with just a few lines: Sequential, Dense, compile, fit, evaluate.