Session 6 — The Transformer Revolution
Duration: 75 min · Format: live online
What you'll learn: by the end, you can explain the attention mechanism — how a model decides which other words to focus on for each word — why that idea let transformers overtake older RNNs, and how this one architecture became the engine behind every large language model.
Soft skill focus — Critical thinking
Today you'll also grow Critical thinking. "Attention is all you need" is a real paper title and a big claim. Critical thinking is the habit of not swallowing big claims whole — asking what problem does this actually solve, what does it cost, and where might it break? Bring that lens to everything today.
- Try this: for each benefit you hear about transformers, deliberately hunt for the trade-off. Faster? At what cost. Smarter? Measured how. A claim you can't find a downside to usually means you haven't looked hard enough yet.
- Think about: "When someone says a new technology 'changes everything', how do I tell real breakthroughs from hype — before the crowd decides?"
What you'll need
- Google Colab open — today is mostly concepts, but you'll run one short demo.
- The diagram below open so you can follow how attention connects words.
- Paper and a pen — you'll draw arrows between words in a sentence by hand.
Hook
Read this sentence: "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? You knew instantly: the trophy. Now change one word: "…because it was too small." Suddenly "it" means the suitcase.
To understand "it", you had to look back and pay attention to the right earlier word — and which word mattered depended on the rest of the sentence. For decades, machines were terrible at this. Then in 2017 a mechanism called attention cracked it, and the transformer was born. Almost everything you think of as "AI" today — ChatGPT, Gemini, translation, coding assistants — runs on the idea you're about to meet.
Teach — The problem transformers solved
Before transformers, language ran on RNNs (recurrent neural networks). An RNN reads a sentence one word at a time, in order, carrying a memory forward. Two problems crippled it:
- It forgot. By the time it reached word 40, the memory of word 2 had faded. Long-range links ("it" → a noun far back) got lost.
- It couldn't be parallelised. Because word 10 depended on word 9 depending on word 8… you had to process the sentence strictly in sequence. On modern hardware that's painfully slow, so you couldn't train on truly enormous text.
The transformer (from the 2017 paper "Attention Is All You Need") threw out the step-by-step reading. Instead, it looks at all the words at once and lets every word directly attend to every other word. No fading memory, and — because there's no forced order to the computation — it trains massively in parallel. That combination is exactly what made today's giant models possible.
Teach — What attention actually does
Attention is a way for each word to ask the rest of the sentence: "which of you should I pay attention to, to understand myself here?" — and to build a better, context-aware version of itself from the answer.
Here's the mechanism in plain terms. For every word, the model produces three vectors:
- a Query — what this word is looking for,
- a Key — what this word offers to others,
- a Value — the information this word will pass on if attended to.
To update a word, the model compares its Query against every other word's Key. High match → high attention weight. Then it builds the word's new representation as a weighted blend of the Values — mostly from the words it scored highest on. Do this for every word, and each one becomes context-aware: "bank" beside "river" ends up different from "bank" beside "money".
The magic isn't one clever rule — it's that these Query/Key/Value transforms are learned weights (just like Session 1's neurons), trained on oceans of text until the attention lands on the words that actually matter.
Teach — A worked example
Take: "The cat sat because it was tired."
We want a context-aware vector for "it". Its Query essentially asks "what noun do I stand for?" The model scores "it" against every word's Key:
| Word | Attention weight for "it" |
|---|---|
| The | 0.02 |
| cat | 0.71 |
| sat | 0.08 |
| because | 0.03 |
| it | 0.10 |
| was | 0.03 |
| tired | 0.03 |
"cat" wins by a mile, so the new vector for "it" is built mostly from cat's Value — the model has effectively decided "it = cat". Change the sentence to "…because it was sunny," and the weights shift toward a different reading. Same mechanism, different context, different focus. That flexibility is the whole point.
⚠ Watch out: attention tells you which words the model weighted, not why in any human sense — and high attention is not proof of correct reasoning. Models can attend to the "right" word and still get the answer wrong, or attend oddly and get it right. Treat attention maps as a useful hint about the machinery, not as an explanation you can fully trust.
Activity — Watch a transformer fill the blank
You don't need to build a transformer to feel attention at work — you can watch a pre-trained one resolve context live. Open a new Colab notebook.
First, predict (30 seconds): in "The river __ was covered in mud", what word fills the blank? Now in "I deposited my money at the __"? Write both down.
Type and run this (it downloads a small model — give it a minute):
from transformers import pipeline
fill = pipeline("fill-mask", model="distilbert-base-uncased")
for word, score in [(r["token_str"], r["score"])
for r in fill("The river [MASK] was covered in mud.")[:3]]:
print(f" {word:12s} {score:.2f}")
Now flip the context and run again:
for r in fill("I deposited my money at the [MASK].")[:3]:
print(f" {r['token_str']:12s} {r['score']:.2f}")
Now experiment:
- Did the same model give "bank"-like answers in both — but for different reasons (river vs money)? That's context steering attention.
- Write a sentence where one changed word flips the top answer. What single word did the model attend to?
- Try a sentence about a topic you love. Where does it focus? Where does it get it wrong?
Check yourself
- In one line, what does attention do? → It lets each word weigh which other words matter to it, and rebuild itself as a blend of the most relevant ones.
- Why did transformers beat RNNs? → They look at all words at once (no fading memory of far-back words) and train in parallel (so they scale to huge data).
- What are Query, Key and Value? → Query = what a word is looking for; Key = what a word offers; Value = the information passed on when a word is attended to.
Wrap-up
You've met the single idea behind the whole modern AI boom: let every word directly attend to every other, weight them by learned relevance, and blend. Stack that mechanism into many layers and train it on the internet, and you get a large language model. Everything you'll use next — Hugging Face pipelines, chatbots, RAG — is transformers under the hood.
- Try this at home: find the original attention paper's title ("Attention Is All You Need") and read just its abstract. List two claims it makes, then note — for each — what evidence you'd want before believing it. That's critical thinking on a real research paper.
Tips & extra challenges
- Watch out: "attention" here is a maths operation, not human focus or consciousness. The word is a metaphor — don't let it trick you into thinking the model "concentrates" or "cares".
- Want more? Try this: look up multi-head attention — transformers run several attention operations in parallel, each free to focus on a different kind of relationship (grammar, meaning, position). Write two sentences on why several "heads" beat one.
Vocabulary
| Term | Meaning |
|---|---|
| Attention | A mechanism where each word weighs how much every other word matters to it |
| Transformer | An architecture built from attention that reads all words at once |
| RNN | An older model that reads word-by-word in order; forgets long-range links |
| Query / Key / Value | What a word seeks / offers / passes on, used to compute attention |
| Attention weight | How strongly one word focuses on another (higher = more relevant) |
Resources
- Google Colab — where you'll run the fill-mask demo.
- Hugging Face — the transformer models you just used, and thousands more.
- The Illustrated Transformer (Jay Alammar) — the clearest visual walkthrough of attention there is.
Practice set
Practise on your own — work these easy → hard. Answers follow each arrow.
1. Name the win. Give one thing a transformer does that an RNN struggles with. → It connects far-apart words directly (no fading memory) and/or trains in parallel instead of one word at a time.
2. Read the weights. For the word "it", attention weights are: cat 0.7, sat 0.1, tired 0.2. Which word does the model think "it" refers to? → cat — it has the highest attention weight, so "it" is built mostly from cat.
3. Match the terms. Which is which: "what I'm looking for", "what I offer", "the info I pass on"? → Query, Key, Value, in that order.
4. Reason about context. Why does "bank" get a different vector in "river bank" than in "money bank"? → Attention blends in the neighbouring words' values, so "river" vs "money" pulls "bank" toward different meanings — the vector becomes context-aware.
5. Think critically (harder). A demo shows a model attending strongly to the "correct" word but giving a wrong final answer. What does this tell you? → That attention weights aren't proof of correct reasoning — focusing on the right word doesn't guarantee the right output; you must judge models by results, not attention maps.
Going deeper (optional)
Optional — for when you want to know how the model knows word order at all.
If a transformer reads every word at once, how does it know "dog bites man" ≠ "man bites dog"? Great catch — pure attention is order-blind. Transformers fix this by adding a positional encoding: a little pattern of numbers added to each word's vector that encodes where it sits in the sentence. So each word carries both its meaning (the embedding from Session 5) and its position, and attention can use both. It's a neat reminder that a transformer is really a stack of ideas working together — embeddings, positions, attention, and ordinary neuron layers — not one single trick.
Common mistakes & fixes
- Mistake: Thinking attention means the model "understands" or "concentrates". → Fix: it's a weighted average driven by learned weights — powerful maths, not human focus.
- Mistake: Believing high attention = correct reasoning. → Fix: attention is a hint about the machinery, not an explanation; judge the model by its outputs.
- Mistake: Assuming transformers process words in order like reading. → Fix: they process all words together and use positional encodings to recover order.
- Mistake: Thinking RNNs are useless now. → Fix: they're still handy for small or streaming problems; transformers won for scale, not every case.
- Mistake: Forgetting attention needs training. → Fix: Query/Key/Value come from learned weights — an untrained transformer attends to nothing useful.
What's next
Session 7 — Build with Pre-trained Models: you now understand what's happening inside a transformer — next you get to use one. With Hugging Face you'll load models that already learned from billions of words and, in a few lines, run sentiment analysis, summarisation and question-answering — standing on the shoulders of giants.