Ibnovate Course 3 · The Future Builders
⏱ 75 minLive session

Session 6 — The Transformer Revolution

Duration: 75 min · Format: live online

What you'll learn: by the end, you can explain the attention mechanism — how a model decides which other words to focus on for each word — why that idea let transformers overtake older RNNs, and how this one architecture became the engine behind every large language model.

Soft skill focus — Critical thinking

Today you'll also grow Critical thinking. "Attention is all you need" is a real paper title and a big claim. Critical thinking is the habit of not swallowing big claims whole — asking what problem does this actually solve, what does it cost, and where might it break? Bring that lens to everything today.

What you'll need

Hook

Read this sentence: "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to — the trophy or the suitcase? You knew instantly: the trophy. Now change one word: "…because it was too small." Suddenly "it" means the suitcase.

To understand "it", you had to look back and pay attention to the right earlier word — and which word mattered depended on the rest of the sentence. For decades, machines were terrible at this. Then in 2017 a mechanism called attention cracked it, and the transformer was born. Almost everything you think of as "AI" today — ChatGPT, Gemini, translation, coding assistants — runs on the idea you're about to meet.

Teach — The problem transformers solved

Before transformers, language ran on RNNs (recurrent neural networks). An RNN reads a sentence one word at a time, in order, carrying a memory forward. Two problems crippled it:

  1. It forgot. By the time it reached word 40, the memory of word 2 had faded. Long-range links ("it" → a noun far back) got lost.
  2. It couldn't be parallelised. Because word 10 depended on word 9 depending on word 8… you had to process the sentence strictly in sequence. On modern hardware that's painfully slow, so you couldn't train on truly enormous text.

The transformer (from the 2017 paper "Attention Is All You Need") threw out the step-by-step reading. Instead, it looks at all the words at once and lets every word directly attend to every other word. No fading memory, and — because there's no forced order to the computation — it trains massively in parallel. That combination is exactly what made today's giant models possible.

Teach — What attention actually does

Attention is a way for each word to ask the rest of the sentence: "which of you should I pay attention to, to understand myself here?" — and to build a better, context-aware version of itself from the answer.

Attention lets a transformer focus on the most relevant words

Here's the mechanism in plain terms. For every word, the model produces three vectors:

To update a word, the model compares its Query against every other word's Key. High match → high attention weight. Then it builds the word's new representation as a weighted blend of the Values — mostly from the words it scored highest on. Do this for every word, and each one becomes context-aware: "bank" beside "river" ends up different from "bank" beside "money".

The magic isn't one clever rule — it's that these Query/Key/Value transforms are learned weights (just like Session 1's neurons), trained on oceans of text until the attention lands on the words that actually matter.

Teach — A worked example

Take: "The cat sat because it was tired."

We want a context-aware vector for "it". Its Query essentially asks "what noun do I stand for?" The model scores "it" against every word's Key:

Word Attention weight for "it"
The 0.02
cat 0.71
sat 0.08
because 0.03
it 0.10
was 0.03
tired 0.03

"cat" wins by a mile, so the new vector for "it" is built mostly from cat's Value — the model has effectively decided "it = cat". Change the sentence to "…because it was sunny," and the weights shift toward a different reading. Same mechanism, different context, different focus. That flexibility is the whole point.

⚠ Watch out: attention tells you which words the model weighted, not why in any human sense — and high attention is not proof of correct reasoning. Models can attend to the "right" word and still get the answer wrong, or attend oddly and get it right. Treat attention maps as a useful hint about the machinery, not as an explanation you can fully trust.

Activity — Watch a transformer fill the blank

You don't need to build a transformer to feel attention at work — you can watch a pre-trained one resolve context live. Open a new Colab notebook.

First, predict (30 seconds): in "The river __ was covered in mud", what word fills the blank? Now in "I deposited my money at the __"? Write both down.

Type and run this (it downloads a small model — give it a minute):

from transformers import pipeline

fill = pipeline("fill-mask", model="distilbert-base-uncased")

for word, score in [(r["token_str"], r["score"])
                    for r in fill("The river [MASK] was covered in mud.")[:3]]:
    print(f"  {word:12s}  {score:.2f}")

Now flip the context and run again:

for r in fill("I deposited my money at the [MASK].")[:3]:
    print(f"  {r['token_str']:12s}  {r['score']:.2f}")

Now experiment:

  1. Did the same model give "bank"-like answers in both — but for different reasons (river vs money)? That's context steering attention.
  2. Write a sentence where one changed word flips the top answer. What single word did the model attend to?
  3. Try a sentence about a topic you love. Where does it focus? Where does it get it wrong?

Check yourself

  1. In one line, what does attention do? → It lets each word weigh which other words matter to it, and rebuild itself as a blend of the most relevant ones.
  2. Why did transformers beat RNNs? → They look at all words at once (no fading memory of far-back words) and train in parallel (so they scale to huge data).
  3. What are Query, Key and Value? → Query = what a word is looking for; Key = what a word offers; Value = the information passed on when a word is attended to.

Wrap-up

You've met the single idea behind the whole modern AI boom: let every word directly attend to every other, weight them by learned relevance, and blend. Stack that mechanism into many layers and train it on the internet, and you get a large language model. Everything you'll use next — Hugging Face pipelines, chatbots, RAG — is transformers under the hood.

Tips & extra challenges

Vocabulary

Term Meaning
Attention A mechanism where each word weighs how much every other word matters to it
Transformer An architecture built from attention that reads all words at once
RNN An older model that reads word-by-word in order; forgets long-range links
Query / Key / Value What a word seeks / offers / passes on, used to compute attention
Attention weight How strongly one word focuses on another (higher = more relevant)

Resources

Practice set

Practise on your own — work these easy → hard. Answers follow each arrow.

1. Name the win. Give one thing a transformer does that an RNN struggles with. → It connects far-apart words directly (no fading memory) and/or trains in parallel instead of one word at a time.

2. Read the weights. For the word "it", attention weights are: cat 0.7, sat 0.1, tired 0.2. Which word does the model think "it" refers to? → cat — it has the highest attention weight, so "it" is built mostly from cat.

3. Match the terms. Which is which: "what I'm looking for", "what I offer", "the info I pass on"? → Query, Key, Value, in that order.

4. Reason about context. Why does "bank" get a different vector in "river bank" than in "money bank"? → Attention blends in the neighbouring words' values, so "river" vs "money" pulls "bank" toward different meanings — the vector becomes context-aware.

5. Think critically (harder). A demo shows a model attending strongly to the "correct" word but giving a wrong final answer. What does this tell you? → That attention weights aren't proof of correct reasoning — focusing on the right word doesn't guarantee the right output; you must judge models by results, not attention maps.

Going deeper (optional)

Optional — for when you want to know how the model knows word order at all.

If a transformer reads every word at once, how does it know "dog bites man" ≠ "man bites dog"? Great catch — pure attention is order-blind. Transformers fix this by adding a positional encoding: a little pattern of numbers added to each word's vector that encodes where it sits in the sentence. So each word carries both its meaning (the embedding from Session 5) and its position, and attention can use both. It's a neat reminder that a transformer is really a stack of ideas working together — embeddings, positions, attention, and ordinary neuron layers — not one single trick.

Common mistakes & fixes

What's next

Session 7 — Build with Pre-trained Models: you now understand what's happening inside a transformer — next you get to use one. With Hugging Face you'll load models that already learned from billions of words and, in a few lines, run sentiment analysis, summarisation and question-answering — standing on the shoulders of giants.

Ibnovate · Build · Innovate
Type to search · Esc to close
Welcome back
Sign in to continue building.
Accounts are created by Ibnovate — ask your instructor for your login.
🔒