Session 5 — Teaching Machines Language
Duration: 75 min · Format: live online
What you'll learn: by the end, you can explain how a computer turns words into numbers, why embeddings (words as vectors) are so much smarter than one-hot codes, and how "closeness" between vectors captures meaning — and you'll load a real embedding and ask it for a word's nearest neighbours.
Soft skill focus — Curiosity
Today you'll also grow Curiosity. A computer has never seen a word — yet it can tell you that "king" is to "man" as "queen" is to "woman". That should make you deeply curious: how does pure arithmetic end up holding meaning? Chase that question today.
- Try this: every time you meet a new word-pair in the code, stop and predict — "will these two be close or far?" — before you run it. When the machine surprises you, that surprise is a clue worth chasing, not an error to skip past.
- Think about: "If meaning can live in a list of numbers, what else that feels 'human' might turn out to be maths in disguise?"
What you'll need
- Google Colab open in a tab, signed in, ready for a new notebook.
- The diagram below open so you can picture words as points in space.
- Paper and a pen — you'll sketch two words as arrows before the computer does it.
Hook
A neural network only understands numbers. But language is words. So before any AI can read, translate, or chat, someone has to answer one deceptively hard question: how do you turn "cat" into a number the maths can use?
The lazy answer — give every word its own ID number — turns out to be nearly useless, because it says nothing about meaning. The clever answer, embeddings, is one of the most beautiful ideas in modern AI: represent every word as a list of numbers positioned so that words with similar meanings sit close together. Do that well, and arithmetic on words starts to work. Today you build that intuition from the ground up.
Teach — From text to tokens
Computers don't process a sentence whole. First they tokenise it: chop it into small pieces called tokens (roughly words or word-parts).
"I love robots" → ["I", "love", "robots"] → each token then needs a number.
The first idea people try is one-hot encoding: give every word in the vocabulary its own slot, and mark a word by putting a 1 in its slot and 0 everywhere else.
cat→[1, 0, 0, 0, …]dog→[0, 1, 0, 0, …]king→[0, 0, 1, 0, …]
It works, but it has two crippling problems:
- It's huge and wasteful. A real vocabulary is 50,000+ words, so every word is a 50,000-long list of mostly zeros.
- It knows nothing about meaning.
catanddogare just as "far apart" ascatandTuesday— every word is exactly the same distance from every other. The numbers carry no information about how words relate.
Teach — Embeddings: words as vectors
An embedding fixes both problems. Instead of one giant slot-list, we represent each word as a short list of real numbers — a vector — say 50 or 300 numbers long. Crucially, these numbers are learned (during training on huge amounts of text) so that words used in similar ways end up with similar vectors.
Think of each vector as a position in space. cat and dog land near each other (both are pets, appear in similar sentences). king and queen land near each other too. Tuesday lands far away in a different neighbourhood. Meaning becomes geometry: similar meaning → close together; different meaning → far apart.
We measure "closeness" with cosine similarity — a number from -1 (opposite) through 0 (unrelated) to 1 (identical direction). Near 1 means "very similar".
The famous party trick: the directions between words carry meaning too. king − man + woman lands right next to queen. The embedding has quietly learned the concept of "royalty" and the concept of "gender" as directions you can add and subtract.
⚠ Watch out: an embedding only knows what it read. If the training text was biased — say it saw "doctor" mostly with "he" and "nurse" mostly with "she" — those biases get baked straight into the geometry. Embeddings aren't neutral; they're a mirror of their data. You'll dig into exactly this in Session 11.
Activity — Find a word's nearest neighbours
Let's load a real, pre-trained embedding and ask it real questions. Open a new Colab notebook.
First, by hand (30 seconds): you're about to ask for the words nearest to "king". Write down your top three guesses now. Then let the machine show you what it actually learned.
Type and run this (the first line downloads a small pre-trained model — give it a minute):
import gensim.downloader as api
# a small, real embedding trained on billions of words
model = api.load("glove-wiki-gigaword-50") # each word → 50 numbers
print("vector length:", len(model["king"]))
print("first 5 numbers of 'king':", model["king"][:5])
Now ask it for nearest neighbours:
print("Nearest to 'king':")
for word, score in model.most_similar("king", topn=5):
print(f" {word:12s} similarity {score:.2f}")
print("\nSimilarity(cat, dog): ", round(model.similarity("cat", "dog"), 2))
print("Similarity(cat, tuesday):", round(model.similarity("cat", "tuesday"), 2))
Now the party trick — arithmetic on meaning:
result = model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print("king - man + woman =", result[0][0])
Now experiment:
- Were your three guesses for "king" in the list? What surprised you?
- Is
cat–dogmore similar thancat–tuesday? By how much? - Try your own analogy:
paris - france + japan = ?What did it answer?
You just did arithmetic on meaning — and it worked, because meaning is stored as position in space.
Check yourself
- Why is one-hot encoding a poor way to represent words? → It's huge (one slot per word) and it captures no meaning — every word is equally far from every other.
- What is an embedding? → A short vector of learned numbers for each word, arranged so similar words sit close together.
- How do we measure how similar two word-vectors are? → With cosine similarity — near
1means very similar, near0means unrelated.
Wrap-up
You've crossed the bridge from words to numbers — the bridge every language model has to cross. The key idea: don't give a word a random ID, give it a position, and learn those positions from real text so that geometry carries meaning. Once meaning is geometry, a network can compute with it.
- Try this at home: pick five words from a topic you love (music, football, space…) and use
model.most_similaron each. Do the neighbourhoods make sense? Find one neighbour that's clearly wrong and write one sentence guessing why the training text led the model astray.
Tips & extra challenges
- Watch out: if a word isn't in the model's vocabulary you'll get a
KeyError. These small models are lowercase-only and have no rare or made-up words — trymodel.key_to_indexto peek at what exists, and lowercase your input. - Want more? Try this: write a function
odd_one_out(words)that usesmodel.doesnt_match(words)to spot the word that doesn't belong, then test it on["breakfast", "lunch", "dinner", "football"]. Can you fool it?
Vocabulary
| Term | Meaning |
|---|---|
| Token | A small piece of text (a word or word-part) the model processes |
| One-hot encoding | A word as one 1 in a slot and 0 everywhere else — no meaning |
| Embedding | A learned vector of numbers for a word; similar words sit close |
| Vector | An ordered list of numbers; here, a point/direction in space |
| Cosine similarity | A score from -1 to 1 for how alike two vectors are |
Resources
- Google Colab — where you'll write and run everything this course.
- Hugging Face — home of pre-trained models and embeddings you'll use all unit.
- TensorFlow Embedding Projector — fly through a real embedding in 3D and watch neighbourhoods form.
Practice set
Practise on your own — work these easy → hard. Answers follow each arrow.
1. Spot the weakness. In one-hot encoding, what is the cosine similarity between cat and dog? → 0 — every one-hot word is orthogonal to every other, so the code says they're totally unrelated (which is wrong).
2. Read the geometry. Two words have cosine similarity 0.91. Are they closer in meaning than a pair scoring 0.12? → Yes — higher cosine similarity means more alike; 0.91 is nearly the same direction, 0.12 is almost unrelated.
3. Size it up. A vocabulary has 40,000 words. How long is each one-hot vector, and how long might its embedding be? → One-hot: 40,000 numbers. Embedding: something small like 50–300 — far shorter and meaningful.
4. Reason about analogies. Why can king − man + woman land near queen? → Because "gender" and "royalty" are stored as consistent directions in the space, so you can add and subtract them like moves on a map.
5. Predict the code (harder). What will this print — a number near 1 or near 0? model.similarity("happy", "joyful") → A number near 1 — the words mean almost the same thing, so their vectors point in nearly the same direction.
Going deeper (optional)
Optional — for when you want to know where the numbers actually come from.
How does an embedding learn without anyone labelling meaning? The trick is a clever self-made task: hide a word and make the model predict it from its neighbours (or the reverse — predict the neighbours from the word). This is the idea behind word2vec and GloVe. Nobody ever tells the model "cat and dog are similar"; but because cat and dog appear in the same kinds of sentences ("my ___ is hungry"), the only way to get good at the prediction game is to give them similar vectors. Meaning falls out of context, for free, from raw text — a principle summed up as "you shall know a word by the company it keeps." That same principle, scaled up massively, is what transformers exploit next.
Common mistakes & fixes
- Mistake: Thinking the embedding "understands" words like a person. → Fix: it only captures statistical patterns of usage — powerful, but it has no real-world experience behind the numbers.
- Mistake:
KeyErroron a word. → Fix: the word isn't in the vocabulary; lowercase it and try a more common synonym, or checkmodel.key_to_index. - Mistake: Expecting cosine similarity above
1or below-1. → Fix: it's always in[-1, 1]— that's the whole scale;~1is "same",~0is "unrelated". - Mistake: Assuming the model's associations are facts. → Fix: they're echoes of the training text — including its biases and errors. Always sanity-check.
- Mistake: Confusing tokens with words. → Fix: a token can be a whole word or a word-part (like
play+ing); big models often split rare words into pieces.
What's next
Session 6 — The Transformer Revolution: you can now turn words into meaningful vectors — but a sentence is more than a bag of words; order and context change everything ("river bank" vs "money bank"). Next you'll meet attention, the mechanism that lets a model weigh which words matter for each word — the breakthrough that powers every large language model.