Ibnovate Course 3 · The Future Builders
⏱ 75 minLive session

Session 5 — Teaching Machines Language

Duration: 75 min · Format: live online

What you'll learn: by the end, you can explain how a computer turns words into numbers, why embeddings (words as vectors) are so much smarter than one-hot codes, and how "closeness" between vectors captures meaning — and you'll load a real embedding and ask it for a word's nearest neighbours.

Soft skill focus — Curiosity

Today you'll also grow Curiosity. A computer has never seen a word — yet it can tell you that "king" is to "man" as "queen" is to "woman". That should make you deeply curious: how does pure arithmetic end up holding meaning? Chase that question today.

What you'll need

Hook

A neural network only understands numbers. But language is words. So before any AI can read, translate, or chat, someone has to answer one deceptively hard question: how do you turn "cat" into a number the maths can use?

The lazy answer — give every word its own ID number — turns out to be nearly useless, because it says nothing about meaning. The clever answer, embeddings, is one of the most beautiful ideas in modern AI: represent every word as a list of numbers positioned so that words with similar meanings sit close together. Do that well, and arithmetic on words starts to work. Today you build that intuition from the ground up.

Teach — From text to tokens

Computers don't process a sentence whole. First they tokenise it: chop it into small pieces called tokens (roughly words or word-parts).

"I love robots"["I", "love", "robots"] → each token then needs a number.

The first idea people try is one-hot encoding: give every word in the vocabulary its own slot, and mark a word by putting a 1 in its slot and 0 everywhere else.

It works, but it has two crippling problems:

  1. It's huge and wasteful. A real vocabulary is 50,000+ words, so every word is a 50,000-long list of mostly zeros.
  2. It knows nothing about meaning. cat and dog are just as "far apart" as cat and Tuesday — every word is exactly the same distance from every other. The numbers carry no information about how words relate.

Teach — Embeddings: words as vectors

An embedding fixes both problems. Instead of one giant slot-list, we represent each word as a short list of real numbers — a vector — say 50 or 300 numbers long. Crucially, these numbers are learned (during training on huge amounts of text) so that words used in similar ways end up with similar vectors.

Word embeddings turn each word into numbers so similar words sit close together

Think of each vector as a position in space. cat and dog land near each other (both are pets, appear in similar sentences). king and queen land near each other too. Tuesday lands far away in a different neighbourhood. Meaning becomes geometry: similar meaning → close together; different meaning → far apart.

We measure "closeness" with cosine similarity — a number from -1 (opposite) through 0 (unrelated) to 1 (identical direction). Near 1 means "very similar".

The famous party trick: the directions between words carry meaning too. king − man + woman lands right next to queen. The embedding has quietly learned the concept of "royalty" and the concept of "gender" as directions you can add and subtract.

⚠ Watch out: an embedding only knows what it read. If the training text was biased — say it saw "doctor" mostly with "he" and "nurse" mostly with "she" — those biases get baked straight into the geometry. Embeddings aren't neutral; they're a mirror of their data. You'll dig into exactly this in Session 11.

Activity — Find a word's nearest neighbours

Let's load a real, pre-trained embedding and ask it real questions. Open a new Colab notebook.

First, by hand (30 seconds): you're about to ask for the words nearest to "king". Write down your top three guesses now. Then let the machine show you what it actually learned.

Type and run this (the first line downloads a small pre-trained model — give it a minute):

import gensim.downloader as api

# a small, real embedding trained on billions of words
model = api.load("glove-wiki-gigaword-50")   # each word → 50 numbers

print("vector length:", len(model["king"]))
print("first 5 numbers of 'king':", model["king"][:5])

Now ask it for nearest neighbours:

print("Nearest to 'king':")
for word, score in model.most_similar("king", topn=5):
    print(f"  {word:12s}  similarity {score:.2f}")

print("\nSimilarity(cat, dog):    ", round(model.similarity("cat", "dog"), 2))
print("Similarity(cat, tuesday):", round(model.similarity("cat", "tuesday"), 2))

Now the party trick — arithmetic on meaning:

result = model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print("king - man + woman =", result[0][0])

Now experiment:

  1. Were your three guesses for "king" in the list? What surprised you?
  2. Is catdog more similar than cattuesday? By how much?
  3. Try your own analogy: paris - france + japan = ? What did it answer?

You just did arithmetic on meaning — and it worked, because meaning is stored as position in space.

Check yourself

  1. Why is one-hot encoding a poor way to represent words? → It's huge (one slot per word) and it captures no meaning — every word is equally far from every other.
  2. What is an embedding? → A short vector of learned numbers for each word, arranged so similar words sit close together.
  3. How do we measure how similar two word-vectors are? → With cosine similarity — near 1 means very similar, near 0 means unrelated.

Wrap-up

You've crossed the bridge from words to numbers — the bridge every language model has to cross. The key idea: don't give a word a random ID, give it a position, and learn those positions from real text so that geometry carries meaning. Once meaning is geometry, a network can compute with it.

Tips & extra challenges

Vocabulary

Term Meaning
Token A small piece of text (a word or word-part) the model processes
One-hot encoding A word as one 1 in a slot and 0 everywhere else — no meaning
Embedding A learned vector of numbers for a word; similar words sit close
Vector An ordered list of numbers; here, a point/direction in space
Cosine similarity A score from -1 to 1 for how alike two vectors are

Resources

Practice set

Practise on your own — work these easy → hard. Answers follow each arrow.

1. Spot the weakness. In one-hot encoding, what is the cosine similarity between cat and dog? → 0 — every one-hot word is orthogonal to every other, so the code says they're totally unrelated (which is wrong).

2. Read the geometry. Two words have cosine similarity 0.91. Are they closer in meaning than a pair scoring 0.12? → Yes — higher cosine similarity means more alike; 0.91 is nearly the same direction, 0.12 is almost unrelated.

3. Size it up. A vocabulary has 40,000 words. How long is each one-hot vector, and how long might its embedding be? → One-hot: 40,000 numbers. Embedding: something small like 50–300 — far shorter and meaningful.

4. Reason about analogies. Why can king − man + woman land near queen? → Because "gender" and "royalty" are stored as consistent directions in the space, so you can add and subtract them like moves on a map.

5. Predict the code (harder). What will this print — a number near 1 or near 0? model.similarity("happy", "joyful") → A number near 1 — the words mean almost the same thing, so their vectors point in nearly the same direction.

Going deeper (optional)

Optional — for when you want to know where the numbers actually come from.

How does an embedding learn without anyone labelling meaning? The trick is a clever self-made task: hide a word and make the model predict it from its neighbours (or the reverse — predict the neighbours from the word). This is the idea behind word2vec and GloVe. Nobody ever tells the model "cat and dog are similar"; but because cat and dog appear in the same kinds of sentences ("my ___ is hungry"), the only way to get good at the prediction game is to give them similar vectors. Meaning falls out of context, for free, from raw text — a principle summed up as "you shall know a word by the company it keeps." That same principle, scaled up massively, is what transformers exploit next.

Common mistakes & fixes

What's next

Session 6 — The Transformer Revolution: you can now turn words into meaningful vectors — but a sentence is more than a bag of words; order and context change everything ("river bank" vs "money bank"). Next you'll meet attention, the mechanism that lets a model weigh which words matter for each word — the breakthrough that powers every large language model.

Ibnovate · Build · Innovate
Type to search · Esc to close
Welcome back
Sign in to continue building.
Accounts are created by Ibnovate — ask your instructor for your login.
🔒