Ibnovate Course 2 · The Rising Builders
⏱ 75 minLive session · ages 12–15

Session 21 — How Machines Read

Duration: 75 min · Format: live online · Ages: 12–15

Session goal: by the end, students can explain how a sentence is split into tokens and counted into numbers, build and test a small sentiment classifier in Python, and name real limits like sarcasm and unseen words.

Before class — prep (5 min)

Agenda

Time Segment
0:00 Hook — how does a phone know a review is happy? (5 min)
0:05 Teach — text becomes tokens, then numbers (14 min)
0:19 Teach — a bag of words can be classified (13 min)
0:32 Activity — build a sentiment classifier in Colab (26 min)
0:58 Check for understanding (10 min)
1:08 Wrap-up + homework (7 min)

0:00 · Hook (5 min)

Ask the class and take a few answers (chat or unmute):

Land it: computers can't read words — but they can count them. Today they'll turn sentences into numbers and train a model to tell happy text from unhappy text — then find exactly where it gets fooled.


0:05 · Teach — Text becomes tokens, then numbers (14 min)

Explain: the first step in every language model is tokenizing — chopping text into pieces called tokens (here, simply words). Then each token becomes a number the computer can count.

Share this diagram so students can follow how text is split into tokens, counted into numbers, and read by a model that predicts the mood:

Pipeline diagram of natural language processing: a sentence flows left to right, first splitting into separate word tokens, then becoming a bag-of-words table of counts, then feeding a model that outputs a positive or negative sentiment label

Type/run this together in Colab:

text = "I really love this movie"

tokens = text.lower().split()   # lowercase, then split on spaces
print(tokens)                   # ['i', 'really', 'love', 'this', 'movie']
print("Number of tokens:", len(tokens))

Explain each move: lower() so Love and love count as the same word, split() to break on spaces. Now show how a computer turns a whole set of sentences into a table of word counts — the "bag of words":

from sklearn.feature_extraction.text import CountVectorizer

texts = ["I love this", "I hate this"]
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(texts)

print("Words it found:", vectorizer.get_feature_names_out())
print(counts.toarray())         # one row per sentence, one column per word

Walk through the grid: each column is a word, each row is a sentence, each number is how many times that word appeared. The sentence is now just numbers.

Ask: "Why lowercase everything first?" (Answer: so Love, love, and LOVE are treated as the same word instead of three different ones.)

⚠ Watch for the #1 misconception: students think the model understands the words. It doesn't — it only counts them. It has no idea what "love" means; it just learns that the count of certain words goes with "positive."


0:19 · Teach — A bag of words can be classified (13 min)

Explain: once every sentence is a row of word-counts, text classification is the same train/test/.fit() recipe from Unit 1 — the features are just word counts instead of pixels. We give the model labelled examples (positive / negative) and it learns which words lean which way.

Type/run this together in Colab:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

texts = ["I love this", "This is great", "Absolutely wonderful", "Best day ever",
         "I hate this", "This is terrible", "So boring", "Worst day ever"]
labels = ["positive", "positive", "positive", "positive",
          "negative", "negative", "negative", "negative"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)   # words -> numbers (the features)

model = LogisticRegression()
model.fit(X, labels)                  # learn which words lean positive/negative
print("Trained on", len(texts), "examples.")

Explain that X is the bag-of-words table and labels is the target — identical shape to every model they've built.

Ask: "This model saw only 8 tiny sentences. Do you trust it yet?" (Answer: no — far too little data; a great honesty setup for the activity.)

⚠ Watch for: students assume more clever wording helps the model. What helps is more, varied, labelled examples — the same data lesson from Unit 1, now for text.


0:32 · Activity — Build a sentiment classifier (26 min)

Have students open their own Google ColabNew notebook, build the classifier above, then test it and try to break it. Screen-share and build line by line.

Type/run this together in Colab — predict on brand-new sentences:

new_texts = ["I really love this movie", "This was so boring"]
new_X = vectorizer.transform(new_texts)     # SAME vectorizer, don't refit
print(model.predict(new_X))

Point out the crucial detail: use vectorizer.transform (not fit_transform) on new text, so it uses the same word columns it learned. Then turn students loose to stress-test it and report in the chat:

Then measure honestly. Have them see that unknown words simply vanish:

mystery = vectorizer.transform(["This is fantastic and superb"])
print(mystery.toarray())     # likely all zeros — none of those words were learned

Circulate for the classic mistakes: calling fit_transform on new text (which re-learns the vocabulary and breaks alignment) and expecting the model to handle words it never trained on.


0:58 · Check for understanding (10 min)

Ask these aloud or drop them in the chat. Answer key (for you):

  1. What is a token, and what's the first step to "read" text? → A token is a piece of text (here, a word); the first step is tokenizing — splitting text into tokens.
  2. How does "I love this" become numbers?Bag of words — count how many times each known word appears; each count is a feature.
  3. Name one honest limit of this model. → e.g. sarcasm, unknown words it never saw, tiny/biased training data, or it ignores word order.

1:08 · Wrap-up + homework (7 min)


Teaching notes

import numpy as np

words = vectorizer.get_feature_names_out()
weights = model.coef_[0]                       # how each word pushes the label
order = np.argsort(weights)
print("Most negative words:", words[order[:3]])
print("Most positive words:", words[order[-3:]])

Ask whether the learned "positive" and "negative" words make sense — and what a weird one reveals about small, biased data (a word looks positive only because it happened to sit in positive examples). This ties straight back to Unit 1's bias lesson. - Low-tech fallback: if devices can't run Colab, do bag-of-words on the shared screen — tally word counts for two happy and two angry sentences by hand, then have students "predict" a new sentence by which word-counts it matches. Reveal that scikit-learn does exactly this counting.

Vocabulary

Term Meaning
Token A piece of text, usually a word
Tokenize Split text into tokens
Bag of words Counting how often each word appears, ignoring order
Sentiment Whether text is positive or negative
Vectorizer The tool that turns text into number counts

Resources

Practice set

A mix of concept questions and short coding tasks on tokens, bag of words, and honest limits — easy to hard. Use for lab time or homework.

1. Define it: what does it mean to tokenize a sentence? → Split it into pieces (tokens) — here, individual words.

2. Predict the output: what does this print? → ['i', 'love', 'pizza'] — lowercased and split on spaces.

print("I Love pizza".lower().split())

3. Reasoning: why do we lower() text before counting words? → So Love, love, and LOVE count as the same word, not three different ones.

4. Read the bag: for the sentences ["good good movie", "bad movie"], the word movie appears in both. In the counts table, what number sits in the movie column for each row? → 1 and 1 — it appears once in each sentence.

5. Fix the bug: why does predicting on new text with fit_transform misbehave? → fit_transform re-learns the vocabulary from the new text, breaking alignment with the trained model; use vectorizer.transform(...) instead.

new_X = vectorizer.fit_transform(["I love this"])   # wrong on new text
print(model.predict(new_X))

6. Reasoning (harder): the model gets "Oh great, another Monday" wrong and calls it positive. Why? → It counts the positive word great and can't detect sarcasm — it has no sense of tone or context.

7. Reasoning (hardest): "dog bites man" and "man bites dog" get the exact same bag of words. What limitation does this reveal, and why does it matter? → Bag of words ignores order, so it can't tell who did what — meaning that depends on order is lost.

Going deeper (optional)

For a class that's flying, show a modern model that does handle unseen words and some context — a pretrained sentiment model in one line with Hugging Face. It's free but downloads a model the first time, so run it once yourself before class:

!pip install -q transformers
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier("I really love this movie"))
print(classifier("Oh great, another rainy day"))    # try to fool it too

Contrast it honestly with their own model: this one was trained on millions of examples, so it knows far more words and some tone — but it's still not perfect (test the sarcasm line and see). Land the lesson: bigger training data buys more coverage, but no text model truly understands — they all have limits worth naming. This is exactly the honesty mindset for their Session 22 project.

Common mistakes & fixes

Next session

Session 22 — Your AI Mini-Project & Showcase: students pick vision or text, build their own small classifier, evaluate it honestly, and present it — the build project for this unit.

Ibnovate · Build · Innovate
Type to search · Esc to close
Welcome back
Sign in to continue building.
Accounts are created by Ibnovate — ask your instructor for your login.
🔒