What are N-grams?
Imagine you’re trying to predict the next word in the sentence “The coffee is…”. As humans, we naturally think “hot” or “delicious” would be likely candidates. But how can we teach a computer to make similar predictions? This is where N-gram models come in.
An N-gram is simply a sequence of N consecutive items from a given text. These items could be letters, words, or even sentences, but we’ll focus on words in this guide. Let’s break it down:
Types of N-grams:
- Unigrams (n=1): Single words: [“the”, “coffee”, “is”, “hot”]
- Bigrams (n=2): Pairs of words: [“the coffee”, “coffee is”, “is hot”]
- Trigrams (n=3): Three words: [“the coffee is”, “coffee is hot”]
How N-gram Models Work: A Simple Example
Let’s say we have this tiny training dataset:
The coffee is hot
The coffee is delicious
The tea is hot
Our bigram model would count pairs of words:
{
"the coffee": 2,
"coffee is": 2,
"is hot": 2,
"is delicious": 1,
"the tea": 1,
"tea is": 1
}
When asked to predict what comes after “coffee is”, the model sees:
- “hot” occurred once
- “delicious” occurred once
So it might predict either word with equal probability.
The Math Behind N-grams
N-gram models use conditional probability. For a bigram model, we’re calculating:
P(word₂|word₁) = Count(word₁ word₂) / Count(word₁)
For example:
P(“hot”|”is”) = Count(“is hot”) / Count(“is”)
= 2 / 3
≈ 0.67
This means given the word “is”, there’s a 67% chance the next word will be “hot”.
Real-World Implementation
Here’s the basic structure of an N-gram implementation:
class NGramModel {
constructor(n) {
this.n = n; // Order of the model
this.contexts = new Map(); // Store context -> next word mappings
}
train(text) {
// Split text into words
const words = text.split(' ');
// Create n-grams
for (let i = 0; i < words.length - this.n + 1; i++) {
const context = words.slice(i, i + this.n - 1).join(' ');
const nextWord = words[i + this.n - 1];
// Store in our model
if (!this.contexts.has(context)) {
this.contexts.set(context, new Map());
}
const nextWords = this.contexts.get(context);
nextWords.set(nextWord, (nextWords.get(nextWord) || 0) + 1);
}
}
predict(context) {
// Find most likely next word
const nextWords = this.contexts.get(context);
if (!nextWords) return null;
return Array.from(nextWords.entries())
.reduce((a, b) => a[1] > b[1] ? a : b)[0];
}
}
Advanced Concepts
1. Temperature Sampling
Instead of always choosing the most probable word, we can introduce randomness:
predict(context, temperature = 1.0) {
const nextWords = this.contexts.get(context);
if (!nextWords) return null;
// Apply temperature to probabilities
const candidates = Array.from(nextWords.entries()).map(([word, count]) => ({
word,
weight: Math.pow(count, 1/temperature)
}));
// Weighted random selection
const total = candidates.reduce((sum, c) => sum + c.weight, 0);
let random = Math.random() * total;
for (const candidate of candidates) {
random -= candidate.weight;
if (random <= 0) return candidate.word;
}
}
2. Backoff Strategy
When we encounter an unknown context, we can “back off” to a smaller context:
predict(context) {
while (context.length > 0) {
const prediction = this.getPrediction(context.join(' '));
if (prediction) return prediction;
// Back off to smaller context
context = context.slice(1);
}
return this.getRandomWord(); // fallback
}
Limitations and Challenges
- Memory Usage: Storing all possible n-grams can be memory-intensive. For a vocabulary of V words, we potentially need to store V^n n-grams.
- Sparsity: As n increases, most possible n-grams never appear in the training data, leading to the “sparsity problem”.
- Limited Context: N-gram models only look at the previous n-1 words, missing longer-range dependencies.
- Repetition: Simple N-gram models can get stuck in repetitive patterns.
Improvements and Solutions
- Smoothing: Techniques like Laplace smoothing help handle unseen n-grams:
getCount(ngram) {
return (this.counts.get(ngram) || 0) + 1; // Add 1 smoothing
}
- Interpolation: Combine predictions from different order n-grams:
predict(context) {
const trigramScore = this.trigramModel.score(context);
const bigramScore = this.bigramModel.score(context);
return 0.7 * trigramScore + 0.3 * bigramScore;
}
- Pruning: Remove rare n-grams to save memory:
prune(minCount) {
for (const [ngram, count] of this.counts) {
if (count < minCount) this.counts.delete(ngram);
}
}
Modern Applications
While deep learning models like BERT and GPT have largely superseded traditional N-gram models for many tasks, N-grams are still useful for:
- Lightweight Applications: When you need a simple, fast model
- Feature Engineering: As input features for larger models
- Text Analysis: Detecting plagiarism or text similarity
- Language Identification: Determining what language a text is in
- Spelling Correction: Suggesting corrections for misspelled words
Conclusion
N-gram models provide an elegant, interpretable way to model sequences of text. While they may seem simple compared to modern deep learning approaches, understanding N-grams provides valuable insights into:
- Probability in natural language
- The importance of context in prediction
- The trade-offs between model complexity and effectiveness
- Fundamental concepts that even modern language models build upon
Whether you’re building a simple autocomplete feature or diving into more complex NLP tasks, the principles behind N-gram models remain relevant and instructive.