What is N-grams in NLP?

NLPml~5 mins

N-grams in NLP

Choose your learning style10 modes available

Learn Why Deep Model Try Challenge Experiment Recall Metrics

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

N-grams help us understand how words appear together in text. They show sequences of words to find patterns or predict the next word.

To predict the next word when typing on a phone keyboard.

To find common phrases in customer reviews.

To improve search engines by understanding word pairs or triples.

To detect spam messages by spotting unusual word combinations.

To analyze writing style by looking at frequent word sequences.

Syntax

NLP

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(n, n))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

ngram_range=(n, n) means you get only n-grams of size n (like bigrams if n=2).

You can set ngram_range=(1, 2) to get both single words and pairs.

Examples

This gets single words (unigrams) from the sentence.

NLP

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love machine learning']
vectorizer = CountVectorizer(ngram_range=(1, 1))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

This gets pairs of words (bigrams) from the sentence.

NLP

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love machine learning']
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

This gets both single words and pairs of words.

NLP

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['I love machine learning']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())

Sample Model

This code finds all pairs of words (bigrams) in the three sentences. It prints the list of bigrams and how many times each appears in each sentence.

NLP

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'I love machine learning',
    'Machine learning is fun',
    'I love coding in Python'
]

# Create bigrams only
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(corpus)

# Show the bigrams found
bigrams = vectorizer.get_feature_names_out()
print('Bigrams:', bigrams)

# Show the count matrix as array
counts = X.toarray()
print('Counts matrix:\n', counts)

OutputSuccess

Important Notes

N-grams help capture context by looking at word groups, not just single words.

Higher n (like 3 or 4) means longer word sequences but fewer matches and more data needed.

CountVectorizer automatically lowercases words and removes punctuation by default.

Summary

N-grams are groups of n words appearing together in text.

They help find patterns and improve text predictions.

Use CountVectorizer with ngram_range to extract n-grams easily.

Practice

(1/5)

1. What is an n-gram in natural language processing?

easy

A. A random selection of n words from a text

B. A single word repeated n times

C. A sentence with n words

D. A group of n consecutive words in a text

N-grams in NLP

Start learning this pattern below

Practice

Solution

Step 1: Understand the definition of n-gram

Step 2: Compare options with definition

Final Answer:

Quick Check:

Solution

Step 1: Understand ngram_range parameter

Step 2: Evaluate each option

Final Answer:

Quick Check:

Solution

Step 1: Understand trigram extraction

Step 2: List trigrams from the sentence

Final Answer:

Quick Check:

Solution

Step 1: Check method usage

Step 2: Validate other parts

Final Answer:

Quick Check:

Solution

Step 1: Understand requirements

Step 2: Evaluate options

Final Answer:

Quick Check: