NlpConceptBeginner · 3 min read

What is Self Attention in NLP: Simple Explanation and Example

In NLP, self-attention is a technique that helps a model focus on different parts of a sentence when understanding each word. It allows the model to weigh the importance of all words relative to each other, improving context understanding.

⚙️

How It Works

Imagine reading a sentence and trying to understand the meaning of a word by looking at the other words around it. Self-attention works similarly: it lets the model look at every word in the sentence and decide how much each word should influence the understanding of the current word.

It does this by creating scores that measure the connection between words. For example, in the sentence "The cat sat on the mat," when focusing on "sat," the model might pay more attention to "cat" and "mat" because they relate closely to the action. This helps the model understand context better than just reading words one by one.

By doing this for every word, self-attention builds a rich understanding of the sentence, which is very useful for tasks like translation, summarization, or answering questions.

💻

Example

This example shows a simple self-attention calculation using Python and NumPy. It computes attention scores for a small set of word vectors and outputs the weighted sum representing the attended information.

python

import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

# Example word vectors (3 words, 4 features each)
words = np.array([
    [1, 0, 1, 0],  # word 1
    [0, 2, 0, 2],  # word 2
    [1, 1, 1, 1]   # word 3
])

# Compute attention scores (dot product of words with each other)
scores = words @ words.T

# Apply softmax to get attention weights
attention_weights = softmax(scores)

# Compute weighted sum of words for each word
self_attention_output = attention_weights @ words

print("Attention Weights:\n", attention_weights)
print("\nSelf-Attention Output:\n", self_attention_output)

Output

Attention Weights: [[0.57611688 0.21194156 0.21194156] [0.21194156 0.57611688 0.21194156] [0.21194156 0.21194156 0.57611688]] Self-Attention Output: [[0.78811688 0.42388312 0.78811688 0.42388312] [0.42388312 1.42388312 0.42388312 1.42388312] [0.78811688 0.78811688 0.78811688 0.78811688]]

🎯

When to Use

Use self-attention when you want a model to understand the relationships between words in a sentence or document, especially when context matters a lot. It is key in modern NLP tasks like machine translation, text summarization, and question answering.

For example, in translating a sentence, self-attention helps the model know which words to focus on to produce accurate translations. In chatbots, it helps understand user queries better by considering the whole sentence context.

✅

Key Points

Self-attention lets models weigh the importance of all words relative to each other.
It improves understanding of context in sentences.
It is a core part of Transformer models used in NLP today.
Self-attention computes scores and uses them to create weighted word representations.

✅

Key Takeaways

Self-attention helps models focus on important words in a sentence by comparing all words to each other.

It is essential for understanding context in NLP tasks like translation and summarization.

Self-attention uses scores and weights to create better word representations.

Transformer models rely heavily on self-attention for their success.

You can implement a simple self-attention mechanism using dot products and softmax.