The Attention Mechanism – Devansh’s Blog

So far, we have built Recurrent Neural Networks and LSTMs. These models process sequences step-by-step, maintaining a “memory” or hidden state that summarizes all the information seen so far. While powerful, this approach has a fundamental weakness: the hidden state becomes an information bottleneck. The model must compress the entire meaning of a long sentence like “The fluffy cat, which had been sleeping all day on the warm, sunny windowsill, finally woke up and…” into a single, fixed-size vector. By the time the model processes “woke up,” the specific details about the “fluffy cat” might be diluted or lost.

What if, instead of relying on a single summary vector, the model could “look back” at the entire input sequence at every step and decide which parts are most relevant for the current task?

This is the core intuition behind the Attention Mechanism. It’s a technique that allows a model to dynamically focus on the most relevant parts of the input sequence when producing a part of the output sequence. It was the key ingredient that unlocked the power of the Transformer architecture and redefined modern AI.

The Attention Formula: Queries, Keys, and Values

Attention can be described beautifully through an analogy to a library retrieval system. You have a question, and you want to find the most relevant books.

Query (Q): This is your question. In a model, it represents the current context or the word you are trying to produce (e.g., “I need information about an animal”).
Key (K): These are the titles on the spines of all the books in the library. Each input word has a Key vector, like a label that says, “I am about animals” or “I am about places.”
Value (V): These are the actual contents of the books. Each input word also has a Value vector, which is its rich, meaningful representation.

The process is intuitive: 1. You compare your Query to every Key in the library to see how well they match. A common way to do this is with a dot product. A high score means a strong match. 2. You take all the scores and run them through a softmax function. This converts the scores into a probability distribution. These are your attention weights. A key with a high score will get a high weight. 3. You create a weighted sum of all the Values (the books’ contents) using your attention weights. Books with higher weights contribute more to the final result.

This is captured in the famous Scaled Dot-Product Attention formula from the “Attention Is All You Need” paper: \[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

The division by \(\sqrt{d_k}\) (the square root of the key dimension) is a scaling factor that helps stabilize the gradients during training, preventing the dot product scores from becoming too large.

Let’s look at our from-scratch implementation.

# from_scratch/nn.py

class ScaledDotProductAttention(Module):
    """Computes Scaled Dot-Product Attention."""
    def forward(self, q: Tensor, k: Tensor, v: Tensor, mask=None) -> Tensor:
        # Get the dimension of the key vectors
        key_dim = Tensor(k.shape[-1])
        
        # Transpose the last two dimensions of the key tensor for matrix multiplication
        key_transposed = k.transpose(-2, -1)
        
        # 1. Calculate scores: Query @ Key_transposed
        scores = q @ key_transposed
        
        # 2. Scale the scores
        scaled_scores = scores / key_dim.sqrt()
        
        # 3. Apply mask if provided (e.g., for padding or causal attention)
        if mask is not None:
            scaled_scores = scaled_scores + mask
            
        # 4. Apply softmax to get attention weights
        weights = softmax(scaled_scores)
        
        # 5. Multiply weights by Values to get the final output
        return weights @ v

A Simple Demonstration

Let’s see this in action. We’ll create a simple scenario where our “Values” represent three distinct concepts, and we’ll watch how the attention weights shift based on our “Query.” To make the effect obvious, our query vector will have a large value in the dimension it’s “interested” in.

Code

import sys
sys.path.append('../')

import numpy as np
from from_scratch.autograd.tensor import Tensor
from from_scratch.functional import softmax

# Imagine our "Values" represent three concepts: "cat", "dog", "bird"
V = Tensor(np.array([
    [1, 0, 0],  # Vector for 'cat'
    [0, 1, 0],  # Vector for 'dog'
    [0, 0, 1]   # Vector for 'bird'
]))

# The "Keys" are labels for our values. We'll make them match the values for simplicity.
K = Tensor(np.array([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
]))

def calculate_attention(Q, K, V):
    # For this simple demo, we'll omit the scaling factor for a clearer result.
    scores = Q @ K.T
    weights = softmax(scores)
    return weights

# Scenario 1: We are looking for "dog"
# Use a more "opinionated" query vector with a large magnitude in the 'dog' dimension.
query_dog = Tensor([[1.0, 10.0, 1.0]]) 

attention_weights_dog = calculate_attention(query_dog, K, V)

print("--- Query: 'dog' ---")
print("Attention Weights:\n", np.round(attention_weights_dog.data, 2))
print("The weights show a very clear focus on the second item (index 1), which is 'dog'.")

# Scenario 2: Now we are looking for "bird"
query_bird = Tensor([[1.0, 1.0, 10.0]])

attention_weights_bird = calculate_attention(query_bird, K, V)

print("\n--- Query: 'bird' ---")
print("Attention Weights:\n", np.round(attention_weights_bird.data, 2))
print("The weights have decisively shifted to the third item (index 2), which is 'bird'.")

--- Query: 'dog' ---
Attention Weights:
 [[0. 1. 0.]]
The weights show a very clear focus on the second item (index 1), which is 'dog'.

--- Query: 'bird' ---
Attention Weights:
 [[0. 0. 1.]]
The weights have decisively shifted to the third item (index 2), which is 'bird'.

Multi-Head Attention: Focusing on Many Things at Once

The Transformer paper took this one step further with Multi-Head Attention. The intuition is simple: instead of having one attention mechanism, let’s have several of them (“heads”) working in parallel. Each head can learn to focus on different aspects of the input. For example, when translating a sentence, one head might learn to track subject-verb agreement, while another tracks adjective-noun pairings.

This is achieved by: 1. Creating separate Linear projection layers for the Queries, Keys, and Values for each head. 2. Splitting the input into multiple “heads” and applying Scaled Dot-Product Attention to each head in parallel. 3. Concatenating the results from all heads. 4. Passing the concatenated output through a final Linear layer.

A Structural Test

At this stage, the output of an untrained MultiHeadAttention module is just a matrix of meaningless numbers. The most important thing to verify is its structure. A key property of these blocks is that the output shape is identical to the input shape. This is what allows us to stack them to create a deep network. Let’s test that.

Code

from from_scratch.nn import MultiHeadAttention

# Define parameters
batch_size = 4
seq_len = 10
hidden_size = 32
num_heads = 4 # hidden_size must be divisible by num_heads

# Create some dummy input tensors
query = Tensor(np.random.randn(batch_size, seq_len, hidden_size))
key = Tensor(np.random.randn(batch_size, seq_len, hidden_size))
value = Tensor(np.random.randn(batch_size, seq_len, hidden_size))

# Instantiate and run our module
multi_head_attention = MultiHeadAttention(hidden_size=hidden_size, num_heads=num_heads)
output = multi_head_attention(q=query, k=key, v=value)

print(f"Input shape:  ({batch_size}, {seq_len}, {hidden_size})")
print(f"Output shape: {output.shape}")
assert output.shape == (batch_size, seq_len, hidden_size)
print("\nSuccess! The MultiHeadAttention module processed the input and produced an output of the correct shape.")

Input shape:  (4, 10, 32)
Output shape: (4, 10, 32)

Success! The MultiHeadAttention module processed the input and produced an output of the correct shape.

Conclusion

The Attention mechanism is arguably the most important concept in modern deep learning behind backpropagation itself. It broke the sequential bottleneck of RNNs and paved the way for parallelizable, highly effective models.

Now that we have built this core component, we are finally ready to assemble the full Transformer architecture.