Assembling the Full Transformer

We have arrived at the summit. After building our autograd engine, foundational layers, recurrent networks, and the revolutionary attention mechanism, we now have all the necessary components to construct a full Transformer model, as detailed in the seminal paper “Attention Is All You Need”.

This is the moment where all our previous work—Tensor, Function, Module, Linear, MultiHeadAttention—comes together.

Our goals for this post are: 1. Implement the final missing piece: Positional Encoding. 2. Explain the crucial sub-layer components: LayerNorm, FeedForward, and the residual “Add & Norm” connection. 3. Combine these components into EncoderLayer and DecoderLayer modules. 4. Assemble the final Transformer model and verify its structural integrity.

The Final Missing Piece: Positional Encoding

The self-attention mechanism is “permutation-invariant”—it treats an input sentence as a “bag of words” with no inherent order. To fix this, we must explicitly inject information about the position of each token into its embedding.

The authors of the Transformer paper proposed a clever trick using sine and cosine functions of different frequencies: \[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \] \[ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

This creates a unique positional signature for each token, and the wave-like nature of the functions allows the model to easily learn relative positions. We pre-calculate these values and simply add them to our token embeddings.

# from_scratch/nn.py
class PositionalEncoding(Module):
    """Injects positional information into the input embeddings."""
    def __init__(self, hidden_size: int, max_len: int = 5000):
        super().__init__()
        pe = np.zeros((max_len, hidden_size), dtype=np.float32)
        position = np.arange(0, max_len, dtype=np.float32).reshape(-1, 1)
        div_term = np.exp(np.arange(0, hidden_size, 2, dtype=np.float32) * -(np.log(10000.0) / hidden_size))
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)
        self.pe = Tensor(pe, requires_grad=False)

    def forward(self, x: Tensor) -> Tensor:
        return x + self.pe[:x.shape, :]

The Sub-layer Components

Before we assemble the full EncoderLayer and DecoderLayer, let’s look at the smaller utility modules that make them work.

Layer Normalization (`LayerNorm`)

LayerNorm normalizes the features for each token independently across the hidden dimension. This helps stabilize the training of deep networks by keeping the activations in a consistent range.

# from_scratch/nn.py
class LayerNorm(Module):
    def __init__(self, normalized_shape: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.gamma = Tensor(np.ones(normalized_shape), requires_grad=True)
        self.beta = Tensor(np.zeros(normalized_shape), requires_grad=True)
    def forward(self, x: Tensor) -> Tensor:
        mean = x.sum(axis=-1, keepdims=True) / Tensor(x.shape[-1])
        var = ((x - mean)**2).sum(axis=-1, keepdims=True) / Tensor(x.shape[-1])
        x_norm = (x - mean) / (var + self.eps).sqrt()
        return self.gamma * x_norm + self.beta

Feed-Forward Network (`FeedForward`)

Each encoder and decoder layer contains a simple, fully connected feed-forward network. This is applied independently to each position. It consists of two linear transformations with a ReLU activation in between.

# from_scratch/nn.py
class FeedForward(Module):
    def __init__(self, hidden_size: int, ff_size: int):
        super().__init__()
        self.linear1 = Linear(hidden_size, ff_size)
        self.linear2 = Linear(ff_size, hidden_size)
    def forward(self, x: Tensor) -> Tensor:
        return self.linear2(relu(self.linear1(x)))

Residual Connections (`ResidualAddAndNorm`)

This is perhaps the most critical component. Training very deep networks is difficult because gradients can vanish as they propagate backward. Residual connections (or “skip connections”) solve this by adding the input of a layer to its output (x + sublayer(x)). This creates a direct path for the gradient to flow, making it possible to train networks with dozens or even hundreds of layers.

Our module combines this addition with a LayerNorm step, which is the standard pattern in the Transformer.

# from_scratch/nn.py
class ResidualAddAndNorm(Module):
    def __init__(self, hidden_size: int):
        super().__init__()
        self.norm = LayerNorm(hidden_size)
    def forward(self, x: Tensor, sublayer_output: Tensor) -> Tensor:
        return self.norm(x + sublayer_output)

Assembling the Encoder and Decoder Layers

An EncoderLayer contains two main sub-layers: a MultiHeadAttention module and a FeedForward network, each wrapped in our ResidualAddAndNorm module.

# from_scratch/nn.py
class EncoderLayer(Module):
    def __init__(self, hidden_size, num_heads, ff_size):
        super().__init__()
        self.self_attn = MultiHeadAttention(hidden_size, num_heads)
        self.add_norm1 = ResidualAddAndNorm(hidden_size)
        self.ff = FeedForward(hidden_size, ff_size)
        self.add_norm2 = ResidualAddAndNorm(hidden_size)
    def forward(self, x, mask=None):
        attn_output = self.self_attn(q=x, k=x, v=x, mask=mask)
        x = self.add_norm1(x, attn_output)
        ff_output = self.ff(x)
        x = self.add_norm2(x, ff_output)
        return x

A DecoderLayer is similar but has three sub-layers, including the crucial cross-attention module that looks at the encoder’s output.

# from_scratch/nn.py
class DecoderLayer(Module):
    def __init__(self, hidden_size, num_heads, ff_size):
        super().__init__()
        self.masked_self_attn = MultiHeadAttention(hidden_size, num_heads)
        self.add_norm1 = ResidualAddAndNorm(hidden_size)
        self.enc_dec_attn = MultiHeadAttention(hidden_size, num_heads)
        self.add_norm2 = ResidualAddAndNorm(hidden_size)
        self.ff = FeedForward(hidden_size, ff_size)
        self.add_norm3 = ResidualAddAndNorm(hidden_size)
    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        attn_output = self.masked_self_attn(q=x, k=x, v=x, mask=tgt_mask)
        x = self.add_norm1(x, attn_output)
        enc_dec_attn_output = self.enc_dec_attn(q=x, k=encoder_output, v=encoder_output, mask=src_mask)
        x = self.add_norm2(x, enc_dec_attn_output)
        ff_output = self.ff(x)
        x = self.add_norm3(x, ff_output)
        return x

The Final Assembly

We are now ready to build the full Transformer class. It’s simply a container for the embedding layer, the positional encoding, a stack of Encoder layers, a stack of Decoder layers, and a final Linear layer to produce the output logits.

The most important test we can run right now is a structural test. We’re not training the model yet; we are simply verifying that a tensor can flow through the entire complex architecture without any shape mismatches or errors. This will prove that our implementation is correctly assembled.

Code

import sys
import numpy as np
sys.path.append('../')

from from_scratch.autograd.tensor import Tensor
from from_scratch.nn import Transformer

# 1. Define Model Hyperparameters
vocab_size = 1000   # Size of our vocabulary
hidden_size = 64    # Dimension of embeddings and model
num_layers = 2      # Number of Encoder/Decoder layers to stack
num_heads = 4       # Number of attention heads
ff_size = 128       # Hidden size of the FeedForward networks
max_len = 50        # Max sequence length for positional encoding
batch_size = 8
seq_len = 20        # Length of our dummy sentences

# 2. Instantiate the Full Transformer Model
model = Transformer(
    vocab_size=vocab_size,
    hidden_size=hidden_size,
    num_layers=num_layers,
    num_heads=num_heads,
    ff_size=ff_size,
    max_len=max_len
)

print("Transformer model instantiated successfully!")

# 3. Create Dummy Data
# Source sentence
src_tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))
# Target sentence
tgt_tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))

print(f"\nInput source shape: {src_tokens.shape}")
print(f"Input target shape: {tgt_tokens.shape}")

# 4. Perform a Single Forward Pass ---
# In a real scenario, we would also pass attention masks here.
logits = model(src_tokens, tgt_tokens)

print(f"\nOutput logits shape: {logits.shape}")
print(f"Expected output shape: ({batch_size}, {seq_len}, {vocab_size})")

# 5. Verification
assert logits.shape == (batch_size, seq_len, vocab_size)
print("\nSuccess! A tensor flowed through the entire Transformer architecture and produced an output of the correct shape.")

Transformer model instantiated successfully!

Input source shape: (8, 20)
Input target shape: (8, 20)

Output logits shape: (8, 20, 1000)
Expected output shape: (8, 20, 1000)

Success! A tensor flowed through the entire Transformer architecture and produced an output of the correct shape.

Conclusion

We just assembled one of the most influential deep learning architectures from the ground up, using only the components we’ve built in our bare-bones-ml library.

We have proven that the complex interplay of embeddings, positional encodings, multi-head attention, residual connections, and feed-forward layers is structurally sound in our implementation.

The final step in this journey is to put our model to the test: in the next and final post of this from-scratch series, we will train our Transformer on a real task, implementing the necessary masking and a full training pipeline.