We’ll be putting together all the components we’ve built in the previous posts to assemble a complete Transformer model, ready for training and inference.
bare-bones-ml
code
Author
Devansh Lodha
Published
May 28, 2025
We have arrived at the summit. After building our autograd engine, foundational layers, recurrent networks, and the revolutionary attention mechanism, we now have all the necessary components to construct a full Transformer model, as detailed in the seminal paper “Attention Is All You Need”.
This is the moment where all our previous work—Tensor, Function, Module, Linear, MultiHeadAttention—comes together.
Transformer Architecture
Our goals for this post are: 1. Implement the final missing piece: Positional Encoding. 2. Explain the crucial sub-layer components: LayerNorm, FeedForward, and the residual “Add & Norm” connection. 3. Combine these components into EncoderLayer and DecoderLayer modules. 4. Assemble the final Transformer model and verify its structural integrity.
The Final Missing Piece: Positional Encoding
The self-attention mechanism is “permutation-invariant”—it treats an input sentence as a “bag of words” with no inherent order. To fix this, we must explicitly inject information about the position of each token into its embedding.
The authors of the Transformer paper proposed a clever trick using sine and cosine functions of different frequencies: \[
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
\]\[
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
\]
This creates a unique positional signature for each token, and the wave-like nature of the functions allows the model to easily learn relative positions. We pre-calculate these values and simply add them to our token embeddings.
# from_scratch/nn.pyclass PositionalEncoding(Module):"""Injects positional information into the input embeddings."""def__init__(self, hidden_size: int, max_len: int=5000):super().__init__() pe = np.zeros((max_len, hidden_size), dtype=np.float32) position = np.arange(0, max_len, dtype=np.float32).reshape(-1, 1) div_term = np.exp(np.arange(0, hidden_size, 2, dtype=np.float32) *-(np.log(10000.0) / hidden_size)) pe[:, 0::2] = np.sin(position * div_term) pe[:, 1::2] = np.cos(position * div_term)self.pe = Tensor(pe, requires_grad=False)def forward(self, x: Tensor) -> Tensor:return x +self.pe[:x.shape, :]
The Sub-layer Components
Before we assemble the full EncoderLayer and DecoderLayer, let’s look at the smaller utility modules that make them work.
Layer Normalization (LayerNorm)
LayerNorm normalizes the features for each token independently across the hidden dimension. This helps stabilize the training of deep networks by keeping the activations in a consistent range.
Each encoder and decoder layer contains a simple, fully connected feed-forward network. This is applied independently to each position. It consists of two linear transformations with a ReLU activation in between.
This is perhaps the most critical component. Training very deep networks is difficult because gradients can vanish as they propagate backward. Residual connections (or “skip connections”) solve this by adding the input of a layer to its output (x + sublayer(x)). This creates a direct path for the gradient to flow, making it possible to train networks with dozens or even hundreds of layers.
Our module combines this addition with a LayerNorm step, which is the standard pattern in the Transformer.
We are now ready to build the full Transformer class. It’s simply a container for the embedding layer, the positional encoding, a stack of Encoder layers, a stack of Decoder layers, and a final Linear layer to produce the output logits.
The most important test we can run right now is a structural test. We’re not training the model yet; we are simply verifying that a tensor can flow through the entire complex architecture without any shape mismatches or errors. This will prove that our implementation is correctly assembled.
Code
import sysimport numpy as npsys.path.append('../')from from_scratch.autograd.tensor import Tensorfrom from_scratch.nn import Transformer# 1. Define Model Hyperparametersvocab_size =1000# Size of our vocabularyhidden_size =64# Dimension of embeddings and modelnum_layers =2# Number of Encoder/Decoder layers to stacknum_heads =4# Number of attention headsff_size =128# Hidden size of the FeedForward networksmax_len =50# Max sequence length for positional encodingbatch_size =8seq_len =20# Length of our dummy sentences# 2. Instantiate the Full Transformer Modelmodel = Transformer( vocab_size=vocab_size, hidden_size=hidden_size, num_layers=num_layers, num_heads=num_heads, ff_size=ff_size, max_len=max_len)print("Transformer model instantiated successfully!")# 3. Create Dummy Data# Source sentencesrc_tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))# Target sentencetgt_tokens = Tensor(np.random.randint(0, vocab_size, (batch_size, seq_len)))print(f"\nInput source shape: {src_tokens.shape}")print(f"Input target shape: {tgt_tokens.shape}")# 4. Perform a Single Forward Pass ---# In a real scenario, we would also pass attention masks here.logits = model(src_tokens, tgt_tokens)print(f"\nOutput logits shape: {logits.shape}")print(f"Expected output shape: ({batch_size}, {seq_len}, {vocab_size})")# 5. Verificationassert logits.shape == (batch_size, seq_len, vocab_size)print("\nSuccess! A tensor flowed through the entire Transformer architecture and produced an output of the correct shape.")
Transformer model instantiated successfully!
Input source shape: (8, 20)
Input target shape: (8, 20)
Output logits shape: (8, 20, 1000)
Expected output shape: (8, 20, 1000)
Success! A tensor flowed through the entire Transformer architecture and produced an output of the correct shape.
Conclusion
We just assembled one of the most influential deep learning architectures from the ground up, using only the components we’ve built in our bare-bones-ml library.
We have proven that the complex interplay of embeddings, positional encodings, multi-head attention, residual connections, and feed-forward layers is structurally sound in our implementation.
The final step in this journey is to put our model to the test: in the next and final post of this from-scratch series, we will train our Transformer on a real task, implementing the necessary masking and a full training pipeline.