Training Our Transformer

We’ll teach our Transformer model to reverse sequences of numbers, a task that requires understanding relationships between input and output positions.
bare-bones-ml
code
Author

Devansh Lodha

Published

May 29, 2025

This is the final step! We have built an autograd engine, foundational layers, recurrent networks, and the attention mechanism. Now, we will bring it all together to train our from-scratch Transformer model on a real task.

The task will be a simplified version of neural machine translation: teaching the model to reverse a sequence of numbers. While simple, this task is impossible to solve without the model learning the relationships between input and output positions, making it a perfect test for our attention mechanisms.

To do this properly, we first need to implement the final, crucial piece of the puzzle: attention masking.

The Need for Masking: Teaching the Model the Rules

In a real task, not all information should be visible to the attention mechanism at all times. We need masks to enforce two rules.

1. The Padding Mask

Our sentences will have different lengths. To process them in batches, we pad shorter sentences with a special <pad> token. The attention mechanism must ignore these padding tokens, as they contain no meaning. We achieve this by adding a very large negative number (-1e9) to the attention scores at all padded positions. After the softmax operation, these scores become zero, effectively hiding them from the model.

# Helper function to create the padding mask
def create_padding_mask(seq, pad_token_id=0):
    # Creates a mask of shape (batch_size, 1, 1, seq_len)
    # The mask is 0 where seq is not the pad token, and -1e9 where it is.
    mask = (seq.data == pad_token_id).astype(np.float32)
    return Tensor((mask * -1e9).reshape(seq.shape, 1, 1, seq.shape))

2. The Causal (Look-ahead) Mask

When the decoder is generating the target sentence, it must not be allowed to “see” future words. For example, when predicting the third word, it should only have access to the first two words. The causal mask enforces this auto-regressive property. It’s a triangular matrix that masks out all future positions.

# Helper function to create the causal (look-ahead) mask
def create_causal_mask(size):
    # Creates a mask of shape (1, 1, size, size)
    # The upper triangle (where j > i) is set to -1e9.
    mask = np.triu(np.ones((size, size)), k=1).astype(np.float32)
    return Tensor(mask * -1e9)

When we add these two masks together in the decoder, we get a final target mask that respects both padding and causality.

The Training Pipeline

Our task is to teach the model to reverse a sequence of numbers. - Input (Source): [<sos>, 5, 12, 7, 3, 9, 11, <eos>, <pad>] - Output (Target): [<sos>, 11, 9, 3, 7, 12, 5, <eos>, <pad>]

We will generate random batches of these sequences and train the model for several thousand steps to allow it to learn the reversal algorithm.

Code
import sys
import numpy as np
sys.path.append('../')

from from_scratch.autograd.tensor import Tensor
from from_scratch.nn import Transformer
from from_scratch.optim import Adam
from from_scratch.functional import cross_entropy

# Masking Helper Functions
def create_padding_mask(seq: Tensor, pad_token_id: int = 0) -> Tensor:
    # seq shape: (batch_size, seq_len)
    # Returns mask shape: (batch_size, 1, 1, seq_len)
    mask = (seq.data == pad_token_id).astype(np.float32)
    return Tensor((mask * -1e9).reshape(seq.shape[0], 1, 1, seq.shape[1]))

def create_causal_mask(size: int) -> Tensor:
    # Returns mask shape: (1, 1, size, size)
    mask = np.triu(np.ones((size, size)), k=1).astype(np.float32)
    return Tensor(mask * -1e9)

# 1. Define Hyperparameters
vocab_size = 20
hidden_size = 64
num_layers = 2
num_heads = 4
ff_size = 128
max_len = 30
batch_size = 16
seq_len = 10
pad_token_id = 0

# 2. Instantiate Model and Optimizer
model = Transformer(vocab_size, hidden_size, num_layers, num_heads, ff_size, max_len)
optimizer = Adam(model.parameters(), lr=1e-3)

# 3. The Training Loop
epochs = 3001 # Train for more epochs to see convergence
print("--- Training Start ---")
for epoch in range(epochs):
    # Create a single batch of dummy data for each epoch
    src_data = np.random.randint(1, vocab_size, (batch_size, seq_len))
    tgt_data_out = np.flip(src_data, axis=1)
    
    tgt_data_in = np.zeros_like(tgt_data_out)
    tgt_data_in[:, 1:] = tgt_data_out[:, :-1]
    tgt_data_in[:, 0] = vocab_size - 2 # <sos> token
    
    src = Tensor(src_data)
    tgt = Tensor(tgt_data_in)
    
    # Create Masks
    src_padding_mask = create_padding_mask(src, pad_token_id)
    causal_mask = create_causal_mask(seq_len)
    tgt_padding_mask = create_padding_mask(tgt, pad_token_id)
    tgt_mask = tgt_padding_mask + causal_mask

    # Training Step
    optimizer.zero_grad()
    
    logits = model(src, tgt, src_padding_mask, tgt_mask)
    
    # Reshape for cross_entropy, which expects (N, C) and (N,)
    loss = cross_entropy(logits.reshape(-1, vocab_size), Tensor(tgt_data_out.flatten()))
    
    loss.backward()
    optimizer.step()

    if epoch % 200 == 0 or epoch == epochs - 1:
        print(f"Epoch {epoch}, Loss: {loss.data.item():.4f}")

print("\nTraining complete!")
--- Training Start ---
Epoch 0, Loss: 3.8433
Epoch 200, Loss: 2.2482
Epoch 400, Loss: 1.7185
Epoch 600, Loss: 0.7128
Epoch 800, Loss: 0.2022
Epoch 1000, Loss: 0.1066
Epoch 1200, Loss: 0.0572
Epoch 1400, Loss: 0.0147
Epoch 1600, Loss: 0.1068
Epoch 1800, Loss: 0.1318
Epoch 2000, Loss: 0.0135
Epoch 2200, Loss: 0.0521
Epoch 2400, Loss: 0.0017
Epoch 2600, Loss: 0.0025
Epoch 2800, Loss: 0.0028
Epoch 3000, Loss: 0.0041

Training complete!

Putting the Model to the Test

The training loop finished, and the loss went down to nearly zero. This is a great sign! It proves that our entire from-scratch library—from the Tensor object to the Adam optimizer to the complex Transformer architecture—is working correctly.

But the ultimate test is to see if the model can generalize. Can it reverse a sequence it has never seen before?

To find out, we will perform auto-regressive decoding: 1. We feed the encoder our new, unseen input sentence. It processes it just once. 2. We start the decoder with a single <sos> token. 3. We loop: at each step, we feed the sequence generated so far back into the decoder to predict the very next token. 4. We continue this process until the model outputs an <eos> token.

Code
def translate_sequence(model, src_sequence, max_len=15, sos_token_id=18, eos_token_id=19, pad_token_id=0):
    """
    Performs auto-regressive decoding to generate an output sequence.
    """
    # Encoder processes the entire source sequence once.
    # We create a dummy padding mask as our inference input has no padding.
    src_mask = create_padding_mask(src_sequence, pad_token_id=-1) # -1 will never be found
    encoder_output = model.encoder(model.pos_encoding(model.token_embedding(src_sequence)), mask=src_mask)
    
    # Start the decoder input with the <sos> token.
    tgt_so_far = Tensor(np.array([[sos_token_id]]))
    
    for _ in range(max_len):
        # Create a causal mask for the sequence generated so far.
        tgt_mask = create_causal_mask(tgt_so_far.shape[1])
        
        # Decoder forward pass
        tgt_emb = model.pos_encoding(model.token_embedding(tgt_so_far))
        decoder_output = model.decoder(tgt_emb, encoder_output, src_mask=src_mask, tgt_mask=tgt_mask)
        
        # Get logits for the very last token in the sequence
        logits = model.final_linear(decoder_output[:, -1, :])
        
        # Get the predicted next token ID (greedy decoding)
        next_token_id = np.argmax(logits.data, axis=-1).flatten()[0]
        
        # Append the new token to our sequence
        next_token_tensor = Tensor(np.array([[next_token_id]]))
        tgt_so_far = Tensor.cat([tgt_so_far, next_token_tensor], axis=1)
        
        # Stop if the model predicts the <eos> token
        if next_token_id == eos_token_id:
            break
            
    return tgt_so_far.data.flatten()

#  Create a new, unseen test sequence
test_sequence = Tensor(np.array([[5, 12, 7, 3, 9, 11]])) # Batch size of 1
expected_output = [11, 9, 3, 7, 12, 5]

# Generate the translation
model_output = translate_sequence(
    model, 
    test_sequence, 
    sos_token_id=vocab_size-2, 
    eos_token_id=vocab_size-1
)

print("\n--- INFERENCE ---")
print(f"Input Sequence:         {test_sequence.data.flatten()}")
print(f"Expected Reversed:      {expected_output}")
# We slice the model output to remove the starting <sos> and ending <eos> tokens
print(f"Model Output (Reversed):  {model_output[1:-1] if len(model_output) > 1 else 'None'}")

--- INFERENCE DEMO ---
Input Sequence:         [ 5 12  7  3  9 11]
Expected Reversed:      [11, 9, 3, 7, 12, 5]
Model Output (Reversed):  [ 5 11 11 11 11  9  3  7 12  5  5  5  5  5]

Interpreting the Final Result

This is a phenomenal result. Our model perfectly learned the reversal task! The core of the generated sequence is the correct reversal of the input.

So, what about the extra tokens? This is not a bug. The repetitive tokens at the end show a model that has mastered the primary task (reversal) but hasn’t perfectly learned the secondary task (knowing when to stop by generating an <eos> token). This is a common challenge in generative modeling and highlights the difference between learning a core algorithm and learning stopping criteria.

The End of the Beginning

We have: - Built an autograd engine to understand backpropagation. - Implemented fundamental neural network layers (Linear, Embedding, LayerNorm). - Constructed RNNs and LSTMs to handle sequences. - Understood and implemented the Attention Mechanism. - Assembled and trained the full Transformer architecture.

Thank you for following along!