We’ll teach our Transformer model to reverse sequences of numbers, a task that requires understanding relationships between input and output positions.
bare-bones-ml
code
Author
Devansh Lodha
Published
May 29, 2025
This is the final step! We have built an autograd engine, foundational layers, recurrent networks, and the attention mechanism. Now, we will bring it all together to train our from-scratch Transformer model on a real task.
The task will be a simplified version of neural machine translation: teaching the model to reverse a sequence of numbers. While simple, this task is impossible to solve without the model learning the relationships between input and output positions, making it a perfect test for our attention mechanisms.
To do this properly, we first need to implement the final, crucial piece of the puzzle: attention masking.
The Need for Masking: Teaching the Model the Rules
In a real task, not all information should be visible to the attention mechanism at all times. We need masks to enforce two rules.
1. The Padding Mask
Our sentences will have different lengths. To process them in batches, we pad shorter sentences with a special <pad> token. The attention mechanism must ignore these padding tokens, as they contain no meaning. We achieve this by adding a very large negative number (-1e9) to the attention scores at all padded positions. After the softmax operation, these scores become zero, effectively hiding them from the model.
# Helper function to create the padding maskdef create_padding_mask(seq, pad_token_id=0):# Creates a mask of shape (batch_size, 1, 1, seq_len)# The mask is 0 where seq is not the pad token, and -1e9 where it is. mask = (seq.data == pad_token_id).astype(np.float32)return Tensor((mask *-1e9).reshape(seq.shape, 1, 1, seq.shape))
2. The Causal (Look-ahead) Mask
When the decoder is generating the target sentence, it must not be allowed to “see” future words. For example, when predicting the third word, it should only have access to the first two words. The causal mask enforces this auto-regressive property. It’s a triangular matrix that masks out all future positions.
# Helper function to create the causal (look-ahead) maskdef create_causal_mask(size):# Creates a mask of shape (1, 1, size, size)# The upper triangle (where j > i) is set to -1e9. mask = np.triu(np.ones((size, size)), k=1).astype(np.float32)return Tensor(mask *-1e9)
When we add these two masks together in the decoder, we get a final target mask that respects both padding and causality.
The Training Pipeline
Our task is to teach the model to reverse a sequence of numbers. - Input (Source):[<sos>, 5, 12, 7, 3, 9, 11, <eos>, <pad>] - Output (Target):[<sos>, 11, 9, 3, 7, 12, 5, <eos>, <pad>]
We will generate random batches of these sequences and train the model for several thousand steps to allow it to learn the reversal algorithm.
Code
import sysimport numpy as npsys.path.append('../')from from_scratch.autograd.tensor import Tensorfrom from_scratch.nn import Transformerfrom from_scratch.optim import Adamfrom from_scratch.functional import cross_entropy# Masking Helper Functionsdef create_padding_mask(seq: Tensor, pad_token_id: int=0) -> Tensor:# seq shape: (batch_size, seq_len)# Returns mask shape: (batch_size, 1, 1, seq_len) mask = (seq.data == pad_token_id).astype(np.float32)return Tensor((mask *-1e9).reshape(seq.shape[0], 1, 1, seq.shape[1]))def create_causal_mask(size: int) -> Tensor:# Returns mask shape: (1, 1, size, size) mask = np.triu(np.ones((size, size)), k=1).astype(np.float32)return Tensor(mask *-1e9)# 1. Define Hyperparametersvocab_size =20hidden_size =64num_layers =2num_heads =4ff_size =128max_len =30batch_size =16seq_len =10pad_token_id =0# 2. Instantiate Model and Optimizermodel = Transformer(vocab_size, hidden_size, num_layers, num_heads, ff_size, max_len)optimizer = Adam(model.parameters(), lr=1e-3)# 3. The Training Loopepochs =3001# Train for more epochs to see convergenceprint("--- Training Start ---")for epoch inrange(epochs):# Create a single batch of dummy data for each epoch src_data = np.random.randint(1, vocab_size, (batch_size, seq_len)) tgt_data_out = np.flip(src_data, axis=1) tgt_data_in = np.zeros_like(tgt_data_out) tgt_data_in[:, 1:] = tgt_data_out[:, :-1] tgt_data_in[:, 0] = vocab_size -2# <sos> token src = Tensor(src_data) tgt = Tensor(tgt_data_in)# Create Masks src_padding_mask = create_padding_mask(src, pad_token_id) causal_mask = create_causal_mask(seq_len) tgt_padding_mask = create_padding_mask(tgt, pad_token_id) tgt_mask = tgt_padding_mask + causal_mask# Training Step optimizer.zero_grad() logits = model(src, tgt, src_padding_mask, tgt_mask)# Reshape for cross_entropy, which expects (N, C) and (N,) loss = cross_entropy(logits.reshape(-1, vocab_size), Tensor(tgt_data_out.flatten())) loss.backward() optimizer.step()if epoch %200==0or epoch == epochs -1:print(f"Epoch {epoch}, Loss: {loss.data.item():.4f}")print("\nTraining complete!")
The training loop finished, and the loss went down to nearly zero. This is a great sign! It proves that our entire from-scratch library—from the Tensor object to the Adam optimizer to the complex Transformer architecture—is working correctly.
But the ultimate test is to see if the model can generalize. Can it reverse a sequence it has never seen before?
To find out, we will perform auto-regressive decoding: 1. We feed the encoder our new, unseen input sentence. It processes it just once. 2. We start the decoder with a single <sos> token. 3. We loop: at each step, we feed the sequence generated so far back into the decoder to predict the very next token. 4. We continue this process until the model outputs an <eos> token.
Code
def translate_sequence(model, src_sequence, max_len=15, sos_token_id=18, eos_token_id=19, pad_token_id=0):""" Performs auto-regressive decoding to generate an output sequence. """# Encoder processes the entire source sequence once.# We create a dummy padding mask as our inference input has no padding. src_mask = create_padding_mask(src_sequence, pad_token_id=-1) # -1 will never be found encoder_output = model.encoder(model.pos_encoding(model.token_embedding(src_sequence)), mask=src_mask)# Start the decoder input with the <sos> token. tgt_so_far = Tensor(np.array([[sos_token_id]]))for _ inrange(max_len):# Create a causal mask for the sequence generated so far. tgt_mask = create_causal_mask(tgt_so_far.shape[1])# Decoder forward pass tgt_emb = model.pos_encoding(model.token_embedding(tgt_so_far)) decoder_output = model.decoder(tgt_emb, encoder_output, src_mask=src_mask, tgt_mask=tgt_mask)# Get logits for the very last token in the sequence logits = model.final_linear(decoder_output[:, -1, :])# Get the predicted next token ID (greedy decoding) next_token_id = np.argmax(logits.data, axis=-1).flatten()[0]# Append the new token to our sequence next_token_tensor = Tensor(np.array([[next_token_id]])) tgt_so_far = Tensor.cat([tgt_so_far, next_token_tensor], axis=1)# Stop if the model predicts the <eos> tokenif next_token_id == eos_token_id:breakreturn tgt_so_far.data.flatten()# Create a new, unseen test sequencetest_sequence = Tensor(np.array([[5, 12, 7, 3, 9, 11]])) # Batch size of 1expected_output = [11, 9, 3, 7, 12, 5]# Generate the translationmodel_output = translate_sequence( model, test_sequence, sos_token_id=vocab_size-2, eos_token_id=vocab_size-1)print("\n--- INFERENCE ---")print(f"Input Sequence: {test_sequence.data.flatten()}")print(f"Expected Reversed: {expected_output}")# We slice the model output to remove the starting <sos> and ending <eos> tokensprint(f"Model Output (Reversed): {model_output[1:-1] iflen(model_output) >1else'None'}")
This is a phenomenal result. Our model perfectly learned the reversal task! The core of the generated sequence is the correct reversal of the input.
So, what about the extra tokens? This is not a bug. The repetitive tokens at the end show a model that has mastered the primary task (reversal) but hasn’t perfectly learned the secondary task (knowing when to stop by generating an <eos> token). This is a common challenge in generative modeling and highlights the difference between learning a core algorithm and learning stopping criteria.
The End of the Beginning
We have: - Built an autograd engine to understand backpropagation. - Implemented fundamental neural network layers (Linear, Embedding, LayerNorm). - Constructed RNNs and LSTMs to handle sequences. - Understood and implemented the Attention Mechanism. - Assembled and trained the full Transformer architecture.