We’ll build a recurrent neural network from scratch, implementing the forward and backward passes, and training it on a sequence prediction task.
bare-bones-ml
code
Author
Devansh Lodha
Published
May 25, 2025
The feed-forward networks we’ve built so far are powerful, but they are stateless. They process each input independently, with no memory of what came before. They cannot understand the order in a sentence or the trend in a time series.
Enter Recurrent Neural Networks (RNNs).
An RNN is a type of neural network designed to handle sequential data. It does this by introducing a loop. At each step in a sequence, the network processes an input and combines it with information from the previous step. This “memory” of the past is stored in a vector called the hidden state.
Where: - \(h_t\) is the new hidden state (the output for the current step). - \(x_t\) is the input vector at the current time step. - \(h_{t-1}\) is the hidden state from the previous step. - \(W_{xh}\), \(W_{hh}\), and \(b_h\) are the learnable weight and bias parameters that are shared across all time steps. This weight sharing is the key that allows the RNN to generalize across sequences of different lengths.
In this post, we will: 1. Implement a simple RecurrentBlock and the tanh activation function from scratch. 2. Build a character-level RNN model. 3. Train it to generate new, unique names based on a list of thousands of examples.
All code for this post can be found in the from_scratch/ directory of the bare-bones-ml repository.
New Library Components
To build our RNN, we only need to add two new pieces to our library.
1. The tanh Activation Function
The hyperbolic tangent (tanh) function is the traditional non-linearity used in simple RNNs. It squashes its input to a range between -1 and 1. Its derivative is simple: \(\frac{d}{dx}\tanh(x) = 1 - \tanh^2(x)\), which is important for backpropagation.
Our RecurrentBlock will implement a single step of the RNN forward pass. The looping over the sequence will happen explicitly in our training code. This allows us to realize the process of passing the hidden state from one time step to the next, which is often hidden inside a framework’s implementation.
We also make a design choice to combine the weights W_xh and W_hh into a single, larger weight matrix w. We then concatenate the input x and the previous hidden state h_prev to perform a single, efficient matrix multiplication.
We will train our RNN on a list of thousands of names. The model will learn by reading a name one character at a time and trying to predict the next character in the sequence. For the name “emma”, the training examples would be: - Given a start token ., predict “e”. - Given “e”, predict “m”. - Given “m”, predict “m”. - Given “m”, predict “a”. - Given “a”, predict the end token ..
By training on this task, the model learns the statistical patterns of names: which letters tend to follow others, common prefixes and suffixes, and typical name lengths.
Data Preparation
First, we need to load the data and create our character-level vocabulary. For this task, the “vocabulary” is simply the set of all unique characters present in the dataset, plus a special token (.) to signify the start and end of a name. We will represent each character as a one-hot vector.
Code
import syssys.path.append('../')import numpy as npimport randomfrom from_scratch.autograd.tensor import Tensor# 1. The Dataset# Download from: https://github.com/karpathy/makemore/blob/master/names.txt# And place it in a `data/` directory at the root of your project.try:withopen('../data/names.txt', 'r') as f: names = [name.lower() for name in f.read().splitlines()]exceptFileNotFoundError:print("Error: `data/names.txt` not found.")print("Please download it from https://github.com/karpathy/makemore/blob/master/names.txt") names = []if names:print(f"Loaded {len(names)} names. First 5: {names[:5]}")# 2. Create the Vocabulary chars =sorted(list(set("."+"".join(names)))) char_to_int = {ch: i for i, ch inenumerate(chars)} int_to_char = {i: ch for i, ch inenumerate(chars)} vocab_size =len(chars)print(f"\nVocabulary size: {vocab_size}")print(f"Characters: {''.join(chars)}")# 3. Helper to create a training example (input sequence and target sequence)def get_random_example(): name = random.choice(names) full_sequence ="."+ name +"." input_indices = [char_to_int[c] for c in full_sequence[:-1]] target_indices = [char_to_int[c] for c in full_sequence[1:]]# One-hot encode the input sequence input_one_hot = np.zeros((len(input_indices), vocab_size), dtype=np.float32) input_one_hot[np.arange(len(input_indices)), input_indices] =1return Tensor(input_one_hot), Tensor(np.array(target_indices))# Test the helper test_input, test_target = get_random_example()print(f"\nExample Input Shape: {test_input.shape}")print(f"Example Target Shape: {test_target.shape}")
Loaded 32033 names. First 5: ['emma', 'olivia', 'ava', 'isabella', 'sophia']
Vocabulary size: 27
Characters: .abcdefghijklmnopqrstuvwxyz
Example Input Shape: (6, 27)
Example Target Shape: (6,)
The Model and Training Loop
Our full model will consist of our RecurrentBlock and a Linear output layer. At each time step, the RNN processes a character and updates its hidden state. This hidden state is then passed to the Linear layer to produce logits—raw scores for each character in our vocabulary. We use cross_entropy loss to compare these logits to the true next character.
The key to training an RNN is that the total loss for a sequence is the sum of the losses at each time step. We then call backward() on this final accumulated loss to compute gradients for the entire unrolled sequence. This process is called Backpropagation Through Time (BPTT).
Code
from from_scratch.nn import RecurrentBlock, Linearfrom from_scratch.optim import Adamfrom from_scratch.functional import cross_entropyif names:# Model Definition hidden_size =128 rnn_layer = RecurrentBlock(input_size=vocab_size, hidden_size=hidden_size) output_layer = Linear(input_size=hidden_size, output_size=vocab_size)# Group all parameters for the optimizer all_params = rnn_layer.parameters() + output_layer.parameters() optimizer = Adam(params=all_params, lr=0.005)# Training Loop epochs =20000 print_every =1000print("\n--- Training Start ---")for epoch inrange(epochs): input_tensor, target_tensor = get_random_example() optimizer.zero_grad() hidden = Tensor(np.zeros((1, hidden_size))) total_loss = Tensor(0.0)# Explicitly loop through the sequence (BPTT)for t inrange(input_tensor.shape[0]): x_t = input_tensor[t:t+1, :]# RNN step: Pass input and previous hidden state hidden = rnn_layer(x_t, hidden)# Output layer to get prediction for this step logits = output_layer(hidden)# Calculate loss for this step target_t = target_tensor[t:t+1] loss = cross_entropy(logits, target_t)# Accumulate the loss total_loss = total_loss + loss# Backward pass on the final accumulated loss for the whole sequence total_loss.backward()# Gradient clipping to prevent exploding gradientsfor p in all_params:if p.grad isnotNone: np.clip(p.grad, -5, 5, out=p.grad) optimizer.step()if epoch % print_every ==0or epoch == epochs -1: avg_loss = total_loss.data.item() / input_tensor.shape[0]print(f"Epoch {epoch}, Avg Loss: {avg_loss:.4f}")
The reward for training a generative model is seeing what it creates! To generate a name, we give the model a starting letter and an initial hidden state of zeros. We then enter a loop: 1. Feed the current character and hidden state to the model to get logits for the next character. 2. Convert the logits to probabilities using softmax. 3. Sample a character from this probability distribution. 4. Feed this new character as the input for the next time step. 5. Repeat until the model generates the special “end” token (.).
We have successfully built our first stateful model! It can process inputs of varying lengths and generate new sequences that capture the statistical patterns of the training data.
However, simple RNNs suffer from a major issue: the vanishing gradient problem. As sequences get longer, it becomes very difficult for gradients to flow back to the beginning of the sequence, making it hard for the model to learn long-range dependencies (e.g., how the first letter of a name influences the last).
In our next post, we will tackle this problem by implementing a more advanced and powerful recurrent architecture: the Long Short-Term Memory (LSTM) network.