We’ll be solving the vanishing gradient problem by building a Long Short-Term Memory (LSTM) network from scratch, implementing the forward and backward passes, and training it on a sequence prediction task.
bare-bones-ml
code
Author
Devansh Lodha
Published
May 26, 2025
In our last post, we built a Recurrent Neural Network (RNN) that successfully learned to generate names. However, we ended with a cliffhanger: the vanishing gradient problem. As an RNN processes long sequences, the gradient signal flowing backward in time can shrink exponentially, making it impossible for the model to learn dependencies between distant elements. A simple RNN might learn that “q” is often followed by “u,” but it would struggle to ensure a name that starts with “Alex” ends with “ander” instead of “andra.”
This is where the Long Short-Term Memory (LSTM) network comes in. Invented by Hochreiter & Schmidhuber in 1997, it was a groundbreaking architecture designed specifically to combat the vanishing gradient problem and remember information over long periods.
The Core Idea: A Separate Memory Lane
The genius of the LSTM is the introduction of the cell state, \(C_t\). You can think of this as a separate, protected memory “conveyor belt” that runs parallel to the main hidden state. The LSTM has the ability to precisely add information to or remove information from this cell state, regulated by a series of structures called gates.
An LSTM has three main gates that control this memory conveyor belt: 1. Forget Gate (\(f_t\)): This gate looks at the previous hidden state (\(h_{t-1}\)) and the current input (\(x_t\)) and decides what information from the previous cell state (\(C_{t-1}\)) is no longer relevant and should be discarded. It outputs a number between 0 (completely forget) and 1 (completely keep) for each number in the cell state vector. \[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
Input Gate (\(i_t\)): This gate decides which new information from the current input we’re going to store in the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of new candidate values, \(\tilde{C}_t\). \[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\]\[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
Updating the Cell State: Now, we update the old cell state \(C_{t-1}\) into the new cell state \(C_t\). We forget the old stuff by multiplying by \(f_t\), and then we add the new candidate values, scaled by how much we want to update them (\(i_t\)). \[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]
Output Gate (\(o_t\)): Finally, we decide what our next hidden state (\(h_t\)) will be. This is a filtered version of our cell state. We run a sigmoid gate to decide which parts of the cell state we’re going to output, and then put the cell state through tanh and multiply it by the output of the sigmoid gate. \[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\]\[h_t = o_t * \tanh(C_t)\]
This gated mechanism allows gradients to flow much more freely through time, solving the vanishing gradient problem.
Implementing the LSTMBlock
Let’s translate this directly into code in our from_scratch/nn.py file. Each gate will have its own weight matrix and bias vector. The forward method will implement the four equations above.
Now, we’ll run the exact same experiment as before, but we’ll swap our RecurrentBlock with our new LSTMBlock. The training loop is nearly identical, but now we must initialize and pass two states at each time step: the hidden state h and the cell state c.
Code
import syssys.path.append('../')import numpy as npimport randomfrom from_scratch.autograd.tensor import Tensor# 1. The Datasettry:withopen('../data/names.txt', 'r') as f: names = [name.lower() for name in f.read().splitlines()]exceptFileNotFoundError:print("Error: `data/names.txt` not found.") names = []if names:print(f"Loaded {len(names)} names. First 5: {names[:5]}")# 2. Create the Vocabulary chars =sorted(list(set("."+"".join(names)))) char_to_int = {ch: i for i, ch inenumerate(chars)} int_to_char = {i: ch for i, ch inenumerate(chars)} vocab_size =len(chars)print(f"\nVocabulary size: {vocab_size}")print(f"Characters: {''.join(chars)}")# 3. Helper to create a training exampledef get_random_example(): name = random.choice(names) full_sequence ="."+ name +"." input_indices = [char_to_int[c] for c in full_sequence[:-1]] target_indices = [char_to_int[c] for c in full_sequence[1:]] input_one_hot = np.zeros((len(input_indices), vocab_size), dtype=np.float32) input_one_hot[np.arange(len(input_indices)), input_indices] =1return Tensor(input_one_hot), Tensor(np.array(target_indices))
You will likely notice that the names generated by the LSTM feel even more structured and coherent than those from the simple RNN. This is a direct result of its superior ability to capture and maintain context over time, thanks to its gated cell state.
Even LSTMs, however, have limitations when dealing with extremely long sequences where context from the very distant past is important. The need to compress all past information into a fixed-size state vector remains a bottleneck.
In our next post, we will explore the revolutionary concept that moved beyond this sequential bottleneck altogether: the Attention Mechanism.