LSTMs - Giving Your Network a Long-Term Memory

We’ll be solving the vanishing gradient problem by building a Long Short-Term Memory (LSTM) network from scratch, implementing the forward and backward passes, and training it on a sequence prediction task.
bare-bones-ml
code
Author

Devansh Lodha

Published

May 26, 2025

In our last post, we built a Recurrent Neural Network (RNN) that successfully learned to generate names. However, we ended with a cliffhanger: the vanishing gradient problem. As an RNN processes long sequences, the gradient signal flowing backward in time can shrink exponentially, making it impossible for the model to learn dependencies between distant elements. A simple RNN might learn that “q” is often followed by “u,” but it would struggle to ensure a name that starts with “Alex” ends with “ander” instead of “andra.”

This is where the Long Short-Term Memory (LSTM) network comes in. Invented by Hochreiter & Schmidhuber in 1997, it was a groundbreaking architecture designed specifically to combat the vanishing gradient problem and remember information over long periods.

The Core Idea: A Separate Memory Lane

The genius of the LSTM is the introduction of the cell state, \(C_t\). You can think of this as a separate, protected memory “conveyor belt” that runs parallel to the main hidden state. The LSTM has the ability to precisely add information to or remove information from this cell state, regulated by a series of structures called gates.

LSTM Diagram Credit: Colah’s blog

An LSTM has three main gates that control this memory conveyor belt: 1. Forget Gate (\(f_t\)): This gate looks at the previous hidden state (\(h_{t-1}\)) and the current input (\(x_t\)) and decides what information from the previous cell state (\(C_{t-1}\)) is no longer relevant and should be discarded. It outputs a number between 0 (completely forget) and 1 (completely keep) for each number in the cell state vector. \[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]

  1. Input Gate (\(i_t\)): This gate decides which new information from the current input we’re going to store in the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of new candidate values, \(\tilde{C}_t\). \[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] \[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]

  2. Updating the Cell State: Now, we update the old cell state \(C_{t-1}\) into the new cell state \(C_t\). We forget the old stuff by multiplying by \(f_t\), and then we add the new candidate values, scaled by how much we want to update them (\(i_t\)). \[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]

  3. Output Gate (\(o_t\)): Finally, we decide what our next hidden state (\(h_t\)) will be. This is a filtered version of our cell state. We run a sigmoid gate to decide which parts of the cell state we’re going to output, and then put the cell state through tanh and multiply it by the output of the sigmoid gate. \[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)\] \[h_t = o_t * \tanh(C_t)\]

This gated mechanism allows gradients to flow much more freely through time, solving the vanishing gradient problem.

Implementing the LSTMBlock

Let’s translate this directly into code in our from_scratch/nn.py file. Each gate will have its own weight matrix and bias vector. The forward method will implement the four equations above.

# from_scratch/nn.py

class LSTMBlock(Module):
    """A Long Short-Term Memory (LSTM) block."""
    def __init__(self, input_size: int, hidden_size: int):
        super().__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        concat_size = input_size + hidden_size
        
        # Forget gate weights
        self.W_f = Tensor(np.random.randn(concat_size, hidden_size) * 0.01, requires_grad=True)
        self.b_f = Tensor(np.zeros(hidden_size), requires_grad=True)
        
        # Input gate weights
        self.W_i = Tensor(np.random.randn(concat_size, hidden_size) * 0.01, requires_grad=True)
        self.b_i = Tensor(np.zeros(hidden_size), requires_grad=True)
        
        # Candidate cell state weights
        self.W_c = Tensor(np.random.randn(concat_size, hidden_size) * 0.01, requires_grad=True)
        self.b_c = Tensor(np.zeros(hidden_size), requires_grad=True)
        
        # Output gate weights
        self.W_o = Tensor(np.random.randn(concat_size, hidden_size) * 0.01, requires_grad=True)
        self.b_o = Tensor(np.zeros(hidden_size), requires_grad=True)

    def forward(self, x: Tensor, h_prev: Tensor, c_prev: Tensor) -> Tuple[Tensor, Tensor]:
        """Performs one step of the LSTM computation."""
        combined = Tensor.cat([x, h_prev], axis=1)
        
        # f_t = forget_gate
        f_t = sigmoid((combined @ self.W_f) + self.b_f)
        
        # i_t = input_gate
        i_t = sigmoid((combined @ self.W_i) + self.b_i)
        
        # c_candidate = candidate cell state
        c_candidate = tanh((combined @ self.W_c) + self.b_c)
        
        # c_next = new cell state
        c_next = f_t * c_prev + i_t * c_candidate
        
        # o_t = output_gate
        o_t = sigmoid((combined @ self.W_o) + self.b_o)

        # h_next = new hidden state
        h_next = o_t * tanh(c_next)
        
        return h_next, c_next

    def parameters(self) -> List[Tensor]:
        return [self.W_f, self.b_f, self.W_i, self.b_i, self.W_c, self.b_c, self.W_o, self.b_o]

Training the LSTM for Name Generation

Now, we’ll run the exact same experiment as before, but we’ll swap our RecurrentBlock with our new LSTMBlock. The training loop is nearly identical, but now we must initialize and pass two states at each time step: the hidden state h and the cell state c.

Code
import sys
sys.path.append('../')

import numpy as np
import random
from from_scratch.autograd.tensor import Tensor

# 1. The Dataset
try:
    with open('../data/names.txt', 'r') as f:
        names = [name.lower() for name in f.read().splitlines()]
except FileNotFoundError:
    print("Error: `data/names.txt` not found.")
    names = []

if names:
    print(f"Loaded {len(names)} names. First 5: {names[:5]}")

    # 2. Create the Vocabulary
    chars = sorted(list(set("." + "".join(names))))
    char_to_int = {ch: i for i, ch in enumerate(chars)}
    int_to_char = {i: ch for i, ch in enumerate(chars)}
    vocab_size = len(chars)

    print(f"\nVocabulary size: {vocab_size}")
    print(f"Characters: {''.join(chars)}")

    # 3. Helper to create a training example
    def get_random_example():
        name = random.choice(names)
        full_sequence = "." + name + "."
        input_indices = [char_to_int[c] for c in full_sequence[:-1]]
        target_indices = [char_to_int[c] for c in full_sequence[1:]]
        input_one_hot = np.zeros((len(input_indices), vocab_size), dtype=np.float32)
        input_one_hot[np.arange(len(input_indices)), input_indices] = 1
        return Tensor(input_one_hot), Tensor(np.array(target_indices))
Loaded 32033 names. First 5: ['emma', 'olivia', 'ava', 'isabella', 'sophia']

Vocabulary size: 27
Characters: .abcdefghijklmnopqrstuvwxyz
Code
from from_scratch.nn import LSTMBlock, Linear
from from_scratch.optim import Adam
from from_scratch.functional import cross_entropy

# Model Definition
hidden_size = 128
lstm_layer = LSTMBlock(input_size=vocab_size, hidden_size=hidden_size)
output_layer = Linear(input_size=hidden_size, output_size=vocab_size)

# Group all parameters for the optimizer
all_params = lstm_layer.parameters() + output_layer.parameters()
optimizer = Adam(params=all_params, lr=0.005)

# Training Loop
epochs = 20000
print_every = 1000

print("\n--- Training Start ---")
for epoch in range(epochs):
    input_tensor, target_tensor = get_random_example()
    optimizer.zero_grad()
    
    # Initialize BOTH hidden and cell states
    hidden = Tensor(np.zeros((1, hidden_size)))
    cell = Tensor(np.zeros((1, hidden_size)))
    
    total_loss = Tensor(0.0)
    
    for t in range(input_tensor.shape[0]):
        x_t = input_tensor[t:t+1, :]
        
        # The forward pass now returns two states
        hidden, cell = lstm_layer(x_t, hidden, cell)
        
        logits = output_layer(hidden)
        target_t = target_tensor[t:t+1]
        loss = cross_entropy(logits, target_t)
        total_loss = total_loss + loss

    total_loss.backward()
    
    for p in all_params:
        if p.grad is not None:
            np.clip(p.grad, -5, 5, out=p.grad)
            
    optimizer.step()
    
    if epoch % print_every == 0 or epoch == epochs - 1:
        avg_loss = total_loss.data.item() / input_tensor.shape[0]
        print(f"Epoch {epoch}, Avg Loss: {avg_loss:.4f}")

--- Training Start ---
Epoch 0, Avg Loss: 3.2939
Epoch 1000, Avg Loss: 2.0206
Epoch 2000, Avg Loss: 2.2393
Epoch 3000, Avg Loss: 1.9994
Epoch 4000, Avg Loss: 1.9533
Epoch 5000, Avg Loss: 2.8938
Epoch 6000, Avg Loss: 1.8327
Epoch 7000, Avg Loss: 1.9813
Epoch 8000, Avg Loss: 1.6036
Epoch 9000, Avg Loss: 2.7477
Epoch 10000, Avg Loss: 2.3096
Epoch 11000, Avg Loss: 1.8515
Epoch 12000, Avg Loss: 1.8646
Epoch 13000, Avg Loss: 1.9118
Epoch 14000, Avg Loss: 1.5715
Epoch 15000, Avg Loss: 2.5192
Epoch 16000, Avg Loss: 2.5600
Epoch 17000, Avg Loss: 2.5066
Epoch 18000, Avg Loss: 2.3326
Epoch 19000, Avg Loss: 1.9452
Epoch 19999, Avg Loss: 2.2930

Generating Names with the LSTM

The inference loop is also updated to manage both the hidden and cell states, passing them from one step to the next.

Code
from from_scratch.functional import softmax

def generate_name(start_letter='a', max_len=20):
    def char_to_tensor(char):
        tensor = np.zeros((1, vocab_size))
        tensor[0, char_to_int[char]] = 1
        return Tensor(tensor)

    # Initialize both hidden and cell state
    hidden = Tensor(np.zeros((1, hidden_size)))
    cell = Tensor(np.zeros((1, hidden_size)))
    
    current_input = char_to_tensor(start_letter)
    name = start_letter
    
    for _ in range(max_len):
        # Pass and update both states
        hidden, cell = lstm_layer(current_input, hidden, cell)
        logits = output_layer(hidden)
        
        probs = softmax(logits)
        next_char_idx = np.random.choice(len(chars), p=probs.data.flatten())
        
        if int_to_char[next_char_idx] == '.':
            break
            
        next_char = int_to_char[next_char_idx]
        name += next_char
        current_input = char_to_tensor(next_char)
            
    return name

print("\n--- Generated Names with LSTM ---")
for char in "abcde":
    print(generate_name(char))

--- Generated Names with LSTM ---
adiha
brean
calane
derce
ezorn

Conclusion

You will likely notice that the names generated by the LSTM feel even more structured and coherent than those from the simple RNN. This is a direct result of its superior ability to capture and maintain context over time, thanks to its gated cell state.

Even LSTMs, however, have limitations when dealing with extremely long sequences where context from the very distant past is important. The need to compress all past information into a fixed-size state vector remains a bottleneck.

In our next post, we will explore the revolutionary concept that moved beyond this sequential bottleneck altogether: the Attention Mechanism.