Building an Autograd Engine from Scratch

“What I cannot create, I do not understand.” — Richard Feynman

Welcome to the first post in the BareBonesML series! This quote embodies the spirit of this project. To master powerful libraries like PyTorch, we must understand whats happening under the hood.

The “Why”: Gradient Descent

At its core, training a neural network is an optimization problem. We define a loss function that measures how “wrong” our model’s predictions are. Our goal is to tweak the model’s internal parameters (its weights and biases) to make this loss as small as possible. The most common way to do this is with an algorithm called Gradient Descent.

Imagine your loss function is a giant, hilly landscape, and your goal is to find the lowest valley. Gradient descent tells you which direction is “downhill” from your current position. This “downhill” direction is given by the gradient—a vector of partial derivatives of the loss with respect to each of the model’s parameters. The update rule is simple:

\[ \text{parameter}_{\text{new}} \leftarrow \text{parameter}_{\text{old}} - \text{learning\_rate} \cdot \frac{\partial \text{Loss}}{\partial \text{parameter}_{\text{old}}} \]

The challenge is, how do we calculate \(\frac{\partial \text{Loss}}{\partial \text{parameter}}\) for potentially millions of parameters in a complex model? The answer is to have the machine do it for us, automatically. That’s what our autograd engine will do.

In this post, we will: 1. Explore the concept of a computational graph. 2. Design and implement our core Tensor and Function classes. 3. Walk through the code for fundamental operations like addition, multiplication, and power. 4. Unpack the logic of backpropagation and the chain rule. 5. Build a working engine that can compute gradients for any arbitrary combination of functions.

Design Decision: Separating Data from Operations

Our first major design decision is to separate the data containers from the mathematical operations. This is a powerful and clean design used by major frameworks.

Tensor: A Tensor will be our data container. It will hold a NumPy array of the data itself, but more importantly, it will know its own history—which operation (Function) created it.
Function: A Function will represent a single mathematical operation (e.g., addition). It will be stateless and only know how to compute its output (forward) and its local derivatives (backward).

Let’s look at the actual code from our library in from_scratch/autograd/tensor.py.

The `Function` Base Class

This is the abstract parent for all our operations. Its most important part is the apply classmethod. It takes Tensor objects as input, creates an instance of the specific Function subclass (like Add or Mul), performs the forward calculation on the raw data, and returns a new Tensor whose _creator attribute points back to the function instance. This is how the graph gets built, one link at a time.

# from_scratch/autograd/tensor.py

class Function:
    """
    Base class for differentiable operations. It links Tensors in the computational graph.
    """
    def __init__(self, *tensors: "Tensor"):
        self.parents = tensors
        self.saved_tensors: List[np.ndarray] = []

    def save_for_backward(self, *tensors: np.ndarray):
        """Saves given tensors for use in the backward pass."""
        self.saved_tensors.extend(tensors)

    def forward(self, *args: Any, **kwargs: Any) -> np.ndarray:
        """Performs the forward computation. Must be implemented by subclasses."""
        raise NotImplementedError

    def backward(self, grad: np.ndarray) -> Union[None, np.ndarray, Tuple[Optional[np.ndarray], ...]]:
        """Computes the gradient of the loss with respect to the inputs. Must be implemented by subclasses."""
        raise NotImplementedError

    @classmethod
    def apply(cls, *args: "Tensor", **kwargs: Any) -> "Tensor":
        """Applies the function, creating a new tensor that tracks the operation."""
        ctx = cls(*args)
        output_data = ctx.forward(*(t.data for t in args), **kwargs)
        return Tensor(output_data, requires_grad=any(t.requires_grad for t in args), _creator=ctx)

The `Tensor` Class

Here is the core of our Tensor class. Notice how simple the __init__ method is. It just stores the data and the (optional) creator function.

# from_scratch/autograd/tensor.py

class Tensor:
    """The core data structure for the autograd engine."""
    def __init__(self, data: Union[np.ndarray, float, int, list], requires_grad: bool = False, _creator: Optional[Function] = None):
        if not isinstance(data, np.ndarray):
            data = np.array(data, dtype=np.float32)
        self.data = data
        self.requires_grad = requires_grad
        self._creator = _creator
        self.grad: Optional[np.ndarray] = None
    
    # ... more methods to come ...

A Quick Detour: What is `raise NotImplementedError`?

You’ll notice that the forward and backward methods in our base Function class don’t actually do anything except raise NotImplementedError. This looks strange, but it’s a very important and deliberate design choice.

This technique turns our Function class into an abstract base class—a blueprint or a contract. By doing this, we are saying:

“Any new, concrete operation that inherits from Function must provide its own implementation of forward and backward. If a developer forgets to do this, their code will crash immediately with an error, telling them exactly what they need to fix.”

Implementing Core Operations

Let’s implement a few fundamental operations to see the pattern. Each one is a subclass of Function.

Addition (`Add`)

Forward Pass: The forward pass is simple element-wise addition: \(f(x, y) = x + y\).
Backward Pass: The partial derivatives are both 1 (\(\frac{\partial f}{\partial x} = 1\), \(\frac{\partial f}{\partial y} = 1\)). By the chain rule, this means the Add operation just passes the incoming gradient through to both of its parents.

# from_scratch/autograd/tensor.py

class Add(Function):
    def forward(self, x: np.ndarray, y: np.ndarray) -> np.ndarray:
        return x + y
    def backward(self, grad: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        return grad, grad

Multiplication (`Mul`)

Forward Pass: Element-wise multiplication: \(f(x, y) = x \cdot y\).
Backward Pass: Using the product rule, the partial derivatives are \(\frac{\partial f}{\partial x} = y\) and \(\frac{\partial f}{\partial y} = x\). To do this, we must first save the original inputs x and y during the forward pass.

# from_scratch/autograd/tensor.py

class Mul(Function):
    def forward(self, x: np.ndarray, y: np.ndarray) -> np.ndarray:
        self.save_for_backward(x, y)
        return x * y
    def backward(self, grad: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
        x, y = self.saved_tensors
        return grad * y, grad * x

Power (`Pow`)

Forward Pass: Element-wise power: \(f(x) = x^{\text{power}}\).
Backward Pass: Using the power rule, the derivative is \(\frac{\partial f}{\partial x} = \text{power} \cdot x^{\text{power}-1}\).

# from_scratch/autograd/tensor.py

class Pow(Function):
    def forward(self, x: np.ndarray, *, power: float) -> np.ndarray:
        self.save_for_backward(x)
        self.power = power
        return x ** power
    def backward(self, grad: np.ndarray) -> np.ndarray:
        x, = self.saved_tensors
        return grad * (self.power * (x ** (self.power - 1)))

To make these easy to use, we overload the operators +, *, and ** on the Tensor class.

# In the Tensor class in from_scratch/autograd/tensor.py

    def __add__(self, other):
        if not isinstance(other, Tensor): other = Tensor(other)
        return Add.apply(self, other)

    def __mul__(self, other):
        if not isinstance(other, Tensor): other = Tensor(other)
        return Mul.apply(self, other)

    def __pow__(self, power):
        if not isinstance(power,(int,float)): raise TypeError("Power must be scalar")
        return Pow.apply(self, power=power)

Let’s test this by building a small computational graph.

Code

# --- A small demonstration ---
import sys
sys.path.append('../')
from from_scratch.autograd.tensor import Tensor

a = Tensor(3.0, requires_grad=True)
b = Tensor(4.0, requires_grad=True)

# c = a * b
# d = c + b
c = a * b
d = c + b

print(f"Tensor c: {c}")
print(f"c was created by: {c._creator.__class__.__name__}")
print(f"Tensor d: {d}")
print(f"d was created by: {d._creator.__class__.__name__}")

Tensor c: Tensor(data=12.0, requires_grad=True)
c was created by: Mul
Tensor d: Tensor(data=16.0, requires_grad=True)
d was created by: Add

Backpropagation: The Algorithm

Now for the main event. The backward() method automates the chain rule. It works by first creating a topologically sorted list of all the nodes that lead to the final output. This ensures that when we process a node, we have already computed the gradients for all the nodes that depend on it.

Here is the full implementation of the backward method in our Tensor class. Read the comments carefully to understand the flow.

# from_scratch/autograd/tensor.py

def backward(self, grad: Optional[np.ndarray] = None):
    # 1. Ensure the tensor requires a gradient.
    if not self.requires_grad:
        return
    
    # 2. Bootstrap the gradient. For the final loss, this is 1.0.
    if grad is None:
        if self.data.size != 1:
            raise ValueError("Gradient must be specified for non-scalar Tensors.")
        grad = np.ones_like(self.data, dtype=np.float32)
    
    self.grad = grad
    
    # 3. Build a topologically sorted list of all nodes in the graph.
    visited, topo_nodes = set(), []
    def build_topo(node):
        if id(node) not in visited:
            visited.add(id(node))
            if node._creator:
                for p in node._creator.parents:
                    build_topo(p)
                topo_nodes.append(node)
    build_topo(self)

    # 4. Propagate gradients backward through the graph.
    for node in reversed(topo_nodes):
        if not node._creator or node.grad is None:
            continue

        # Get the gradients for this node's parents by calling its creator's backward method.
        parent_grads = node._creator.backward(node.grad)
        
        if not isinstance(parent_grads, tuple):
            parent_grads = (parent_grads,)

        # 5. Distribute and accumulate gradients to the parents.
        for p, p_g in zip(node._creator.parents, parent_grads):
            if p_g is not None and p.requires_grad:
                # Handle cases where broadcasting occurred during the forward pass.
                if p_g.shape != p.shape:
                    # (Implementation detail for broadcasting omitted for clarity)
                    pass 
                
                # Accumulate gradient.
                if p.grad is None:
                    p.grad = p_g.copy()
                else:
                    p.grad += p_g

Final Demonstration

Let’s test our complete engine on a more complex expression: \(L = (a \cdot b) + a^2\). We need to calculate \(\frac{\partial L}{\partial a}\) and \(\frac{\partial L}{\partial b}\).

Using calculus: - \(\frac{\partial L}{\partial a} = \frac{\partial (ab)}{\partial a} + \frac{\partial (a^2)}{\partial a} = b + 2a\) - \(\frac{\partial L}{\partial b} = \frac{\partial (ab)}{\partial b} + \frac{\partial (a^2)}{\partial b} = a + 0 = a\)

If we set \(a=3\) and \(b=4\): - \(\frac{\partial L}{\partial a} = 4 + 2 \cdot 3 = 10\) - \(\frac{\partial L}{\partial b} = 3\)

Let’s see if our engine gets the same result.

Code

a = Tensor(3.0, requires_grad=True)
b = Tensor(4.0, requires_grad=True)

# L = (a * b) + a**2
term1 = a * b
term2 = a ** 2.0
L = term1 + term2

# Kick off backpropagation from the final node
L.backward()

print(f"The final value L is: {L.data.item():.2f}")
print("--- Gradients ---")
print(f"dL/da: {a.grad.item():.2f} (Expected: 10.0)")
print(f"dL/db: {b.grad.item():.2f} (Expected: 3.0)")

The final value L is: 21.00
--- Gradients ---
dL/da: 10.00 (Expected: 10.0)
dL/db: 3.00 (Expected: 3.0)

It works perfectly! Our engine correctly calculated the gradients, and critically, it correctly accumulated the gradients for a from both the Mul and Pow branches of the computational graph.

Conclusion

We have successfully built the core of a deep learning library. By separating Tensor objects (which hold state) from Function objects (which define operations), we created a system that can dynamically build a computational graph and automatically compute the gradients of any output with respect to its inputs.

This autograd engine is the fundamental tool that enables model training. In the next post, we will use this engine to build our first neural network layers and train a model to solve a real problem.

The “Why”: Gradient Descent

Design Decision: Separating Data from Operations

The Function Base Class

The Tensor Class

A Quick Detour: What is raise NotImplementedError?