In this post we will implement an automatic differentiation engine!
bare-bones-ml
code
Author
Devansh Lodha
Published
May 23, 2025
“What I cannot create, I do not understand.” — Richard Feynman
Welcome to the first post in the BareBonesML series! This quote embodies the spirit of this project. To master powerful libraries like PyTorch, we must understand whats happening under the hood.
The “Why”: Gradient Descent
At its core, training a neural network is an optimization problem. We define a loss function that measures how “wrong” our model’s predictions are. Our goal is to tweak the model’s internal parameters (its weights and biases) to make this loss as small as possible. The most common way to do this is with an algorithm called Gradient Descent.
Imagine your loss function is a giant, hilly landscape, and your goal is to find the lowest valley. Gradient descent tells you which direction is “downhill” from your current position. This “downhill” direction is given by the gradient—a vector of partial derivatives of the loss with respect to each of the model’s parameters. The update rule is simple:
The challenge is, how do we calculate \(\frac{\partial \text{Loss}}{\partial \text{parameter}}\) for potentially millions of parameters in a complex model? The answer is to have the machine do it for us, automatically. That’s what our autograd engine will do.
In this post, we will: 1. Explore the concept of a computational graph. 2. Design and implement our core Tensor and Function classes. 3. Walk through the code for fundamental operations like addition, multiplication, and power. 4. Unpack the logic of backpropagation and the chain rule. 5. Build a working engine that can compute gradients for any arbitrary combination of functions.
Design Decision: Separating Data from Operations
Our first major design decision is to separate the data containers from the mathematical operations. This is a powerful and clean design used by major frameworks.
Tensor: A Tensor will be our data container. It will hold a NumPy array of the data itself, but more importantly, it will know its own history—which operation (Function) created it.
Function: A Function will represent a single mathematical operation (e.g., addition). It will be stateless and only know how to compute its output (forward) and its local derivatives (backward).
Let’s look at the actual code from our library in from_scratch/autograd/tensor.py.
The Function Base Class
This is the abstract parent for all our operations. Its most important part is the apply classmethod. It takes Tensor objects as input, creates an instance of the specific Function subclass (like Add or Mul), performs the forward calculation on the raw data, and returns a newTensor whose _creator attribute points back to the function instance. This is how the graph gets built, one link at a time.
# from_scratch/autograd/tensor.pyclass Function:""" Base class for differentiable operations. It links Tensors in the computational graph. """def__init__(self, *tensors: "Tensor"):self.parents = tensorsself.saved_tensors: List[np.ndarray] = []def save_for_backward(self, *tensors: np.ndarray):"""Saves given tensors for use in the backward pass."""self.saved_tensors.extend(tensors)def forward(self, *args: Any, **kwargs: Any) -> np.ndarray:"""Performs the forward computation. Must be implemented by subclasses."""raiseNotImplementedErrordef backward(self, grad: np.ndarray) -> Union[None, np.ndarray, Tuple[Optional[np.ndarray], ...]]:"""Computes the gradient of the loss with respect to the inputs. Must be implemented by subclasses."""raiseNotImplementedError@classmethoddefapply(cls, *args: "Tensor", **kwargs: Any) ->"Tensor":"""Applies the function, creating a new tensor that tracks the operation.""" ctx = cls(*args) output_data = ctx.forward(*(t.data for t in args), **kwargs)return Tensor(output_data, requires_grad=any(t.requires_grad for t in args), _creator=ctx)
The Tensor Class
Here is the core of our Tensor class. Notice how simple the __init__ method is. It just stores the data and the (optional) creator function.
# from_scratch/autograd/tensor.pyclass Tensor:"""The core data structure for the autograd engine."""def__init__(self, data: Union[np.ndarray, float, int, list], requires_grad: bool=False, _creator: Optional[Function] =None):ifnotisinstance(data, np.ndarray): data = np.array(data, dtype=np.float32)self.data = dataself.requires_grad = requires_gradself._creator = _creatorself.grad: Optional[np.ndarray] =None# ... more methods to come ...
A Quick Detour: What is raise NotImplementedError?
You’ll notice that the forward and backward methods in our base Function class don’t actually do anything except raise NotImplementedError. This looks strange, but it’s a very important and deliberate design choice.
This technique turns our Function class into an abstract base class—a blueprint or a contract. By doing this, we are saying:
“Any new, concrete operation that inherits from Functionmust provide its own implementation of forward and backward. If a developer forgets to do this, their code will crash immediately with an error, telling them exactly what they need to fix.”
Implementing Core Operations
Let’s implement a few fundamental operations to see the pattern. Each one is a subclass of Function.
Addition (Add)
Forward Pass: The forward pass is simple element-wise addition: \(f(x, y) = x + y\).
Backward Pass: The partial derivatives are both 1 (\(\frac{\partial f}{\partial x} = 1\), \(\frac{\partial f}{\partial y} = 1\)). By the chain rule, this means the Add operation just passes the incoming gradient through to both of its parents.
# from_scratch/autograd/tensor.pyclass Add(Function):def forward(self, x: np.ndarray, y: np.ndarray) -> np.ndarray:return x + ydef backward(self, grad: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:return grad, grad
Multiplication (Mul)
Forward Pass: Element-wise multiplication: \(f(x, y) = x \cdot y\).
Backward Pass: Using the product rule, the partial derivatives are \(\frac{\partial f}{\partial x} = y\) and \(\frac{\partial f}{\partial y} = x\). To do this, we must first save the original inputs x and y during the forward pass.
# from_scratch/autograd/tensor.pyclass Mul(Function):def forward(self, x: np.ndarray, y: np.ndarray) -> np.ndarray:self.save_for_backward(x, y)return x * ydef backward(self, grad: np.ndarray) -> Tuple[np.ndarray, np.ndarray]: x, y =self.saved_tensorsreturn grad * y, grad * x
To make these easy to use, we overload the operators +, *, and ** on the Tensor class.
# In the Tensor class in from_scratch/autograd/tensor.pydef__add__(self, other):ifnotisinstance(other, Tensor): other = Tensor(other)return Add.apply(self, other)def__mul__(self, other):ifnotisinstance(other, Tensor): other = Tensor(other)return Mul.apply(self, other)def__pow__(self, power):ifnotisinstance(power,(int,float)): raiseTypeError("Power must be scalar")return Pow.apply(self, power=power)
Let’s test this by building a small computational graph.
Code
# --- A small demonstration ---import syssys.path.append('../')from from_scratch.autograd.tensor import Tensora = Tensor(3.0, requires_grad=True)b = Tensor(4.0, requires_grad=True)# c = a * b# d = c + bc = a * bd = c + bprint(f"Tensor c: {c}")print(f"c was created by: {c._creator.__class__.__name__}")print(f"Tensor d: {d}")print(f"d was created by: {d._creator.__class__.__name__}")
Tensor c: Tensor(data=12.0, requires_grad=True)
c was created by: Mul
Tensor d: Tensor(data=16.0, requires_grad=True)
d was created by: Add
Backpropagation: The Algorithm
Now for the main event. The backward() method automates the chain rule. It works by first creating a topologically sorted list of all the nodes that lead to the final output. This ensures that when we process a node, we have already computed the gradients for all the nodes that depend on it.
Here is the full implementation of the backward method in our Tensor class. Read the comments carefully to understand the flow.
# from_scratch/autograd/tensor.pydef backward(self, grad: Optional[np.ndarray] =None):# 1. Ensure the tensor requires a gradient.ifnotself.requires_grad:return# 2. Bootstrap the gradient. For the final loss, this is 1.0.if grad isNone:ifself.data.size !=1:raiseValueError("Gradient must be specified for non-scalar Tensors.") grad = np.ones_like(self.data, dtype=np.float32)self.grad = grad# 3. Build a topologically sorted list of all nodes in the graph. visited, topo_nodes =set(), []def build_topo(node):ifid(node) notin visited: visited.add(id(node))if node._creator:for p in node._creator.parents: build_topo(p) topo_nodes.append(node) build_topo(self)# 4. Propagate gradients backward through the graph.for node inreversed(topo_nodes):ifnot node._creator or node.grad isNone:continue# Get the gradients for this node's parents by calling its creator's backward method. parent_grads = node._creator.backward(node.grad)ifnotisinstance(parent_grads, tuple): parent_grads = (parent_grads,)# 5. Distribute and accumulate gradients to the parents.for p, p_g inzip(node._creator.parents, parent_grads):if p_g isnotNoneand p.requires_grad:# Handle cases where broadcasting occurred during the forward pass.if p_g.shape != p.shape:# (Implementation detail for broadcasting omitted for clarity)pass# Accumulate gradient.if p.grad isNone: p.grad = p_g.copy()else: p.grad += p_g
Final Demonstration
Let’s test our complete engine on a more complex expression: \(L = (a \cdot b) + a^2\). We need to calculate \(\frac{\partial L}{\partial a}\) and \(\frac{\partial L}{\partial b}\).
If we set \(a=3\) and \(b=4\): - \(\frac{\partial L}{\partial a} = 4 + 2 \cdot 3 = 10\) - \(\frac{\partial L}{\partial b} = 3\)
Let’s see if our engine gets the same result.
Code
a = Tensor(3.0, requires_grad=True)b = Tensor(4.0, requires_grad=True)# L = (a * b) + a**2term1 = a * bterm2 = a **2.0L = term1 + term2# Kick off backpropagation from the final nodeL.backward()print(f"The final value L is: {L.data.item():.2f}")print("--- Gradients ---")print(f"dL/da: {a.grad.item():.2f} (Expected: 10.0)")print(f"dL/db: {b.grad.item():.2f} (Expected: 3.0)")
The final value L is: 21.00
--- Gradients ---
dL/da: 10.00 (Expected: 10.0)
dL/db: 3.00 (Expected: 3.0)
It works perfectly! Our engine correctly calculated the gradients, and critically, it correctly accumulated the gradients for a from both the Mul and Pow branches of the computational graph.
Conclusion
We have successfully built the core of a deep learning library. By separating Tensor objects (which hold state) from Function objects (which define operations), we created a system that can dynamically build a computational graph and automatically compute the gradients of any output with respect to its inputs.
This autograd engine is the fundamental tool that enables model training. In the next post, we will use this engine to build our first neural network layers and train a model to solve a real problem.