We’ll be stacking multiple layers and tackling a real-world binary classification task.
bare-bones-ml
code
Author
Devansh Lodha
Published
May 25, 2025
In the last post, we built the foundational Module, Linear layer, and Sequential container. We successfully trained a single-layer model on a toy regression problem.
Our goals for this post are: 1. Implement Sigmoid activation function and Binary Cross-Entropy (BCE) loss to our library. 2. Implement Adam optimizer. 3. Use our Sequential container to easily stack Linear layers and activations to build a multi-layer perceptron (MLP). 4. Use the Wisconsin Breast Cancer dataset from Scikit-learn to train our from-scratch MLP.
All the code for this post can be found in the from_scratch/ directory of the bare-bones-ml repository.
New Tools for a New Task: Classification
Regression (predicting a continuous value) is different from classification (predicting a discrete category). This requires a different set of tools. For a binary (yes/no) problem, we need two things: a way to output a probability, and a way to measure the error of that probability.
1. The Sigmoid Activation Function
To get a probability, we need to squash the model’s raw output (called a “logit,” which can be any real number) into the range [0, 1]. The Sigmoid function is perfect for this.
\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]
Its derivative is also simple and elegant, which is useful for backpropagation:
MSELoss is a poor choice for measuring the error of a probability. If the model predicts 0.9 and the true label is 1, the squared error is small. But if it predicts 0.1 (very confident and very wrong), we want to penalize it heavily.
Binary Cross-Entropy (BCE) does exactly this. It heavily penalizes confident, incorrect predictions. For a single prediction, the formula is:
Where \(\hat{y}_{\text{pred}}\) is the model’s predicted probability (the output of the sigmoid).
# from_scratch/functional.pyclass BCELoss(Function):def forward(self, y_pred: np.ndarray, y_true: np.ndarray) -> np.ndarray: epsilon =1e-15# Clip predictions to avoid log(0) y_pred = np.clip(y_pred, epsilon, 1- epsilon)self.save_for_backward(y_pred, y_true) loss =-np.mean(y_true * np.log(y_pred) + (1- y_true) * np.log(1- y_pred))return np.array(loss)def backward(self, grad: np.ndarray): y_pred, y_true =self.saved_tensors n = y_pred.shape if y_pred.ndim >0else1# Derivative of BCE Loss grad_y_pred = grad * (1.0/ n) * ((y_pred - y_true) / (y_pred * (1- y_pred)))return grad_y_pred, None# No gradient for the true labels
3. The Adam Optimizer
While SGD works, it can be slow to converge. The Adam (Adaptive Moment Estimation) optimizer is a more advanced algorithm that often leads to faster training. It adapts the learning rate for each parameter individually by keeping track of two “moments” of the gradients: - First Moment (the mean): This is like momentum, helping the optimizer to continue in a consistent direction. - Second Moment (the uncentered variance): This helps to scale the learning rate, making larger updates for infrequent parameters and smaller updates for frequent ones.
Here is the implementation from from_scratch/optim.py:
Let’s put all our new tools together. We will train a model to classify breast cancer tumors as malignant (1) or benign (0) based on 30 different features from the Scikit-learn dataset.
The training process is almost identical to our regression example, showcasing the power of our modular design.
Code
import syssys.path.append('../')import numpy as npfrom sklearn.datasets import load_breast_cancerfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_scorefrom from_scratch.autograd.tensor import Tensorfrom from_scratch.nn import Sequential, Linear, ReLUfrom from_scratch.functional import sigmoid, binary_cross_entropyfrom from_scratch.optim import Adam# 1. Load and Prepare Datadata = load_breast_cancer()X, y = data.data, data.target# Split data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# --- A crucial step: Feature Scaling ---# Neural networks train best when input features are on a similar scale.scaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_test = scaler.transform(X_test)# Convert data to our custom TensorsX_train_t = Tensor(X_train)y_train_t = Tensor(y_train.reshape(-1, 1))X_test_t = Tensor(X_test)# 2. Define the Model, Optimizer, and Lossinput_features = X_train.shape[1]model = Sequential( Linear(input_features, 16), ReLU(), Linear(16, 1))optimizer = Adam(params=model.parameters(), lr=0.01)loss_function = binary_cross_entropy# 3. The Training Loopepochs =100print("--- Training Start ---")for epoch inrange(epochs): optimizer.zero_grad()# Forward pass: get the raw model output (logits) logits = model(X_train_t)# Apply sigmoid to get probabilities predictions = sigmoid(logits)# Compute loss loss = loss_function(predictions, y_train_t)# Backward pass to compute gradients loss.backward()# Update weights using the optimizer optimizer.step()if epoch %20==0or epoch == epochs -1:print(f"Epoch {epoch}, Loss: {loss.data.item():.4f}")# 4. Evaluate the trained model on the test setlogits_test = model(X_test_t)preds_test = sigmoid(logits_test)# Convert probabilities to binary class predictions (0 or 1)binary_preds = (preds_test.data >0.5).astype(int).flatten()accuracy = accuracy_score(y_test, binary_preds)print(f"\nFinal Test Accuracy: {accuracy:.4f}")
With just a few additions to our library, we were able to build a multi-layer perceptron and achieve very high accuracy on a real-world classification task. We’ve shown that our from-scratch Module, Linear, Sequential, Adam, and BCELoss components all work together seamlessly.
We have built a solid foundation for feed-forward networks. The next logical step is to explore architectures that can handle sequential data, like sentences or time series. In the next post, we will implement our first Recurrent Neural Network (RNN).