Building a Multi-Layer Perceptron for Classification

In the last post, we built the foundational Module, Linear layer, and Sequential container. We successfully trained a single-layer model on a toy regression problem.

Our goals for this post are: 1. Implement Sigmoid activation function and Binary Cross-Entropy (BCE) loss to our library. 2. Implement Adam optimizer. 3. Use our Sequential container to easily stack Linear layers and activations to build a multi-layer perceptron (MLP). 4. Use the Wisconsin Breast Cancer dataset from Scikit-learn to train our from-scratch MLP.

All the code for this post can be found in the from_scratch/ directory of the bare-bones-ml repository.

New Tools for a New Task: Classification

Regression (predicting a continuous value) is different from classification (predicting a discrete category). This requires a different set of tools. For a binary (yes/no) problem, we need two things: a way to output a probability, and a way to measure the error of that probability.

1. The `Sigmoid` Activation Function

To get a probability, we need to squash the model’s raw output (called a “logit,” which can be any real number) into the range [0, 1]. The Sigmoid function is perfect for this.

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Its derivative is also simple and elegant, which is useful for backpropagation:

\[ \frac{d\sigma(x)}{dx} = \sigma(x)(1 - \sigma(x)) \]

Here is the implementation in from_scratch/functional.py:

# from_scratch/functional.py
class Sigmoid(Function):
    def forward(self, x: np.ndarray) -> np.ndarray:
        output = 1 / (1 + np.exp(-x))
        self.save_for_backward(output)
        return output
    def backward(self, grad: np.ndarray):
        output, = self.saved_tensors
        return grad * output * (1 - output)

def sigmoid(x: Tensor) -> Tensor:
    return Sigmoid.apply(x)

2. Binary Cross-Entropy (BCE) Loss

MSELoss is a poor choice for measuring the error of a probability. If the model predicts 0.9 and the true label is 1, the squared error is small. But if it predicts 0.1 (very confident and very wrong), we want to penalize it heavily.

Binary Cross-Entropy (BCE) does exactly this. It heavily penalizes confident, incorrect predictions. For a single prediction, the formula is:

\[ L = - \left( y_{\text{true}} \log(\hat{y}_{\text{pred}}) + (1 - y_{\text{true}}) \log(1 - \hat{y}_{\text{pred}}) \right) \]

Where \(\hat{y}_{\text{pred}}\) is the model’s predicted probability (the output of the sigmoid).

# from_scratch/functional.py
class BCELoss(Function):
    def forward(self, y_pred: np.ndarray, y_true: np.ndarray) -> np.ndarray:
        epsilon = 1e-15 # Clip predictions to avoid log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        self.save_for_backward(y_pred, y_true)
        loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return np.array(loss)

    def backward(self, grad: np.ndarray):
        y_pred, y_true = self.saved_tensors
        n = y_pred.shape if y_pred.ndim > 0 else 1
        # Derivative of BCE Loss
        grad_y_pred = grad * (1.0 / n) * ((y_pred - y_true) / (y_pred * (1 - y_pred)))
        return grad_y_pred, None # No gradient for the true labels

3. The Adam Optimizer

While SGD works, it can be slow to converge. The Adam (Adaptive Moment Estimation) optimizer is a more advanced algorithm that often leads to faster training. It adapts the learning rate for each parameter individually by keeping track of two “moments” of the gradients: - First Moment (the mean): This is like momentum, helping the optimizer to continue in a consistent direction. - Second Moment (the uncentered variance): This helps to scale the learning rate, making larger updates for infrequent parameters and smaller updates for frequent ones.

Here is the implementation from from_scratch/optim.py:

# from_scratch/optim.py
class Adam(Optimizer):
    def __init__(self, params: List[Tensor], lr: float, beta1: float = 0.9, beta2: float = 0.999, eps: float = 1e-8):
        super().__init__(params, lr)
        self.beta1, self.beta2, self.eps = beta1, beta2, eps
        self.t = 0
        self.m = [np.zeros_like(p.data) for p in self.params]
        self.v = [np.zeros_like(p.data) for p in self.params]

    def step(self):
        self.t += 1
        for i, p in enumerate(self.params):
            if p.grad is not None:
                # Update biased moment estimates
                self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * p.grad
                self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (p.grad ** 2)
                
                # Compute bias-corrected estimates
                m_hat = self.m[i] / (1 - self.beta1 ** self.t)
                v_hat = self.v[i] / (1 - self.beta2 ** self.t)
                
                # Update parameters
                p.data -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

Training a Classifier on Real Data

Let’s put all our new tools together. We will train a model to classify breast cancer tumors as malignant (1) or benign (0) based on 30 different features from the Scikit-learn dataset.

The training process is almost identical to our regression example, showcasing the power of our modular design.

Code

import sys
sys.path.append('../')

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from from_scratch.autograd.tensor import Tensor
from from_scratch.nn import Sequential, Linear, ReLU
from from_scratch.functional import sigmoid, binary_cross_entropy
from from_scratch.optim import Adam

# 1. Load and Prepare Data
data = load_breast_cancer()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- A crucial step: Feature Scaling ---
# Neural networks train best when input features are on a similar scale.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert data to our custom Tensors
X_train_t = Tensor(X_train)
y_train_t = Tensor(y_train.reshape(-1, 1))
X_test_t = Tensor(X_test)

# 2. Define the Model, Optimizer, and Loss
input_features = X_train.shape[1]
model = Sequential(
    Linear(input_features, 16),
    ReLU(),
    Linear(16, 1)
)
optimizer = Adam(params=model.parameters(), lr=0.01)
loss_function = binary_cross_entropy

# 3. The Training Loop
epochs = 100
print("--- Training Start ---")
for epoch in range(epochs):
    optimizer.zero_grad()
    
    # Forward pass: get the raw model output (logits)
    logits = model(X_train_t)
    
    # Apply sigmoid to get probabilities
    predictions = sigmoid(logits)
    
    # Compute loss
    loss = loss_function(predictions, y_train_t)
    
    # Backward pass to compute gradients
    loss.backward()
    
    # Update weights using the optimizer
    optimizer.step()

    if epoch % 20 == 0 or epoch == epochs - 1:
        print(f"Epoch {epoch}, Loss: {loss.data.item():.4f}")

# 4. Evaluate the trained model on the test set
logits_test = model(X_test_t)
preds_test = sigmoid(logits_test)

# Convert probabilities to binary class predictions (0 or 1)
binary_preds = (preds_test.data > 0.5).astype(int).flatten()
accuracy = accuracy_score(y_test, binary_preds)

print(f"\nFinal Test Accuracy: {accuracy:.4f}")

--- Training Start ---
Epoch 0, Loss: 1.0902
Epoch 20, Loss: 0.1138
Epoch 40, Loss: 0.0743
Epoch 60, Loss: 0.0582
Epoch 80, Loss: 0.0475
Epoch 99, Loss: 0.0408

Final Test Accuracy: 0.9825

Conclusion

With just a few additions to our library, we were able to build a multi-layer perceptron and achieve very high accuracy on a real-world classification task. We’ve shown that our from-scratch Module, Linear, Sequential, Adam, and BCELoss components all work together seamlessly.

We have built a solid foundation for feed-forward networks. The next logical step is to explore architectures that can handle sequential data, like sentences or time series. In the next post, we will implement our first Recurrent Neural Network (RNN).