TechLead
Intermediate
20 min
Full Guide

Neural Networks Fundamentals

Deep dive into perceptrons, activation functions, backpropagation, gradient descent, and loss functions with Python code examples

The Building Block: Perceptron

A perceptron is the simplest neural network unit. It takes multiple inputs, multiplies each by a weight, sums them with a bias, and passes the result through an activation function. Understanding perceptrons is essential before tackling deep networks.

Neuron Equation:

output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias)

Activation Functions Compared

Activation functions introduce non-linearity, allowing networks to learn complex patterns. Here are the most important ones:

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

Default choice for hidden layers. Fast, avoids vanishing gradient. Can suffer from "dying ReLU" where neurons output 0.

Sigmoid

f(x) = 1 / (1 + e^(-x))

Outputs between 0-1. Used for binary classification output layers. Suffers from vanishing gradient for deep nets.

Softmax

f(x_i) = e^(x_i) / sum(e^(x_j))

Converts logits to probabilities summing to 1. Used in multi-class classification output layers.

Tanh

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Outputs between -1 and 1. Zero-centered, useful for hidden layers when data is centered around zero.

Implementing Activation Functions in Python

import numpy as np

# Activation functions and their derivatives
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x) ** 2

# Visualize activation functions
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 200)
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

axes[0, 0].plot(x, relu(x), 'b-', linewidth=2)
axes[0, 0].set_title('ReLU'); axes[0, 0].grid(True)

axes[0, 1].plot(x, sigmoid(x), 'r-', linewidth=2)
axes[0, 1].set_title('Sigmoid'); axes[0, 1].grid(True)

axes[1, 0].plot(x, tanh(x), 'g-', linewidth=2)
axes[1, 0].set_title('Tanh'); axes[1, 0].grid(True)

axes[1, 1].plot(x, softmax(np.column_stack([x, np.zeros_like(x)]))[:, 0],
                'm-', linewidth=2)
axes[1, 1].set_title('Softmax (vs zero)'); axes[1, 1].grid(True)

plt.tight_layout()
plt.savefig('activations.png', dpi=150)
print("Saved activation function plots")

Forward Propagation

Forward propagation passes input data through the network layer by layer to produce a prediction. Each layer applies weights, adds bias, and passes through an activation function.

import numpy as np

class NeuralNetwork:
    """A simple feedforward neural network from scratch."""

    def __init__(self, layer_sizes):
        """
        layer_sizes: list of integers, e.g. [2, 4, 4, 1]
        means 2 inputs, two hidden layers of 4 neurons, 1 output
        """
        self.weights = []
        self.biases = []

        for i in range(len(layer_sizes) - 1):
            # He initialization for ReLU
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)

    def forward(self, X):
        """Forward pass through the network."""
        self.activations = [X]
        self.z_values = []

        for i in range(len(self.weights)):
            z = self.activations[-1] @ self.weights[i] + self.biases[i]
            self.z_values.append(z)

            # ReLU for hidden layers, sigmoid for output
            if i < len(self.weights) - 1:
                a = np.maximum(0, z)  # ReLU
            else:
                a = 1 / (1 + np.exp(-z))  # Sigmoid
            self.activations.append(a)

        return self.activations[-1]

# Example: XOR problem
net = NeuralNetwork([2, 4, 1])
X = np.array([[0,0], [0,1], [1,0], [1,1]])
output = net.forward(X)
print("Initial predictions (random weights):")
print(output)

Backpropagation: How Networks Learn

Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule backward through the network. These gradients tell us how to adjust weights to reduce error.

class NeuralNetwork:
    # ... (forward method from above) ...

    def backward(self, X, y, learning_rate=0.01):
        """Backpropagation to compute gradients and update weights."""
        m = X.shape[0]
        output = self.activations[-1]

        # Output layer error (binary cross-entropy derivative)
        delta = output - y  # shape: (m, 1)

        # Backpropagate through layers
        for i in reversed(range(len(self.weights))):
            # Compute gradients
            dW = self.activations[i].T @ delta / m
            db = np.sum(delta, axis=0, keepdims=True) / m

            # Propagate error to previous layer
            if i > 0:
                delta = (delta @ self.weights[i].T)
                # ReLU derivative
                delta *= (self.z_values[i-1] > 0).astype(float)

            # Update weights and biases (gradient descent)
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db

    def train(self, X, y, epochs=1000, learning_rate=0.01):
        """Train the network using forward + backward passes."""
        for epoch in range(epochs):
            # Forward pass
            predictions = self.forward(X)

            # Compute loss (binary cross-entropy)
            loss = -np.mean(y * np.log(predictions + 1e-8) +
                           (1 - y) * np.log(1 - predictions + 1e-8))

            # Backward pass
            self.backward(X, y, learning_rate)

            if epoch % 200 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}")

        return predictions

# Train on XOR
net = NeuralNetwork([2, 8, 4, 1])
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

predictions = net.train(X, y, epochs=2000, learning_rate=0.5)
print("\nFinal predictions:")
for i in range(len(X)):
    print(f"  {X[i]} -> {predictions[i][0]:.4f} (expected {y[i][0]})")

Loss Functions

MSE (Mean Squared Error)

L = (1/n) * sum((y - y_hat)^2)

Best for regression tasks. Penalizes large errors more heavily. Sensitive to outliers.

Binary Cross-Entropy

L = -[y*log(p) + (1-y)*log(1-p)]

Used for binary classification. Works with sigmoid output. Measures probability calibration.

Categorical Cross-Entropy

L = -sum(y_i * log(p_i))

Multi-class classification. Paired with softmax. Equivalent to negative log-likelihood.

Huber Loss

L = MSE if |error| < delta, else MAE

Combines MSE and MAE. Robust to outliers while still being smooth near zero.

Gradient Descent Variants

import numpy as np

# 1. Batch Gradient Descent - uses ALL data each step
def batch_gradient_descent(X, y, lr=0.01, epochs=100):
    w = np.random.randn(X.shape[1], 1) * 0.01
    for epoch in range(epochs):
        predictions = X @ w
        gradient = (2 / len(X)) * X.T @ (predictions - y)
        w -= lr * gradient
    return w

# 2. Stochastic Gradient Descent - uses ONE sample each step
def sgd(X, y, lr=0.01, epochs=100):
    w = np.random.randn(X.shape[1], 1) * 0.01
    for epoch in range(epochs):
        for i in range(len(X)):
            xi = X[i:i+1]
            yi = y[i:i+1]
            gradient = 2 * xi.T @ (xi @ w - yi)
            w -= lr * gradient
    return w

# 3. Mini-Batch Gradient Descent - uses BATCH_SIZE samples
def mini_batch_gd(X, y, lr=0.01, epochs=100, batch_size=32):
    w = np.random.randn(X.shape[1], 1) * 0.01
    n = len(X)
    for epoch in range(epochs):
        indices = np.random.permutation(n)
        for start in range(0, n, batch_size):
            idx = indices[start:start + batch_size]
            X_batch, y_batch = X[idx], y[idx]
            gradient = (2 / len(X_batch)) * X_batch.T @ (X_batch @ w - y_batch)
            w -= lr * gradient
    return w

# Mini-batch is the standard in practice:
# - Batch: stable but slow, memory-intensive
# - SGD: fast but noisy gradients
# - Mini-batch: best of both worlds
print("Mini-batch GD is the default choice for training neural networks")

Learning Rate: The Most Important Hyperparameter

Too High (0.1+)

Overshoots the minimum, loss oscillates wildly, may diverge to infinity. Training becomes unstable.

Just Right (0.001-0.01)

Converges steadily to a good solution. Typical starting values: 0.001 for Adam, 0.01 for SGD.

Too Low (0.00001)

Converges extremely slowly. May get stuck in poor local minima. Wastes compute time.

Key Takeaways

  • Perceptrons are the fundamental unit; networks are layers of perceptrons connected together
  • ReLU is the default activation for hidden layers; sigmoid/softmax for output layers
  • Backpropagation uses the chain rule to compute gradients layer by layer
  • Mini-batch gradient descent balances speed and stability for training
  • Learning rate is the single most important hyperparameter to tune

Continue Learning