Supervised Learning: Linear Models

1. Setting & Technical Terms

In supervised learning, we are given a dataset:

Features: $x_{i} \in R^{d}$ for $i = 1, \dots, n$ .
Labels: $y_{i}$ , which can be real-valued (regression) or discrete (classification).

We seek a prediction function (hypothesis)

H : R^{d} \to Y,

where $Y = R$ in regression or ${0, 1}$ , ${- 1, + 1}$ in classification.

Common terms:

Training set: the data used to fit the model.
Test set: held-out data to evaluate generalization.
Overfitting: when the model fits the training data too closely and fails on unseen data.

2. General Pipeline

Choose a hypothesis class $H$ (e.g., linear models).
Define a loss function $ℓ (y, \overset{y}{^})$ measuring prediction error.
Minimize empirical risk over $H$ : find $H^{*}$ .
Evaluate on held-out data.

Graphically:

Data --> Model Choice --> Loss & Optimization --> Trained Model --> Prediction

3. Empirical Risk Minimization – Linear Regression

We illustrate ERM using linear regression.

3.1 Step 1: Hypothesis Class

We restrict to affine functions:

H = {H_{w, b} (x) = w^{T} x + b : w \in R^{d}, b \in R} .

We often fold $b$ into $w$ by augmenting $x$ with 1:

$\tilde{x} = [1, x_{1}, \dots, x_{d}]^{T}, \tilde{w} = [b, w_{1}, \dots, w_{d}]^{T}$ , so that $H (x) = \tilde{w}^{T} \tilde{x} .$

3.2 Step 2: Loss Function

Choose the squared loss for regression:

ℓ (y, \overset{y}{^}) = (y - \overset{y}{^})^{2} .

For a dataset of $n$ points, the empirical risk is:

R_{emp} (w) = \frac{1}{n} i = 1 \sum n (y_{i} - w^{T} x_{i})^{2} .

3.3 Step 3: Minimize Empirical Risk

We solve:

w^{*} = ar g w min R_{emp} (w) .

This is a quadratic problem with a closed-form solution. Let:

$X \in R^{n \times d}$ be the data matrix (rows are $x_{i}^{T}$ ).
$y \in R^{n}$ be the vector of labels.

Then

w^{*} = (X^{T} X)^{- 1} X^{T} y .

Note: Matrix $X^{T} X$ must be invertible (or we add regularization).

3.4 Example: Movie Rating Prediction

Suppose you have ratings from 3 friends for a movie:

Friend	Rating (feature)
Alice	8
Bob	5
Charlie	7

You want to predict your own rating $y$ based on theirs. Let $x \in R^{3}$ be the feature vector $[8, 5, 7]^{T}$ . Assume the model $y \approx w_{1} x_{1} + w_{2} x_{2} + w_{3} x_{3}$ . Using least squares on past data, we solve for $w^{*}$ by the formula above.

import numpy as np
 
# Example data
X = np.array([
    [8, 5, 7],  # Movie 1
    [7, 6, 8],  # Movie 2
    [6, 7, 6]   # Movie 3
])
y = np.array([7, 8, 6])  # Your ratings
 
# Add bias term
X_aug = np.column_stack([np.ones(len(X)), X])
 
# Compute optimal weights
w_star = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
print("Optimal weights:", w_star)
 
# Make prediction for new movie
new_movie = np.array([1, 8, 5, 7])  # [bias, Alice, Bob, Charlie]
prediction = new_movie @ w_star
print(f"Predicted rating: {prediction:.2f}")

4. Extension to Linear Classification

Linear models can also classify by thresholding:

f (x) = sign (w^{T} x + b) \in {- 1, + 1} .

This reduces classification to a regression output with a decision boundary at zero.

4.1 Example: Mango Ripeness

Features: sugar content $x_{1}$ and firmness $x_{2}$ . We collect samples labeled $y = + 1$ (ripe) or $- 1$ (unripe). Fit $w$ by minimizing squared loss as above, then predict:

\overset{y}{^} = {+ 1 - 1 if w^{T} x \geq 0, otherwise.

# Example: Mango classification
X = np.array([
    [0.8, 0.3],  # Ripe mango
    [0.7, 0.4],  # Ripe mango
    [0.3, 0.8],  # Unripe mango
    [0.4, 0.7]   # Unripe mango
])
y = np.array([1, 1, -1, -1])
 
# Add bias term
X_aug = np.column_stack([np.ones(len(X)), X])
 
# Compute optimal weights
w_star = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
print("Optimal weights:", w_star)
 
# Make prediction for new mango
new_mango = np.array([1, 0.75, 0.35])  # [bias, sugar, firmness]
prediction = np.sign(new_mango @ w_star)
print(f"Predicted class: {'Ripe' if prediction > 0 else 'Unripe'}")

Caution: Using MSE for classification is sensitive to outliers in $y$ and can distort the decision boundary. For better probabilistic outputs, consider logistic loss and logistic regression.

Written Notes

ERM: Framework for fitting models by risk minimization.
Hypothesis Class: Choice balances bias vs. variance.
Loss Function: Determines what “error” means; choose according to task.
Optimization: Closed-form for linear regression; iterative methods for others.
Classification: Linear separators work in feature space; use feature maps for nonlinearity.

Notes

Explorer