1. Setting & Technical Terms

In supervised learning, we are given a dataset:

  • Features: for .
  • Labels: , which can be real-valued (regression) or discrete (classification).

We seek a prediction function (hypothesis)

where in regression or , in classification.

Common terms:

  • Training set: the data used to fit the model.
  • Test set: held-out data to evaluate generalization.
  • Overfitting: when the model fits the training data too closely and fails on unseen data.

2. General Pipeline

  1. Choose a hypothesis class (e.g., linear models).
  2. Define a loss function measuring prediction error.
  3. Minimize empirical risk over : find .
  4. Evaluate on held-out data.

Graphically:

Data --> Model Choice --> Loss & Optimization --> Trained Model --> Prediction

3. Empirical Risk Minimization – Linear Regression

We illustrate ERM using linear regression.

3.1 Step 1: Hypothesis Class

We restrict to affine functions:

We often fold into by augmenting with 1:

, so that

3.2 Step 2: Loss Function

Choose the squared loss for regression:

For a dataset of points, the empirical risk is:

3.3 Step 3: Minimize Empirical Risk

We solve:

This is a quadratic problem with a closed-form solution. Let:

  • be the data matrix (rows are ).
  • be the vector of labels.

Then

Note: Matrix must be invertible (or we add regularization).

3.4 Example: Movie Rating Prediction

Suppose you have ratings from 3 friends for a movie:

FriendRating (feature)
Alice8
Bob5
Charlie7

You want to predict your own rating based on theirs. Let be the feature vector . Assume the model . Using least squares on past data, we solve for by the formula above.

import numpy as np
 
# Example data
X = np.array([
    [8, 5, 7],  # Movie 1
    [7, 6, 8],  # Movie 2
    [6, 7, 6]   # Movie 3
])
y = np.array([7, 8, 6])  # Your ratings
 
# Add bias term
X_aug = np.column_stack([np.ones(len(X)), X])
 
# Compute optimal weights
w_star = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
print("Optimal weights:", w_star)
 
# Make prediction for new movie
new_movie = np.array([1, 8, 5, 7])  # [bias, Alice, Bob, Charlie]
prediction = new_movie @ w_star
print(f"Predicted rating: {prediction:.2f}")

4. Extension to Linear Classification

Linear models can also classify by thresholding:

This reduces classification to a regression output with a decision boundary at zero.

4.1 Example: Mango Ripeness

Features: sugar content and firmness . We collect samples labeled (ripe) or (unripe). Fit by minimizing squared loss as above, then predict:

# Example: Mango classification
X = np.array([
    [0.8, 0.3],  # Ripe mango
    [0.7, 0.4],  # Ripe mango
    [0.3, 0.8],  # Unripe mango
    [0.4, 0.7]   # Unripe mango
])
y = np.array([1, 1, -1, -1])
 
# Add bias term
X_aug = np.column_stack([np.ones(len(X)), X])
 
# Compute optimal weights
w_star = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
print("Optimal weights:", w_star)
 
# Make prediction for new mango
new_mango = np.array([1, 0.75, 0.35])  # [bias, sugar, firmness]
prediction = np.sign(new_mango @ w_star)
print(f"Predicted class: {'Ripe' if prediction > 0 else 'Unripe'}")

Caution: Using MSE for classification is sensitive to outliers in and can distort the decision boundary. For better probabilistic outputs, consider logistic loss and logistic regression.


Written Notes

  • ERM: Framework for fitting models by risk minimization.
  • Hypothesis Class: Choice balances bias vs. variance.
  • Loss Function: Determines what “error” means; choose according to task.
  • Optimization: Closed-form for linear regression; iterative methods for others.
  • Classification: Linear separators work in feature space; use feature maps for nonlinearity.