1. Setting & Technical Terms
In supervised learning, we are given a dataset:
- Features: for .
- Labels: , which can be real-valued (regression) or discrete (classification).
We seek a prediction function (hypothesis)
where in regression or , in classification.
Common terms:
- Training set: the data used to fit the model.
- Test set: held-out data to evaluate generalization.
- Overfitting: when the model fits the training data too closely and fails on unseen data.
2. General Pipeline
- Choose a hypothesis class (e.g., linear models).
- Define a loss function measuring prediction error.
- Minimize empirical risk over : find .
- Evaluate on held-out data.
Graphically:
Data --> Model Choice --> Loss & Optimization --> Trained Model --> Prediction
3. Empirical Risk Minimization – Linear Regression
We illustrate ERM using linear regression.
3.1 Step 1: Hypothesis Class
We restrict to affine functions:
We often fold into by augmenting with 1:
, so that
3.2 Step 2: Loss Function
Choose the squared loss for regression:
For a dataset of points, the empirical risk is:
3.3 Step 3: Minimize Empirical Risk
We solve:
This is a quadratic problem with a closed-form solution. Let:
- be the data matrix (rows are ).
- be the vector of labels.
Then
Note: Matrix must be invertible (or we add regularization).
3.4 Example: Movie Rating Prediction
Suppose you have ratings from 3 friends for a movie:
Friend | Rating (feature) |
---|---|
Alice | 8 |
Bob | 5 |
Charlie | 7 |
You want to predict your own rating based on theirs. Let be the feature vector . Assume the model . Using least squares on past data, we solve for by the formula above.
import numpy as np
# Example data
X = np.array([
[8, 5, 7], # Movie 1
[7, 6, 8], # Movie 2
[6, 7, 6] # Movie 3
])
y = np.array([7, 8, 6]) # Your ratings
# Add bias term
X_aug = np.column_stack([np.ones(len(X)), X])
# Compute optimal weights
w_star = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
print("Optimal weights:", w_star)
# Make prediction for new movie
new_movie = np.array([1, 8, 5, 7]) # [bias, Alice, Bob, Charlie]
prediction = new_movie @ w_star
print(f"Predicted rating: {prediction:.2f}")
4. Extension to Linear Classification
Linear models can also classify by thresholding:
This reduces classification to a regression output with a decision boundary at zero.
4.1 Example: Mango Ripeness
Features: sugar content and firmness . We collect samples labeled (ripe) or (unripe). Fit by minimizing squared loss as above, then predict:
# Example: Mango classification
X = np.array([
[0.8, 0.3], # Ripe mango
[0.7, 0.4], # Ripe mango
[0.3, 0.8], # Unripe mango
[0.4, 0.7] # Unripe mango
])
y = np.array([1, 1, -1, -1])
# Add bias term
X_aug = np.column_stack([np.ones(len(X)), X])
# Compute optimal weights
w_star = np.linalg.inv(X_aug.T @ X_aug) @ X_aug.T @ y
print("Optimal weights:", w_star)
# Make prediction for new mango
new_mango = np.array([1, 0.75, 0.35]) # [bias, sugar, firmness]
prediction = np.sign(new_mango @ w_star)
print(f"Predicted class: {'Ripe' if prediction > 0 else 'Unripe'}")
Caution: Using MSE for classification is sensitive to outliers in and can distort the decision boundary. For better probabilistic outputs, consider logistic loss and logistic regression.
Written Notes
- ERM: Framework for fitting models by risk minimization.
- Hypothesis Class: Choice balances bias vs. variance.
- Loss Function: Determines what “error” means; choose according to task.
- Optimization: Closed-form for linear regression; iterative methods for others.
- Classification: Linear separators work in feature space; use feature maps for nonlinearity.