Chapter 5 (Notes)

10 minute read

Published:

5.2.2. Regression Metrics

  • Regression refers to a predictive modeling problem that involves predicting a continuous (numeric) value rather than a class label.
  • It is fundamentally different from classification tasks, which involve discrete labels or categories.
  • Regression models are common in real-world tasks such as:
    • Estimating prices (houses, cars, electronics)
    • Predicting drug dosages based on patient characteristics
    • Forecasting transportation demand
    • Predicting sales trends using historical and market data
  • Unlike classification, where accuracy can directly evaluate performance, regression requires error-based metrics.
  • These metrics provide an error score summarizing how close the model’s predictions are to the actual values.
  • Understanding and interpreting these scores is crucial for developing robust and interpretable regression models.
  • There are four error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:
    • Mean Squared Error (MSE).
    • Root Mean Squared Error (RMSE).
    • Mean Absolute Error (MAE)
    • R-squared $(R^2)$

Mean Squared Error (MSE)

  • Mean Squared Error (MSE) is one of the most widely used metrics for evaluating the performance of regression models.
  • It measures the average of the squares of the errors—that is, the average squared difference between the actual and predicted values.
  • MSE is a loss function used in least squares regression and also serves as a performance metric.
  • It is the basis for least squares optimization, the core of many regression algorithms (e.g., Linear Regression).
  • Highlights large errors, making it ideal when penalizing significant mistakes is important.

MSE

  • Formula:
\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
  • Where:
    • $n$ : Total number of data points
    • $y_i$ : Actual (true) value of the $i^{th}$ data point
    • $\hat{y}_i$ : Predicted value of the $i^{th}$ data point
  • Characteristics:
    • Always non-negative (since errors are squared).
    • Units: Squared units of the target variable, making it less interpretable in its raw form.
    • Sensitive to outliers: Squaring errors disproportionately penalizes large mistakes.
    • Interpretation: Lower MSE values indicate better model performance.
  • Use MSE when?
    • You want to penalize larger errors more heavily.
    • You’re optimizing with algorithms that rely on gradient descent.
    • You are more interested in the overall performance than interpretability.

Root Mean Squared Error (RMSE)

  • RMSE is the square root of the Mean Squared Error (MSE).
  • It provides a measure of the average magnitude of the prediction error, but unlike MSE, it is in the same unit as the original data.
  • RMSE gives a higher weight to large errors due to the squaring step in MSE.
  • Formula
\[\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }\]
  • Where:
    • $n$ : Total number of data points
    • $y_i$ : Actual (true) value of the $i^{th}$ data point
    • $\hat{y}_i$ : Predicted value of the $i^{th}$ data point
  • Here, squaring handles the magnitude, and the square root brings the unit back to original scale.
  • It is more interpretable than MSE because it’s in the same scale as the target.
  • It penalizes larger errors more heavily (like MSE).
  • Here, Lower RMSE indicates better model performance.

Mean Absolute Error (MAE)

  • MAE calculates the average absolute difference between predicted and actual values.
  • It is a linear score, meaning all errors are weighted equally in proportion to their size.
  • Formula
\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|\]
  • Where:
    • $n$ : Total number of data points
    • $y_i$ : Actual (true) value of the $i^{th}$ data point
    • $\hat{y}_i$ : Predicted value of the $i^{th}$ data point
    • $y_i - \hat{y}_i$ : Absolute difference (always non-negative)
  • Easy to interpret and in the same unit as the original target.
  • Less sensitive to outliers than RMSE or MSE.
  • Good for general error analysis, especially when large errors aren’t critical.

R-squared (Coefficient of Determination)

  • R-squared $(R^2)$ measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model.
  • In simpler terms, it tells us how well the model fits the data.
  • Formula
\[R^2 = 1 - \frac{ \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }{ \sum_{i=1}^{n} (y_i - \bar{y})^2 }\]
  • Where,
    • $y_i$ : Actual (true) value of the $i^{th}$ data point
    • $\hat{y}_i$ : Predicted value of the $i^{th}$ data point
    • $\bar{y}$: Mean of actual values
    • Numerator: Sum of squared errors (residual sum of squares, RSS)
    • Denominator: Total sum of squares (TSS)
  • Here,

    R-squared ValueInterpretation
    1Perfect prediction (all variance explained)
    0Model does not explain any variability
    < 0Model performs worse than a horizontal line
  • A higher R-squared generally means a better fit, but not always.
  • R-squared doesn’t indicate causality, and a high value may still be misleading if the model is overfitting or improperly specified.
  • It is a relative measure, it **compares model vs baseline (mean)

5.3. Model Validation Techniques

  • Model validation is the process of assessing how well a machine learning (ML) or artificial intelligence (AI) model performs — especially on unseen (new) data.
  • It ensures that the model:
    • Achieves its design objectives
    • Produces accurate and trustworthy predictions
    • Generalizes well beyond the training data
    • Complies with regulatory and quality standards
  • Purpose:
    • Evaluate Model Performance: Confirms that the model works not just on training data, but on real-world, unseen datasets.
    • Support Model Selection: Helps compare multiple models, choose the most appropriate one, and select optimal hyperparameters.
    • Prevent Overfitting: Assesses the risk of overfitting by observing how model performance changes on validation vs training data.
    • Build Trust: Ensures transparency and reliability, especially if validated by third-party or independent teams.
    • Comply with ML Governance: Contributes to AI governance by enforcing policies, monitoring model activity, and validating data/tooling quality.
  • Benefits:
    • Ensures Generalization
    • Guides Model Selection
    • Improves Accuracy and Robustness
    • Builds Regulatory Confidence
    • Prevents Future Failures
  • Without model validation, we risk deploying models that look good on paper but fail in practice.

Train-Test vs. K-Fold Cross Validation

5.3.1. Train-Test Split

  • The train-test split is the most basic method of model validation.
  • It involves splitting the dataset into two parts:
    • Training set: Used to train the model
    • Test set: Used to evaluate model performance on unseen data
  • Common ratios include:
    • 70% training / 30% testing
    • 80% training / 20% testing
    • 60% training / 40% testing (for very small datasets)
  • Advantages:
    • Simple and fast
    • Good for large datasets
    • Helps quickly estimate model performance
  • Disadvantages
    • High variance: Model evaluation may vary significantly based on how the data was split
    • Risk of under- or over-estimating model performance due to random splits
    • Not ideal for small datasets

5.3.2. Cross Validation

  • Cross-validation is a more robust model validation technique where the model is trained and tested multiple times on different data subsets.
  • Helps to assess how the model generalizes to an independent dataset.

5.3.2.1. K-Fold Cross Validation

  • The dataset is split into K equally sized “folds”.
  • The model is trained on K−1 folds and tested on the remaining 1 fold.
  • This process is repeated K times, each time using a different fold as the test set.
  • The average performance over the K trials is used as the final evaluation metric.
\[\text{CV}_{\text{score}} = \frac{1}{K} \sum_{i=1}^{K} \text{score}_i\]
  • Where,
    • $K$: Number of folds
    • $\text{score}_i$: Evaluation metric (e.g., RMSE, MAE, R²) on the $i^{th}$ fold
  • Advantages
    • Reduces variance in performance estimation
    • Uses the entire dataset for both training and testing
    • More reliable for small to medium-sized datasets
  • Disadvantages:
    • Computationally expensive, especially for large datasets or complex models
    • May not work well if data is not independently and identically distributed (i.i.d.)

5.4 Hyperparameter Tuning

  • Hyperparameter tuning (also called hyperparameter optimization) is the process of searching for the best set of hyperparameters that maximizes a machine learning model’s performance on a given task.
  • Unlike model parameters (which are learned from training data), hyperparameters are set before training and control the learning process itself.
  • It determines how well a model learns and generalizes to new data.
  • Poorly tuned hyperparameters can lead to:
    • Underfitting (high bias, poor performance)
    • Overfitting (high variance, poor generalization)
  • Good tuning results in:
    • Minimized loss
    • Improved accuracy
    • Robust performance
    • Balanced bias-variance tradeoff
  • It’s essential for real-world applications in healthcare, finance, autonomous driving, etc.
  • Common hyperparamter for a Neural Network:
    • Learning Rate (η): Controls step size in gradient descent.
      • High → fast but unstable
      • Low → stable but slow
    • Epochs: Number of passes over entire training dataset
    • Batch Size: Number of samples per gradient update
    • Hidden Layers / Neurons: Affects capacity and depth of the network
    • Activation Function: ReLU, Sigmoid, Tanh; adds non-linearity
    • Momentum: Helps accelerate training by smoothing gradients
    • Learning Rate Decay: Reduces learning rate over time
  • Objective: Minimize loss or maximize metric score
  • Process:
    1. Choose a set of hyperparameters
    2. Train the model
    3. Evaluate performance (e.g., via cross-validation)
    4. Repeat until best configuration is found
  • Mathematical Representation
\[\theta^* = \arg\min_{\theta \in \Theta} \ \mathcal{L}(f_{\theta}(X_{\text{train}}), y_{\text{train}})\]

where,

  • θ : Set of hyperparameters
  • $\mathcal{L}$: Loss function
  • $f_{\theta}$: Model with configuration θ\thetaθ

Grid Search vs. Random Search

  • Exhaustively tests all combinations in a hyperparameter grid
  • Best for small search spaces
  • Steps:
    • Choose the model and its hyperparameters to tune.
    • Specify a set of possible values for each hyperparameter.
    • Build a grid containing all combinations.
    • Train and validate the model for each combination.
    • Select the combination that produces the best score (e.g., lowest MSE or highest accuracy).
  • Advantages:
    • Guaranteed to find optimal combo (if within grid)
    • Simple to understand
  • Disadvantages:
    • Computationally expensive
    • Doesn’t scale well with more parameters or wider ranges
  • When to use:
    • For small search spaces
    • When computational resources are not a constraint
    • When accuracy is critical and time is available
  • Randomly samples combinations from the search space
  • Efficient when not all hyperparameters are equally important
  • This process continues for a fixed number of iterations or until computational budget is exhausted.
  • Advantages:
    • Much faster than grid search
    • Can explore larger spaces with fewer evaluations
  • Disadvantages:
    • No guarantee of finding the absolute best configuration
    • Results may vary between runs (unless random seed is fixed)
    • Might miss promising combinations if not enough iterations
  • Steps:
    • Select model to tune
    • Define distributions for hyperparameters
    • Set number of iterations (e.g., 10–100)
    • Choose evaluation metric (e.g., accuracy, RMSE)
    • Randomly sample hyperparameter combinations
    • Train and validate model for each combination
    • Compare scores across all runs
    • Pick best combination
    • (Optional) Retrain model on full data
  • When to use:
    • When hyperparameter space is large
    • When computational resources or time are limited
    • When exact optimal values are not critical, but good performance is sufficient