Understanding Regularization: Ridge and Lasso Regression with Sklearn
Written on
Why Consider Alternatives to Linear Regression
Linear Regression, also known as Ordinary Least Squares (OLS), is recognized for its simplicity and widespread application in machine learning. However, it is prone to a significant drawback: a tendency to overfit the training dataset.
For a basic scenario involving two-dimensional data, the line of best fit is the one that minimizes the sum of squared residuals (SSR):
The corresponding formula is straightforward:
However, as the number of predictor variables increases, the coefficients become excessively large. Larger coefficients allow for predictions that may not be generalizable, leading to overfitting.
Additionally, linear regression disregards the significance of individual feature weights. A linear relationship between features does not ensure a valid model. For instance, predicting newborn rates in a town may involve the number of fertile women, while the number of storks is irrelevant, yet both could show a misleading linear relationship.
In practice, datasets often contain numerous features, some of which may be irrelevant, posing challenges for accurate predictions.
> You can experiment with the notebook for this article on Kaggle.
Understanding Bias and Variance
To comprehend how Ridge Regression addresses the aforementioned issues, we must delve into bias and variance.
Variance relates to a model's performance on new (test) datasets; high variance often results in overfitting, causing significant discrepancies across different datasets. In contrast, bias refers to a model's inability to generalize from training data, leading to poor performance on both training and test datasets.
The ideal model would possess both low bias and low variance, but achieving this balance can be complex. Bias and variance are inversely related to model complexity.
Model complexity is influenced by the number of features inputted into the model. Linear regression, being unbiased, fits training data well but can lead to excessive variance, positioning it far to the right on the model complexity scale.
Regularization via Ridge Regression
Ridge and Lasso Regression provide elegant solutions to bias and overfitting issues. While the formula for the line of best fit remains unchanged, the cost function is updated.
Ridge Regression introduces a new hyperparameter, lambda (?), which modifies the cost function. This penalty term squares each slope of the feature variables and scales them, effectively shrinking all coefficients.
The shrinkage has dual benefits:
- It reduces the risk of overfitting by lowering coefficients. For example, a small value for ?, such as 0.1, scales down all coefficients.
- It adds a degree of bias to the otherwise unbiased linear regression, making significant features more prominent while diminishing less important ones.
Although the sum of the squared slopes may increase, resulting in a poorer initial fit compared to OLS, Ridge and Lasso yield more consistent long-term predictions. By introducing a slight bias, we significantly decrease variance.
Let's implement Ridge using Scikit-learn. The Ridge model adheres to the same API as other sklearn models. We will utilize the Ames Housing Dataset from Kaggle, working with a subset of features to predict housing prices.
Initially, we will train a LinearRegression model and assess its performance against that of Ridge using Mean Absolute Error (MAE). First, we must preprocess the data through feature scaling and handling missing values, which can be efficiently managed with a Pipeline instance.
> For more on using pipelines in sklearn, refer to this article or Kaggle notebook.
Let’s begin by fitting a Linear Regressor:
The testing score reveals a significant discrepancy from the training score, indicating overfitting. Now, let’s implement Ridge:
The results for Ridge are nearly identical because we selected a small value for ?. When ? equals 0, we revert to standard OLS:
> Note that in the sklearn API, the hyperparameter ? is referred to as ?; don’t confuse the two.
To optimize ? without manually testing various values, we can utilize RidgeCV, which evaluates a list of potential ? values using cross-validation, akin to GridSearch:
We will test ? values ranging from 1 to 100 in increments of 5 with 10-fold cross-validation. Upon completion, we can access the optimal ? via the .alpha_ attribute:
>>> ridge.alpha_ 86
Let’s evaluate Ridge with this hyperparameter and compare it against Linear Regression:
Even with the optimal ?, Ridge yields results similar to Linear Regression, demonstrating that this dataset may not be ideal for showcasing the capabilities of Ridge and Lasso.
Regularization Using Lasso Regression
Lasso regression shares many similarities with Ridge, with a slight alteration in the cost function:
Instead of squaring each coefficient, Lasso uses their absolute values. Let’s see how this works on the built-in diamonds dataset from Seaborn:
diamonds = sns.load_dataset('diamonds') diamonds.head()
Using all features, we will predict price with Lasso:
Lasso and LassoCV are imported similarly, and we will determine the optimal ? using cross-validation:
It appears that a very low ? yields satisfactory results. Let’s fit the model with this value and evaluate its performance:
Lasso regression performs quite well. A notable feature of Lasso is its ability to perform feature selection; it can reduce unimportant parameters to zero due to its mechanics. We can visualize this by plotting the fitted Lasso coefficients:
As illustrated, aside from carat, which significantly influences diamond price, all other coefficients are nearly zeroed out.
This distinction sets Lasso apart from Ridge: Lasso can eliminate coefficients entirely, while Ridge cannot.
Conclusion
In summary, both Ridge and Lasso are crucial regularization techniques that address the limitations of traditional linear regression by introducing bias to lower variance, thereby preventing overfitting. The hyperparameter ? determines the extent of regularization applied to features with varying magnitudes.
Although we touched on many aspects, we did not delve deeply into the complex mathematics behind these elegant algorithms. Below are some resources for further exploration:
- Highly recommended: StatQuest on Ridge and Lasso
- The Mathematics of Linear, Ridge, and Lasso Regression
- Understanding the Mathematics Behind Ridge and Lasso Regularization
- Exploring the Differences: Lasso Regression and Sparsity vs. Ridge Regression