What is L1 and L2 Regularization. How does L1 Regularization helps in Feature Selection.

Jun 18, 2024

why do we need regularization.

A machine learning model often encounter overfitting. Overfitting is when a model have high accuracy on training data but low accuracy on test data. That is a model is not generalized well. So to overcome overfitting , we add regularization to our Loss Function.

L1 Regularization ( Lasso Regression)

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), adds a absolute value of model coefficients to the Loss Function.

where λ is the regularization parameter that controls the strength of the penalty, w_i represents the model coefficients.

L2 Regularization ( Ridge Regression)

L2 regularization, also known as Ridge regression, adds the squared magnitude of model coefficients as a penalty term to the loss function.

L2 Regularization Formula

where λ is the regularization parameter, w_i represents the model coefficients

Effect of λ

If λ = 0 there is no regularization , so
we overfit.
if λ is large the influence of Loss Function decreases, so we underfit the data.

When should you use L1 regularization

L1 regularization can drive some non important features to zero. So If you have high dimensional then it is better to choose L1 regularization.
If you prefer a model with a smaller number of non-zero coefficients, L1 regularization is ideal. Sparse models are easier to interpret and can be more computationally efficient.

When should you use L2 Regularization

If features are collinear/Multicollinear , weight vector can change arbitrarily. Hence w_i cannot be used to determine feature importance. L2 regularization can help. It distributes the error across all features, reducing the impact of multicollinearity and providing more stable estimates of the coefficients.
If you think your data might overfit, then it is best to use L2 Regularization.

Why does L1 regularization drive some non important features to zero

Consider a example where λ and w_i are small values. You can notice that L2 regularization does not change the value of w_i from 1 iteration to another. L1 regularization continues to constantly reduce w_i towards zero. Hence L1 regularization drive some non important features to zero.

Manasa’s Substack

Discussion about this post