ML Basics for Econometric Analysis in R:

Running a Lasso regression

This note is inspired by the blog entry on Statology. on how to perform a Lasso regression in R using the glmnet package and a bit of background information on Lasso regressions in general.

In econometrics, selecting the right variables for your model is essential, especially when working with large datasets where many predictors are available. Lasso Regression (Least Absolute Shrinkage and Selection Operator) is a popular technique in machine learning that helps with variable selection by shrinking the coefficients of less important variables to zero. This makes it ideal for econometric applications where a simpler, interpretable model is valuable.

In this post, I’ll walk you through the basics of running a Lasso regression in R using a real-world dataset. Let’s get started!

But first, a bit of background information: A Lasso regression is a method that can be used to fit a regression model when multicollinearity is present in the data. Recall that the OLS regression tries to find coefficient estimates that minimize the sum of squared residuals (RSS) with

A Lasso regression, on the other hand, tries to minimize the following term:

with j ranging from 1 to p predictor variables and λ ≥ 0. This second term in the equation is known as a shrinkage penalty (see below for more information). In a Lasso regression, we select a value for λ that produces the lowest possible test MSE (mean squared error).

Step 1: Loading the Data and Necessary Package

For this example, I’ll use the mtcars dataset that comes with R, which contains various specifications of car models from the 1970s (similar to the auto.dta in Stata). To illustrate how to perform a Lasso regression, I will predict miles per gallon (mpg) using other variables in the dataset.

The Lasso regression is available in R's glmnet package, which I'll load first:

install.packages("glmnet")

library(glmnet)

Step 2: Prepare the data

A Lasso regression in glmnet requires the predictor matrix (x) and the response vector (y). I’ll also standardize the predictors for better interpretability:

# Response variable

y <- mtcars$mpg


# Predictor matrix, excluding the response variable

x <- as.matrix(mtcars[, -which(names(mtcars) == "mpg")])

Step 3: Running Lasso Regression

As mentioned before, the Lasso regression requires a tuning parameter, lambda, that controls the degree of shrinkage. For this purpose, I can let glmnet choose a sequence of lambda values automatically:

# Fit Lasso model

lasso_model <- glmnet(x, y, alpha = 1# Alpha = 1 for Lasso

Step 4: Choosing the Optimal Lambda Using Cross-Validation

One way to find the best lambda, I’ll use cross-validation, which will help to select a value that minimizes prediction error:

# Cross-validate to find optimal lambda

cv_lasso <- cv.glmnet(x, y, alpha = 1)


# Plot results

plot(cv_lasso)


# Optimal lambda value

best_lambda <- cv_lasso$lambda.min

best_lambda

The resulting plot shows the cross-validation errors for each lambda, with the optimal value marked by a vertical line.

Step 5: Interpreting the Results

Once we have the optimal lambda, I can extract the coefficients of our final model with the coef command:

# Coefficients with optimal lambda

coef(cv_lasso, s = "lambda.min")

This output shows which variables remain in the model (those with non-zero coefficients). Variables with coefficients set to zero are effectively excluded from the model, which makes the interpretation easier.

Step 6: Making Predictions

With the final model, I can hence make predictions on new data:

# Example prediction using the Lasso model

predicted_mpg <- predict(cv_lasso, s = "lambda.min", newx = x)

head(predicted_mpg)

Wrap-up

A Lasso regression offers a powerful approach to variable selection, especially in econometrics, where interpretability is key. By shrinking irrelevant coefficients to zero, a Lasso helps to put focus on the variables that matter the most.