Lecture: LASSO Regression  


Assigned Reading

  • Purpose of LASSO regression Slides►
  • Tutorial on LASSO regression Python► R► You Tube►
  • Clusters of COVID-19 symptoms: Application of LASSO regression (use instructor's last name as password) Read►
  • What to do about a negative McFadden R-squared?  ChatGPT►


Question 1:  Predict from age, gender, symptoms, home test results, and pairwise and triplet combination of the variables the PCR test results for COVID-19. 

  • Describe the order of occurrence of the variables. Assume that age, and gender occur at birth.  Assume home tests occurs after onset of symptoms.  Assume that laboratory PCR test occurs after home test. Establish the order with which symptoms occur by counting for each pair of symptoms, the number of times one symptom occurs before another.  Use the sum of pairwise count of one symptom occurring before all others to establish which symptom occurs first.

Table 1: Portion of Table showing number of times row variable occurs before column variable (number of pairs of symptoms occurring)
(Gray cells indicate factors that do not occur before column variables for majority of patients)

  Swelling Loss of Appetite Chest Pain Chills Cough
Swelling NaN 0 (3) 1 (3) 0 (4) 0 (7)
Loss of Appetite 0 (3) NaN 2 (8) 1 (14) 1 (21)
Chest Pain 1 (3) 1 (8) NaN 2 (8) 2 (10)
Chills 0 (4) 2 (14) 4 (8) NaN 5 (28)
Cough 2 (7) 5 (21) 3 (10) 3 (28) NaN


  • Using logistic LASSO, regress the PCR test results on all variables (age, gender, symptoms, and home test results), including pairwise or triple variables that precede PCR test. List the variables that are direct predictors of PCR test results. This list should include the coefficients for the non-zero Logistic regression variables, including coefficients for pairs or triple of variables. Report the percent of variation explained by the LASSO regression of PCR tests on independent variables. Calculate and report the McFadden Pseudo R-Square. 
  • Using LASSO, regress Fever on variables that precede it,(include main effects, pairwise combination, and triplets of variables0. Report the independent variables that are significant (non-zero) predictors of Fever. Report the percent of variation explained by the regression

The following resources may be helpful:

Question 2: LASSO Regress Circulatory Body System factor on all variables that precede it (defined as variables that occur more often before than after circulatory events).  In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

  1. Plot the Circulatory Body System and notice that it is bimodal. 
  2. Organize Circulatory Body System (dependent variable) into a binary variable. Zero for values of the bimodal distribution to the left and 1 for the values of bimodal distribution to the right. Drop from analysis any place were Circulatory Body System is missing.
  3. When an independent variable is missing, impute the variable from other variables.
  4. Include pairwise, triple, and four-way combination of independent variables in your analysis. Print out the values for the first 4 rows of the dependent and independent variables
  5. Adjust the hyper-parameter so that about 10 to 15 variables remain in the equation. Report predictors of progression of diseases in the circulatory body system. List the variables, pairs of variables, triplets and four way combination of variables that are non-zero in the LASSO regression.
  6. Evaluate the McFadden R-square (percent of variation in circulatory diseases explained by other variables that occur prior to it).


This page is part of the HAP 819 course on Advanced Statistics organized by Farrokh Alemi, PhD Home► Email►