Lecture: LASSO Regression  


Assigned Reading

  • Purpose of LASSO regression Slides►
  • Clusters of COVID-19 symptoms: Application of LASSO regression (use instructor's last name as password) Read►
  • What to do about a negative McFadden R-squared? ChatGPT►
  • Tutorial on LASSO regression Python► R► You Tube►
  • Clusters of COVID-19 symptoms: Application of LASSO regression (use instructor's last name as password) Read►


Question 1: LASSO regress COVID-19 test results on COVID-19 symptoms. Identify the relative weight of each symptom, pair of symptom, and triplet of symptoms. Clarify if clusters of symptoms are more accurate than individual symptoms.  LASSO regress COVID-19 test results on its symptoms: Report a measure of goodness of fit and the coefficients for the regression. 

Resources for Question 1:

  • Prepare the data by setting variables that are present to 1 and absent to 0. When the COVID-19 test result is missing, drop the case from the analysis. When a symptom is missing, replace it with its mode, almost always 0. Data►
  • 30 subset of IDs Data Subset►
  • Data pre-processing Python►
  • Main effect model Python►
  • How to include interaction terms in Python ChatGPT►
  • Symptom cluster model Python►
  • Dr Vang's code Python►
  • Chris Naso's Teach One Slides► Python►

Question 2: Using LASSO logistic regressions, identify the relative weights of each symptom, pair of symptom and triplet of symptoms, in diagnosis of COVID-19.  Make sure that you use Polynomial function to create interaction terms for symptoms.  Make sure that you try the LASSO regressions with the following three C hyper-parameters: 0.1, 0.01, and 0.001. List the non-zero coefficients for each of the C parameters. Report the McFadden Pseudo R-square for each of the C parameters.  

Resources for Question 2:

  • COVID-19 test results and symptoms Data►
  • Count of number of times symptoms occur together for the same person Data►
  • Percent of times symptom listed in the row occurs before the symptom listed in the column Data►
  • Python code for repeated LASSO regressions, in order of variables' time of occurrence Python►
  • Jieun Jan's Teach One SQL►
  • Tejaswi Pulusu's Teach One Slides►

Question 3: The following provides the results from a recent LASSO regression of "symptom remission" on patients' "medical history" for patients taking 15 different antidepressants.

  1. For patients taking Bupropion, what are the 5 most important features that increase symptom remission?  Ask ChatGPT if a person with these features should take Bupropion? Report the difference between the regression and the advice of ChatGPT.
  2. For patients taking Bupropion, what are the 5 least important features that can be used to rule out the use of Bupriorin?
  3. In comparing Bupropion and Citalopram, what are the features that affect both medications? If the first 3 digits of the International Classification of Disease codes are the same, consider them the same feature. 
  4. Suppose we can ask about the features listed in the two regressions.  In what order, question should be asked, if we want to differentiate among the two medications with least amount of queries? List the first 10 questions that are most likely to resolve the need to take one of the two medications.

Here are resources for Q3:

Question 4 (Optional): In the upcoming project for the course, you are asked to analyze data within All of Us.  Please set up the database for analysis of impact of antidepressants on remission.

Here are resources for question 4:


This page is part of the course on Comparative Effectiveness by Farrokh Alemi, PhD Home► Email►