Assigned Reading
- Purpose of LASSO regression
Slides►
- Tutorial on LASSO regression
Python►
R►
You Tube►
- Clusters of COVID-19 symptoms: Application of LASSO regression (use instructor's last name as password)
Read►
- What to do about a negative McFadden R-squared?
ChatGPT►
Assignment
Question 1: Predict from age, gender, symptoms, home test results,
and pairwise and triplet combination of the variables the PCR test results for COVID-19.
- Describe the order of occurrence of the variables. Assume that age, and gender occur at birth. Assume home tests occurs after onset of
symptoms. Assume that laboratory PCR test occurs after home test. Establish the order with which symptoms occur by counting for
each pair of symptoms, the number of times one symptom occurs before another. Use the
sum of pairwise count of one symptom occurring before all others to establish
which symptom occurs first.
Table 1: Portion of Table showing number of times row variable occurs before column variable (number of pairs of symptoms occurring)
(Gray cells indicate factors that do not occur before column variables for
majority of patients)
|
Swelling |
Loss of Appetite |
Chest Pain |
Chills |
Cough |
… |
Swelling |
NaN |
0 (3) |
1 (3) |
0 (4) |
0 (7) |
… |
Loss of Appetite |
0 (3) |
NaN |
2 (8) |
1 (14) |
1 (21) |
… |
Chest Pain |
1 (3) |
1 (8) |
NaN |
2 (8) |
2 (10) |
… |
Chills |
0 (4) |
2 (14) |
4 (8) |
NaN |
5 (28) |
… |
Cough |
2 (7) |
5 (21) |
3 (10) |
3 (28) |
NaN |
… |
… |
… |
… |
… |
… |
… |
… |
- Using logistic LASSO, regress the PCR test results on all variables
(age, gender, symptoms, and home test results), including pairwise or triple variables that precede PCR test. List the
variables that are direct predictors of PCR test results. This list should include the coefficients for the non-zero Logistic
regression variables, including coefficients for pairs or triple of variables. Report the percent of variation explained by the LASSO
regression of PCR tests on independent variables. Calculate and report the McFadden Pseudo R-Square.
- Using LASSO, regress Fever on variables that precede it,(include main effects, pairwise combination, and triplets of variables0. Report the
independent variables that are significant (non-zero) predictors of Fever. Report the percent of variation explained by the regression
The following resources may be helpful:
- Data
Download►
Dictionary►
- How to include interaction terms in Python
ChatGPT►
- Count of number of times symptoms occur together for the same person
Data►
- Percent of times symptom listed in the row occurs before the symptom listed in the column
Data►
- Python code for repeated LASSO regressions, in order of variables' time of occurrence
Python►
- Jieun Jan's Teach One
SQL►
- Tejaswi Pulusu's Teach One
Slides►
- Vladimir Cardenas's Answer►
R-Code►
- Dharmi Desai's Teach One YouTube►
Slides►
Question 2: LASSO Regress Circulatory Body System factor on all variables that precede it (defined as variables that occur more
often before than after circulatory events). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.
- Plot the Circulatory Body System and notice that it is bimodal.
- Organize Circulatory Body System (dependent variable) into a binary variable. Zero for values of the bimodal distribution to the
left and 1 for the values of bimodal distribution to the right. Drop from analysis any place were Circulatory Body System is missing.
- When an independent variable is missing, impute the variable from other variables.
- Include pairwise, triple, and four-way combination of independent variables in your analysis. Print out the values for the first 4 rows of the dependent and independent variables
- Adjust the hyper-parameter so that about 10 to 15 variables remain in the equation. Report predictors of progression of diseases in the circulatory body system.
List the variables, pairs of variables, triplets and four way combination of variables that are non-zero in the LASSO regression.
- Evaluate the McFadden R-square (percent of variation in circulatory diseases explained by other variables that occur prior to it).
More
- Graphical LASSO
PubMed►
- Time varying graphical LASSO YouTube►
- Graphical LASSO more accurate than logistic regression
PubMed►
This page is part of the HAP 819 course on Advanced Statistics organized by Farrokh Alemi, PhD
Home►
Email►
|