## Adjustments to Regression
## Overview- Transformations
- Adjustments for missing values
- Regressions with rare independent variables PubMed►
- Construction of new features
## Learning ObjectivesIn this session, we review some of the traditional methods of adjusting regressions and then pose additional questions about improving accuracy of regression model fitting. For some of these questions, we do not have a reasonable answer and therefore this session is more speculative than other sessions in the class. After completing the activities this module you should be able to: - Adjust for missing values in the dependent and independent data
- Reason through patterns of missing values
- Include redundant variables in regression
## Adjustments for Missing ValuesSeveral approaches are available for replacing missing values so that a complete matrix format of the data (without any variable missing information) can be accomplished. None of the existing approaches work when there is extensive non-random missing values. ## AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.
- Remove all instances where incidence of diabetes is not known. Report the number of cases that remain.
- For each independent variable, define an indicator that is 1 when the variable is missing and 0 otherwise. Set the independent variable to 0, if it is missing.
- LASSO Regress diabetes on all independent variables, pair of independent variables, and triplet independent variables. Report the McFadden R-square and the coefficients for the variables.
- LASSO Regress diabetes on all indicators of missing variables, pair of indicators, and triplet of indicators. Report the McFadden R-square and the coefficients for the variables.
- Regress diabetes on both all independent variables, indicators of missingness, and pairwise and triple combination of these variables. Report the McFadden R-squared.
- Data Download► Dictionary►
- Yili Lin's
existing approaches to missing values
R-code►
Results►
- Regress Y on the independent variables with and without the rare variables. What is the meaning of the error messages?
- What conclusion can you reach regarding the effectiveness of rare variables?
- What conclusion can you reach about rare variable combinations that corresponding exactly to positive Y values?
- Data Download►
- For the "Breast and Endocrine Structures" body system create a feature that
scores the worse diagnosis within this body system. Here is a
list of all body systems within SNOMED CT and their associated code.
Social determinants of diseases See Bard - Identify diseases that are part of "Breast and Endocrine
Structures". You can do this using Bard, but make sure that you
insist it provides a complete list. I have done so and posted it
below. Create a binary variable for each disease, where if the
patient has the disease they have a code of 1 and 0 otherwise.
Excel►
- Regress your cancer on the members of this body system. Estimate the coefficients associated with these diseases. In
this regression you do not need any other variable besides these diseases. You do not need to include combination of diseases.
Your cancer is the response variable. The independent variables are binary variables, one for each one of these diseases. The
relationshp between coefficients of a logistic regression and odds ratio of the response variable are described in the following
document.
Read►
- Create a new variable in All of Us database called "Worst of Breast and Endocrine Structures Body System" or "Worst of Endocrine", where the patient is assigned the exponential of the variable with the highest coefficient.
## MoreFor additional information (not part of the required reading), please see the following links: - Open introduction to statistics Read►
This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email► |
||||||||||||||||||