Question 1: What are causes of diabetes? Using LASSO regression construct a causal network for explaining variation in incidence of diabetes. In the attached data, the dependent variable is incidence of diabetes. The independent variables (body systems) were constructed based on the worst diagnosis of the patient, measured by the likelihood ratio of diabetes. The diabetes variable is calculated after all other independent variables. Keep in mind that the data is massive and that analysis may take hours. You can do the analysis on 10% sample of the data. The complete list of independent variables are the following (if a variable is always missing, drop it from the analysis):
(a) Report direct predictors of diabetes; and the percent of variation in diabetes explained by these variables. Include pairwise combination of variables in your analysis. Before you do this regression, delete the entire row of data for missing diabetes variable. Missing independent variables should be set to 1, or imputed from the data. Prepare a logistic LASSO regression. If using R set the hyper parameter to 1se. If using Python set the hyper parameter so about 10 variables remain in the equation. Estimate non-zero LASSO variables using validation data set. List the variables that are parents in Markov blanket of diabetes. :
Evaluate the Pseudo R-square using the validation set. A negative value indicates that the index does not predict the response value accurately and you need to change the C parameter to find an index that does accurately predict the response variable.
(b) Report order of occurrence of independent variables in predicting direct predictors of diabetes as a modified table. Start with the table reporting pairwise order of occurrence of variables. Delete from this table any column that is not a direct predictor of diabetes. Gray any cell that is not occurring on average before the direct predictor of diabetes listed in the column. For example, Nervous System (bs6lr) is preceded by Circulatory, Respiratory, Digestive, Genitourinary, Musculoskeletal and v-codes. Nervous System will be listed in top column heading and all other variables will be listed in rows. The percent of people that have one of the systems preceding Nervous System is given as cell values and cells that are preceding Nervous System are in white and all other variables are in light gray.
(c) Report predictors, and percent of variation explained, for
In indirect regressions, the missing independent variable should be
replaced with 1 or imputed. If the response variable is missing then the
entire row of data should be deleted.
For each indirect regression (i.e., regressions where the response variable
is not diabetes), start from the original data and drop the rows of
variables where the response variable is missing. For example, make
adjustments for missing values for regression predicting "Nervous System"
or "Circulatory System" by starting from the original data so that you do
not eliminate variables missing for one, as if they are missing for both.
(d) Report the structure and parameters of the causal network. Use the regressions to create the structure of the data. Any non-zero variable in the indirect regression is a parent in Markov blanket of the response variable. There should be an arc between thes non-zero variables and the node representing the response variable. Create a visual model of the data using Netica (if you have more than 15 variables and do not have license to Netica, you can take an image of the structure before saving it). Provide the image as the report of the structure.
Remove cycles. There should not be any cycles as we are regressing only on preceding variables but there is a remote change that some could be there.
Use the regressions to generate joint distribution of the data. A joint distribution for node A is created by looking at factorial combination of direct predictors of A and evaluating the probability of A for each of these combinations. The factorial combination is conveniently provided by Table in Netica software. Report the joint distribution of all nodes in Excel sheets, using a different sheet for each Table. You can also report the joint distribution as part of Netica table structures.
(e) Describe if social determinants of illness are direct, or indirect, causes of diabetes.
Resources for Question 1
Question 2: Please confirm in writing that you have registered for All of Us data, have access to the data, have organized your data into the work space, can read the data from the work space into Python or R code. Provide a summary of your data as part of the response to this question. Code for creating a summary of your data is available in All of Us work space.