Lecture: Diabetes

Assigned Reading

There is no assigned reading for this week.

Assignment

Question 1: What are causes of diabetes? Using LASSO regression construct a causal network for explaining variation in incidence of diabetes. In the attached data, the dependent variable is incidence of diabetes. The independent variables (body systems) were constructed based on the worst diagnosis of the patient, measured by the likelihood ratio of diabetes. The diabetes variable is calculated after all other independent variables. Keep in mind that the data is massive and that analysis may take hours. You can do the analysis on 10% sample of the data. The complete list of independent variables are the following (if a variable is always missing, drop it from the analysis):

Variable	Description
id	ID of the patient (drop this variable)
dm	Incidence of Diabetes, 1 if there was diabetes, 0 otherwise
bs1lr	(1) Infectious & parasitic
bs2lr	(2) Neoplasms
bs3lr	(3) Endocrine, metabolic, & immunity
bs4lr	(4) Blood system
bs5lr	(5) Mental disorders
bs6lr	(6) Nervous system
bs7lr	(7) Circulatory system
bs8lr	(8) Respiratory system
bs9lr	(9) Digestive system
bs10lr	(10) Genitourinary system
bs11lr	(11) Pregnancy, childbirth
bs12lr	(12) Skin and subcutaneous tissue
bs13lr	(13) Musculoskeletal system & connective tissue
bs14lr	(14) congenital anomalies
bs15lr	(15) Perinatal period (no data)
bs16lr	(16) Ill-defined conditions
bs17lr	(17) Injury and poisoning
bs18lr	(18) External causes of injury
bs19lr	(19) Supplemental classification
hf	(20) Health factors in VA EHRs
vcode	(21) Social and supplemental classification codes
Vlr	(21) EHR-based index of social determinants (same as bs19lr and can be dropped)

(a) Report direct predictors of diabetes; and the percent of variation in diabetes explained by these variables. Include pairwise combination of variables in your analysis. Before you do this regression, delete the entire row of data for missing diabetes variable. Missing independent variables should be set to 1, or imputed from the data. Prepare a logistic LASSO regression. If using R set the hyper parameter to 1se. If using Python set the hyper parameter so about 10 variables remain in the equation. Estimate non-zero LASSO variables using validation data set. List the variables that are parents in Markov blanket of diabetes. :

Direct Predictor Regression Coefficient

Intercept

Variable 1

Variable 2

...

Evaluate the Pseudo R-square using the validation set. A negative value indicates that the index does not predict the response value accurately and you need to change the C parameter to find an index that does accurately predict the response variable.

Pseudo R-square

xx%

(b) Report order of occurrence of independent variables in predicting direct predictors of diabetes as a modified table. Start with the table reporting pairwise order of occurrence of variables. Delete from this table any column that is not a direct predictor of diabetes. Gray any cell that is not occurring on average before the direct predictor of diabetes listed in the column. For example, Nervous System (bs6lr) is preceded by Circulatory, Respiratory, Digestive, Genitourinary, Musculoskeletal and v-codes. Nervous System will be listed in top column heading and all other variables will be listed in rows. The percent of people that have one of the systems preceding Nervous System is given as cell values and cells that are preceding Nervous System are in white and all other variables are in light gray.

Prior
Variables Later Variables

Nervous System Other Direct Predictors …

Nervous
System zz% …

Circulatory
System xx% …

Other
Variables yy% ww% …

… … … …

(c) Report predictors, and percent of variation explained, for indirect regressions. In indirect regressions, the missing independent variable should be replaced with 1 or imputed. If the response variable is missing then the entire row of data should be deleted. For each indirect regression (i.e., regressions where the response variable is not diabetes), start from the original data and drop the rows of variables where the response variable is missing. For example, make adjustments for missing values for regression predicting "Nervous System" or "Circulatory System" by starting from the original data so that you do not eliminate variables missing for one, as if they are missing for both.

For indirect regression use the temporal analysis to select independent variables that precede the response variable. For example, these are the steps in "Nervous System (bs6lr)" a direct predictor of Diabetes. First binaries the variable. Any values larger than 1 are set to 1, values less than 1 or 1 are set to 0, and missing values are deleted row-wise. Second, binary Nervous System is LASSO logistic regressed on Circulatory, Respiratory, Digestive, Genitourinary, Musculoskeletal and v-codes. Any non-zero variable will be drawn in the network as parent in Markov Blanket of Nervous System. Report the non-zero coefficients and the percent of variation explained in test data set (not in the training data set).

Independent Variables Response Variable 1 Response Variable 2 …

Intercept

Variable 1

Variable 2

Variable 3

…

Pseudo
R-squared

(d) Report the structure and parameters of the causal network. Use the regressions to create the structure of the data. Any non-zero variable in the indirect regression is a parent in Markov blanket of the response variable. There should be an arc between thes non-zero variables and the node representing the response variable. Create a visual model of the data using Netica (if you have more than 15 variables and do not have license to Netica, you can take an image of the structure before saving it). Provide the image as the report of the structure.

Remove cycles. There should not be any cycles as we are regressing only on preceding variables but there is a remote change that some could be there.

Use the regressions to generate joint distribution of the data. A joint distribution for node A is created by looking at factorial combination of direct predictors of A and evaluating the probability of A for each of these combinations. The factorial combination is conveniently provided by Table in Netica software. Report the joint distribution of all nodes in Excel sheets, using a different sheet for each Table. You can also report the joint distribution as part of Netica table structures.

Predictors of Node A Prob of Node A 1 - Prob of Node A

Node B Node C

1 1 xx% 1 - xx%

1 0 yy% 1 - yy%

0 1 zz% 1 - zz%

0 0 ww% 1 - ww%

(e) Describe if social determinants of illness are direct, or indirect, causes of diabetes.

Resources for Question 1

Simulated de-identified data for model building Download►
Simulated de-identified data for validation Download►
Data on order of occurrence of pairs of body systems Download►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi, PhD Home► Email►