|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Variable | Description |
| id | ID of the patient (drop this variable) |
| dm | Incidence of Diabetes, 1 if there was diabetes, 0 otherwise |
| bs1lr | (1) Infectious & parasitic |
| bs2lr | (2) Neoplasms |
| bs3lr | (3) Endocrine, metabolic, & immunity |
| bs4lr | (4) Blood system |
| bs5lr | (5) Mental disorders |
| bs6lr | (6) Nervous system |
| bs7lr | (7) Circulatory system |
| bs8lr | (8) Respiratory system |
| bs9lr | (9) Digestive system |
| bs10lr | (10) Genitourinary system |
| bs11lr | (11) Pregnancy, childbirth |
| bs12lr | (12) Skin and subcutaneous tissue |
| bs13lr | (13) Musculoskeletal system & connective tissue |
| bs14lr | (14) congenital anomalies |
| bs15lr | (15) Perinatal period (no data) |
| bs16lr | (16) Ill-defined conditions |
| bs17lr | (17) Injury and poisoning |
| bs18lr | (18) External causes of injury |
| bs19lr | (19) Supplemental classification |
| hf | (20) Health factors in VA EHRs |
| vcode | (21) Social and supplemental classification codes |
| Vlr | (21) EHR-based index of social determinants (same as bs19lr and can be dropped) |
(a) Report direct predictors of diabetes; and the percent of variation in diabetes explained by these variables. Include pairwise combination of variables in your analysis. Before you do this regression, delete the entire row of data for missing diabetes variable. Missing independent variables should be set to 1, or imputed from the data. Prepare a logistic LASSO regression. If using R set the hyper parameter to 1se. If using Python set the hyper parameter so about 10 variables remain in the equation. Estimate non-zero LASSO variables using validation data set. List the variables that are parents in Markov blanket of diabetes. :
| Direct Predictor | Regression Coefficient |
| Intercept | |
| Variable 1 | |
| Variable 2 | |
| ... |
Evaluate the Pseudo R-square using the validation set. A negative value indicates that the index does not predict the response value accurately and you need to change the C parameter to find an index that does accurately predict the response variable.
| Pseudo R-square |
| xx% |
(b) Report order of occurrence of independent variables in predicting direct predictors of diabetes as a modified table. Start with the table reporting pairwise order of occurrence of variables. Delete from this table any column that is not a direct predictor of diabetes. Gray any cell that is not occurring on average before the direct predictor of diabetes listed in the column. For example, Nervous System (bs6lr) is preceded by Circulatory, Respiratory, Digestive, Genitourinary, Musculoskeletal and v-codes. Nervous System will be listed in top column heading and all other variables will be listed in rows. The percent of people that have one of the systems preceding Nervous System is given as cell values and cells that are preceding Nervous System are in white and all other variables are in light gray.
|
Prior Variables |
Later Variables | ||
| Nervous System | Other Direct Predictors | … | |
|
Nervous System |
zz% | … | |
|
Circulatory System |
xx% | … | |
|
Other Variables |
yy% | ww% | … |
| … | … | … | … |
(c) Report predictors, and percent of variation explained, for
indirect regressions.
In indirect regressions, the missing independent variable should be
replaced with 1 or imputed. If the response variable is missing then the
entire row of data should be deleted.
For each indirect regression (i.e., regressions where the response variable
is not diabetes), start from the original data and drop the rows of
variables where the response variable is missing. For example, make
adjustments for missing values for regression predicting "Nervous System"
or "Circulatory System" by starting from the original data so that you do
not eliminate variables missing for one, as if they are missing for both.
For indirect regression
use the temporal analysis to select independent variables that precede the
response variable. For example, these are the steps in "Nervous
System (bs6lr)" a direct predictor of Diabetes. First binaries the
variable. Any values larger than 1 are set to 1, values less than 1
or 1 are set to 0, and missing values are deleted row-wise. Second, binary
Nervous System is LASSO logistic regressed on Circulatory, Respiratory,
Digestive, Genitourinary, Musculoskeletal and v-codes. Any non-zero
variable will be drawn in the network as parent in Markov Blanket of
Nervous System. Report the non-zero coefficients and the percent of variation explained in test data set
(not in
the training data set).
| Independent Variables | Response Variable 1 | Response Variable 2 | … |
| Intercept | |||
| Variable 1 | |||
| Variable 2 | |||
| Variable 3 | |||
| … | |||
|
Pseudo R-squared |
|||
(d) Report the structure and parameters of the causal network. Use the regressions to create the structure of the data. Any non-zero variable in the indirect regression is a parent in Markov blanket of the response variable. There should be an arc between thes non-zero variables and the node representing the response variable. Create a visual model of the data using Netica (if you have more than 15 variables and do not have license to Netica, you can take an image of the structure before saving it). Provide the image as the report of the structure.
Remove cycles. There should not be any cycles as we are regressing only on preceding variables but there is a remote change that some could be there.
Use the regressions to generate joint distribution of the data. A joint distribution for node A is created by looking at factorial combination of direct predictors of A and evaluating the probability of A for each of these combinations. The factorial combination is conveniently provided by Table in Netica software. Report the joint distribution of all nodes in Excel sheets, using a different sheet for each Table. You can also report the joint distribution as part of Netica table structures.
| Predictors of Node A | Prob of Node A | 1 - Prob of Node A | |
| Node B | Node C | ||
| 1 | 1 | xx% | 1 - xx% |
| 1 | 0 | yy% | 1 - yy% |
| 0 | 1 | zz% | 1 - zz% |
| 0 | 0 | ww% | 1 - ww% |
(e) Describe if social determinants of illness are direct, or indirect, causes of diabetes.
Resources for Question 1
This page is part of the course on Comparative Effectiveness by Farrokh Alemi, PhD Home► Email►