Lecture: Diabetes  


Assigned Reading

  • Impact of neighborhood variables on Type 2 diabetes
    • No impact using statistical mediation analysis Read►
    • Some impact using Causal Networks Read►
    • Calculating time in range from multiple A1c readings Read►


Question 1: The attached data show the percent of diabetes in different 2,228 counties within United States in 2010, 2011, and 2012 years. We want to understand if access to food stores affects diabetes. Create the network model, using data from repeated LASSO regressions.  The first regression will be diabetes in 2012 on all 2011 variables.  Other LASSO regressions will have as response/dependent variable the statistically significant variables in the previous regression regressed on all 2010 variables. Draw the network model using Netica.  Stratify the parents in Markov blanket of diabetes in 2012; calculate the impact of access to quality food stores in 2011 on diabetes using stratified covariate balancing. The following shows one possible model and not necessarily the model you will construct with your data.  This model was organized without race and education levels higher than 1.
Network Model of food access and diabetes

Resources for Question 1:

Question 2: What are causes of diabetes?  Using LASSO regression construct a causal network for explaining variation in incidence of diabetes.  In the attached data, the dependent variable is incidence of diabetes.  The independent variables (body systems) were constructed based on the worst diagnosis of the patient, measured by the likelihood ratio of diabetes. The diabetes variable is calculated after all other independent variables.   Keep in mind that the data is massive and that analysis may take hours.. The complete list of independent variables are the following (if a variable is always missing, drop it from the analysis):

Variable Description
id ID of the patient (drop this variable)
dm Incidence of Diabetes, 1 if there was diabetes, 0 otherwise
bs1lr (1) Infectious & parasitic
bs2lr (2) Neoplasms
bs3lr (3) Endocrine, metabolic, & immunity
bs4lr (4) Blood system
bs5lr (5) Mental disorders
bs6lr (6) Nervous system
bs7lr (7) Circulatory system
bs8lr (8) Respiratory system
bs9lr (9) Digestive system
bs10lr (10) Genitourinary system
bs11lr (11) Pregnancy, childbirth
bs12lr (12) Skin and subcutaneous tissue
bs13lr (13) Musculoskeletal system & connective tissue
bs14lr (14) congenital anomalies
bs15lr (15) Perinatal period (no data and should be deleted)
bs16lr (16) Ill-defined conditions
bs17lr (17) Injury and poisoning
bs18lr (18) External causes of injury 
bs19lr (19) Supplemental classification
hf (20) Health factors in VA EHRs
vcode (21) Social and supplemental classification codes
Vlr (21) EHR-based index of social determinants

 Before you do this regression, delete the entire row of data for missing diabetes variable.  Missing independent variables should be set to 1, or imputed from the data. Prepare a logistic LASSO regression.  If using R set the hyper parameter to 1se. If using Python set the hyper parameter so about 10 variables remain in the equation.  List the variables that are parents in Markov blanket of diabetes.  :

 In indirect regressions, the missing independent variable should be replaced with 1 or imputed. If the response variable is missing then the entire row of data should be deleted.  For each indirect regression (i.e., regressions where the response variable is not diabetes), start from the original data and drop the rows of variables where the response variable is missing. For example, make adjustments for missing values for regression predicting "nervous system" or "circulatory system" by starting from the original data so that you do not eliminate variables missing for one, as if they are missing for both. For indirect regression use the temporal analysis to select independent variables that precede the response variable.  Thus, if we are predicting "external causes of injury," then use only those independent variables that precede it.

Use the regressions to create the structure of the data.  Remove cycles. Use the regressions to generate joint distribution of the data.  Create a visual model of the data using Netica (if you have more than 15 variables and do not have license to Netica, you can take an image of the structure before saving it). Provide the image as the report of the structure.  Generate the joint distribution of the direct predictors of each node using regression equation and report these in Excel tables.  You can also report these as part of Netica table structures.

Describe if social determinants of illness are direct, or indirect, causes of diabetes. 

Resources for Question 2

Question 3:  Please confirm in writing that you have registered for All of Us data, have access to the data, have organized your data into the work space, can read the data from the work space into Python or R code.  Provide a summary of your data as part of the response to this question. Code for creating a summary of your data is available in All of Us work space.


For additional information (not part of the required reading), please see the following links:

  1. Pearl's direct and indirect effects Read► Web Appendix►
  2. Saeed's lecture Video►
  3. Mediation analysis allowing for exposure-mediator interactions Read►
  4. Mediation analysis through stable weights Read►
  5. Practical guide to mediation analysis through inverse odds ratio Read► Slides►
  6. Mediation analysis revisited Read►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi, PhD Home► Email►