Question 1: The attached data show the percent of diabetes in different 2,228 counties within
United States in 2010, 2011, and 2012 years. We want to
understand if access to food stores affects diabetes. Create the network model,
using data from repeated LASSO regressions. The
first regression will be diabetes in 2012 on all 2011 variables.
Other LASSO regressions will have as response/dependent variable the
statistically significant variables in the previous regression
regressed on all 2010 variables. Draw the network model using Netica.
the parents in Markov blanket of diabetes in 2012; calculate the impact of access
to quality food stores in 2011 on diabetes using stratified
covariate balancing. The following shows one possible model and not necessarily the model
you will construct with your data. This model was organized without
race and education levels higher than 1.
Resources for Question 1:
Question 2: What are causes of diabetes? Using LASSO regression construct a causal network for explaining variation in incidence of diabetes. In the attached data, the dependent variable is incidence of diabetes. The independent variables (body systems) were constructed based on the worst diagnosis of the patient, measured by the likelihood ratio of diabetes. The diabetes variable is calculated after all other independent variables. Keep in mind that the data is massive and that analysis may take hours.. The complete list of independent variables are the following (if a variable is always missing, drop it from the analysis):
Before you do this
regression, delete the entire row of data for
missing diabetes variable. Missing independent variables should be
set to 1, or imputed from the data. Prepare a logistic LASSO regression.
If using R set the hyper parameter to 1se. If using Python set the hyper
parameter so about 10 variables remain in the equation. List the variables that are
parents in Markov blanket of diabetes. :
In indirect regressions, the missing independent variable should be replaced with 1 or imputed. If the response variable is missing then the entire row of data should be deleted. For each indirect regression (i.e., regressions where the response variable is not diabetes), start from the original data and drop the rows of variables where the response variable is missing. For example, make adjustments for missing values for regression predicting "nervous system" or "circulatory system" by starting from the original data so that you do not eliminate variables missing for one, as if they are missing for both. For indirect regression use the temporal analysis to select independent variables that precede the response variable. Thus, if we are predicting "external causes of injury," then use only those independent variables that precede it.
Use the regressions to create the structure of the data. Remove cycles. Use the regressions to generate joint distribution of the data. Create a visual model of the data using Netica (if you have more than 15 variables and do not have license to Netica, you can take an image of the structure before saving it). Provide the image as the report of the structure. Generate the joint distribution of the direct predictors of each node using regression equation and report these in Excel tables. You can also report these as part of Netica table structures.
Describe if social determinants of illness are direct, or indirect, causes of diabetes.
Resources for Question 2
Question 3: Please confirm in writing that you have registered for All of Us data, have access to the data, have organized your data into the work space, can read the data from the work space into Python or R code. Provide a summary of your data as part of the response to this question. Code for creating a summary of your data is available in All of Us work space.
For additional information (not part of the required reading), please see the following links: