Assigned Reading
Assignment
The group project in this course is to use medical history of
patients to screen for a disease/condition. For this project you are
required to use All of Us Data. For this project you can use any statistical software.
Question 1: Register for All of US. This step was accomplished in a previous assignment.
If you have not done all registration steps, including training, then
you will need to solve this problem quickly. This registration may take
several weeks for students who do not have a State ID. Otherwise, it
should take about 90 minutes. Also make sure that you remember your
password as there are multiple accounts set up in this process.
- Register for an account on @researchallofus.org
- Change from temporary password to a new password and record your password on paper somewhere.
- Turn on Google 2-Step Verification
- Verify your identity with Login.gov. This step requires a state ID or Drivers License, social security number, and
text phone. There are multiple passwords that you should keep in mind. There is your GMU password, your research workbench password on All of
Us and your computer password, and your Google password. Please make sure that you keep these accounts separate and read the messages
carefully to see which password is needed.
- Complete All of Us Registered Tier Training
- You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
- Sign the Code of Conduct Sign Data User Code of Conduct
Question 2: Create your cohort in
All of Us. Note that a cohort and data sets are different concepts.
-
Create your cohort in All of Us. Do not limit the cohort by
demographics, unless clearly the cancer does not occur in certain
demographics. Do not limit the cohort by the presence of cancer as
both patients with and without cancer are needed. Define your cohort
population broadly (e.g. all adults). Include people with, and
without, the disease you want to predict. Do not limit the cohort by
conditions or diseases.
-
Create the concept for your cancer. Select as your response (outcome)
variable one or more conditions that describe the disease you want to
predict. Review how people before you have done this by examining
PubMed publications that have defined the variable of interest using
EHR codes. Alternatively, use conditions defined within All of Us to
select the right definition of your cancer.
-
Create the concept of patient's survival. Add in an observation that
the patient has died.
-
Create your data sets, for your cohort. Do not include non-EHR data
or surveys. Note that creation of cancer data sets requires creation
of concepts that capture the cancer in the data. In your cohort,
select demographics (age, gender, and race) and all conditions as
independent variables of interest. No survey responses are needed for
independent variables. Rely only on EHR data only. Include date of
occurrence. You also need the date of occurrence of the cancer. The
date of occurrence of the response variable is the first time the
variable/condition has occurred. Here are the datapoints that you
need to include in your data sets: demographics (Age at event, Sex at
birth / gender, Race, Ethnicity, Survival, Diseases among the
Conditions.
Here are more detailed steps in getting ready for analysis:
- Get
the dataset for patient demographics to include date of birth, race,
and ethnicity.
-
Convert race to dummy variables. Drop one dummy variable to avoid
multicollinearity (dummy variable trap).
-
Covert ethnicity to dummy variables. Drop one dummy variable to avoid
multicollinearity (dummy variable trap).
-
Create the base of df_analysis from this.
- Get
date_of_death from the dataset containing dead persons then left join
to df_analysis.
- Get
date_of_first_diagnosis from dataset containing all of your cancers
then left join to df_analysis.
-
Process disease dataset
- Get
list of all of your cancer SNOMED codes and exclude from the disease
dataset.
-
Remove diseases that happened after your cancer
-
Create a new column disease_group
- Use
the df_disease_grouped.csv to fill the disease_group column based on
standard_concept_code
-
Change missing values of disease_group to zero, 0 (catch all disease
grouping)
-
Binarize the disease_group column. No need to drop any column since
this is not mutually exclusive, meaning a person can have many disease
groups thus avoiding the dummy variable trap.
-
Aggregate based on person_id so that only 1 row per person_id is in
the dataset and the binarized disease group columns indicate all the
disease groups that the person has.
- Drop
all other columns except person_id and the binarize columns.
- Left
join the binarized columns to df_analysis
-
Create a new column, sdoh, for social determinants of health. The
column value will be 1 if the person_id is in the
df_persons_w_sdoh.csv that I sent, otherwise 0
- Do
further data prep such as missing value handling, etc.
- Check
if you should transform the data so that assumptions of regression are
met.
- You
are now ready to start analysis of your data
The following resources may be of use in this task:
Question 3: Identify the temporal sequence of events.
- Assume that age, gender, and race occur at birth. Assume that
death occurs as the last event.
- Establish the order with which conditions occur
- Count for each pair of condition, the number of times one
condition occurs before another in the same person.
- Use the pairwise count of one condition occurring before
another to establish a sequence of occurrence of conditions.
Question 4: Create a Causal Network for clusters of conditions in predicting your response variable. The response to
this question needs to be included in the diabetes assignments.
- Create the structure of the network:
- Using LASSO, regress the response variable on all independent
variables, and pairwise or triple cluster of independent variables that precede
the response variable.
- Using LASSO, regress each variable that is a direct
predictor of response/outcome variable on all preceding
variables (demographics and conditions). In these regressions, statistically significant variables are parents in the
Markov blanket of the regression response variable.
- Draw
the network using Netica.
- Estimate the parameters of the network
- Using the LASSO regression, calculate the predicted value
for all combinations of the parents in the Markov blanket of
the regression's response variables.
- Enter the parameters
into Netica Tables.
Question 6: This report should
include the following section and provided at approximate times indicated
by email to the instructor:
-
Background literature review should not exceed 1 page. Your one page
literature review should assume a reader familiar with the literature
and not exceed three paragraph. The first paragraph should address
the significance of the area you are addressing, including prevalence
of the cancer and importance of early detection in improving outcomes
of care. The second paragraph should describe how US Preventive Task
Force recommends who should be screened and point out that such an
approach misses importance risk factors. This paragraph should list
the risk factors missed and reference articles that point these risk
factors. The paragraph should not exceed two or three sentences but
can have numerous references. The last paragraph should discuss how
your paper provides a comprehensive review of non-genetic risk factors
by examining all of the patients' medical history. Background section
should be a brief synthesis of existing research findings related to
the problem being addressed in the study. This section is due in 3rd
week of the course.
-
Method section should be a complete description of the methods; and
there is no page limit but brevity is appreciated. It should include a
paragraph or a sentence on source of data. It should describe the
inclusion and exclusion criteria for the creation of the cohort and
compare these criteria to what has been done in the literature. It
should have a sentence or a paragraph, with citations, on definition
of the dependent/response variable. It should have a sentence or a
paragraph on number of, and definition of, independent variables;
including interaction among pairs, and triplets of variables. These
statements should clarify how missing values were treated and explain
what steps were taken to ensure that independent variables occur prior
to response/dependent variable. There should be a paragraph on feature
construction. In particular, it should describe how ontological
adjustmetns were made to construct body systems or other features.
This work should reference relevant papers on feature construction.
There should be a sentence or two on how data were transformed to meet
assumptions of regression. There should be a paragraph on analytical
methods used, e.g. LASSO, and how hyper parameters were set. e The
methods paragraph should describe in one sentence how the performance
of the US Preventive Task Force recommendations was simulated given
that some key variables were missing. The method section is due 2
weeks after lecture on LASSO regression.
-
Results section should describe the findings and there is no page
limit. Table 1 should be description of the population studied.
Figures and additional tables should summarize the statistical
findings. These should include parameters of your model and the fit
between the model and data. Include the fit between US Preventive Task
Force model and the data. Describe the fit between the data with and
without interaction terms. Describe the fit with the data with and
without feature construction. There should not be any discussion of
findings in the result section. This section should be complete 3
weeks before end of the course.
-
Discussion section should include 4 distinct sections and there is no
page limits. The first section should be a summary of the key
findings. The second section should be a review of support for the
findings in the literature. The third section should summarize study
limitations. The last section should conclude with policy
implications. This section is due at last week of classes.
Example Completed Assignments
All of the following listed projects use patient's medical history to screen
for the indicated disease. The recommendation for screening that
emerges from these projects differs radically from the recommendation of
US Preventive Services Task Force. The Task Force has systematically
ignored how data in EHRs can be used to improve screening. The
projects listed here are pilot studies designed to understand the nature
of the data. Some of these projects have low percent of variation
explained, which suggests alternative analysis is necessary.
For more completed published studies, please use PubMed. All of these
pilot projects use data from
All of Us database. All of these projects construct network models,
through regression analysis.
This page is part of the course on Comparative Effectiveness by Farrokh Alemi, Ph.D.
Home► Email►
|