- US Task Force on Prevention
- Data mining as an alternative method for screening for diseases
The group project in this course is to use medical history of
patients to screen for a disease/condition. For this project you are
required to use All of Us Data. For this project you can use any statistical software.
Question 1: Select the disease you are
predicting, organize your project team, and email the instructor the list
of people involved.
- Select a condition that you wish to
model, e.g. lung cancer, breast cancer, diabetes, depression, etc.
Groups should select different conditions. Check that the condition
you want to work on is not already selected by other groups. Send an
email to the entire class about your selection and reserve your
group's right to the selected condition..
- You can team up with up to 2 other people. The maximum number of
people in same team, including you, is 3. You can also organize
groups of size 2 or do it all by yourself. The choice is yours. The
project does not change with the group size, you need to do the entire
project whether you are doing it by yourself or in a group. You need
to choose your teammates carefully, so that you do not complain about
them later. Many students complain of unequal contribution to team
projects. I advise you not to team up with students who do not have access to
All of Us data. as that will significantly delay you.
Question 2: Register for All of US. This step was accomplished in a previous assignment.
If you have not done all registration steps, including training, then
you will need to solve this problem quickly. This registration may take
several weeks for students who do not have a State ID. Otherwise, it
should take about 90 minutes. Also make sure that you remember your
password as there are multiple accounts set up in this process.
- Register for an account on @researchallofus.org
- Change from temporary password to a new password and record your password on paper somewhere.
- Turn on Google 2-Step Verification
- Verify your identity with Login.gov. This step requires a state ID or Drivers License, social security number, and
text phone. There are multiple passwords that you should keep in mind. There is your GMU password, your research workbench password on All of
Us and your computer password, and your Google password. Please make sure that you keep these accounts separate and read the messages
carefully to see which password is needed.
- Complete All of Us Registered Tier Training
- You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
- Sign the Code of Conduct Sign Data User Code of Conduct
Question 3: Create your cohort in All of Us.
Submit an email to your instructor when this has been done and report the
number of people in the cohort and their demographics.
- Sign into your account on All of Us.
- Define your population broadly (e.g. all adults). Include people
with, and without, the disease you want to predict. Do not
restrict the analysis to individuals with the disease.
- Select as your response (outcome) variable one or more conditions
that describe the disease you want to focus on. Review how
people before you have done this by examining PubMed publications that
have defined the variable of interest using EHR codes. The date of
occurrence of the response variable is the first time the
variable/condition has occurred.
- In your cohort, select demographics (age, gender, and race) and
all conditions as independent variables of interest. No survey
responses are needed for independent variables. Rely only on EHR data.
Include date of occurrence.
- Request from All of Us to construct of your database, this will
take some time.
Question 4: Identify the temporal sequence of events.
The response to this question needs to be provided when addressing COVID
- Assume that age, gender, and race occur at birth. Assume that
death occurs as the last event.
- Establish the order with which conditions occur
- Count for each pair of condition, the number of times one
condition occurs before another in the same person.
- Use the pairwise count of one condition occurring before
another to establish a sequence of occurrence of conditions.
Question 5: Create a Causal Network for clusters of conditions in predicting your response variable. The response to
this question needs to be included in the diabetes assignments.
- Create the structure of the network:
- Using LASSO, regress the response variable on all independent
variables, and pairwise or triple cluster of independent variables that precede
the response variable.
- Using LASSO, regress each variable that is a direct
predictor of response/outcome variable on all preceding
variables (demographics and conditions). In these regressions, statistically significant variables are parents in the
Markov blanket of the regression response variable.
the network using Netica.
- Estimate the parameters of the network
- Using the LASSO regression, calculate the predicted value
for all combinations of the parents in the Markov blanket of
the regression's response variables.
- Enter the parameters
into Netica Tables.
Question 6: Present your findings
- Prepare a narrated PowerPoint. Contrast screening from
medical history of a patient with screening recommended by the US
Preventive Services Task Force
Example Completed Assignments
All of the following listed projects use patient's medical history to screen
for the indicated disease. The recommendation for screening that
emerges from these projects differs radically from the recommendation of
US Preventive Services Task Force. The Task Force has systematically
ignored how data in EHRs can be used to improve screening. The
projects listed here are pilot studies designed to understand the nature
of the data. Some of these projects have low percent of variation
explained, which suggests alternative analysis is necessary.
For more completed published studies, please use PubMed. All of these
pilot projects use data from
All of Us database. All of these projects construct network models,
through regression analysis.
This page is part of the course on Comparative Effectiveness by Farrokh Alemi, Ph.D.