Lecture: All of Us Project  


Assigned Reading


The group project in this course is to use medical history of patients to screen for a disease/condition. For this project you are required to use All of Us Data.  For this project you can use any statistical software. 

Question 1:  Select the disease you are predicting, organize your project team, and email the instructor the list of people involved.

  •  Select a condition that you wish to model, e.g. lung cancer, breast cancer, diabetes, depression, etc. Groups should select different conditions. Check that the condition you want to work on is not already selected by other groups. Send an email to the entire class about your selection and reserve your group's right to the selected condition..
  • You can team up with up to 2 other people.  The maximum number of people in same team, including you, is 3.  You can also organize groups of size 2 or do it all by yourself.  The choice is yours.  The project does not change with the group size, you need to do the entire project whether you are doing it by yourself or in a group. You need to choose your teammates carefully, so that you do not complain about them later. Many students complain of unequal contribution to team projects.  I advise you not to team up with students who do not have access to All of Us data. as that will significantly delay you. 

Question 2: Register for All of US.  This step was accomplished in a previous assignment. If you have not done all registration steps, including training, then you will need to solve this problem quickly. This registration may take several weeks for students who do not have a State ID.  Otherwise, it should take about 90 minutes.  Also make sure that you remember your password as there are multiple accounts set up in this process.

  • Register for an account on @researchallofus.org
  • Change from temporary password to a new password and record your password on paper somewhere.
  • Turn on Google 2-Step Verification
  • Verify your identity with Login.gov.  This step requires a state ID or Drivers License, social security number, and text phone. There are multiple passwords that you should keep in mind.  There is your GMU password, your research workbench password on All of Us and your computer password, and your Google password.  Please make sure that you keep these accounts separate and read the messages carefully to see which password is needed.
  • Complete All of Us Registered Tier Training
  • You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
  • Sign the Code of Conduct Sign Data User Code of Conduct

Question 3: Create your cohort in All of Us.  Submit an email to your instructor when this has been done and report the number of people in the cohort and their demographics.

  •  Sign into your account on All of Us.
  • Define your population broadly (e.g. all adults). Include people with, and without, the disease you want to predict.  Do not restrict the analysis to individuals with the disease.
  • Select as your response (outcome) variable one or more conditions that describe the disease you want to focus on.  Review how people before you have done this by examining PubMed publications that have defined the variable of interest using EHR codes. The date of occurrence of the response variable is the first time the variable/condition has occurred.
  • In your cohort, select demographics (age, gender, and race) and all conditions as independent variables of interest.  No survey responses are needed for independent variables. Rely only on EHR data. Include date of occurrence. 
  • Request from All of Us to construct of your database, this will take some time.

Question 4: Identify the temporal sequence of events.  The response to this question needs to be provided when addressing COVID related assignments

  • Assume that age, gender, and race occur at birth.  Assume that death occurs as the last event.  
  • Establish the order with which conditions occur
    • Count for each pair of condition, the number of times one condition occurs before another in the same person.   
    • Use the pairwise count of one condition occurring before another to establish a sequence of occurrence of conditions.

Question 5: Create a Causal Network for clusters of conditions in predicting your response variable.  The response to this question needs to be included in the diabetes assignments.

  • Create the structure of the network:
    • Using LASSO, regress the response variable on all independent variables, and pairwise or triple cluster of independent variables that precede the response variable. 
    • Using LASSO, regress each variable that is a direct predictor of response/outcome variable on all preceding variables (demographics and conditions). In these regressions, statistically significant variables are parents in the Markov blanket of the regression response variable. 
    • Draw the network using Netica.
  • Estimate the parameters of the network
    • Using the LASSO regression, calculate the predicted value for all combinations of the parents in the Markov blanket of the regression's response variables.
  • Enter the parameters into Netica Tables.

Question 6: Present your findings

  • Prepare a narrated PowerPoint.  Contrast screening from medical history of a patient with screening recommended by the US Preventive Services Task Force USPSTF►

Example Completed Assignments

All of the following listed projects use patient's medical history to screen for the indicated disease. The recommendation for screening that emerges from these projects differs radically from the recommendation of US Preventive Services Task Force. The Task Force has systematically ignored how data in EHRs can be used to improve screening.  The projects listed here are pilot studies designed to understand the nature of the data.  Some of these projects have low percent of variation explained, which suggests alternative analysis is necessary. For more completed published studies, please use PubMed. All of these pilot projects use data from All of Us database. All of these projects construct network models, through regression analysis. 

This page is part of the course on Comparative Effectiveness by Farrokh Alemi, Ph.D. Home► Email►