HAP 819: Advanced Statistics II

Lecture: All of Us Project  

 

Assigned Reading

Assignment

The group project in this course is to use medical history of patients to screen for a cancer. For this project you are required to use All of Us Data.  For this project you can use any statistical software. 

Select Cancer: Select the disease you are predicting, organize your project team, and email the instructor the list of people involved.

  1.  Select a condition that you wish to model, e.g. lung cancer, breast cancer, etc. Each student should select a different condition. Check that the condition you want to work on is not already selected by others. Using Blackboard send an email to the entire class about your selection and reserve your right to the selected condition. 

Register for All of US.  This step was accomplished in a previous assignment. If you have not done all registration steps, including training, then you will need to solve this problem quickly. This registration may take several weeks for students who do not have a State ID.  Otherwise, it should take about 90 minutes.  Also make sure that you remember your password as there are multiple accounts set up in this process.

  1. Register for an account on @researchallofus.org
  2. Change from temporary password to a new password and record your password on paper somewhere.
  3. Turn on Google 2-Step Verification
  4. Verify your identity with Login.gov.  This step requires a state ID or Drivers License, and text phone.
  5. There are multiple passwords that you should keep in mind.  There is your GMU password, your research workbench password on All of Us and your computer password, and your Google password.  Please make sure that you keep these accounts separate and read the messages carefully to see which password is needed.
  6. Complete All of Us Registered Tier Training
  7. You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
  8. Sign the Code of Conduct Sign Data User Code of Conduct

Create Cohort and Related Data Sets.  Note that a cohort and data sets are different concepts. 

  1. Create your cohort in All of Us.  Do not limit the cohort by demographics, unless clearly the cancer does not occur in certain demographics. Do not limit the cohort by the presence of cancer as both patients with and without cancer are needed. Define your cohort population broadly (e.g. all adults). Include people with, and without, the disease you want to predict.  Do not limit the cohort by conditions or diseases. 
  2. Create the concept for your cancer. Select as your response (outcome) variable one or more conditions that describe the disease you want to predict.  Review how people before you have done this by examining PubMed publications that have defined the variable of interest using EHR codes. Alternatively, use conditions defined within All of Us to select the right definition of your cancer.
  3. Create the concept of patient's survival.  Add in an observation that the patient has died. 
  4. Create your data sets, for your cohort.  Do not include non-EHR data or surveys. Note that creation of cancer data sets requires creation of concepts that capture the cancer in the data. In your cohort, select demographics (age, gender, and race) and all conditions as independent variables of interest.  No survey responses are needed for independent variables. Rely only on EHR data only. Include date of occurrence. You also need the date of occurrence of the cancer. The date of occurrence of the response variable is the first time the variable/condition has occurred.  Here are the datapoints that you need to include in your data sets: demographics (Age at event, Sex at birth / gender, Race, Ethnicity, Survival, Diseases among the Conditions.

Here are more detailed steps in getting ready for analysis:

  1. Get the dataset for patient demographics to include date of birth, race, and ethnicity.
  2. Convert race to dummy variables. Drop one dummy variable to avoid multicollinearity (dummy variable trap).
  3. Covert ethnicity to dummy variables. Drop one dummy variable to avoid multicollinearity (dummy variable trap).
  4. Create the base of df_analysis from this.
  5. Get date_of_death from the dataset containing dead persons then left join to df_analysis.
  6. Get date_of_first_diagnosis from dataset containing all of your cancers then left join to df_analysis.
  7. Process disease dataset
  8. Get list of all of your cancer SNOMED codes and exclude from the disease dataset.
  9. Remove diseases that happened after your cancer
  10. Create a new column disease_group
  11. Use the df_disease_grouped.csv to fill the disease_group column based on standard_concept_code
  12. Change missing values of disease_group to zero, 0 (catch all disease grouping)
  13. Binarize the disease_group column. No need to drop any column since this is not mutually exclusive, meaning a person can have many disease groups thus avoiding the dummy variable trap.
  14. Aggregate based on person_id so that only 1 row per person_id is in the dataset and the binarized disease group columns indicate all the disease groups that the person has.
  15. Drop all other columns except person_id and the binarize columns.
  16. Left join the binarized columns to df_analysis
  17. Create a new column, sdoh, for social determinants of health. The column value will be 1 if the person_id is in the df_persons_w_sdoh.csv that I sent, otherwise 0
  18. Do further data prep such as missing value handling, etc.
  19. Check if you should transform the data so that assumptions of regression are met.
  20. You are now ready to start analysis of your data

The following resources may be of use in this task:

Identify the Temporal Sequence of Variables.  All independent variables should occur before dependent variables.  This section can be skipped as long as you make sure that your cancer is calculated after all independent variables.

  1. Assume that age, gender, and race occur at birth.  Assume that death occurs as the last event.  
  2. Establish the order with which diseases (conditions) occur.
    • Count for each pair of condition, the number of times one condition occurs before another in the same person.  Use the pairwise count of one condition occurring before another to establish a sequence of occurrence of conditions.
    • Use shifted dates in All of US to exclude diseases that occur after cancer

Feature Construction.  Create body system variables across diseases by using the hierarchy in the SNOMED ontology of diseases. 

  1. For each body system, identify the conditions within All of Us that fall within the body system.  Some conditions may fall within more than one body system
  2. For each body system, regress occurrence of cancer on the conditions that occur within it.  A simpler and faster method is to calculate the likelihood ratio of cancer associated with each condition within the body system.
  3. Create a new feature called "worst condition within disorder xxx", where for each patient you select the disease with largest regression coefficient or highest likelihood ratio of cancer.

The following resources may be of use in this task:

  • Vladimir Cardenas's list of body systems and members of body system Zip►

Predictive Modeling of Cancer: Create several LASSO regression models.

  1. Use indicators to address missing body systems.  Regress occurrence of cancer on body systems (include pair and triplet of body systems), and indicators (include interactions of missing indicators that represent patterns of missing revealed by ggmice).  Slides► YouTube►
  2. Assume that missing diseases have not occurred, i.e., replace missing values for disease conditions with 0. Regress occurrence of cancer on diseases (include pair of diseases and triplet of diseases).
  3. Regress occurrence of cancer on all independent variables including body systems, diseases, and missing value indicators.  

Report Your findings: This report should include the following section and provided at approximate times indicated by email to the instructor:

  1. Background literature review should not exceed 1 page. Your one page literature review should assume a reader familiar with the literature and not exceed three paragraph.  The first paragraph should address the significance of the area you are addressing, including prevalence of the cancer and importance of early detection in improving outcomes of care.  The second paragraph should describe how US Preventive Task Force recommends who should be screened and point out that such an approach misses importance risk factors.  This paragraph should list the risk factors missed and reference articles that point these risk factors.  The paragraph should not exceed two or three sentences but can have numerous references.  The last paragraph should discuss how your paper provides a comprehensive review of non-genetic risk factors by examining all of the patients' medical history.  Background section should be a brief synthesis of existing research findings related to the problem being addressed in the study. This section is due in 3rd week of the course.
  2. Method section should be a complete description of the methods; and there is no page limit but brevity is appreciated. It should include a paragraph or a sentence on source of data. It should describe the inclusion and exclusion criteria for the creation of the cohort and compare these criteria to what has been done in the literature. It should have a sentence or a paragraph, with citations, on definition of the dependent/response variable.  It should have a sentence or a paragraph on number of, and definition of, independent variables; including interaction among pairs, and triplets of variables. These statements should clarify how missing values were treated and explain what steps were taken to ensure that independent variables occur prior to response/dependent variable. There should be a paragraph on feature construction.  In particular, it should describe how ontological adjustmetns were made to construct body systems or other features. This work should reference relevant papers on feature construction.  There should be a sentence or two on how data were transformed to meet assumptions of regression. There should be a paragraph on analytical methods used, e.g. LASSO, and how hyper parameters were set. e The methods paragraph should describe in one sentence how the performance of the US Preventive Task Force recommendations was simulated given that some key variables were missing. The method section is due 2 weeks after lecture on LASSO regression. 
  3. Results section should describe the findings and there is no page limit.  Table 1 should be description of the population studied.  Figures and additional tables should summarize the statistical findings. These should include parameters of your model and the fit between the model and data. Include the fit between US Preventive Task Force model and the data. Describe the fit between the data with and without interaction terms. Describe the fit with the data with and without feature construction. There should not be any discussion of findings in the result section.  This section should be complete 3 weeks before end of the course.
  4. Discussion section should include 4 distinct sections and there is no page limits.  The first section should be a summary of the key findings.  The second section should be a review of support for the findings in the literature. The third section should summarize study limitations.  The last section should conclude with policy implications. This section is due at last week of classes.

Example Completed Assignments: All of the following listed projects use patient's medical history to screen for the indicated disease. The recommendation for screening that emerges from these projects differs radically from the recommendation of US Preventive Services Task Force. The Task Force has systematically ignored how data in EHRs can be used to improve screening.  The projects listed here are pilot studies designed to understand the nature of the data.  Some of these projects have low percent of variation explained, which suggests alternative analysis is necessary. For more completed published studies, please use PubMed. All of these pilot projects use data from All of Us database.  All of these projects construct network models, through regression analysis. 

  • Redd's lung cancer paper Read► (Use instructor's last name for password)
  • All of Us breast cancer study YouTube►

This page is part of the HAP 819 course organized by Farrokh Alemi, Ph.D. Home► Email►