Lecture: All of Us Project

Assigned Reading

None

Assignment

The group project in this course is to use medical history of patients to screen for a disease/condition. For this project you are required to use All of Us Data. For this project you can use any statistical software.

Question 1: Register for All of US. This step was accomplished in a previous assignment. If you have not done all registration steps, including training, then you will need to solve this problem quickly. This registration may take several weeks for students who do not have a State ID. Otherwise, it should take about 90 minutes. Also make sure that you remember your password as there are multiple accounts set up in this process.

Register for an account on @researchallofus.org
Change from temporary password to a new password and record your password on paper somewhere.
Turn on Google 2-Step Verification
Verify your identity with Login.gov. This step requires a state ID or Drivers License, social security number, and text phone. There are multiple passwords that you should keep in mind. There is your GMU password, your research workbench password on All of Us and your computer password, and your Google password. Please make sure that you keep these accounts separate and read the messages carefully to see which password is needed.
Complete All of Us Registered Tier Training
You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
Sign the Code of Conduct Sign Data User Code of Conduct

Question 2: Create your cohort in All of Us. Note that a cohort and data sets are different concepts.

Create your cohort in All of Us. Do not limit the cohort by demographics, unless clearly the cancer does not occur in certain demographics. Do not limit the cohort by the presence of cancer as both patients with and without cancer are needed. Define your cohort population broadly (e.g. all adults). Include people with, and without, the disease you want to predict. Do not limit the cohort by conditions or diseases.
Create the concept for your cancer. Select as your response (outcome) variable one or more conditions that describe the disease you want to predict. Review how people before you have done this by examining PubMed publications that have defined the variable of interest using EHR codes. Alternatively, use conditions defined within All of Us to select the right definition of your cancer.
Create the concept of patient's survival. Add in an observation that the patient has died.
Create your data sets, for your cohort. Do not include non-EHR data or surveys. Note that creation of cancer data sets requires creation of concepts that capture the cancer in the data. In your cohort, select demographics (age, gender, and race) and all conditions as independent variables of interest. No survey responses are needed for independent variables. Rely only on EHR data only. Include date of occurrence. You also need the date of occurrence of the cancer. The date of occurrence of the response variable is the first time the variable/condition has occurred. Here are the datapoints that you need to include in your data sets: demographics (Age at event, Sex at birth / gender, Race, Ethnicity, Survival, Diseases among the Conditions.

Here are more detailed steps in getting ready for analysis:

Get the dataset for patient demographics to include date of birth, race, and ethnicity.
Convert race to dummy variables. Drop one dummy variable to avoid multicollinearity (dummy variable trap).
Covert ethnicity to dummy variables. Drop one dummy variable to avoid multicollinearity (dummy variable trap).
Create the base of df_analysis from this.
Get date_of_death from the dataset containing dead persons then left join to df_analysis.
Get date_of_first_diagnosis from dataset containing all of your cancers then left join to df_analysis.
Process disease dataset
Get list of all of your cancer SNOMED codes and exclude from the disease dataset.
Remove diseases that happened after your cancer
Create a new column disease_group
Use the df_disease_grouped.csv to fill the disease_group column based on standard_concept_code
Change missing values of disease_group to zero, 0 (catch all disease grouping)
Binarize the disease_group column. No need to drop any column since this is not mutually exclusive, meaning a person can have many disease groups thus avoiding the dummy variable trap.
Aggregate based on person_id so that only 1 row per person_id is in the dataset and the binarized disease group columns indicate all the disease groups that the person has.
Drop all other columns except person_id and the binarize columns.
Left join the binarized columns to df_analysis
Create a new column, sdoh, for social determinants of health. The column value will be 1 if the person_id is in the df_persons_w_sdoh.csv that I sent, otherwise 0
Do further data prep such as missing value handling, etc.
Check if you should transform the data so that assumptions of regression are met.
You are now ready to start analysis of your data

The following resources may be of use in this task:

Vladimir Cardenas's YouTube► Slides► Dictionary► R Code►
Creating survival variable More►

Question 3: Identify the temporal sequence of events.

Assume that age, gender, and race occur at birth. Assume that death occurs as the last event.
Establish the order with which conditions occur
- Count for each pair of condition, the number of times one condition occurs before another in the same person.
- Use the pairwise count of one condition occurring before another to establish a sequence of occurrence of conditions.

Question 4: Create a Causal Network for clusters of conditions in predicting your response variable. The response to this question needs to be included in the diabetes assignments.

Create the structure of the network:
- Using LASSO, regress the response variable on all independent variables, and pairwise or triple cluster of independent variables that precede the response variable.
- Using LASSO, regress each variable that is a direct predictor of response/outcome variable on all preceding variables (demographics and conditions). In these regressions, statistically significant variables are parents in the Markov blanket of the regression response variable.
- Draw the network using Netica.
Estimate the parameters of the network
- Using the LASSO regression, calculate the predicted value for all combinations of the parents in the Markov blanket of the regression's response variables.
Enter the parameters into Netica Tables.

Question 6: This report should include the following section and provided at approximate times indicated by email to the instructor:

Background literature review should not exceed 1 page. Your one page literature review should assume a reader familiar with the literature and not exceed three paragraph. The first paragraph should address the significance of the area you are addressing, including prevalence of the cancer and importance of early detection in improving outcomes of care. The second paragraph should describe how US Preventive Task Force recommends who should be screened and point out that such an approach misses importance risk factors. This paragraph should list the risk factors missed and reference articles that point these risk factors. The paragraph should not exceed two or three sentences but can have numerous references. The last paragraph should discuss how your paper provides a comprehensive review of non-genetic risk factors by examining all of the patients' medical history. Background section should be a brief synthesis of existing research findings related to the problem being addressed in the study. This section is due in 3rd week of the course.
Method section should be a complete description of the methods; and there is no page limit but brevity is appreciated. It should include a paragraph or a sentence on source of data. It should describe the inclusion and exclusion criteria for the creation of the cohort and compare these criteria to what has been done in the literature. It should have a sentence or a paragraph, with citations, on definition of the dependent/response variable. It should have a sentence or a paragraph on number of, and definition of, independent variables; including interaction among pairs, and triplets of variables. These statements should clarify how missing values were treated and explain what steps were taken to ensure that independent variables occur prior to response/dependent variable. There should be a paragraph on feature construction. In particular, it should describe how ontological adjustmetns were made to construct body systems or other features. This work should reference relevant papers on feature construction. There should be a sentence or two on how data were transformed to meet assumptions of regression. There should be a paragraph on analytical methods used, e.g. LASSO, and how hyper parameters were set. e The methods paragraph should describe in one sentence how the performance of the US Preventive Task Force recommendations was simulated given that some key variables were missing. The method section is due 2 weeks after lecture on LASSO regression.
Results section should describe the findings and there is no page limit. Table 1 should be description of the population studied. Figures and additional tables should summarize the statistical findings. These should include parameters of your model and the fit between the model and data. Include the fit between US Preventive Task Force model and the data. Describe the fit between the data with and without interaction terms. Describe the fit with the data with and without feature construction. There should not be any discussion of findings in the result section. This section should be complete 3 weeks before end of the course.
Discussion section should include 4 distinct sections and there is no page limits. The first section should be a summary of the key findings. The second section should be a review of support for the findings in the literature. The third section should summarize study limitations. The last section should conclude with policy implications. This section is due at last week of classes.

Example Completed Assignments

All of the following listed projects use patient's medical history to screen for the indicated disease. The recommendation for screening that emerges from these projects differs radically from the recommendation of US Preventive Services Task Force. The Task Force has systematically ignored how data in EHRs can be used to improve screening. The projects listed here are pilot studies designed to understand the nature of the data. Some of these projects have low percent of variation explained, which suggests alternative analysis is necessary. For more completed published studies, please use PubMed. All of these pilot projects use data from All of Us database. All of these projects construct network models, through regression analysis.

Diabetes YouTube►
Leukemia YouTube►
Congestive heart failure YouTube►
Hypertension YouTube►
Hip fracture YouTube►
Breast cancer YouTube►
Depression YouTube►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi, Ph.D. Home► Email►