Assigned Reading
- US Task Force on Prevention
PubMed►
- Data mining as an alternative method for screening for diseases
Assignment
The group project in this course is to use medical history of
patients to screen for a cancer. For this project you are
required to use All of Us Data. For this project you can use any statistical software.
Select Cancer: Select the disease you are
predicting, organize your project team, and email the instructor the list
of people involved.
- Select a condition that you wish to
model, e.g. lung cancer, breast cancer, etc.
Each student should select a different condition. Check that the condition
you want to work on is not already selected by others. Using
Blackboard send an
email to the entire class about your selection and reserve your
right to the selected condition.
Register for All of US. This step was accomplished in a previous assignment.
If you have not done all registration steps, including training, then
you will need to solve this problem quickly. This registration may take
several weeks for students who do not have a State ID. Otherwise, it
should take about 90 minutes. Also make sure that you remember your
password as there are multiple accounts set up in this process.
- Register for an account on @researchallofus.org
- Change from temporary password to a new password and record your password on paper somewhere.
- Turn on Google 2-Step Verification
- Verify your identity with Login.gov. This step requires a state ID or Drivers License, and text phone.
- There are multiple passwords that you should keep in mind. There is your GMU password, your research workbench password on All of
Us and your computer password, and your Google password. Please make sure that you keep these accounts separate and read the messages
carefully to see which password is needed.
- Complete All of Us Registered Tier Training
- You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
- Sign the Code of Conduct Sign Data User Code of Conduct
Create Cohort and Related Data Sets. Note that a
cohort and data sets are different concepts.
- Create your cohort in All of Us. Do not limit the cohort by demographics, unless clearly the cancer does not occur in certain demographics.
Do not limit the cohort by the presence of cancer as both patients with and without cancer are needed. Define your
cohort population broadly (e.g. all adults). Include people with, and without, the disease you want to predict. Do not limit the
cohort by conditions or diseases.
- Create the concept for your cancer. Select as your response (outcome) variable one or more conditions that
describe the disease you want to predict. Review how people before you have done this by examining PubMed publications that
have defined the variable of interest using EHR codes. Alternatively, use conditions defined within All of Us to select
the right definition of your cancer.
- Create the concept of patient's survival. Add in an
observation that the patient has died.
- Create your data sets, for your cohort. Do not include non-EHR data or surveys.
Note that creation of cancer data sets requires creation of concepts that capture the cancer in the data. In your cohort, select demographics (age, gender, and race)
and all conditions as independent variables of interest. No survey responses are needed for independent variables. Rely only
on EHR data only. Include date of occurrence. You also need the date of occurrence of the cancer. The date of occurrence of the
response variable is the first time the variable/condition has occurred. Here are the datapoints that you need to include in your data sets: demographics
(Age at event, Sex at birth / gender, Race, Ethnicity, Survival, Diseases among the Conditions.
Here are more detailed steps in getting ready for analysis:
- Get the dataset for patient demographics to include date of birth,
race, and ethnicity.
- Convert race to dummy variables. Drop one dummy variable to avoid
multicollinearity (dummy variable trap).
- Covert ethnicity to dummy variables. Drop one dummy variable
to avoid multicollinearity (dummy variable trap).
- Create the base of df_analysis from this.
- Get date_of_death from the dataset containing dead persons
then left join to df_analysis.
- Get date_of_first_diagnosis from dataset containing all of
your cancers then left join to df_analysis.
- Process disease dataset
- Get list of all of your cancer SNOMED codes and exclude from the
disease dataset.
- Remove diseases that happened after your cancer
- Create a new column disease_group
- Use the df_disease_grouped.csv to fill the disease_group
column based on standard_concept_code
- Change missing values of disease_group to zero, 0 (catch all
disease grouping)
- Binarize the disease_group column. No need to drop any
column since this is not mutually exclusive, meaning a person can have
many disease groups thus avoiding the dummy variable trap.
- Aggregate based on person_id so that only 1 row per person_id is
in the dataset and the binarized disease group columns indicate all
the disease groups that the person has.
- Drop all other columns except person_id and the binarize
columns.
- Left join the binarized columns to df_analysis
- Create a new column, sdoh, for social determinants of health. The
column value will be 1 if the person_id is in the
df_persons_w_sdoh.csv that I sent, otherwise 0
- Do further data prep such as missing value handling, etc.
- Check if you should transform the data so that assumptions of
regression are met.
- You are now ready to start analysis of your data
The following resources may be of use in this task:
Identify the Temporal Sequence of Variables. All independent variables should occur before dependent variables.
This section can be skipped as long as you make sure that your cancer is
calculated after all independent variables.
- Assume that age, gender, and race occur at birth. Assume that death occurs as the last event.
- Establish the order with which diseases (conditions) occur.
- Count for each pair of condition, the number of times one condition occurs before another in the same person.
Use the pairwise count of one condition occurring before another to establish a sequence of occurrence of conditions.
- Use shifted dates in All of US to exclude diseases that occur after cancer
Feature Construction. Create body system variables across diseases by using the hierarchy in the SNOMED ontology of diseases.
- For each body system, identify the conditions within All of Us that fall within the body system. Some conditions may fall within more than one body system
- For each body system, regress occurrence of cancer on the conditions that occur within it.
A simpler and faster method is to calculate the likelihood ratio of
cancer associated with each condition within the body system.
- Create a new feature called "worst condition within disorder xxx",
where for each patient you select the disease with largest regression
coefficient or highest likelihood ratio of cancer.
The following resources may be of use in this task:
- Vladimir Cardenas's list of body systems and members of body system Zip►
Predictive Modeling of Cancer: Create several LASSO regression models.
- Use indicators to address missing body systems. Regress occurrence of cancer on body systems (include pair and triplet of body
systems), and indicators (include interactions of missing indicators
that represent patterns of missing revealed by ggmice).
Slides►
YouTube►
- Assume that missing diseases have not occurred, i.e., replace
missing values for disease conditions with 0. Regress occurrence of
cancer on diseases (include pair of diseases and triplet of diseases).
- Regress occurrence of cancer on all independent variables including body systems, diseases, and missing value indicators.
Report Your findings: This report should include the
following section and provided at approximate times indicated by email to
the instructor:
- Background literature review should not exceed 1 page. Your one
page literature review should assume a reader familiar with the
literature and not exceed three paragraph. The first paragraph should
address the significance of the area you are addressing, including
prevalence of the cancer and importance of early detection in improving
outcomes of care. The second paragraph should describe how US
Preventive Task Force recommends who should be screened and point out
that such an approach misses importance risk factors. This paragraph
should list the risk factors missed and reference articles that point
these risk factors. The paragraph should not exceed two or three
sentences but can have numerous references. The last paragraph should
discuss how your paper provides a comprehensive review of non-genetic
risk factors by examining all of the patients' medical history. Background
section should be a brief synthesis of existing research findings related to the problem being addressed in the study.
This section is due in 3rd week of the course.
- Method section should be a complete description of the
methods; and there is no page limit but brevity is appreciated. It should include a paragraph or a sentence on source of data.
It should describe the inclusion and exclusion criteria for the
creation of the cohort and compare these criteria to what has been
done in the literature. It should have a sentence or a paragraph, with
citations, on definition of the dependent/response variable. It
should have a sentence or a paragraph on number of, and definition of,
independent variables; including interaction among pairs, and triplets
of variables. These statements should clarify how missing values were
treated and explain what steps were taken to ensure that independent
variables occur prior to response/dependent variable. There should be
a paragraph on feature construction. In particular, it should
describe how ontological adjustmetns were made to construct body
systems or other features. This work should reference relevant papers
on feature construction. There should be a sentence or two on
how data were transformed to meet assumptions of regression. There
should be a paragraph on analytical methods used, e.g. LASSO, and how
hyper parameters were set. e The methods paragraph should describe in
one sentence how the performance of the US Preventive Task Force
recommendations was simulated given that some key variables were
missing. The method section is due 2 weeks after lecture on LASSO
regression.
- Results section should describe the findings and there is no page
limit. Table 1 should be description of the
population studied. Figures and additional tables should
summarize the statistical findings. These should include parameters of
your model and the fit between the model and data. Include the fit
between US Preventive Task Force model and the data. Describe the fit
between the data with and without interaction terms. Describe the fit
with the data with and without feature construction. There should not be any discussion of
findings in the result section. This section should be complete
3 weeks before end of the course.
- Discussion section should include 4 distinct sections and there is
no page limits. The first section should be a summary of the key findings.
The second section should be a review of support for the findings in
the literature. The third section should summarize study limitations.
The last section should conclude with policy implications.
This section is due at last week of classes.
Example Completed Assignments: All of the following listed projects use patient's medical history to screen for the indicated disease. The recommendation for screening that
emerges from these projects differs radically from the recommendation of US Preventive Services Task Force. The Task Force has systematically
ignored how data in EHRs can be used to improve screening. The projects listed here are pilot studies designed to understand the nature
of the data. Some of these projects have low percent of variation explained, which suggests alternative analysis is necessary.
For more completed published studies, please use PubMed. All of these pilot projects use data from All of Us database.
All of these projects construct network models, through regression analysis.
- Redd's lung cancer paper Read► (Use instructor's last name for password)
- All of Us breast cancer study
YouTube►
This page is part of the HAP 819 course organized by Farrokh Alemi, Ph.D.
Home► Email►
|