HAP 823: Causal Analysis

Lecture: All of Us Project on Antidepressants  


Assigned Reading

  • Effectiveness of antidepressants PubMed►
  • Proxy measure for remission of depression symptoms PubMed►
  • SAFE procedures for excluding some independent variables from LASSO regressions Read►


The semester long project in this course is to assess the effectiveness of an existing guide to depression medications in minority populations.

(A) Register for All of US.  This step was assigned prior to start of the course. If you have not done all registration steps, including training, then you will need to solve this problem quickly. This registration may take several weeks for students who do not have a State ID.  Otherwise, it should take about 90 minutes.  Also make sure that you remember your password as there are multiple accounts set up in this process.  You need to write down the password for each sign in separately on a piece of paper as you may confuse which password is needed when.

  1. Register for an account on @researchallofus.org
  2. Change from temporary password to a new password and record your password on paper somewhere.
  3. Turn on Google 2-Step Verification
  4. Verify your identity with Login.gov.  This step requires a state ID or Drivers License, and text phone.
  5. There are multiple passwords that you should keep in mind.  There is your GMU password, your research workbench password on All of Us and your computer password, and your Google password.  Please make sure that you keep these accounts separate and read the messages carefully to see which password is needed.
  6. Complete All of Us Registered Tier Training
  7. You do not need to get additional data access beyond registration data. George Mason University does not allow access to Controlled Tier.
  8. Sign the Code of Conduct Sign Data User Code of Conduct

When you have registered completely, you should see something like this page:

Create your Workspace in All of Us Database


(B) Create Cohort and Related Data Sets.  Note that a cohort and data sets are different concepts. 

  1. Create your cohort in All of Us. 
  2. Limit the cohort by African American race.  
  3. Create the concept for Major depression. Review in PubMed how investigators have defined Major depression in EHRs. Alternatively, use conditions defined within All of Us to select the right definition of Major Depression.
  4. Create the concept of patient's survival. 
  5. The unit of analysis is medications and not individuals.  An individual can have multiple medications.  Define the database so that there is one entry for each antidepressant.   
  6. Create your data sets, for your cohort.  Do not include non-EHR data or surveys. Note that creation of antidepressant data set requires creation of concepts that capture the antidepressant in the data. In your cohort, select demographics (age, gender) and all conditions as independent variables of interest.  No survey responses are needed for independent variables. Rely only on EHR data only. Include date of occurrence of every event. You also need the date of first use (purchase) of the antidepressant. The date of occurrence of the response variable is the first time the variable/condition has occurred.  Here are the data points that you need to include in your data sets:
    1. ID of antidepressant
    2. ID of person
    3. Age at first intake of antidepressant
    4. Sex at birth
    5. Gender
    6. Survival
    7. 590 Diseases among the Conditions.

Here are more detailed steps in getting ready for analysis:

  1. Get the dataset for patient demographics to include date of birth, race, and ethnicity.
  2. Select African Americans.
  3. Create the base of df_analysis from this.
  4. Get date_of_death from the dataset containing dead persons then left join to df_analysis.
  5. Get date_of_first_antidepressant from dataset containing all of your cancers then left join to df_analysis.
  6. Get date of every antidepressant purchase
  7. Process disease dataset
  8. Get list of all of your antidepressants codes. The data set should not be limited to the antidepressant you selected and should include all antidepressants.
  9. Create a new column for the start date of antidepressant you selected.
  10. Create a new column for the end date of antidepressant you selected.
  11. Create a new column for duration of any antidepressant used prior to the antidepressant you selected.
  12. Create a new column disease_group
  13. Use the df_disease_grouped.csv to fill the disease_group column based on standard_concept_code
  14. Change missing values of disease_group to zero, 0 (catch all disease grouping).  This assumes that unreported diseases are absent.
  15. Select all diseases that occur prior to date of the antidepressants
  16. Calculate number of days of use of antidepressants and score if antidepressant was prematurely abandoned.
  17. Binarize the disease_group column. No need to drop any column since this is not mutually exclusive, meaning a person can have many disease groups thus avoiding the dummy variable trap.
  18. Aggregate based on antidepressant-id so that only 1 row per antidepressant per person_id is in the dataset and the binarized disease group columns indicate all the disease groups that the person has.
  19. Drop all other columns except antidepressant_id, person_id, days of antidepressant use, and the binarized columns.
  20. Left join the binarized columns to df_analysis
  21. You are now ready to start description of the data

The following resources may be of use in this task:

(C) Describe the Population.  In this step you need to create Table 1 in your eventual report.  This Table should include the description of the population.  For examples of Table 1 see PubMed.  Provide a summary of your data that includes number of antidepressants examined, number of individuals involved, number of antidepressants discontinued, number of days individuals followed, number of days antidepressants continued, number of medical conditions at baseline of use of antidepressants, number of antidepressants used prior to baseline, experience with previous antidepressants. 

(D) Fit a Network Model to the Data:  Use chain of LASSO regressions to create a network model of direct and indirect predictors of remission after taking your antidepressant.  Include pairwise interaction of conditions.  This may result in too many independent variables.  To reduce the number of independent variable use the SAFE procedure, where strong rules are used to exclude some variables.

(E) Report Your findings: This report should include the following section and provided at approximate times indicated by email to the instructor:

  1. Abstract.  Include a structured abstract using objective of the study, method, results, and main conclusion.  The abstract should be written after you complete other sections.  The abstract must not exceed 500 words and should report the number of words used in the abstract.
  2. Background literature review should not exceed 1 page. Your one page literature review should assume a reader familiar with the literature and not exceed three paragraph.  The first paragraph should address the significance of the area you are addressing, including prevalence of depression and importance of selection of antidepressants. The second paragraph should describe failure of clinicians in selecting the right antidepressant for African Americans, as reported in the literature. The paragraph should not exceed two or three sentences but can have numerous references.  The last paragraph should discuss how your analysis can help selection of antidepressants for African Americans.  Background section should be a brief synthesis of existing research findings related to the problem being addressed in the study. Every sentence should have a reference.  We are not interested in unsupported claims.
  3. Method section should be a complete description of the methods; and there is no page limit but brevity is appreciated. It should include a paragraph or a sentence on source of data. It should describe the inclusion and exclusion criteria for the creation of the cohort and compare these criteria to what has been done in the literature. It should have a sentence or a paragraph, with citations, on definition of remission.  It should have a sentence or a paragraph on number of, and definition of, independent variables. These statements should clarify how missing values were treated and explain what steps were taken to ensure that independent variables occur prior to response/dependent variable. There should be a paragraph on analytical methods used.  
  4. Results section should describe the findings and there is no page limit.  Table 1 should be description of the population studied.  Figures and additional tables should summarize the statistical findings. These should include parameters of your model and the fit between the guide and experience of African Americans. There should not be any discussion of findings in the result section.
  5. Discussion section should include 4 distinct sections and there is no page limits.  The first section should be a summary of the key findings.  The second section should be a review of support for the findings in the literature. The third section should summarize study limitations.  The last section should conclude with policy implications.

Example Completed Assignments: The following listed projects use patient's medical history to screen for indicated disease. 

  • Redd's lung cancer paper Read► (Use instructor's last name for password)
  • All of Us breast cancer study YouTube►


This page is part of the HAP 819 course organized by Farrokh Alemi, Ph.D. Home► Email►