HAP 786: Workshop in Health Informatics

Lecture: Create Database in All of Us  

Pull data together
Generated by ChatGPT

Overview

Objectives

  1. Select a population of patients to focus on
  2. Select a high dimensional set of variables to focus on
  3. Define baseline measures and independent variables
  4. Define exposure measures
  5. Define outcomes and dependent variaables

Assigned Reading & Learning Materials

Different student teams have different assignments and thus you should not follow the code of others verbatim.  Please note the following videos are organized to serve a specific type of regression and the variables you need in your database could be different.  You need to think through what are the dependent and the independent variables.  We would like you to use conditions reported in All of Us as your independent variable.  Your dependent variable could be response to antidepressant or a predictor of response to antidepressant.  In creating the database you need to make sure that all independent variables occur prior to the dependent variable. 

  • Vlad Cardenas Teach One Part 1 YouTube►
  • Vlad Cardenas Teach One Part 2 YouTube► Slides► Code►
  • Rasil Alamri’s Teach One Part 1 YouTube►
  • Mona Mohamed’s Teach One Part 2 YouTube►
  • Organizing response to antidepressants CSV►
  • Organizing conditions (independent variables) within body systems CSV►
  • Creating survival variable More►
  • Creating a database in All of Us YouTube►
  • Rasil Alamri’s Teach One Part 1: YouTube►
  • Mona Mohamed’s Teach One Part 2: YouTube►
  • Lana Hashem's Teach One - Data Preparation Part 1: YouTube►
  • Chathrini Sirisena's Teach One - Data Preparation Part 2: YouTube►
  • HAP 464 Lecture 03 - Data Preparation (Part 2): YouTube►
  • Wafaa Abdelmalak's Teach One - Data Preparation Part 3 YouTube►
  • Jenny Rivera-Rivas' Teach One - Data Preparation Part 4 YouTube►

Assignment

The semester long project in this course is to assess the effectiveness of an existing guide to depression medications in minority populations. In this session, you are asked to organize the database for your analysis.  Analysis of observational data requires that you pay attention to timing of variables.  In organizing your database, it is important to make sure that you include timing of the variables.

(B) Create Cohort and Related Data Sets.  Note that a cohort and data sets are different concepts. 

  1. Create your cohort in All of Us. 
  2. Limit the cohort by African American race.  
  3. Create the concept for Major depression. Review in PubMed how investigators have defined Major depression in EHRs. Alternatively, use conditions defined within All of Us to select the right definition of Major Depression.
  4. Create the concept of patient's survival. 
  5. The unit of analysis is medications and not individuals.  An individual can have multiple medications.  Define the database so that there is one entry for each antidepressant.   
  6. Create your data sets, for your cohort.  Do not include non-EHR data or surveys. Note that creation of antidepressant data set requires creation of concepts that capture the antidepressant in the data. In your cohort, select demographics (age, gender) and all conditions as independent variables of interest.  No survey responses are needed for independent variables. Rely only on EHR data only. Include date of occurrence of every event. You also need the date of first use (purchase) of the antidepressant. The date of occurrence of the response variable is the first time the variable/condition has occurred.  Here are the data points that you need to include in your data sets:
    1. ID of antidepressant
    2. ID of person
    3. Age at first intake of antidepressant
    4. Sex at birth
    5. Gender
    6. Survival
    7. 590 Diseases among the Conditions.

Here are more detailed steps in getting ready for analysis:

  1. Get the dataset for patient demographics to include date of birth, race, and ethnicity.
  2. Select African Americans.
  3. Create the base of df_analysis from this.
  4. Get date_of_death from the dataset containing dead persons then left join to df_analysis.
  5. Get date_of_first_antidepressant from dataset containing all of your cancers then left join to df_analysis.
  6. Get date of every antidepressant purchase
  7. Process disease dataset
  8. Get list of all of your antidepressants codes. The data set should not be limited to the antidepressant you selected and should include all antidepressants.
  9. Create a new column for the start date of antidepressant you selected.
  10. Create a new column for the end date of antidepressant you selected.
  11. Create a new column for duration of any antidepressant used prior to the antidepressant you selected.
  12. Create a new column disease_group
  13. Use the df_disease_grouped.csv to fill the disease_group column based on standard_concept_code
  14. Change missing values of disease_group to zero, 0 (catch all disease grouping).  This assumes that unreported diseases are absent.
  15. Select all diseases that occur prior to date of the antidepressants
  16. Calculate number of days of use of antidepressants and score if antidepressant was prematurely abandoned.
  17. Binarize the disease_group column. No need to drop any column since this is not mutually exclusive, meaning a person can have many disease groups thus avoiding the dummy variable trap.
  18. Aggregate based on antidepressant-id so that only 1 row per antidepressant per person_id is in the dataset and the binarized disease group columns indicate all the disease groups that the person has.
  19. Drop all other columns except antidepressant_id, person_id, days of antidepressant use, and the binarized columns.
  20. Left join the binarized columns to df_analysis
  21. You are now ready to start description of the data

This page is part of the HAP 819 course organized by Farrokh Alemi, Ph.D. Home► Email►