Healthcare Databases & Information Systems Course


Topic: SQL for Predictive Models

Learning Module Objectives:

After completing the activities in this section, you should be able to:

  • Differentiate between association, detection, and prediction
  • Randomly set aside data for training
  • Understand the relationship between naive Bayes and likelihood ratios
  • Use thousands of rare, sometimes obvious, predictors
  • Use SQL to calculate conditional probabilities
  • Use training data set to calculate the likelihood ratio associated with a predictor
  • Adjust likelihood ratios for rare diseases

Learning Material

  1. Introduction to probability Read►
  2. Risk assessment (use instructor's last name as password) Read►

Teach One Assignment

If you are supposed to teach about this section of the course, select one of the assignments, do the assignment and show it to the instructor to make sure you have done it correctly.  Prepare your slides, narrate your slides, remove excess words from narrated slides, convert narrated slides to a file format that can be uploaded,  upload your file, email everyone in the class the URL of your file.  Make sure that all these tasks are done ahead of scheduled class session.  Your peers will appreciate receiving your advice on how to solve a class assignment as soon as possible and well before last day prior to class session.  More

Individual Assignments

No individual assignment should be completed in teams.  Submit your work in Blackboard.  Do not discuss the work with other students. 

Question 1: Calculate the likelihood ratio associated with repeated (first, second, third, fourth, and fifth) Infection of unspecified site and diabetes.  Calculate the likelihood ratio as:

Take your definition for unspecified bacterial infection and diabetes from Agency for Health Care Quality  and Research's Clinical Classification Software (CCS codes).  Since codes have removed some elements of the diagnosis code, add an "I" prior to the diagnosis code and a period prior to the last 2 digits.  The CCS diagnosis code should correspond to the format of ICD9 codes in the data.  Rank order the repeated bacterial diagnoses of each person and calculate a separate likelihood ratio for each occasions of repetition of the unspecified bacterial diagnosis.  Make sure that you count distinct individuals in calculation of the likelihood ratios.  Plot the relationship between likelihood ratio for diabetes and the number of bacterial infections.  CCS Codes►  Data►

Question 2: Redo question 1 but this time exclude patients who died within 6 months of the nth unspecified bacterial infection.  Data►

Team Assignment

General requirements:

  • Work in team of 2 persons.  Do not work with a person that you have previously worked within a team project.
  • Upon submission, indicate the name of your team member.  Both team members must submit the team's work separately.
  • No copying of code from each other but feel free to learn from each other. The data reported by team members must be the same, the SQL code can be different. Come to an agreement on the findings and help each other to arrive to the same findings.
  • If team assignments are completed with individual effort, then the student loses 10% of the grade.   

Team tasks:

  1. Download data  Video► Download► SQL► Slides► Screen Shots►
  2. Clean the data as you or your teammate had done so in the previous week.  
  3. Verify that both team members are working with same set of cleaned data. 
  4. Randomly set aside 80% of data for training and 20% for validation. Use the training data set in the following calculations. 
  5. Estimate the likelihood ratios associated with each diagnosis and its repetitions. 
    • Calculate separate likelihood ratios for first, second, third, fourth, and fifth occurrences of the same diagnosis for the same person. 
    • Adjust for situations where the outcome never or always occurs
  6. Identify the total number of unique diagnoses for which you have an estimated likelihood ratio.
  7. Rank order the estimated likelihood ratios in order of the likelihood ratio
  8. Using the web, identify the name of 4 diagnoses with the largest likelihood ratios. 

To complete this team assignment,  upload your SQL code, the first 4 diagnoses with the largest likelihood ratios (report the names and the likelihood ratios), and the total number of unique diagnoses into a word document.  Then, upload the document into Blackboard.  Each student will upload their document by Sunday, 11:55 PM, EST.


  1. Definition, content and use of electronic health records among United States providers  PubMed►
  2. Creating tables and records using Microsoft (R) Access Read► Video►
  3. Creating forms in Access Video► Slides►
  4. Create a hierarchical relationship inside Microsoft Access.   Video► SWF►

Copyright © Farrokh Alemi, Ph.D. First created on January 9th 2005. Most recent revision 07/20/2018. This page is part of the course on Healthcare Databases.