Healthcare Databases & Information Systems Course
 

 

Topic: Accuracy of Predictions

Learning Module Objectives:

After completing the activities in this section, you should be able to:

  • Set aside data for validation
  • Make one aggregate prediction from thousands of predictors
  • Compare naive Bayes and Logistic regression
  • Understand conditions under which naive Bayes works as accurately as other data mining approaches
  • Construct receiver operating curves to check accuracy of predictions
  • Calculate area under the receiver operating curves
  • Cross-validate predictive models

Learning Material

  1. Predicting outcomes Slides► Video►
  2. Product of values of data in one column Slides► Video►
  3. Accuracy of Predictive Models (use instructor's last name as password) Read►
  4. Constructing Receiver operating Curves Slides► Video►

Teach One Assignment

If you are supposed to teach about this section of the course, select one of the assignments, do the assignment and show it to the instructor to make sure you have done it correctly.  Prepare your slides, narrate your slides, remove excess words from narrated slides, convert narrated slides to a file format that can be uploaded,  upload your file, email everyone in the class the URL of your file.  Make sure that all these tasks are done ahead of scheduled class session.  Your peers will appreciate receiving your advice on how to solve a class assignment as soon as possible and well before last day prior to class session.  More

Individual Assignments

No individual assignment should be completed in teams.  Submit your work in Blackboard.  Do not discuss the work with other students. 

Question 1: Use the attached data to create a receiver operating curve.  The file contains two values, predicted probabilities and actual true classification.   (a) Generate cutoff values as the average of two consecutive predicted values.  (b) Classify the model predictions.  (c) Calculate the sensitivity and specificity of model predictions at each cutoff level and list in order of the cutoff values.  (d) Draw the receiver operating curve.  (e) Calculate the area under the receiver operating curve.  Data► Curve► SQL►

Team Assignment

General requirements:

  • Work in teams of 2 persons.  Not the same person with whom you have previously handed in a team assignment.
  • Upon submission, indicate the name of your team member.  
  • Each member of the team should submit a separate assignment.  
  • No copying of code from each other but feel free to learn from each other.
  • The data reported by team members must be the same, the SQL code can be different. Come to an agreement on the findings and help each other to arrive to the same findings.
  • If team assignments are completed with individual effort, then the student loses 10% of the grade.   

Team tasks:

  1. Download data  Video► Download► SQL► Slides► Screen Shots►
  2. Clean the data as you or your teammate had done so in the previous weeks.  
  3. Estimate the likelihood ratios as you or your teammate had done so in previous weeks.
  4. Verify that both team members are working with same set of cleaned data and same set of likelihood ratios. 
  5. Randomly set aside 80% of data for training and 20% for validation. Use the validation data set in the following calculations. 
  6. Use naive Bayes to predict the probability of the outcome.
  7. Report the accuracy of predictions using the receiver operating curves.  Draw the receiver operating curve.
  8. Report the area under the receiver operating curve.

To complete this team assignment,  upload your SQL code, the receiver operating curve, the area under the receiver operating curve into a Word document.  Then, upload the document into Blackboard.  Each student will upload their document by Sunday, 11:55 PM, EST.

More

  1. Tutorials on receiver operating curves  PubMed►
  2. Receiver operating curves are not reliable measure of accuracy PubMed

Copyright © Farrokh Alemi, Ph.D. First created on January 9th 2005. Most recent revision 07/10/2018. This page is part of the course on Healthcare Databases.