Statistical Process Improvement
Georgetown University

Review of Preparing Data


Assigned Reading

  • Read chapter 2 in Big Data in Health Care: Statistical Analysis of Electronic Health Record Read►



Instruction for Submission of Assignments: Assignments should be submitted directly on Blackboard.  In rare situations assignments can be sent directly by email to the instructor. Submission should follow these rules:

  1. We prefer if you submit your answers in a Jupyter Notebook using HTML.   
  2. All Excel cells, except the cells containing the data, must have formulas.  Do not paste the value into the cell, it must be calculated using a formula.  Even simple steps, such as adding two numbers, should be done using formulas. 
  3. Make sure that any control charts follow the visual rules below:  (1) Control limits must be in red and without markers, (2) Observed lines must have markers, (3) X and Y axis must be labeled, and (4) Charts must be linked to the data. 
  4. Copy and paste SQL or R code into Excel sheet.  Plot data in Excel. 
  5. The first sheet in the file should be a summary page.  In the summary page you should list how your answers to the question differs from answers provided within the assignment (inside Teach One or other answers).  You must indicate for each question if your control chart is exactly the same as seen in Teach One or other formats.  For each question, you must indicate if the answers you have provided is the same as the answers supplied on the web.  If there are no answers provided, you must indicate that there were no answers available on the web to compare your answers to.

To complete these assignments, you need to download Microsoft SQL Server.   Chris's Teach One►

  • MAC users should first download and install virtualization software (VirtualBox, VMWare Player/Fusion, Parallels). Get a copy of an operating system from MSDN (i.e. Windows 10), or Linux (i.e., Ubuntu, or CentOS). Configure a virtual machine and install the guest operating system. Note: In this course we will be mostly using VirtualBox, but you can use any virtualization solution. 
  • Install MS SQL.  DreamSpark (with Microsoft software) library is available free of charge for GMU students. Click on start shopping.  Select SQL Server 2014 Express With Advanced Services With Service Pack 1 32/64-bit (English).
  • SQL download is not necessary for MHA students who plan to respond to questions using Excel.

Question 1: In this problem we ask you to calculate a case mix index for a hospital from classification of its patients into Diagnostic Related Groupings (DRGs). This concept is fully described in pages 83 to 85 of the required book.  In Health Administration programs case mix issues arises in multiple courses where severity of patients receiving care in different hospitals are discussed. The case mix index allows the comparison of two hospitals. It is generally calculated as a weighted length of stay across all DRGs see in the hospital. The concept of weighted average was discussed in this section. In a case mix index, the weights are the probability of observing patients in a particular DRG category. Each DRG category is assumed to be mutually exclusive and exhaustive. The number of patients who are admitted for different DRGs are indicated in the attached data file. From these numbers you calculate the probability of each DRG. By multiplying the probability of the DRG by length of stay you get the contribution of each DRG. The case mix index is the sum of the product of probability of each DRG and length of stay within each DRG. The higher the case mix index, the larger the expected length of stay at the hospital. Which hospital has a higher case mix index?

Question 2: Download the attached file of ICD9 codes and descriptions and find the seven errors in the data, where the same ICD9 code has been assigned different descriptions.  If you do not have access to SQL, you can answer this using Excel 2021 version.  Data► Video► Kavallor's Teach One►

Question 3: Download data in three/four zipped files. You can focus on the entire data or focus on the data for patients who have at least 365 days of encounters. For password contact your instructor.  By opening this file you agree not to share the file with anyone else.  Unzip the files twice, first to get to the directory and then to get to the actual file.  Link to the data into Access or read the data into Microsoft SQL server.  We recommend you use SQL server.  To create the database, open SQL server and right click on the database and start a new database.  Right click on the database, select "Tasks," select "Import Data," select "Flat File Source" as the source of the data, change file type to "CSV Files," browse to where you unzipped the files, indicate that field names are in the first row, select as destination "SQL Server Native Client" file type. Massive Data► >365 Days Data, Access Code► SQL Code to Merge Files► Create Database► Visual Guide to Read SQL►

  • Remove blanks from numeric data, such as DxAtAge.  Convert text data in the AgeAtDeath to float.  This is typically done through if statements such as these:
    • IIF(DxAtAge >0, DxAtAge, Null)
    • IIF(AgeAtDeath="Null", Null, cast(AgeAtDeath AS Float))
  • Calculate the average age of and the standard deviation of the diagnosis.  Which 10 diagnoses occur first, meaning which diagnosis occurs at a younger age.   Access Code► SQL Code & Answer►
  • List the top 20 most frequent diagnoses that co-occur.  To complete this task you would need to join the table to itself.  Then, use ICD9 code in one table as the first and the ICD9 code in the second table as the second of the pair.  Count the number that match any pair of diagnoses.  Access Code► SQL Code►
  • Use STUFF function to concatenate list of unique diagnoses for the same person.  Count numbers of times these lists occur more than 29 times STUFF SQL►
  • Identify individuals whose date of death might be in error and have visits post date of death. Exclude them. Report the top 10 IDs that remain in order of IDs.  Access Code► SQL Code►
  • Rank order diagnosis in order of their reoccurrence for the same person.  Rank or Row Number functions are described on the web.  Search for these commands and look at some examples online.  For how many unique persons does the ICD diagnoses 250.00 repeat twice in the data? Rank Command►

    For example, the following tables shows how the rank order should work for person with ID 1:

ID ICD9 Rank  
1 410 1 First occurrence of diagnosis 410
1 250 1 First occurrence of diagnosis 250
1 410 2 Second occurrence of diagnosis 410
1 250 2 Second occurrence of diagnosis 250
1 250 3  
1 100 1  
1 250 4  

Question 4:  Examine data available online on hospital performance. Hospital performance is generally reported as a ratio of two numbers.  The value placed above the horizontal line in a fraction is called a numerator. In performance data it signifies the number of patients who had an adverse event. The numeric value below the fraction is called the denominator. It represents the total number of patients examined.  Sometimes, these denominator and numerators are risk adjusted, showing the number of people who had the adverse event among the number at risk.  

  • Report 5 measures that take into account severity/risk of the patients' illness.  To understand if a measure is risk adjusted you need to read the definition of numerator and denominator of the measure and see if it references risk. 
  • Report if the measure "PSI-09, Perioperative Hemorrhage or Hematoma Rate" is available and for what years. 
  • Report 5 measures that are appropriate for cardiovascular patients.

The following resources may be of use to you as you search CMS and AHRQ's online descriptions of various measures.

  • Hospital compare list of indicators, types of patients, and other information Dictionary► List►
  • Agency for Healthcare Research and Quality Patient Safety Indicators measures and 30-day death rates (to include PSI-09) Web►

Question 5: Count the comments (online reviews) and list the starting date and ending date of comments available on the web for 5 INOVA hospitals. The response requires both the count and years of reported comments for each of the 5 hospitals.  These are generally available through Google and other online review sites.



Copyright 1996 Farrokh Alemi, Ph.D. Most recent revision 09/11/2022.  This page is part of the course on Statistical Process Improvement, this is the lecture on Preparing Data