Healthcare Databases

Lab4: Clean Data

Team Assignments

Please work in teams of 2 persons. At submission, indicate the name of your team member.  No team assignments should be completed with a person you have previously worked with.  Each member of the team should submit a separate assignment.  No copying of code from each other but feel free to learn from each other.  If team assignments are completed with individual effort, then the student loses 10% of the grade.   

  1. Download data in three/four zipped files.You can focus on the entire data or focus on the data for patients who have at least 365 days of encounters. For password contact your instructor.  By opening this file you agree not to share the file with anyone else.  Unzip the files twice, first to get to the directory and then to get to the actual file.  Link to the data into Access or read the data into Microsoft SQL server.  We recommend you use SQL server.  To create the database, open SQL server and right click on the database and start a new database.  Right click on the database, select "Tasks," select "Import Data," select "Flat File Source" as the source of the data, change file type to "CSV Files," browse to where you unzipped the files, indicate that field names are in the first row, select as destination "SQL Server Native Client" file type.  Massive Data► >365 Days Data, Access Code► SQL Code to Merge Files► Create Database► Visual Guide to Read SQL►
    • Remove blanks from numeric data, such as DxAtAge.  Convert text data in the AgeAtDeath to float.  This is typically done through if statements such as these:
      •  IIF(DxAtAge >0, DxAtAge, Null)
      • IIF(AgeAtDeath="Null", Null, cast(AgeAtDeath AS Float))
    • Calculate the average age of and the standard deviation of the diagnosis.  Which 10 diagnoses occur first, meaning which diagnosis occurs at a younger age.   Access Code► SQL Code & Answer►
    • List the top 20 most frequent diagnoses that co-occur.  To complete this task you would need to join the table to itself.  Then, use ICD9 code in one table as the first and the ICD9 code in the second table as the second of the pair.  Count the number that match any pair of diagnoses.  Access Code► SQL Code►
    • Use STUFF function to concatenate list of unique diagnoses for the same person.  Count numbers of times these lists occur more than 29 times STUFF SQL►
    • Identify individuals whose date of death might be in error and have visits post date of death. Exclude them. Report the top 10 IDs that remain in order of IDs.  Access Code► SQL Code►
    • Rank order diagnosis in order of their reoccurrence for the same person.  Rank or Row Number functions are described in Google.  Look up the format of the function and implement it in your SQL code. Identify the 1st, 2nd and 3rd re-occurrences of every diagnoses.  For example, the following tables shows how the rank order should work for person with ID 1:

    ID ICD9 Rank
    1 410 1
    1 250 1
    1 410 2
    1 250 2
    1 250 3
    1 100 1
    1 250 4

Individual Assignment

No individual assignment should be completed in teams.  Submit your work without discussion with other students.  At submission, indicate "This work was completed without help from other students."

  1. Import data from the following four files into four tables.  Ptid file►   Claims file► ICD file► CPT file► Video►
    • Identify patients that have diabetes in the above database.  Video►
    • Calculate the average cost of each diagnosis sorted from most expensive to least expensive.  Exclude all bills with negative or 0 values.  Video►
    • Show if men are more likely to have diabetes than women.  Video►
    • Calculate which month is most likely to have a diagnosis reported.  Video►
  2. Download the attached file of ICD9 codes and descriptions and find the seven errors in the data, where the same ICD9 code has been assigned different descriptions. Data►  Video►

Teach One

In this course we learn using the paradigm of learn one, , do one, teach one.  As part of learning this content, one or more of your peers have completed this assignment ahead of time. They will contact you and introduce themselves. You can get answers to your questions from them.  They are there to help. They also have the option to post videos to this site to help resolve the problems you are facing.  if students do not wish to publicly post their video, they can post the video within Blackboard. Here are some videos posted openly to the web:

No student videos are available at this time


This page is part of the course on Healthcare Databases taught by various instructors. It was last edited on Wednesday September 06, 2017 by Farrokh Alemi, Ph.D.