Healthcare Databases & Information Systems Course


Topic: Cleaning Data

Learning Module Objectives:

After completing the activities in this section, you should be able to:

  • Distinguish among types of SQL commands
  • Create tables and insert values using SQL
  • Filter data using SELECT and FROM commands
  • Filter data using WHERE, HAVING and GROUP BY commands
  • Set relationships among tables
  • Access fields in different tables through join statements
  • Calculate product of values in a column of data

Learning Material

  1. Preparing data from Electronic Health Records for analysis (use instructor's last name as password) Read►
  2. Types of SQL commands Slides► Video►
  3. Creating tables in Microsoft Server SQL Read► Slides► Video►
  4. Select command Slides► Video►
  5. Where, Having, and Group by commands Slides► Read► Video►
  6. Introduction to joins Slides► Video►
  7. Product of values of data in one column Slides► Video►
  8. Order data using RANK and row numbers Slides► Video►
  9. Manipulate DATE functions Slides►Video►
  10. Text manipulation Slides► Video►

Teach One Assignment

If you are supposed to teach about this section of the course, select one of the assignments, do the assignment and show it to the instructor to make sure you have done it correctly.  Prepare your slides, narrate your slides, remove excess words from narrated slides, convert narrated slides to a file format that can be uploaded,  upload your file, email everyone in the class the URL of your file.  Make sure that all these tasks are done ahead of scheduled class session.  Your peers will appreciate receiving your advice on how to solve a class assignment as soon as possible and well before last day prior to class session.

Individual Assignments

No individual assignment should be completed in teams.  Submit your work in Blackboard.  Do not discuss the work with other students. 

Question 1: Calculate various queries using the data provided

  • Import data from the following four files into four tables. Ptid►  Claims► ICD► CPT► Video►
  • Identify patients that have diabetes in the above database.  Video►
  • Calculate the average cost of each diagnosis sorted from most expensive to least expensive.  Exclude all bills with negative or 0 values.  Video►
  • Show if men are more likely to have diabetes than women.  Video►
  • Calculate which month is most likely to have a diagnosis reported.  Video►

Question 2:  Download the attached file of ICD9 codes and descriptions and find the seven errors in the data, where the same ICD9 code has been assigned different descriptions. Data►  Video►

Question 3: Create three tables: patient, provider and encounter tables.  Make sure that each table has a unique primary key.  The patient table should include first name, last name, street name, street number, and zip code.  Provider's table should include first name, last name, and whether board certified.  The encounter table should indicate the patient and provider involved, the diagnosis, and the treatment provided.  Enter three rows of data into each table.  Join the tables and identify providers that do not have patients. 

Question 4: Using the data you imported in previous week, count how many patients continue to visit despite having been reported to be dead.  Download Data►

Question 5: Using the data you imported last week, use GROUP BY and COUNT to report how many patients have same diagnosis occurring at same age, essentially having a duplicate entry in the database.  Download Data►

Team Assignment

Work in teams of 2 persons, never working with the same person again. Each person to initially work individually and then share your work with each other. Help each other to get to the same answers.  At submission, indicate the name of your team member.  Each member of the team should submit a separate assignment.  No copying of code from each other but feel free to learn from each other.  If team assignments are completed with individual effort, then the student loses 10% of the grade.   

Download data in zipped files. Merge the files. For password contact your instructor.  By opening this file you agree not to share the file with anyone.  Unzip the files, first to get to the directory and then to get to the actual file. Import data into Microsoft SQL server.  To create the database, open SQL server and right click on the database and start a new database.  Right click on the database, select "Tasks," select "Import Data," select "Flat File Source" as the source of the data, change file type to "CSV Files," browse to where you unzipped the files, indicate that field names are in the first row, select as destination "SQL Server Native Client" file type.  Video► Download► SQL► Slides► Screen Shots►

  • Remove blanks from numeric data, such as DxAtAge.  Convert text data in the AgeAtDeath to float.  This is typically done through if statements such as these:
    •  IIF(DxAtAge >0, DxAtAge, Null)
    • IIF(AgeAtDeath="Null", Null, cast(AgeAtDeath AS Float))
  • Calculate the average age of and the standard deviation of the diagnosis.  Which 10 diagnoses occur first, meaning which diagnosis occurs at a younger age.   Access► SQL Code & Answer►
  • List the top 20 most frequent diagnoses that co-occur.  To complete this task you would need to join the table to itself.  Then, use ICD9 code in one table as the first and the ICD9 code in the second table as the second of the pair.  Count the number that match any pair of diagnoses.  Access► SQL Code►
  • Use STUFF function to concatenate list of unique diagnoses for the same person.  Count numbers of times these lists occur more than 29 times STUFF SQL►
  • Identify individuals whose date of death might be in error and have visits post date of death. Exclude them. Report the top 10 IDs that remain in order of IDs.  Access► SQL Code►
  • Rank order diagnosis in order of their reoccurrence for the same person.  Rank or Row Number functions are described in Google.  Look up the format of the function and implement it in your SQL code. Identify the 1st, 2nd and 3rd re-occurrences of every diagnoses.  For example, the following tables shows how the rank order should work for person with ID 1:

ID ICD9 Rank
1 410 1
1 250 1
1 410 2
1 250 2
1 250 3
1 100 1
1 250 4


  1. Definition, content and use of electronic health records among United States providers  PubMed►
  2. Creating tables and records using Microsoft (R) Access Read► Video►
  3. Creating forms in Access Video► Slides►
  4. Create a hierarchical relationship inside Microsoft Access.   Video► SWF►

Copyright © 1996 Farrokh Alemi, Ph.D. First created on January 9th 2005. Most recent revision 06/18/2018. This page is part of the course on Healthcare Databases, the lecture on Tables and Records.