Topic: Basic SQL & Data Cleaning
Learning Module Objectives:
After completing the activities in this section, you should be able to:
- Distinguish among types of SQL commands
- Create tables and insert values using SQL
- Filter data using SELECT and FROM commands
- Filter data using WHERE, HAVING and GROUP BY commands
- Set relationships among tables
- Access fields in different tables through join statements
- Rank order data
- Use functions to manipulate dates
- Use functions to manipulate text
- Preparing data from Electronic Health Records
(use instructor's last name as password)
- Types of SQL commands
- Creating tables in Microsoft Server SQL
- Select command
- GROUP BY command
- WHERE and HAVING commands
- Introduction to joins
- RANK and row numbers
- Date functions Slides►
- Text manipulation Slides►
- SQL code for cleaning the data
Teach One Assignment
If you are supposed to teach about this section of the course, select one of
the assignments, do the assignment and show it to the instructor to make sure
you have done it correctly. Prepare your slides, narrate your slides,
remove excess words from narrated slides, convert narrated slides to a file
format that can be uploaded, upload your file, email everyone in the class
the URL of your file. Make sure that all these tasks are done ahead of
scheduled class session. Your peers will appreciate receiving your advice
on how to solve a class assignment as soon as possible and well before last day
prior to class session.
No individual assignment should be completed in teams. Submit
your work in Blackboard. Do not discuss the work with other
Question 1: Calculate various queries using the data
- Import data from the following four files into four tables. Ptid►
- Identify patients that have diabetes in the above database.
- Calculate the average cost of each diagnosis sorted from most expensive
to least expensive. Exclude all bills with negative or 0 values.
- Show if men are more likely to have diabetes than women.
- Calculate which month is most likely to have a diagnosis reported.
Question 2: Download the attached file of ICD9 codes and descriptions and find the seven
errors in the data, where the same ICD9 code has been assigned different
Question 3: Using the data you imported in previous week,
count how many patients continue to visit despite having been reported to be
Question 4: Using the data you imported last week, use GROUP
BY and COUNT to report how many patients have same diagnosis occurring at same
age, essentially having a duplicate entry in the database.
Each week you will be complete a team assignment. Each week you are to work with
a classmate that you have not previously worked with, so that by the end of this
course, you will have completed team, assignments with 8 different partners.
- Work in teams of 2 persons.
- Upon submission, indicate the name of your team member.
- Each member of the team should submit a separate assignment.
- No copying of code from each other but feel free to learn from each other. The data reported must be the same. Come to an agreement on the findings
and help each other to arrive to the same findings.
- If team assignments are completed with individual effort, then the student loses 10% of the grade.
- Download data in zipped files.
- Merge the files.
- For password contact your instructor.
- Do not share the data file. By opening this file you agree not to share the file with anyone.
- Unzip the files, first to get to the directory and then to get to the actual file.
- Import data into Microsoft SQL server.
- To create the database, open SQL server and right click on the database and start a new database. Select "Tasks," select "Import Data," select "Flat
File Source" as the source of the data, change file type to "CSV Files,"
browse to where you unzipped the files, indicate that field names are in the
first row, select as destination "SQL Server Native Client" file type.
- Remove blanks from numeric data, such as DxAtAge.
- Convert text data in the AgeAtDeath to float. This is typically done through if statements such as these:
- IIF(DxAtAge >0, DxAtAge, Null)
- IIF(AgeAtDeath="Null", Null, cast(AgeAtDeath AS Float))
- Identify individuals whose date of death or birth might be in error and have visits post date of death or prior to birth. Exclude them.
- Remove diagnoses occurring more than once for the same patient at same age
- Remove patient ids that are null or start with Z.
- Remove diagnoses for patients whose age of diagnosis is null
- Remove diagnoses that occur before patient is 21 years old or when
patient is more than 99 years old
- Rank order diagnosis in order of their reoccurrence for the same person. Assign ranks for diagnoses that occur for the same person once, twice,
three times, four times, or more than 5 times. Remove diagnoses that
re-occur more than 5 times.
- Order the data by patient ID and diagnosis and rank of diagnosis
To complete this team assignment, upload the first 4 lines of the merged clean data into a word
file to Blackboard. Both team members must mention each other in the uploaded
document. The 4 lines of data results for both team members should be the same,
work together until you achieve the same results. The SQL code could be
different. Each student will upload their files by Sunday, 11:55 PM, EST.
- Definition, content and use of electronic health records among United
- Microsoft Access
- Creating tables and records
- Creating forms
- Create hierarchical relationships
- Introduction to queries
Copyright © 1996
Farrokh Alemi, Ph.D.
First created on
January 9th 2005.
Most recent revision
07/24/2018. This page is part of the
Healthcare Databases, the lecture on