Georgetown University

Review of Preparing Data

Course Home Page

Assigned Reading

  • Read chapter 2 in Big Data in Health Care: Statistical Analysis of Electronic Health Record Read►



Instruction for Submission of Assignments: Assignments should be submitted directly on Blackboard.  In rare situations assignments can be sent directly by email to the instructor. Submission should follow these rules:

  1. Submit your answers in a Jupyter Notebook Download► YouTube► Slides►
  2. Submit you answers in Blackboard.

Task 1: Download Microsoft SQL Server.  LinkedIn(r) Learning►  Instruction for SSMS 2017► Video► Patruni's Teach One►

  • MAC users should first download and install virtualization software (VirtualBox, VMWare Player/Fusion, Parallels). Get a copy of an operating system from MSDN (i.e. Windows 10), or Linux (i.e., Ubuntu, or CentOS). Configure a virtual machine and install the guest operating system. Note: In this course we will be mostly using VirtualBox, but you can use any virtualization solution. 
  • Install MS SQL.  DreamSpark (with Microsoft software) library is available free of charge for GMU students. Click on start shopping.  Select SQL Server 2014 Express With Advanced Services With Service Pack 1 32/64-bit (English). Download►  
  • Use your GMU email to register (without Masonlive, i.e. Do not use your GMU password. Only Microsoft software is available on this site. Software is FREE. Contact instructor if your password does not work.

Task 2: Download and clean data.  Download data in three/four zipped files. You can focus on the entire data or focus on the data for patients who have at least 365 days of encounters. For password contact your instructor.  By opening this file you agree not to share the file with anyone else.  Unzip the files twice, first to get to the directory and then to get to the actual file.  Link to the data into Access or read the data into Microsoft SQL server.  We recommend you use SQL server.  To create the database, open SQL server and right click on the database and start a new database.  Right click on the database, select "Tasks," select "Import Data," select "Flat File Source" as the source of the data, change file type to "CSV Files," browse to where you unzipped the files, indicate that field names are in the first row, select as destination "SQL Server Native Client" file type. Massive Data► >365 Days Data, Access Code► SQL Code to Merge Files► Create Database► Visual Guide to Read SQL►

  • Remove blanks from numeric data, such as DxAtAge. 
  • Convert text data in the AgeAtDeath to float.  This is typically done through if statements such as these:
    • IIF(DxAtAge >0, DxAtAge, Null)
    • IIF(AgeAtDeath="Null", Null, cast(AgeAtDeath AS Float))
  • Identify individuals whose date of death or date of birth might be in error and have visits post date of death or visits prior to birth. Exclude them. Keep only the cases whose date of death, or date of birth, is not in error.  SQL Code►
  • Report the total number of cases that are subject to the analysis
  • Dahee Choi's Teach One►
  • Grace Buck's Teach One►


Copyright 2021 Farrokh Alemi, Ph.D. Most recent revision 04/21/2021.  This page is part of the course on Electronic Health Record Configuration and Data Analysis.