Basic data  

HAP 719: Advanced Statistics I

Ordinary Regression Missing Values

MIssing values by row

Overview

In this module, you will learn to handle missing values in both dependent and independent data, a crucial skill for ensuring the integrity of your analyses. You will determine if data is missing at random, check the accuracy of mean imputation, and verify if missing values in EHRs indicate the absence of disease. By using a series of regressions (structural equation models), you will predict the value of missing variables, equipping you with advanced techniques to manage incomplete datasets effectively.

Learning Objectives

After completing the activities this module you should be able to:

  • Adjust for missing values in the dependent and independent data
  • Determine if data is missing at random
  • Check for accuracy of mean imputation
  • Check for accuracy of missing value in EHRs indicating absence of disease
  • Use a series of regressions (structural equation models) to predict the value of missing variables

Lecture

AI assisted Indicates AI assisted content, image or video.

Assignments

Assignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.

Toy data not allowedQuestion 1: Regress progression in Infectious and Parasite body system on all other variables (except diabetes). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

  1. Remove from independent variables a body system that is always missing.  Report the number of cases and variables that remain.
  2. Assume that missing independent variables indicate that the patient does not have any disease in the body system (i.e., assign a score of 0 when the data is missing). Print a summary of the data showing that there are no missing values in the data.
  3. Regress progression in the Infectious and Parasite body system on the independent variables. Report the total number of cases and variables in the analysis. Report the R-squared. Report the coefficients of variables that are statistically significant. 

Toy data not allowedQuestion 2: Consider the regression of progression in Infectious and Parasite body system on all other variables (except diabetes). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

  1. Remove variables where a body system is always missing.  Report the number of cases and variables that remain.
  2. Regress indicators for missing indicator variables on other reported independent variables, using MICE software or doing the regressions one at a time by yourself.  Report the coefficients of these regressions.  
  3. Regress progression in Infectious and Parasite body system on the independent variables and indicator variables for missing variables.  Report the total number of cases and variables in the data.  Report the R-squared for the regression. Report the coefficients of the regression equation and list the variables that are missing not at random.  

More

For additional information (not part of the required reading), please see the following links:

  1. Introduction to regression by others YouTube► Slides►
  2. Regression using R Read►
  3. Statistical learning with R Read►
  4. Open introduction to statistics Read►

This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home►  Email►