HAP 719: Advanced Statistics I

Ordinary Regression Missing Values

MIssing values by row

Overview

In this module, you will learn to handle missing values in both dependent and independent data, a crucial skill for ensuring the integrity of your analyses. You will determine if data is missing at random, check the accuracy of mean imputation, and verify if missing values in EHRs indicate the absence of disease. By using a series of regressions (structural equation models), you will predict the value of missing variables, equipping you with advanced techniques to manage incomplete datasets effectively.

Learning Objectives

After completing the activities this module you should be able to:

Adjust for missing values in the dependent and independent data
Determine if data is missing at random
Check for accuracy of mean imputation
Check for accuracy of missing value in EHRs indicating absence of disease
Use a series of regressions (structural equation models) to predict the value of missing variables

Lecture

Indicates AI assisted content, image or video.

Missing value and regression Read► Slides► Video►
Yili's lecture on adjustments for missing values using R software Read► Slides► Video►

Assignments

Assignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.

Question 1: Regress progression in Infectious and Parasite body system on all other variables (except diabetes). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

Remove all instances where progression in Infectious and Parasite body system is missing. Report the number of cases that remain.
Plot the progression in Infectious and Parasite body system. Is the data bimodal? Is the data symmetric around the mean? Transform the data to improve QQplot..
Include in your analysis all pairwise, and triplets of the independent variables. Exclude any variable or interaction term that is always missing. What is the total number of independent variables included in the regression?
Impute missing independent variables from other variables that are present. Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
Assume that missing independent variables indicate the patient does not have any disease in the body system (i.e., 0 score). Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
Assume the value of the missing independent variables can be replaced by the average value of the independent variable. Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
Indicate which method of replacing missing values fits the data best.

Data Download► Dictionary►
Niharika Sarraf's Teach One YouTube►
Yili Lin's Answer► R-code►

Question 2: Consider the regression of progression in Infectious and Parasite body system on all other variables (except diabetes). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

Create a binary variable that is 1 every time a variable is missing and 0 otherwise. Predict Progression in Infectious and Parasite body system from binary diseases and report if any of the variables is statistically significant. List variables that are not missing at random. Variables that are not missing at random have a statistically signficant relationship to the response (outcome) variable.
Replace missing values using MICE. Report the percent of variation explained before and after MICE adjustments.

Data Download► Dictionary►
Sowmya Chakravarthy's Answer► R-code►
Ledo Thankachan's Teach One YouTube►

For additional information (not part of the required reading), please see the following links:

Introduction to regression by others YouTube► Slides►
Regression using R Read►
Statistical learning with R Read►
Open introduction to statistics Read►

This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email►