George Mason University HAP 719 Advanced Statistics

HAP 719: Advanced Statistics I

Ordinary Regression Missing Values

 

Overview

Missing value and regression Read► Slides►

Learning Objectives

After completing the activities this module you should be able to:

  • Adjust for missing values in the dependent and independent data
  • Determine if data is missing at random
  • Check for accuracy of mean imputation
  • Check for accuracy of missing value in EHRs indicating absence of disease
  • Use a series of regressions (structural equation models) to predict the value of missing variables

Adjustments for Missing Values

Several approaches are available for replacing missing values so that a complete matrix format of the data (without any variable missing information) can be accomplished.  These approaches include:

When the dependent variable is missing information, the best strategy is to drop the row of data.  For example, if we are predicting health status of a patient from social determinants and medical history variables, then when health status is missing, one should ignore the entire case, including the available social determinants and medical history. In EHRs, missing values are typically referring to absent codes.

Drop the Entire Case: When independent variables are missing information several strategies are available.  You should consider the nature of your data and research question, when dealing with missing data. In R, you may choose to drop entire cases (rows) from your dataset under several circumstances, typically when dealing with missing data or outliers. Dropping entire cases is a decision that should be made carefully and depends on your research question and the specific characteristics of your data. Here are some common scenarios when you might consider dropping entire cases using

When missing data is substantial and imputation is not feasible or appropriate for your analysis, dropping cases with missing values may be a reasonable choice. You can use functions like na.omit() or complete.cases() to remove rows with missing data:

# Remove rows with any missing values
mydata <- mydata[complete.cases(mydata), ]

When you have identified outliers that are not representative of the underlying population or are adversely affecting the assumptions of your analysis (e.g., in regression models), you might choose to remove the entire cases containing those outliers:

 # Remove cases with outliers in a specific variable (e.g., "variable_name")
mydata <- mydata[mydata$variable_name < upper_threshold & mydata$variable_name > lower_threshold, ]

 In some cases, you might have data quality issues that cannot be resolved easily. For instance, if you suspect data entry errors or inconsistencies that cannot be corrected, you may decide to drop cases with problematic data:

# Remove cases with data quality issues (e.g., non-numeric values in a numeric variable)
mydata <- mydata[is.numeric(mydata$variable_name),]

 When you are working with a large dataset and only need a specific subset for your analysis, you can choose to drop cases that are not relevant to your research question:

# Select a subset of cases meeting specific criteria
mydata <- mydata[mydata$variable_name == "desired_value", ]

 Remember that dropping entire cases can lead to a loss of valuable information, reduced sample size, and potential biases in your analysis if not done carefully. Before deciding to drop cases, it's important to consider the impact on the validity and representativeness of your results. You should also document your data preprocessing steps, including any case removal, for transparency and reproducibility in your research. Additionally, consider alternative approaches like imputation or robust statistical methods when appropriate to handle missing data or outliers without dropping cases.

Replace Missing with Mode or Average: A simple strategy is to replace the missing information with mode of the variable. This is typically done for binary variables.  For example, if we are analyzing medical history of patients in electronic health records, a missing diagnosis is replaced with the mode for the diagnosis in the data, most often 0 or absent.  In this approach, missing is often replaced with absent indicator.  In this case, you won't need any additional libraries beyond R's built-in functions. Load the dataset that contains the variable with missing values.

# Load your dataset (e.g., "mydata.csv")
mydata <- read.csv("mydata.csv")

 Identify the variable in your dataset that contains missing values. Let's assume the variable is called my_variable.

 variable_with_missing <- "my_variable"

You can calculate the mode of the variable using the table() and which.max() functions. Here's a code snippet to find the mode:

# Calculate the mode
mode_value <- as.numeric(names(sort(table(mydata$my_variable), decreasing = TRUE)[1]))

This code counts the occurrences of each unique value in the variable and selects the one with the highest count as the mode. Replace the missing values in the variable with the calculated mode. You can use the ifelse() function to do this efficiently:

 mydata$my_variable <- ifelse(is.na(mydata$my_variable), mode_value, mydata$my_variable)

This code checks if each element in mydata$my_variable is missing (NA) and replaces it with mode_value if it is, leaving the original value unchanged otherwise. Verify that the missing values have been replaced with the mode by examining the variable or checking summary statistics:

summary(mydata$my_variable)

This will display a summary of the variable, and you should see that missing values have been replaced with the mode. Similar code can be written to replace missing continuous information with the average or median of the response.

Replace Missing with Multiple Imputations: involves creating multiple datasets, each with different imputed values for the missing data, and then running the regression analysis on each of these datasets separately. The results from these analyses are then combined to obtain more accurate and robust parameter estimates and standard errors. Start by loading the necessary R libraries for multiple imputation and regression analysis. The key packages are mice for imputation and lm for regression.

library(mice)

Read the dataset that contains missing values and the variables you want to include in your regression model.

# Load your dataset (e.g., "mydata.csv")
mydata <- read.csv("mydata.csv")

Specify the Variables for imputation. Identify the variables that have missing values and those you want to include in your regression model. Let's assume you want to perform a linear regression with dependent_variable as the outcome and independent_variable1 and independent_variable2 as predictors.

vars_to_impute <- c("dependent_variable", "independent_variable1", "independent_variable2")

Use the mice package to create multiple imputations for the missing data. Specify the number of imputations you want to generate using the m parameter.

# Set the number of imputations
num_imputations <- 5
# You can choose a different number
# Create multiple imputations
imp_data <- mice(mydata, m = num_imputations, method = "pmm", predictorMatrix = NULL)
# Summary of imputed datasets
summary(imp_data)
# Note that  method = "pmm" uses Predictive Mean Matching for imputation, but you can choose other imputation methods depending on your data.

Run your regression analysis on each of the imputed datasets. In this example, we're using linear regression:

# Create an empty list to store regression results
regression_results <- list()
# Loop through each imputed dataset and perform regression
for (i in 1:num_imputations)
{ model <- with(data=complete(imp_data, i), lm(dependent_variable ~ independent_variable1 + independent_variable2))
regression_results[[i]] <- summary(model) }

 Combine the results from the multiple imputations to obtain pooled estimates and standard errors. You can use Rubin's rules for combining the results:

 pooled_results <- pool(regression_results)

Finally, interpret and present the pooled regression results, which will include estimates, standard errors, confidence intervals, and p-values that account for the uncertainty introduced by imputation. This process allows you to handle missing data in your regression analysis while accounting for the uncertainty introduced by imputation, resulting in more reliable and valid statistical inferences. Adjust the number of imputations and imputation methods as needed based on the characteristics of your data and research question.    

Assignments

Assignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.

Question 1: Regress progression in Infectious and Parasite body system on all other variables (except diabetes). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

  1. Remove all instances where progression in Infectious and Parasite body system is missing.  Report the number of cases that remain.
  2. Plot the progression in Infectious and Parasite body system. Is the data bimodal?  Is the data symmetric around the mean? Transform the data to improve QQplot..
  3. Include in your analysis all pairwise, and triplets of the independent variables.  Exclude any variable or interaction term that is always missing. What is the total number of independent variables included in the regression?
  4. Impute missing independent variables from other variables that are present.  Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
  5. Assume that missing independent variables indicate the patient does not have any disease in the body system (i.e., 0 score).   Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
  6. Assume the value of the missing independent variables can be replaced by the average value of the independent variable. Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
  7. Indicate which method of replacing missing values fits the data best.

Question 2: Consider the regression of progression in Infectious and Parasite body system on all other variables (except diabetes). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

  1. Create a binary variable that is 1 every time a variable is missing and 0 otherwise. Predict Progression in Infectious and Parasite body system from binary diseases and report if any of the variables is statistically significant.  List variables that are not missing at random.  Variables that are not missing at random have a statistically signficant relationship to the response (outcome) variable.
  2. Replace missing values using MICE. Report the percent of variation explained before and after MICE adjustments.

More

For additional information (not part of the required reading), please see the following links:

  1. Introduction to regression by others YouTube► Slides►
  2. Regression using R Read►
  3. Statistical learning with R Read►
  4. Open introduction to statistics Read►

This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home►  Email►