## HAP 719: Advanced Statistics I## Ordinary Regression Missing Values
## OverviewMissing value and regression Read► Slides► ## Learning ObjectivesAfter completing the activities this module you should be able to: - Adjust for missing values in the dependent and independent data
- Determine if data is missing at random
- Check for accuracy of mean imputation
- Check for accuracy of missing value in EHRs indicating absence of disease
- Use a series of regressions (structural equation models) to predict the value of missing variables
## Adjustments for Missing ValuesSeveral approaches are available for replacing missing values so that a complete matrix format of the data (without any variable missing information) can be accomplished. These approaches include: When the dependent variable is missing information, the best strategy is to drop the row of data. For example, if we are predicting health status of a patient from social determinants and medical history variables, then when health status is missing, one should ignore the entire case, including the available social determinants and medical history. In EHRs, missing values are typically referring to absent codes.
When missing data is substantial and imputation is not feasible or appropriate for your analysis, dropping cases with missing values may be a reasonable choice. You can use functions like na.omit() or complete.cases() to remove rows with missing data: # Remove rows with any missing values When you have identified outliers that are not representative of the underlying population or are adversely affecting the assumptions of your analysis (e.g., in regression models), you might choose to remove the entire cases containing those outliers: # Remove cases with outliers in a specific variable (e.g., "variable_name") In some cases, you might have data quality issues that cannot be resolved easily. For instance, if you suspect data entry errors or inconsistencies that cannot be corrected, you may decide to drop cases with problematic data: # Remove cases with data quality issues
(e.g., non-numeric values in a numeric variable) When you are working with a large dataset and only need a specific subset for your analysis, you can choose to drop cases that are not relevant to your research question: # Select a subset of cases meeting specific criteria Remember that dropping entire cases can lead to a loss of valuable information, reduced sample size, and potential biases in your analysis if not done carefully. Before deciding to drop cases, it's important to consider the impact on the validity and representativeness of your results. You should also document your data preprocessing steps, including any case removal, for transparency and reproducibility in your research. Additionally, consider alternative approaches like imputation or robust statistical methods when appropriate to handle missing data or outliers without dropping cases.
# Load your dataset (e.g., "mydata.csv") Identify the variable in your dataset that contains missing values. Let's assume the variable is called my_variable. variable_with_missing <- "my_variable" You can calculate the mode of the variable using the table() and which.max() functions. Here's a code snippet to find the mode: # Calculate the mode This code counts the occurrences of each unique value in the variable and selects the one with the highest count as the mode. Replace the missing values in the variable with the calculated mode. You can use the ifelse() function to do this efficiently: mydata$my_variable <- ifelse(is.na(mydata$my_variable), mode_value, mydata$my_variable) This code checks if each element in mydata$my_variable is missing (NA) and replaces it with mode_value if it is, leaving the original value unchanged otherwise. Verify that the missing values have been replaced with the mode by examining the variable or checking summary statistics: summary(mydata$my_variable) This will display a summary of the variable, and you should see that missing values have been replaced with the mode. Similar code can be written to replace missing continuous information with the average or median of the response.
library(mice) Read the dataset that contains missing values and the variables you want to include in your regression model. # Load your dataset (e.g., "mydata.csv") Specify the Variables for imputation. Identify the variables that have missing values and those you want to include in your regression model. Let's assume you want to perform a linear regression with dependent_variable as the outcome and independent_variable1 and independent_variable2 as predictors. vars_to_impute <- c("dependent_variable", "independent_variable1", "independent_variable2") Use the mice package to create multiple imputations for the missing data. Specify the number of imputations you want to generate using the m parameter. # Set the number of imputations Run your regression analysis on each of the imputed datasets. In this example, we're using linear regression: # Create an empty list to store regression results Combine the results from the multiple imputations to obtain pooled estimates and standard errors. You can use Rubin's rules for combining the results: pooled_results <- pool(regression_results) Finally, interpret and present the pooled regression results, which will include estimates, standard errors, confidence intervals, and p-values that account for the uncertainty introduced by imputation. This process allows you to handle missing data in your regression analysis while accounting for the uncertainty introduced by imputation, resulting in more reliable and valid statistical inferences. Adjust the number of imputations and imputation methods as needed based on the characteristics of your data and research question. ## AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.
- Remove all instances where progression in Infectious and Parasite body system is missing. Report the number of cases that remain.
- Plot the progression in Infectious and Parasite body system. Is the data bimodal? Is the data symmetric around the mean? Transform the data to improve QQplot..
- Include in your analysis all pairwise, and triplets of the independent variables. Exclude any variable or interaction term that is always missing. What is the total number of independent variables included in the regression?
- Impute missing independent variables from other variables that are present. Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
- Assume that missing independent variables indicate the patient does not have any disease in the body system (i.e., 0 score). Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
- Assume the value of the missing independent variables can be replaced by the average value of the independent variable. Regress progression in Infectious and Parasite body system on independent variables and report the percent of variation explained.
- Indicate which method of replacing missing values fits the data best.
- Data Download► Dictionary►
- Niharika Sarraf's Teach One YouTube►
- Yili Lin's Answer► R-code►
- Create a binary variable that is 1 every time a variable is missing and 0 otherwise. Predict Progression in Infectious and Parasite body system from binary diseases and report if any of the variables is statistically significant. List variables that are not missing at random. Variables that are not missing at random have a statistically signficant relationship to the response (outcome) variable.
- Replace missing values using MICE. Report the percent of variation explained before and after MICE adjustments.
- Data Download► Dictionary►
- Sowmya Chakravarthy's Answer► R-code►
- Ledo Thankachan's Teach One YouTube►
## MoreFor additional information (not part of the required reading), please see the following links: - Introduction to regression by others YouTube► Slides►
- Regression using R Read►
- Statistical learning with R Read►
- Open introduction to statistics Read►
This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email► |