## Ordinary Regression
## OverviewRead Chapter 11 in Statistical Analysis of Electronic Health Records by Farrokh Alemi, 2020 ## Learning ObjectivesAfter completing the activities this module you should be able to: - Distinguish among several methods of building a multiple linear regression model, including models with interaction terms
- Transform data to have residuals that have a Normal distribution
- Adjust for missing values in the dependent and independent data
- Interpret findings from statistical outputs pertaining to a regression model building technique
## Verify Regression AssumptionsChecking the assumptions of regression is a crucial step in ensuring the validity and reliability of your regression analysis results. In R, you can use various diagnostic techniques and visualization tools to assess whether your data meets the key assumptions of linear regression. Here are the primary assumptions to check and how to do so in R. - Convert STATA code to R ChatGPT►
# Example using base R After fitting the regression model, create a residuals vs. fitted values plot to look for any patterns or curvature. Non-linearity in this plot can indicate a violation of the linearity assumption. # Fit the regression model
# Durbin-Watson Test If your data is time series data, create a plot of residuals against time to detect any patterns or autocorrelation.
# Residuals vs. Fitted Values Plot You can use Breusch-Pagan Test (for heteroscedasticity), the bptest() function from the lmtest package, to formally test for heteroscedasticity. # Breusch-Pagan Test library(lmtest)
# Histogram of Residuals You can use the shapiro.test() function to perform a formal test for normality on the residuals. # Shapiro-Wilk Test
# Calculate VIF
# Leverage vs. Residuals Plot You can also identify specific observations using functions like outlierTest() from the car package or examining observations with high Cook's distance. Remember that regression assumptions are not always perfectly met in real-world data. The goal is to assess the extent to which they are violated and whether these violations are severe enough to affect the validity of your results. Depending on the severity of any violations, you may need to consider data transformations, alternative models, or robust regression techniques to address issues and obtain valid inferences. - See visual examples of how linearity affects QQ plots, Normal distribution, and fitted versus residual plots More►
- Read Chapter 11 in Statistical Analysis of Electronic Health Records, pages 282 to 290.
- Slides►
- YouTube►
- Video►
## Adjustments for Missing ValuesSeveral approaches are available for replacing missing values so that a complete matrix format of the data (without any variable missing information) can be accomplished. These approaches include: When the dependent variable is missing information, the best strategy is to drop the row of data. For example, if we are predicting health status of a patient from social determinants and medical history variables, then when health status is missing, one should ignore the entire case, including the available social determinants and medical history. In EHRs, missing values are typically referring to absent codes.
When missing data is substantial and imputation is not feasible or appropriate for your analysis, dropping cases with missing values may be a reasonable choice. You can use functions like na.omit() or complete.cases() to remove rows with missing data: # Remove rows with any missing values When you have identified outliers that are not representative of the underlying population or are adversely affecting the assumptions of your analysis (e.g., in regression models), you might choose to remove the entire cases containing those outliers: # Remove cases with outliers in a specific
variable (e.g., "variable_name") In some cases, you might have data quality issues that cannot be resolved easily. For instance, if you suspect data entry errors or inconsistencies that cannot be corrected, you may decide to drop cases with problematic data: # Remove cases with data quality issues
(e.g., non-numeric values in a numeric variable) When you are working with a large dataset and only need a specific subset for your analysis, you can choose to drop cases that are not relevant to your research question: # Select a subset of cases meeting specific
criteria Remember that dropping entire cases can lead to a loss of valuable information, reduced sample size, and potential biases in your analysis if not done carefully. Before deciding to drop cases, it's important to consider the impact on the validity and representativeness of your results. You should also document your data preprocessing steps, including any case removal, for transparency and reproducibility in your research. Additionally, consider alternative approaches like imputation or robust statistical methods when appropriate to handle missing data or outliers without dropping cases.
# Load your dataset (e.g., "mydata.csv") Identify the variable in your dataset that contains missing values. Let's assume the variable is called my_variable. variable_with_missing <- "my_variable" You can calculate the mode of the variable using the table() and which.max() functions. Here's a code snippet to find the mode: # Calculate the mode This code counts the occurrences of each unique value in the variable and selects the one with the highest count as the mode. Replace the missing values in the variable with the calculated mode. You can use the ifelse() function to do this efficiently: mydata$my_variable <- ifelse(is.na(mydata$my_variable), mode_value, mydata$my_variable) This code checks if each element in mydata$my_variable is missing (NA) and replaces it with mode_value if it is, leaving the original value unchanged otherwise. Verify that the missing values have been replaced with the mode by examining the variable or checking summary statistics: summary(mydata$my_variable) This will display a summary of the variable, and you should see that missing values have been replaced with the mode. Similar code can be written to replace missing continuous information with the average or median of the response.
library(mice) Read the dataset that contains missing values and the variables you want to include in your regression model. # Load your dataset (e.g., "mydata.csv") Specify the Variables for imputation. Identify the variables that have missing values and those you want to include in your regression model. Let's assume you want to perform a linear regression with dependent_variable as the outcome and independent_variable1 and independent_variable2 as predictors. vars_to_impute <- c("dependent_variable", "independent_variable1", "independent_variable2") Use the mice package to create multiple imputations for the missing data. Specify the number of imputations you want to generate using the m parameter. # Set the number of imputations Run your regression analysis on each of the imputed datasets. In this example, we're using linear regression: # Create an empty list to store regression results Combine the results from the multiple imputations to obtain pooled estimates and standard errors. You can use Rubin's rules for combining the results: pooled_results <- pool(regression_results) Finally, interpret and present the pooled regression results, which will include estimates, standard errors, confidence intervals, and p-values that account for the uncertainty introduced by imputation. This process allows you to handle missing data in your regression analysis while accounting for the uncertainty introduced by imputation, resulting in more reliable and valid statistical inferences. Adjust the number of imputations and imputation methods as needed based on the characteristics of your data and research question. ## Coefficient of DeterminationCoefficient of determination or R-squared is used to measure goodness of fit between the model and the data. The statistic R # Fit the linear regression model In this code: lm() is used to fit the linear regression model, where dependent_variable is the variable you're trying to predict, and independent_variable1 and independent_variable2 are the independent variables in your model. summary(model) generates a summary of the regression model. $r.squared extracts the R-squared value from the summary. After running this code, you'll get the R-squared value, which is a number between 0 and 1. A higher R-squared value indicates that a larger proportion of the variability in the dependent variable is explained by the independent variables, which suggests a better fit of the model to the data. - Read Chapter 11 in Statistical Analysis of Electronic Health Records, pages 277 to 280
## Model SelectionTo do regression, you have to try different mathematical models and see which one fits the data best. A linear model is just a weighted sum of independent variables. A non-linear model is a weighted sum of independent variables and interaction terms. Interaction terms are constructed as product of 2 or more independent variable. In R, you can run multiple regression models that take into account interactions among independent variables by including interaction terms in your regression formula. Interaction terms allow you to examine how the relationship between the dependent variable and one independent variable is influenced by another independent variable. Here's how to run such models. Let's assume you have a dataset named mydata and you want to run a multiple regression model that includes interactions between independent_variable1 and independent_variable2 along with other main effects: # Fit a multiple regression model with
interaction terms In this example: dependent_variable is the variable you want to predict. independent_variable1 and independent_variable2 are the independent variables for which you want to test the interaction effect. other_independent_variables represents any other independent variables you want to include in the model as main effects. The * operator between independent_variable1 and independent_variable2 creates an interaction term. You can also explicitly define interaction terms using the : operator or the interaction() function: # Using the : operator When you have a large number of variables (such as x1 through x15) and you want to include two-way interaction terms for all pairs of these variables in a linear regression model in R, it can be cumbersome to write out each interaction term manually. Fortunately, R provides functions to help automate the process. You can use the interaction() function in combination with the : operator. Here's how to calculate two-way interaction terms for all pairs of variables: # Assuming you have a data frame called
"mydata" with variables x1 through x15 In this code: combn() generates all possible combinations of variable pairs. The interaction() function is applied to each pair of variables to create interaction terms. as.formula() is used to convert the interaction terms into a formula. Finally, the linear regression model is fitted using the formula that includes all two-way interaction terms. Keep in mind that including a large number of interaction terms can lead to a more complex model and may require careful consideration of model selection and potential issues like multicollinearity. You may also want to assess the significance of the interaction terms and consider variable selection techniques if you have many variables and interactions. After fitting the model, you can use summary(model) to obtain detailed information about the regression results, including coefficients, standard errors, p-values, and R-squared values. Interpreting the results of a multiple regression model with interaction terms involves considering the main effects of the independent variables and the interaction effects. For example, if independent_variable1 and independent_variable2 have a significant interaction effect, it means that the relationship between the dependent variable and independent_variable1 depends on the value of independent_variable2, and vice versa. Remember to assess the significance of interaction terms and their practical implications when interpreting the results. Additionally, be cautious about multicollinearity when including interaction terms, as it can affect the stability and interpretability of the coefficients. - Read Chapter 11 in Statistical Analysis of Electronic Health Records page 274 to to 277
- YouTube►
## AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.
- Remove all cases in which all values for disabilities in 365 days, age and gender, are missing. These are meaningless data and should be dropped from analysis.
- Remove any row in which the treatment variable (MFH) is missing. MFH is an intervention for nursing home patients. In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home. The resident eats with the family and relies on the family members for socialization, food and comfort. It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1. A value of NaN or null is missing value.
- Various costs are reported in the file, including cost inside and outside the organization. Rely on cost per day. Exclude patients who have 0 cost per day within the organization. These do not make sense. The cost is reported for specific time period after admission, some stay a short time, and others some longer. Use daily cost so you do not get caught on the issues related to lack of follow-up.
- Select for your independent variables the probability of disability in 365 days. These probabilities are predicted from CCS variables. CCS in these data refer to Clinical Classification System of Agency for Health Care Research and Quality. CCS data indicate the comorbidities of the patient. When null, it is assumed the patient did not have the comorbidity. When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days. An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. Because of the relationship between disabilities and comorbidities, you can rely exclusively on disabilities and ignore comorbidities.
- Check if cases repeat and should be deleted from the analysis.
- In survival days, null values indicate missing values. Either assume missing values are zero or use imputation to estimate the missing values. Select the approach that fits the data best..
- Convert all categorical variables to binary dummy variables. For example, race has four values W, B, A, Other, and null value. Create 5 binary dummy variables for these categories and use 4 of them in the regression. For example, the binary variable called Black is 1 when race is B, and 0 otherwise. In this binary variable we are comparing all Black residents to non-Black residents that include W, A, null, and other races.
- In all variables where null value was not deleted row wise, e.g. race being null, the null value should be made into a dummy variable, zero when not null and 1 when null. Treat these null variables as you would any other independent variable.
- Gender is indicated as "M" and "F"; revise by replacing M with 1 and F with 0.
- Make sure that no numbers are entered as text
- Visually check that cost is normally distributed and see if log of cost is more normal than cost itself. If a variable is not normally distributed, is the average of the variable normal (see page 261 in required textbook)? Visually check that age and cost have a linear relationship.
- Regress cost per day on age (continuous variable), gender (male=1, Female=0), survival, binary dummy variables for race, probabilities of functional disabilities, and any null dummy variable you have created.
- Describe the percent of variation explained, and F statistics.
- Show which variables have a statistically significant effect on cost. Does age affect cost? Does MFH reduce cost of care?
- Data Download►
- Python's Teach Ones
- Taheeri's YouTube►
- Marla's YouTube►
- Adnan's
YouTube►
Python Code►
- See Python code for the regression Code►
- Adnan's Teach One on regression portion YouTube► Python Code►
- CCS codes Read►
- See sample R codes in required textbook pages 266 to 274 for doing the regression
- Vladimir Cardenas's Answer► R-code►
- Aaron Jackson Hill's Teach One YouTube►
- Check the normal distribution assumption of the response variable.
Drop from analysis any place where the dependent variable is missing. If the data is not normal; transform the data to meet normal assumptions.
For each transformation show the test of Normal distribution. You
should at a minimum consider the following transformations of the
data:
- Odds to probability transformation
- Log of odds to probability transformation
- Logarithm transformation:
- Third root of the variable
- Odds to probability transformation
- Decide on a method for addressing missing values. There are several ways to address missing values. You can assign a value of 0 to missing values, you can impute these variables from other variables that have occurred prior to it, or you can use patterns of missing values to assume different values for different missing patterns. You should choose the method that yields the highest R-squared value.
- Check the assumption of linearity. If the data have non-linear elements, transform the data to remove non-linearity.
- Include pairwise and triple combination of variables in your analysis. Report predictors of progression of diseases in the circulatory body system. List the variables, pairs of variables, or triplets of variables that are significant predictors at 0.01.
- Evaluate the R-square (percent of variation in circulatory diseases explained by other variables that occur prior to it.
- Data Download► Dictionary►
- Sowmya Chakravarthy's Answer► R-code►(Password protected)
- Answer in More►
- Answer in More►
## MoreFor additional information (not part of the required reading), please see the following links: - Introduction to regression by others YouTube► Slides►
- Regression using R Read►
- Statistical learning with R Read►
- Open introduction to statistics Read►
This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email► |