George Mason University HAP 719 Advanced Statistics

HAP 719: Advanced Statistics I

Ordinary Regression - Assumptions  



Interactive Visualization

Learning Objectives

After completing the activities this module you should be able to:

  • Analyze data using linear regression
  • Transform data to have residuals that have a Normal distribution
  • Interpret findings from statistical outputs pertaining to a regression model building technique

Introduction to Regression

Regression analysis is a statistical method used to examine the relationship between one or more independent variables (predictors or features) and a dependent variable (the outcome or target). It is a fundamental tool in the field of statistics and is widely applied in various fields, including economics, social sciences, finance, and natural sciences, to make predictions, understand relationships, and uncover patterns in data. Here are the key components and concepts associated with regression analysis:

Dependent Variable (Y): This is the variable that you want to predict or explain. It's the outcome or target variable that you are trying to understand in terms of its relationship with the independent variables.

Independent Variables (X1 , ...Xn): These are the variables that you believe may have an influence on the dependent variable. They are also known as predictors, features, or explanatory variables. In simple linear regression, there is one independent variable, while in multiple linear regression, there are two or more.

Regression Equation: The fundamental concept in regression analysis is the regression equation, which represents the relationship between the dependent and independent variables. In its simplest form, in simple linear regression with 1 independent variable, it can be written as:

Regression equation with 1 variable

Regression Coefficients: The coefficients (Beta zero and Beta one in the simple linear regression equation above) represent the strength and direction of the relationship between the independent and dependent variables. They indicate how much change in the dependent variable is associated with a one-unit change in the independent variable, if there are no interaction terms in the regression equation. When interaction terms are included in a regression model, the interpretation of the regression coefficients for the main effects and the interaction terms can change significantly. The interpretation of regression coefficients changes when interaction terms are added to the model.

Main Effects: These refer to the coefficients of the independent variables that are not involved in an interaction term. In a simple linear regression model (without interactions), the coefficient of an independent variable represents the change in the dependent variable associated with a one-unit change in that independent variable, holding all other variables constant. In the presence of interaction terms, the interpretation of the main effects becomes conditional on the values of the interacting variables. The main effect of an independent variable now represents the change in the dependent variable associated with a one-unit change in that independent variable, while keeping all other variables constant at zero (if they are centered) or at their reference levels (if categorical).

Interaction Effects: Interaction terms capture how the relationship between two independent variables changes as a function of each other. The coefficient of an interaction term represents the additional change in the dependent variable when the two interacting variables are multiplied together, assuming that the main effects and all other variables are held constant. The direction and magnitude of the interaction effect can vary, and it is essential to examine the sign and statistical significance of the interaction coefficient. A positive interaction coefficient implies that the effect of one variable on the dependent variable increases when the other variable increases, while a negative interaction coefficient suggests that the effect decreases as the other variable increases.

Interpretation of Main Effects in the Presence of Interaction: When interaction terms are present in the model, the interpretation of main effects may not be straightforward. The effect of an independent variable depends on the levels of the interacting variables. For example, if you have an interaction between age and income in a regression model predicting health outcomes, the main effect of age would tell you the effect of age on health when income is at a certain level. However, as income changes, the effect of age on health may also change due to the interaction. To interpret the coefficients correctly in the presence of interactions, it's essential to look at the predicted values or graphs of the relationship to understand how the effects change under different conditions. You may use post-estimation techniques like plotting interaction effects or calculating predicted values for specific combinations of variables to make the interpretation more intuitive.
In summary, when interaction terms are included in a regression model, the interpretation of coefficients becomes conditional on the values of the interacting variables, and the relationship between independent variables and the dependent variable can change depending on these conditions. Proper interpretation often involves examining the main effects in conjunction with interaction effects and considering how they affect the predicted outcomes under various scenarios.

Residuals: Residuals are the differences between the observed values of the dependent variable and the values predicted by the regression equation. The goal in regression analysis is to minimize the squared sum of residuals, as they represent the unexplained variation in the dependent variable.

Types of Regression: There are various types of regression analysis, including:

  • Simple Linear Regression: When there is one independent variable.
  • Multiple Linear Regression: When there are two or more independent variables.
  • Logistic Regression: Used for binary classification problems.
  • Polynomial Regression: Used when the relationship between variables is not linear.
  • Ridge and LASSO Regression: Used for regularization to prevent over fitting in high dimensional data.
  • Time Series Regression: Applied to time-dependent data.

Assumptions: Regression analysis assumes certain conditions, such as:

  1. Linearity
  2. Independence of errors
  3. Constant variance of errors (homoscedasticity), and
  4. Normally distributed errors.

Violations of these assumptions can affect the validity of the results.  A later section provides details on how to check assumption of regression in R.

Model Evaluation: Various statistical measures are used to assess the quality of a regression model, including the coefficient of determination (R-squared), mean squared error (MSE), and hypothesis tests for the significance of coefficients.  More details are provided in later lectures.

In summary, regression analysis is a powerful statistical technique used to model and understand relationships between variables, make predictions, and inform decision-making based on data. It is a versatile tool with a wide range of applications in research and practical problem-solving.

Verify Regression Assumptions Using R

Checking the assumptions of regression is a crucial step in ensuring the validity and reliability of your regression analysis results. In R, you can use various diagnostic techniques and visualization tools to assess whether your data meets the key assumptions of linear regression. Here are the primary assumptions to check and how to do so in R.

Linearity: Create scatterplots of your independent variables against the dependent variable to visually assess linearity. You can use the plot() function or the ggplot2 package for more advanced plotting.

 # Example using base R
plot(mydata$independent_variable, mydata$dependent_variable)
# Example using ggplot2
ggplot(data = mydata, aes(x = independent_variable, y = dependent_variable)) + geom_point()

After fitting the regression model, create a residuals vs. fitted values plot to look for any patterns or curvature. Non-linearity in this plot can indicate a violation of the linearity assumption.

 # Fit the regression model
model <- lm(dependent_variable ~ independent_variable, data = mydata)
# Residuals vs. Fitted Values Plot
plot(fitted(model), residuals(model))

Independence of Residuals: Use the Durbin-Watson test to check for autocorrelation in the residuals. A value close to 2 suggests no autocorrelation.

 # Durbin-Watson Test

If your data is time series data, create a plot of residuals against time to detect any patterns or autocorrelation.

Homoscedasticity (Constant Variance of Residuals): Check the residuals vs. fitted values plot for a consistent spread of residuals across the fitted values. If the spread increases or decreases systematically, it may indicate heteroscedasticity.

 # Residuals vs. Fitted Values Plot
plot(fitted(model), residuals(model))

You can use Breusch-Pagan Test (for heteroscedasticity), the bptest() function from the lmtest package, to formally test for heteroscedasticity.

 # Breusch-Pagan Test library(lmtest)

Normality of Residuals: Create a histogram and a quantile-quantile (QQ) plot of the residuals to visually assess their normality.

 # Histogram of Residuals
# QQ Plot qqnorm(residuals(model)) qqline(residuals(model)) Shapiro-Wilk

 You can use the shapiro.test() function to perform a formal test for normality on the residuals.

 # Shapiro-Wilk Test

Multicollinearity: Calculate Variance Inflation Factor (VIF) for each independent variable to check for multicollinearity. High VIF values (usually above 5 or 10) suggest multicollinearity.

 # Calculate VIF

Outliers and Influential Observations: Use residual plots, leverage plots, and Cook's distance to identify outliers and influential observations. These can be done graphically or by calculating specific statistics.

 # Leverage vs. Residuals Plot
plot(hatvalues(model), residuals(model))
# Cook's Distance Plot

You can also identify specific observations using functions like outlierTest() from the car package or examining observations with high Cook's distance.

Remember that regression assumptions are not always perfectly met in real-world data. The goal is to assess the extent to which they are violated and whether these violations are severe enough to affect the validity of your results. Depending on the severity of any violations, you may need to consider data transformations, alternative models, or robust regression techniques to address issues and obtain valid inferences.


Assignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.

Question 1: Clean the Medical Foster Home data. Limit the data to cost per day, patient disabilities in 365 days, survival, age of patients, gender of patients and whether they participated in the medical foster home (MFH) program. Clean the data using the following:

  1. Remove all cases in which all values for disabilities in 365 days, age and gender, are missing.  These are meaningless data and should be dropped from analysis.
  2. Remove any row in which the treatment variable (MFH) is missing.  MFH is an intervention for nursing home patients.  In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home.  The resident eats with the family and relies on the family members for socialization, food and comfort.  It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1.  A value of NaN or null is missing value.
  3. Various costs are reported in the file, including cost inside and outside the organization.  Rely on cost per day. Exclude patients who have 0 cost per day within the organization. These do not make sense. The cost is reported for specific time period after admission, some stay a short time, and others some longer.  Use daily cost so you do not get caught on the issues related to lack of follow-up.
  4. Select for your independent variables the probability of disability in 365 days.  These probabilities are predicted from CCS variables.  CCS in these data refer to Clinical Classification System of Agency for Health Care Research and Quality.  CCS data indicate the comorbidities of the patient.  When null, it is assumed the patient did not have the comorbidity.  When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days.  An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. Because of the relationship between disabilities and comorbidities, you can rely exclusively on disabilities and ignore comorbidities.
  5. Check if cases repeat and should be deleted from the analysis. 
  6. Convert all categorical variables to binary dummy variables. For example, race has four values W, B, A, Other, and null value.  Create 5 binary dummy variables for these categories and use 4 of them in the regression.  For example, the binary variable called Black is 1 when race is B, and 0 otherwise.  In this binary variable we are comparing all Black residents to non-Black residents that include W, A, null, and other races.   
  7. In all variables where null value was not deleted row wise, e.g. race being null, the null value should be made into a dummy variable, zero when not null and 1 when null.  Treat these null variables as you would any other independent variable. 
  8. Gender is indicated as "M" and "F"; revise by replacing M with 1 and F with 0. 
  9. Make sure that no numbers are entered as text
  10. Visually check that cost is normally distributed and see if log of cost is more normal than cost itself. If a variable is not normally distributed, is the average of the variable normal (see page 261 in required textbook)?  Visually check that age and cost have a linear relationship. 
  11. Regress cost per day on age (continuous variable), gender (male=1, Female=0), survival, binary dummy variables for race, probabilities of functional disabilities, and any null dummy variable you have created.  
  12. Show which variables have a statistically significant effect on cost.  Does age affect cost? Does MFH reduce cost of care?

Question 2: Regress Circulatory Body System factor on all variables that precede it (defined as variables that occur more before than after circulatory events).  In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.

  1. Check the normal distribution assumption of the response variable. Drop from analysis any place where the dependent variable is missing.  If the data is not normal; transform the data to meet normal assumptions. For each transformation show the test of Normal distribution. You should at a minimum consider the following transformations of the data:
    • Odds to probability transformation
      Odds to probability transformation
    • Log of odds to probability transformation
      Log of odds to probability transformation
    • Logarithm transformation:
      Log transform
    • Third root of odds
      Third root transformation
  2. Check the assumption of linearity.  If the data have non-linear elements, transform the data to remove non-linearity.

Question 3:  If you see the following residual versus fitted, distribution, and QQ plot after fitting a linear regression to the data, which of the following statements are true: (a) there is a linear upward trend, (b) there is a linear downward trend, (c) there is curved upward trend, (d) there is a curved downward trend, (e) there is a fanned shaped trend:

fanned residuals vs fitted

Question 4: If you see the following residual versus fitted, distribution, and QQ plot after fitting a linear regression to the data, which of the following statements are true: (a) there is a linear upward trend, (b) there is a linear downward trend, (c) there is curved upward trend, (d) there is a curved downward trend, (e) there is a fanned shaped trend:

curved up diagnostics in regression

Question 5:  The attached data comes from Larry Hatcher's book titled "Advanced Statistics in Research"  The response variable is average of grades in graduate school and the independent variables are listed in the following table are admission information:

Interpret regression coefficients

  If your instructor has not covered this topic, please learn from ChatGPT. Please answer the following questions:

  1. What does adjusted R2 measure and "according to the criteria recommended by Cohen (1988), does this value of R2 come closest to representing a zero effect, a small effect, a medium effect, or a large effect?" 
  2. "According to Table E9.3.2, the unstandardized multiple regression coefficient for undergraduate GPA overall is b = 0.313. What does this value mean? In other words, what is the correct interpretation of this coefficient (hint: your answer must incorporate the definition for an unstandardized multiple regression coefficient, and it must also incorporate the value “0.313”)."
  3. Which variables have a statistically significant relationship with graduate grades?
  4.  "What was the unstandardized multiple regression coefficient for undergraduate GPA overall?"
  5. "What was the 95% confidence interval for this coefficient?"
  6. "Assume that you had access to this 95% confidence interval but not the p value for the regression coefficient. Based only on this 95% confidence interval, was this regression coefficient statistically significant? How do you know?"
  7. Is the statistical test of the coefficient for undergraduate GPA's affected by other variables in the model?  How is the test of coefficient in a regression model different from hypothesis testing of the same variable by itself and without other variables in the regression model.  Explain your answer.
  8. "If an applicant increased his or her GRE quantitative test score by one standard deviation, that applicant’s score on graduate GPA overall would be expected to increase by how many standard deviations (while statistically controlling the remaining predictor variables)?"
  9. "What percent of variance in graduate GPA overall is accounted for by the GRE verbal test, above and beyond the variance already accounted for by all of the other predictors?"
  10. Is the GRE verbal test more useful in predicting graduate grades than the GRE quantitative test?
  11. Does the model capture interaction among the variables?


For additional information (not part of the required reading), please see the following links:

  1. Introduction to regression by others YouTube► Slides►
  2. Regression using R Read►
  3. Statistical learning with R Read►
  4. Open introduction to statistics Read►

This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home►  Email►