HAP 719: Advanced Statistics IOrdinary Regression - AssumptionsOverview
Learning ObjectivesAfter completing the activities this module you should be able to:
Introduction to RegressionRegression analysis is a statistical method used to examine the relationship between one or more independent variables (predictors or features) and a dependent variable (the outcome or target). It is a fundamental tool in the field of statistics and is widely applied in various fields, including economics, social sciences, finance, and natural sciences, to make predictions, understand relationships, and uncover patterns in data. Here are the key components and concepts associated with regression analysis: Dependent Variable (Y): This is the variable that you want to predict or explain. It's the outcome or target variable that you are trying to understand in terms of its relationship with the independent variables. Independent Variables (X1 , ...Xn): These are the variables that you believe may have an influence on the dependent variable. They are also known as predictors, features, or explanatory variables. In simple linear regression, there is one independent variable, while in multiple linear regression, there are two or more. Regression Equation: The fundamental concept in regression analysis is the regression equation, which represents the relationship between the dependent and independent variables. In its simplest form, in simple linear regression with 1 independent variable, it can be written as:
Regression Coefficients: The coefficients (Beta zero and Beta one in the simple linear regression equation above) represent the strength and direction of the relationship between the independent and dependent variables. They indicate how much change in the dependent variable is associated with a one-unit change in the independent variable, if there are no interaction terms in the regression equation. When interaction terms are included in a regression model, the interpretation of the regression coefficients for the main effects and the interaction terms can change significantly. The interpretation of regression coefficients changes when interaction terms are added to the model. Main Effects: These refer to the coefficients of the independent variables that are not involved in an interaction term. In a simple linear regression model (without interactions), the coefficient of an independent variable represents the change in the dependent variable associated with a one-unit change in that independent variable, holding all other variables constant. In the presence of interaction terms, the interpretation of the main effects becomes conditional on the values of the interacting variables. The main effect of an independent variable now represents the change in the dependent variable associated with a one-unit change in that independent variable, while keeping all other variables constant at zero (if they are centered) or at their reference levels (if categorical). Interaction Effects: Interaction terms capture how the relationship between two independent variables changes as a function of each other. The coefficient of an interaction term represents the additional change in the dependent variable when the two interacting variables are multiplied together, assuming that the main effects and all other variables are held constant. The direction and magnitude of the interaction effect can vary, and it is essential to examine the sign and statistical significance of the interaction coefficient. A positive interaction coefficient implies that the effect of one variable on the dependent variable increases when the other variable increases, while a negative interaction coefficient suggests that the effect decreases as the other variable increases. Interpretation of Main Effects in the Presence of Interaction: When
interaction terms are present in the model, the interpretation of main
effects may not be straightforward. The effect of an independent variable
depends on the levels of the interacting variables. For example, if you
have an interaction between age and income in a regression model
predicting health outcomes, the main effect of age would tell you the
effect of age on health when income is at a certain level. However, as
income changes, the effect of age on health may also change due to the
interaction. To interpret the coefficients correctly in the presence of
interactions, it's essential to look at the predicted values or graphs of
the relationship to understand how the effects change under different
conditions. You may use post-estimation techniques like plotting
interaction effects or calculating predicted values for specific
combinations of variables to make the interpretation more intuitive. Residuals: Residuals are the differences between the observed values of the dependent variable and the values predicted by the regression equation. The goal in regression analysis is to minimize the squared sum of residuals, as they represent the unexplained variation in the dependent variable. Types of Regression: There are various types of regression analysis, including:
Assumptions: Regression analysis assumes certain conditions, such as:
Violations of these assumptions can affect the validity of the results. A later section provides details on how to check assumption of regression in R. Model Evaluation: Various statistical measures are used to assess the quality of a regression model, including the coefficient of determination (R-squared), mean squared error (MSE), and hypothesis tests for the significance of coefficients. More details are provided in later lectures. In summary, regression analysis is a powerful statistical technique
used to model and understand relationships between variables, make
predictions, and inform decision-making based on data. It is a versatile
tool with a wide range of applications in research and practical
problem-solving. Verify Regression Assumptions Using RChecking the assumptions of regression is a crucial step in ensuring the validity and reliability of your regression analysis results. In R, you can use various diagnostic techniques and visualization tools to assess whether your data meets the key assumptions of linear regression. Here are the primary assumptions to check and how to do so in R. Linearity: Create scatterplots of your independent variables against the dependent variable to visually assess linearity. You can use the plot() function or the ggplot2 package for more advanced plotting. # Example using base R After fitting the regression model, create a residuals vs. fitted values plot to look for any patterns or curvature. Non-linearity in this plot can indicate a violation of the linearity assumption. # Fit the regression model Independence of Residuals: Use the Durbin-Watson test to check for autocorrelation in the residuals. A value close to 2 suggests no autocorrelation. # Durbin-Watson Test If your data is time series data, create a plot of residuals against time to detect any patterns or autocorrelation. Homoscedasticity (Constant Variance of Residuals): Check the residuals vs. fitted values plot for a consistent spread of residuals across the fitted values. If the spread increases or decreases systematically, it may indicate heteroscedasticity. # Residuals vs. Fitted Values Plot You can use Breusch-Pagan Test (for heteroscedasticity), the bptest() function from the lmtest package, to formally test for heteroscedasticity. # Breusch-Pagan Test library(lmtest) Normality of Residuals: Create a histogram and a quantile-quantile (QQ) plot of the residuals to visually assess their normality. # Histogram of Residuals You can use the shapiro.test() function to perform a formal test for normality on the residuals. # Shapiro-Wilk Test Multicollinearity: Calculate Variance Inflation Factor (VIF) for each independent variable to check for multicollinearity. High VIF values (usually above 5 or 10) suggest multicollinearity. # Calculate VIF Outliers and Influential Observations: Use residual plots, leverage plots, and Cook's distance to identify outliers and influential observations. These can be done graphically or by calculating specific statistics. # Leverage vs. Residuals Plot You can also identify specific observations using functions like outlierTest() from the car package or examining observations with high Cook's distance. Remember that regression assumptions are not always perfectly met in real-world data. The goal is to assess the extent to which they are violated and whether these violations are severe enough to affect the validity of your results. Depending on the severity of any violations, you may need to consider data transformations, alternative models, or robust regression techniques to address issues and obtain valid inferences. AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible. Question 1: Clean the Medical Foster Home data. Limit the data to cost per day, patient disabilities in 365 days, survival, age of patients, gender of patients and whether they participated in the medical foster home (MFH) program. Clean the data using the following:
Question 2: Regress Circulatory Body System factor on all variables that precede it (defined as variables that occur more before than after circulatory events). In the attached data, the variables indicate incidence of diabetes (a binary variable) and progression of diseases in body systems. You can do the analysis first on 10% sample before you do it on the entire data that may take several hours.
Question 3: If you see the following residual versus fitted, distribution, and QQ plot after fitting a linear regression to the data, which of the following statements are true: (a) there is a linear upward trend, (b) there is a linear downward trend, (c) there is curved upward trend, (d) there is a curved downward trend, (e) there is a fanned shaped trend:
Question 4: If you see the following residual versus fitted, distribution, and QQ plot after fitting a linear regression to the data, which of the following statements are true: (a) there is a linear upward trend, (b) there is a linear downward trend, (c) there is curved upward trend, (d) there is a curved downward trend, (e) there is a fanned shaped trend:
Question 5: The attached data comes from Larry Hatcher's book titled "Advanced Statistics in Research" The response variable is average of grades in graduate school and the independent variables are listed in the following table are admission information:
If your instructor has not covered this topic, please learn from ChatGPT. Please answer the following questions:
MoreFor additional information (not part of the required reading), please see the following links:
This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email► |