## HAP 719: Advanced Statistics I## Ordinary Regression - Assumptions## Overview- Read Chapter 11 in Statistical Analysis of Electronic Health Records by Farrokh Alemi, 2020, pages 282 to 290. Slides► YouTube► Video►
- Visual examples of checking regression assumptions More►
- Introduction to regression with one independent variable Slides►
- Yili's lecture on introduction to regression concept Slides► Interactive►
- Assumptions Slides► YouTube► Video►
- Interactive visualization of regression More►
## Learning ObjectivesAfter completing the activities this module you should be able to: - Analyze data using linear regression
- Transform data to have residuals that have a Normal distribution
- Interpret findings from statistical outputs pertaining to a regression model building technique
## Introduction to RegressionRegression analysis is a statistical method used to examine the relationship between one or more independent variables (predictors or features) and a dependent variable (the outcome or target). It is a fundamental tool in the field of statistics and is widely applied in various fields, including economics, social sciences, finance, and natural sciences, to make predictions, understand relationships, and uncover patterns in data. Here are the key components and concepts associated with regression analysis:
- Simple Linear Regression: When there is one independent variable.
- Multiple Linear Regression: When there are two or more independent variables.
- Logistic Regression: Used for binary classification problems.
- Polynomial Regression: Used when the relationship between variables is not linear.
- Ridge and LASSO Regression: Used for regularization to prevent over fitting in high dimensional data.
- Time Series Regression: Applied to time-dependent data.
- Linearity
- Independence of errors
- Constant variance of errors (homoscedasticity), and
- Normally distributed errors.
Violations of these assumptions can affect the validity of the results. A later section provides details on how to check assumption of regression in R.
In summary, regression analysis is a powerful statistical technique
used to model and understand relationships between variables, make
predictions, and inform decision-making based on data. It is a versatile
tool with a wide range of applications in research and practical
problem-solving. ## Verify Regression Assumptions Using RChecking the assumptions of regression is a crucial step in ensuring the validity and reliability of your regression analysis results. In R, you can use various diagnostic techniques and visualization tools to assess whether your data meets the key assumptions of linear regression. Here are the primary assumptions to check and how to do so in R.
# Example using base R After fitting the regression model, create a residuals vs. fitted values plot to look for any patterns or curvature. Non-linearity in this plot can indicate a violation of the linearity assumption. # Fit the regression model
# Durbin-Watson Test If your data is time series data, create a plot of residuals against time to detect any patterns or autocorrelation.
# Residuals vs. Fitted Values Plot You can use Breusch-Pagan Test (for heteroscedasticity), the bptest() function from the lmtest package, to formally test for heteroscedasticity. # Breusch-Pagan Test library(lmtest)
# Histogram of Residuals You can use the shapiro.test() function to perform a formal test for normality on the residuals. # Shapiro-Wilk Test
# Calculate VIF
# Leverage vs. Residuals Plot You can also identify specific observations using functions like outlierTest() from the car package or examining observations with high Cook's distance. Remember that regression assumptions are not always perfectly met in real-world data. The goal is to assess the extent to which they are violated and whether these violations are severe enough to affect the validity of your results. Depending on the severity of any violations, you may need to consider data transformations, alternative models, or robust regression techniques to address issues and obtain valid inferences. ## AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.
- Remove all cases in which all values for disabilities in 365 days, age and gender, are missing. These are meaningless data and should be dropped from analysis.
- Remove any row in which the treatment variable (MFH) is missing. MFH is an intervention for nursing home patients. In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home. The resident eats with the family and relies on the family members for socialization, food and comfort. It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1. A value of NaN or null is missing value.
- Various costs are reported in the file, including cost inside and outside the organization. Rely on cost per day. Exclude patients who have 0 cost per day within the organization. These do not make sense. The cost is reported for specific time period after admission, some stay a short time, and others some longer. Use daily cost so you do not get caught on the issues related to lack of follow-up.
- Select for your independent variables the probability of disability in 365 days. These probabilities are predicted from CCS variables. CCS in these data refer to Clinical Classification System of Agency for Health Care Research and Quality. CCS data indicate the comorbidities of the patient. When null, it is assumed the patient did not have the comorbidity. When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days. An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. Because of the relationship between disabilities and comorbidities, you can rely exclusively on disabilities and ignore comorbidities.
- Check if cases repeat and should be deleted from the analysis.
- Convert all categorical variables to binary dummy variables. For example, race has four values W, B, A, Other, and null value. Create 5 binary dummy variables for these categories and use 4 of them in the regression. For example, the binary variable called Black is 1 when race is B, and 0 otherwise. In this binary variable we are comparing all Black residents to non-Black residents that include W, A, null, and other races.
- In all variables where null value was not deleted row wise, e.g. race being null, the null value should be made into a dummy variable, zero when not null and 1 when null. Treat these null variables as you would any other independent variable.
- Gender is indicated as "M" and "F"; revise by replacing M with 1 and F with 0.
- Make sure that no numbers are entered as text
- Visually check that cost is normally distributed and see if log of cost is more normal than cost itself. If a variable is not normally distributed, is the average of the variable normal (see page 261 in required textbook)? Visually check that age and cost have a linear relationship.
- Regress cost per day on age (continuous variable), gender (male=1, Female=0), survival, binary dummy variables for race, probabilities of functional disabilities, and any null dummy variable you have created.
- Show which variables have a statistically significant effect on cost. Does age affect cost? Does MFH reduce cost of care?
- Data Download►
- Python's Teach Ones
- Taheeri's YouTube►
- Marla's YouTube►
- Adnan's
YouTube►
Python Code►
- See Python code for the regression Code►
- Adnan's Teach One on regression portion YouTube► Python Code►
- CCS codes Read►
- See sample R codes in required textbook pages 266 to 274 for doing the regression
- Vladimir Cardenas's Answer► R-code►
- Aaron Jackson Hill's Teach One YouTube►
- Chethana Banoth's Teach One for R YouTube►
- Check the normal distribution assumption of the response variable.
Drop from analysis any place where the dependent variable is missing. If the data is not normal; transform the data to meet normal assumptions.
For each transformation show the test of Normal distribution. You
should at a minimum consider the following transformations of the
data:
- Odds to probability transformation
- Log of odds to probability transformation
- Logarithm transformation:
- Third root of odds
- Odds to probability transformation
- Check the assumption of linearity. If the data have non-linear elements, transform the data to remove non-linearity.
- Data Download► Dictionary►
- Sowmya Chakravarthy's Answer► R-code►(Password protected)
- Sai Vineela Ramineni's Teach One YouTube►
- Answer in More►
- Answer in More►
Question 5: The attached data comes from Larry Hatcher's book titled "Advanced Statistics in Research" The response variable is average of grades in graduate school and the independent variables are listed in the following table are admission information:
If your instructor has not covered this topic, please learn from ChatGPT. Please answer the following questions: - What does adjusted R2 measure and "according to the criteria recommended by Cohen (1988), does this value of R2 come closest to representing a zero effect, a small effect, a medium effect, or a large effect?"
- "According to Table E9.3.2, the unstandardized multiple regression coefficient for undergraduate GPA overall is b = 0.313. What does this value mean? In other words, what is the correct interpretation of this coefficient (hint: your answer must incorporate the definition for an unstandardized multiple regression coefficient, and it must also incorporate the value “0.313”)."
- Which variables have a statistically significant relationship with graduate grades?
- "What was the unstandardized multiple regression coefficient for undergraduate GPA overall?"
- "What was the 95% confidence interval for this coefficient?"
- "Assume that you had access to this 95% confidence interval but not the p value for the regression coefficient. Based only on this 95% confidence interval, was this regression coefficient statistically significant? How do you know?"
- Is the statistical test of the coefficient for undergraduate GPA's affected by other variables in the model? How is the test of coefficient in a regression model different from hypothesis testing of the same variable by itself and without other variables in the regression model. Explain your answer.
- "If an applicant increased his or her GRE quantitative test score by one standard deviation, that applicant’s score on graduate GPA overall would be expected to increase by how many standard deviations (while statistically controlling the remaining predictor variables)?"
- "What percent of variance in graduate GPA overall is accounted for by the GRE verbal test, above and beyond the variance already accounted for by all of the other predictors?"
- Is the GRE verbal test more useful in predicting graduate grades than the GRE quantitative test?
- Does the model capture interaction among the variables?
## MoreFor additional information (not part of the required reading), please see the following links: - Introduction to regression by others YouTube► Slides►
- Regression using R Read►
- Statistical learning with R Read►
- Open introduction to statistics Read►
This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email► |