## HAP 719: Advanced Statistics I## Ordinary Regression Model Building
## Overview- Read Chapter 11 in Statistical Analysis of Electronic Health Records by Farrokh Alemi, 2020 Slides► YouTube► Video►
- Model building Slides►
## Learning ObjectivesAfter completing the activities this module you should be able to: - Examine if combination of variables are more accurate predictor than each variable by itself.
- Distinguish among several methods of building a multiple linear regression model, including models with interaction terms
- Interpret findings from statistical outputs pertaining to a regression model building technique
## Coefficient of DeterminationCoefficient of determination or R-squared is used to measure goodness of fit between the model and the data. The statistic R # Fit the linear regression model In this code: lm() is used to fit the linear regression model, where dependent_variable is the variable you're trying to predict, and independent_variable1 and independent_variable2 are the independent variables in your model. summary(model) generates a summary of the regression model. $r.squared extracts the R-squared value from the summary. After running this code, you'll get the R-squared value, which is a number between 0 and 1. A higher R-squared value indicates that a larger proportion of the variability in the dependent variable is explained by the independent variables, which suggests a better fit of the model to the data. - Read Chapter 11 in Statistical Analysis of Electronic Health Records, pages 277 to 280
## Model SelectionTo do regression, you have to try different mathematical models and see which one fits the data best. A linear model is just a weighted sum of independent variables. A non-linear model is a weighted sum of independent variables and interaction terms. Interaction terms are constructed as product of 2 or more independent variable. In R, you can run multiple regression models that take into account interactions among independent variables by including interaction terms in your regression formula. Interaction terms allow you to examine how the relationship between the dependent variable and one independent variable is influenced by another independent variable. Here's how to run such models. Let's assume you have a dataset named mydata and you want to run a multiple regression model that includes interactions between independent_variable1 and independent_variable2 along with other main effects: # Fit a multiple regression model with
interaction terms In this example: dependent_variable is the variable you want to predict. independent_variable1 and independent_variable2 are the independent variables for which you want to test the interaction effect. other_independent_variables represents any other independent variables you want to include in the model as main effects. The * operator between independent_variable1 and independent_variable2 creates an interaction term. You can also explicitly define interaction terms using the : operator or the interaction() function: # Using the : operator When you have a large number of variables (such as x1 through x15) and you want to include two-way interaction terms for all pairs of these variables in a linear regression model in R, it can be cumbersome to write out each interaction term manually. Fortunately, R provides functions to help automate the process. You can use the interaction() function in combination with the : operator. Here's how to calculate two-way interaction terms for all pairs of variables: # Assuming you have a data frame called
"mydata" with variables x1 through x15 In this code: combn() generates all possible combinations of variable pairs. The interaction() function is applied to each pair of variables to create interaction terms. as.formula() is used to convert the interaction terms into a formula. Finally, the linear regression model is fitted using the formula that includes all two-way interaction terms. Keep in mind that including a large number of interaction terms can lead to a more complex model and may require careful consideration of model selection and potential issues like multicollinearity. You may also want to assess the significance of the interaction terms and consider variable selection techniques if you have many variables and interactions. After fitting the model, you can use summary(model) to obtain detailed information about the regression results, including coefficients, standard errors, p-values, and R-squared values. Interpreting the results of a multiple regression model with interaction terms involves considering the main effects of the independent variables and the interaction effects. For example, if independent_variable1 and independent_variable2 have a significant interaction effect, it means that the relationship between the dependent variable and independent_variable1 depends on the value of independent_variable2, and vice versa. Remember to assess the significance of interaction terms and their practical implications when interpreting the results. Additionally, be cautious about multicollinearity when including interaction terms, as it can affect the stability and interpretability of the coefficients. - Read Chapter 11 in Statistical Analysis of Electronic Health Records page 274 to to 277
- YouTube►
## AssignmentsAssignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in R if possible.
- Initially explain variation in Complete Series Vaccination rates by demographics (including age, race, gender) of the county's residents. Report the percent of variation explained.
- Explain variation in Complete Series Vaccination rates by demographics (age, race, gender), and social determinants (including high school completion rate, percent nor proficient in English, percent employed, percent of children in poverty, and median household income). Report the percent of variation explained.
- Explain variation in Complete Series Vaccination rates by demographics (age, race, gender), social determinants (including high school completion rate, percent nor proficient in English, percent employed, percent of children in poverty, median household income) and health of residents (including percent population disabled, life expectancy, percent population having premature morbidity). Report the percent of variation explained.
- Explain variation in Complete Series Vaccination rates by demographics (age, race, gender), social determinants (including high school completion rate, percent nor proficient in English, percent employed, percent of children in poverty, median household income), health of residents (including percent population disabled, life expectancy, percent population having premature morbidity), and political leaning of the population (including republican leaning, democrat leaning). Report the percent of variation explained.
- Does a county's political leaning affect vaccination rates?
Resources for Question 1: - Data Download►
- Yili Lin's Answer► R-code►
- S B Yuvaraj Rejeti's Teach One YouTube►
- Nahida Farheen Shaik's Teach One YouTube►
- Using only independent variables measured in 2015 predict incidence of diabetes in the county. Report the percent of variation explained.
- Using only independent variables measured in 2016 predict incidence of diabetes in the county. Report the percent of variation explained.
- Using both independent variables measured in 2015 and independent variables measured in 2016, predict incidence of diabetes in the ocunty. Report the percent of variation explained.
- List variables that have have an impact on incidence of diabetes within a year.
- List variables that have an impact on incidence of diabetes within 2 years.
Resources for Question 2: ## MoreFor additional information (not part of the required reading), please see the following links: - Introduction to regression by others YouTube► Slides►
- Regression using R Read►
- Statistical learning with R Read►
- Open introduction to statistics Read►
This page is part of the HAP 819 course on Advance Statistics and was organized by Farrokh Alemi PhD Home► Email► |