Ordinary Regression  




Verifying Regression Assumptions

ChatGPT: Regression analysis makes several assumptions about the data and the relationship between the variables being studied. These assumptions include:

  1. Linearity: The relationship between the independent and dependent variables is linear.
  2. Independence: The observations are independent of each other.
  3. Homoscedasticity: The variance of the errors is constant across all levels of the independent variable.
  4. Normality: The errors are normally distributed.
  5. No multicollinearity: The independent variables are not highly correlated with each other.

Violations of these assumptions can lead to biased or inefficient estimates and can affect the validity of inferences made from the model.

Coefficient of Determination

Coefficient of determination or R-squared is used to measure goodness of fit between the model and the data.  The statistic R2 measures the percentage of variation in the outcome (response variable in the regression) explained by the independent variables.  If a regression has a low R-squared then the right variables have not been included in the analysis, something often referred to as a model specification bias.

  • Read Chapter 11 in Statistical Analysis of Electronic Health Records, pages 277 to 280

Model Selection

To do regression, you have to try different mathematical models and see which one fits the data best.  A linear model is just a weighted sum of independent variables.  A non-linear model is a weighted sum of independent variables and interaction terms.  Interaction terms are constructed as product of 2 or more independent variable.  

Wikipedia: "All models are wrong is a common aphorism in statistics; it is often expanded as "All models are wrong, but some are useful". The aphorism is generally attributed to the statistician George Box. The underlying concept, though, predates Box's writings." 

  • Read Chapter 11 in Statistical Analysis of Electronic Health Records page 274 to to 277
  • YouTube►

Markov Blanket & Statistical Significance

ChatGPT: In regression analysis, statistically significant variables are those that have a strong and meaningful relationship with the dependent variable, as determined by a hypothesis test. The coefficients for these variables are estimated and their significance is determined by the p-value. Variables with a low p-value (typically less than 0.05) are considered statistically significant. In contrast, in a Markov blanket, a variable is considered statistically significant if it is part of the set of variables that shield a target variable from the effects of other variables in a graph. The Markov blanket of a target variable is the set of variables that, when known, renders the target variable independent of all other variables. The variables in the Markov blanket are considered statistically significant because they are the ones that are most directly related to the target variable and can be used to make accurate predictions about the target variable. In summary, in regression, statistically significant variables are those that have a strong and meaningful relationship with the dependent variable, as determined by a hypothesis test; in a Markov blanket, a variable is considered statistically significant if it is part of the set of variables that shield a target variable from the effects of other variables in a graph.


Assignments should be submitted in Blackboard. The submission must have a summary statement, with one statement per question. All assignments should be done in Python, if possible.  Include at top of your report a summary section or include a file that reports the summary of your work.  In the summary section, or summary page, write brief sentences comparing your work to answers given or videos.  For example, "I got the same answers as the Teach One video for question 1." 

  • Question 1: For the following X and Y data, calculate the regression equation.  Plot the points and the line.  Calculate the residuals and sum of squared residuals.   For solution see required textbook pages 262 through 264.
  • Question 2: Using clean data in Question 1, change the intercept of the estimated equation by +20% and by -20% and plot all three lines.  Calculate residuals for each of the three lines and report which line has the lowest sum of squared residuals.  Repeat and this time change the coefficient in the equation by +20% and -20%. Calculate the residuals for each of the three lines and report which line has the lowest sum of squared residuals.  See solution in page 264 exhibit 11.5 in the required textbook.
  • Question 3: Clean the Medical Foster Home data. Limit the data to cost per day, patient disabilities in 365 days, survival, age of patients, gender of patients and whether they participated in the medical foster home (MFH) program. Clean the data using the following:
    1. Remove all cases in which all values for disabilities in 365 days, age and gender, are missing.  These are meaningless data and should be dropped from analysis.
    2. Remove any row in which the treatment variable (MFH) is missing.  MFH is an intervention for nursing home patients.  In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home.  The resident eats with the family and relies on the family members for socialization, food and comfort.  It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1.  A value of NaN or null is missing value.
    3. Various costs are reported in the file, including cost inside and outside the organization.  Rely on cost per day. Exclude patients who have 0 cost per day within the organization. These do not make sense. The cost is reported for specific time period after admission, some stay a short time, and others some longer.  Use daily cost so you do not get caught on the issues related to lack of follow-up.
    4. Select for your independent variables the probability of disability in 365 days.  These probabilities are predicted from CCS variables.  CCS in these data refer to Clinical Classification System of Agency for Health Care Research and Quality.  CCS data indicate the comorbidities of the patient.  When null, it is assumed the patient did not have the comorbidity.  When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days.  An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. Because of the relationship between disabilities and comorbidities, you can rely exclusively on disabilities and ignore comorbidities.
    5. Check if cases repeat and should be deleted from the analysis. 
    6.  In survival days, null values indicate zero. 
    7. Convert all categorical variables to binary dummy variables. For example, race has four values W, B, A, Other, and null value.  Create 5 binary dummy variables for these categories and use 4 of them in the regression.  For example, the binary variable called Black is 1 when race is B, and 0 otherwise.  In this binary variable we are comparing all Black residents to non-Black residents that include W, A, null, and other races.   
    8. In all variables where null value was not deleted row wise, e.g. race being null, the null value should be made into a dummy variable, zero when not null and 1 when null.  Treat these null variables as you would any other independent variable. 
    9. Gender is indicated as "M" and "F"; revise by replacing M with 1 and F with 0. 
    10. Make sure that no numbers are entered as text
  • Taheeri's Teach One►
  • Marla's Teach One►
  • Data►
  • CCS►
  • Adnan's Teach One YouTube► Python Code►
  • Question 4: Visually check that cost is normally distributed and see if log of cost is more normal than cost itself. If a variable is not normally distributed, is the average of the variable normal (see pages 261 in textbook)?  Visually check that age and cost have a linear relationship. 
  • Question 5: Regress cost per day on age (continuous variable), gender (male=1, Female=0), survival, binary dummy variables for race, probabilities of functional disabilities, and any null dummy variable you have created.   Describe the percent of variation explained, and F statistics.  Show which variables have a statistically significant effect on cost.  Does age affect cost? Does MFH reduce cost of care?
  • Question 6: In the following interactive graph, we have fitted a regression line to data.  Move one, and only one, point so that R2 declines to 0.05. Provide an image of the resulting data and line. Create a concave curve with the data and report the fit between the regression line and concave data.  Attach a screen shot of your work. Interactive Fit►


For additional information (not part of the required reading), please see the following links:

  1. Introduction to regression by others Video► Slides►
  2. Regression using R Read►
  3. Statistical learning with R Read►
  4. Open introduction to statistics Read►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi PhD Home►  Email►