Lecture: Ordinary Regression  


Assigned Reading


Assignments should be submitted in Blackboard.  All assignments should be done in Python, if possible.  Include at top of your report a summary section or include a file that reports the summary of your work.  In the summary section, or summary page, write brief sentences comparing your work to answers given or videos.  For example, "I got the same answers as the Teach One video for question 1." 

Question 1: For the following X and Y data, calculate the regression equation using Excel.  Plot the points and the line.  Calculate the residuals and sum of squared residuals.  Recalculate the sum of squared residuals and re-plot the line when the intercept is increased and decreased by 20%.  Recalculate the sum of squared residuals and plot the line when the coefficient for X is increased and decreased by 20%.  Which of these 5 lines minimizes the sum of squared residuals and by how much?  Sample Plots in Reading► Cheema & Shih's Answer► Chelsea's Answer►



Predicted Y


Squared Residuals
















Question 2: Regress cost of healthcare on comorbidities of the patients, age of patients, gender of patients and whether they  participated in the medical foster home program. MFH is an intervention for nursing home patients.  In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home.  The resident eats with the family and relies on the family members for socialization, food and comfort.  It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1. 

Various costs are reported in the file, including cost inside and outside the organization.  Rely on cost per day but exclude patients who have 0 cost within the organization. The cost is reported for specific time period after admission, some short and some longer.  Use daily cost so you do not get caught on the issues related to lack of followup.

CCS in these data refers to Clinical Classification System of Agency for Health Care Research and Quality.  These data indicate the comorbidities of the patient.  When null, it is assumed the patient did not have the comorbidity.  When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days.  An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. You choose what data (minimum, maximum, occurrence) is relevant for the analysis and you use what you think should be used. Keep in mind the possibility that for acute illness the most recent event may be predictive while for chronic illness the first occurrence may be predictive of cost.

The functional disabilities are probabilities that the patient has the disability.  These probabilities are generated from the CCS diagnoses and demographics of the person.  In completing this assignment follow these steps: Python Code►

  1. Clean the data using SQL. Check if cases repeat and should be deleted from the analysis.  There are many null values, make sure your solutions takes into account null value.  Gender is indicated as "M" and "F"; revise by replacing M with 1 and F with 0.  In survival days, null values indicate zero.  In other variables, null value can be imputed from the mode or the case can be ignored.  Taheeri's Teach One► Marla's Teach One►
  2. Describe the data using univariate analysis. 
  3. Check that cost distribution is normal and if not normal decide on transformation of the value that would make it more normal.
  4. Check that age and cost have a linear relationship. 
  5. Check the impact of age and gender interaction on cost.
  6. Check the impact of survival on cost. 
  7. Create a regression model to explain the relationship among the variables and cost. 
  8. Use plots of residuals to test regression assumptions.
  9. Explain the percentage of variation in cost explained by the model.
  10. List the top 10 predictors of cost (list these predictors using English language and not coded data). 
  11. Describe in English if MFH contributes to cost of care

Use the instructor's last name as the password for the data.    Data► CCS► Cheema & Shih's Answer► Chelsea's Answer►

Question 3: In the following data, examine whether age, gender and last year's cost predict next year's cost.  If you are using R code make sure that you reformat currency into a number. Data► R Code► Cheema & Shih's Answer► Chelsea's Answer►

Question 4: Throughout this course we emphasize the concept of Markov blanket. A Markov blanket refers to a set of variables that would make all other variables irrelevant in predicting the response variable. You have not been exposed to this concept yet but you have learned about issues related to multi-collinearity in regression.  The point of this question is to push you to think harder about these two concepts. 

  • If X1 and X2 are significant predictors of Y, X3 and X4 are not, and no interactions are significant; then what is the Markov Blanket for Y?  How is the concept of Markov Blanket related to multi-collinearity? 
  • Suppose X2 occurs after Y and X1 occurs prior to Y, what is a Markov Blanket that separates variables that are irrelevant and could possibly be causes of Y.  Keep in mind that a cause is something that occurs prior to effect, has a significant association with the effect, has a mechanism leading from cause to effect, and if cause is removed then the effect is less likely to occur, Cetris Peribus.

Markov chains explained visually►  Markov blanket's definition►  Answer by Chelsea►


For additional information (not part of the required reading), please see the following links:

  1. Introduction to regression by others Video► Slides►
  2. Regression using R Read►
  3. Statistical learning with R Read►
  4. Open introduction to statistics Read►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi PhD Home►  Email►