Lecture: Ordinary Regression  


Assigned Reading


Submit one file for all questions.  Include all charts, code, and output in the same file.  Include in the first page a summary page.  In the summary page write statements comparing your work to answers given or videos.  For example, "I got the same answers as the Teach One video for question 1." 

Question 1: For the following X and Y data, calculate the regression equation using Excel.  Plot the points and the line.  Calculate the residuals and sum of squared residuals.  Recalculate the sum of squared residuals and re-plot the line when the intercept is increased and decreased by 20%.  Recalculate the sum of squared residuals and plot the line when the coefficient for X is increased and decreased by 20%.  Which of these 5 lines minimizes the sum of squared residuals and by how much?  Sample Plots in Reading► Cheema & Shih's Answer► Chelsea's Answer►



Predicted Y


Squared Residuals
















Question 2: Regress cost of healthcare on comorbidities of the patients, age of patients, gender of patients and whether they  participated in the medical foster home program. MFH is an intervention for nursing home patients.  In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home.  The resident eats with the family and relies on the family members for socialization, food and comfort.  It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1. 

Various costs are reported in the file, including cost inside and outside the organization.  Rely on cost per day but exclude patients who have 0 cost within the organization. The cost is reported for specific time period after admission, some short and some longer.  Use daily cost so you do not get caught on the issues related to lack of followup.

CCS in these data refers to Clinical Classification System of Agency for Health Care Research and Quality.  These data indicate the comorbidities of the patient.  When null, it is assumed the patient did not have the comorbidity.  When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days.  An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. You choose what data (minimum, maximum, occurrence) is relevant for the analysis and you use what you think should be used. Keep in mind the possibility that for acute illness the most recent event may be predictive while for chronic illness the first occurrence may be predictive of cost.

The functional disabilities are probabilities that the patient has the disability.  These probabilities are generated from the CCS diagnoses and demographics of the person by a P2C2E process.  P2C2E stands for a process too complicated to explain.   In completing this assignment follow these steps:

  1. Clean the data using SQL. There are a number of cases that repeat and should be deleted from the analysis.  There are many null values.  The treatment of null value changes with the type of variable.  In some variables, null values indicate zero.  In others they can be estimated from the mode.  In still others, they should be treated as separate variable. There are duplicate SCRSSN that should be removed. Taheeri's Teach One► Marla's Teach One►
  2. Describe the data using univariate analysis. 
  3. Check that cost distribution is normal and if not normal decide on transformation of the value that would make it more normal.
  4. Check that age and cost have a linear relationship. 
  5. Check the impact of age and gender interaction on cost.
  6. Check the impact of survival on cost. 
  7. Create a regression model to explain the relationship among the variables and cost. 
  8. Use plots of residuals to test regression assumptions.
  9. Explain the percentage of variation in cost explained by the model.
  10. List the top 10 predictors of cost (list these predictors using English language and not coded data). 
  11. Describe in English if MFH contributes to cost of care

Use the instructor's last name as the password for the data.    Data► CCS► Cheema & Shih's Answer► Chelsea's Answer►

Question 3: In the following data, examine whether age, gender and last year's cost predict next year's cost.  If you are using R code make sure that you reformat currency into a number. Data► R Code► Cheema & Shih's Answer► Chelsea's Answer►

Question 4: Throughout this course we emphasize the concept of Markov blanket. A Markov blanket refers to a set of variables that would make all other variables irrelevant in predicting the response variable. You have not been exposed to this concept yet but you have learned about issues related to multi-collinearity in regression.  The point of this question is to push you to think harder about these two concepts. 

  • If X1 and X2 are significant predictors of Y, X3 and X4 are not, and no interactions are significant; then what is the Markov Blanket for Y?  How is the concept of Markov Blanket related to multi-collinearity? 
  • Suppose X2 occurs after Y and X1 occurs prior to Y, what is a Markov Blanket that separates variables that are irrelevant and could possibly be causes of Y.  Keep in mind that a cause is something that occurs prior to effect, has a significant association with the effect, has a mechanism leading from cause to effect, and if cause is removed then the effect is less likely to occur, Cetris Peribus.

Markov chains explained visually►  Markov blanket's definition►  Answer by Chelsea►


For additional information (not part of the required reading), please see the following links:

  1. Introduction to regression by others Video► Slides►
  2. Regression using R Read►
  3. Statistical learning with R Read►
  4. Open introduction to statistics Read►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi PhD Home►  Email►