Lecture: Logistic Regression & Propensity to Treat  


Assigned Reading


Question 1: What is the logit of probability 0.75?

Question 2: Regress survival in next 6 months on comorbidities of the patients, age of patients, gender of patients and whether they  participated in the medical foster home program. MFH is an intervention for nursing home patients.  In this program, nursing home patients are diverted to a community home and health care services are delivered within the community home.  The resident eats with the family and relies on the family members for socialization, food and comfort.  It is called "foster" home because the family previously living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1. 

Survival is reported in two variables.  One variable indicates survival in 6 months.  Another reports days known to survive, if the patient has died and otherwise null.  Thus a null value in this latter variable indicates the patient did not die.  

CCS in these data refers to Clinical Classification System of Agency for Health Care Research and Quality.  These data indicate the comorbidities of the patient.  When null, it is assumed the patient did not have the comorbidity.  When data are entered it is assumed that the patient had the comorbidity and the reported value is the first (maximum) or last (minimum) number of days till admission to either the nursing home or the MFH. Thus an entry of 20 under the minimum CCS indicates that from the most recent occurrence of the comorbidity till admission was 20 days.  An entry of 400 under the Maximum CCS indicates that from the first time the comorbidity occurred till admission was 400 days. You choose what data (minimum, maximum, occurrence) is relevant for the analysis and you use what you think should be used. Keep in mind the possibility that for acute illness the most recent event may be predictive while for chronic illness the first occurrence may be predictive of cost.

The functional disabilities are probabilities that the patient has the disability.  These probabilities are generated from the CCS diagnoses and demographics of the person.

Clean the data using SQL. There are a number of cases that repeat and should be deleted from the analysis.  There are many null values.  The treatment of null value changes with the type of variable.  In some variables, null values indicate zero.  In others they can be estimated from the mode.  In still others, they should be treated as separate variable.  In completing this assignment follow these steps:

  1. Describe the data using univariate analysis. 
  2. Check the distribution of the survival variable.
  3. Check the impact of the interaction of age and gender on survival.
  4. Create a regression model to explain the relationship among the variables and survival.  Manalac's Stata►
  5. Use plots of residuals to test regression assumptions.
  6. Explain the fit of the model to the data.
  7. List the top 4 predictors of survival (list these predictors using English language and not coded data). 
  8. Describe, in English, if the MFH program contributes to survival.  Provide the evidence for your claim.

Use the instructor's last name as the password for the data.    Data► CCS►

Question 3: The following data provide the length of stay of patients seen by Dr. Smith (Variable Dr Smith=1) and his peer group (variable Dr. Smith = 0).  Answer following questions:

  1. Does Dr. Smith see a different set of patients than his peer group?  In particular, what is the probability of patients being seen by Dr. Smith.  Regress the choice of provider on the 9 diagnoses provided. 
  2. Balance the data by propensity to seek care from Dr. Smith.  Graphically show that the weighting procedure results in same number of patients treated by Dr. Smith or his peer. 
  3. Report the unconfounded impact of Dr. Smith on length of stay.  Data► Kanfer's Teach One► Solution►

Question 4: The following data provide the survival among cancer patients.  The data provides 35 common comorbidities for patients who have or don't have stomach cancer. Use both logistic and ordinary regression to analyze these data and report the difference of the findings, in particular:

  1. Using logistic regression, calculate the propensity to have cancer. 
  2. Group the diagnoses using SQL.  Within the naturally occurring groups of diagnoses, calculate probability of cancer.  Calculate the logit of the probability.  Regress the logit function on the diagnoses using ordinary regression. SQL►

Report how the coefficients for the comorbidities of stomach cancer.  How do these coefficients change across the two methods?   Data►

Question 5:  For this assignment you can use any statistical software you are familiar with or use R.  You can also use MatchIt or other software designed to do propensity scoring.  The objective is to find response to citalopram for patients with different types of depression. These data come from STAR*D experiment conducted by NIMH.

  1. Read about the study protocol. Protocol►
  2. Download data.  Use instructor's last name as password.  Must enter password twice. Data 2010► Data 2003►
  3. Summarize the data. Describe different diagnoses that co-occur with depression.
  4. Select the patient's medical history and predict receipt of citalopram.
  5. Balance the data to remove the effects of other types of co-occurring mental health diagnoses on receiving citalopram.  Show visually that the propensity scoring has been able to remove the effects of medical history in selection of citalopram
  6. Estimate response to citalopram
  7. Describe how well the model predicts response to citalopram.

Solutions can be obtained using different software.  Enclosed is use of TWANG and R code. R code►  Lavanya's Tuitorial Part 1► Lavanya's Tuitorial Part 2►

Question 6: The following problem was first created by Morgan and Harding and we have adjusted it to fit within health care. In this example, the outcome are length of stay in the hospital, the treatment is the clinician/his peer group and the strata are a mix of medical history and demographic variables that account for the pattern of self-selection into treatment.  This mix have been divided into 3 strata: low, medium and high risk. What is the impact of clinician on length of stay, after removing confounding associated with severity of the patients' illness? 

Strata Probability Total Strata Length of Stay Net Impact
Untreated Treated Untreated Treated
Low 0.36 0.08 0.44 Low 2 4 2
Med  0.12 0.12 0.24 Med  6 8 2
High 0.12 0.20 0.32 High 10 14 4
Total 0.60 0.40 1
Solution by Morgan and Harding Read► Answer►


For additional information (not part of the required reading), please see the following links:

  1. Regression using R Read►
  2. Statistical learning with R Read►
  3. Open introduction to statistics Read►

This page is part of the course on Comparative Effectiveness by Farrokh Alemi PhD Home►  Emailâ–º