# Regression through Search

## Learning Objectives

• Analyze data using search for key cases within the database
• Detect interactions among variables
• Estimate regression parameters for massive data without matrix manipulations

## Assignments

Question 1: In the following estimate the regression equations from a plot of the data across strata.  On the X axis, each number refers to a unique combination of diseases in the patient's medical history.  The X-axis is a variable that captures the patients' prognosis based on their medical history.  The Y axis shows the probability of mortality. Three lines are plotted.  The line with diamonds shows the probability of mortality for each of the strata on the X axis. The line with squares shows the probability of mortality for combination of strata and stomach cancer.  The dashed line shows the average probability of stomach cancer, across all strata.

Answer the following questions?

1. What is the approximate odds of mortality from stomach cancer? This estimate includes the average effect of comorbidities and cancer.
2. For patients without cancer, which set of comorbidities has the highest risk of mortality?
3. Does mortality from stomach cancer depend on medical history (comorbidities) of the patient?
4. In strata 7, what is the impact of stomach cancer on mortality risk?
5. In strata 20, what is the impact of stomach cancer on mortality risk?
6. If we regress 6-month mortality on (a) stomach cancer and (b) combinations of comorbidities of patients, what is the approximate coefficient of stomach cancer in the regression equation? To answer this question, calculate the approximate change in mortality for change in occurrence of stomach cancer across all 31 strata.
7. What is the coefficient associated with combination of cancer and medical history in strata 31?
8. In what strata cancer adds the most to the risk of mortality?
9. If we construct a linear regression model consisting of two variables, cancer and prognosis of patients associated with comorbidities captured in the strata, in which strata we are likely to have the highest residuals.

Resources for Corner Cases Question 1:

Question 2: In the following, Y indicates the logit of probability of an outcome.  It is regressed on X1 through X4.  All variables are binary.  All independent variable are monotone.  All are related to Y.  Y values are standardized so that when all independent values are absent logit of y is 0 and when all are present logit of Y is 1. Using the technique of searching the data to construct approximate regression equations, answer the following questions:

1. What is the coefficient of X1 in the regression of Y on X1 through X4?
2. What is the coefficient of X1X2 in the regression of logit of Y on X1 through X4 plus all possible interactions among independent variables?
3. To get an exact specification of regression coefficient for X1, what do we need to do?

Resources for Corner Case Question 2:

Question 3: In the following, we regress Y on five binary variables that are positively related to Y.

1. Using search technique what is the coefficient of the variable A?
2. Using regression with no-intercept, what is the coefficient of the variable A?
3. Create an interaction plot for variable A

Resources for Question 3