HAP 719: Advanced Statistics I

Regression through Search

Regression replaced with search in the database
Generated by ChatGPT

Overview

In this module, you'll gain the skills to analyze data by searching for key cases within a database, a technique that is essential for identifying significant patterns and trends. You'll learn to detect interactions among variables, which will help you understand the complex relationships within your data. Additionally, you'll master the ability to estimate regression parameters for massive datasets without relying on matrix manipulations, a crucial skill for handling large-scale data efficiently. These capabilities will empower you to tackle real-world data challenges and uncover valuable insights, making you a sought-after professional in any data-driven field.

Learning Objectives

Analyze data by searching for key cases within the data
Detect interactions among variables
Estimate regression parameters for massive data without matrix manipulations

Lecture

Indicates content, image, or video made with assistance from AI

Interaction plots Slides► Video►
Equation adjustments based on interaction plots Slides► Video►
Estimating regression coefficients through search Slides► Video► YouTube►
Yili Lin on constructing multiplicative models through corner cases Slides► Video►

Assignments

Question 1: In the following estimate the regression equations from a plot of the data across strata. On the X axis, each number refers to a unique combination of diseases in the patient's medical history. The X-axis is a variable that captures the patients' prognosis based on their medical history. The Y axis shows the probability of mortality. Three lines are plotted. The line with diamonds shows the probability of mortality for each of the strata on the X axis. The line with squares shows the probability of mortality for combination of strata and stomach cancer. The dashed line shows the average probability of stomach cancer, across all strata.

Stomach Cancer Corner Case

Answer the following questions?

What is the approximate odds of mortality from stomach cancer? This estimate includes the average effect of comorbidities and cancer.
For patients without cancer, which set of comorbidities has the highest risk of mortality?
Does mortality from stomach cancer depend on medical history (comorbidities) of the patient?
In strata 7, what is the impact of stomach cancer on mortality risk?
In strata 20, what is the impact of stomach cancer on mortality risk?
If we regress 6-month mortality on (a) stomach cancer and (b) combinations of comorbidities of patients, what is the approximate coefficient of stomach cancer in the regression equation? To answer this question, calculate the approximate change in mortality for change in occurrence of stomach cancer across all 31 strata.
What is the coefficient associated with combination of cancer and medical history in strata 31?
In what strata cancer adds the most to the risk of mortality?
If we construct a linear regression model consisting of two variables, cancer and prognosis of patients associated with comorbidities captured in the strata, in which strata we are likely to have the highest residuals.

Resources for Corner Cases Question 1:

Download data Excel►
Divya Bhavanam's Teach One YouTube►

Question 2: In the following data, Y is a binary outcome. A, B, C, D, and E are five binary independent variables that predict Y.

Calculate the probability of Y for different combinations of the independent variables.
Construct an interaction plot
Identify interactions in the data. Make a list
Show how much accuracy is gained by including interactions in a logistic regression of Y on the variables A through E.

Resources for Question 2:

AI-guided solution Prompt►
Download data CSV►

Question 3: In the following, Y indicates the logit of probability of an outcome. It is regressed on X1 through X4. All variables are binary. All independent variable are monotone. All are related to Y. Y values are standardized so that when all independent values are absent logit of y is 0 and when all are present logit of Y is 1. Using the technique of searching the data to construct approximate regression equations, answer the following questions:

What is the coefficient of X1 in the regression of Y on X1 through X4?
What is the coefficient of X1X2 in the regression of logit of Y on X1 through X4 plus all possible interactions among independent variables?
To get an exact specification of regression coefficient for X1, what do we need to do?

Resources for Question 3:

Download data CSV►

Question 4: In the following, we regress Y on five binary variables that are positively related to Y.

Using search technique what is the coefficient of the variable A?
Using regression with no-intercept, what is the coefficient of the variable A?
Create an interaction plot for variable A

Resources for Question 4

Data Download►
How to estimate coefficients of a variable through search Slides►
Shreya Prasanna's answers YouTube►
Interaction plot for variable A before sorting:

Interaction plot

For additional information (not part of the required reading), please see the following links:

Open introduction to statistics Read►

This page is part of the HAP 819 course on Advanced Statistics by Farrokh Alemi PhD Home► Email►