Basic data

HAP 719: Advanced Statistics I

Regression through Search  

Regression replaced with search in the database
Generated by ChatGPT

Overview

In this module, you'll gain the skills to analyze data by searching for key cases within a database, a technique that is essential for identifying significant patterns and trends. You'll learn to detect interactions among variables, which will help you understand the complex relationships within your data. Additionally, you'll master the ability to estimate regression parameters for massive datasets without relying on matrix manipulations, a crucial skill for handling large-scale data efficiently. These capabilities will empower you to tackle real-world data challenges and uncover valuable insights, making you a sought-after professional in any data-driven field.

Learning Objectives

  • Analyze data by searching for key cases within the data
  • Detect interactions among variables
  • Estimate regression parameters for massive data without matrix manipulations

Lecture

AI assisted Indicates content, image, or video made with assistance from AI

Assignments

Question 1: In the following estimate the regression equations from a plot of the data across strata.  On the X axis, each number refers to a unique combination of diseases in the patient's medical history.  The X-axis is a variable that captures the patients' prognosis based on their medical history.  The Y axis shows the probability of mortality. Three lines are plotted.  The line with diamonds shows the probability of mortality for each of the strata on the X axis. The line with squares shows the probability of mortality for combination of strata and stomach cancer.  The dashed line shows the average probability of stomach cancer, across all strata. 

Stomach Cancer Corner Case 

Answer the following questions?

  1. What is the approximate odds of mortality from stomach cancer? This estimate includes the average effect of comorbidities and cancer.
  2. For patients without cancer, which set of comorbidities has the highest risk of mortality?
  3. Does mortality from stomach cancer depend on medical history (comorbidities) of the patient?
  4. In strata 7, what is the impact of stomach cancer on mortality risk?
  5. In strata 20, what is the impact of stomach cancer on mortality risk?
  6. If we regress 6-month mortality on (a) stomach cancer and (b) combinations of comorbidities of patients, what is the approximate coefficient of stomach cancer in the regression equation? To answer this question, calculate the approximate change in mortality for change in occurrence of stomach cancer across all 31 strata.
  7. What is the coefficient associated with combination of cancer and medical history in strata 31?
  8. In what strata cancer adds the most to the risk of mortality?
  9. If we construct a linear regression model consisting of two variables, cancer and prognosis of patients associated with comorbidities captured in the strata, in which strata we are likely to have the highest residuals. 

Resources for Corner Cases Question 1:

Question 2: In the following data, Y is a binary outcome.  A, B, C, D, and E are five binary independent variables that predict Y. 

  1. Calculate the probability of Y for different combinations of the independent variables. 
  2. Construct an interaction plot
  3. Identify interactions in the data. Make a list
  4. Show how much accuracy is gained by including interactions in a logistic regression of Y on the variables A through E.    

Resources for Question 2:

Question 3: In the following, Y indicates the logit of probability of an outcome.  It is regressed on X1 through X4.  All variables are binary.  All independent variable are monotone.  All are related to Y.  Y values are standardized so that when all independent values are absent logit of y is 0 and when all are present logit of Y is 1. Using the technique of searching the data to construct approximate regression equations, answer the following questions:

  1. What is the coefficient of X1 in the regression of Y on X1 through X4?
  2. What is the coefficient of X1X2 in the regression of logit of Y on X1 through X4 plus all possible interactions among independent variables?
  3. To get an exact specification of regression coefficient for X1, what do we need to do?

Resources for Question 3:

Question 4: In the following, we regress Y on five binary variables that are positively related to Y.

  1. Using search technique what is the coefficient of the variable A?
  2. Using regression with no-intercept, what is the coefficient of the variable A? 
  3. Create an interaction plot for variable A

Resources for Question 4

  • Data Download►
  • How to estimate coefficients of a variable through search Slides►
  • Shreya Prasanna's answers YouTube►
  • Interaction plot for variable A before sorting:

Interaction plot

More

For additional information (not part of the required reading), please see the following links:

  1. Open introduction to statistics Read►

This page is part of the HAP 819 course on Advanced Statistics by Farrokh Alemi PhD Home► Email►