Generated by ChatGPT
Overview
-
Session overview
YouTube►
-
In this module, you'll master the art of analyzing the
association of variables with binary data, a crucial skill in fields
like healthcare, marketing, and social sciences. You'll learn to
verify the assumptions of logistic regression, ensuring the validity
of your models. You will tackle real-world massive data, with all of
its imperfections.
Learning Objectives
- Analyze association of variables with binary data
- Verify assumptions of logistic regression
Lecture
Indicates
content, image, or video made with assistance from AI systems
Assignments
Assignments should be submitted in Blackboard. Include a summary page. In the summary page,
write statements comparing your work to answers given or videos. For example, "I got the same answers as the Teach One video for question 1."
Or you can write: "There was no answer sheet available for question 2." We prefer that assignments are done in R.
Question 1: Use the following corpus of training data. Classify if the target sentence is a complaint. The corpus is organized
as in the following table. The comment ID shows the comment in the training data. In the following table, 6 comments in the training set are displayed.
The columns on the right of the table show where in the training comment the words from the target comment appears.
For example, in the training comment 57685 the word "patient" in the target comment is the third word in the training comment.
comment Id |
Type Id |
Classification True = Complaint False = Praise |
loves |
patients |
tell |
about |
not |
money |
57685 |
1 |
TRUE |
0 |
3 |
0 |
2 |
0 |
0 |
57688 |
1 |
TRUE |
0 |
0 |
1 |
0 |
0 |
0 |
57703 |
1 |
TRUE |
0 |
0 |
0 |
0 |
0 |
9 |
57704 |
1 |
TRUE |
0 |
0 |
0 |
3 |
0 |
0 |
57711 |
1 |
FALSE |
0 |
8 |
0 |
0 |
0 |
0 |
57712 |
1 |
TRUE |
0 |
0 |
0 |
0 |
2 |
0 |
In the following, calculate predicted value of a logistic regression
using the following formula:
- Regress the classification labels in the training set on the words, pair of consecutive words, and triplets of consecutive words in the target
sentence: "He loves his patients and I can tell it's about us and not the
money." Use the predicted probability of complaint to classify the target sentence. Values above 0.5
should be classified as complaints.
- Regress the classification labels in the training set on the
words, pair of words, triplet of consecutive words in the target
sentence "However, I am not happy with rhinoplasty revision results."
Use the predicted probability of complaint to classify the target
sentence. Values above 0.5 should be classified as complaints.
- Repeat the analysis but this time include all of the
complaints and 50% random sample of praises in the training data set.
How did the sampling procedure affect the McFadden R-square
Resources:
- Labeled training data set for: "He loves his patients and I can
tell it's about us and not the money."
Download►
- Labeled training data set for: "However, I am not happy with rhinoplasty revision results."
Download►
- How to predict response variable in logistic regression
ChatGPT►
- How to drop variables that are perfectly correlated?
R Code►
- Full Corpus (needed for analysis of other target comments)
Download►
Preprocessing
ChatGPT►
- Vladimir Cardenas's
Answer►
R Code►
- Regina Reyes's Teach One
on "However, I am happy with rhinoplasty revision results."
Slides►
YouTube►
- Sravya's Teach One on "He loves his patients and I can tell it's about us and not the money."
Slides►
YouTube►
Question 2: Regress survival in next 6 months on disabilities of the patients, age of patients, gender of patients and
whether they participated in the medical foster home program. MFH is an intervention for nursing home patients. In this program, nursing
home patients are diverted to a community home and health care services are delivered within the community home. The resident eats with the
family and relies on the family members for socialization, food and comfort. It is called "foster" home because the family previously
living in the community home is supposed to act like the resident's family. Enrollment in MFH is indicated by a variable MFH=1.
Survival is reported in two variables. One variable indicates survival in 6 months. Another reports days known to survive,
if the patient has died and otherwise null. Thus a null value in this latter variable indicates the patient did not die.
The functional disabilities are probabilities that the patient has the disability. These probabilities are generated from the CCS diagnoses and demographics of the person.
Use long term disabilities. These are the disabilities with suffix 365. If the disability is higher than 0.5, then assume the person is disabled.
- Clean the data. Convert the disabilities to binary variables. Convert the age to decades
- Create a regression model to explain the relationship among the variables and survival.
- List the top 4 predictors of survival (list these predictors using English language and not coded data).
- Describe, in English, if the MFH program contributes to survival. Provide the evidence for your claim.
Resources:
Question 3: Predict
from age, gender, symptoms, home test results the
PCR test results for COVID-19.
- Build a model that includes only "home test results" as
independent variable. Report the percent of variation explained
- Build a model that includes age and gender, interaction of age and
gender, and home test results as independent variables. Report the percent of variation explained.
- Build a model that includes age and gender, interaction of age and
gender, home test results, and symptoms as independent variables. Report the percent of variation explained
- Build a model that includes includes age, gender, interaction of
age and gender, symptoms, home test, and pairs of symptoms, as independent variables. Report the percent of variation explained
- What is the most accurate way of diagnosing COVID-19 at home prior
to triage to clinics?
- Can a clinician learn to make these diagnoses or is the number of
adjustments needed beyond human capabilities?
The following resources may be helpful:
More
This page is part of the HAP 819 course on Advanced Statistics by Farrokh Alemi PhD
Home►
Email►
|