HAP 719: Advanced Statistics I

Analysis of Variance

Variation between and within groups
Image generated by ChatGPT

Overview

In this module, you will learn to differentiate between observational and random experimental designs within the context of ANOVA, and gain hands-on experience using One-Way and Two-Way ANOVA to analyze data. By mastering these techniques, you will be equipped to tackle complex data analysis challenges in your research and professional work, making you a valuable asset in any data-driven field. Additionally, you will develop the skills to interpret statistical outputs, enabling you to make informed decisions based on your analyses.

Objectives

After completing the activities this module you should be able to:

Differentiate between observational and random experimental designs in context of ANOVA
Use One-Way ANOVA
Use two-way ANOVA
Interpret statistical outputs as they relate to these techniques

Lecture

indicates AI assisted content, image, or video.

Sum of Squares Explained YouTube►
One way analysis of variance, with example calculations Slides► Video►
Two way analysis of variance, with example calculations Slides► Video► Excel►
Yili Lin on Analysis of variance using R Slides► Video►
Yili Lin on assumptions in Analysis of Variance using R Slides► Video►
Yili Lin on interpreting R output from Analysis of Variance Slides► Video Part 1► Video Part 2►

Assignments

Assignments are submitted on blackboard. They are graded as pass/fail. A summary 1-page word document should be included. In the summary, you should state if you were able to get the same answers as those provided. Your R, STATA, or Python code should be included in separate files. It is OK to help each other in doing the assignments but not OK to copy and paste work of others. It is OK to use ChatGPT or other large language models to generate the R code, but you must be transparent about it and report its use. Some assignments require large data, so you will have more data preparation skills -- an essential statistical analysis expertise.

Assignments are due by Sunday midnight. Peer teachers should submit Sunday prior to the lecture. All others should submit Sunday following the lecture. Late assignments will lose 20% of the grade.

Question 1:The data set you will be using for this assignment constitutes a subsample of a larger data set called 2011/2012 National Survey of Children’s Health (NSCH) which was conducted by the Centers for Disease Control and Prevention (CDC) and the National Center for Health Statistics. This survey included around 95,000 children between the ages of 0-17 years and its purpose was to measure children’s health status, insurance coverage, parental health and several other characteristics.

Construct the subsample that includes only children between the ages of 3-5 years old. Create a data frame in R for the variables needed. Exclude variables not needed in answering the following questions. Address missing values. Check the distribution of the variables and whether transformations are necessary:

Summarize the data. As an analyst, you are aware that the first step in any data analysis project is to examine the descriptive statistics (or assess the distribution) of the variables mentioned in the research question you chose. This essentially means that you need to run descriptive statistics such as means, standard deviations, counts with percent, etc. for each variable (e.g., average number of preventive health services in the last 12 months, child’s race, etc.). Caution: Remember that you should identify the variables first either as continuous or categorical, and then run the appropriate descriptive statistics. For example, you should not compute the mean and standard deviation for a variable such as child’s race because this is a categorical one (it has two groups: white vs. non-white). Computing the percent and count (actual number of children (n) being non-white) for each category of this variable is appropriate. Using graphs to describe some of the variables is fine too. However, you need to avoid redundancy in how you present findings from analyses related to descriptive statistics. In other words, findings related to a specific variable (s) displayed in a Table should not provide the same information as a graph.
Examine the effects of child’s race, poverty level as measured by hard to cover basic food or housing, gender, consistency of health insurance coverage, adverse life experiences, and the number of times children received preventive services in the last 12 months on the number of chronic health conditions (measured as sum of binary comorbidities). These data are collected for 3-5 year old children. Are there any statistical significant differences in the outcome variable across the different levels of the independent variables? For example: Do white children have a significantly higher average number of chronic conditions than non-white children? Caution: Check if the dependent variable is a continuous variable and the independent variable is categorical with two levels or groups, you should conduct a t-test. A t-test will help you determine how the dependent variable compares across different groups of the independent categorical variable. If the independent variable is a categorical one with 3 or more groups and the dependent variable is continuous, then ANOVA should be used in the analysis. Conducting these bivariate analyses (between each independent and dependent variable) before regression will help you understand how two variables relate to each other, which then in turn can assist with what variables should enter the regression model. Results generated from these analyses should be organized in another Table and your findings should be presented in English.

Resources for this question:

Background information Read►
2021 Data Download STATA► Import R-Code►
Data dictionary PDF►
The list of variable names and corresponding response options used in the survey are available: Screener Variables► Topical Variables►
Calculation of number of conditions More►
Vladimir Cardenas's Answer► R Code►(Password required)
Rosa Maria Prieto Choy's Teach One YouTube►
Abdul Nadeem's Teach One YouTube►

Question 2: Perform a one-way analysis of variance. A large percentage of people in the United States suffer from high levels of cholesterol. For the patient with high cholesterol levels, physicians prescribe drugs to reduce cholesterol levels. A pharmaceutical company has developed three such drugs. To find out if any statistical significant differences exit among three drugs, the researcher in the company conducted an experiment. The researcher selected 60 men, each of whom had cholesterol levels over 285. She randomly assigned 20 men in each treatment group. The drugs were administered over a three-month period and the reduction in cholesterol was recorded for each person.

Summarize the data. Provide correlation/scatter plot matrix.
Are there significant differences in cholesterol reduction due to drug type?

Resources for this question:

Data Download►
AI-guided solution Prompt►
Sowmya Chakravarthy's Answer► R Code►
Rashmi Kondakindi's Teach One YouTube►

Question 3: In universities there are often multiple lectures for the same introductory course each semester because of high demand. Consider a statistics department that runs three lectures of an introductory statistics course. We like to determine whether there are statistically significant differences in first exam scores in these three classes (shown as class A, B, and C). Describe appropriate hypotheses to determine whether there are any differences between the three classes, summarize available data, and use one-way ANOVA to check if there are differences among the performance of students in the three lectures of the same class. This question is from OpenIntro to Statistics.

Data Download►
AI-guided solution Prompt►
Sowmya Chakravarthy's Answer► R code►

Question 4: Does utilization in three rural hospitals differ significantly from each other?

Resources for this question:

Data Download►
STATA solution YouTube►

Question 5: Suppose we had patients with myocardial infarction in the following groups: –Group 1: A music therapy group –Group 2: A relaxation therapy group –Group 3: A control group. 15 patients were randomly assigned to the 3 groups and then their stress levels were measured to determine if the interventions were effective in minimizing stress. Identify whether the three groups differ and if so identify differences between any pair of groups.

Resources for this question

Data Download►
Solution without use of software Slides►

Visual display of one-way ANOVA PubMed►
Chapter 5 ANOVA Fixed Effect Models Open Book►
Yili's lecture ANOVA► Study Design►
Two-way ANOVA Video►
Randomized block design Video►
Two-way ANOVA Video►
Randomized block design Video►

HAP 719: Advanced Statistics I

Analysis of Variance

Overview

Objectives

Lecture

Assignments

More