George Mason University HAP 719 Advanced Statistics

HAP 719: Advanced Statistics I

Basic data 

Analysis of Variance & Covariate Balancing


After completing the activities this module you should be able to:

  • Differentiate between observational and random experimental designs in context of ANOVA
  • Use One-Way ANOVA and  two-way ANOVA to analyze data
  • Interpret statistical outputs as they relate to these techniques



Assignments are submitted on blackboard.  They are graded as pass/fail.  A summary 1-page word document should be included.  In the summary, you should state if you were able to get the same answers as those provided. Your R, STATA, or Python code should be included in separate files.  It is OK to help each other in doing the assignments but not OK to copy and paste work of others.  It is OK to use ChatGPT or other large language models to generate the R code, but you must be transparent about it and report its use. 

Assignments are due by Sunday midnight.  Peer teachers should submit Sunday prior to the lecture. All others Should submit Sunday following the lecture.  Late assignments will be graded at 80% of the on time assignments.

The data set you will be using for this project constitutes a subsample of a larger data set called 2011/2012 National Survey of Children’s Health (NSCH) which was conducted by the Centers for Disease Control and Prevention (CDC) and the National Center for Health Statistics. This survey included around 95,000 children between the ages of 0-17 years and its purpose was to measure children’s health status, insurance coverage, parental health and several other characteristics.

Question 1: Construct the subsample  that includes only children between the ages of 3-5 years old. Create a data frame in R for the variables needed. Exclude variables not needed in answering the following questions. Addressing missing values. Check the distribution of the variables and whether transformations are necessary:

  1. Summarize the data. As an analyst, you are aware that the first step in any data analysis project is to examine the descriptive statistics (or assess the distribution) of the variables mentioned in the research question you chose. This essentially means that you need to run descriptive statistics such as means, standard deviations, counts with percent, etc. for each variable (e.g., average number of preventive health services in the last 12 months, child’s race, etc.). Caution: Remember that you should identify the variables first either as continuous or categorical, and then run the appropriate descriptive statistics. For example, you should not compute the mean and standard deviation for a variable such as child’s race because this is a categorical one (it has two groups: white vs. non-white). Computing the percent and count (actual number of children (n) being non-white) for each category of this variable is appropriate. Using graphs to describe some of the variables is fine too. However, you need to avoid redundancy in how you present findings from analyses related to descriptive statistics. In other words, findings related to a specific variable (s) displayed in a Table should not provide the same information as a graph.
  2. Examine the effects of child’s race, poverty level as measured by hard to cover basic food or housing, gender, consistency of health insurance coverage, adverse life experiences, and the number of times children received preventive services in the last 12 months on the number of chronic health conditions (measured as sum of binary comorbidities).  These data are collected for 3-5 year old children.  Are there any statistical significant differences in the outcome variable across the different levels of the independent variables? For example: Do white children have a significantly higher average number of chronic conditions than non-white children? Caution: Check if the dependent variable is a continuous variable and the independent variable is categorical with two levels or groups, you should conduct a t-test. A t-test will help you determine how the dependent variable compares across different groups of the independent categorical variable. If the independent variable is a categorical one with 3 or more groups and the dependent variable is continuous, then ANOVA should be used in the analysis. Conducting these bivariate analyses (between each independent and dependent variable) before regression will help you understand how two variables relate to each other, which then in turn can assist with what variables should enter the regression model. Results generated from these analyses should be organized in another Table and your findings should be presented in English.

Resources for this question:

Question 2: Perform a one-way analysis of variance. A large percentage of people in the United States suffer from high levels of cholesterol. For the patient with high cholesterol levels, physicians prescribe drugs to reduce cholesterol levels. A pharmaceutical company has developed three such drugs. To find out if any statistical significant differences exit among three drugs, the researcher in the company conducted an experiment.   The researcher selected 60 men, each of whom had cholesterol levels over 285. She randomly assigned 20 men in each treatment group. The drugs were administered over a three-month period and the reduction in cholesterol was recorded for each person. 

  1. Summarize the data.  Provide correlation/scatter plot matrix.
  2. Are there significant differences in cholesterol reduction due to drug type?

Resources for this question:

Question 3: College departments commonly run multiple lectures of the same introductory course each semester because of high demand. Consider a statistics department that runs three lectures of an introductory statistics course. We might like  to determine whether there are statistically significant differences in first exam scores in these three classes (A, B, and C). Describe appropriate hypotheses to determine whether there are any differences between the three classes.

Question 4: Does utilization of  three rural hospitals significantly differ?

Resources for this question:

Question 5: Suppose we had patients with myocardial infarction in the following groups: –Group 1: A music therapy group –Group 2: A relaxation therapy group –Group 3: A control group. 15 patients were randomly assigned to the 3 groups and then their stress levels were measured to determine if the interventions were effective in minimizing stress.  Identify whether the three groups differ and if so identify differences between any pair of groups.

Resources for this question