## Homework 7: Grabbag testing and tidy data

### Part One: Hypothesis testing (15 points each = 75 points)

Perform a series of hypothesis tests using any test we have learned in class. These include the following:

• $$\chi^2$$ goodness-of-fit test
• $$\chi^2$$ contingency table test
• Fisher’s exact test
• binomial test
• two-sample t-test
• paired t-test
• one-sample t-test
• sign test
• Wilcoxon signed-rank test
• Mann Whitney U test (aka Wilcoxon rank sum test)

For each question you must do the following:

• State null and alternative hypotheses
• Explain why you chose the test you are using (1-2 sentences). If another test would be possible to use, include an explanation why you chose the test you did. This should include a brief statement about any assumptions checked and how they were verified.
• Carry out the test, including any checking of assumptions.
• Report results and conclusions
• Results must include the test statistic and/or parameter estimate, P-value, and 95% confidence interval (if applicable)
• Conclusions must include your P-value comparison with $$\alpha$$ and corresponding conclusions (i.e, reject/fail to reject, whether there is evidence in favor of alternative) regarding the null and alternative hypothesis. If a directional conclusion is warranted, it must be made.

Most importantly, for this assignment, you should use $$\alpha=0.01$$ for significance assessment.

Question 1. Carotenoids are bright red pigments that are also important as antioxidants for humans and other animals. Researchers tested whether carotenoids influence immune systems of zebra finches, whose red beak color is induced by eating carotenoids. To test this hypothesis, a group of zebra finches was randomly divided into two groups (McGraw and Ardia 2003). 10 individual finches received supplemental carotenoids, and 10 individuals did not. All 20 birds were then measured using an assay that measures cell-mediated immunocompetence (PHA) as well as an assay that measures humeral immunity (SRBC). Higher levels of PHA and SRBC indicate a strengthed immune response. The data are contained in the file “ZebraFinches.csv”.

Do PHA levels tend to differ between the birds that received supplemental carotenoids and those that did not? Based on your results, can you infer anything about immune differences between birds that did and did not receive carotenoids?

### All R code goes here

State your full answer here.

Question 2. The horned lizard Phrynosoma mcalli is named for a fringe of spikes around its eyes. Herpetologists asked if having longer spikes helps to prevent birds from eating them. The researchers identified remains of 55 horned lizards who had been attacked by their main predator, the loggerhead strike, and measured the length of their horns. The researchers also measured the horn lengths on 154 lizards who were alive and well. The data are contained in the file “HornedLizards.csv”.

Does horn length tend to differ between lizards which were killed vs. those who remain alive? Based on your results, do you find evidence that having longer horns might help lizards not to be eaten?

### All R code goes here

State your full answer here.

Question 3. Male spiders of the genus Tidarren have an odd behavior: They voluntarily self-amputate one of their two pedipalps (copulatory organs) before sexual maturity. These pedipalps each comprise ~10% of the male spider’s mass, so researchers wondered if removing one pedipalp increased male spider running performance, and thereby its ability to reach mate. To test their hypothesis, researchers measured the running speed (using video to measure their running speed over strands of spider silk) for male spiders before and after self-amputation. The data are contained in the file “MaleSpiderAmputation.csv”.

Does spider speed tend to differ before and after pedipalp amputation? Based on your results, do you find evidence that amputation confers a mating advantage (i.e., by running to the female faster)?

### All R code goes here

State your full answer here.

Question 4. Monoclonal gammopathy is a disease characterized by an extreme spike in the levels of a single immunoglobin protein in blood and carries a high-risk prognosis of multiple myleoma. Researchers at the Mayo Clinic catalogued patients with monoclonal gammopathy to compare vital signs between patients who eventually developed malignancies and those who did not. The data are contained in the file “MonoclonalGammopathy”.

Is there an association between the strength of the immunoglobin spike (column “Mspike”) and whether or not a malignancy occured (column “Malig”)?

### All R code goes here

State your full answer here.

Question 5. This question concerns the same monoclonal gammopathy dataset as in question 4.

For a clinical trial to be robust, different experimental units (“blocks”) should have similar properties, or in other words the units should be “balanced”. Test whether males enrolled in the study show proper balancing of Hemoglobin levels. (Hint: If the study were fully balanced, there would be equal frequencies for all hemoglobin levels).

### All R code goes here

State your full answer here.

### Part Two: Tidy data (25 points)

Question 1 (2.5 points). Examine the built-in R dataset trees (like iris, you can simply call this variable to see the dataset) and explain why it is either tidy or messy, using the principles of tidy data.

### View the dataset here

State your full answer here in 1-3 sentences.

Question 2 (2.5 points). Examine the built-in R dataset HairEyeColor and explain why it is either tidy or messy, using the principles of tidy data.

### View the dataset here

State your full answer here.

Question 3 (10 points). The built-in R dataset pressure contains data on the vapor pressure of Mercury at different temperatures in Celsius. Convert this dataset into wide table, where columns are temperatures (be sure to print this dataset out for your grader to see). Use the tidy version of these two data representations to make a scatterplot of pressure plotted against temperature (meaning temperature is the explanatory variable) in ggplot. Explain why you used the data representation you did for plotting in the context of its “tidy data” advantage.

### All R code goes here

Explanation for data plotting choice goes here.

Question 4 (10 points). Convert the dataset “air_passengers.csv” (contains monthly totals of international airline passengers from 1949-1960) into a tidy dataset with three columns: Year, month, number of passengers. Once you have tidied the data, use ggplot to visualize the data as a line plot. The y-axis should show number of passengers, the x-axis should show months, and there should be a single line (denoted with color) per year. Additionally make sure to represent each data point with an actual point in the plot. Once the plot is made, describe any trends you see in the data in 1-2 sentences.

Hint: To ensure proper color-coding, you may need to call the variable Year in your ggplot code as as.factor(Year). This way, ggplot will treat it as a cateogorical variable in the color scheme.

### All R code goes here

Describe trends you see here.