**Consider $\alpha=0.05$ as significant throughout this assignment.**

## Part One

This section uses the dataset `bodyfat.csv`, which contains various physical measurements from ~250 adult men. Variables in the dataset include the following: + `Percent`, body fat percentage + `Age`, in years + `Weight`, in lbs + `Height`, in inches + `Neck`, circumference in cm + `Chest`, circumference in cm + `Abdomen`, circumference in cm + `Hip`, circumference in cm + `Thigh`, circumference in cm + `Knee`, circumference in cm + `Ankle`, circumference in cm + `Biceps`, circumference in cm + `Forearm`, circumference in cm + `Wrist`, circumference in cm

**Question 1 (15 points).** Use the `step()` function to construct a linear model that predicts body fat percentage in men (we will call this the "step-wise model"). Be sure to use the argument `trace=0` when running `step`, i.e.: `step(lm(Y~X, data=data), trace=0)`. This will reduce the amount of unnecessary output. **You will lose points if you don't include this argument.** Then, provide an answer with the following components: + State which predictors `step()` removed from the model + Report the final AIC and BIC for the step-wise model + Fully interpret all coefficients (including the intercept) and the $R^2$, indicating its significance and what it means regarding the response variable + Indicate if you observe anything "unexpected" in the final model (Hint: there **are** several "unexpected" things in the output, which relates to answers under the previous bullet point!)

```{r} ### Code goes here ```

**Answer goes here.**

**Question 2 (15 points).** Perform a **likelihood ratio test** (LRT) between two bodyfat percentage models: the step-wise model determined in question 1, and a second model with all those same predictors *and an added effect* that is an interaction between *the two most significant predictors in the step-wise model*. In other words, imagine an iris model as `lm(Petal.Length ~ Sepal.Width + Sepal.Length + Species, data = iris)`. To add in an interaction between Sepal.Width and Sepal.Length, we would do this: `lm(Petal.Length ~ Sepal.Width + Sepal.Length + Species + Sepal.Width:Sepal.Length, data = iris)`. Then, provide an answer with the following components: + Based on the LRT, which model is preferred? + Compare the adjusted $R^2$ values between the two models. Does your LRT support the model with the higher $R^2$? + Is your result consistent with what you would expect based on AIC and BIC differences between these two models?

```{r} ### Code goes here ```

**Question 3 (20 points).** Using only the predictors in the step-wise model (not the model from question 2), perform a *k-fold cross validation* with K=10 to predict bodyfat percentage. Make sure to set your seed first! For each trained model, calculate RMSE for both the respective training and testing data. Visualize the final RMSE distributions (there will be 20 values, 10 from test data and 10 from training data) as boxplots, in a single call to `ggplot()`. Based on your results, explain how robust the model is.

```{r} ### Code goes here ```

**Answer goes here.**

## Part Two This section uses the dataset `mammogram.csv`, which contains mammogram results, and final diagonses, for 831 women. Variables in the dataset include the following: + `BIRADS`, the BI-RADS mammogram assessment, ranging from 1--5 here (see here: [https://en.wikipedia.org/wiki/BI-RADS](https://en.wikipedia.org/wiki/BI-RADS) as desired) + `Age`: patient's age in years + `Shape`: mass shape + `Margin`: mass margin + `Density`: mass density + `Severity`: final diagnosis, as benign or malignant

**Question 1 (25 points).** Construct a logistic regression, using the full mammogram dataset, to predict breast cancer malignancy. Once the model is made, make two figures to accompany this model: an ROC curve and a plot of the logistic curve fitted (for the latter plot, show colored points). Then, provide an explanation which includes the following components: + The AUC + The false discovery rate, at a cutoff of 0.5 + The accuracy, at a cutoff of 0.5 + The true positive rate that corresponds to a *specificity* of 0.9, as assessed *visually* from the ROC curve + For each of these quantites, be sure to explain what the quantity means in the context of the model. + **For example:** in defining $R^2$ in a hypothetical bodyfat linear model, I would not define it as "the percent of y explained by x". I *would* define it as "the percent of variance in bodyfat explained by the model." + Based on all quantities and curves, indicate if this model has a strong performance.

```{r} ### Code goes here ```

**Question 2 (25 points).** Perform a *k-fold cross validation* with K=10 to predict breast cancer incidence, using all predictors. Make sure to set your seed first! For each trained model, calculate AUC for both the respective training and testing data. Visualize the final AUC distributions (there will be 20 values, 10 from test data and 10 from training data) as violin plots, in a single call to `ggplot()`. Based on your results, explain how robust the model is.

```{r} ### Code goes here ```

**Answer goes here.**

**OPTIONAL BONUS QUESTION (20 points).** Make **violin plots** of *precision* and *accuracy* across the 10 folds, using a tidyverse-oriented strategy to obtain these values. An example for how to perform something similar can be found in the K-folds supplement on the course website. Additionally report and mean and standard deviation for these quantities.

```{r} ### Code goes here ```