Supervised Machine Learning Project

Overview of Data Sets

We are analyzing the quality of wine based on 11 predictors: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, alcohol. Quality, in this instance, is defined as an indicator of its craftsmanship and, thus, desirability which can be used, for example, in pricing. Quality does not necessarily indicate if a wine has gone bad. We will use two data sets – one for white wine and the other for red – to create two respective machine learning models. There are 4899 observations in the white wine data set and 1600 observations in the red wine data set.

Overview of Research Questions

We are interested in conducting analysis on both the white and red wine to assess the quality of wine in new wines in production so that they may be accurately priced and marketed. Maya works at a natural wine bar in downtown Santa Barbara and is familiar with the components of wine that have to do with taste, influence of location, and food pairing, but less familiar with chemical composition of a quality wine. She is specifically interested in how certain chemical levels can influence the quality and desireability of a wine, incorporating this knowledge in how to sell wine to her customers.

Loading Data

This project uses data on the white and red wine, which records information of the chemical makeup of the wine.

white_og <- read.csv("wine_dataset/winequality-white.csv", sep = ";")
red_og <- read.csv("wine_dataset/winequality-red.csv", sep = ";")

Data Cleaning

To clean our data, we clean the column names, change quality into a factor so we can analyze it with classification models, and add a type of White or Red to each data set. We removed any rows in the white wine data set with a quality value of 9 because there are too few instances, and thus, it inhibits our models from performing correctly later on. The red wine data set did not have any quality values of 9. We also create a combined data frame with both values from white and red wine to see if there are significant differences between red and white evaluations for quality.

# Make Column Names Clean 
white_og %>% clean_names()
red_og %>% clean_names()

#take out data with quality = 9; red does not have any values of 9
white <- white_og[white_og$quality < 9,]
red <- red_og[red_og$quality < 9,]

# Change quality to factor
white$quality <- factor(white$quality, levels = c(3,4,5,6,7,8))
red$quality <- factor(red$quality, levels = c(3,4,5,6,7,8))

# Adding type
white$type <- "White"
red$type <- "Red"

# data frame of combined wine
combinedWine_og <- rbind(white_og, red_og)
combinedWine <- rbind(white, red)

Data Split

In our data split, we put a proportion of .7 of each original data set into a training data set and a proportion of .3 into the testing data sets, stratifying by quality. In this section, we also folded our data into 5 folds for later cross validation use.

set.seed(1234)
white_split <- white %>% 
  initial_split(prop = 0.7, strata = "quality")

white_train <- training(white_split)
white_test <- testing(white_split)

red_split <- red %>% 
  initial_split(prop = 0.7, strata = "quality")

red_train <- training(red_split)
red_test <- testing(red_split)

white_fold <- vfold_cv(white_train, v = 5)
red_fold <- vfold_cv(red_train, v = 5)

Exploratory Data Analysis

What sort of factors do winemakers and sommeliers look for in a quality wine? Generally, quality is determined by acidity, dryness, flavor profile or taste, alcohol content, and how well the wine is preserved or how it changes as it is stored. In our exploratory data analysis, we will analyze our predictors based on these five categories. First, acidity levels can be summarized through the ph levels, fixed.acidity, volatile.acidity, and citric.acid content. Dryness is determined by the density. Taste can be broken down into sweetness and saltiness, which are caused by residual.sugar and chlorides respectively. We will analyze alcohol content singularly to see its effect on the wine quality. Lastly, sulfurous compounds are what is generally used to preserve wine, so we will analyze free.sulfur.dioxide, total.sulfur.dioxide, and sulphates to see if the way a wine is preserved interacts with wine quality in an interesting way.

Our data can be split into two data sets because experts look for different levels of acidity, sugar, etc. for white wine and red wine. Thus, we will have 3 different representations of the data: one for white wine, one for red wine, and one for both.

All of our predictors are continuous, so we will use box plots, histograms, and scatter plots to visualize our data and determine feature selection.

First, let’s see the distribution of quality between both data sets of wine.

ggplot(combinedWine_og, aes(quality)) + geom_bar(color = "black", fill = "pink") + labs(title = "Histogram of Quality - Total Wine") + xlab("Quality of Wine") + ylab("Count")

We can see that it is normally distributed, meaning that most wine has a quality value of 5 or 6, with few exceptionally good wines having a value of 8 or 9, and low quality wines having a quality value of 3. Based on their low frequency, we can further justify selecting against of quality values of 9 in our initial data cleaning.

Next, we look at the correlation matrices for white and red wine separately to determine which predictors are correlated.

Correlation Matrices

White

white %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color', mar=c(0,0,2,0), main = 'White Wine Correlation Plot')

In the white wine correlation matrix, density and residual sugar; and density and alcohol are the predictors with the highest correlation. Total sulfur dioxide and free sulfur dioxide also have a moderate correlation.

Red

red %>% 
  select(where(is.numeric)) %>% 
  cor() %>% 
  corrplot(type = 'lower', diag = FALSE, 
           method = 'color', mar=c(0,0,2,0), main = 'Red Wine Correlation Plot')

In the red wine correlation matrix, citric acid and fixed acidity; density and fixed acidity; citric acid and volatile acidity; pH and fixed acidity; and free sulfur dioxide and total sulfur dioxide are highly correlated with each other.

Scatter plots

To visualize and validate these correlations, let’s take a look at the scaled scatter plot of each predictor plotted against its correlated counterpart.

White

# scaled data sets 
scaled_white = as.data.frame(scale(select(white, c(-quality,-type))))
scaled_red = as.data.frame(scale(select(red, c(-quality,-type))))

# scaled white residual sugar versus density 
ggplot(scaled_white, aes(x = residual.sugar, y = density)) + geom_point()+scale_x_continuous(name = "Residual Sugar") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle(" Residual Sugar Versus Density") + theme(plot.title = element_text(size = 20))

# scaled white alcohol versus density 
ggplot(scaled_white, aes(x = alcohol, y = density)) + geom_point()+scale_x_continuous(name = "Alcohol") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Alcohol Versus Density") + theme(plot.title = element_text(size = 20))

# scaled white free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_white, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 20))

Red

# scaled red volatile acidity versus citric acid
ggplot(scaled_red, aes(x = volatile.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Volatile Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Volatile Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 20))

# scaled red fixed acidity versus citric acid 
ggplot(scaled_red, aes(x = fixed.acidity, y = citric.acid)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Citric Acid") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Citric Acid") + theme(plot.title = element_text(size = 20))

# scaled red fixed acidity versus pH
ggplot(scaled_red, aes(x = fixed.acidity, y = pH)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "pH") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus pH") + theme(plot.title = element_text(size = 20))

# scaled red fixed acidity versus density
ggplot(scaled_red, aes(x = fixed.acidity, y = density)) + geom_point()+scale_x_continuous(name = "Fixed Acidity") + scale_y_continuous(name = "Density") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Fixed Acidity Versus Density") + theme(plot.title = element_text(size = 20))

# scaled red free sulfur dioxide versus total sulfur dioxide
ggplot(scaled_red, aes(x = free.sulfur.dioxide, y = total.sulfur.dioxide)) + geom_point()+scale_x_continuous(name = "Free Sulfur Dioxide") + scale_y_continuous(name = "Total Sulfur Dioxide") + geom_smooth(method = "lm", se = FALSE)+ ggtitle("Free Sulfur Dioxide Versus Total Sulfur Dioxide") + theme(plot.title = element_text(size = 20))

Based on the scatter plots, we can visualize the correlations between the predictors. For example, for white wine, density has a strong positive correlation with residual sugar and a moderate negative correlation with alcohol. Through these scatter plots, we confirm the existence of correlations predicted by our initial correlation matrix.

Now, we can take a look at the box plots for several of our predictors to see the ways that they interact with wine quality, isolated from the other predictors. First, we will visualize acidity levels which can be measured through fixed acidity, volatile acidity, and citric acid levels. As shown above in the scatter plots, these three predictors are highly correlated with each other in red wine.

#fixed acidity
ggplot(combinedWine, mapping = aes(x = `fixed.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Fixed Acidity Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) +coord_cartesian( xlim = c(0,16), ylim = NULL, default = FALSE )

# volatile acidity 
ggplot(combinedWine, mapping = aes(x = `volatile.acidity`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Volatile Acidity Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )

# citric acid 
ggplot(combinedWine, mapping = aes(x = `citric.acid`, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Citric Acid Levels Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(0,1), ylim = NULL, default = FALSE )

From these box plots, we can see that fixed acidity levels are relatively consistent in both red and white wine. Volatile acidity has a negative correlation with quality in red wine, but relatively consistent averages for each level of wine quality in white wine. Citric acid levels in red wine have a stronger positive correlation than in white wine. In general, we can see that acidity levels fluctuate more in red wine than in white wine.

Next, let’s take a look at dryness which is determined by the predictor density. Based on the correlation matrix and scatter plots, density also is correlated with residual sugar and alcohol in white wine and with fixed acidity in red wine.

# density
ggplot(combinedWine, mapping = aes(x = density, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = density, y = quality)) + labs(title = "Red and White Density Levels versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)+ coord_cartesian( xlim = c(.985,1.01), ylim = NULL, default = FALSE )

Although density is correlated with several predictors according to the correlation matrices and scatter plots, in this box plot, we can see that density stays relatively consistent, around 1, for each level of wine quality.

Next, we will look at the taste of the wine, which is determined by levels of sweetness and saltiness. These are affected by sugar levels and chlorides respectively.

#sugar content
ggplot(combinedWine, mapping = aes(x = residual.sugar, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = residual.sugar, y = quality)) + labs(title = "Red and White Residual Sugar Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,25), ylim = NULL, default = FALSE )

# we removed outliers to ensure that the variation was not due to the outliers 
ggplot(red[red$free.sulfur.dioxide < 50,], aes(x = free.sulfur.dioxide, y = quality)) + 
  geom_boxplot(aes(fill = quality)) +
  labs(title = "Free Sulfur Dioxide for Red Wine", x = "Free Sulfur Dioxide", y = "Quality") +
  geom_point(width = 0.15) +
  scale_fill_brewer(palette = "RdPu")

#chlorides
ggplot(combinedWine, mapping = aes(x = chlorides, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = chlorides, y = quality)) + labs(title = "Red and White Chloride Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,0.3), ylim = NULL, default = FALSE )

White wine has, on average, higher and more variable sugar levels than red wine while red wine has an on average higher chloride content than white wine. There seem to be a higher number of outliers in the values of chloride.

Next, we will analyze alcohol content, which can affect the taste of the wine as well.

ggplot(combinedWine, mapping = aes(x = alcohol, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = alcohol, y = quality)) + labs(title = "Red and White Alcohol Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1)

There is generally higher alcohol content associated with a wine of higher quality but there is not a significant different in averages between red wine and white wine.

Lastly, let’s look at the preservative content, which is determined by free sulfur dioxide, total sulfur dioxide, and sulfates.

#free sulfur dioxide
ggplot(combinedWine, mapping = aes(x = free.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = free.sulfur.dioxide, y = quality)) + labs(title = "Red and White Free Sulfur Dioxide Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,150), ylim = NULL, default = FALSE )

#total sulfur dioxide
ggplot(combinedWine, mapping = aes(x = total.sulfur.dioxide, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = total.sulfur.dioxide, y = quality)) + labs(title = "Red and White Total Sulfur Dioxide Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,300), ylim = NULL, default = FALSE )

#sulfates
ggplot(combinedWine, mapping = aes(x = sulphates, y = quality, fill = quality)) + geom_boxplot() + geom_point(white_train, mapping = aes(x = sulphates, y = quality)) + labs(title = "Red and White Sulfate Content versus Quality", fill = "Quality")  + scale_fill_brewer(palette="PuRd") + facet_wrap(. ~ type, nrow = 1) + coord_cartesian( xlim = c(0,1.5), ylim = NULL, default = FALSE )

From the box plots we can see that red wine generally has a lower sulfur dioxide content than white wine. Also, averages across each stratification of quality have similar values except for sulfates in red wine, which have a slight positive correlation with quality.

In general, data scientists conduct exploratory data analysis to see how predictor variables interact with the response and with each other before we make any assumptions. Based on the EDA that we conducted, we will keep all of our predictor variables in our recipes and will not interact any terms.

Loading in Saved Models

We did all of our modeling in R-Scripts for efficiency purposes, since models generally take a long time to run. We will load all of results here, and intermittently call variables throughout the report to visualize our calculations.

# LDA 
load("rda_objects/WhiteLDA.rda")
load("rda_objects/RedLDA.rda")

# LDA with PCA
load("rda_objects/WhiteLDAPCA.rda")
load("rda_objects/RedLDAPCA.rda")

# Decision Tree
load("rda_objects/WhiteWineDecisionTree.rda")
load("rda_objects/RedWineDecisionTree.rda")

# Random Forest 
load("rda_objects/WhiteWineRandomForest.rda")
load("rda_objects/RedWineRandomForest.rda")

# Boosted Trees
load("rda_objects/WhiteWineBoostedTrees.rda")
load("rda_objects/RedWineBoostedTrees.rda")

Model Fitting for White Wine

We will be fitting linear discriminant analysis, naive Bayes, single decision tree, random forest, and boosted tree models and compare accuracy metrics. Then, we will fit the three models with the best roc_auc to our testing data. First, let’s see how the models perform on the white wine data set.

Recipe

In our recipe for the white wine data set, we selected for all the predictors, converted character or factored data into numeric binary data, and normalized all predictors.

white_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())

K-Fold Cross Validation

Let’s first explore linear discriminant analysis and naive Bayes classification through k-fold cross validation.

Linear Discriminant Analysis

For the linear discriminant analysis model, we use a classification mode and set the engine to MASS. We then add the model and recipe to a workflow and create a fit between the workflow and folded data. We are using roc_auc to evaluate accuracy.

#set up model with mode classification and engine MASS
wlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

#add model and recipe to the workflow
wlda_wkflow<- workflow() %>% 
  add_model(wlda_model) %>% 
  add_recipe(white_recipe)

#create a fit between the workflow and folded data
wlda_fit_cross <- fit_resamples(wlda_wkflow, white_fold)

#determine the roc_auc of the LDA model on the folded training data
collect_metrics(wlda_fit_cross)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.528     5 0.00640 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.735     5 0.0159  Preprocessor1_Model1

Naive Bayes

Now, let’s take a look at how our cross validation method works with a Naive Bayes model. In particular, let’s see if the accuracy increases.

#set up model with mode classification and engine kLaR
#we used set_args(use_kernel = FALSE) based on Lab 3
wnb_mod <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("klaR") %>% 
  set_args(usekernel = FALSE) 

#add model and recipe to the workflow
wnb_wkflow <- workflow() %>% 
  add_model(wnb_mod) %>% 
  add_recipe(white_recipe)

#create a fit between the workflow and folded data
wnb_fit_cross <- fit_resamples(wnb_wkflow, white_fold)

#determine the roc_auc of the Naive Bayes model on the folded training data
collect_metrics(wnb_fit_cross)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.451     5 0.00984 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.734     5 0.0200  Preprocessor1_Model1

Through k-fold cross validation, we can see that the linear discriminant analysis model produces better accuracy than the Naive Bayes model comparing the accuracy metrics of .52 to .45 respectively.

To account for the collinearity between some of our predictors, which we saw in the exploratory data analysis, we will conduct principal component analysis on the data. Since our linear discriminant analysis model was better on the white wine data set, we will use the principal components on an LDA model.

Principal Component Analysis

To conduct principal component analysis, we will begin by setting up another recipe specifically for this purpose. With this, we conduct an LDA workflow and model fit. We are tuning the model to find the best number of principal components using k-fold cross validation.

white_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())

# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(white_recipe_pca$template))

# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)

wlda_pca_wkflow <- workflow() %>% 
  add_model(wlda_model) %>% 
  add_recipe(white_recipe_pca)

pca_lda_fit <- wlda_pca_wkflow %>%
  tune_grid(resamples = white_fold,
            grid = tuneGrid,
            metrics = metric_set(accuracy),
            control = trControl)

This visualization represents the number of principal components versus the accuracy of the model. We can observe an obvious spike in accuracy at 9 principal components.

pca_lda_metrics <- pca_lda_fit %>% collect_metrics()

ggplot(pca_lda_metrics, aes(x = num_comp, y = mean)) +
  geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
  scale_x_continuous(breaks = 1:ncol(white_recipe_pca$template)) +
  facet_wrap(~.metric) +
  theme_bw()

This is a visualization of the actual quality and the predicted qualities. We are only displaying about half of the data, so the plot is easier to interpret. The plot displays the clustering of the data very well for each quality level. In addition, we can visually see how well the model predicts the qualities accurately, and around how often/how greatly the model fails.

# plot
ggplot(white.PCALDA[1:1500,], aes(x = LD1, y = LD2)) +
  geom_point(aes(color = quality, shape = .pred_class)) + 
  theme_bw() +
  ggtitle("PCA-LDA (DAPC) on White Wine Training dataset, using 9 PC")

pcalda_acc

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.538

pcalda_rocauc

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                       
##   <chr>   <chr>          <dbl> <chr>                            
## 1 roc_auc hand_till      0.760 "White Wine LDA Model using PCA "

pcalda_roccurve

pcalda_confusionmatrix

These metrics and graphs reveal how accurately the principal component analysis works on the training data with a linear discriminant analysis model. The roc_auc curves are the best for qualities 4 and 8, although overall PCA LDA is not the most effective as the accuracy is merely 53%. Through the confusion matrix, we can see the model predicted 6s well.

Single Decision Tree

Since a value of about 53% is only moderately accurate, we will try several tree methods to see if the produce more accurate results on the training data set of white wine. First, we will look at the model for a single decision tree.

First, we set up a specification with the engine rpart and for classification.

# decision tree specification
wtree_spec <- decision_tree() %>%
  set_engine("rpart")

wtree_spec_class <- wtree_spec %>%
  set_mode("classification")

Next, we fit the specification to the training data.

wclass_tree_fit <- wtree_spec_class %>%
  fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)

Here is a visual of how the decision tree model works with our data.

wclass_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

Now, we augment the model on the training model and evaluate the accuracy and confusion matrix.

# augmented on training 
augment(wclass_tree_fit, new_data = white_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.529

augment(wclass_tree_fit, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class)%>% autoplot(type = "heatmap")

The accuracy is 52% and the confusion matrix shows us that 5 and 6 quality are evaluated the best.

Here, we are tuning our model to determine the best measures for cost_complexity.

# tuning cost complexity 
wclass_tree_wf<- workflow() %>%
  add_model(wtree_spec_class %>% 
              set_args(cost_complexity = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

param_grid <- grid_regular(cost_complexity(range = c(-3,-1)), levels = 10)

tune_res_white <- tune_grid(
  wclass_tree_wf,
  resamples = white_fold,
  grid = param_grid,
  metric = metric_set(accuracy)
)

Now we produce a graph of accuracies and roc_auc levels for various cost complexity parameters.

wAutoPlot

Here, we can see that accuracy and roc_auc are both highest at around .005 value for the cost-complexity parameter.

# extracting the best cost complexity parameter
best_complexity <- select_best(tune_res_white)
wclass_tree_final <- finalize_workflow(wclass_tree_wf, best_complexity)
wclass_tree_final_fit <- fit(wclass_tree_final, data = white_train)

Using the measure of cost complexity which produces the best accuracy and roc_auc levels, this is a visualization of the decision tree that is used in the model. It is very precise and may show signs that it would overfit on testing data.

wclass_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
wdectree_pred <- augment(wclass_tree_final_fit, new_data = white_train) 
wdectree_acc <- wdectree_pred %>% accuracy(truth = quality, estimate = .pred_class)
wdectree_rocauc <- wdectree_pred %>% roc_auc(truth = quality, estimate = .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Decision Tree Model")
wdectree_roccurve <- wdectree_pred %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
wdectree_confusionmatrix <- augment(wclass_tree_final_fit, new_data = white_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

This is the overall accuracy measures, roc_auc measures, ROC curves and confusion matrix that is returned my our decision tree model on the white wine data set. The accuracy is similar to the principal component model with linear discriminant analysis that we previously conducted.

print(wdectree_acc)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.647

print(wdectree_rocauc)

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.818 White Decision Tree Model

wdectree_roccurve

wdectree_confusionmatrix

From the ROC curve, we can see that the decision tree model produces very similar curves for each value of quality, and from the heat map, we can see that the model predicted 5s and 6s with the highest accuracy.

Random Forest

In hopes of improving our results, we will now look at the random forest model. Here, we set up our model with the engine ranger and set the importance to impurity to find a set of predictors that best explains the variance in the response variable, and still with a classification approach.

# setting random forest model up
wrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")

Then, we set up our workflow, tuning range, trees and min_n

wrandfor_wf <- workflow() %>%
  add_model(wrandfor %>%
              set_args(mtry = tune(), trees = tune(), min_n = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

This next visualization shows the accuracy and roc_auc of the various values we are tuning our model with. For the final recipe, we will extract the metrics that return the best accuracy and roc_auc.

autoplot(wtune_res_randfor)

Now we will collect accuracy metrics to find the model with the highest mean.

Here are our final results.

wVIP

print(wrandfor_acc)

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.800 White Random Forest Model

print(wrandfor_rocauc)

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.968 White Random Forest Model

wrandfor_roccurve

wrandfor_confusionmatrix

Overall the random forest for the white wine data set has a drastic improvement in terms of the accuracy and roc_auc. The variance importance plot reveals that alcohol, volatile acidity and density are most important variables in the data. Every single roc_auc curve is almost perfect, although as we have learned through the course, this model may be over fitting, and as a result, may not perform as well on the testing data set. This model did the best for predicting wine qualities of 5, 6, and 7.

Boosted Trees

Finally, let’s look at the boosted tree model. We set up our specifications with engine xgboost and with a classification approach.

wboost_spec <-  boost_tree(tree_depth = 5) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Then, we set up our workflow, tuning only trees this time.

wboost_wf <- workflow() %>%
  add_model(wboost_spec %>%
              set_args(trees = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

This plot below visualizes the accuracy and roc_auc levels for the number of trees. Since we tune the value for the number of trees, we can choose to use the number of trees which has the highest accuracy and roc_auc levels.

wBoostedAutoPlot

Here, the accuracy and roc_auc graphs both spike right before 250 trees.

Now, we collect accuracy metrics to see which model has the highest mean.

wbest_rocauc2 <- collect_metrics(wtune_res_boosted) %>% arrange(desc(mean))
wbest_metric2 <- select_best(wtune_res_boosted)
print(wbest_metric2)

wboost_final <- boost_tree(tree_depth = 5, trees = 231)%>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

wboost_fit_final <- fit(wboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = white_train)

Here are our final results.

wboosted_acc

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass         1 White Boosted Trees Model

wboosted_rocauc

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till          1 White Boosted Trees Model

wBoostedROCCurve

wBoostedConfusionMatrix

The accuracy and roc_auc for the boosted trees model on the white wine data set is 1, meaning every single observation was correctly classified in the model. This model has the potential to run amazingly on the testing data, although there is undeniable evidence of overfitting.

Determining Best Fit

wbest_roc_table <- rbind(wbest_rocauc[1,c(2,4)], wbest_rocauc1[1,c(4,6)], wbest_rocauc2[1,c(2,4)] ) %>% mutate(model_type = c("Decision Tree", "Random Forest", "Boosted Trees"))
wbest_roc_table

## # A tibble: 3 × 3
##   .metric  mean model_type   
##   <chr>   <dbl> <chr>        
## 1 roc_auc 0.688 Decision Tree
## 2 roc_auc 0.794 Random Forest
## 3 roc_auc 0.805 Boosted Trees

# We will test our model using Boosted Trees and Random Forest

As we can see, between the three tree based methods we ran, the two with the highest roc_auc values were Boosted Trees and Random Forest. From our earlier calculations, linear discriminant analysis using principal component analysis had a higher accuracy than Decision Trees so we will use that as our third model.

To conclude, we will test our data on three models: LDA using PCA, Random Forest and Boosted Trees.

Predicting on the Testing Data

Testing the PCA LDA model on the data:

pcalda_test_acc

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.514

pcalda_test_rocauc

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                    
##   <chr>   <chr>          <dbl> <chr>                         
## 1 roc_auc hand_till      0.685 White Wine LDA Model using PCA

pcalda_test_roccurve

pcalda_test_confusionmatrix

Testing the Random Forest model on the data:

wrandfor_acc_test

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.608 White Random Forest Model

wrandfor_rocauc_test

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.764 White Random Forest Model

wrandfor_roccurve_test

wrandfor_confusionmatrix_test

Testing the Boosted Trees model on the data:

wboosted_acc_test

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.657 White Boosted Trees Model

wboosted_rocauc_test

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.821 White Boosted Trees Model

wBoostedROCCurveTesting

wBoostedConfusionMatrixTesting

Conclusion for White Wine Data Set

In conclusion, the Boosted Trees model did the best on the testing data for the White Wine data set with an accuracy of 65.7% and a roc auc level of 0.82. The Random Forest model is a close second with accuracy 60.8% and a roc auc value of 0.76. LDA did not work very well, and using principal component analysis did not have much of an improvement. We thought that it would due to the fact that many of our predictors had high correlation. Majority of our models do well in predicting values of 5 and 6. Although, in general, we have seen that all of the models thus far do not do as well with 3,4, and 8. This is probably due to the fact that there are not as many observations with this quality level.

Model Fitting for Red Wine

Next, let’s see how the models perform on the red wine data set. We will be using the same type of models that we used in white wine in order to keep things consistent.

Recipe

In our recipe for the red wine data set, we also selected for all the predictors, converted character or factored data into numeric binary data, and normalized all predictors.

red_recipe <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors())

K-Fold Cross Validation

Let’s first explore linear discriminant analysis and naive Bayes classification through k-fold cross validation.

Linear Discriminant Analysis

#lda model using cross validation
rlda_model <- discrim_linear() %>% 
  set_mode("classification") %>% 
  set_engine("MASS") 

rlda_wkflow<- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe)

rlda_fit_cross <- fit_resamples(rlda_wkflow, red_fold)

collect_metrics(rlda_fit_cross)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.601     5 0.0127  Preprocessor1_Model1
## 2 roc_auc  hand_till  0.786     5 0.00468 Preprocessor1_Model1

Naive Bayes

Now, let’s take a look at how our cross validation method works with a Naive Bayes model. In particular, let’s see if the accuracy increases.

#naive bayes model using cross validation
rnb_mod <- naive_Bayes() %>% 
  set_mode("classification") %>% 
  set_engine("klaR") %>% 
  set_args(usekernel = FALSE) 

rnb_wkflow <- workflow() %>% 
  add_model(rnb_mod) %>% 
  add_recipe(red_recipe)

rnb_fit_cross <- fit_resamples(rnb_wkflow, red_fold)

collect_metrics(rnb_fit_cross)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.534     5 0.00828 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.743     5 0.00714 Preprocessor1_Model1

Through k-fold cross validation, we can see that the linear discriminant analysis model produces a more accurate model than the Naive Bayes model with 60% versus 53% accuracy.

Principal Component Analysis

red_recipe_pca <- recipe(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train) %>% step_dummy(all_nominal_predictors()) %>% step_normalize(all_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = tune())

# column name(s) must match tune() above
tuneGrid <- expand.grid(num_comp = 1:ncol(red_recipe_pca$template))

# control tune_grid() process below
trControl <- control_grid(verbose = TRUE, allow_par = FALSE)

rlda_pca_wkflow <- workflow() %>% 
  add_model(rlda_model) %>% 
  add_recipe(red_recipe_pca)

rpca_lda_fit <- rlda_pca_wkflow %>%
  tune_grid(resamples = red_fold,
            grid = tuneGrid,
            metrics = metric_set(accuracy),
            control = trControl)

rpca_lda_metrics <- rpca_lda_fit %>% collect_metrics()
rpca_lda_metrics

## # A tibble: 12 × 7
##    num_comp .metric  .estimator  mean     n std_err .config              
##       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
##  1        1 accuracy multiclass 0.401     5 0.0145  Preprocessor01_Model1
##  2        2 accuracy multiclass 0.512     5 0.0153  Preprocessor02_Model1
##  3        3 accuracy multiclass 0.579     5 0.00881 Preprocessor03_Model1
##  4        4 accuracy multiclass 0.579     5 0.00945 Preprocessor04_Model1
##  5        5 accuracy multiclass 0.588     5 0.00749 Preprocessor05_Model1
##  6        6 accuracy multiclass 0.581     5 0.00692 Preprocessor06_Model1
##  7        7 accuracy multiclass 0.592     5 0.0152  Preprocessor07_Model1
##  8        8 accuracy multiclass 0.590     5 0.0134  Preprocessor08_Model1
##  9        9 accuracy multiclass 0.606     5 0.0116  Preprocessor09_Model1
## 10       10 accuracy multiclass 0.604     5 0.0112  Preprocessor10_Model1
## 11       11 accuracy multiclass 0.601     5 0.0127  Preprocessor11_Model1
## 12       12 accuracy multiclass 0.601     5 0.0127  Preprocessor12_Model1

This next visualization represents the number of principal components versus the accuracy of the model. We can observe an obvious spike in accuracy also at 9 principal components.

ggplot(rpca_lda_metrics, aes(x = num_comp, y = mean)) +
  geom_line(color = "#3E4A89FF", linewidth = 2, alpha = 0.6) +
  scale_x_continuous(breaks = 1:ncol(red_recipe_pca$template)) +
  facet_wrap(~.metric) +
  theme_bw()

This next graph is a visualization of the actual quality and the predicted qualities. We are only displaying about half of the data, so the plot is easier to interpret. The plot displays the clustering of the data very well for each quality level. In addition, we can visually see how well the model predicts the qualities accurately, and around how often or how greatly the model fails.

# plot
ggplot(red.PCALDA, aes(x = LD1, y = LD2)) +
  geom_point(aes(color = quality, shape = .pred_class)) + 
  theme_bw() +
  ggtitle("PCA-LDA (DAPC) on Red Wine Training dataset, using 9 PC")

rpcalda_acc

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.618

rpcalda_rocauc

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                       
##   <chr>   <chr>          <dbl> <chr>                            
## 1 roc_auc hand_till      0.826 "White Wine LDA Model using PCA "

rpcalda_roccurve

rpcalda_confusionmatrix

These metrics and graphs reveal how accurately the principal component analysis works on the training data with a linear discriminant analysis model. The roc_auc curves are the best for qualities 3 and 8, although overall PCA LDA is still only moderately effective as the accuracy is 61%. Through the confusion matrix, we can see the model predicted 5s well.

Single Decision Tree

Since a value of about 61% is only moderately accurate, we will try several tree methods to see if the produce more accurate results on the training data set of red wine. First, we will look at the model for a single decision tree.

First, we set up a specification with the engine rpart and for classification.

# decision tree specification
rtree_spec <- decision_tree() %>%
  set_engine("rpart")

# setting mode to classification
rtree_spec_class <- rtree_spec %>%
  set_mode("classification")

Next, we fit the specification to the training data.

rclass_tree_fit <- rtree_spec_class %>%
  fit(quality ~ volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)

Here is a visual of how the decision tree model works with our data.

rclass_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

Now, we augment the model on the training model and evaluate the accuracy and confusion matrix. The accuracy is 61% and the confusion matrix shows us that 5 and 6 quality are evaluated the best, as in the white wine single decision tree model.

# augmented on training 
augment(rclass_tree_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.619

augment(rclass_tree_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

Here, we are tuning our model to determine the best measures for cost_complexity.

# tuning cost complexity 
rclass_tree_wf<- workflow() %>%
  add_model(rtree_spec_class %>% 
              set_args(cost_complexity = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

Now we produce a graph of accuracies and roc_auc levels for various cost complexity parameters.

autoplot(tune_res_red)

Here, we can see that accuracy is highest at around .0075 value and roc_auc is highest around .010 value for the cost-complexity parameter.

rclass_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

# augmented on training 
augment(rclass_tree_final_fit, new_data = red_train) %>%
  accuracy(truth = quality, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.667

augment(rclass_tree_final_fit, new_data = red_train) %>%
  conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

# augmented on testing 
augment(rclass_tree_final_fit, new_data = red_test) %>%
  conf_mat(truth = quality, estimate = .pred_class) 

augment(rclass_tree_final_fit, new_data = red_test) %>%
  accuracy(truth = quality, estimate = .pred_class)

This is the overall accuracy measures, roc_auc measures, ROC curves and confusion matrix that is returned my our decision tree model on the white wine data set. The accuracy is slightly less than the principal component model with linear discriminant analysis that we previously conducted.

Random Forest

rrandfor <- rand_forest() %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")

Then, we set up our workflow, tuning range, trees and min_n

rrandfor_wf <- workflow() %>%
  add_model(rrandfor %>%
              set_args(mtry = tune(), trees = tune(), min_n = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

This visualization shows the accuracy and roc_auc of the various values we are tuning our model with. For the final recipe, we will extract the metrics that return the best accuracy and roc_auc.

rAutoPlotRF

Now we will collect accuracy metrics to find the model with the highest mean.

# collecting metrics to find best mean
rbest_rocauc1 <- collect_metrics(rtune_res_randfor) %>% arrange(desc(mean))
print(rbest_rocauc1)
rbest_metric1 <- select_best(rtune_res_randfor)

rrandfor_final <- rand_forest(mtry = 7, trees = 17, min_n = 32) %>% set_engine("ranger", importance = "impurity") %>% set_mode("classification")
rrandfor_fit_final <- fit(rrandfor_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol,data = red_train)

Here are our final results.

rVIP

print(rrandfor_acc)

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type             
##   <chr>    <chr>          <dbl> <chr>                  
## 1 accuracy multiclass     0.838 Red Random Forest Model

print(rrandfor_rocauc)

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.963 White Random Forest Model

rrandfor_roccurve

rrandfor_confusionmatrix

Overall the random forest for the red wine data set has a drastic improvement in terms of the accuracy and roc_auc. The variance importance plot reveals that alcohol, volatile acidity and sulphates are most important variables in the data for predicting quality. Every single roc_auc curve is almost perfect, although as we have learned through the course, this model may be over fitting, and as a result, may not perform as well on the testing data set. This model did the best for predicting wine qualities of 5 and 6.

Boosted Tree

Finally, let’s look at the boosted tree model. We set up our specifications with engine xgboost and with a classification approach.

rboost_spec <-  boost_tree(tree_depth = 5) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Then, we set up our workflow, tuning only trees this time.

rboost_wf <- workflow() %>%
  add_model(rboost_spec %>%
              set_args(trees = tune())) %>% 
  add_formula(quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol)

rBoostedAutoPlot

Here we can see that accuracy is highest for 250 to 500 trees, and roc_auc spikes at around 250 trees.

Now, we collect accuracy metrics to see which model has the highest mean.

rbest_rocauc2 <- collect_metrics(rtune_res_boosted) %>% arrange(desc(mean))
print(rbest_rocauc2)

rbest_metric2 <- select_best(rtune_res_boosted)
print(rbest_metric2)

rboost_final <- boost_tree(tree_depth = 5, trees = 231)%>% 
  set_engine("xgboost") %>% 
  set_mode("classification")
rboost_fit_final <- fit(rboost_final, formula = quality ~ volatile.acidity + fixed.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_train)

rpredicted <- augment(rboost_fit_final, new_data = red_train) 
rboosted_acc <- rpredicted %>% accuracy(truth = quality, estimate = .pred_class) %>% mutate(model_type = "White Boosted Trees Model")
rboosted_rocauc <-  rpredicted %>% roc_auc(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% mutate(model_type = "White Boosted Trees Model")
rBoostedROCCurve <- rpredicted %>% roc_curve(quality, .pred_3,.pred_4, .pred_5 , .pred_6 , .pred_7, .pred_8) %>% autoplot()
rBoostedConfusionMatrix <- rpredicted %>% conf_mat(truth = quality, estimate = .pred_class) %>% autoplot(type = "heatmap")

Here are our final results.

rboosted_acc

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass         1 White Boosted Trees Model

rboosted_rocauc

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till          1 White Boosted Trees Model

rBoostedROCCurve

rBoostedConfusionMatrix

Based on these results, we can see that the boosted tree model also overfitted our red wine data set by our accuracy and roc_auc values of 1.

Determining Best Fit

rbest_roc_table

## # A tibble: 3 × 3
##   .metric  mean model_type   
##   <chr>   <dbl> <chr>        
## 1 roc_auc 0.693 Decision Tree
## 2 roc_auc 0.799 Random Forest
## 3 roc_auc 0.796 Boosted Trees

As we can see, between the three tree based methods we ran, the two with the highest roc_auc values were Boosted Trees and Random Forest. LDA using PCA had a higher accuracy than Decision Trees so we will use that as our third model.

To conclude, we will test our data on three models: LDA using PCA, Random Forest and Boosted Trees.

Predicting on the Testing Data

Testing the PCA LDA model on the data:

rpcalda_test_acc

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy multiclass     0.561

rpcalda_test_rocauc

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type                  
##   <chr>   <chr>          <dbl> <chr>                       
## 1 roc_auc hand_till      0.778 Red Wine LDA Model using PCA

rpcalda_test_roccurve

rpcalda_test_confusionmatrix

Testing the Random Forest model on the data:

rrandfor_pred_test

## # A tibble: 481 × 20
##    fixed…¹ volat…² citri…³ resid…⁴ chlor…⁵ free.…⁶ total…⁷ density    pH sulph…⁸
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>
##  1     7.8   0.76     0.04     2.3   0.092      15      54   0.997  3.26    0.65
##  2     5.6   0.615    0        1.6   0.089      16      59   0.994  3.58    0.52
##  3     8.9   0.62     0.19     3.9   0.17       51     148   0.999  3.17    0.93
##  4     8.1   0.56     0.28     1.7   0.368      16      56   0.997  3.11    1.28
##  5     7.9   0.32     0.51     1.8   0.341      17      56   0.997  3.04    1.08
##  6     8.9   0.22     0.48     1.8   0.077      29      60   0.997  3.39    0.53
##  7     7.6   0.39     0.31     2.3   0.082      23      71   0.998  3.52    0.65
##  8     8.5   0.49     0.11     2.3   0.084       9      67   0.997  3.17    0.53
##  9     7.6   0.41     0.24     1.8   0.08        4      11   0.996  3.28    0.59
## 10     6.9   0.685    0        2.5   0.105      22      37   0.997  3.46    0.57
## # … with 471 more rows, 10 more variables: alcohol <dbl>, quality <fct>,
## #   type <chr>, .pred_class <fct>, .pred_3 <dbl>, .pred_4 <dbl>, .pred_5 <dbl>,
## #   .pred_6 <dbl>, .pred_7 <dbl>, .pred_8 <dbl>, and abbreviated variable names
## #   ¹fixed.acidity, ²volatile.acidity, ³citric.acid, ⁴residual.sugar,
## #   ⁵chlorides, ⁶free.sulfur.dioxide, ⁷total.sulfur.dioxide, ⁸sulphates
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

rrandfor_acc_test

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type             
##   <chr>    <chr>          <dbl> <chr>                  
## 1 accuracy multiclass     0.632 Red Random Forest Model

rrandfor_rocauc_test

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.792 White Random Forest Model

rrandfor_roccurve_test

rrandfor_confusionmatrix_test

Testing the Boosted Trees model on the data:

rpredictedtest

## # A tibble: 481 × 20
##    fixed…¹ volat…² citri…³ resid…⁴ chlor…⁵ free.…⁶ total…⁷ density    pH sulph…⁸
##      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl>   <dbl>
##  1     7.8   0.76     0.04     2.3   0.092      15      54   0.997  3.26    0.65
##  2     5.6   0.615    0        1.6   0.089      16      59   0.994  3.58    0.52
##  3     8.9   0.62     0.19     3.9   0.17       51     148   0.999  3.17    0.93
##  4     8.1   0.56     0.28     1.7   0.368      16      56   0.997  3.11    1.28
##  5     7.9   0.32     0.51     1.8   0.341      17      56   0.997  3.04    1.08
##  6     8.9   0.22     0.48     1.8   0.077      29      60   0.997  3.39    0.53
##  7     7.6   0.39     0.31     2.3   0.082      23      71   0.998  3.52    0.65
##  8     8.5   0.49     0.11     2.3   0.084       9      67   0.997  3.17    0.53
##  9     7.6   0.41     0.24     1.8   0.08        4      11   0.996  3.28    0.59
## 10     6.9   0.685    0        2.5   0.105      22      37   0.997  3.46    0.57
## # … with 471 more rows, 10 more variables: alcohol <dbl>, quality <fct>,
## #   type <chr>, .pred_class <fct>, .pred_3 <dbl>, .pred_4 <dbl>, .pred_5 <dbl>,
## #   .pred_6 <dbl>, .pred_7 <dbl>, .pred_8 <dbl>, and abbreviated variable names
## #   ¹fixed.acidity, ²volatile.acidity, ³citric.acid, ⁴residual.sugar,
## #   ⁵chlorides, ⁶free.sulfur.dioxide, ⁷total.sulfur.dioxide, ⁸sulphates
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

rboosted_acc_test

## # A tibble: 1 × 4
##   .metric  .estimator .estimate model_type               
##   <chr>    <chr>          <dbl> <chr>                    
## 1 accuracy multiclass     0.674 White Boosted Trees Model

rboosted_rocauc_test

## # A tibble: 1 × 4
##   .metric .estimator .estimate model_type               
##   <chr>   <chr>          <dbl> <chr>                    
## 1 roc_auc hand_till      0.774 White Boosted Trees Model

rBoostedROCCurveTesting

rBoostedConfusionMatrixTesting

Conclusion for Red Wine Data Set

In conclusion, the Boosted Trees model did the best on the testing data for the Red Wine data set with an accuracy of 67.4% and a roc_auc level of 0.77. The Random Forest model is a close second with accuracy 63.2% and a roc_auc value of 0.79. LDA did not work very well, and using principal component analysis did not have much of an improvement. We thought that it would due to the fact that many of our predictors had high correlation. Majority of our models do well in predicting values of 5 and 6. Although, in general, we have seen that all of the models thus far do not do as well with 3, 4, and 8. This is probably due to the fact that there are not as many observations with this quality level.

Final Thoughts

Overall, we have successfully conducted analysis on both the white wine data set and the red wine data set by constructing 4 models for each wine type and analyzing the results. Generally, tree based models performed the best for predicting the quality of wines given the 11 predictors. For both data sets, the boosted trees model did the best in terms of the accuracy and roc_auc. The naive Bayes and LDA models with and without PCA did not compare to the tree based methods in terms of these metrics. Although it is important to note that tree based methods require much more computing power than the other models we looked into.

As well as a fruitful exercise in machine learning and data modelling, we learned how the quality of white wine generally increases with alcohol content, and red wine is more desirable when it is less acidic. If we were to repeat this supervised machine learning project again, we would want to take into account a greater number of categorical predictors (if available) such as location grown, type of soil, and type of grape. In our code, we would modify our decision tree and boosted tree recipes and models to avoid overfitting the data to the training set.

Supervised Machine Learning Project – Wine Quality

Safiya Alavi and Maya Sinha

2022-12-10

Overview of Data Sets

Overview of Research Questions

Loading Data

Data Cleaning

Data Split

Exploratory Data Analysis

Correlation Matrices

White

Red

Scatter plots

White

Red

Loading in Saved Models

Model Fitting for White Wine

Recipe

K-Fold Cross Validation

Linear Discriminant Analysis

Naive Bayes

Principal Component Analysis

Single Decision Tree

Random Forest

Boosted Trees

Determining Best Fit

Predicting on the Testing Data

Testing the PCA LDA model on the data:

Testing the Random Forest model on the data:

Testing the Boosted Trees model on the data:

Conclusion for White Wine Data Set

Model Fitting for Red Wine

Recipe

K-Fold Cross Validation

Linear Discriminant Analysis

Naive Bayes

Principal Component Analysis

Single Decision Tree

Random Forest

Boosted Tree

Determining Best Fit

Predicting on the Testing Data

Testing the PCA LDA model on the data:

Testing the Random Forest model on the data:

Testing the Boosted Trees model on the data:

Conclusion for Red Wine Data Set

Final Thoughts