Recent Posts

Archive

Tags

# Wine Quality Prediction

Problem

Here I’m looking at samples of Vinho Verde red wines from Portugal.

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

The idea is to work out which of the variables is most important in determining the quality of the red wine on a scale from 0-10, this is the output variable.

The input variables based are:

Input variables (based on physicochemical tests):

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol

Applications

There are a few wineries, gin factories and rum factories around where I live. They would employ lab technicians who would take samples of alcohol. However, I’d be astonished if they knew that prediction and inference of those measures were possible to determine quality of the product.

The idea that they could improve the quality of their product through predictive modelling I’d suggest would be entirely new to them. It’s really going a step beyond simple measures of quality control.

Method

I decided to do this analysis in R, simply because I have been using Python a bit lately. I always love mucking around with R, I’m a big fan of some of the “apply” functions. I reckon it is neat the way that R encourages (albeit subtly) the user to move to functional programming over time.

1) Check the distribution of variables

lapply(dataset, function(var) { ggplot(dataset, aes(x=var)) + geom_histogram() })

I mean, that’s a one-liner and it is just elegant!

It gives the histogram of all the variables in the dataset. It’s enough for a quick visual inspection of the variables.

2) Split into train and test datasets

The model will be trained on the training dataset and will be applied to an unseen test dataset for evaluation. The Caret module in R is very much a swiss army knife for modelling and preprocessing in R. Below I am splitting the data into training and test samples and then centering and scaling each predictor.

# train test split based on outcome

set.seed(1234)

trainIndex <- createDataPartition(dataset\$quality, p=0.8, list=FALSE)

train <- dataset[trainIndex,]

test <- dataset[-trainIndex,]

scaledValues <- preProcess(train[,-12], method=c("center", "scale"))

trainScaled <- predict(scaledValues, train[-12])

testScaled <- predict(scaledValues, test[-12])

# append outcome variable back to train and test

train <- cbind(trainScaled, quality=train[,12])

test <- cbind(testScaled, quality=test[,12])

3) Try different models

So, I don’t really know what kind of model will work best with this data. So I will compare a number of models to see what the best approach is using cross validation. I’ve written my own “fitMods” helper function to avoid repetition in the code.

# Compare Models

#===================

# helper model function

fitMods <- function(method) {

train(quality ~ ., data=train,

method=method,

trControl = fitControl)}

set.seed(1234)

fitControl <- trainControl(method="repeatedcv",

number=10,

repeats=10)

# linear stepwise regression

linReg <- fitMods(method="glmStepAIC")

# lasso regression

lassoReg <- fitMods(method="lasso")

# svr

svrPoly <- fitMods(method="svmPoly")

svrLinear <- fitMods(method="svmLinear")

# decision tree

dTree <- fitMods(method="rpart")

# random forest

forest <- fitMods(method="rf")

4) Evaluate the accuracy of each model

This is a pretty trivial thing to do in R, turns out our random forest model is best.

svrPoly=svrPoly,svrLinear=svrLinear,dTree=dTree,forest=forest))

summary(results)

dotplot(results)

There’s a fair bit of variability due to our small sample size, but we can see the R-squared of the random forest is superior (we will have a range here because we have gone for 10 fold cross-validation to fit the model).

A visual representation of the above makes the idea clearer, it’s clear that the data is not linearly separable as the linear classifiers like the decision tree and support vector machine with a linear kernel function have worse results:

5) Variable importance plot

From the variable importance plot we can see the alcohol variable stands out from the other variables.

The reason why can be seen in the below correlation plot:

What we have is correlated variables in the dataset. Alcohol content is positively correlated with wine quality score, but is relatively uncorrelated with the other variables. So, all else being equal to increase the quality of the wine you’d add more alcohol to it? Sounds about right, but the story is far more subtle. It looks like wine making is a very tricky business, and involves balancing many factors. This explains why the most complex, non-linear model was the most successful in predicting quality. More on the debate on wine quality and alcohol content can be seen here (interestingly alcohol content in wines has been increasing since the 1980s)

https://www.decanter.com/features/alcohol-levels-the-balancing-act-246426/

It is interesting to check down the last column on the right hand side of the above correlation plot. Wine quality is also negatively correlated with volatile.acidity, this means that you’d want to go easy on that. Check out the article in winemaker.com this gives red wine that sharp “vinegary flavour”.

https://winemakermag.com/article/676-the-perils-of-volatile-acidity

Caveats

The measure of quality of the wine is a subjective one, how different really is a 6 from a 7 in terms of wine quality? Yet, I have tried to model wine quality as a continuous and not a categorical variable.