While I am reading Elements of Statistical Learning, I figured it would be a good idea to try to use the machine learning methods introduced in the book. I just finished a chapter on linear regression, and learned more about linear regression and the penalized methods (Ridge and Lasso). Since there is an abundant resource available online, it would be redundant to get into the details. I’ll quickly go over Ordinary Least Squares, Ridge, and Lasso regression, and quickly show an application of those methods in R.

## Overview

All three of the models assume a linear model in which the output is a linear combination of the features (dependent variables).

\[y = \beta_0 + \Sigma_{i=1}^p\beta_ix_i\]

The difference comes from how the \(\beta_i\) are estimated.

**Ordinary Least Squares** method picks \(\hat{\beta}\) that minimizes the residual sum of squares:
\[RSS(\beta) = \Sigma_{i=1}^N (y_i - (\hat{\beta}_0 + \Sigma_{i=1}^p\hat{\beta}_ix_i))^2\]
**Ridge Regression** minimizes a function that takes into account RSS as well as the L2 Norm of \(\beta\) values:
\[RSS_{ridge}(\beta) = \Sigma_{i=1}^N (y_i - (\hat{\beta}_0 + \Sigma_{i=1}^p\hat{\beta}_ix_i))^2 + \lambda\Sigma_{j=1}^p\beta_j^2\]

**Lasso Regression** is similar to ridge regression, except that it takes into account the L1 Norm of \(\beta\) values:
\[RSS_{lasso}(\beta) = \Sigma_{i=1}^N (y_i - (\hat{\beta}_0 + \Sigma_{i=1}^p\hat{\beta}_ix_i))^2 + \lambda\Sigma_{j=1}^p|\beta_j|\]

\(\lambda\) is a hyperparameter that determines how much penalty to give to the norms. Higher \(\lambda\) will get the \(\beta\) reaching 0, while \(\lambda\) approaching 0 gets \(\beta\) close to the OLS estimates.

The extra penalty in the norms will cause the estimators to be “shrunk” in absolute values.

I wanted to try all three types of regression using `R`

, so I grabbed a coffee rating dataset from tidy tuesday. I chose this dataset after skimming through a bunch in the list for its simple description and the availability of the numeric variables for a response variable as well as the feature variables.

The categorical variables could very well be coded into 1’s and 0’s for linear regression and the penalized regression, but the question remains: how can they also be standardized for “fair” evaluation? This is another topic in itself and I don’t plan to address this in this post, and only numerical variables will be included for the models.

## Quick Data Clean

`suppressPackageStartupMessages(library(tidyverse))`

`## Warning: package 'ggplot2' was built under R version 4.0.5`

`## Warning: package 'dplyr' was built under R version 4.0.5`

```
suppressPackageStartupMessages(library(naniar))
suppressPackageStartupMessages(library(gplots))
suppressPackageStartupMessages(library(glmnet))
# source of the data
# cofdat_tmp <-
# readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
cofdat_tmp <-
readr::read_csv('coffee_ratings.csv')
```

```
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## total_cup_points = col_double(),
## number_of_bags = col_double(),
## aroma = col_double(),
## flavor = col_double(),
## aftertaste = col_double(),
## acidity = col_double(),
## body = col_double(),
## balance = col_double(),
## uniformity = col_double(),
## clean_cup = col_double(),
## sweetness = col_double(),
## cupper_points = col_double(),
## moisture = col_double(),
## category_one_defects = col_double(),
## quakers = col_double(),
## category_two_defects = col_double(),
## altitude_low_meters = col_double(),
## altitude_high_meters = col_double(),
## altitude_mean_meters = col_double()
## )
## i Use `spec()` for the full column specifications.
```

`print(dim(cofdat_tmp))`

`## [1] 1339 43`

`print(head(cofdat_tmp))`

```
## # A tibble: 6 x 43
## total_cup_points species owner country_of_orig~ farm_name lot_number mill
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 90.6 Arabica metad ~ Ethiopia "metad plc" NA meta~
## 2 89.9 Arabica metad ~ Ethiopia "metad plc" NA meta~
## 3 89.8 Arabica ground~ Guatemala "san marco~ NA NA
## 4 89 Arabica yidnek~ Ethiopia "yidnekach~ NA wole~
## 5 88.8 Arabica metad ~ Ethiopia "metad plc" NA meta~
## 6 88.8 Arabica ji-ae ~ Brazil NA NA NA
## # ... with 36 more variables: ico_number <chr>, company <chr>, altitude <chr>,
## # region <chr>, producer <chr>, number_of_bags <dbl>, bag_weight <chr>,
## # in_country_partner <chr>, harvest_year <chr>, grading_date <chr>,
## # owner_1 <chr>, variety <chr>, processing_method <chr>, aroma <dbl>,
## # flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
## # uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, cupper_points <dbl>,
## # moisture <dbl>, category_one_defects <dbl>, quakers <dbl>, color <chr>,
## # category_two_defects <dbl>, expiration <chr>, certification_body <chr>,
## # certification_address <chr>, certification_contact <chr>,
## # unit_of_measurement <chr>, altitude_low_meters <dbl>,
## # altitude_high_meters <dbl>, altitude_mean_meters <dbl>
```

```
# coffee type of score
ggplot(cofdat_tmp) +
geom_boxplot(aes(x = species, y = total_cup_points))
```

There’s one point where total rating is zero. It seems unusual for something to have a rating of complete 0, especially across all categories (except moisture?). I can’t imagine what it could have tasted like. In the end I’m not sure what led the judge to come to this conclusion, so for the time being, I removed this point as an outlier.

`print(cofdat_tmp[cofdat_tmp$total_cup_points == 0,], width = Inf)`

```
## # A tibble: 1 x 43
## total_cup_points species owner country_of_origin farm_name
## <dbl> <chr> <chr> <chr> <chr>
## 1 0 Arabica bismarck castro Honduras los hicaques
## lot_number mill ico_number company altitude region
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 103 cigrah s.a de c.v. 13-111-053 cigrah s.a de c.v 1400 comayagua
## producer number_of_bags bag_weight in_country_partner
## <chr> <dbl> <chr> <chr>
## 1 Reinerio Zepeda 275 69 kg Instituto Hondureño del Café
## harvest_year grading_date owner_1 variety processing_method aroma
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 2017 April 28th, 2017 Bismarck Castro Caturra NA 0
## flavor aftertaste acidity body balance uniformity clean_cup sweetness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0 0 0
## cupper_points moisture category_one_defects quakers color category_two_defects
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 0 0.12 0 0 Green 2
## expiration certification_body
## <chr> <chr>
## 1 April 28th, 2018 Instituto Hondureño del Café
## certification_address
## <chr>
## 1 b4660a57e9f8cc613ae5b8f02bfce8634c763ab4
## certification_contact unit_of_measurement
## <chr> <chr>
## 1 7f521ca403540f81ec99daec7da19c2788393880 m
## altitude_low_meters altitude_high_meters altitude_mean_meters
## <dbl> <dbl> <dbl>
## 1 1400 1400 1400
```

Since the data isn’t perfect, there are a lot of missing observations in some variables. `naniar`

package has some tools that help with exploring the missingness of the data with clever visualizations. The first plot highlights the missing observations (row) for each variable (column). This can help with getting an idea of how much missingness there is and how the missingness is related across observations. Fortunately, the rating for each category are all present, so I will be able to keep all the observations. It seems like there is a fair amount of missing value in the altitude, so altitude will be omitted.

The plots after show more information about the missing observations.

```
# missing data explore
vis_miss(cofdat_tmp, sort_miss = TRUE)
```

`visdat::vis_dat(cofdat_tmp)`

`gg_miss_var(cofdat_tmp, show_pct = TRUE)`

After applying my previous remarks, the data is prepared for analysis.

```
cofdat <- cofdat_tmp %>%
filter(total_cup_points > 50) %>%
select(
total_cup_points, aroma, flavor, aftertaste, acidity,
body, balance, uniformity, clean_cup, sweetness, cupper_points,
moisture, category_one_defects, category_two_defects, quakers
)
head(cofdat)
```

```
## # A tibble: 6 x 15
## total_cup_points aroma flavor aftertaste acidity body balance uniformity
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 90.6 8.67 8.83 8.67 8.75 8.5 8.42 10
## 2 89.9 8.75 8.67 8.5 8.58 8.42 8.42 10
## 3 89.8 8.42 8.5 8.42 8.42 8.33 8.42 10
## 4 89 8.17 8.58 8.42 8.42 8.5 8.25 10
## 5 88.8 8.25 8.5 8.25 8.5 8.42 8.33 10
## 6 88.8 8.58 8.42 8.42 8.5 8.25 8.33 10
## # ... with 7 more variables: clean_cup <dbl>, sweetness <dbl>,
## # cupper_points <dbl>, moisture <dbl>, category_one_defects <dbl>,
## # category_two_defects <dbl>, quakers <dbl>
```

There is one missing observation in `quaker`

variable, which stores the number of “bad” beans mixed in with the good beans. This is removed as well.

```
# one missing value in quakers variable
cofdat %>% miss_var_summary()
```

```
## # A tibble: 15 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 quakers 1 0.0747
## 2 total_cup_points 0 0
## 3 aroma 0 0
## 4 flavor 0 0
## 5 aftertaste 0 0
## 6 acidity 0 0
## 7 body 0 0
## 8 balance 0 0
## 9 uniformity 0 0
## 10 clean_cup 0 0
## 11 sweetness 0 0
## 12 cupper_points 0 0
## 13 moisture 0 0
## 14 category_one_defects 0 0
## 15 category_two_defects 0 0
```

`cofdat <- na.omit(cofdat)`

I can plot box plots and histogram to skim through the distribution of each variables.

```
cof_long <- cofdat %>% pivot_longer(cols = everything(), names_to = "Var", values_to = "value")
ggplot(cof_long) +
geom_boxplot(aes(x = Var, y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.3, size = 8), axis.title = element_blank())
```

```
ggplot(cof_long) +
geom_histogram(aes(value), bins = 50) +
facet_wrap(facets = ~Var, scales = "free")
```

One notable thing is that `category_one_defects`

, `category_two_defects`

, and `quakers`

comparatively have a lot of 0’s than non-zero values. These variables look like an indicator of defects, so they should negatively affect the total score. I am curious to see how they will affecting the score.

Lastly, I checked the correlation between all the variables included in the model. Ideally, I would not want to have a high correlation among the variables since that could lead to the inflation of the standard error for \(\beta\) estimates, possibly making the \(\beta\) very unstable.

```
cof_cor <- cor(cofdat)
heatmap.2(cof_cor, symm = TRUE, col = bluered(100), trace = "none", main = "Correlation Between Variables\n(with Hierarchical Clustering)")
```

As I suspected, I see some strong correlations among the ratings. It does make sense to see this since a good bean is bound to have multiple positive characteristics. At the top of the heatmap we have flavor and aftertaste. If a food has good flavor, I would expect to have a pleasant aftertaste. Another interesting takeaway from the heatmap is how the three of last four variables, `category_one_defects`

, `category_two_defects`

, and `quakers`

, which are indicators of poor quality, are clustered out of the other ratings. Frankly, I don’t know why moisture joined the party there because I have no knowledge of coffee bean ratings. The association looks very weak, so it could be a random chance. But it could very well be that higher rating for moisture is a bad thing.

In the end, I decided to keep all the variables for regression since all of these variables were likely to have been considered for the final score.

## Running the Models

### Linear Regression with Ordinary Least Squares

This part is simple, as I only need to use `lm()`

and `summary()`

to review the output.

```
linreg <- lm(total_cup_points ~ ., data = cofdat)
summary(linreg)
```

```
##
## Call:
## lm(formula = total_cup_points ~ ., data = cofdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42978 -0.00374 -0.00085 0.00751 0.10215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.289e-02 1.561e-02 0.826 0.409
## aroma 9.992e-01 2.032e-03 491.712 < 2e-16 ***
## flavor 9.993e-01 2.880e-03 346.939 < 2e-16 ***
## aftertaste 1.000e+00 2.633e-03 379.855 < 2e-16 ***
## acidity 9.973e-01 2.086e-03 478.222 < 2e-16 ***
## body 1.002e+00 1.962e-03 510.931 < 2e-16 ***
## balance 1.002e+00 2.006e-03 499.669 < 2e-16 ***
## uniformity 1.002e+00 9.923e-04 1009.913 < 2e-16 ***
## clean_cup 9.997e-01 7.039e-04 1420.342 < 2e-16 ***
## sweetness 9.982e-01 8.372e-04 1192.313 < 2e-16 ***
## cupper_points 9.981e-01 1.557e-03 641.145 < 2e-16 ***
## moisture 8.139e-03 9.066e-03 0.898 0.370
## category_one_defects -1.681e-03 1.746e-04 -9.631 < 2e-16 ***
## category_two_defects -3.473e-04 8.653e-05 -4.013 6.32e-05 ***
## quakers -2.325e-04 5.058e-04 -0.460 0.646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01534 on 1322 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.929e+06 on 14 and 1322 DF, p-value: < 2.2e-16
```

### Ridge Regression

I use the package `glmnet`

to use ridge and lasso methods. The elastic net penalty in `glmnet`

is given by:

\[\frac{(1-\alpha)}{2}||\beta_j||_2^2+\alpha||\beta_j||_2\]

Whether the regression is ridge, lasso, or in between is controlled by \(\alpha\), with \(\alpha = 1\) being lasso and \(\alpha = 0\) being ridge.

To determine the most appropriate \(\lambda\), cross validation can be used. `cv.glmnet`

does the heavy lifting for me and calculates the MSE for each lambda after doing cross validation. I opt for the highest \(\lambda\) whose MSE is within 1 sd of the lowest MSE across all the \(\lambda\)s. I believe this leads to better generalization while maintaining low training error.

This short code will run cross-validation with various \(\lambda\)s, choose the appropriate lambda, and get a final model with that lambda.

```
# Choose the value of lambdas I'd like to test
lambdas <- exp(1)^seq(3, -10, by = -0.05)
# Cross validation with different lambdas
ridge <- cv.glmnet(
x = as.matrix(cofdat[, -1]),
y = cofdat$total_cup_points,
alpha = 0,
standardize = TRUE, # standardizes training data
lambda = lambdas
)
# choose lambda.1se
print(ridge$lambda.1se)
```

`## [1] 0.03877421`

```
ridge_final <- glmnet(
x = as.matrix(cofdat[, -1]),
y = cofdat$total_cup_points,
alpha = 0,
standardize = TRUE, # standardizes training data
lambda = ridge$lambda.1se
)
print(ridge_final$a0)
```

```
## s0
## 0.3755191
```

`print(ridge_final$beta)`

```
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## aroma 0.9907920628
## flavor 1.0214067393
## aftertaste 1.0011881996
## acidity 0.9888159329
## body 0.9939766384
## balance 1.0014300201
## uniformity 0.9975419726
## clean_cup 0.9884834152
## sweetness 0.9890917946
## cupper_points 0.9856624624
## moisture 0.0051675667
## category_one_defects -0.0022493552
## category_two_defects -0.0008581067
## quakers 0.0001926588
```

Just a couple things to note here:

`x`

expects a matrix input where each column is a feature and each row is an observation. The data cleaning step is usually conveniently done with a`data.frame`

but if`data.frame`

is passed into`x`

, then an error will be put out.`standardize`

does the scaling (standardization) of the feature inputs for me and returns the rescaled-version of the coefficients. If you have already done the standardization yourself, then you can set`standardize`

to FALSE.

### LASSO

```
lasso <- cv.glmnet(
x = as.matrix(cofdat[, -1]),
y = cofdat$total_cup_points,
alpha = 1,
standardize = TRUE, # standardizes training data
lambda = lambdas
)
# choose lambda.1se
print(lasso$lambda.1se)
```

`## [1] 0.009095277`

```
lasso_final <- glmnet(
x = as.matrix(cofdat[, -1]),
y = cofdat$total_cup_points,
alpha = 1,
standardize = TRUE, # standardizes training data
lambda = lasso$lambda.1se
)
print(lasso_final$a0)
```

```
## s0
## 0.486362
```

`print(lasso_final$beta)`

```
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## aroma 0.9884542
## flavor 1.0168211
## aftertaste 1.0030134
## acidity 0.9852781
## body 0.9839906
## balance 1.0016683
## uniformity 0.9935998
## clean_cup 0.9951121
## sweetness 0.9875308
## cupper_points 0.9874171
## moisture .
## category_one_defects .
## category_two_defects .
## quakers .
```

## Results

Let’s see the results and see how the coefficients differ.

```
coefs <- data.frame(variable = colnames(cofdat)[-1])
coefs$lm <- linreg$coefficients[coefs$variable]
coefs$ridge <- ridge_final$beta[coefs$variable,]
coefs$lasso <- lasso_final$beta[coefs$variable,]
coefs_long <- coefs %>%
mutate(variable = factor(variable, sort(variable))) %>%
pivot_longer(cols = c("lm", "ridge", "lasso"), names_to = "Method", values_to = "Coefficients") %>%
filter(abs(Coefficients) > 0) %>%
mutate(Method = factor(Method, c("lm", "ridge", "lasso"), labels = c("Linear Regression", "Ridge", "Lasso")))
coefs_long$low <- ifelse(coefs_long$Coefficients < 0.1, TRUE, FALSE)
ggplot(filter(coefs_long)) +
facet_grid(rows = vars(low), scales = "free_y") +
geom_point(aes(x = variable, y = Coefficients, color = Method)) +
ggtitle("Coefficient Comparison across Three Models") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
```

You can see that the coefficients are all around zero, with the exception of the four variables that indicate poor quality. I think it should come clear from this result that the final rating, `total_cup_points`

, is actually a simple sum of those ratings! Honestly, I hadn’t thought this through when I was choosing the dataset and running the regressions. I should have seen this beforehand in hindsight; there are 10 rating features, all ranging from 0 to 10, and the total score is on a range from 0 to 100.

You can still see the shrinkage effect in the penalized regression in the coefficient comparison plot above. Ridge and Lasso’s coefficients are lower in absolute value than the OLS methods for the most part. Furthermore, the coefficients for `category_one_defects`

, `category_two_defects`

, `moisture`

, and `quakers`

are all shrunk to 0 for Lasso. All the methods seem to capture the relationship between the 10 rating features and the final score.

There are some cases where the final score is not exactly a total sum of the 10 variables, and this makes the coefficients to deviate slightly from 1.

## Ending Remark

In this post, I grabbed a coffee dataset to practice penalized regression and see the shrinkage effect in action. We observed some of the variables being shrunk to zero as the penalty increases in Lasso. We were able to see that with the help of the regression models that the final score is simply a linear combination of 10 different ratings for characteristics for coffee beans. Unfortunately, it was difficult to see if the extra generalization from the penalized methods helped with prediction of the response due to the nature of the variables. When the time comes, I will try these methods on a different datasets and see if the generalizations help with making a better model.