Appendix

Bias Variance Demo

set.seed(123)
x = runif(1000)
ytrue = sin(3*pi*x)
basedat = cbind(x,ytrue)[order(x),]
 
gendatfunc = function(noise=.5, n=1000){
  x = runif(n)
  y = sin(3*pi*x) + rnorm(n, sd=noise) # truth
  d = cbind(x, y)
  d
}
 
gendat = replicate(100, gendatfunc(n=100))
str(gendat)
 
library(kernlab)
 
rbf1 = apply(gendat, 3, 
             function(d) predict(gausspr(y~x, data=data.frame(d), kpar=list(sigma=.5)), 
                                 newdata = data.frame(x), type='response'))
rbf2 = apply(gendat, 3, 
             function(d) predict(gausspr(y~x, data=data.frame(d)), 
                                 newdata = data.frame(x), type='response') )
 
library(ggplot2); library(tidyverse); library(gridExtra)
 
rbf1_samp = rbf1 %>% 
  data.frame %>% 
  cbind(x, .) %>%
  slice(sample(1:100, 25)) %>% 
  gather(key=sample, value=yhat, -x)
 
rbf2_samp = rbf2 %>% 
  data.frame %>% 
  cbind(x, .) %>%
  slice(sample(1:100, 25)) %>% 
  gather(key=sample, value=yhat, -x)
 
g1 = ggplot(data=data.frame(basedat)) +
  geom_blank() +
  geom_line(aes(x=x, y=yhat, group=sample), color='#ff5500', alpha=.25, data=rbf1_samp) +
  ylim(c(-1.5, 1.5)) +
  labs(y='', title='Low Variance') + 
  lazerhawk::theme_trueMinimal() +
    theme(
    legend.key = ggplot2::element_rect(fill='#fffff8', colour = NA),
    legend.background = ggplot2::element_rect(fill='#fffff8', colour = NA),
    panel.background = ggplot2::element_blank(),
    panel.grid = ggplot2::element_blank(),
    strip.background = ggplot2::element_blank(),
    plot.background = ggplot2::element_rect(fill = "#fffff8", colour = NA)
  )
 
g2 = ggplot(data=data.frame(basedat)) +
  geom_line(aes(x=x, y=ytrue), color='#00aaff') +
  geom_line(aes(x=x, y=yhat), color='#ff5500', data.frame(yhat=rowMeans(rbf1))) +
  ylim(c(-1.5, 1.5)) +
  labs(y='', title='High Bias') + 
  lazerhawk::theme_trueMinimal() +
    theme(
    legend.key = ggplot2::element_rect(fill='#fffff8', colour = NA),
    legend.background = ggplot2::element_rect(fill='#fffff8', colour = NA),
    panel.background = ggplot2::element_blank(),
    panel.grid = ggplot2::element_blank(),
    strip.background = ggplot2::element_blank(),
    plot.background = ggplot2::element_rect(fill = "#fffff8", colour = NA)
  )
 
g3 = ggplot(data=data.frame(basedat)) +
  geom_blank() +
  geom_line(aes(x=x, y=yhat, group=sample), color='#ff5500', alpha=.25, data=rbf2_samp) +
    ylim(c(-1.5, 1.5)) +
  labs(y='', title='High Variance') + 
  lazerhawk::theme_trueMinimal() +
    theme(
    legend.key = ggplot2::element_rect(fill='#fffff8', colour = NA),
    legend.background = ggplot2::element_rect(fill='#fffff8', colour = NA),
    panel.background = ggplot2::element_blank(),
    panel.grid = ggplot2::element_blank(),
    strip.background = ggplot2::element_blank(),
    plot.background = ggplot2::element_rect(fill = "#fffff8", colour = NA)
  )
 
g4 = ggplot(data=data.frame(basedat)) +
  geom_line(aes(x=x, y=ytrue), color='#00aaff') +
  geom_line(aes(x=x, y=yhat), color='#ff5500', data.frame(yhat=rowMeans(rbf2))) +
  ylim(c(-1.5, 1.5)) +
  labs(y='', title='Low Bias') + 
  lazerhawk::theme_trueMinimal() +
    theme(
    legend.key = ggplot2::element_rect(fill='#fffff8', colour = NA),
    legend.background = ggplot2::element_rect(fill='#fffff8', colour = NA),
    panel.background = ggplot2::element_blank(),
    panel.grid = ggplot2::element_blank(),
    strip.background = ggplot2::element_blank(),
    plot.background = ggplot2::element_rect(fill = "#fffff8", colour = NA)
  )
 
grid.arrange(g1, g2, g3, g4, ncol=2)

Programming Languages

R

Demonstrations for this document were done with R, and specifically the caret package. I would highly recommend using it for your own needs, as it makes a lot the ML process simpler, while providing access to whatever technique you want to use, even while it comes with the ability to use hundreds of approaches out of the box.

Deep learning example

The following is a deep learning example using the same data from the previous examples, using keras. The package is an R wrapper for the keras module in Python, which itself is a wrapper for tensorflow. For more on using TensorFlow with R, check out RStudio’s documentation.

This is a sequential neural net with three layers. I also add some dropout at each layer, which, given the material presented, you can think of as a means of regularization to avoid overfitting59. Note that keras works a bit differently, in that its objects are mutable, hence the there is no reassignment of the model. This R implementation also provides some nice visualizations to keep your interest while it runs.

This is just a starting point, so don’t expect any better performance than those shown previously, but it’ll give you something to play with. Note that there is plenty to learn, and such models require copious amounts of tuning to work well.

library(keras)

X_train = model.matrix(lm(good ~ . -1, data=wine_train)) %>% scale()
y_train = as.numeric(wine_train$good) - 1
X_test = model.matrix(lm(good ~ . -1, data=wine_test)) %>% scale()
y_test = as.numeric(wine_test$good) - 1

model = keras_model_sequential() 

model %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(9)) %>%
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1, activation = 'sigmoid')

summary(model)

model %>% compile(
  optimizer = 'nadam',
  loss = "binary_crossentropy",
  metrics = "accuracy"
)

history = model %>% 
  fit(X_train, 
      y_train, 
      epochs = 10000, 
      validation_split = 0.1,
      verbose = 1,
      view_metrics = 1)

plot(history)

# model %>% evaluate(X_test, y_test)

ypred = model %>% predict_classes(X_test)
caret::confusionMatrix(ypred, y_test, positive='1')

Python

If your data fits on your machine and/or your analysis time is less than a couple hours, R is hands down the easiest to use to go from data to document, including if that document is an interactive website. That said, R probably isn’t even the most popular ML tool, because in many situations we have a lot more data, or simply need the predictions without frills and as fast as possible60. As such Python is the de facto standard in such situations, and probably the most popular development environment for machine learning.

One can start with the scikit-learn module, using it much in the same way as was demonstrated with caret. It will get you very far, but for some situations, you may need more heavy-duty options like tensorflow, pytorch, etc.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
# from os import chdir            # if desired
# chdir('your_directory')
# import data
wine = pd.read_csv('data/wine.csv')
# data preprocessing
np.random.seed(1234)
X = wine.drop(['free.sulfur.dioxide', 'density', 'quality', 'color', 'white','good'], axis=1)
X = MinMaxScaler().fit_transform(X)  # by default on 0, 1 scale
y = wine['good']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# train model
rf = RandomForestClassifier(n_estimators=1000)
rf_train = rf.fit(X_train, y_train)
# get test predictions
rf_predict = rf_train.predict(X_test)
# create confusion matrix, and accuracy
cm = confusion_matrix(y_test,rf_predict)
cm_prob = cm / np.sum(cm)  # as probs
print(cm_prob)
[[ 0.26692308  0.09076923]
 [ 0.07615385  0.56615385]]
acc = accuracy_score(y_test, rf_predict)
acc = pd.DataFrame(np.array([acc]), columns=['Accuracy'])
print(acc)
   Accuracy
0  0.833077

Deep learning example

Here is a bit of code to get you started with a scikit approach to deep learning. This is a ‘deep neural net’ with seven layers. I also add some ‘dropout’. This isn’t too viable a model as far as settings go, and won’t do any better than those shown previously61, but it’ll run fine on your machine so you can at least play with it.

import tensorflow.contrib.learn as skflow
from sklearn import metrics
feats = skflow.infer_real_valued_columns_from_input(X_train)
classifier_tf = skflow.DNNClassifier(feature_columns=feats, 
                                     hidden_units=[50, 50, 50, 40, 30, 20, 10], 
                                     dropout=.2,
                                     n_classes=2)
classifier_tf.fit(X_train, y_train, steps=10000)
predictions = list(classifier_tf.predict(X_test, as_iterable=True))
score = metrics.accuracy_score(y_test, predictions)
print("Accuracy: %f" % score)

Other

I wouldn’t recommend a proprietary tool when better open source tools are available, but I will say Matlab is also very popular in machine learning, and in specific areas like image processing. Julia has been coming along as well. Of course, for maximum speed people still prefer lower level languages like C++, Java, and Go, and many of the Python modules are interfaces to such lower-level libraries. And for whatever reason, people are reinventing ML wheels in languages like Javascript and others. See the awesome list for more.

What I can’t recommend is a traditional statistics package like SPSS, SAS, or Stata. Not only did they miss this boat by well over a decade, their offerings are slim and less capable. It seems SAS is probably the only one that’s made serious effort here, and it has some audience in the business world due to its long entrenchment there. And you don’t have to take my word for it- here’s a comparison of trends at indeed.com.

Local Interpretable Model-agnostic Explanations

The general approach lime takes to achieving this goal is as follows:

  • For each prediction to explain, permute the observations n times62.
  • Let the complex model predict the outcome of all permuted observations.
  • Calculate the distance from all permutations to the original observation.
  • Convert the distance to a similarity score.
  • Select m features best describing the complex model outcome from the permuted data.
  • Fit a simple model to the permuted data, explaining the complex model outcome with the m features from the permuted data weighted by its similarity to the original observation.
  • Extract the feature weights from the simple model and use these as explanations for the complex models local behavior.

See Ribeiro, Singh, and Guestrin (2016) for details.

Various Variable Importance Measures

This is taken from the randomForestExplainer package vignette.

For a given predictor variable \(X_j\)

  • accuracy_decrease (classification) – mean decrease of prediction accuracy after \(X_j\) is permuted
  • gini_decrease (classification) – mean decrease in the Gini index of node impurity (i.e. increase of node purity) by splits on \(X_j\)
  • mse_increase (regression) – mean increase of mean squared error after \(X_j\) is permuted
  • node_purity_increase (regression) – mean node purity increase by splits on \(X_j\) as measured by the decrease in sum of squares
  • mean_minimal_depth – mean minimal depth calculated in one of three ways specified by the parameter mean_sample
  • no_of_trees – total number of trees in which a split on \(X_j\) occurs
  • no_of_nodes – total number of nodes that use \(X_j\) for splitting (it is usually equal to no_of_trees if trees are shallow)
  • times_a_root – total number of trees in which \(X_j\) is used for splitting the root node (i.e., the whole sample is divided into two based on the value of \(X_j\))
  • p_value – p-value for the one-sided binomial test using the following distribution:

    $$\textrm{Binom}(\texttt{no_of_nodes}, P(\textrm{node splits on } X_j)$$
    
    where we calculate the probability of split on $X_j$ as if $X_j$ was uniformly drawn from the $r$ candidate variables
    
    $$P(\textrm{node splits on } X_j)=P(X_j \textrm{ is a candidate}) \cdot P(X_j \textrm{ is selected}) = \frac{r}{p} \cdot \frac{1}{r} = \frac{1}{p}$$
    
    This test tells us whether the observed number of successes (number of nodes in which $X_j$ was used for splitting) exceeds the theoretical number of successes if they were random (i.e. following the binomial distribution given above).

Brief Glossary of Common Terms

bias: could mean the intercept (e.g. in neural nets), typically refers to the bias in bias-variance decomposition

classifier: specific model or technique (i.e. function) that maps observations to classes

confusion matrix: a table of predicted class membership vs. true class membership

hypothesis: a specific model \(h(x)\) of all possible in the hypothesis space \(\mathcal{H}\)

input, feature, attribute: independent variable, explanatory variable, covariate, predictor variable, column

instance, example: observation, row

learning: model fitting

machine learning: a form of statistics utilizing various algorithms with a goal to generalize to new data situations

regularization, penalization, shrinkage: The process of adding a penalty to the size of coefficients, thus shrinking them towards zero but resulting in less overfitting (at an increase to bias)

supervised: has a dependent variable

target, label: dependent variable, response, the outcome of interest

unsupervised: no dependent variable (or rather, only dependent variables); think clustering, PCA etc.

weights: coefficients, parameters

References

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1135–44. ACM.


  1. You can obtain near perfect accuracy on the training set without dropout, but will do no better at test.

  2. Actually, for this rf.fit was slower than the default randomForest function in R by about a second under similar settings.

  3. Just for giggles I let that particular one go for 100000 steps and it finally started to get into > 81% on the test set.

  4. This could be done in different ways. For example, given a numeric feature add some random noise given its place in the distribution of values seen for that feature. In the plots shown, the values, e.g. 9.5 and 10.3, are at quantiles (.25 and .5) of the alcohol. The data has been permuted within those quantile bins.