Wrap-up

In this section, I note some other techniques one may come across, and others that will provide additional insight into machine learning applications.

Unsupervised Learning

Unsupervised learning generally speaking involves techniques in which we are utilizing unlabeled data. In this case we have our typical set of features we are interested in, but no particular response to map them to. In this situation, we are more interested in the discovery of structure within the data. For more examples, see this document.

Clustering

Many of the techniques used in unsupervised learning are commonly taught in various disciplines as simply “cluster” analysis. The gist is that we are seeking an unknown class structure rather than seeing how various inputs relate to a known class structure. Common techniques include k-means, hierarchical clustering, and model based approaches (e.g. mixture models).

Latent Variable Models

Sometimes the desire is to reduce the dimensionality of the inputs to a more manageable set of information. In this manner we are thinking that much of the data can be seen as having only a few sources of variability, often called latent variables or factors. Again, this takes familiar forms such as principal components and (“exploratory”) factor analysis, but would also include independence components analysis and partial least squares techniques. Note also that these can be part of a supervised technique (e.g. principal components regression) or the main focus of analysis (as with latent variable models in structural equation modeling).

Graphical Structure

Other techniques are available to understand structure among observations or features. Among the many approaches is the popular network analysis, where we can obtain links among observations and examine visually the structure of those data points, where observations are placed closer together that have closer ties to one another in some way. In still other situations, we aren’t so interested in the structure as we are in modeling the relationships and making predictions from the attributes of nodes. One can examine my document that covers more of these and latent variable approaches.

An example network graph of U.S. senators in 2006. Node size is based on the betweeness centrality measure, edge size the percent agreement (graph filtered to edges >= 65%). Color is based on the clustering discovered within the graph (link to data).

Imputation

We can also use ML techniques when we are missing data, as a means to impute the missing values. While many are familiar with this problem and standard techniques for dealing with it, it may not be obvious that ML techniques may also be used. For example, both k-nearest neighbors and random forest techniques have been applied to imputation⁵².

Beyond this we can infer values that are otherwise unavailable in a different sense. Consider Netflix, Amazon and other sites that suggest various products based on what you already like or are interested in. In this case the suggested products have missing values for the user which are imputed or inferred based on their available data and other consumers similar to them who have rated the product in question. Such recommender systems are widely used these days.

Ensembles

In many situations we can combine the information of multiple models to enhance prediction. This can take place within a specific technique, e.g. random forests, or between models that utilize different techniques. I will discuss some standard techniques, but there are a great variety of forms in which model combination might take place.

Bagging

Bagging, or bootstrap aggregation, uses bootstrap sampling to create many data sets on which a procedure is then performed. The final prediction is based on an average of all the predictions made for each observation. In general, bagging helps reduce the variance while leaving bias unaffected. A conceptual outline of the procedure is provided.

Model Generation

Sample \(N\) observations with replacement \(B\) times to create \(B\) data sets of size \(N\).
Apply the learning technique to each of \(B\) data sets to create \(t\) models.
Store the \(t\) results.

Classification

For each of \(t\) number of models:

Predict the class of \(N\) observations of the original data set.
Return the class predicted most often across the \(t\) number of models (or alternatively, the proportion \(t =1\) as a probability).

The approach would be identical for the continuous target domain, where the final prediction would be the average across all models.

Boosting

With boosting we take a different approach to refitting models. Consider a classification task in which we start with a basic learner and apply it to the data of interest. Next, the learner is refit, but with more weight (importance) given to misclassified observations. This process is repeated until some stopping rule is reached (e.g. reaching some \(M\) iterations). An example of the AdaBoost algorithm is provided (in the following \(\mathbb{I}\) is the indicator function).

Set initial weights \(w_i\) to \(1/N\).

for \(m=1:M\) {

Fit a classifier \(m\) with given weights to the data resulting in predictions \(f^{(m)}_i\) that minimizes some loss function.
Compute the error rate \(\text{err}_m = \frac{{\sum_{i=1}^N}\mathbb{I}(y_i\ne f^{(m)}_i)}{\sum^N_{i=1}w^{(m)}_i}\)
Compute \(\alpha_m = \log[(1-err_m)/err_m]\)
Set \(w_i \leftarrow w_i\exp[\alpha_m \mathbb{I}(y_i\ne f^{(m)}_i)]\)

}

Return \(\textrm{sign} [\sum^M_{m=1}\alpha_m f^{(m)}]\)

Boosting can be applied to a variety of tasks and loss functions, and in general is highly resistant to overfitting. A very popular implementation is XGBoost and its variants. The following shows an implementation.

library(xgboost)
modelLookup("xgbLinear")
modelLookup("xgbTree")

xgb_opts = expand.grid(eta=c(.3,.4),
                       max_depth=c(9, 12),
                       colsample_bytree=c(.6,.8),
                       subsample=c(.5,.75,1),
                       nrounds=1000,
                       min_child_weight=1,
                       gamma=0)

set.seed(1234)
results_xgb = train(good~., 
                    data=wine_train, 
                    method='xgbTree',
                    preProcess=c('center', 'scale'), 
                    trControl=cv_opts, 
                    tuneGrid=xgb_opts)
results_xgb
preds_gb = predict(results_xgb, wine_test)
confusionMatrix(preds_gb, good_observed, positive='Good')

Stacking

Stacking is a method that can generalize beyond a single fitting technique, though it can be applied in a fashion similar to boosting for a single technique. Here we will use it broadly to mean any method to combine models of different forms. Consider the four approaches we demonstrated earlier: k-nearest neighbors, neural net, random forest, and the support vector machine. We saw that they do not have the same predictive accuracy, though they weren’t bad in general. Perhaps by combining their respective efforts, we could get even better prediction than using any particular one.

The issue then is how we might combine them. We really don’t have to get too fancy with it, and can even use a simple voting scheme as in bagging. For each observation, note the predicted class on new data across models. The final prediction is the class that receives the most votes. Another approach would be to use a weighted vote, where the votes are weighted by the accuracy of their respective models.

Another approach would use the predictions on the test set to create a data set of just the predicted probabilities from each learning scheme. We can then use this data to train a meta-learner using the test labels as the response. With the final meta-learner chosen, we then retrain the original models on the entire data set (i.e. including the test data). In this manner, the initial models and the meta-learner are trained separately and you get to eventually use the entire data set to train the original models. Now when new data becomes available, you feed them to the base level learners, get the predictions, and then feed the predictions to the meta-learner for the final prediction.

Deep Learning

Deep learning is all the rage these days, and for good reason- it keeps working and is highly flexible. Many techniques are largely focused on AI applications, but are not restricted to those. They’ve been employed successfully in a wide range of areas, e.g. facial recognition, computer vision, speech recognition, and natural language processing. Common techniques include deep feed forward neural networks⁵³, convolutional neural networks, and recurrent/recursive neural networks. Armed with a basic knowledge of neural networks as presented earlier, you can see these as the next step.

Such models require massive amounts of data, a lot of tuning, and generally serious hardware⁵⁴. Python generally has the latest implementation of tools, through modules such as tensorflow, pytorch, and keras. There are tools in R, but they are wrappers for the Python modules, and the memory usage alone precludes standard R implementation, though packages like sparklyr and keras may eventually allow this.

Because of the difficulty training such models, pre-trained models are often being applied in various situations. This can be a very dangerous situation if the data are not comparable. It’s fine to use a sentiment analysis model based on twitter feeds on other data from a review website. It’s another matter to use a model that was used for pedestrian detection in road data to look for tumors in x-rays. However, don’t be surprised if you see this stuff in drop-down menus for Excel in the not too distant future #whatcouldpossiblygowrong.

You can start your journey into this sort of stuff at http://deeplearning.net/, and I have a minimal demos for R and Python in the appendix.

Feature Selection & Importance

We hit on this topic some before, but much like there are a variety of ways to gauge performance, there are different approaches to select features and/or determine their importance. Invariably feature selection takes place from the outset when we choose what data to collect in the first place. Hopefully guided by theory, in other cases it may be restricted by user input, privacy issues, time constraints and so forth. But once we obtain the initial data set however, we may still want to trim the models under consideration.

In standard approaches we might have in the past used forward or other selection procedure, or perhaps some more explicit model comparison approach. Concerning the content here, take for instance the lasso regularization procedure we spoke of earlier. ‘Less important’ variables may be shrunk entirely to zero, and thus feature selection is an inherent part of the process, and is useful in the face of many, many predictors, sometimes outnumbering our sample points. As another example, consider any particular approach where the importance metric might be something like the drop in accuracy when the variable is excluded.

Variable importance was given almost full weight in the discussion of typical applied research in the past, based on statistical significance results from a one-shot analysis, and virtually ignorant of prediction on new data. We still have the ability to focus on feature performance with ML techniques, while shifting more of the focus toward prediction at the same time. For the uninitiated, it might require new ways of thinking about how one measures importance though.

Natural Language Processing/Text Analysis

In some situations, the data of interest is not in a typical matrix form but in the form of textual content, e.g. a corpus of documents (loosely defined). In this case, much of the work (like in most analyses but perhaps even more so) will be in the data preparation, as text is rarely if ever in a ready-to-analyze state. The eventual goals may include using the discovery of latent topics, parts-of-speech tagging, sentiment analysis, language identification, word usage in the prediction of an outcome, or examining the structure of the term usage graphically as in a network model. In addition, machine learning processes might be applied to sounds (acoustic data) to discern the speech characteristics and other information. Deep learning has been widely applied in this realm. For some applied examples of basic text analysis in R, see this.

Bayesian Approaches

It should be noted that the approaches outlined in this document are couched in the frequentist tradition. But one should be aware that many of the concepts and techniques would carry over into the Bayesian perspective, and even some machine learning techniques might only be feasible or make more sense within the Bayesian framework (e.g. online learning). However the core nature of Bayesian estimation makes it difficult to implement in ways that scale to even moderately large data situations.

More Stuff

Aside from what has already been noted, there still exists a great many applications for ML such as data set shift⁵⁵, semi-supervised learning⁵⁶, online learning⁵⁷, and many more.

Summary

Cautionary Notes

A standard mantra in machine learning and statistics generally is that there is no free lunch. All methods have certain assumptions, and if those don’t hold the results will be problematic at best. Also, even if in truth learner A is better than B, B can often outperform A in the finite situations we actually deal with in practice.

In general, without context, no algorithm can be said to be any better than another on average. Furthermore, being more complicated doesn’t mean a technique is better. As previously noted, simply incorporating regularization and cross-validation goes a long way toward to improving standard techniques, and may perform quite well in many situations. The basic conclusion is that

Machine learning is not magic!

ML does not prove your theories, it does not make your data better, and the days of impressing someone simply because you’re using it have long since passed. Like any statistical technique, the reason to use ML is that is well-suited to the problem at hand.

Some Guidelines

Here are some thoughts to keep in mind, though these may be applicable to applied statistical practice generally.

More data beats a cleverer algorithm, but a lot of data is not enough by itself (Domingos (2012)).
Avoid overfitting.
Let the data speak for itself.
“Nothing is more practical than a good theory.”⁵⁸
While getting used to ML, it might be best to start from simpler approaches and then work towards more black box ones that require more tuning. For example, regularized logistic regression \(\rightarrow\) random forest \(\rightarrow\) your-fancy-technique. Don’t get too excited if you aren’t doing significantly better than a random forest with default settings.
Drawing up a visual diagram of your process is a good way to keep your analysis on the path to your goal. Some programs can even make this explicit.
Keep the tuning parameter/feature selection process separate from the final test process for assessing error.
Learn multiple models, selecting the best or possibly combining them.

Conclusion

It is hoped that this document sheds some light on a few areas that might otherwise be unfamiliar to some in more applied disciplines. The fields of statistics, computer science, engineering and related have rapidly evolved over the past couple decades. The tools available from them are myriad, and expanding all the time. Rather than feeling intimidated or overwhelmed, one should embrace the choice available, and have some fun with your data!

References

Domingos, Pedro. 2012. “A Few Useful Things to Know About Machine Learning.” Commun. ACM 55 (10). https://doi.org/10.1145/2347736.2347755.

Imputation and related techniques may fall under the broad heading of matrix completion.↩
Such models are no different than the neural nets of the 80s and 90s, save for maybe newer or more efficient optimization implementations.↩
You can run any model on small data and your own machine, it would just likely waste your time in the case of these models.↩
Used when fundamental changes occur between the data a learner is trained on and the data coming in for further analysis.↩
Learning with both labeled and unlabeled data.↩
Learning from a continuous stream of data.↩
Kurt Lewin, and iterated by V. Vapnik for the machine learning context.↩