Preface

The purpose of this document is to provide a conceptual introduction to statistical or machine learning (ML) techniques for those that would not normally be exposed to such approaches during their typical required statistical training. Machine learning1 can be described as a form of statistical analysis, often even utilizing well-known and familiar techniques, that has bit of a different focus than traditional analytical practice in applied disciplines. The key notion is that flexible, automatic approaches are used to detect patterns within the data, with a primary focus on making predictions on future data.

If one surveys the number of techniques available in ML without any context, one can easily be overwhelmed in terms of the sheer number of approaches, as well as the various tweaks and variations of them. However, the specifics of the techniques are not as important as more general concepts that would be applicable in most every ML setting, and indeed, many traditional techniques as well. While there will be examples using the R statistical environment and descriptions of a few specific approaches, the focus here is more on ideas than application2 and kept at the conceptual level as much as possible. Note that Python versions of the model examples are available here. In addition, Marcio Mourao has provided additional Python examples.

As for prerequisite knowledge, I will assume a basic familiarity with regression analyses typically presented in applied disciplines. Regarding programming, none is really required to follow most of the content here, but some knowledge of R would be helpful if one wants to do the examples. Note that I won’t do much explaining of the code, as I will be more concerned with getting to a result than clearly detailing the path to it. Armed with even introductory knowledge, if there are parts of R code that are unclear, there are many resources to investigate and discover the details on one’s own, which results in more learning anyway.

The latest version of this document is dated 2019-03-25. It was originally written in March 2013, when there was much more need for it in academia, at least outside of CS and Engineering departments. For all the buzz of machine learning, deep learning, AI, etc., that need still exists, as most applied disciplines still do not teach such techniques as a core part of their training (yet). However, the number of tools and information, within the realm of R and Python especially, has exploded, so you’ll have plenty to keep you occupied. Indeed, I’m not sure how much of the specific techniques and tools demonstrated in this document will be viable in another five years, so be prepared to keep up.

Color coding in text:

  • emphasis
  • package
  • function
  • object/class
  • link

  1. Also referred to as applied statistical learning, statistical engineering, data science or data mining in various academic, professional, and other contexts (correctly or not).

  2. Indeed, there is evidence that with large enough samples many techniques converge to similar performance.