Appendix D — Using R and Python in ML

D.1 Python

Python is the king of ML. Many other languages can perform ML and maybe even well, but Python is the most popular, has the most packages, and it’s where tools are typically implemented and developed first. Even if it isn’t your primary language, it should be for any implementation of machine learning.

Pros:

  • powerful and widely used tools
  • typically very efficient on memory and fast
  • many modeling packages try to use the sklearn API for consistency
  • easy pipeline/reproducibility setup

Cons:

  • Everything beyond getting a prediction from a model can be difficult: e.g. good model summaries and visualizations, interpretability tools, extracting key estimated model features, etc. For example, getting features names as part of the output is a recent development for scikit-learn and other modeling packages.
  • Data processing can be notably tedious and overly verbose. Pandas, to put it simply, is not tidyverse. Newer tools are at least starting to be more widely adopted, but they still are not as easy to use.
  • The development ecosystem can be fragile in the sense that any package update will often break another package’s functionality. In order to avoid this, people freeze their environment to whenever model exploration first began. Many industry modeling environments are still based on versions of Python that may be many years old to the point of no longer being supported. This also means that model packages will contain all the bugs from the time of release when the Python environment was created.
  • Package documentation is often quite poor, even for some important model aspects of the model, and there is no consistency in documentation from one package to another. Demonstrations, if even available, often may not work, leaving you to dig into the source code to figure out what’s actually going on. This hopefully will be alleviated in the future with AI tools that can write the documentation for you.
  • Interactive model development with Jupyter notebooks has not been close to the level of RMarkdown for years. However, Quarto has already shown great promise for using Python almost as easily as R, and this book was even written with it. So in the end, the R community may bail out the Python community on this issue.

D.2 R

Your authors can say definitively that R is actually great at ML and at production level, as they have used it with data comprising millions of data points for very large and well-known companies. The default tools are not as fast or memory efficient relative to Python, but they are typically more user friendly, and usually have good to even excellent documentation, as package development has been largely standardized for some time.

As far as some drawbacks, some Python modeling packages such as xgboost, lightgbm, and keras have concurrent development in R, but the R development typically lags with feature implementation. And when it comes to ML with deep learning models, R packages merely wrap the Python packages. Furthermore, Python has many more bindings for other production tools that have little to do with the data science part, but are necessary for other production-level tasks. In general though, for many aspects of data science, from feature engineering to visualization to reporting, R often has more, or at the very least easier, tools to offer.

Pros:

  • very user friendly and fast data processing
  • easy to use objects that contain the things you’d need to use for further processing
  • practically every tool you’d use works with data frames
  • saving models does not require any special effort
  • easy post-processing of models with many packages
  • documentation is standardized for any CRAN and most non-CRAN packages, and will only improve with AI tools. Unlike Python, examples are expected for documented functions, and the package will fail to build if any example fails, and even warn if examples are empty. This is a great way to ensure that examples are present and actually work.
  • ML tools can be used on tabular data of millions of instances in memory and in production, and on data that is too large to fit in memory using many viable tools.

Cons:

  • relatively slow
  • memory intensive
  • pipeline/reproducibility has only recently been of focus
    • tidymodels is a great but fairly non-standard way of conducting machine learning
    • mlr3 is much more sklearn-like- fast and memory efficient, but not as widely used
  • Other production tools are available but still recent and not as well-developed.

In summary, Python is the best tool for ML, and it is better for production level work in general. But you can use R for pretty much everything else if you want, including ML if it’s not too computationally expensive. Quarto makes it easy to use both, including simultaneously, so the great thing is that you don’t have to choose. More to the point, AI tools like CoPilot now make it easier to program in either, leaving you to focus on the data science instead of the programming.