Appendix B — Python vs. R
As we mentioned at the beginning (Section 1.3), use the language you like as long as it gets the job done well. The vast majority of data scientists use Python, but R is very common in academia and statistical analysis in general. Both languages have their strengths and weaknesses, and it’s worth getting a sense of what they are1.
B.1 Our Background
Michael has used both R and Python for many years and now uses Python primarily as a reflection of his own movement from academia to industry. Seth also has used both for many years and teaches with both in his courses every semester. Unlike many practitioners, we have a lot of experience with using both for serious data science, which we hope gives us a pretty good perspective on the strengths and weaknesses of each. In the end though, this is just our opinion, and you would do well to consider your specific needs and preferences when choosing a language.
B.2 Python
Python is the leading language for machine learning, deep learning, and data science in general. Many other languages can perform ML and maybe even well, but Python is the most popular, has the most packages, and it’s where tools are typically implemented and developed first. Even if it isn’t your primary language, it should be for any implementation of machine learning or using deep learning models.
B.2.1 Pros
- Popular Tools: Very powerful and very widely used tools.
- Efficiency: Typically efficient on memory and fast.
- Modeling API: Consistent API across many modeling packages (e.g., sklearn).
- Pipelines: Easy pipeline and reproducibility setup.
- ML/DL Development: It is the standard for machine/deep learning packages.
- Versatility: Versatile, general programming language, and can be used for many other things easily.
- Origins: Was not developed by statisticians.
B.2.2 Cons
- Model Exploration: It can be difficult to get model summaries, visualizations, and interpretable results.
- Data Processing: Pandas is not as intuitive/user-friendly as tidyverse.
- Environment fragility: Python or package updates often break functionality; environments are often frozen to avoid this, which often means using outdated package versions that can cause miscellaneous issues and retain known bugs.
- Documentation: Documentation is inconsistent since there is no standard. This hopefully will be alleviated in the future with AI tools that can write the documentation for you.
- Interactive Development: Jupyter notebooks lag behind RMarkdown, but Quarto shows promise for improving Python’s usability in this area2.
B.3 R
R is the leading language for statistical analysis, visualization and reporting analytical results. Your authors can also definitively say that R is actually great at ML and at production level, as they have used it with data comprising millions of data points for very large and well-known companies. The default tools are not as fast or memory efficient relative to Python, but they are typically more user friendly, and usually have good to even excellent documentation, as package development has been largely standardized for some time. When it comes to statistical models like those from Part I, R is the best tool for the job hands down.
B.3.1 Pros
- Interactive analysis: Interactive analysis was always the default approach.
- User-friendly: User-friendly and fast data processing, modeling, and visualization.
- Compatibility: Practically every tool you’d ever use works with data frames.
- Post-processing: Easy post-processing of models with many packages.
- Documentation: Documentation is standardized for any CRAN package, and almost all non-CRAN packages use the same approach. Examples are expected for documented functions, and the package will fail to build if any example fails.
- Package Development: CRAN packages are checked against the latest version of R, all major operating systems, and even other packages that depend on them.
B.3.2 Cons
- Speed: Relatively slow for many modeling tasks.
- Memory: R is very memory intensive, with modern machines this is less of an issue for many tabular data applications, but can be very limiting for large data sets.
- Pipelines: Pipeline approach and/or reproducibility in a production-level sense has only recently been of focus.
- Production tools: Production tools are still relatively recent and not as well-developed.
B.4 Conclusion
In summary, R is fantastic for Part I models (statistical modeling) and Python for Part II (machine learning). We can also say that both are great for causal modeling depending on what your approach is. We still feel R is notably easier to use for data processing, visualization, and model exploration, producing data science reports/documents, and obviously for statistical modeling. We would likely use Python for anything that requires a lot of data or is computationally expensive, and for machine learning in general. For deep learning models, Python is the obvious choice.
We like tools like Quarto because it makes it easy to use both, even simultaneously within the same document, so the great thing is that you don’t have to choose. More to the point, AI tools like CoPilot now make it easier to program in either, leaving you to focus on the data science instead of the programming.
There are other languages that can be used for data science, but they are not nearly as common. For example, Julia is a language that is very fast for some things and generally has a lot of potential for data science. But honestly it came too late to the game, and it’s not as user-friendly as Python or R. Proprietary tools like Matlab, SAS, Stata, and similar had their day, and it has long since passed, and closed-source tools will never develop as rapidly as popular open source tools. Still other languages whose primary purpose is not data science may be useful for some tasks (e.g. Spark-ML), but they are not as well-suited for as Python or R.↩︎
Quarto is a relatively new markdown tool that attempts to be language agnostic and is designed to be a successor to RMarkdown. In our experience it is already much better than Jupyter for producing an actual document, and nearly approaching the same level for interactive use.↩︎