Introduction

This document provides some tools, demonstrations, and more to make data processing, programming, modeling, visualization, and presentation easier. The goal here is to instill a foundation for sound data science, and also to provide some of the why behind the approaches demonstrated, so that one can better implement them. Focus is given to tools and methods that are generalizable, replicable, and efficient.

Intended Audience

If you are a budding data scientist in academia, industry, or elsewhere, the content in this document should provide enough knowledge for you to do better work under a variety of circumstances, and from data creation to presentation of results. You probably have already tried to analyze data in some form before this, but may be struggling to do so in an efficient way. Hopefully this will allow one to extend what they know, fill in gaps, and otherwise try some new tricks.

Programming Language

While the programming language focus is on R, where applicable (which is most of the time), Python notebooks are also available, and they include the same exercises. Furthermore, much of the actual content and concepts discussed would apply to any programming language used for data science, and the concepts for programming, modeling, visualization, reproducibility and more are applicable for anyone engaged in data science.

To use R effectively enough for the content here you need to also install RStudio (which is not R itself), and know how to install and load packages.

Additional Practice

Exercises are provided to get more practice with the things learned. In addition, references are available to dive into even more extensions of the things demonstrated here.

Outline

The content is divided into five sections. While parts assume at least the information in Part 1, they are otherwise fairly independent and can be explored on their own.

Part 1: Information Processing

  • Understanding Basic R Approaches to Gathering and Processing Data
    • Overview of Data Structures
    • Getting data in and out
    • Indexing
  • Getting Acquainted with Other Approaches to Data Processing
    • Pipes, and how to use them
    • tidyverse
    • data.table
    • Misc.

Part 2: Programming Basics

  • Using R more fully
    • Dealing with objects
    • Iterative programming
    • Writing functions
  • Going further
    • Code style
    • Vectorization
    • Regular expressions

Part 3: Modeling

  • Model Exploration
    • Key concepts
    • Understanding and fitting models
    • Overview of extensions
  • Model Criticism
    • Model Assessment
    • Model Comparison
  • Machine Learning
    • Concepts
    • Demonstration of techniques

Part 4: Visualization

  • Thinking Visually
    • Visualizing Information
    • Color
    • Contrast
    • and more…
  • Using ggplot2
    • Aesthetics
    • Layers
    • Themes
    • and more…
  • Adding Interactivity
    • Package demos
    • Shiny

Part 5: Presentation

  • Building Better Data-Driven Products
    • Reproducibility concepts
  • Starting out with R markdown
    • Standard documents
  • Customization and more
    • Themes, CSS, etc.

Workshops

This document also serves as a basis for several workshops. To follow along with the examples, clone/download the related section repos. Downloading any one of them will have an R project and associated data, such that the code from any section should run.

Other

Color coding in text:

  • emphasis
  • package
  • function
  • object/class
  • link

Some key packages used in the following demonstrations and exercises:

tidyverse (several packages), data.table, tidymodels, rmarkdown

Python notebooks

The related Python notebooks may be found here. The primary modules used and demonstrated are numpy and pandas for the first two parts, and throughout the rest. Besides that, statsmodels, scikit-learn, plotnine, plotly, and others are employed.

Other R packages

Many other packages are also used for data or minor demonstration, so feel free to install as we come across them. Here are a few.

ggplot2movies, nycflights13, DT, highcharter, magrittr, maps, mgcv (already comes with base R), plotly, quantmod, readr, visNetwork, emmeans, ggeffects

History

This was initially a document that only contained the information processing and visualization chapters, and it filled a need specifically for workshops. Over time, it has increased to include other things I think are generally missing from those who simply learn how to get to the end results. The content is born out of the gaps I see in those I consult with, and also just ‘I wish I’d known that’ type of experience. In the interim, textbooks have come out, authored by some who develop the packages being used here, that could be used as a next step, as they offer more detail. They are listed in the references section.

Current Efforts

At present most of the content is more or less as I want it, though I’m filling it out with some minor additions here and there over time. Currently I’m improving the Python documents as well.