Practical Data Science
Introduction
Intended Audience
Programming Language
Additional Practice
Outline
Part 1: Information Processing
Part 2: Programming Basics
Part 3: Modeling
Part 4: Visualization
Part 5: Presentation
Workshops
Other
Python notebooks
Other R packages
History
Current Efforts
Part I: Information Processing
Data Structures
Vectors
Character strings
Factors
Logicals
Numeric and integer
Dates
Matrices
Creating a matrix
Lists
Data Frames
Creating a data frame
Data Structure Exercises
Exercise 1
Exercise 2
Thinking Exercises
Python Data Structures Notebook
Input/Output
Better & Faster Approaches
R-specific Data
R Datasets
Other Types of Data
On the Horizon
Big Data
I/O Exercises
Exercise 1
Thinking Exercises
Python I/O Notebook
Indexing
Slicing Vectors
Slicing Matrices/data.frames
Label-based Indexing
Position-based Indexing
Mixed Indexing
Non-contiguous
Boolean
List/data.frame Extraction
Indexing Exercises
Exercise 1
Exercise 2
Exercise 3
Python Indexing Notebook
Pipes
Using Variables as They are Created
Pipes for Visualization (more later)
The Dot
Flexibility
Pipes Summary
Tidyverse
What is the Tidyverse?
What is Tidy?
dplyr
An example
Running Example
Selecting Columns
Helper functions
Filtering Rows
Generating New Data
Grouping and Summarizing Data
Renaming Columns
Merging Data
Pivoting axes
More Tidyverse
Personal Opinion
Tidyverse Exercises
Exercise 0
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Python Pandas Notebook
data.table
data.table Basics
Grouped Operations
Faster!
Joins
Group by
String matching
Reading files
More speed
Pipe with data.table
data.table Summary
Faster dplyr Alternatives
data.table Exercises
Exercise 0
Exercise 1
Exercise 2
Part II: Programming
Programming Basics
R Objects
Object Inspection & Exploration
Methods
S4 classes
Others
Inspecting Functions
Documentation
Objects Exercises
Iterative Programming
For Loops
A slight speed gain
While alternative
Loops summary
Implicit Loops
apply family
Apply functions
purrr
Looping with Lists
Iterative Programming Exercises
Exercise 1
Exercise 2
Exercise 3
Writing Functions
A Starting Point
DRY
Conditionals
Anonymous functions
Writing Functions Exercises
Excercise 1
Excercise 1b
Exercise 2
More Programming
Code Style
Why does your code exist?
Assignment
Code length
Spacing
Naming things
Other
Vectorization
Boolean indexing
Vectorized operations
Regular Expressions
Typical uses
Code Style Exercises
Exercise 1
Exercise 2
Vectorization Exercises
Exercise 1
Exercise 2
Regex Exercises
Exercise 1
Part III: Modeling
Model Exploration
Model Taxonomy
Linear models
Estimation
Minimizing and maximizing
Optimization
Fitting Models
Using matrices
Summarizing Models
Variable Transformations
Numeric variables
Categorical variables
Scales, indices, and dimension reduction
Don’t discretize
Variable Importance
Extracting Output
Package support
Visualization
Extensions to the Standard Linear Model
Different types of targets
Correlated data
Other extensions
Model Exploration Summary
Model Exploration Exercises
Exercise 1
Exercise 2
Exercise 3
Python Model Exploration Notebook
Model Criticism
Model Fit
Standard linear model
Beyond OLS
Classification
Model Assumptions
Predictive Performance
Model Comparison
Example: Additional covariates
Example: Interactions
Example: Additive models
Model Averaging
Model Criticism Summary
Model Criticism Exercises
Exercise 0
Exercise 1
Exercise 2
Python Model Criticism Notebook
Machine Learning
Concepts
Loss
Bias-variance tradeoff
Regularization
Cross-validation
Optimization
Tuning parameters
Techniques
Regularized regression
Random forests
Neural networks
Interpreting the Black Box
Machine Learning Summary
Machine Learning Exercises
Exercise 1
Exercise 2
Python Machine Learning Notebook
Part IV: Visualization
ggplot2
Layers
Piping
Aesthetics
Geoms
Examples
Stats
Scales
Facets
Multiple plots
Fine control
Themes
Extensions
ggplot2 Summary
ggplot2 Exercises
Exercise 0
Exercise 1
Exercise 2
Python Plotnine Notebook
Interactive Visualization
Packages
Piping for Visualization
htmlwidgets
Plotly
Modes
ggplotly
Highcharter
Graph networks
visNetwork
sigmajs
Plotly
leaflet
DT
Shiny
Dash
Interactive and Visual Data Exploration
Interactive Visualization Exercises
Exercise 0
Exercise 1
Exercise 2
Exercise 3
Python Interactive Visualization Notebook
Thinking Visually
Information
Your audience isn’t dumb
Clarity is key
Avoid clutter
Color isn’t optional
Think interactively
Color
Viridis
Scientific colors
RColorBrewer
Contrast
Scaling Size
Transparency
Accessibility
File Types
Summary of Thinking Visually
A casual list of things to avoid
Pie
Histograms
Using 3D without adding any communicative value
Using too many colors
Using valenced colors when data isn’t applicable
Showing maps that just display population
Biplots
Thinking Visually Exercises
Exercise 1
Exercise 2
Thinking exercises
Part V: Presentation
Building Better Data-Driven Products
Rep* Analysis
Example
Repeatable
Reproducible
Replicable
Summary of rep* analysis
Literate Programming
R Markdown
Version Control
Dynamic Data Analysis & Report Generation
Using Modern Tools
Getting Started
What is Markdown?
Documents
Standard HTML
R notebooks
Distill
Bookdown
Presentations
Apps, Sites & Dashboards
Templates
How to Begin
Standard Documents
R Markdown files
Text
Code
Chunks
In-line
Labels
Running code
Multiple Documents
Knitting multiple documents into one
Parameterized reports
Collaboration
Using Python for Documents
Customization & Configuration
Output Options
Themes etc.
YAML
HTML & CSS
HTML
CSS
Custom classes
Personal Templates
The Rabbit Hole Goes Deep
R Markdown Exercises
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Exercise 5
Wrap-up
Summary
Appendix
R Markdown
Footnotes
Citations and references
Multiple documents
Web standards
References
Practical Data Science
Practical Data Science
Doing more with your data
Michael Clark
https://m-clark.github.io/
2020-10-12