Introduction

We are constantly inundated with data, regardless of background and whether we’re conscious of it or not. It’s inescapable, from our first attempts to understand the world around us, to our most recent efforts to explain why we still don’t get it. Even now, our most complicated and successful models are almost uninterpretable even to those that created them. But that doesn’t mean that even in those cases we can’t understand the essence of how they work. And if you’re reading this, you are probably the type of person that wants to keep trying! So for seasoned professionals or perhaps just the data curious, we want to help you learn more about how to use data to answer the questions you have.

What Is This Book?

This book aims to demystify the complex world of data science modeling. It serves as a practical resource, and is something you can refer to for a quick overview of a specific modeling technique, a reminder of something you’ve seen before, or perhaps a sneak peak into some new modeling details.

The text is focused on a few statistical and machine learning concepts that are ubiquitous, and modeling approaches that are widely employed, and especially those which form the basis for most other models in use in a variety of domains. Believe it or not, whether a lowly t-test or a complex neural network, there is a tie that binds, and you don’t have to know every detail to get a solid model that works well enough. We hope to help you understand some of the core modeling principles, and how the simpler models can be extended and applied to a wide variety of data scenarios. We also touch on some topics related to the modeling process, such as common data issues and causal inference.

Our approach is first and foremost a practical one - models are just tools to help us reach a goal, and if a model doesn’t work in the world, it’s not very useful. But modeling is often a delicate balance of interpretation and prediction, and each data situation is unique in some way, almost always requiring a bespoke approach. What works well in one setting may be poor in another, and what may be the state of the art may only be marginally better than a simpler approach that is more easily interpreted. In addition, complexities arise even in an otherwise deceptively simple application. However, if you have the core understanding of the techniques that lie at the heart of many models, you’ll automatically have many more tools at your disposal to tackle the problems you face, and be more comfortable with choosing the best for your needs.

What We Hope You Take Away

Here are a few things we hope you’ll take away from this book:

A small set of modeling tools that will be applicable to the vast majority of tabular data problems you’ll encounter
Enough understanding to be able to apply these tools to your own data
A sense of the common thread that runs through the modeling landscape, from simple linear models to complex neural networks

While we recommend working through the chapters in order if you’re starting out, we hope that this book can serve as a “choose your own adventure” reference. Whether you want a surface-level understanding, a deeper dive, or just want to be able to understand what the analysts in your organization are talking about, we think you will find value in this book. While we assume that you have a basic familiarity with coding, that doesn’t mean that you need to work through every line of code to understand the fundamental principles and use cases of every model.

What You Can Expect

For each chapter that we cover, you generally see the same type of content structure. We start with an overview and provide some key ideas to keep in mind as we go through the chapter. We then demonstrate the model with data, code, results, and visualizations. In further demystifying the modeling process, at various points we take time to show how a model comes about by estimating them by hand. We’ll also provide some concluding thoughts, connections to other techniques and topics, and suggestions on what to explore next. We’ll also provide some exercises to try on your own.

Some topics may be a bit more in the weeds than you want, and that’s okay! We hope that you can take away the big ideas and come back to the details when you’re ready. Awareness of what’s possible is often the first step to understanding how to apply it to your own data. In general though, we’ll touch a little bit on a lot of things, but hopefully not in an overwhelming way.

Who Should Use This Book?

This book is intended for every type of data dabbler, no matter what part of the data world you call home. If you consider yourself a data scientist, a business analyst, or a statistical hobbyist, you already know that the best part of a good dive into data is the modeling. But whatever your data persuasion, models give us the possibility to answer questions, make predictions, and understand what we’re interested in a little bit better. And no matter who you are, it isn’t always easy to understand how the models work. Even when you do get a good grasp of a modeling approach, things can still get complicated, and there are a lot of details to keep track of. In other cases, maybe you just have other things going on in your life and have forgotten a few things. In that case, we find that it’s always good to remind yourself of the basics! So if you’re just interested in data and hoping to understand it a little better, then it’s likely you’ll find something useful.

Your humble authors have struggled mightily themselves throughout the course of their data science history, and still do! We were initially taught by those that weren’t exactly experts, and often found it difficult to get a good grasp of statistical modeling and machine learning. We’ve had to learn how to use the tools, how to interpret the results, and possibly the most difficult, how to explain what we’re doing to others! We’ve forgotten a lot, confused ourselves, and made some happy accidents in the process. That’s okay! Our goal is to help you avoid some of those pitfalls, help you understand the basics of how models work, and get a sense of how most modeling endeavors have a lot of things in common.

Whether you enthusiastically pour over formulas and code, or prefer to skip over them, we promise that you don’t need to memorize a formula to get a good understanding of modeling and related issues. We are the first to admit that we have long dumped the ability to pull formulas out of our brain folds¹; however, knowing how those individual pieces work together only helps to deepen your understanding of the model. Typically using code puts the formula into more concrete terms that you can then use in different ways to solidify and expand your knowledge. Sometimes you just need a reminder or want to see what function you’d use. And often, the visualization will reveal even more about what’s going than the formula or the code. In short, there are a lot of tools at your disposal to help learn modeling in a way that works for you. We hope that anyone that would be interested in the book will find a way to learn things in a manner that suits them best.

There is a bit of a caveat. We aren’t going to teach you basic statistics or how to program in R or Python. Although there is a good chance you will learn some of it here, you’ll have an easier time if you have a very basic understanding of statistics and some familiarity with coding. We will provide some resources for you to learn more about these topics, but we won’t be covering them in detail. However, we really aren’t assuming a lot of background knowledge, and are, if anything, assuming that whatever knowledge you have may be a bit loose or fuzzy. That’s okay!

Which Language?

You’ve probably noticed most data science books, blogs, and courses choose R or Python. While many individuals often have a strong opinion towards teaching and using one over the other, we eschew dogmatic approaches and language flame wars. R and Python are both great languages for modeling, and both flawed in unique ways. Even if you specialize in one, it’s good to have awareness of the other as they are the most popular languages for statistical modeling and machine learning. We use both extensively in our own work for teaching, personal use, and production level code, and have found both are up to whatever task you have in mind.

Throughout this book, we will be presenting demonstrations in both R and Python, and you can use both or take your pick, but we want to leave that choice up to you. Our goal is to use them as a tool to help understand some big model ideas. This book can be a resource for the R user who could use a little help translating their R knowledge to Python; we’d also like it to be a resource for the Python user who sees the value in R’s statistical modeling abilities and more.

Moving Towards An Excellent Adventure

Remember the point we made about “choosing your own adventure”? Modeling and programming in data science is an adventure, even if you never leave your desk! Every situation calls for choices to be made and every choice you make will lead you down a different path. You will run into errors, dead ends, and you might even find that you’ve spent considerable time to conclude that nothing interesting is happening in your data. This, no doubt, is part of the fun and all of those struggles make success that much sweeter. Like every adventure, things might not be immediately clear and you might find yourself in perilous situations! If you find that something isn’t making sense upon your first read, that is okay. Both authors have spent considerable time mulling over models and foggy ideas during our assorted (mis)adventures; nobody should expect to master complex concepts on a single read through! In any arena where you strive to develop skills, distributed practice and repetition are essential. When concepts get tough, step away from the book, and come back with a fresh mind.

We actually never had this ability.↩︎