1 Introduction

We are constantly inundated with data, regardless of our background and whether we’re conscious of it or not. It’s inescapable, from our first attempts to understand the world around us, to our most recent efforts to explain why we still don’t get it. Even now, our most complicated and successful models are almost uninterpretable even to those who created them. But that doesn’t mean that even in difficult circumstances we can’t understand the essence of how models work, and make practical decisions from their results. And if you’re reading this, you are probably the type of person who wants to keep trying anyway! So for seasoned professionals or perhaps just the data curious, we want to help you learn more about how to use data to answer the questions you have.

1.1 What Is This Book?

This book aims to demystify the complex world of data science modeling. It serves as a practical resource and is something you can refer to for a quick overview of a specific modeling technique, a reminder of some modeling-related topic you’ve seen before, or perhaps a sneak peak into some modeling details.

The text is focused on a few statistical and machine learning concepts that are ubiquitous, and modeling approaches that are widely employed, and especially those which form the basis for most other models in use in a variety of domains. Believe it or not, whether a lowly t-test or a complex neural network, there is a tie that binds, and you don’t have to know every detail to get a solid model that works well enough. We hope to help you understand some of the core modeling principles, and how the simpler models can be extended and applied to a wide variety of data scenarios. We also touch on some topics related to the modeling process, such as common data issues and causal inference.

Our approach is first and foremost a practical one - models are just tools to help us reach a goal, and if a model doesn’t work in the world, it’s not very useful. But modeling is often a delicate balance of interpretation and prediction, and each data situation is unique in some way, almost always requiring a bespoke approach. What works well in one setting may be poor in another, and what may be the state of the art may only be marginally better than a simpler approach that is more easily interpreted. In addition, complexities arise even in an otherwise deceptively simple application. However, if you have a core understanding of the techniques that lie at the heart of many models, you’ll automatically have many more tools at your disposal to tackle the problems you face, and be more comfortable with choosing the best for your needs.

This book also strives to find the balance between statistical texts that don’t speak to predictive power or machine learning techniques, and machine learning treatments that consider the job done after calling the predict method. We aim to provide a solid treatment of both, and show how both are necessary perspectives of data modeling. The right modeling tool for your job may come from anywhere, and we hope you’ll get a good sense of what’s out there, and how to use it.

1.1.1 What we hope you take away

Here are a few things we hope you’ll take away from this book:

A sense of the common thread that runs through the modeling landscape, from simple linear models to complex neural networks
A small set of modeling tools that will nonetheless be applicable to many common data problems you’ll encounter
Enough understanding to be able to confidently apply these tools to your own data

While we recommend working through the chapters in order if you’re starting out, we hope that this book can serve as a “choose your own adventure” reference. Whether you want a surface-level understanding or a deeper dive, we think you will find value in this book.

1.1.2 What you can expect

For each topic that we cover in a chapter, you will generally see the same type of content structure. We start with an overview and provide some key ideas to keep in mind as we go through the chapter. You’ll also be given a sense of the context required. This should help you choose any topic you feel comfortable with, and skip over those you don’t.

Models are implemented with code using standard approaches, though results are usually shown in a more digestible format with tables and visualizations. To further demystify the modeling process, at various points we take a DIY approach to show how a model or some aspect of it comes about by estimating the results by hand for comparison. We’ll also provide some concluding thoughts, connections to other techniques and topics, and suggestions on what to explore next. For some chapters, we’ll also provide suggestions for things to try on your own.

Some topics may get a bit more into the weeds than you want, and that’s okay! We hope that you can take away the big ideas and come back to the details when you’re ready. Just having an awareness of what’s possible is often the first step to understanding how to apply it to your own data. In general though, we’ll touch a little bit on a lot of things, but hopefully not in an overwhelming way.

1.1.3 What you can’t expect

This book will not teach you programming, but you really only need a very basic understanding of R or Python. We also won’t be teaching you basic statistics, so we won’t be delving into hypothesis testing or the intricacies of statistical theory. The text is more focused on applied modeling, prediction, and performance than a normal stats book, and it is more focused on interpretation and uncertainty in the modeling process than a typical machine learning book. It’s not an academic treatment of the topics, so when it comes to references, you’ll be more likely to find a nice blog post or YouTube video that clearly demonstrates a concept, rather than a dense academic paper. That said, you should have a great idea of where to go and what to search to go further for deeper content.

1.2 Who Should Use This Book?

This book is intended for every type of data dabbler, no matter what part of the data world you call home. If you consider yourself a data scientist, a machine learning engineer, a business analyst, or a deep learning hobbyist, you already know that the best part of a good dive into data is the modeling. But whatever your data persuasion, models give us the possibility to answer questions, make predictions, and understand what we’re interested in a little bit better. And no matter who you are, it isn’t always easy to understand how the models work. Even when you do get a good grasp of a modeling approach, things can still get complicated, and there are a lot of details to keep track of. In other cases, maybe you just have other things going on in your life and have forgotten a few things. In that case, we find that it’s always good to remind yourself of the basics! So if you’re just interested in data and hoping to understand it a little better, then it’s likely you’ll find something useful.

1.3 Which Language?

You’ve probably noticed most data science books, blogs, and courses choose R or Python. While many individuals often have a strong opinion toward teaching and using one over the other, we eschew dogmatic approaches and language flame wars. R and Python are both great languages for modeling and both flawed in unique ways. Even if you specialize in one, it’s good to have awareness of the other, as they are the most popular languages for statistical modeling and machine learning, and both excel in at least some areas the other does not. We use both extensively in our own work for teaching, personal use, and production level code, and either may be useful to whatever task you have in mind.

Throughout this book, we will be presenting demonstrations in both R and Python, and you can use both or take your pick, but we want to leave that choice up to you. Our goal is to use them as a tool to help understand some big model ideas. We do present the initial code in R for statistical models, and Python for machine learning approaches and beyond, as we feel their relative strengths are in those areas, and for a balanced focus. But either language can be used well for any modeling task in this book.

In the end, this book can be a resource for the R user who could use a little help translating their R knowledge to Python. We’d also like it to be a resource for the Python user who sees the value in R’s statistical modeling abilities and more. You’ll find that our coding style/presentation bends more toward legibility, clarity and consistency, which is not necessarily the same as a standard like PEP8 or the tidyverse style guide¹. We hope that you can take the code we provide and make it your own, and that you can use it to help you understand the models we’re discussing.

1.4 Moving Toward an Excellent Adventure

Remember the point we made about “choosing your own adventure”? Modeling and programming in data science is an adventure, even if you never leave your desk! Every situation calls for choices to be made, and every choice you make will lead you down a different path. You will run into errors, dead-ends, and you might even find that you’ve spent considerable time to conclude that nothing interesting is happening in your data. This, no doubt, is actually part of the fun, and all of those struggles will make your ultimate success that much sweeter. Like every adventure, things might not be immediately clear, and you might find yourself in perilous situations! If you find that something isn’t making sense upon your first read, that’s fine! Your humble authors have spent considerable time mulling over models and foggy ideas during our assorted (mis)adventures, and nobody should expect to master complex concepts on a single read through! In any arena where you strive to develop skills, distributed practice and repetition are essential. When concepts get tough, step away from the book, and come back with a fresh mind. We have great faith you will get where you want to go, and we’re here to help you along the way!

The commonly used coding styles for both R and Python aren’t actually scientifically derived or tested, and only recently has research been conducted in this area (see Ivanova et al. (2020) for an example). The guidelines are generally good but mostly reflect the preferences of the person(s) who wrote them. Our focus here is not on programming though.↩︎