1 Introduction
We are constantly inundated with data, regardless our of background and whether we’re conscious of it or not. It’s inescapable, from our first attempts to understand the world around us, to our most recent efforts to explain why we still don’t get it. Even now, our most complicated and successful models are almost uninterpretable even to those that created them. But that doesn’t mean that even in difficult circumstances we can’t understand the essence of how models work, and make practical decisions from their results. And if you’re reading this, you are probably the type of person that wants to keep trying anyway! So for seasoned professionals or perhaps just the data curious, we want to help you learn more about how to use data to answer the questions you have.
1.1 What Is This Book?
This book aims to demystify the complex world of data science modeling. It serves as a practical resource, and is something you can refer to for a quick overview of a specific modeling technique, a reminder of some modeling related topic you’ve seen before, or perhaps a sneak peak into some modeling details.
The text is focused on a few statistical and machine learning concepts that are ubiquitous, and modeling approaches that are widely employed, and especially those which form the basis for most other models in use in a variety of domains. Believe it or not, whether a lowly t-test or a complex neural network, there is a tie that binds, and you don’t have to know every detail to get a solid model that works well enough. We hope to help you understand some of the core modeling principles, and how the simpler models can be extended and applied to a wide variety of data scenarios. We also touch on some topics related to the modeling process, such as common data issues and causal inference.
Our approach is first and foremost a practical one - models are just tools to help us reach a goal, and if a model doesn’t work in the world, it’s not very useful. But modeling is often a delicate balance of interpretation and prediction, and each data situation is unique in some way, almost always requiring a bespoke approach. What works well in one setting may be poor in another, and what may be the state of the art may only be marginally better than a simpler approach that is more easily interpreted. In addition, complexities arise even in an otherwise deceptively simple application. However, if you have a core understanding of the techniques that lie at the heart of many models, you’ll automatically have many more tools at your disposal to tackle the problems you face, and be more comfortable with choosing the best for your needs.
1.1.1 What we hope you take away
Here are a few things we hope you’ll take away from this book:
- A sense of the common thread that runs through the modeling landscape, from simple linear models to complex neural networks
- A small set of modeling tools that will be applicable to the vast majority of tabular data problems you’ll encounter
- Enough understanding to be able to confidently apply these tools to your own data
While we recommend working through the chapters in order if you’re starting out, we hope that this book can serve as a “choose your own adventure” reference. Whether you want a surface-level understanding, a deeper dive, or just want to be able to understand what the analysts in your organization are talking about, we think you will find value in this book.
1.1.2 What you can expect
For each topic that we cover in a chapter, you will generally see the same type of content structure. We start with an overview and provide some key ideas to keep in mind as we go through the chapter. We then demonstrate the model with data, code, results, and visualizations. To further demystify the modeling process, at various points we take time to show how a model comes about by estimating them by hand. We’ll also provide some concluding thoughts, connections to other techniques and topics, and suggestions on what to explore next. Occasionally we’ll also provide some exercises to try on your own.
Some topics may get a bit more into the weeds than you want, and that’s okay! We hope that you can take away the big ideas and come back to the details when you’re ready. Just having an awareness of what’s possible is often the first step to understanding how to apply it to your own data. In general though, we’ll touch a little bit on a lot of things, but hopefully not in an overwhelming way.
1.1.3 What you can’t expect
This book will not teach you programming, but you really only need a basic understanding of R or Python. We also won’t be teaching you basic statistics, so won’t be delving into hypothesis testing or the intricacies of statistical theory. The text is more focused on applied modeling, prediction and performance than a normal stats book, and more focused on interpretation and uncertainty in the modeling process than a typical machine learning book. It’s not an academic treatment of the topics, so when it comes to references, you’ll be more likely to find a nice blog post or youtube video that clearly demonstrates a concept, rather than a dense academic paper. That said you should have a great idea of where to go and what to search to go further for deeper content.
1.2 Who Should Use This Book?
This book is intended for every type of data dabbler, no matter what part of the data world you call home. If you consider yourself a data scientist, a machine learning engineer, a business analyst, or a deep learning hobbyist, you already know that the best part of a good dive into data is the modeling. But whatever your data persuasion, models give us the possibility to answer questions, make predictions, and understand what we’re interested in a little bit better. And no matter who you are, it isn’t always easy to understand how the models work. Even when you do get a good grasp of a modeling approach, things can still get complicated, and there are a lot of details to keep track of. In other cases, maybe you just have other things going on in your life and have forgotten a few things. In that case, we find that it’s always good to remind yourself of the basics! So if you’re just interested in data and hoping to understand it a little better, then it’s likely you’ll find something useful.
Your humble authors have struggled mightily themselves throughout the course of their data science history, and still do! We often found it difficult to get a good grasp of statistical modeling and machine learning. It took us a lot of effort to learn how to use the tools, how to interpret the results, and possibly the most difficult, how to explain what we’re doing to others! We’ve forgotten a lot, confused ourselves, and made some happy accidents in the process. That’s okay! Our goal is to help you avoid some of those pitfalls, help you understand the basics of how models work, and get a sense of how most modeling endeavors have a lot of things in common.
Whether you enthusiastically pour over formulas and code, or prefer to skip over them, we promise that you don’t need to memorize a formula to get a good understanding of modeling and related issues. We are the first to admit that we have long dumped the ability to pull formulas out of our brain folds1; however, knowing how the individual pieces work together only helps to deepen your understanding of the model. Code examples help put more difficult aspects of models into more concrete terms that you can then use in different ways to solidify and expand your knowledge. And a good visualization will reveal even more about what’s going than a formula or code. In short, there are many ways to help learn modeling in a way that works for you. We hope that anyone that would be interested in the book will find a way to learn things in a manner that suits them best.
1.3 Which Language?
You’ve probably noticed most data science books, blogs, and courses choose R or Python. While many individuals often have a strong opinion towards teaching and using one over the other, we eschew dogmatic approaches and language flame wars. R and Python are both great languages for modeling, and both flawed in unique ways. Even if you specialize in one, it’s good to have awareness of the other as they are the most popular languages for statistical modeling and machine learning, and both excel in at least some areas the other does not. We use both extensively in our own work for teaching, personal use, and production level code, and have found both are up to whatever task you have in mind.
Throughout this book, we will be presenting demonstrations in both R and Python, and you can use both or take your pick, but we want to leave that choice up to you. Our goal is to use them as a tool to help understand some big model ideas. This book can be a resource for the R user who could use a little help translating their R knowledge to Python; we’d also like it to be a resource for the Python user who sees the value in R’s statistical modeling abilities and more. You’ll find that our coding style/presentation bends more toward legibility, clarity and consistency, which is not necessarily the same as a standard like PEP8 or the tidyverse style guide2. We hope that you can take the code we provide and make it your own, and that you can use it to help you understand the models we’re discussing.
1.4 Moving Towards an Excellent Adventure
Remember the point we made about “choosing your own adventure”? Modeling and programming in data science is an adventure, even if you never leave your desk! Every situation calls for choices to be made and every choice you make will lead you down a different path. You will run into errors, dead ends, and you might even find that you’ve spent considerable time to conclude that nothing interesting is happening in your data. This, no doubt, is actually part of the fun, and all of those struggles will make your ultimate success that much sweeter. Like every adventure, things might not be immediately clear and you might find yourself in perilous situations! If you find that something isn’t making sense upon your first read, that is okay. Both authors have spent considerable time mulling over models and foggy ideas during our assorted (mis)adventures - nobody should expect to master complex concepts on a single read through! In any arena where you strive to develop skills, distributed practice and repetition are essential. When concepts get tough, step away from the book, and come back with a fresh mind. We have great faith you will get where you want to go, and we’re here to help you along the way!
We actually never had this ability.↩︎
The commonly used coding styles for both R and Python aren’t actually scientifically derived or tested, and only recently has research been conducted in this area (see Ivanova et al. (2020) for an example). The guidelines are generally good, but mostly reflect the preferences of the person(s) who wrote them. Our focus here is not on programming though.↩︎