3  How Did We Get Here?

In our initial linear model, the coefficients for each feature are the key parameters are. But how do we know what the coefficients are and come to those values? When we run a linear model using some program function, they appear magically, but it’s worth knowing a little bit about how they come to be, so let’s try and dive a little deeper. Here is where we go really far into ‘demystifying’ the modeling process.

Model estimation is the process of finding the parameters associated with a model that allow us to reach a particular modeling goal. Different types of models will have different parameters to estimate, and there are different ways to estimate them. In general though, the goal is the same, find the set of parameters that will lead to the best predictions under the current data modeling context.

With model estimation, we can break things down into the following steps:

  1. Start with an initial guess for the parameters
  2. Calculate the prediction error based on those parameters, or some function of it, or some other value that represents our model’s objective
  3. Update the guess
  4. Repeat steps 2 & 3 until we find a ‘best’ guess

Pretty straightforward, right? Well, it’s not always so simple, but this is the general idea in most applications, so keep it in mind! In this chapter, we’ll show how to do this ourselves to take away the mystery a bit from when you run standard model functions in typical contexts. Hopefully then you’ll gain more confidence when you do use them!

3.1 Key Ideas

A few concepts we’ll keep using here are fundamental to understanding estimation and optimization. We’ll start by saying that we’re qualifying our discussion of these topics to typical linear model and similar settings, but they are much more broad and general than presented here. We’ve seen some of this before, but we’ll be getting a bit cozier with the concepts now.

  • Parameters are the values associated with a model that we have to estimate.
  • Estimation is the process of finding the parameters associated with a model.
  • The objective (loss) function takes input and produces a value that we want to maximize or minimize.
  • Prediction error is the difference between the observed value of the target and the predicted value, and is often used to calculate the objective function.
  • Optimization is the specific process of finding the parameters that maximize or minimize some objective function.
  • Model Selection is the process of choosing the best model from a set of models.

We can use estimation as general term for finding parameters, while optimization can be seen as a term for finding parameters that maximize or minimize some objective function, or even a combination of objectives. In some cases, we can estimate parameters without optimization, because there is a known way of solving the problem, but in most modeling situations we are going to use some optimization approach to find a ‘best’ set of parameters.

3.1.1 Why this matters

When it comes to modeling, even knowing just a little bit about what goes on behind the scenes is a great demystifier. And if models are less of a mystery, you’ll feel more confident using them. Parts of what you see here are a part of almost every common model used for statistics and machine learning, providing you even more of a foundation for understanding what’s going on.

3.1.2 Good to know

This chapter is more involved and technical than most of the others, so might be more suited for those who like to get their hands dirty. It’s all about ‘rolling your own’, and so we’ll be doing a lot of the work ourselves. If you’re not one of those types of people that gets much out of that, that’s ok, you can skip this chapter and still get a lot out of the rest of the book. But if you’re curious about how things work, or you want to be able to do more than just run a function, then we think you’ll find the following useful. You’d want to at least have your linear model basics down (Chapter 1).

3.2 Data Setup

For the examples here, we’ll use the world happiness dataset for the year 2018. We’ll use the happiness score as our target. Let’s take an initial look at the data here, but for more information see the appendix Section B.2.

World happiness data summary

Our happiness score has values from around 3-7, life expectancy and gdp appear to have some notable variability, and corruption perception is skewed toward lower values. We can also see that the features and target are correlated with each other, which is not surprising.

Correlation matrix for world happiness data

term happiness life_exp log_gdp_pc corrupt
happiness NA 0.78 0.82 −0.47
life_exp 0.78 NA 0.86 −0.34
log_gdp_pc 0.82 0.86 NA −0.34
corrupt −0.47 −0.34 −0.34 NA

We’ll do some minor cleaning and renaming of columns, and we’ll drop any rows with missing values. We’ll also scale the features so that they have the same variance, which as noted in the data chapter, can help make estimation easier.

df_happiness = read_csv("data/world_happiness_2018.csv") |>
    drop_na() |>
    select(
        country,
        happiness_score,
        healthy_life_expectancy_at_birth,
        log_gdp_per_capita,
        perceptions_of_corruption
    ) |>
    rename(
        happiness  = happiness_score,
        life_exp   = healthy_life_expectancy_at_birth,
        log_gdp_pc = log_gdp_per_capita,
        corrupt    = perceptions_of_corruption
    ) |>
    # put gdp back on original scale before scaling
    mutate(
        gdp_pc = exp(log_gdp_pc), 
        across(life_exp:gdp_pc, \(x) scale(x)[,1])
    ) |>
    select(-log_gdp_pc) |>  # drop the log version
    rename_with(\(x) glue('{x}_sc'), -c(happiness, country))
df_happiness = (
    pd.read_csv('data/world_happiness_2018.csv')
    .dropna()
    .rename(
        columns = {
            'happiness_score': 'happiness',
            'healthy_life_expectancy_at_birth': 'life_exp',
            'log_gdp_per_capita': 'log_gdp_pc',
            'perceptions_of_corruption': 'corrupt'
        }
    )
    .assign(
        gdp_pc = lambda x: np.exp(x['log_gdp_pc']),
    )    
    [['country', 'happiness','life_exp', 'gdp_pc', 'corrupt']]
)


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df_happiness[['life_exp_sc', 'gdp_pc_sc', 'corrupt_sc']] = scaler.fit_transform(
    df_happiness[['life_exp', 'gdp_pc', 'corrupt']]
)
df_happiness = df_happiness.drop(columns = ['life_exp', 'gdp_pc', 'corrupt'])

3.3 Starting Out by Guessing

So, we’ll start with a model in which we predict a country’s level of happiness by their life expectancy, where if you can expect to live longer, maybe you’re probably in a country with better health care, higher incomes, and other important stuff. We’ll stick with our simple linear regression model as well.

As a starting point we can just guess what the parameter should be, but how would we know what to guess? How would we know which guesses are better than others? Let’s try a couple and see what happens. Let’s say that we think all countries start at the bottom on the happiness scale (around 3), but life expectancy makes a big impact- for every standard deviation of life expectancy we go up a whole point on happiness1. We can plug this into the model and see what we get:

\[ \text{prediction} = 3.3 + 1\cdot\text{life\_exp} \]

For a different model, we’ll say countries start with a mean of happiness score, and moving up a standard deviation of life expectancy would move us up a half point of happiness. \[ \text{prediction} = \overline{\text{happiness}} + .5\cdot\text{life\_exp} \]

How do we know which is better? Let’s find out!

3.4 Prediction Error

We’ve seen that a key component to model assessment involves comparing the predictions from the model to the actual values of the target. This difference is known as the prediction error, or residuals in more statistical contexts. We can express this as:

\[ \epsilon = y - \hat{y} \] \[ \text{error} = \text{target} - \text{(model-based) guess} \]

This prediction error tells us how far off our model prediction is from the observed target values, and it gives us a way to compare models. With our measure of prediction error, we can calculate a total error for all observations/predictions (Section 2.2), or similarly, the average error. If one model or parameter set has less total or average error, we can say it’s a better model than one that has more (Section 2.2.3). Ideally we’d like to choose a model with the least error, but we’ll see that this is not always possible2.

However, if we just take the average of our errors from a linear regression model, you’ll see that it is roughly zero! This is by design for many common models, where we even will explicitly write the formula for the error as coming from a normal distribution with mean of zero. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same3. We’ll use squared error here, and we’ll calculate the mean of the squared errors for all our predictions.

y = df_happiness$happiness

# Calculate the error for the guess of 4
prediction = min(df_happiness$happiness) + 1*df_happiness$life_exp_sc
mse_model_A = mean((y - prediction)^2)

# Calculate the error for our other guess
prediction = mean(y) + .5 * df_happiness$life_exp_sc
mse_model_B = mean((y - prediction)^2)
y = df_happiness['happiness']

# Calculate the error for the guess of four
prediction = np.min(df_happiness['happiness']) + 1 * df_happiness['life_exp_sc']
mse_model_A   = np.mean((y - prediction)**2)

# Calculate the error for our other guess
prediction = y.mean() + .5 * df_happiness['life_exp_sc']
mse_model_B  = np.mean((y - prediction)**2)

Now let’s look at our Mean Squared Error (MSE), and we’ll also inspect the square root of it, or the Root Mean Squared Error, as that puts things back on the original target scale, and tells us the standard deviation of our prediction errors. We also add the Mean Absolute Error (MAE) as another metric with straightforward interpretation.

Comparison of error metrics for two models

Model MSE RMSE MAE RMSE % drop MAE % drop
A 5.09 2.26 1.52
B 0.64 0.80 0.58 65% 62%

Inspecting the metrics, we can see that we are off on average by over a point (MAE) for model A, and a little over half a point on average for model B. So we can see that model B is not only better, but results in a 65% drop in RMSE, and similar for MAE. We’d definitely prefer it over model A, and we can see how to compare models in a general fashion.

Now all of this is useful, and at least we can say one model is better than another. But you’re probably hoping there is an easier way to do get a good guess for our model parameters, especially when we have possibly dozens of features and/or parameters to keep track of, and there is!

3.5 Ordinary Least Squares

In a simple linear model, we often use the Ordinary Least Squares (OLS) method to estimate parameters. This method finds the coefficients that minimize the sum of the squared differences between the predicted and actual values4. In other words, it finds the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, which is what we just did in or previous example. The sum of the squared errors is also called the residual sum of squares (RSS), as opposed to the ‘total’ sums of squares (i.e. the variance of the target), and the part explained by the model (‘model’ or ‘explained’ sums of squares). We can express this as follows, where \(y_i\) is the observed value of the target for observation \(i\), and \(\hat{y_i}\) is the predicted value from the model.

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \tag{3.1}\]

It’s called ordinary least squares because there are other least squares methods - generalized least squares, weighted least squares, and others, but we don’t need to worry about that for now. The sum or mean of the squared errors is our objective value. The process taking the predictions and observed target values as inputs and returning this value as an output is our objective function. We can use this value to find the best parameters for a model or to compare different models. We can use this value to find the best parameters for a specific model, as well as compare models with different parameters, such as a model with additional features versus one with fewer. We can also use this value to compare different types of models that are using the same objective function, such as a linear model and a decision tree model.

Now let’s calculate the OLS estimate for our model. We’ll start by making a bunch of guesses for the parameters, then choose the guess that gives us the lowest objective value.

# for later comparison
model_happy = lm(happiness ~ life_exp_sc, data = df_happiness)

ols = function(X, y, par, sum_sq = FALSE) {
    # add a column of 1s for the intercept
    X = cbind(1, X)

    # Calculate the predicted values
    y_hat = X %*% par # %*% is matrix multiplication

    # Calculate the error
    error = y - y_hat

    # Calculate the value as sum or mean squared error
    value = crossprod(error) # crossprod is matrix multiplication

    if (!sum_sq) {
        value = value / nrow(X)
    }

    # Return the value
    return(value)
}

# create a grid of guesses
guesses = crossing(
    b0 = seq(1, 7, 0.1),
    b1 = seq(-1, 1, 0.1)
)

# Example for one guess
ols(
    X = df_happiness$life_exp_sc,
    y = df_happiness$happiness,
    par = unlist(guesses[1, ])
)
       [,1]
[1,] 23.777
# for later comparison
model_happy = smf.ols('happiness ~ life_exp_sc', data = df_happiness)
model_happy = model_happy.fit()

def ols(par, X, y, sum = False):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]

    # Calculate the predicted values
    y_hat = X @ par
    
    # Calculate the error
    value = np.sum((y - y_hat)**2)
    
    # Calculate the value as sum or average
    if not sum:
        value = value / X.shape[0]
    
    # Return the value
    return(value)

# create a grid of guesses
from itertools import product

guesses = pd.DataFrame(
    product(
        np.arange(1, 7, 0.1),
        np.arange(-1, 1, 0.1)
    ),
    columns = ['b0', 'b1']
)

# Example for one guess
ols(
    par = guesses.iloc[0,:],
    X = df_happiness['life_exp_sc'],
    y = df_happiness['happiness']
)
23.793842044979073

After that, we’ll calculate the loss for each guess and find which one gives us the smallest function value. Note that in our custom functions, we could get the total or mean squared error by setting the sum parameter to TRUE or FALSE. Either is fine, but the mean is more commonly used.

# Calculate the function value for each guess
guesses = guesses |>
    mutate(objective = map2_dbl(
        guesses$b0, guesses$b1,
        \(b0, b1) ols(
            par = c(b0, b1), 
            X = df_happiness$life_exp_sc, 
            y = df_happiness$happiness
        )
    ))

min_loss = guesses |> filter(objective == min(objective))

min_loss
# Calculate the function value for each guess
guesses['objective'] = guesses.apply(
    lambda x: ols(
        par = x, 
        X = df_happiness['life_exp_sc'], 
        y = df_happiness['happiness']
    ),
    axis = 1
)

min_loss = guesses[guesses['objective'] == guesses['objective'].min()]

min_loss

The following plot shows the objective value for each guess, where darker values indicate we’re getting closer to our smallest objective value.

Results of parameter search

If we inspect our results from the standard functions, we had estimates of 5.44 and 0.89 for our coefficients versus the best guess from our approach of 5.4 and 0.9. These are very similar but not exactly the same, but this is mostly due to the granularity of our guesses. Even so, in the end we can see that we get pretty dang close to what our basic lm or statsmodels functions would get us. Pretty neat!

Estimation and/or optimization can be seen as the process of a model learning which parameters will best allow the predictions to match the observed data, and hopefully, predict as-yet-unseen future data. This is a very common way to think about estimation in machine learning, and it is a useful way to think about our simple linear model also.

One thing to keep in mind is that it is not a magical process. It takes good data, a good idea (model), and an appropriate estimation method to get good results.

3.6 Optimization

Before we get into other objective functions, let’s think about a better way to find good parameters for our model. Rather than just guessing, we can use a more systematic approach, and thankfully, there are tools out there to help us. We just use a function like our OLS function, give it a starting point, and let the algorithms do the rest! These tools eventually arrive at a pretty good set of parameters, and are optimized for speed.

Previously we created a set of guesses, and tried each one in a manner called a grid search, and it is a bit of a brute force approach to finding the best fitting model. You can maybe imagine a couple of unfortunate scenarios for this approach, such as having a very large number of parameters to search. Or it may be that our range of guesses doesn’t allow us to find the right set of parameters, or we specify a very large range, but the best fitting model is within a very narrow part of that, so it takes a long time to find. In any of these cases we waste a lot of time or may not find an optimal solution.

In general, we can think of optimization as employing a smarter, more efficient way to find what you’re looking for. Here’s how it works:

  • Start with an initial guess for the parameters, then…

  • Calculate the objective function given the parameters

  • Update the parameters to a new guess (that hopefully improves the objective function)

  • Repeat, until the improvement is small enough, or a maximum number of iterations is reached.

The key idea now is how we update the old parameters with a new guess at each iteration. Different optimization algorithms use different approaches to find those new guesses. The process stops when the improvement is within a set tolerance or the maximum number of iterations is reached. If the objective is met, we say that our model has converged. Sometimes, the number of iterations is not enough for us to reach convergence in terms of tolerance, and we have to try again with a different set of parameters, a different algorithm, maybe use some data transformations, or something else.

So, let’s try it out! We start out with several inputs:

  • the objective function
  • the initial guess for the parameters to get things going
  • other related inputs to the objective function
  • options for the optimization process, e.g. algorithm, maximum number of iterations, etc.

With these inputs, we’ll let the optimization functions do the rest of the work. We’ll also compare our results to the standard functions to make sure we’re on the right track.

We’ll use the optim function in R.

our_result = optim(
    par     = c(1, 0),
    fn      = ols,
    X       = df_happiness$life_exp_sc,
    y       = df_happiness$happiness,
    method  = "BFGS", # optimization algorithm
    control = list(   # specify tolerance, max iter, etc. here
        tol   = 1e-6,
        maxit = 500
    )  
)

our_result

We’ll use the minimize function in Python.

from scipy.optimize import minimize

our_result = minimize(
    fun    = ols,
    x0     = np.array([1., 0.]),
    args   = (
        np.array(df_happiness['life_exp_sc']), 
        np.array(df_happiness['happiness'])
    ),
    method  = 'BFGS', # optimization algorithm
    tol     = 1e-6,   # tolerance
    options = {
        'maxiter': 500
    }
)

our_result
Table 3.1: Comparison of our results to standard function
Parameter Standard Our Result
Intercept 5.4450 5.4450
Life Exp. Coef. 0.8880 0.8880
Objective/MSE 0.4890 0.4890

So, our little function and the right tool allows us to come up with the same thing as base R and statsmodels! I hope you’re feeling pretty good at this point because you should! You just proved you could do what seemed before to be like magic, but really all it took is just a little knowledge about some key concepts to demystify the process. So, let’s keep going!

The objective function is often called the loss function, and sometimes the cost function. However, these both imply that we are trying to minimize the function, which is not always the case5, and it’s arbitrary whether you want to minimize or maximize the function.

The term metric, such as the MSE or AUC we’ve seen elsewhere, is a value that you might also want to use to evaluate the model. Some metrics are used be an objective function. For instance, we might minimize MSE as our objective, but also calculate other metrics like Adjusted R-squared or Mean Absolute Error to evaluate the model.

This can be very confusing when starting out! We’ll try to stick to the term metric for additional values that we might want to examine, apart from the objective function value.

3.7 Maximum Likelihood

In our example, we’ve been minimizing the mean of the squared errors to find the best parameters for our model. But let’s think about this differently. Now we’d like you to think about the data generating process. Ignoring the model, imagine that each happiness value is generated by some random process, like drawing from a normal distribution. So, something like this would describe it mathematically:

\[ \text{happiness} \sim N(\text{mean}, \text{sd}) \]

where the mean is just the mean of happiness, and sd is its standard deviation. In other words, we can think of happiness as a random variable that is drawn from a normal distribution with mean and standard deviation as the parameters of that distribution.

Let’s apply this idea to our linear model setting. In this case, the mean is a function of life expectancy, and we’re not sure what the standard deviation is, but we can go ahead and write our model as follows.

\[ \text{happiness} \sim N(\text{mean}, \text{sd}) \]

\[ \text{mean} = \beta_0 + \beta_1 * \text{life\_exp} \]

Now, our model is estimating the parameters of the normal distribution. We have an extra parameter to estimate - the standard deviation, which is similar to our RMSE, but it’s just another parameter, just like our coefficients.

Now when we compare our prediction to the observed value, we don’t just look at the simple difference, but we are still interested in the discrepancy between the two. We also start thinking about the likelihood of observing the happiness score given our prediction. The likelihood is based on the estimated parameter one of which is a function of the coefficients and life expectancy, and is provided by the chosen statistical distribution. We can write this as:

\[ \text{Pr}(\text{happiness} \mid \text{life\_exp}, \beta_0, \beta_1, \text{sd}) \]

\[ \text{Pr}(\text{happiness} \mid \text{mean}, \text{sd}) \]

Thinking more generally, the likelihood gives us the probability of the observed data given the parameter estimates.

\[ \text{Pr}(\text{Data} \mid \text{Parameters}) \]

This demo shows how to calculate a likelihood for our model. The values you see are called probability density values. They’re not exactly probabilities, but they show the relative likelihood of each observation6. You can think of them like you do for probabilities, but remember that likelihoods are slightly different.

# two example life expectancy scores, at the mean (0) and 1 sd above
life_expectancy = c(0, 1)

# observed happiness scores
happiness = c(4, 5.2)

# predicted happiness with rounded coefs
mu = 5 + 1 * life_expectancy

# just a guess for sigma
sigma = .5

# likelihood for each observation
L = dnorm(happiness, mean = mu, sd = sigma)
L
[1] 0.1079819 0.2218417
from scipy.stats import norm

# two example life expectancy scores, at the mean (0) and 1 sd above
life_expectancy = np.array([0, 1])

# observed happiness scores
happiness = np.array([4, 5.2])

# predicted happiness with rounded coefs
mu = 5 + 1 * life_expectancy

# just a guess for sigma
sigma = .5

# likelihood for each observation
L = norm.pdf(happiness, loc = mu, scale = sigma)
L
array([0.1080, 0.2218])

With a guess for the parameters and an assumption about the data’s distribution, we can calculate the likelihood of each data point. We get a total likelihood for all observations, similar to how we added squared errors previously. But unlike errors, we want more likelihood, not less. In theory we’d multiply each likelihood, but in practice we sum the log of the likelihood, otherwise values would get too small for our computers to handle. We can also turn our problem into a minimization problem by multiplying the log likelihood by -1, and then minimizing that value, which many optimization algorithms are designed to do.

The following shows a function we can use to calculate the likelihood of the data given our parameters. The actual likelihood value isn’t easily interpretable, but higher values are better. We can use it to compare models with different parameter guesses. Even if our total (log) likelihoods are negative, we prefer the model with the relatively higher likelihood7.

likelihood = function(par, X, y) {
    X = cbind(1, X)
    # setup
    beta = par[-1] # coefficients
    sigma = exp(par[1]) # error sd, exp keeps positive

    N = nrow(X)

    LP = X %*% beta # linear predictor
    mu = LP # identity link in the glm sense

    # calculate (log) likelihood
    ll = dnorm(y, mean = mu, sd = sigma, log = TRUE)
    -sum(ll) # for minimization
}

our_result = optim(
    par = c(1, 0, 0),
    fn  = likelihood,
    X   = df_happiness$life_exp_sc,
    y   = df_happiness$happiness
)

our_result
def likelihood(par, X, y):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]

    # setup
    beta   = par[1:]         # coefficients
    sigma  = np.exp(par[0])  # error sd, exp keeps positive

    N = X.shape[0]

    LP = X @ beta          # linear predictor
    mu = LP                # identity link in the glm sense

    # calculate (log) likelihood
    ll = norm.logpdf(y, loc = mu, scale = sigma) 
    return(-np.sum(ll))

our_result = minimize(
    fun  = likelihood,
    x0   = np.array([1, 0, 0]),
    args = (
        np.array(df_happiness['life_exp_sc']), 
        np.array(df_happiness['happiness'])
    )
)

our_result

To use a maximum likelihood approach for linear models, you can use functions like glm in R or GLM in Python. We can also use different likelihoods corresponding to the binomial, poisson and other distributions. Still other packages would allow even more distributions for consideration. In general, we choose a distribution that we feel best reflects the data generating process. For binary targets for example, we typically would feel a bernoulli or binomial distribution is appropriate. For count data, we might choose a poisson or negative binomial distribution. For targets that fall between 0 and 1, we might go for a beta distribution. You can see some of these demonstrated in 4.

There are many distributions to choose from, and the best one depends on your data. Sometimes, even if one distribution seems like a better fit, we might choose another one because it’s easier to use. Some distributions are special cases of others, or they might become more like a normal distribution under certain conditions. For example, the exponential distribution is a special case of the gamma distribution, and a t distribution with many degrees of freedom looks like a normal distribution. Here is a visualization of the relationships among some of the more common distributions (Wikipedia (2023)).

Figure 3.1: Relationships among some of probability distributions

When you realize that many of distributions are closely related, it’s easier to understand why we might choose one over another, but also why we might use a simpler option even if it’s not the best fit - you likely won’t come to a different conclusion about your model. Ultimately, you’ll get a better feel for this as you work with different types of data and models.

Here are examples of standard GLM functions, which just require an extra argument for the family of the distribution.

glm(happiness ~ life_exp_sc, data = df_happiness, family = gaussian)
glm(binary_target ~ x1 + x2, data = some_data, family = binomial)
glm(count ~ x1 + x2, data = some_data, family = poisson)
import statsmodels.formula.api as smf

smf.glm(
    'happiness ~ life_exp_sc', 
    data = df_happiness, 
    family = sm.families.Gaussian()
)

smf.glm(
    'binary_target ~ x1 + x2', 
    data = some_data, 
    family = sm.families.Binomial()
)

smf.glm(
    'count ~ x1 + x2', 
    data = some_data, 
    family = sm.families.Poisson()
)

With that in mind, we can compare our result to a built-in function that has capabilities beyond OLS. As before, we’re duplicating the basic glm result. We show more decimal places on the log likelihood estimate to prove we aren’t getting exactly the same result

Table 3.2: Comparison of our results to the built-in function
Parameter Standard Our Result
Intercept 5.445 5.445
Life Exp. Coef. 0.888 0.888
Sigma 0.705 0.6991
LogLik (neg) 118.804 118.804
1 Parameter estimate is exponentiated

3.7.0.1 Diving deeper

Let’s think more about what’s going on here. It turns out that our objective function defines a ‘space’ or ‘surface’. You can imagine the process as searching for the lowest point on a landscape, with each guess a point on this landscape. Let’s start to get a sense of this with the following visualization, based on a single parameter. The following visualization shows this for a single parameter. The data comes from a variable with a true average of 5. As our guesses get closer to 5, the likelihood increases. However, with more and more data, the final guess converges on the true value. Model estimation finds that maximum on the curve, and optimization algorithms are the means to find it.

Figure 3.2: Likelihood function one parameter

Now let’s add a parameter. If we have more than one parameter, we now have a surface to deal with. Given a starting point, an optimization procedure then travels along the surface looking for a minimum/maximum point. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. However, even our simple model has three parameters plus the likelihood, so it would be difficult to visualize without additional complexity. Instead, we show the results for an alternate model where happiness is standardized also, which means the intercept is zero8, and so not shown.

Figure 3.3: Likelihood surface for happiness and life expectancy (interactive)

We can also see the path our estimates take. Starting at a fairly bad estimate, the optimization algorithm quickly updates to estimates that result in a better likelihood value. We also see little exploratory jumps creating a star like pattern, before things ultimately settle to the best values. In general, these updates and paths are dependent on the optimization algorithm one uses.

Figure 3.4: Optimization path for two parameters

For linear regression assuming a normal distribution, the maximum likelihood estimate of the standard deviation is the estimate as the standard deviation of the residuals. Furthermore, the maximum likelihood coefficient estimates and OLS estimates converge to the same estimates as the sample size increases. For most data sizes in practice these estimates are indistinguishable, and the OLS estimate is the maximum likelihood estimate for linear regression.

3.8 Penalized Objectives

One thing we may want to take into account of with our models is their complexity, especially in the context of overfitting. We talk about this with machine learning also Chapter 6, but the basic idea is that we can get too familiar with the data we have, and when we try to predict on new data the model hasn’t seen before, our performance suffers or even gets worse than a simpler model. In other words, we are not generalizing well Section 6.4.

One way to deal with this is to penalize the objective function value for complexity, or at least favor simpler models that might do as well. In some contexts this is called regularization, and in other contexts shrinkage, since the parameter estimates are typically shrunk toward some specific value (e.g., zero).

As a starting point, in our basic linear model we can add a penalty that is applied to the size of coefficients. This is called ridge regression, or, more mathily, as L2 regularization. We can write this formally as:

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \tag{3.2}\]

The first part is the same as basic OLS (Equation 3.1), but the second part is the penalty for \(p\) features. The penalty is the sum of the squared coefficients multiplied by some value, which we call \(\lambda\). This is an additional model parameter that we typically want to estimate, e.g. through cross-validation. This kind of parameter is often called a hyperparameter, mostly just to distinguish it from those that may be of actual interest. For example, we could probably care less what the actual value for \(\lambda\) is, but we would still be interested in the coefficients.

In the end this is just a small change to ordinary least squares (OLS) regression (Equation 3.1), but it can make a big difference. It introduces some bias in the coefficients - recall that OLS is unbiased if assumptions are met - but it can help to reduce variance, which can help the model perform better on new data (Section 6.4.2). In other words, we are willing to accept some bias in order to get a model that generalizes better.

But let’s get to a code example to demystify this a bit! Here is an example of a function that calculates the ridge objective. To make things interesting, let’s add the other features we talked about regarding GDP per capita and perceptions of corruption.

ridge = function(par, X, y, lambda = 0) {
    # add a column of 1s for the intercept
    X = cbind(1, X)

    # Calculate the predicted values
    mu = X %*% par # %*% is matrix multiplication

    # Calculate the value as sum squared error
    error = crossprod(y - mu)

    # Add the penalty
    value = error + lambda * crossprod(par)

    return(value)
}

X = df_happiness |>
    select(-happiness, -country) |> 
    as.matrix()

our_result = optim(
    par = c(0, 0, 0, 0),
    fn = ridge,
    X = X,
    y = df_happiness$happiness,
    lambda = 0.1,
    method = "BFGS"
)
# we use lambda_ because lambda is a reserved word in python
def ridge(par, X, y, lambda_ = 0):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]

    # Calculate the predicted values
    mu = X @ par
    
    # Calculate the error
    value = np.sum((y - mu)**2)
    
    # Add the penalty
    value = value + lambda_ * np.sum(par**2)
    
    return(value)

our_result = minimize(
    fun  = ridge,
    x0   = np.array([0, 0, 0, 0]),
    args = (
        np.array(df_happiness.drop(columns=['happiness', 'country'])),
        np.array(df_happiness['happiness']), 
        0.1
    )
)

We can compare this to built-in functions as we have before, and can see that the results are very similar, but not exactly the same. We would not worry about such differences in practice, but the main point is again, we can use simple functions that do just about as well as any what we’d get from package output.

Table 3.3: Comparison of ridge regression results
Parameter Standard1 Our Result
Intercept 5.44 5.44
life_exp_sc 0.49 0.52
corrupt_sc −0.12 −0.11
gdp_pc_sc 0.42 0.44
1 Showing results from R glmnet package with alpha = 0, lambda = .1

It turns out that given a fixed \(\lambda\) penalty ridge regression estimates can be estimated analytically. Here is a demo.

Another very common penalized approach is to use the sum of the absolute value of the coefficients, which is called lasso regression or L1 regularization. An interesting property of the lasso is that in typical implementations, it will potentially zero out coefficients, which is the same as dropping the feature from the model altogether. This is a form of feature selection or variable selection. The true values are never zero, but if we want to use a ‘best subset’ of features, this is one way we could do so. We can write the lasso objective as follows. The chapter exercise asks you to implement this yourself.

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \tag{3.3}\]

3.9 Classification

So far, we’ve been assuming a continuous target, but what if we have a categorical target? Now we have to learn a bunch of new stuff for that situation, right? Actually, no! When we want to model categorical targets, conceptually, nothing changes! We still have an objective function that maximizes or minimizes some goal, and use the same algorithms to estimate parameters, etc. However, we need to think about how we can do this in a way that makes sense for the target.

3.9.1 Misclassification

A straightforward correspondence to MSE is a function that minimizes classification error, or by the same token, maximizes accuracy. In other words, we can think of the objective function as the proportion of incorrect classifications. This is called the misclassification rate.

\[ \text{Loss} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(y_i \neq \hat{y_i}) \]

In the equation,e \(y_i\) is the actual value of the target for observation \(i\), arbitrarily coded as 1 or 0, and \(\hat{y_i}\) is the predicted class from the model. The \(\mathbb{1}\) is an indicator function that returns 1 if the condition is true, and 0 otherwise. In other words, we are counting the number of times the predicted value is not equal to the actual value, and dividing by the number of observations. Very straightforward, so let’s do this ourselves!

# misclassification rate
misclassification = function(par, X, y, class_threshold = .5) {
    X = cbind(1, X)
    # Calculate the predicted values
    mu = X %*% par # %*% is matrix multiplication

    # Convert to a probability ('sigmoid' function)
    p = 1 / (1 + exp(-mu))

    # Convert to a class
    predicted_class = as.integer(
        ifelse(p > class_threshold, "good", "bad")
    )

    # Calculate the error
    error = y - predicted_class

    return(mean(error))
}
def misclassification_rate(par, X, y, class_threshold = .5):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]

    # Calculate the predicted values
    mu = X @ par
    
    # Convert to a probability ('sigmoid' function)
    p = 1 / (1 + np.exp(-mu))
    
    # Convert to a class
    predicted_class = np.where(p > class_threshold, 1, 0)
    
    # Calculate the error
    error = y - predicted_class 
    
    return(np.mean(error))

Note that our function first adds a step to convert the linear predictor (called mu) to a probability. Once we have a probability, we use some threshold to convert it to a ‘class’. In this case, we use 0.5 as the threshold, but this could be different depending on the context, something we talk more about elsewhere (Section 2.2.2.7). We’ll leave it as an exercise for you to play around with this, as the next objective function is more commonly used. But at least you can see how easy it can be to switch to the classification case.

3.9.2 Log loss

Another approach is to use the log loss, sometimes called logistic loss or cross-entropy. If we have just the binary case it is:

\[ \text{Loss} = -\sum_{i=1}^{n} y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i}) \]

Where \(y_i\) is the actual value of the target for observation \(i\), and \(\hat{y_i}\) is the predicted value from the model (essentially a probability). It turns out that this is the same as the log-likelihood used in a maximum likelihood approach for logistic regression, made negative so we can minimize it.

We typically prefer this objective function to classification error because it results in a smooth optimization surface, like in the visualization we showed before for maximum likelihood, which means it is differentiable in mathematical sense. This is important because it allows us to use optimization algorithms that rely on derivatives in updating the parameter estimates. You don’t really need to get into that too much, but just know that a smoother objective function is something we prefer. Here’s some code to try out.

objective = function(par, X, y) {
    X = cbind(1, X)

    # Calculate the predicted values on the raw scale
    y_hat = X %*% par

    # Convert to a probability ('sigmoid' function)
    y_hat = 1 / (1 + exp(-y_hat))

    # likelihood
    ll = y * log(y_hat) + (1 - y) * log(1 - y_hat)

    # alternative
    # dbinom(y, size = 1, prob = y_hat, log = TRUE)

    return(sum(-ll))
}
def objective(par, X, y):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]

    # Calculate the predicted values
    y_hat = X @ par
    
    # Convert to a probability ('sigmoid' function)
    y_hat = 1 / (1 + np.exp(-y_hat))
    
    # likelihood
    ll = y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat)
    
    return(-np.sum(ll))

Let’s go ahead and demonstrate this. To create a classification problem, we’ll say that a country is ‘happy’ if the happiness score is greater than 5.5, and ‘unhappy’ otherwise. We’ll use the same features as before.

df_happiness_bin = df_happiness |>
    mutate(happiness = ifelse(happiness > 5.5, 1, 0))

mod_logloss = optim(
    par = c(0, 0, 0, 0),
    fn = objective,
    X = df_happiness_bin |>
        select(life_exp_sc:gdp_pc_sc) |>
        as.matrix(),
    y = df_happiness_bin$happiness
)

mod_glm = glm(
    happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc,
    data   = df_happiness_bin,
    family = binomial
)

mod_logloss$par
from scipy.optimize import minimize

df_happiness_bin = df_happiness.copy()
df_happiness_bin['happiness'] = np.where(df_happiness['happiness'] > 5.5, 1, 0)

mod_logloss = minimize(
    objective,
    x0 = np.array([0, 0, 0, 0]),
    args = (
        df_happiness_bin[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],
        df_happiness_bin['happiness']
    )
)

mod_glm = smf.glm(
    'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc',
    data   = df_happiness_bin,
    family = sm.families.Binomial()
).fit()

mod_logloss['x']

Once again, we can see that the results are very similar between our classification result and the built-in function.

Table 3.4: Comparison of log loss results
name Ours GLM
LogLike 40.6635 40.6635
intercept −0.1642 −0.1637
life_exp_sc 1.8168 1.8172
corrupt_sc −0.4638 −0.4648
gdp_pc_sc 1.1307 1.1311

So, when it comes to classification, you should feel confident in what’s going on under the hood, just like you did with a numeric target. Too much is made of the distinction between ‘regression’ and ‘classification’ and it can be confusing to those starting out. In reality, classification just requires a slightly different way of thinking about the target. Conceptually it really is the same approach.

3.10 Optimization Algorithms

When it comes to optimization, there are a number of algorithms that have been developed over time. We’ll demonstrate one of the most popular ones used in machine learning, but there many variants of this one even! The main thing to keep in mind is that these are all just ways to find the best fitting parameters for a model. Some may be better suited for certain data tasks, or provide computational advantages, but often the choice of algorithm is not as important as many other modeling choices.

3.10.1 Gradient Descent

One of the most popular approaches in optimization is called gradient descent. It uses the gradient of the function we’re trying to optimize to find the best parameters. We still use objective functions as before - gradient descent is just a way to find that path along the objective surface. More formally, the gradient is the vector of partial derivatives of the objective function with respect to each parameter. That may not mean much to you, but the basic idea is that the gradient provides a direction that points in the direction of steepest increase in the function. So if we want to maximize the objective function, we can take a step in the direction of the gradient, and if we want to minimize it, we can take a step in the opposite direction of the gradient (use the negative gradient). The size of the step is called the learning rate, and, like our penalty parameter we saw with penalized regression, it is a hyperparameter that we can tune through cross-validation (Section 6.7). If the learning rate is too small, it will take a longer time to converge. If it’s too large, we might overshoot the objective and miss the best parameters. There are a number of variations on gradient descent that have been developed over time. Let’s see this in action with the world happiness model.

gradient_descent = function(
    par,
    X,
    y,
    tolerance = 1e-3,
    maxit = 1000,
    learning_rate = 1e-3
) {
    # add a column of 1s for the intercept
    X = cbind(1, X)
    N = nrow(X)

    # initialize
    beta = par
    names(beta) = colnames(X)
    mse = crossprod(X %*% beta - y) / N
    tol = 1
    iter = 1

    while (tol > tolerance && iter < maxit) {
        LP = X %*% beta
        grad = t(X) %*% (LP - y)
        betaCurrent = beta - learning_rate * grad
        tol = max(abs(betaCurrent - beta))
        beta = betaCurrent
        mse = append(mse, crossprod(LP - y) / N)
        iter = iter + 1
    }

    list(
        par    = beta,
        loss   = mse,
        MSE    = crossprod(LP - y) / nrow(X),
        iter   = iter,
        fitted = LP
    )
}

X = df_happiness |>
    select(life_exp_sc:gdp_pc_sc) |>
    as.matrix()

our_result = gradient_descent(
    par = c(0, 0, 0, 0),
    X = X,
    y = df_happiness$happiness,
    learning_rate = 1e-3
)
def gradient_descent(
    par, 
    X, 
    y, 
    tolerance = 1e-3, 
    maxit = 1000, 
    learning_rate = 1e-3
):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]
    
    # initialize
    beta = par
    loss = np.sum((X @ beta - y)**2)
    tol = 1
    iter = 1

    while (tol > tolerance and iter < maxit):
        LP = X @ beta
        grad = X.T @ (LP - y)
        betaCurrent = beta - learning_rate * grad
        tol = np.max(np.abs(betaCurrent - beta))
        beta = betaCurrent
        loss = np.append(loss, np.sum((LP - y)**2))
        iter = iter + 1

    return({
        "par": beta,
        "loss": loss,
        "RSE": np.sqrt(np.sum((LP - y)**2) / (X.shape[0] - X.shape[1])),
        "iter": iter,
        "fitted": LP
    })

our_result = gradient_descent(
    par = np.array([0, 0, 0, 0]),
    X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']].to_numpy(),
    y = df_happiness['happiness'].to_numpy(),
    learning_rate = 1e-3
)

Comparing our results, we have the following table. As usual, we see that the results are very similar. Once again, we have demystified a step in the modeling process!

Table 3.5: Comparison of gradient descent results
Value Standard Our Result
Intercept 5.445 5.437
life_exp_sc 0.525 0.521
corrupt_sc −0.105 −0.107
gdp_pc_sc 0.438 0.439
MSE 0.367 0.367

In addition, when we visualize the loss function across iterations, we see smooth decline in the MSE value as we go along each iteration. This is a good sign that we are converging to an optimal solution.

Figure 3.5: Loss with gradient descent

3.10.2 Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a version of gradient descent that uses a random sample of data to guess the gradient, instead of using all the data. This makes it less accurate in some ways, but it’s faster and can be parallelized. This speed is useful in machine learning when there’s a lot of data, which often makes the discrepancy between standard GD and SGD small. As such you will see variants of it incorporated in many models in deep learning, but it can be with much simpler models as well.

Let’s see this in action with the happiness model. The following is a conceptual version of the AdaGrad approach9, which is a variation of SGD that adjusts the learning rate for each parameter. We will also add a variation that averages the parameter estimates across iterations, which is a common approach to improve the performance of SGD, but by default it is not used, just something you can play with. We are going to use a ‘batch size’ of one, which is similar to a ‘streaming’ or ‘online’ version where we update the model with each observation. Since our data are alphabetically ordered, we’ll shuffle the data first. We’ll also use a stepsize_tau parameter, which is a way to adjust the learning rate at early iterations. We’ll set it to zero for now, but you can play with it to see how it affects the results. The values for the learning rate and stepsize_tau are arbitrary, selected after some initial playing around, but you can play with them to see how they affect the results.

stochastic_gradient_descent = function(
    par, # parameter estimates
    X,   # model matrix
    y,   # target variable
    learning_rate = 1, # the learning rate
    stepsize_tau = 0   # if > 0, a check on the LR at early iterations
    ) {
    # initialize
    set.seed(123)

    # shuffle the data
    idx = sample(1:nrow(X), nrow(X))
    X = X[idx, ]
    y = y[idx]

    X = cbind(1, X)
    beta = par

    # Collect all estimates
    betamat = matrix(0, nrow(X), ncol = length(beta))

    # Collect fitted values at each point))
    fits = NA

    # Collect loss at each point
    loss = NA

    # adagrad per parameter learning rate adjustment
    s = 0

    # a smoothing term to avoid division by zero
    eps = 1e-8

    for (i in 1:nrow(X)) {
        Xi = X[i, , drop = FALSE]
        yi = y[i]

        # matrix operations not necessary here,
        # but makes consistent with previous gd func
        LP = Xi %*% beta
        grad = t(Xi) %*% (LP - yi)
        s = s + grad^2 # adagrad approach

        # update
        beta = beta - learning_rate / 
            (stepsize_tau + sqrt(s + eps)) * grad
        betamat[i, ] = beta

        fits[i] = LP
        
        loss[i] = crossprod(LP - yi)
        
    }

    LP = X %*% beta
    lastloss = crossprod(LP - y)

    list(
        par = beta,          # final estimates
        par_chain = betamat, # estimates at each iteration
        MSE = sum(lastloss) / nrow(X),
        fitted = LP
    )
}

X_train = df_happiness |>
    select(life_exp_sc, corrupt_sc, gdp_pc_sc) |>
    as.matrix()

y_train = df_happiness$happiness

our_result = stochastic_gradient_descent(
    par = c(mean(df_happiness$happiness), 0, 0, 0),
    X = X_train,
    y = y_train,
    learning_rate = .15,
    stepsize_tau = .1
)

our_result$par
def stochastic_gradient_descent(
    par, # parameter estimates
    X,   # model matrix
    y,   # target variable
    learning_rate = 1, # the learning rate
    stepsize_tau = 0,  # if > 0, a check on the LR at early iterations
    average = False    # a variation of the approach
):
    # initialize
    np.random.seed(1234)

    # shuffle the data
    idx = np.random.choice(
        df_happiness.shape[0], 
        df_happiness.shape[0], 
        replace = False
    )
    X = X[idx, :]
    y = y[idx]
    
    X = np.c_[np.ones(X.shape[0]), X]
    beta = par

    # Collect all estimates
    betamat = np.zeros((X.shape[0], beta.shape[0]))

    # Collect fitted values at each point))
    fits = np.zeros(X.shape[0])

    # Collect loss at each point
    loss = np.zeros(X.shape[0])

    # adagrad per parameter learning rate adjustment
    s = 0

    # a smoothing term to avoid division by zero
    eps = 1e-8

    for i in range(X.shape[0]):
        Xi = X[None, i, :]
        yi = y[i]

        # matrix operations not necessary here,
        # but makes consistent with previous gd func
        LP = Xi @ beta
        grad = Xi.T @ (LP - yi)
        s = s + grad**2 # adagrad approach

        # update
        beta = beta - learning_rate / \
            (stepsize_tau + np.sqrt(s + eps)) * grad

        betamat[i, :] = beta

        fits[i] = LP
        loss[i] = np.sum((LP - yi)**2)

    LP = X @ beta
    lastloss = np.sum((LP - y)**2)

    return({
        "par": beta, # final estimates
        "par_chain": betamat, # estimates at each iteration
        "MSE": lastloss / X.shape[0],
        "fitted": LP
    })
X_train = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']]
y_train = df_happiness['happiness']

our_result = stochastic_gradient_descent(
    par = np.array([np.mean(df_happiness['happiness']), 0, 0, 0]),
    X = X_train.to_numpy(),
    y = y_train.to_numpy(),
    learning_rate = .15,
    stepsize_tau = .1
)

our_result['par']

Next we’ll compare it to OLS estimates and our previous ‘batch’ gradient descent results. Even though SGD normally would not be used for such a small dataset, we at least get close10!

Table 3.6: Comparison of stochastic gradient descent results
Value Standard Our Result Batch SGD
Intercept 5.445 5.469 5.437
life_exp_sc 0.525 0.514 0.521
corrupt_sc −0.105 −0.111 −0.107
gdp_pc_sc 0.438 0.390 0.439
MSE 0.367 0.370 0.367

Here’s a plot of the estimates as they moved along the data. For this plot we don’t include the intercept as it’s on a notably different scale. We can see that the estimates are moving around a bit, but they appear to be converging to a solution.

Figure 3.6: Stochastic gradient descent path

3.10.3 Other Optimization Algorithms

There are lots of other approaches to optimization. For example, here are some of the options available in R’s optim or scipy’s minimize function:

  • Nelder-Mead
  • BFGS
  • L-BFGS-B (provides constraints)
  • Conjugate gradient
  • Simulated annealing
  • Newton’s method
  • Genetic algorithms
  • Other Popular SGD extensions and variants
    • RMSProp
    • Adam/momentum
    • Adagrad

The main reason to choose one method over another usually is based on factors like speed, memory use, or how well they work for certain models. For statistical contexts, many functions for GLM use Newton’s method by default, but more complicated models may implement a different default for better convergence. In general, we can always try a few different methods to see which works best, but usually, the results are similar. For example, here are the results for the happiness model using different algorithms, with a comparison to the standard linear regression model function. We can see that the results are very similar, and for simpler modeling endeavors they should converge on the same result.

Table 3.7: Comparison of optimization results
parameter NM1 BFGS2 CG3 GD4 Standard5
Intercept 5.445 5.445 5.445 5.437 5.445
life_exp_sc 0.525 0.525 0.525 0.521 0.525
gdp_pc_sc −0.105 −0.105 −0.105 −0.107 −0.105
corrupt_sc 0.437 0.438 0.438 0.439 0.438
MSE 0.367 0.367 0.367 0.367 0.367
1 NM = Nelder-Mead
2 BFGS = Broyden–Fletcher–Goldfarb–Shanno
3 CG = Conjugate gradient
4 GD = Gradient descent
5 Standard = Standard package function

3.11 Other Estimation Approaches

Before leaving our estimation discussion, we should mention there are other approaches one could use, including variations on least squares, method of moments, generalized estimating equations, robust estimation, and more. We’ve focused on the most common ones, but it’s good to be aware of others that might be more common in some domains. But there are two we want to discuss in a little bit detail given their widespread usage, and that is the bootstrap and Bayesian estimation.

3.11.1 Bootstrap

The bootstrap is method where we create new data sets by randomly sampling the data from our original set, allowing the same data to be picked more than once. We then use these new data sets to estimate our model. We do this many times, collecting parameter estimates, predictions, or anything we want to calculate along the way. Ultimately, we end up with a distribution of all the things we calculated.

These results give us a range of possible outcomes, which is useful for inference11, as we can use the distribution to calculate confidence intervals, prediction intervals or intervals for any value we calculate. The average estimate is often the same as whatever the underlying model used would produce, but the bootstrap provides a way to get at a measure of uncertainty with fewer assumptions about how that distribution should take shape. The approach very flexible, and it can be used with any model. Let’s see this in action with the happiness model.

bootstrap = function(X, y, nboot = 100, seed = 123) {
    
    N = nrow(X)
    p = ncol(X) + 1 # add one for intercept

    # initialize
    beta = matrix(NA, p*nboot, nrow = nboot, ncol = p)
    colnames(beta) = c('Intercept', colnames(X))
    mse = rep(NA, nboot)

    # set seed
    set.seed(seed)

    for (i in 1:nboot) {
        # sample with replacement
        idx = sample(1:N, N, replace = TRUE)
        Xi = X |> slice(idx)
        yi = y[idx]

        # estimate model
        mod = lm(yi ~., data = Xi)

        # save results
        beta[i, ] = coef(mod)
        mse[i] = sum((mod$fitted - yi)^2) / N
    }

    # given mean estimates, calculate MSE
    y_hat = cbind(1, as.matrix(X)) %*% colMeans(beta)
    final_mse = sum((y - y_hat)^2) / N

    list(
        beta = as_tibble(beta),
        MSE = mse,
        final_mse = final_mse
    )
}

X = df_happiness |>
    select(life_exp_sc:gdp_pc_sc)

our_result = bootstrap(
    X = X,
    y = df_happiness$happiness,
    nboot = 250
)

colMeans(our_result$beta)
  Intercept life_exp_sc  corrupt_sc   gdp_pc_sc 
  5.4424683   0.5083511  -0.1023755   0.4639788 
def bootstrap(X, y, nboot=100, seed=123):
    # add a column of 1s for the intercept
    X = np.c_[np.ones(X.shape[0]), X]
    N = X.shape[0]

    # initialize
    beta = np.empty((nboot, X.shape[1]))
    
    # beta = pd.DataFrame(beta, columns=['Intercept'] + list(cn))
    mse = np.empty(nboot)    

    # set seed
    np.random.seed(seed)

    for i in range(nboot):
        # sample with replacement
        idx = np.random.randint(0, N, N)
        Xi = X[idx, :]
        yi = y[idx]

        # estimate model
        model = LinearRegression(fit_intercept=False)
        mod = model.fit(Xi, yi)

        # save results
        beta[i, :] = mod.coef_
        mse[i] = np.sum((mod.predict(Xi) - yi)**2) / N

    # given mean estimates, calculate MSE
    y_hat = X @ beta.mean(axis=0)
    final_mse = np.sum((y - y_hat)**2) / N

    return dict(beta = beta, mse = mse, final_mse = final_mse)

our_result = bootstrap(
    X = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],
    y = df_happiness['happiness'],
    nboot = 250
)

np.mean(our_result['beta'], axis=0)

Here are the results of the interval estimates for the coefficients. Each parameter has the mean estimate, the lower and upper bounds of the 95% confidence interval, and the width of the interval. The bootstrap intervals are a bit wider than the OLS intervals, possibly better capturing the uncertainty in this model based on not too many observations.

Table 3.8: Bootstrap parameter estimates
Parameter mean Lower BS Upper BS Lower OLS Upper OLS Diff Width1
Intercept 5.44 5.33 5.55 5.33 5.56 −0.02
life_exp_sc 0.51 0.30 0.70 0.35 0.70 0.04
corrupt_sc −0.10 −0.29 0.09 −0.25 0.04 0.09
gdp_pc_sc 0.46 0.18 0.76 0.24 0.64 0.18
1 Width of bootstrap estimate minus width of OLS estimate

Let’s look more closely at the distributions for each coefficient. Standard statistical estimates assume a specific shape like the normal distribution. But with the bootstrap method, we have more flexibility, even though it often leans towards the assumed distribution. These distributions aren’t perfectly symmetrical like a normal distribution, but they suit our needs in that we can extract the lower and upper quantiles to create an interval estimate.

Figure 3.7: Bootstrap distributions of parameter estimates

The bootstrap is often used for predictions and other metrics. However, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g. extreme values) or in some data contexts (e.g. correlated observations). Despite these limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model.

3.11.2 Bayesian Estimation

The Bayesian approach to modeling is many things - a philosophical viewpoint, an entirely different way to think about probability, a different way to measure uncertainty, and on a practical level, just another way to get model parameter estimates. It can be as frustrating as it is fun to use, and one of the really nice things about using Bayesian estimation is that it can handle model complexities that other approaches can’t do well.

The basis of Bayesian estimation is the likelihood, the same as with maximum likelihood, and everything we did there applies here. However, the Bayesian approach is unique because it also lets us use our knowledge about the parameters, through prior distributions. For example, we may think that the coefficients for a linear model come from a normal distribution centered on zero with some variance. That would serve as our prior. The combination of a prior distribution with the likelihood results in the posterior distribution, which is a distribution of possible parameter values.

Just like with the bootstrap which also provided distributions for the parameters, we can use the Bayesian approach to understand how certain we are about our estimates. We can look at any range of values in the posterior distribution to get what is often referred to as a credible interval, which is the Bayesian equivalent of a confidence interval12. Here is an example of the posterior distribution for the parameters of our happiness model, along with 95% intervals13.

Figure 3.8: Posterior distribution of parameters

With Bayesian estimation we also provide starting values for the algorithm to get things going. We also typically specify a number of iterations, or times the model will run, as the stopping rule. Each iteration gives us a new guess for each parameter, which amounts to a random draw from the posterior distribution. With more iterations the model takes longer to run, but the length often reflects the complexity of the model.

We also specify multiple chains, which do the same estimation procedure, but due to the random nature of the Bayesian approach, take different estimation paths. We can then compare the chains to see if they are converging to the same result, which is a check on the model. If they are not converging, we may need to run the model longer, or it may indicate a problem with how we set up the model. Here’s an example of the four chains for our happiness model for the life expectancy coefficient. They’re giving similar results, so we know the model is working well. Nowadays, we have model statistics in the output that also provide this information, which makes it easier to quickly check many parameters.

Figure 3.9: Bayesian chains for life expectancy coefficient

When we are interested in making predictions, we can use the results to generate a distribution of possible predictions for each observation, which can be very useful when we want to quantify uncertainty in for complex models. This is referred to as posterior predictive distribution, which is explored in (Section 2.2.4). Here is a plot of several draws of predicted values against the true happiness scores.

Figure 3.10: Posterior predictive distribution of happiness values

With the Bayesian approach, every metric we calculate has a range of possible values, not just one. For example, if you have a classification model and want to know the accuracy or true positive rate of the model, instead of a single number, you now have access to a whole distribution of values for that metric. How? For each possible set of model parameters from the posterior distribution, we apply those values and model to data to make a prediction. We can then assign it to a class, and compare it to the actual class. This gives us a range of possible predictions and classes. We can then calculate metrics like accuracy or true positive rate for each possible prediction set. As an example, we did this for our happiness model with a numeric target, and show the interval estimate for R-squared. Pretty neat!

Table 3.9: Bayesian R2
Bayes R2 Lower Upper
0.71 0.65 0.75
95% Credible interval for R-squared

As we saw in Section 2.2.4, nothing is keeping you from doing ‘posterior predictive checks’ with other estimation approaches, and it’s a very good idea to do so. For example, in a GLM you have the beta estimates and the covariance matrix for them, and can simulate from a normal distribution with those estimates. It’s just more straightforward with the Bayesian approach, where packages will do it for you with little effort.

3.11.2.1 Additional Thoughts

It turns out that any standard (frequentist) statistical model can be seen as a Bayesian one from a particular point of view. Here are a couple:

  • GLM and related estimated via maximum likelihood: Bayesian estimation with a flat/uniform prior on the parameters.
  • Ridge Regression: Bayesian estimation with a normal prior on the coefficients, penalty parameter is related to the variance of the prior.
  • Lasso Regression: Bayesian estimation with a Laplace prior on the coefficients, penalty parameter is related to the variance of the prior.

So, in many modeling contexts, you’re actually doing a restrictive form of Bayesian estimation already.

We can see that the Bayesian approach is very flexible, and can be used for many different types of models, and can be used to get at uncertainty in a model in ways that other approaches can’t. It’s not a panacea, and it’s not always the best approach, but it’s a good one to have in your toolbox14. Hopefully we’ve helped to demystify the Bayesian approach a bit, and you feel more comfortable switching to it.

3.12 Wrapping Up

Wow, we covered a lot here! But this is the sort of stuff that can take you from just having some fun with data, to doing that and also understanding how things are actually happening. Just having the gist of how modeling actually is done ‘under the hood’ makes so many other things make sense, and can give you a lot of confidence, even in less familiar modeling domains.

3.12.1 The Thread

Simply put, the content in this chapter ties together any and every model you will ever undertake, from linear regression to reinforcement learning, computer vision, and large language models. Estimation and optimization are the core of any modeling process, and understanding them is key to understanding how models work.

3.12.2 Choose Your Own Adventure

Seriously, after this chapter, you should feel fine with any of the others in this text, so dive in!

3.12.3 Additional Resources

OLS and Maximum Likelihood Estimation:

For OLS and maximum likelihood estimation, there are so many resources out there, we recommend just taking a look and seeing which one suits you best. Practically any more technical statistical book will cover these topics in detail.

Gradient Descent:

More demonstration of the simple AdaGrad algorithm used above:

Bootstrap:

Classical treatments:

Bayesian:

3.13 Exercise

Try creating an objective function for a continuous target that uses the mean absolute error, and compare your estimated parameters to the previous results for ordinary least squares. Alternatively, use the ridge regression demonstration and change it to use the lasso approach (this would require altering just one line).


  1. Since life expectancy is scaled, a standard deviation is 1, in this case about 7 years.↩︎

  2. It turns out that our error metric is itself an estimate of the true error. We’ll get more into this later, but for now this means that we can’t ever know the true error, and so we can’t ever really know the best or true model. However, we can still choose a good or better model relative to others based on our estimate.↩︎

  3. We don’t have to do it this way, but it’s the default in most scenarios. As an example, maybe for your situation overshooting is worse than undershooting, and so you might want to use an approach that would weight those errors more heavily.↩︎

  4. Some disciplines seem confuse models with estimation methods and link functions. It doesn’t really make sense, nor is it informative, to call something an OLS model or a logit model. Many models are estimated using a least squares objective function, even deep learning, and different types of models use a logit link, from logistic regression, to beta regression, to activation functions used in deep learning.↩︎

  5. You may find that some packages will only minimize (or maximize) a function, even to the point of reporting nonsensical things like negative squared values, so you’ll need to take care when implementing your own metrics.↩︎

  6. The actual probability of a specific value in this setting is 0, but the probability of a range of values is greater than 0. You can find out more about likelihoods and probabilities at the discussion here, but in general many traditional statistical texts will cover this also.↩︎

  7. Those who have experience here will notice we aren’t putting a lower bound on sigma. You typically want to do this otherwise you may get nonsensical results by not keeping sigma positive. You can do this by setting a specific argument for an algorithm that uses boundaries, or more simply by exponentiating the parameter. In the latter case, you’ll have to exponentiate the estimate to get back to the correct scale. We leave this detail out of the code for now to keep things simple.↩︎

  8. Linear regression will settle on a line that cuts through the means, and when standardizing all variables, the mean of the features and target are both zero, so the line goes through the origin.↩︎

  9. MC does not recall exactly where this origin of his function came from except that Murphy’s PML book was a key reference when he came up with it (Murphy (2012)).↩︎

  10. You’d get better results by also standardizing the target. The initial shuffling that we did can help as well in case the data are ordered, or more typically when we’re dealing with larger data and repeated runs/epochs, it allows the samples/batches to be more representative of the entire data set. Also, we had to ‘hand-tune’ our learning rate and stepsize_tau, which is not ideal, and normally we would use cross-validation to find the best values.↩︎

  11. We’re using inference here in the standard statistical/philosophical sense, not as a synonym for prediction or generalization, which is how it is often used in machine learning. We’re not exactly sure how that terminological muddling arose in ML, but be on the lookout for it.↩︎

  12. Your default interpretation of a standard confidence interval is almost certainly, and incorrectly, the actual interpretation of a Bayesian confidence interval, because the Bayesian interpretation of confidence intervals and p-values is how we tend to naturally think about those statistics. But that’s okay, everyone else is in the same boat. We also don’t care if you want to call the Bayesian version a confidence interval.↩︎

  13. We used the R package for brms for these results.↩︎

  14. R has excellent tools here for modeling and post-processing, like brms and tidybayes, and Python has pymc3, numpyro, and arviz, which are also useful. Honestly R has way more going on here, with many packages devoted to Bayesian estimation of specific models even, but if you want to stick with Python it’s gotten a lot better recently.↩︎