term | happiness | life_exp | log_gdp_pc | corrupt |
---|---|---|---|---|

happiness | 1.00 | 0.78 | 0.82 | −0.47 |

life_exp | 0.78 | 1.00 | 0.86 | −0.34 |

log_gdp_pc | 0.82 | 0.86 | 1.00 | −0.34 |

corrupt | −0.47 | −0.34 | −0.34 | 1.00 |

# 6 Model Estimation and Optimization

In our initial linear model, the coefficients for each feature are the key parameters. But how do we know what the coefficients are, and how did we come to those values? When we run a linear model using some program function, they appear magically, but it’s worth knowing a little bit about how they come to be, so let’s try and dive a little deeper. As we do so, we’ll end up going a long way into ‘demystifying’ the modeling process.

**Model estimation** is the process of finding the parameters associated with a model that allow us to reach a particular modeling goal. Different types of models will have different parameters to estimate, and there are different ways to estimate them. In general though, the goal is the same, find the set of parameters that will lead to the best predictions under the current data modeling context.

With model estimation, we can break things down into the following steps:

- Start with an
**initial guess**for the parameters - Calculate the
**prediction error**based on those parameters, or some function of it, or some other value that represents our model’s**objective** **Update**the guess- Repeat steps 2 & 3 until we find a ‘best’ guess

Pretty straightforward, right? Well, it’s not always so simple, but this is the general idea in most applications, so keep it in mind! In this chapter, we’ll show how to do this ourselves to take away the mystery a bit from when you run standard model functions in typical contexts. Hopefully then you’ll gain more confidence when you do use them!

## 6.1 Key Ideas

A few concepts we’ll keep using here are fundamental to understanding estimation and optimization. We should note that we’re qualifying our present discussion of these topics to typical linear models and similar settings, but they are much more broad and general than presented here. We’ve seen some of this before, but we’ll be getting a bit cozier with the concepts now.

**Parameters**are the values associated with a model that we have to estimate.**Estimation**is the process of finding the parameters associated with a model.- The
**objective (loss) function**takes input and produces a value that we want to maximize or minimize. **Prediction error**is the difference between the observed value of the target and the predicted value, and is often used to calculate the objective function.**Optimization**is the specific process of finding the parameters that maximize or minimize some objective function.**Model Selection**is the process of choosing the best model from a set of models.

### 6.1.1 Why this matters

When it comes to modeling, even knowing just a little bit about what goes on behind the scenes is a great demystifier. And if models are less of a mystery, you’ll feel more confident using them. Parts of what you see here are a part of almost every common model used for statistics and machine learning, providing you even more of a foundation for understanding what’s going on.

### 6.1.2 Helpful context

This chapter is more involved and technical than most of the others, so might be more suited for those who like to get their hands dirty. It’s all about ‘rolling your own’, and so we’ll be doing a lot of the work ourselves. If you’re not one of those types of people that gets much out of that, that’s ok, you can skip this chapter and still get a lot out of the rest of the book. But if you’re curious about how things work, or you want to be able to do more than just run a function, then we think you’ll find the following useful. You’d want to at least have your linear model basics down (Chapter 3).

## 6.2 Data Setup

For the examples here, we’ll use the world happiness dataset for the year 2018. We’ll use the happiness score as our target. Let’s take an initial look at the data here, but for more information see the appendix Section C.2.

Our happiness score has values from around 3-7, life expectancy and gdp appear to have some notable variability, and corruption perception is skewed toward lower values. We can also see that the features and target are correlated with each other, which is not surprising.

For our purposes here, we’ll drop any rows with missing values, and we’ll use scaled features so that they have the same variance, which as noted in the data chapter, can help make estimation easier.

```
= read_csv('https://tinyurl.com/worldhappiness2018') |>
df_happiness drop_na() |>
rename(happiness = happiness_score) |>
select(
country,
happiness,contains('_sc')
)
```

```
import pandas as pd
= (
df_happiness 'https://tinyurl.com/worldhappiness2018')
pd.read_csv(
.dropna()= {'happiness_score': 'happiness'})
.rename(columns filter(regex = '_sc|country|happ')
. )
```

### 6.2.1 Other Setup

For the R examples, after the above nothing beyond base R is needed. For Python examples, the following should be enough to get you through the examples:

```
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from scipy.optimize import minimize
from scipy.stats import norm
```

## 6.3 Starting Out by Guessing

So, we’ll start with a model in which we predict a country’s level of happiness by their life expectancy, where if you can expect to live longer, maybe you’re probably in a country with better health care, higher incomes, and other important stuff. We’ll stick with our simple linear regression model as well.

As a starting point we can just guess what the parameter should be, but how would we know what to guess? How would we know which guesses are better than others? Let’s try a couple and see what happens. Let’s say that we think all countries start at the bottom on the happiness scale (around 3), but life expectancy makes a big impact- for every standard deviation of life expectancy we go up a whole point on happiness^{1}. We can plug this into the model and see what we get:

\[ \text{prediction} = 3.3 + 1\cdot\text{life\_exp} \]

For a different model, we’ll say countries start with a mean of happiness score, and moving up a standard deviation of life expectancy would move us up a half point of happiness. \[ \text{prediction} = \overline{\text{happiness}} + .5\cdot\text{life\_exp} \]

How do we know which is better? Let’s find out!

## 6.4 Prediction Error

We’ve seen that a key component to model assessment involves comparing the predictions from the model to the actual values of the target. This difference is known as the **prediction error**, or **residuals** in more statistical contexts. We can express this as:

\[ \epsilon = y - \hat{y} \] \[ \text{error} = \text{target} - \text{model-based guess} \]

This prediction error tells us how far off our model prediction is from the observed target values, and it gives us a way to compare models. With our measure of prediction error, we can calculate a total error for all observations/predictions (Section 4.2), or similarly, the average error. If one model or parameter set has less total or average error, we can say it’s a better model than one that has more (Section 4.3). Ideally we’d like to choose a model with the least error, but we’ll see that this is not always possible^{2}.

However, if we just take the average of our errors from a linear regression model, you’ll see that it is roughly zero! This is by design for many common models, where we even will explicitly write the formula for the error as coming from a normal distribution with mean of zero. So, to get a meaningful error metric, we need to use the squared error value or the absolute value. These also allow errors of similar value above and below the observed value to cost the same^{3}. We’ll use squared error here, and we’ll calculate the mean of the squared errors for all our predictions.

```
= df_happiness$happiness
y
# Calculate the error for the guess of 4
= min(df_happiness$happiness) + 1*df_happiness$life_exp_sc
prediction = mean((y - prediction)^2)
mse_model_A
# Calculate the error for our other guess
= mean(y) + .5 * df_happiness$life_exp_sc
prediction = mean((y - prediction)^2)
mse_model_B
tibble(
Model = c('A', 'B'),
MSE = c(mse_model_A, mse_model_B)
)
```

```
= df_happiness['happiness']
y
# Calculate the error for the guess of four
= np.min(df_happiness['happiness']) + 1 * df_happiness['life_exp_sc']
prediction = np.mean((y - prediction)**2)
mse_model_A
# Calculate the error for our other guess
= y.mean() + .5 * df_happiness['life_exp_sc']
prediction = np.mean((y - prediction)**2)
mse_model_B
pd.DataFrame({'Model': ['A', 'B'],
'MSE': [mse_model_A, mse_model_B]
})
```

Now let’s look at our **Mean Squared Error** (MSE), and we’ll also inspect the square root of it, or the **Root Mean Squared Error**, as that puts things back on the original target scale, and tells us the standard deviation of our prediction errors. We also add the **Mean Absolute Error** (MAE) as another metric with straightforward interpretation.

Model | MSE | RMSE | MAE | RMSE % drop | MAE % drop |
---|---|---|---|---|---|

A | 5.09 | 2.26 | 1.52 | ||

B | 0.64 | 0.80 | 0.58 | 65% | 62% |

Inspecting the metrics, we can see that we are off on average by over a point for model A (MAE), and a little over half a point on average for model B. So we can see that model B is not only better, but results in a 65% drop in RMSE, and similar for MAE. We’d definitely prefer it over model A, and we can see how to compare models in a general fashion.

Now all of this is useful, and at least we can say one model is better than another. But you’re probably hoping there is an easier way to get a good guess for our model parameters, especially when we have possibly dozens of features and/or parameters to keep track of, and there is!

## 6.5 Ordinary Least Squares

In a simple linear model, we often use the **Ordinary Least Squares (OLS)** method to estimate parameters. This method finds the coefficients that minimize the sum of the squared differences between the predicted and actual values^{4}. In other words, it finds the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, which is what we just did in our previous example. The sum of the squared errors is also called the **residual sum of squares** (RSS), as opposed to the ‘total’ sums of squares (i.e. the variance of the target), and the part explained by the model (‘model’ or ‘explained’ sums of squares). We can express this as follows, where \(y_i\) is the observed value of the target for observation \(i\), and \(\hat{y_i}\) is the predicted value from the model.

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \tag{6.1}\]

It’s called *ordinary* least squares because there are other least squares methods - generalized least squares, weighted least squares, and others, but we don’t need to worry about that for now. The sum or mean of the squared errors is our **objective value**. The process of taking the predictions and observed target values as inputs, and returning this value as an output is our **objective function**. We can use this value to find the best parameters for a specific model, as well as compare models with different parameters.

Now let’s calculate the OLS estimate for our model. We need our own function to do this, but it doesn’t take much. We need to map our inputs to our output, which are the model predictions. We then calculate the error, square it, and then average the squared errors to provide the mean squared error.

```
# for later comparison
= lm(happiness ~ life_exp_sc, data = df_happiness)
model_lr_happy
= function(X, y, par) {
ols # add a column of 1s for the intercept
= cbind(1, X)
X
# Calculate the predicted values
= X %*% par # %*% is matrix multiplication
y_hat
# Calculate the error
= y - y_hat
error
# Calculate the value as mean squared error
= sum(error^2) / nrow(X)
value
# Return the objective value
return(value)
}
```

```
# for later comparison
= smf.ols('happiness ~ life_exp_sc', data = df_happiness).fit()
model_lr_happy
def ols(par, X, y):
# add a column of 1s for the intercept
= np.c_[np.ones(X.shape[0]), X]
X
# Calculate the predicted values
= X @ par # @ is matrix multiplication
y_hat
# Calculate the mean of the squared errors
= np.mean((y - y_hat)**2)
value
# Return the objective value
return value
```

We’ll want to make a bunch of guesses for the parameters, so let’s create data for those guesses. Then choose the guess that gives us the lowest objective value. But before getting carried away, try it out for just one guess to see how it works!

```
# create a grid of guesses
= crossing(
guesses b0 = seq(1, 7, 0.1),
b1 = seq(-1, 1, 0.1)
)
# Example for one guess
ols(
X = df_happiness$life_exp_sc,
y = df_happiness$happiness,
par = unlist(guesses[1, ])
)
```

`[1] 23.78`

```
# create a grid of guesses
from itertools import product
= pd.DataFrame(
guesses
product(1, 7, 0.1),
np.arange(-1, 1, 0.1)
np.arange(
),= ['b0', 'b1']
columns
)
# Example for one guess
ols(= guesses.iloc[0,:],
par = df_happiness['life_exp_sc'],
X = df_happiness['happiness']
y )
```

`23.77700449624871`

Now we’ll calculate the loss for each guess and find which one gives us the smallest function value.

```
# Calculate the function value for each guess, mapping over
# each combination of b0 and b1
= guesses |>
guesses mutate(
objective = map2_dbl(
$b0,
guesses$b1,
guessesols(
\(b0, b1) par = c(b0, b1),
X = df_happiness$life_exp_sc,
y = df_happiness$happiness
)
)
)
= guesses |> filter(objective == min(objective))
min_loss
min_loss
```

```
# A tibble: 1 × 3
b0 b1 objective
<dbl> <dbl> <dbl>
1 5.4 0.9 0.491
```

```
# Calculate the function value for each guess, mapping over
# each combination of b0 and b1
'objective'] = guesses.apply(
guesses[lambda x: ols(
= x,
par = df_happiness['life_exp_sc'],
X = df_happiness['happiness']
y
),= 1
axis
)
= guesses[guesses['objective'] == guesses['objective'].min()]
min_loss
min_loss
```

```
b0 b1 objective
899 5.400 0.900 0.491
```

The following plot shows the objective value for each guess in a smooth fashion as if we had a more continuous space. Darker values indicate we’re getting closer to our smallest objective value. The star notes our best guess.

If we inspect our results from the standard functions, we had estimates of 5.44 and 0.89 for our coefficients versus the best guess from our approach of 5.4 and 0.9. These are very similar but not exactly the same, but this is mostly due to the granularity of our guesses. Even so, in the end we can see that we get pretty dang close to what our basic `lm`

or `statsmodels`

functions would get us. Pretty neat!

## 6.6 Optimization

Before we get into other objective functions, let’s think about a better way to find good parameters for our model. Rather than just guessing, we can use a more systematic approach, and thankfully, there are tools out there to help us. We just use a function like our OLS function, give it a starting point, and let the algorithms do the rest! These tools eventually arrive at a pretty good set of parameters, and are optimized for speed.

Previously we created a set of guesses, and tried each one in a manner called a **grid search**, and it is a bit of a brute force approach to finding the best fitting model. You can maybe imagine a couple of unfortunate scenarios for this approach, such as having a very large number of parameters to search. Or it may be that our range of guesses doesn’t allow us to find the right set of parameters, or we specify a very large range, but the best fitting model is within a very narrow part of that, so it takes a long time to find. In any of these cases we waste a lot of time or may not find an optimal solution.

In general, we can think of **optimization** as employing a smarter, more efficient way to find what you’re looking for. Here’s how it works:

**Start with an initial guess**for the parameters, then…**Calculate the objective function**given the parameters**Update the parameters**to a new guess (that hopefully improves the objective function)**Repeat**, until the improvement is small enough, or a maximum number of iterations is reached.

The key idea now is how we *update* the old parameters with a new guess at each iteration. Different **optimization algorithms** use different approaches to find those new guesses. The process stops when the improvement is within a set **tolerance** or the maximum number of iterations is reached. If the objective is met, we say that our model has **converged**. Sometimes, the number of iterations is not enough for us to reach convergence in terms of tolerance, and we have to try again with a different set of parameters, a different algorithm, maybe use some data transformations, or something else.

So, let’s try it out! We start out with several inputs:

- the objective function
- the initial guess for the parameters to get things going
- other related inputs to the objective function
- options for the optimization process, e.g. algorithm, maximum number of iterations, etc.

With these inputs, we’ll let the optimization functions do the rest of the work. We’ll also compare our results to the standard functions to make sure we’re on the right track.

We’ll use the `optim`

function in R.

```
= optim(
our_result par = c(1, 0),
fn = ols,
X = df_happiness$life_exp_sc,
y = df_happiness$happiness,
method = 'BFGS', # optimization algorithm
control = list(
tol = 1e-6, # tolerance
maxit = 500 # max iterations
)
)
our_result
```

We’ll use the `minimize`

function in Python.

```
from scipy.optimize import minimize
= minimize(
our_result = ols,
fun = np.array([1., 0.]),
x0 = (
args 'life_exp_sc']),
np.array(df_happiness['happiness'])
np.array(df_happiness[
),= 'BFGS', # optimization algorithm
method = 1e-6, # tolerance
tol = {
options 'maxiter': 500 # max iterations
}
)
our_result
```

Optimization functions typically return multiple values, including the best parameters found, the value of the objective function at that point, and sometimes other information like the number of iterations it took to reach the returned value and whether or not the process converged. This can be quite a bit of stuff so we don’t show the raw output, but we definitely encourage you to inspect it closely. The following table shows the estimated parameters and the objective value for our model, and we can compare it to the standard functions to see how we did.

Parameter | Standard | Our Result |
---|---|---|

Intercept | 5.4450 | 5.4450 |

Life Exp. Coef. | 0.8880 | 0.8880 |

Objective/MSE | 0.4890 | 0.4890 |

So, our little function and the right tool allows us to come up with the same thing as base R and `statsmodels`

! I hope you’re feeling pretty good at this point because you should! You just proved you could do what seemed before to be like magic, but really all it took is just a little knowledge about some key concepts to demystify the process. So, let’s keep going!

## 6.7 Maximum Likelihood

In our example, we’ve been minimizing the mean of the squared errors to find the best parameters for our model. But let’s think about this differently. Now we’d like you to think about the **data generating process**. Ignoring the model, imagine that each happiness value is generated by some random process, like drawing from a normal distribution. So, something like this would describe it mathematically:

\[ \text{happiness} \sim N(\text{mean}, \text{sd}) \]

where the `mean`

is just the mean of happiness, and `sd`

is its standard deviation. In other words, we can think of happiness as a random variable that is drawn from a normal distribution with mean and standard deviation as the parameters of that distribution.

Let’s apply this idea to our linear model setting. In this case, the mean is a function of life expectancy, and we’re not sure what the standard deviation is, but we can go ahead and write our model as follows.

\[ \text{mean} = \beta_0 + \beta_1 * \text{life\_exp} \] \[ \text{happiness} \sim N(\text{mean}, \text{sd}) \]

Now, our model is estimating the parameters of the normal distribution. We have an extra parameter to estimate - the standard deviation, which is similar to our RMSE.

In our analysis, instead of merely comparing the predicted happiness score to the actual score by looking at their difference, we do something a little different to get a sense of their difference. We consider how likely it is to observe the actual happiness score based on our prediction. This value, known as the **likelihood**, depends on our model’s parameters, which include the coefficients. The likelihood is determined by the statistical distribution we’ve selected for our analysis. We can write this as:

\[ \text{Pr}(\text{happiness} \mid \text{life\_exp}, \beta_0, \beta_1, \text{sd}) \]

\[ \text{Pr}(\text{happiness} \mid \text{mean}, \text{sd}) \]

Thinking more generally, the likelihood gives us the probability of the observed data given the parameter estimates.

\[ \text{Pr}(\text{Data} \mid \text{Parameters}) \]

The following shows how to calculate a likelihood for our model. The values you see are called **probability density** values. They’re not exactly probabilities, but they show the **relative likelihood** of each observation^{6}. You can think of them like you do for probabilities, but remember that likelihoods are slightly different.

```
# two example life expectancy scores, at the mean (0) and 1 sd above
= c(0, 1)
life_expectancy
# observed happiness scores
= c(4, 5.2)
happiness
# predicted happiness with rounded coefs
= 5 + 1 * life_expectancy
mu
# just a guess for sigma
= .5
sigma
# likelihood for each observation
= dnorm(happiness, mean = mu, sd = sigma)
L L
```

`[1] 0.1080 0.2218`

```
from scipy.stats import norm
# two example life expectancy scores, at the mean (0) and 1 sd above
= np.array([0, 1])
life_expectancy
# observed happiness scores
= np.array([4, 5.2])
happiness
# predicted happiness with rounded coefs
= 5 + 1 * life_expectancy
mu
# just a guess for sigma
= .5
sigma
# likelihood for each observation
= norm.pdf(happiness, loc = mu, scale = sigma)
L L
```

`array([0.1080, 0.2218])`

With a guess for the parameters and an assumption about the data’s distribution, we can calculate the likelihood of each data point. We get a total likelihood for all observations, similar to how we added squared errors previously. But unlike errors, we want *more* likelihood, not less. In theory we’d multiply each likelihood, but in practice we sum the *log of the likelihood*, otherwise values would get too small for our computers to handle. We can also turn our problem into a minimization problem by calculating the negative log-likelihood, and then minimizing that value, which many optimization algorithms are designed to do^{7}.

The following is a function we can use to calculate the likelihood of the data given our parameters. The actual likelihood value isn’t easily interpretable, but it reflects the relative likelihood of the data given the parameters, so higher is generally better. Since many default optimization algorithms are designed to minimize, we’ll multiply the likelihood by -1 to turn it into a minimization problem, so lower is better in that case. The value can also be used to compare models with different parameter guesses^{8}. We’ll hold off with our result

```
= function(par, X, y) {
max_likelihood
# setup
= cbind(1, X) # add a column of 1s for the intercept
X = par[-1] # coefficients
beta = exp(par[1]) # error sd, exp keeps positive
sigma = nrow(X)
N
= X %*% beta # linear predictor
LP = LP # identity link in the glm sense
mu
# calculate (log) likelihood
= dnorm(y, mean = mu, sd = sigma, log = TRUE)
ll
= -sum(ll) # negative for minimization
value
return(value)
}
= optim(
our_result par = c(1, 0, 0), # first param is sigma
fn = max_likelihood,
X = df_happiness$life_exp_sc,
y = df_happiness$happiness
)
our_result
```

```
def max_likelihood(par, X, y):
# setup
= np.c_[np.ones(X.shape[0]), X] # add a column of 1s for the intercept
X = par[1:] # coefficients
beta = np.exp(par[0]) # error sd, exp keeps positive
sigma = X.shape[0]
N
= X @ beta # linear predictor
LP = LP # identity link in the glm sense
mu
# calculate (log) likelihood
= norm.logpdf(y, loc = mu, scale = sigma)
ll
= -np.sum(ll) # negative for minimization
value
return value
= minimize(
our_result = max_likelihood,
fun = np.array([1, 0, 0]), # first param is sigma
x0 = (
args 'life_exp_sc']),
np.array(df_happiness['happiness'])
np.array(df_happiness[
)
)
our_result
```

We can compare our result to a built-in function that has capabilities beyond OLS, and the table shows we’re duplicating the basic result. We show more decimal places on the log likelihood estimate to prove we aren’t getting *exactly* the same result.

Parameter | Standard | Our Result |
---|---|---|

Intercept | 5.445 | 5.445 |

Life Exp. Coef. | 0.888 | 0.888 |

Sigma^{1} |
0.705 | 0.699 |

LogLik (neg) | 118.804 | 118.804 |

^{1} Parameter estimate is exponentiated for 'our result' |

To use a maximum likelihood approach for linear models, you can use functions like `glm`

in R or `GLM`

in Python, which is the reference used in the table above. We can also use different likelihoods corresponding to the binomial, poisson and other distributions. Still other packages would allow even more distributions for consideration. In general, we choose a distribution that we feel best reflects the data generating process. For binary targets for example, we typically would feel a bernoulli or binomial distribution is appropriate. For count data, we might choose a poisson or negative binomial distribution. For targets that fall between 0 and 1, we might go for a beta distribution. You can see some of these demonstrated in Chapter 7.

There are many distributions to choose from, and the best one depends on your data. Sometimes, even if one distribution seems like a better fit, we might choose another one because it’s easier to use. Some distributions are special cases of others, or they might become more like a normal distribution under certain conditions. For example, the exponential distribution is a special case of the gamma distribution, and a t distribution with many degrees of freedom looks like a normal distribution. Here is a visualization of the relationships among some of the more common distributions (Wikipedia (2023)).

When you realize that many distributions are closely related, it’s easier to understand why we might choose one over another, but also why we might use a simpler option even if it’s not the best fit - you likely won’t come to a different conclusion about your model. Ultimately, you’ll get a better feel for this as you work with different types of data and models.

Here are examples of standard GLM functions, which just require an extra argument for the family of the distribution.

```
glm(happiness ~ life_exp_sc, data = df_happiness, family = gaussian)
glm(binary_target ~ x1 + x2, data = some_data, family = binomial)
glm(count ~ x1 + x2, data = some_data, family = poisson)
```

```
import statsmodels.formula.api as smf
smf.glm('happiness ~ life_exp_sc',
= df_happiness,
data = sm.families.Gaussian()
family
)
smf.glm('binary_target ~ x1 + x2',
= some_data,
data = sm.families.Binomial()
family
)
smf.glm('count ~ x1 + x2',
= some_data,
data = sm.families.Poisson()
family )
```

### 6.7.1 Diving deeper

Let’s think more about what’s going on here. It turns out that our objective function defines a ‘space’ or ‘surface’. You can imagine the process as searching for the lowest point on a landscape, with each guess a point on this landscape. Let’s start to get a sense of this with the following visualization, based on a single parameter. The following visualization shows this for a single parameter. The data comes from a variable with a true average of 5. As our guesses get closer to 5, the likelihood increases. However, with more and more data, the final guess converges on the true value. Model estimation finds that maximum on the curve, and optimization algorithms are the means to find it.

Now let’s add a parameter. If we have more than one parameter, we now have a surface to deal with. Given a starting point, an optimization procedure then travels along the surface looking for a minimum/maximum point. For simpler settings such as this, we can visualize the likelihood surface and its minimum point. However, even our simple model has three parameters plus the likelihood, so it would be difficult to visualize without additional complexity. Instead, we show the results for an alternate model where happiness is standardized also, which means the intercept is zero^{9}, and so not shown.

We can also see the path our estimates take. Starting at a fairly bad estimate, the optimization algorithm quickly updates to estimates that result in a better likelihood value. We also see little exploratory jumps creating a star like pattern, before things ultimately settle to the best values. In general, these updates and paths are dependent on the optimization algorithm one uses.

What we’ve shown here with maximum likelihood applies in general to searching for the best solution along an objective function surface. The optimization algorithms we’ll use are general purpose, and can be used for many different types of problems. The key is to define an appropriate objective function, and then let the algorithm do the work.

## 6.8 Penalized Objectives

One thing we may want to take into account with our models is their complexity, especially in the context of **overfitting**. We talk about this with machine learning also (Chapter 9), but the basic idea is that we can get too familiar with the data we have, and when we try to predict on new data the model hasn’t seen before, our performance suffers, or even gets worse than a simpler model. In other words, we are not generalizing well (Section 9.4).

One way to deal with this is to penalize the objective function value for complexity, or at least favor simpler models that might do as well. In some contexts this is called **regularization**, and in other contexts **shrinkage**, since the parameter estimates are typically shrunk toward some specific value (e.g., zero).

As a starting point, in our basic linear model we can add a penalty that is applied to the size of coefficients. This is called **ridge regression**, or, more mathily, as **L2 regularization**. We can write this formally as:

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \tag{6.2}\]

The first part is the same as basic OLS (Equation 6.1), but the second part is the penalty for \(p\) features. The penalty is the sum of the squared coefficients multiplied by some value, which we call \(\lambda\). This is an additional model parameter that we typically want to estimate, e.g. through cross-validation. This kind of parameter is often called a **hyperparameter**, mostly just to distinguish it from those that may be of actual interest. For example, we could probably care less what the actual value for \(\lambda\) is, but we would still be interested in the coefficients.

In the end this is just a small change to ordinary least squares (OLS) regression (Equation 6.1), but it can make a big difference. It introduces some bias in the coefficients - recall that OLS is unbiased if assumptions are met - but it can help to reduce variance, which can help the model perform better on new data (Section 9.4.2). In other words, we are willing to accept some bias in order to get a model that generalizes better.

But let’s get to a code example to demystify this a bit! Here is an example of a function that calculates the ridge objective. To make things interesting, let’s add the other features we talked about regarding GDP per capita and perceptions of corruption.

```
= function(par, X, y, lambda = 0) {
ridge # add a column of 1s for the intercept
= cbind(1, X)
X
= X %*% par # linear predictor
mu
# Calculate the value as sum squared error
= sum((y - mu)^2)
error
# Add the penalty
= error + lambda * sum(par^2)
value
return(value)
}
= df_happiness |>
X select(-happiness, -country) |>
as.matrix()
= optim(
our_result par = c(0, 0, 0, 0),
fn = ridge,
X = X,
y = df_happiness$happiness,
lambda = 0.1,
method = 'BFGS'
)
$par our_result
```

```
# we use lambda_ because lambda is a reserved word in python
def ridge(par, X, y, lambda_ = 0):
# add a column of 1s for the intercept
= np.c_[np.ones(X.shape[0]), X]
X
# Calculate the predicted values
= X @ par
mu
# Calculate the error
= np.sum((y - mu)**2)
value
# Add the penalty
= value + lambda_ * np.sum(par**2)
value
return value
= minimize(
our_result = ridge,
fun = np.array([0, 0, 0, 0]),
x0 = (
args =['happiness', 'country'])),
np.array(df_happiness.drop(columns'happiness']),
np.array(df_happiness[0.1
)
)
'x'] our_result[
```

We can compare this to built-in functions as we have before, and can see that the results are very similar, but not exactly the same. We would not worry about such differences in practice, but the main point is again, we can use simple functions that do just about as well as any what we’d get from package output.

Parameter | Standard^{1} |
Our Result |
---|---|---|

Intercept | 5.44 | 5.44 |

life_exp_sc | 0.49 | 0.52 |

corrupt_sc | −0.12 | −0.11 |

gdp_pc_sc | 0.42 | 0.44 |

^{1} Showing results from R glmnet package with alpha = 0, lambda = .1 |

Another very common penalized approach is to use the sum of the absolute value of the coefficients, which is called **lasso regression** or **L1 regularization**. An interesting property of the lasso is that in typical implementations, it will potentially zero out coefficients, which is the same as dropping the feature from the model altogether. This is a form of **feature selection** or **variable selection**. The true values are never zero, but if we want to use a ‘best subset’ of features, this is one way we could do so. We can write the lasso objective as follows. The chapter exercise asks you to implement this yourself.

\[ \text{Value} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 + \lambda \sum_{j=1}^{p} |\beta_j| \tag{6.3}\]

## 6.9 Classification

So far, we’ve been assuming a continuous target, but what if we have a categorical target? Now we have to learn a bunch of new stuff for that situation, right? Actually, no! *When we want to model categorical targets, conceptually, nothing changes*! We still have an objective function that maximizes or minimizes some goal, and we can use the same algorithms to estimate parameters. However, we need to think about how we can do this in a way that makes sense for the binary target.

### 6.9.1 Misclassification rate

A straightforward correspondence to MSE is a function that minimizes classification error, or by the same token, maximizes accuracy. In other words, we can think of the objective function as the proportion of incorrect classifications. This is called the **misclassification rate**.

\[ \text{Value} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(y_i \neq \hat{y_i}) \tag{6.4}\]

In the equation, \(y_i\) is the actual value of the target for observation \(i\), arbitrarily coded as 1 or 0, and \(\hat{y_i}\) is the predicted class from the model. The \(\mathbb{1}\) is an indicator function that returns 1 if the condition is true, and 0 otherwise. In other words, we are counting the number of times the predicted value is not equal to the actual value, and dividing by the number of observations. Very straightforward, so let’s do this ourselves!

```
= function(par, X, y, class_threshold = .5) {
misclassification = cbind(1, X)
X
# Calculate the 'linear predictor'
= X %*% par
mu
# Convert to a probability ('sigmoid' function)
= 1 / (1 + exp(-mu))
p
# Convert to a class
= as.integer(
predicted_class ifelse(p > class_threshold, 'good', 'bad')
)
# Calculate the mean error
= mean(y - predicted_class)
value
return(value)
}
```

```
def misclassification_rate(par, X, y, class_threshold = .5):
# add a column of 1s for the intercept
= np.c_[np.ones(X.shape[0]), X]
X
# Calculate the 'linear predictor'
= X @ par
mu
# Convert to a probability ('sigmoid' function)
= 1 / (1 + np.exp(-mu))
p
# Convert to a class
= np.where(p > class_threshold, 1, 0)
predicted_class
# Calculate the mean error
= np.mean(y - predicted_class)
value
return value
```

Note that our function first adds a step to convert the linear predictor (called `mu`

) to a probability. Once we have a probability, we use some threshold to convert it to a ‘class’. In this case, we use 0.5 as the threshold, but this could be different depending on the context, something we talk more about elsewhere (Section 4.2.2). We’ll leave it as an exercise for you to play around with this, as the next objective function is more commonly used. But at least you can see how easy it can be to switch to the classification case.

### 6.9.2 Log loss

Another approach is to use the **log loss**, sometimes called logistic loss or **cross-entropy**. If we have just the binary case it is:

\[ \text{Value} = -\sum_{i=1}^{n} y_i \log(\hat{y_i}) + (1 - y_i) \log(1 - \hat{y_i}) \tag{6.5}\]

Where \(y_i\) is the actual value of the target for observation \(i\), and \(\hat{y_i}\) is the predicted value from the model (essentially a probability). It turns out that *this is the same as the log-likelihood used in a maximum likelihood approach for logistic regression*, made negative so we can minimize it.

We typically prefer this objective function to classification error because it results in a *smooth* optimization surface, like in the visualization we showed before for maximum likelihood, which means it is *differentiable* in a mathematical sense. This is important because it allows us to use optimization algorithms that rely on derivatives in updating the parameter estimates. You don’t really need to get into that too much, but just know that a smoother objective function is something we prefer. Here’s some code to try out.

```
= function(par, X, y) {
objective = cbind(1, X)
X
# Calculate the predicted values on the raw scale
= X %*% par
y_hat
# Convert to a probability ('sigmoid' function)
= 1 / (1 + exp(-y_hat))
y_hat
# likelihood
= y * log(y_hat) + (1 - y) * log(1 - y_hat)
ll
# alternative
# dbinom(y, size = 1, prob = y_hat, log = TRUE)
= -sum(ll)
value
return(value)
}
```

```
def objective(par, X, y):
# add a column of 1s for the intercept
= np.c_[np.ones(X.shape[0]), X]
X
# Calculate the predicted values
= X @ par
y_hat
# Convert to a probability ('sigmoid' function)
= 1 / (1 + np.exp(-y_hat))
y_hat
# likelihood
= y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat)
ll
= -np.sum(ll)
value
return value
```

Let’s go ahead and demonstrate this. To create a classification problem, we’ll say that a country is ‘happy’ if the happiness score is greater than 5.5, and ‘unhappy’ otherwise. We’ll use the same features as before.

```
= df_happiness |>
df_happiness_bin mutate(happiness = ifelse(happiness > 5.5, 1, 0))
= optim(
model_logloss par = c(0, 0, 0, 0),
fn = objective,
X = df_happiness_bin |>
select(life_exp_sc:gdp_pc_sc) |>
as.matrix(),
y = df_happiness_bin$happiness
)
= glm(
model_glm ~ life_exp_sc + corrupt_sc + gdp_pc_sc,
happiness data = df_happiness_bin,
family = binomial
)
$par model_logloss
```

```
= df_happiness.copy()
df_happiness_bin 'happiness'] = np.where(df_happiness['happiness'] > 5.5, 1, 0)
df_happiness_bin[
= minimize(
model_logloss
objective,= np.array([0, 0, 0, 0]),
x0 = (
args 'life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],
df_happiness_bin[['happiness']
df_happiness_bin[
)
)
= smf.glm(
model_glm 'happiness ~ life_exp_sc + corrupt_sc + gdp_pc_sc',
= df_happiness_bin,
data = sm.families.Binomial()
family
).fit()
'x'] model_logloss[
```

Once again, we can see that the results are very similar between our classification result and the built-in function.

parameter | Standard | Our result |
---|---|---|

LogLike | 40.6635 | 40.6635 |

intercept | −0.1637 | −0.1642 |

life_exp_sc | 1.8172 | 1.8168 |

corrupt_sc | −0.4648 | −0.4638 |

gdp_pc_sc | 1.1311 | 1.1307 |

So, when it comes to classification, you should feel confident in what’s going on under the hood, just like you did with a numeric target. Too much is made of the distinction between ‘regression’ and ‘classification’ and it can be confusing to those starting out. In reality, classification just requires a slightly different way of thinking about the target. Conceptually it really is the same approach.

## 6.10 Optimization Algorithms

When it comes to optimization, there are a number of algorithms that have been developed over time. The main thing to keep in mind is that these are all just ways to find the best fitting parameters for a model. Some may be better suited for certain data tasks, or provide computational advantages, but often the choice of algorithm is not as important as many other modeling choices.

Here are some of the options available in R’s `optim`

or scipy’s `minimize`

function:

- Nelder-Mead
- BFGS
- L-BFGS-B (provides constraints)
- Conjugate gradient
- Simulated annealing
- Newton’s method
- Genetic algorithms
- Other Popular SGD extensions and variants
- RMSProp
- Adam/momentum
- Adagrad

The main reason to choose one method over another usually is based on factors like speed, memory use, or how well they work for certain models. For statistical contexts, many functions for generalized linear models use Newton’s method by default, but more complicated models may implement a different approach for better convergence. In machine learning, stochastic gradient descent is popular because it can be efficient in large data settings and relatively easy to implement.

In general, we can always try different methods to see which works best, but usually the results will be similar if the results reach convergence. We’ll now demonstrate one of the most popular optimization methods used in machine learning - gradient descent, but know that there are many variants of this one might use.

### 6.10.1 Gradient descent

One of the most popular approaches in optimization is called **gradient descent**. It uses the gradient of the function we’re trying to optimize to find the best parameters. We still use objective functions as before, and gradient descent is just a way to find that path along the objective surface. More formally, the gradient is the vector of partial derivatives of the objective function with respect to each parameter. That may not mean much to you, but the basic idea is that the gradient provides a direction that points in the direction of steepest increase in the function. So if we want to maximize the objective function, we can take a step in the direction of the gradient, and if we want to minimize it, we can take a step in the opposite direction of the gradient (use the negative gradient). The size of the step is called the **learning rate**, and, like our penalty parameter we saw with penalized regression, it is a *hyperparameter* that we can tune through cross-validation (Section 9.7). If the learning rate is too small, it will take a longer time to converge. If it’s too large, we might overshoot the objective and miss the best parameters. There are a number of variations on gradient descent that have been developed over time. Let’s see this in action with the world happiness model.

```
= function(
gradient_descent
par,
X,
y,tolerance = 1e-3,
maxit = 1000,
learning_rate = 1e-3
) {
= cbind(1, X) # add a column of 1s for the intercept
X = nrow(X)
N
# initialize
= par
beta names(beta) = colnames(X)
= crossprod(X %*% beta - y) / N # crossprod provides sum(x^2)
mse = 1
tol = 1
iter
while (tol > tolerance && iter < maxit) {
= X %*% beta
LP = t(X) %*% (LP - y)
grad = beta - learning_rate * grad
betaCurrent = max(abs(betaCurrent - beta))
tol = betaCurrent
beta = append(mse, crossprod(LP - y) / N)
mse = iter + 1
iter
}
= list(
output par = beta,
loss = mse,
MSE = crossprod(LP - y) / nrow(X),
iter = iter,
predictions = LP
)
return(output)
}
= df_happiness |>
X select(life_exp_sc:gdp_pc_sc) |>
as.matrix()
= gradient_descent(
our_result par = c(0, 0, 0, 0),
X = X,
y = df_happiness$happiness,
learning_rate = 1e-3
)
```

```
def gradient_descent(
par,
X,
y, = 1e-3,
tolerance = 1000,
maxit = 1e-3
learning_rate
):# add a column of 1s for the intercept
= np.c_[np.ones(X.shape[0]), X]
X
# initialize
= par
beta = np.sum((X @ beta - y)**2)
loss = 1
tol iter = 1
while (tol > tolerance and iter < maxit):
= X @ beta
LP = X.T @ (LP - y)
grad = beta - learning_rate * grad
betaCurrent = np.max(np.abs(betaCurrent - beta))
tol = betaCurrent
beta = np.append(loss, np.sum((LP - y)**2))
loss iter = iter + 1
= {
output 'par': beta,
'loss': loss,
'MSE': np.mean((LP - y)**2),
'iter': iter,
'predictions': LP
}
return output
= gradient_descent(
our_result = np.array([0, 0, 0, 0]),
par = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']].to_numpy(),
X = df_happiness['happiness'].to_numpy(),
y = 1e-3
learning_rate )
```

Comparing our results, we have the following table. As usual, we see that the results are very similar. Once again, we have demystified a step in the modeling process!

Value | Standard | Our Result |
---|---|---|

Intercept | 5.445 | 5.437 |

life_exp_sc | 0.525 | 0.521 |

corrupt_sc | −0.105 | −0.107 |

gdp_pc_sc | 0.438 | 0.439 |

MSE | 0.367 | 0.367 |

In addition, when we visualize the loss function across iterations, we see a smooth decline in the MSE value as we go along each iteration. This is a good sign that we are converging to an optimal solution.

### 6.10.2 Stochastic gradient descent

**Stochastic gradient descent** (SGD) is a version of gradient descent that uses a random sample of data to guess the gradient, instead of using all the data. This makes it less accurate in some ways, but it’s faster and can be parallelized. This speed is useful in machine learning when there’s a lot of data, which often makes the discrepancy between standard GD and SGD small. As such you will see variants of it incorporated in many models in deep learning, but it can be with much simpler models as well.

Let’s see this in action with the happiness model. The following is a conceptual version of the *AdaGrad* approach^{10}, which is a variation of SGD that adjusts the learning rate for each parameter. We will also add a variation that averages the parameter estimates across iterations, which is a common approach to improve the performance of SGD, but by default it is not used, just something you can play with. We are going to use a **batch size** of one, which is similar to a ‘streaming’ or ‘online’ version where we update the model with each observation. Since our data are alphabetically ordered, we’ll shuffle the data first. We’ll also use a `stepsize_tau`

parameter, which is a way to adjust the learning rate at early iterations. We’ll set it to zero for now, but you can play with it to see how it affects the results. The values for the learning rate and `stepsize_tau`

are arbitrary, selected after some initial playing around, but you can play with them to see how they affect the results.

```
= function(
stochastic_gradient_descent # parameter estimates
par, # model matrix
X, # target variable
y, learning_rate = 1, # the learning rate
stepsize_tau = 0 # if > 0, a check on the LR at early iterations
) {# initialize
set.seed(123)
# shuffle the data
= sample(1:nrow(X), nrow(X))
idx = X[idx, ]
X = y[idx]
y
= cbind(1, X)
X = par
beta
# Collect all estimates
= matrix(0, nrow(X), ncol = length(beta))
betamat
# Collect fitted values at each point))
= NA
fits
# Collect loss at each point
= NA
loss
# adagrad per parameter learning rate adjustment
= 0
s
# a smoothing term to avoid division by zero
= 1e-8
eps
for (i in 1:nrow(X)) {
= X[i, , drop = FALSE]
Xi = y[i]
yi
# matrix operations not necessary here,
# but makes consistent with previous gd func
= Xi %*% beta
LP = t(Xi) %*% (LP - yi)
grad = s + grad^2 # adagrad approach
s
# update
= beta - learning_rate /
beta + sqrt(s + eps)) * grad
(stepsize_tau = beta
betamat[i, ]
= LP
fits[i]
= crossprod(LP - yi)
loss[i]
}
= X %*% beta
LP = crossprod(LP - y)
lastloss
= list(
output par = beta, # final estimates
par_chain = betamat, # estimates at each iteration
MSE = sum(lastloss) / nrow(X),
predictions = LP
)
return(output)
}
= df_happiness |>
X_train select(life_exp_sc, corrupt_sc, gdp_pc_sc) |>
as.matrix()
= df_happiness$happiness
y_train
= stochastic_gradient_descent(
our_result par = c(mean(df_happiness$happiness), 0, 0, 0),
X = X_train,
y = y_train,
learning_rate = .15,
stepsize_tau = .1
)
$par our_result
```

```
def stochastic_gradient_descent(
# parameter estimates
par, # model matrix
X, # target variable
y, = 1, # the learning rate
learning_rate = 0, # if > 0, a check on the LR at early iterations
stepsize_tau = False # a variation of the approach
average
):# initialize
1234)
np.random.seed(
# shuffle the data
= np.random.choice(
idx 0],
df_happiness.shape[0],
df_happiness.shape[= False
replace
)= X[idx, :]
X = y[idx]
y
= np.c_[np.ones(X.shape[0]), X]
X = par
beta
# Collect all estimates
= np.zeros((X.shape[0], beta.shape[0]))
betamat
# Collect fitted values at each point))
= np.zeros(X.shape[0])
fits
# Collect loss at each point
= np.zeros(X.shape[0])
loss
# adagrad per parameter learning rate adjustment
= 0
s
# a smoothing term to avoid division by zero
= 1e-8
eps
for i in range(X.shape[0]):
= X[None, i, :]
Xi = y[i]
yi
# matrix operations not necessary here,
# but makes consistent with previous gd func
= Xi @ beta
LP = Xi.T @ (LP - yi)
grad = s + grad**2 # adagrad approach
s
# update
= beta - learning_rate / \
beta + np.sqrt(s + eps)) * grad
(stepsize_tau
= beta
betamat[i, :]
= LP
fits[i] = np.sum((LP - yi)**2)
loss[i]
= X @ beta
LP = np.sum((LP - y)**2)
lastloss
= {
output 'par': beta, # final estimates
'par_chain': betamat, # estimates at each iteration
'MSE': lastloss / X.shape[0],
'predictions': LP
}
return output
= df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']]
X_train = df_happiness['happiness']
y_train
= stochastic_gradient_descent(
our_result = np.array([np.mean(df_happiness['happiness']), 0, 0, 0]),
par = X_train.to_numpy(),
X = y_train.to_numpy(),
y = .15,
learning_rate = .1
stepsize_tau
)
'par'] our_result[
```

Next we’ll compare it to OLS estimates and our previous ‘batch’ gradient descent results. Even though SGD normally would not be used for such a small dataset, we at least get close^{11}!

Value | Standard | Our Result | Batch SGD |
---|---|---|---|

Intercept | 5.445 | 5.469 | 5.437 |

life_exp_sc | 0.525 | 0.514 | 0.521 |

corrupt_sc | −0.105 | −0.111 | −0.107 |

gdp_pc_sc | 0.438 | 0.390 | 0.439 |

MSE | 0.367 | 0.370 | 0.367 |

Here’s a plot of the estimates as they moved along the data. For this plot we don’t include the intercept as it’s on a notably different scale. We can see that the estimates are moving around a bit, but they appear to be converging to a solution.

To wrap things up, here are the results for the happiness model using different optimization algorithms, with a comparison to the standard linear regression model function. We can see that the results are very similar, and for simpler modeling endeavors they should converge on the same result.

parameter | NM^{1} |
BFGS^{2} |
CG^{3} |
GD^{4} |
Standard^{5} |
---|---|---|---|---|---|

Intercept | 5.445 | 5.445 | 5.445 | 5.437 | 5.445 |

life_exp_sc | 0.525 | 0.525 | 0.525 | 0.521 | 0.525 |

gdp_pc_sc | −0.105 | −0.105 | −0.105 | −0.107 | −0.105 |

corrupt_sc | 0.437 | 0.438 | 0.438 | 0.439 | 0.438 |

MSE | 0.367 | 0.367 | 0.367 | 0.367 | 0.367 |

^{1} NM = Nelder-Mead |
|||||

^{2} BFGS = Broyden–Fletcher–Goldfarb–Shanno |
|||||

^{3} CG = Conjugate gradient |
|||||

^{4} GD = Gradient descent |
|||||

^{5} Standard = Standard package function |

## 6.11 Other Estimation Approaches

Before leaving our estimation discussion, we should mention there are other approaches one could use, including variations on least squares, the **method of moments**, **generalized estimating equations**, **robust** estimation, and more. We’ve focused on the most common ones, but it’s good to be aware of others that might be more common in some domains. But there are two we want to discuss in a little bit detail given their widespread usage, and that is the **bootstrap** and **Bayesian estimation**.

### 6.11.1 Bootstrap

The **bootstrap** is a method where we create new data sets by randomly sampling the data from our original set, allowing the same data to be picked more than once. We then use these new data sets to estimate our model. We do this many times, collecting parameter estimates, predictions, or anything we want to calculate along the way. Ultimately, we end up with a *distribution* of all the things we calculated.

These results give us a range of possible outcomes, which is useful for **inference**^{12}, as we can use the distribution to calculate confidence intervals, prediction intervals or intervals for any value we calculate. The average estimate is often the same as whatever the underlying model used would produce, but the bootstrap provides a way to get at a measure of **uncertainty** with fewer assumptions about how that distribution should take shape. The approach is very flexible, and it can be used with any model. Let’s see this in action with the happiness model.

```
= function(X, y, nboot = 100, seed = 123) {
bootstrap
= nrow(X)
N = ncol(X) + 1 # add one for intercept
p
# initialize
= matrix(NA, p*nboot, nrow = nboot, ncol = p)
beta colnames(beta) = c('Intercept', colnames(X))
= rep(NA, nboot)
mse
# set seed
set.seed(seed)
for (i in 1:nboot) {
# sample with replacement
= sample(1:N, N, replace = TRUE)
idx = X[idx,]
Xi = y[idx]
yi
# estimate model
= lm(yi ~., data = Xi)
mod
# save results
= coef(mod)
beta[i, ] = sum((mod$fitted - yi)^2) / N
mse[i]
}
# given mean estimates, calculate MSE
= cbind(1, as.matrix(X)) %*% colMeans(beta)
y_hat = sum((y - y_hat)^2) / N
final_mse
= list(
output par = as_tibble(beta),
MSE = mse,
final_mse = final_mse
)
return(output)
}
= df_happiness |>
X select(life_exp_sc:gdp_pc_sc)
= bootstrap(
our_result X = X,
y = df_happiness$happiness,
nboot = 250
)
colMeans(our_result$par)
```

```
Intercept life_exp_sc corrupt_sc gdp_pc_sc
5.4425 0.5084 -0.1024 0.4640
```

```
# from sklearn.linear_model import LinearRegression
def bootstrap(X, y, nboot=100, seed=123):
# add a column of 1s for the intercept
= np.c_[np.ones(X.shape[0]), X]
X = X.shape[0]
N
# initialize
= np.empty((nboot, X.shape[1]))
beta
# beta = pd.DataFrame(beta, columns=['Intercept'] + list(cn))
= np.empty(nboot)
mse
# set seed
np.random.seed(seed)
for i in range(nboot):
# sample with replacement
= np.random.randint(0, N, N)
idx = X[idx, :]
Xi = y[idx]
yi
# estimate model
= LinearRegression(fit_intercept=False)
model = model.fit(Xi, yi)
mod
# save results
= mod.coef_
beta[i, :] = np.sum((mod.predict(Xi) - yi)**2) / N
mse[i]
# given mean estimates, calculate MSE
= X @ beta.mean(axis=0)
y_hat = np.sum((y - y_hat)**2) / N
final_mse
= {
output 'par': beta,
'mse': mse,
'final_mse': final_mse
}
return output
= bootstrap(
our_result = df_happiness[['life_exp_sc', 'corrupt_sc', 'gdp_pc_sc']],
X = df_happiness['happiness'],
y = 250
nboot
)
'par'], axis=0) np.mean(our_result[
```

Here are the results of the interval estimates for the coefficients. Each parameter has the mean estimate, the lower and upper bounds of the 95% confidence interval, and the width of the interval. The bootstrap intervals are a bit wider than the OLS intervals, possibly better capturing the uncertainty in this model based on not too many observations.

Parameter | mean | Lower BS | Upper BS | Lower OLS | Upper OLS | Diff Width^{1} |
---|---|---|---|---|---|---|

Intercept | 5.44 | 5.33 | 5.55 | 5.33 | 5.56 | −0.02 |

life_exp_sc | 0.51 | 0.30 | 0.70 | 0.35 | 0.70 | 0.04 |

corrupt_sc | −0.10 | −0.29 | 0.09 | −0.25 | 0.04 | 0.09 |

gdp_pc_sc | 0.46 | 0.18 | 0.76 | 0.24 | 0.64 | 0.18 |

^{1} Width of bootstrap estimate minus width of OLS estimate |

Let’s look more closely at the distributions for each coefficient. Standard statistical estimates assume a specific shape like the normal distribution. But with the bootstrap method, we have more flexibility, even though it often leans towards the assumed distribution. These distributions aren’t perfectly symmetrical like a normal distribution, but they suit our needs in that we can extract the lower and upper quantiles to create an interval estimate.

The bootstrap is often used for predictions and other metrics. However, it is computationally inefficient, and might not be suitable with large data sizes. It also may not estimate the appropriate uncertainty for some types of statistics (e.g. extreme values) or in some data contexts (e.g. correlated observations). Despite these limitations, the bootstrap method is a useful tool and can be used together with other methods to understand uncertainty in a model.

### 6.11.2 Bayesian estimation

The **Bayesian** approach to modeling is many things - a philosophical viewpoint, an entirely different way to think about probability, a different way to measure uncertainty, and on a practical level, just another way to get model parameter estimates. It can be as frustrating as it is fun to use, and one of the really nice things about using Bayesian estimation is that it can handle model complexities that other approaches can’t do well.

The basis of Bayesian estimation is the **likelihood**, the same as with maximum likelihood, and everything we did there applies here. So you need a good grasp of maximum likelihood to understand the Bayesian approach. However, the Bayesian approach is different because it also lets us use our knowledge about the parameters through **prior distributions**. For example, we may think that the coefficients for a linear model come from a normal distribution centered on zero with some variance. That would serve as our prior. The combination of a prior distribution with the likelihood results in the **posterior distribution**, which is a *distribution* of possible parameter values.

#### Example

Let’s do a simple example to show how this comes about. We’ll use a binomial model where we have penalty kicks taken for a soccer player, and we want to estimate the probability of the player making a goal, which we’ll call \(\theta\). For our prior distribution, we’ll use a beta distribution that has a mean of 0.5, suggesting that we think this person would have about a 50% chance of converting the kick on average. For the likelihood, we’ll use a binomial distribution. We also use this in our GLM chapter (Equation 7.1), which, as we noted earlier, is akin to using the log loss (Section 6.9.2).

We’ll then calculate the posterior distribution for the probability of making a shot, given our prior and the evidence at hand, i.e., the data.

Let’s start with some data, and just like our other estimation approaches, we’ll have some guesses for \(theta\) which represents the probability of making a goal. We’ll use a triangular prior, but you can change it to a uniform or beta prior if you like. We’ll then calculate the likelihood of the data given the parameter, and then the posterior distribution.

```
= c(
pk 'goal','goal','goal','miss','miss',
'goal','goal','miss','goal','goal'
)
# convert to numeric, arbitrarily picking goal=1, miss=0
= length(pk) # sample size
N = sum(pk == 'goal') # number of pk made
n_goal = sum(pk == 'miss') # number of those miss
n_miss
# grid of potential theta values
= seq(
theta from = 1 / (N + 1),
to = N / (N + 1),
length = 10
)
### prior distribution
# beta prior with mean = .5, but pretty flat
= dbeta(theta, 5, 5)
p_theta
# Normalize so that values sum to 1
= p_theta / sum(p_theta)
p_theta
# likelihood (binomial)
= choose(N, n_goal) * theta^n_goal * (1 - theta)^n_miss
p_data_given_theta
# posterior (combination of prior and likelihood)
# marginal probability of the data for normalization
= sum(p_data_given_theta * p_theta)
p_data
= p_data_given_theta*p_theta / p_data # Bayes theorem
p_theta_given_data
# final estimate
= sum(theta * p_theta_given_data)
theta_est theta_est
```

`[1] 0.6`

```
from scipy.stats import beta
= np.array([
pk 'goal','goal','goal','miss','miss',
'goal','goal','miss','goal','goal'
])
# convert to numeric, arbitrarily picking goal=1, miss=0
= len(pk) # sample size
N = np.sum(pk == 'goal') # number of pk made
n_goal = np.sum(pk == 'miss') # number of those miss
n_miss
# grid of potential theta values
= np.linspace(1 / (N + 1), N / (N + 1), 10)
theta
### prior distribution
# beta prior with mean = .5, but pretty flat
= beta.pdf(theta, 5, 5)
p_theta
# Normalize so that values sum to 1
= p_theta / np.sum(p_theta)
p_theta
# likelihood (binomial)
= np.math.comb(N, n_goal) * theta**n_goal * (1 - theta)**n_miss
p_data_given_theta
# posterior (combination of prior and likelihood)
# marginal probability of the data for normalization
= np.sum(p_data_given_theta * p_theta)
p_data
= p_data_given_theta * p_theta / p_data # Bayes theorem
p_theta_given_data
# final estimate
= np.sum(theta * p_theta_given_data)
theta_est theta_est
```

`0.599999996503221`

Here is the table that puts all this together. Our prior distribution is centered around 0.5, and the likelihood is centered closer to 0.7. The posterior distribution is a combination of the two. It gives no weight to smaller values, or the max values. Our final estimate is 0.6, which falls between the two. With more evidence in the form of data, our estimate will shift more and more towards what the likelihood would suggest. This is a simple example, but it shows how the Bayesian approach works, and this holds for more complex parameter estimation as well.

theta | prior | like | post |
---|---|---|---|

0.09 | 0.00 | 0.00 | 0.00 |

0.18 | 0.03 | 0.00 | 0.00 |

0.27 | 0.09 | 0.01 | 0.00 |

0.36 | 0.16 | 0.03 | 0.03 |

0.45 | 0.22 | 0.08 | 0.14 |

0.55 | 0.22 | 0.16 | 0.28 |

0.64 | 0.16 | 0.24 | 0.32 |

0.73 | 0.09 | 0.26 | 0.19 |

0.82 | 0.03 | 0.18 | 0.04 |

0.91 | 0.00 | 0.05 | 0.00 |

#### Application

Just like with the bootstrap which also provided distributions for the parameters, we can use the Bayesian approach to understand how certain we are about our estimates. We can look at any range of values in the posterior distribution to get what is often referred to as a **credible interval**, which is the Bayesian equivalent of a confidence interval^{13}. Here is an example of the posterior distribution for the parameters of our happiness model, along with 95% intervals^{14}.

With Bayesian estimation we also provide starting values for the algorithm to get things going. We also typically specify a number of iterations, or times the model will run, as the **stopping rule**. Each iteration gives us a new guess for each parameter, which amounts to a random draw from the posterior distribution. With more iterations the model takes longer to run, but the length often reflects the complexity of the model.

We also specify multiple **chains**, which do the same estimation procedure, but due to the random nature of the Bayesian approach, take different estimation paths. We can then compare the chains to see if they are converging to the same result, which is a check on the model. If they are not converging, we may need to run the model longer, or it may indicate a problem with how we set up the model. Here’s an example of the four chains for our happiness model for the life expectancy coefficient. They’re giving similar results, so we know the model is working well. Nowadays, we have model statistics in the output that also provide this information, which makes it easier to quickly check many parameters.

When we are interested in making predictions, we can use the results to generate a distribution of possible predictions *for each observation*, which can be very useful when we want to quantify uncertainty for complex models. This is referred to as **posterior predictive distribution**, which is explored in a non-bayesian context in Section 4.4. Here is a plot of several draws of predicted values against the true happiness scores.

With the Bayesian approach, every metric we calculate has a range of possible values, not just one. For example, if you have a classification model and want to know the accuracy or true positive rate of the model, instead of a single number, you now have access to a whole distribution of values for that metric. How? For each possible set of model parameters from the posterior distribution, we apply those values and model to data to make a prediction. We can then assign it to a class, and compare it to the actual class. This gives us a range of possible predictions and classes. We can then calculate metrics like accuracy or true positive rate for each possible prediction set. As an example, we did this for our happiness model with a numeric target to obtain the interval estimate for R-squared. Pretty neat!

^{2}

Bayes R2 | Lower | Upper |
---|---|---|

0.71 | 0.65 | 0.75 |

95% Credible interval for R-squared |

#### 6.11.2.1 Additional Thoughts

It turns out that any standard (frequentist) statistical model can be seen as a Bayesian one from a particular point of view. Here are a couple:

- GLM and related models estimated via maximum likelihood: Bayesian estimation with a flat/uniform prior on the parameters.
- Ridge Regression: Bayesian estimation with a normal prior on the coefficients, penalty parameter is related to the variance of the prior.
- Lasso Regression: Bayesian estimation with a Laplace prior on the coefficients, penalty parameter is related to the variance of the prior.

So, in many modeling contexts, you’re actually doing a restrictive form of Bayesian estimation already.

We can see that the Bayesian approach is very flexible, and can be used for many different types of models, and can be used to get at uncertainty in a model in ways that other approaches can’t. It’s not a panacea, and it’s not always the best approach, but it’s a good one to have in your toolbox^{15}. Hopefully we’ve helped to demystify the Bayesian approach a bit, and you feel more comfortable switching to it.

## 6.12 Wrapping Up

Wow, we covered a lot here! But this is the sort of stuff that can take you from just having some fun with data, to doing that and also understanding how things are actually happening. Just having the gist of how modeling actually is done ‘under the hood’ makes so many other things make sense, and can give you a lot of confidence, even in less familiar modeling domains.

### 6.12.1 The common thread

Simply put, the content in this chapter ties together any and every model you will ever undertake, from linear regression to reinforcement learning, computer vision, and large language models. Estimation and optimization are the core of any modeling process, and understanding them is key to understanding how models work.

### 6.12.2 Choose your own adventure

Seriously, after this chapter, you should feel fine with any of the others in this book, so dive in!

### 6.12.3 Additional resources

**OLS and Maximum Likelihood Estimation**:

For OLS and maximum likelihood estimation, there are so many resources out there, we recommend just taking a look and seeing which one suits you best. Practically any more technical statistical book will cover these topics in detail.

**Gradient Descent**:

- Gradient Descent, Step-by-Step StatQuest with Josh Starmer (2019a)
- Stochastic Gradient Descent, Clearly Explained StatQuest with Josh Starmer (2019b)
- A Visual Explanation of Gradient Descent Methods Jiang (2020)

More demonstration of the simple AdaGrad algorithm used above:

**Bootstrap**:

Classical treatments:

Efron and Tibshirani (1994)

Davison and Hinkley (1997)

Bootstrapping Main Ideas StatQuest with Josh Starmer (2021)

**Bayesian**:

- BDA Gelman et al. (2013)
- Statistical Rethinking McElreath (2020)
- Choosing priors

## 6.13 Exercise

For this exercise you’ll have two tasks:

Try creating an objective function for a continuous target that uses the mean absolute error, and compare your estimated parameters to the previous results for ordinary least squares. Use the OLS function from the chapter as a basis (Section 6.5), but modify it for the new objective.

Use the estimation function we demonstrated for ridge regression (Section 6.8) and change it to use the lasso approach.

Both of these can be done by changing one line in the previous functions used.

Since life expectancy is scaled, a standard deviation is 1, in this case about 7 years.↩︎

It turns out that our error metric is itself an

*estimate*of the true error. We’ll get more into this later, but for now this means that we can’t ever know the true error, and so we can’t ever really know the best or true model. However, we can still choose a good or better model relative to others based on our estimate.↩︎We don’t have to do it this way, but it’s the default in most scenarios. As an example, maybe for your situation overshooting is worse than undershooting, and so you might want to use an approach that would weight those errors more heavily.↩︎

Some disciplines seem to confuse models with estimation methods and link functions. It doesn’t really make sense, nor is it informative, to call something an OLS model or a logit model. Many models are estimated using a least squares objective function, even deep learning, and different types of models use a logit link, from logistic regression, to beta regression, to activation functions used in deep learning.↩︎

You may find that some packages will

*only*minimize (or maximize) a function, even to the point of reporting nonsensical things like negative squared values, so you’ll need to take care when implementing your own metrics.↩︎The actual probability of a

*specific value*in this setting is 0, but the probability of a range of values is greater than 0. You can find out more about likelihoods and probabilities at the discussion here, but in general many traditional statistical texts will cover this also.↩︎The negative log-likelihood is often what is reported in the model output as well.↩︎

Those who have experience here will notice we aren’t putting a lower bound on sigma. You typically want to do this otherwise you may get nonsensical results by not keeping sigma positive. You can do this by setting a specific argument for an algorithm that uses boundaries, or more simply by exponentiating the parameter so that it can only be positive. In the latter case, you’ll have to exponentiate the final parameter estimate to get back to the correct scale. We leave this detail out of the code for now to keep things simple.↩︎

Linear regression will settle on a line that cuts through the means, and when standardizing all variables, the mean of the features and target are both zero, so the line goes through the origin.↩︎

MC does not recall exactly where this origin of his function came from except that Murphy’s PML book was a key reference when he came up with it (Murphy (2012)).↩︎

You’d get better results by also standardizing the target. The initial shuffling that we did can help as well in case the data are ordered. When we’re dealing with larger data and repeated runs/epochs, shuffling allows the samples/batches to be more representative of the entire data set. Also, we had to ‘hand-tune’ our learning rate and step size, which is not ideal, and normally we would use cross-validation to find the best values.↩︎

We’re using inference here in the standard statistical/philosophical sense, not as a synonym for prediction or generalization, which is how it is often used in machine learning. We’re not exactly sure how that terminological muddling arose in ML, but be on the lookout for it.↩︎

Many people’s default interpretation of a standard confidence interval is incorrectly the actual interpretation of a Bayesian confidence interval. This is partly because the Bayesian interpretation of confidence intervals and p-values is how we tend to naturally think about those statistics. But that’s okay, everyone is in the same boat. We also think it’s fine if you want to call the Bayesian version a confidence interval.↩︎

We used the R package for brms for these results.↩︎

R has excellent tools here for modeling and post-processing, like brms and tidybayes, and Python has pymc3, numpyro, and arviz, which are also useful. Honestly R has way more going on here, with many packages devoted to Bayesian estimation of specific models even, but if you want to stick with Python it’s gotten a lot better recently.↩︎