This is a quick reference for the modeling syntax and associated packages. While it covers a lot of ground, it is not meant to be exhaustive, but rather, it provides an easy reference for those new to R, someone trying out an unfamiliar (but otherwise common) technique, or those just interested in comparison to similar approaches in other environments. It is likely anyone, regardless of R experience, could come across something new, or new to them. In time, this may be rendered obsolete by the parsnip package, but until then, this can get you quickly started with many common models.
Along with noting base R functionality where applicable, preference is given to what are perceived as very commonly used packages. ‘Very common’ is based on my own knowledge consulting across dozens of disciplines and several universities, regularly following the R community, and things like the rankings at RDocumentation.org and packages listed in CRAN Task Views. My goal is to not regurgitate a lengthy list of packages that leaves you no more knowledgeable about where to start than when you began. Once you use these packages, you will then come across others that you may like or provide additional functionality of interest, but will hopefully have a clearer sense of how to use them.
In what follows, you will not see any model which I have not run on my own, if not many, many times. For those who want to get more into the implementation of many of these models, I have raw R code that demonstrates many (probably most) of the techniques discussed here.
I will ignore the data argument in every example that uses the typical R formula approach. It is implied that you would supply it.
What this won’t provide:
For some content I will provide recommended reading. These texts are mostly those I’ve benefited from personally, or I would think would be useful to others, especially those just starting out with analysis, at an intermediate stage, or even coming to R from another language. They are not meant to be a list of seminal works on the subject, though some might be. I also include the Task View if applicable. The link will take you to Rdocumentation.org, which will list many packages, but the first link is the official Task View, where you will find more details about the packages.
Color coding:
When you see pack::func
the first value is the package name and the second the function. I show this for the first time to make clear what package is used, but otherwise it is assumed you’ve loaded the necessary package (i.e. library(pack)
).
Most packages follow the same basic modeling syntax, and those who don’t generally suffer less usage. Just specify your target variable, say y
and your predictor variables, e.g. x
. For some packages you may have to instead supply a model.matrix X
and separate y
vector (with no data argument).
model_func(y ~ x, data=mydata)
model_func(X, y)
model_func(X) # unsupervised (e.g. principal components analysis)
Some models may be different enough to warrant their own syntax, e.g. latent variable models/SEM. However, even then, packages try to adhere to the y ~ x
formula approach.
Most modeling functions, again, those that actually want to be used, will provide the following commonly used methods.
summary(my_model) # may also be just print()
fitted(my_model) # fitted values
residuals(my_model) # residuals
coef(my_model) # extract coefficients
model.matrix(my_model) # extract the model matrix
predict(my_model, newdata=other_observations, type='response') # specific predictions
Some may also have a plot method, but this will definitely vary by package both in whether it is available and what it will do. For example, plotting an lm object will explore residuals, while plotting a gam will show the component plots of the smoothed terms. Some packages, like ggeffects, visreg, and margins will help plotting some types of effects. I’ve found it easier to do oneself via ggplot2, rather than hope you can finagle a plot made by a package into something you actually want.
One way to prepare results for visualization and publishable tables is the broom package. Its tidy function works with many very commonly used modeling packages, returning a tidy data frame of, for example, the summary output.
broom::tidy(my_model)
Many packages will also have a confint method for interval estimates of coefficients. See the boot package as a starting point for bootstrapped intervals.
Good science will need competing models. For many models, one can use a likelihood ratio test or something like AIC.
anova(mod)
anova(mod1, mod2)
AIC(mod)
AIC(mod1, mod2)
lrtest(mod1, mod2) # only for some packages
Others will be package specific. For example, mgcv will supply a GCV (generalized cross-validation) metric, and there is WAIC and loo in Stan-related packages for Bayesian analysis.
If you want to roll your own model, the base R installation comes with optim, but you’re probably better suited to use one of the many packages that provide more efficient1 or simply better implementations of the same algorithms. See the Task View for starters, and know that many will work in a similar fashion as optim.
model = optim(par = start_values, fn = my_opt_function, method = 'BFGS', control = list(maxit=1000))
model
meths <- c("Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "nlm", "nlminb")
result <- optimr::opm(start_values, my_opt_function, method=meths)
summary(result, order=value)
Starting with the basics, we can obtain linear models (OLS) or generalized linear models as follows.
aov(y ~ x + Error(id)) # ANOVA with error stratum (e.g. repeated measures for each id)
lm(y ~ x + z) # standard linear model/OLS
glm(y ~ x + z, family = 'binomial') # logistic regression with binary response
glm(y ~ x + z + offset(log(q)), family = 'poisson') # count/rate model
See also, rms::ols for the standard model with perks, along with lrm for logistic regression.
Unlike some statistical packages that haven’t figured out what year it is, you do not have to make explicit interaction terms. There are two ways you can do it.
lm(y ~ x*z)
lm(y ~ x + z + x:z) # equivalent
There are better approaches to get at nonlinear relationships than with polynomials (variables interacting with themselves), but if interested, one can use poly. There is also nls for nonlinear least squares. However, it is rare that one knows the functional form beforehand. Just because you can fit a quadratic function or logistic growth curve model doesn’t mean it’s actually the right or best functional form.
lm(y ~ poly(x, degree=3))
nls(y ~ SSlogis(log(x), Asym, xmid, scal)) # see ?SSlogis for details
You typically can transform a variable within the formula itself.
lm(y ~ log(x))
This is fine for quick exploration, but generally it’s poor practice to do so (you’ll likely need the transformed variable in other contexts), and it simply won’t work with some modeling functions, often due to variable name issues.
For categorical predictors, most modeling packages will automatically dummy code any categorical variables. But you can specify other coding schemes as well.
lm(y ~ x) # x is a factor
?contrasts
Standard binomial logistic regression was demonstrated above, but options thin out quickly when you have more than two categories for your target variable. But there is still plenty to play with in R for ordinal and multinomial models.
For ordinal models, the rms package is a very good starting point. The ordinal package will add random effects as well, in the familiar lme4 style.
rms::orm(y ~ x) # y is ordinal
ordinal::clmm(y ~ x + (1|g1) + (1|g2), link = "probit", threshold = "equidistant")
For nominal dependent variables, check out the mlogit package. You will almost certainly have some data processing to do beforehand.
# x: is an generic covariate
# q: is an individual specific covariate
# z: is an alternative specific covariate
mlogit::mlogit(y ~ x|q|z)
?mlogit::mlogit.data
?mlogit::mFormula
See mnlogit for a faster approach. Technically the base R package nnet will do these models too, but you won’t be able to do much with the result.
The standard GLM will only take you so far in terms of distributions for the target variable. To start your journey beyond the exponential family, you might consider the following. Some packages will provide additional distributions in similar contexts, and some might be geared entirely toward a specific family.
Some might be interested in quantile regression. The quantreg package would be a starting point. In addition, the car package has a few miscellaneous functions for various models, e.g. add-variable plots, variance inflation factors, influence metrics, and more.
The following is a list of the packages mentioned in this section and their percentile rank on RDocumentation.
Not enough people use regularization in their models to guard against overfitting, and it should be the default approach in my opinion, especially in small data/complex modeling scenarios. The glmnet package fits a variety of models, with lasso, ridge, or something in between for a penalty. Not a very user friendly package though.
glmnet::glmnet(X, y)
See also, elasticnet. Better yet, go Bayesian. Note also that there are actually quite a few packages that will accept a penalty parameter/matrix, but odds are slim you’d know that in advance.
One of the more common generalizations from the GLM setting regards data dependent situations where observations are correlated with one another due to inherent clustering. This is most commonly dealt with via mixed models. You can find detailed examples in my document here.
The nlme package comes with base R. It does standard linear and nonlinear (in the predictors) mixed models. It can add correlation structure (e.g. temporal (e.g. corAR) and spatial (e.g. corGaus)), as well as heterogeneous variances (e.g. varIdent). It can also do nonlinear models of specific functional forms (e.g. SSlogis). However, it cannot generalize beyond the gaussian distribution.
nlme::lme(y ~ x, random = ~ 1|group) # basic model
lme(y ~ x, random = ~ 1 + x|group) # add random slopes
lme(y ~ x, random = list(~ 1|g1, ~ 1|g2)) # different random effects
In addition, one can get at additional correlation/variance structure, though it may not be very intuitive. The first line examines heterogeneous variances across some grouping factor q
. The other demonstrates an autoregressive correlation structure.
lme(y ~ x, random = ~ 1|group, weights = varIdent(form=~1|q))
lme(y ~ x, random = ~ 1|group, correlation = corAR1(form=~time|q))
The lme4 package is one of the most widely used modeling packages in R. It’s one of the best tools for mixed models, and I’ve used many within R and beyond. It doesn’t do everything, but it does a lot, and fast. It serves most people’s needs just fine. Beyond that, many other packages work or depend on it, and other packages that extend lme4 will use the same modeling syntax.
In contrast to nlme, lme4 is very efficient, but does not have the capability (in a practical way) to do things like further examination of residual structure. It does however extend the models to the GLM family (and negative binomial with glmer.nb).
lme4::lmer(y ~ x + (1|group))
lmer(y ~ x + (1|g1) + (1|g2))
lmer(y ~ x + (1 + x|group))
glmer(y ~ x + (1|group), family = 'binomial')
Note that if you’re labeling your data correctly, it doesn’t matter to lme4 whether the clustering is nested or crossed. However if you give the nested clusters the same labels (e.g. 1,2,3,4 in every A, B, C), you’ll need to note this.
lmer(y ~ x + (1|group/nested_group))
lmer(y ~ x + (1|group) + (1|group:nested_group))
Again though, you shouldn’t label your data such that the same label can refer to multiple things when it’s not appropriate.
The following is a graph of packages that have a connection to lme4 directly (thick lines) or indirectly, and gives a sense that you’ll have a lot to work with should you use this package. Zoom in to see package names.
You’ll note in the graph one of the relatively large nodes regards the car package. People use that because lme4 won’t give you p-values. You don’t need them though, because you can just use confint to get the interval estimates, e.g. via bootstrap. The merTools package provides a more efficient approach to interval estimation, and in my experience is more like what you’d get with a Bayesian approach.
Here are some other useful packages:
merTools: extracting useful information from and exploring the implications of merMod objects lmertest: likelihood ratio and other tests flexmix: mixture mixed models (e.g. latent class growth/trajectory) mediation: mediation models with merMod objects mgcv: additive mixed models (also via nlme) brms: Bayesian approach with the same syntax
Several mixed model packages have the same methods to extract certain features.
fixef(model) # fixed effects coefficients
ranef(model) # random effects coefficients
VarCorr(model) # variance components
We’ve mentioned mgcv (more detail later) and ordinal. One of the lme4 developers is extending things to more complex models with glmmTMB.
I have several documents demonstrating mixed models for further exploration.
Generalized additive models incorporate nonlinear, random, and spatial effects all under one roof. One of the most powerful modeling packages in the R universe, the excellently named Mixed-GAM Computational Vehicle (mgcv), focuses on these, and even comes with the standard R installation!
My own opinion is that one would do well to use GAMs as a baseline model, as they allow for more complex modeling, but penalize that complexity to help guard against overfitting. Furthermore, assuming a linear relationship for everything is overly simplistic at best. GAMs also have close connections to mixed models, so can be seen as an extension of those. You can find detailed examples of GAMs in my document here.
mgcv::gam(y ~ s(x))
gam(y ~ s(x), family = 'nb') # different family (negative binomial)
gam(y ~ s(x, k=20)) # allow for more wiggles!
gam(y ~ s(x, by='group')) # different wiggles per group
gam(y ~ s(x, z)) # interaction
gam(y ~ s(x, bs='cc')) # alternate spline (cyclic cubic)
gam(y ~ s(group, bs='re')) # random effect with factor variable
gam(y ~ s(area, bs='mrf', xt=neighborhood_list)) # discrete spatial random effect
Once you run the model, you’ll want to explore it.
plot(model) # Visualize effects
gam.check(model) # Did I let it get wiggly enough?
concurvity(model) # the GAM form of collinearity
The package also comes with three approaches to mixed effect models:
bs='re'
With the former one can also get at temporal autocorrelation as well via the correlation
argument. There is capability to deal with missing values, multivariate outcomes, and handle large data sets. So even if you don’t use smooth terms, mgcv can simply be seen as a means to model additional distributional families.
It took a while, but people have finally started to realize the power of mgcv.
For example, most of the R community uses ggplot2 to explore their data visually. The package comes with geom_smooth, which can actually use mgcv to display a nonlinear relationship between two variables. It is only triggered automatically for large samples (> 1000), but you can always use it if desired via the method
argument. You can even use the various arguments for the smooth term function.
ggplot2::ggplot(aes(x, y)) +
geom_smooth(method = 'gam')
ggplot(aes(x, y)) +
geom_smooth(method = 'gam', formula = y ~ s(x, bs='gp'))
I’ll speak more about the package in the Bayesian section, but you can also use the s function when using brms3. Honestly, between those two packages and their respective capabilities, you’ll likely need little more for your statistical modeling. I personally use them all the time.
At one time both gamlss and VGAM extended GAMs beyond what mgcv could do, and they still have some modeling capabilities not seen in mgcv, especially with regard to other distributional families. However, mgcv has since made up some of the more important differences. Also, I still see some using the R-core spline function and splines package. There is nothing to be gained there however. There is also the gam package, by Trevor Hastie, the person who wrote one of the most widely cited works on the technique. While it certainly does the job, it’s not on the level of mgcv in terms of what all it can do. Unfortunately, some packages that use GAMs use it instead of mgcv, so you won’t have all the nice functionality.
In my document on GAMs, I quote Cosma Shalizi, who I think sums up the reason to use the models nicely. So I’ll do so again here.
With modern computing power, there are very few situations in which it is actually better to do linear regression than to fit an additive model. In fact, there seem to be only two good reasons to prefer linear models.
Our data analysis is guided by a credible scientific theory which asserts linear relationships among the variables we measure (not others, for which our observables serve as imperfect proxies).
Our data set is so massive that either the extra processing time, or the extra computer memory, needed to fit and store an additive rather than a linear model is prohibitive.
Even when the first reason applies, and we have good reasons to believe a linear theory, the truly scientific thing to do would be to check linearity, by fitting a flexible non-linear model and seeing if it looks close to linear. Even when the second reason applies, we would like to know how much bias we’re introducing by using linear predictors, which we could do by randomly selecting a subset of the data which is small enough for us to manage, and fitting an additive model.
In the vast majority of cases when users of statistical software fit linear models, neither of these justifications applies: theory doesn’t tell us to expect linearity, and our machines don’t compel us to use it. Linear regression is then employed for no better reason than that users know how to type lm but not gam. You now know better, and can spread the word.
Survival analysis is very common in biostatistics, epidemiology, public health etc. It is also commonly called event-history analysis and failure analysis (engineering). The basic idea is that you have a time-based target variable and want to model the time it takes until some event of interest occurs.
Given the nature of the data you need two things to specify a target variable, the time counter and the indicator for whether the event of interest happened or not. The survival package comes with a basic R installation, and brings quite a bit of functionality.
y = survival::Surv(time = t, event = q) # standard right censored
y = Surv(t, t>0, type='left') # left-censored
coxph(y ~ x) # Cox Proportional Hazards Regression
cox.zph(model) # test proportional hazards assumption
anova(model) # anova summary table
coxph(y ~ x + strata(group)) # stratified
coxph(y ~ x + frailty.gaussian(group)) # random effect for group
coxph(Surv(start, stop, event) ~ x) # time-dependent model
survreg(y ~ x, dist="exponential") # parametric model
I’ve mentioned the rms package before. It is written by Frank Harrell, a noted biostatistician who also contributed to SAS back in the day. The package name is an acronym for the title of his book Regression Modeling Strategies, and given his discipline, he devotes a lot of content in the text, and functionality in the package, to survival analysis. The book is a very good modeling book in general.
y = survival::Surv(time = t, event = status)
dd = rms::datadist(x, z) # set data up for prediction
options(datadist='dd')
cph(y ~ x + z) # basic cox
psm(y ~ x + z, dist="weibull") # parametric survival model
Mean(model) # create a function for further exploration
Survival(model) # create a function for further exploration
Quantile(model) # create a function for further exploration
ggplot(Predict(model, x, z)) # prediction plot
survplot(model) # survival plot
nomogram(model) # yep, nomogram
Along with the extensions that rms provides, you might also find something of use in packages like Epi, epitools, or epiR. In addition, while the biostats world seems high on splines for their survival models (see e.g. ?rms::rcs
), and so you can incorporate them easily with the models above, likewise some of the additive models packages like mgcv have functionality for survival analysis.
mgcv::gam(time ~ s(x), weights = status, family = cox.ph)
And finally, you can find survival models in the machine learning context, e.g. with a package like randomForestSRC.
Harrell (2015)
Many data sets in the social sciences are the result of a survey design, where certain geographic regions and populations are sampled in precise ways. With the survey weights, one can make more appropriate inferences to the population from which the data was drawn.
I don’t have a lot of experience here, but can at least show a demo and provide some resources. The first issue is setting up the survey design. Your data will presumably have the weights and other necessary information (e.g. the finite population correction).
# no clusters, stratified, with finite populaton correction
dstrat = survey::svydesign(id=~1, strata=~stype, weights=~pw, fpc=~fpc)
dclus = svydesign(id=~clus, weights=~pw) # clustered design
svymean(~y, dclus, deff=TRUE) # survey weighted mean
svytotal(~z, dclus, deff=TRUE) # survey weighted total
svyratio(numerator = ~y, denominator= ~z, dclus, deff=TRUE) # survey weighted ratio
svyglm(y ~ x, data=dstrat, family='binomial') # standard glm
There are also some other models like survival models, factor analysis, and more. One thing I can also mention- don’t use the weights
argument in the base R lm/glm functions with survey weights. That argument refers to a different type of weight (inverse variance), and will result in different standard errors.
A lot of techniques fall under the heading of ‘dimension reduction’, ‘factor analysis’, or ‘matrix factorization’ techniques.
Principal components analysis is one of the most widely used dimension reduction techniques, and brings us to the first example of unsupervised methods. In this case, we don’t have a single target variable to predict, but several, and generally we are not modeling them with other covariates, but trying to reduce them to a smaller dimension. The basic idea of PCA is to reduce the data to fewer dimensions that nevertheless retain the most variance seen in the original data.
Base R has princomp and prcomp functions, but they merely get the job done.
X = scale(X) # standardize the data
princomp(X) # via eigen decomposition
prcomp(X) # via singular value decomposition
Beyond that you’ll have some methods like loadings, and these will often work in other packages as well. The most common things you’ll want are scores and loadings.
X$scores(X) # component scores
X$loadings(X) # loadings/weights
loadings(X)
Note that this recommendation comes from someone who has little use for PCA (but very often does factor analysis), I can recommend a couple other packages. The psych package puts PCA more squarely in the framework of factor analysis more generally, and offers rotations, other measures of fit, etc. The pcaMethods package provides probabilistic PCA, Bayesian PCA, cross-validation, selection of the number of components, PCA-based imputation, and more.
model = psych::principal(X, nfactors = 2, rotate='varimax')
model$loadings
model$scores
model = pcaMethods::ppca(X, nPcs = 2) # probabilistic PCA
loadings(model)
scores(model)
There are plenty of supervised extensions to PCA, but generally you’ll have better choices in such a modeling scenario. For example, instead of PC regression, use techniques that won’t require the dimension reduction (many machine learning techniques) and/or would be more flexible (e.g. a neural network), or would better suit the goals of some analyses (e.g. when variables belong to the same scale, factor analysis, and thus SEM, would be more appropriate).
The base R approach to the most common form of factor analysis is pretty much identical to PCA.
model = factanal(X, factors = 2, scores = 'regression', rotation = 'promax')
model$loadings
model$scores
The psych package offers far, far, more in this realm. The output alone will take a bit for you to get through.
model = psych::fa(X, nfactors = 2, rotate = 'oblimin', fm = 'ml') # via maximum likelihood
model
model$loadings
model$scores
fa.diagram(model) # visualize
predict(model, newdata=mynewdata) # estimate scores for new data
nfactors(X, 4, rotate='promax') # compare solutions for 1:4 factors
Once you get into latent variable models like factor analysis, you’ll have plenty of tools at your disposal. Check out lavaan (confirmatory factor analysis/SEM) and ltm (item response theory models) for starters.
I have some more information on many dimension reduction techniques here, which briefly covers/demos a wide variety of techniques like t-sne, non-negative matrix factorization, latent Dirichlet allocation, mixture models, recommender systems and more. It spends more time on PCA and FA.
I also cover PCA/FA here, along with SEM, mixture models, IRT models and more.
For measurement in general, consider psychometrician and psych package author William Revelle’s freely available text. You will also learn a ridiculous amount by simply reading the help files to the various functions in his package.
More to follow in the SEM section.
Often the goal is to find latent structure (or simply dimension reduction) among the observations (rows), as opposed to the variables (columns). As such we have another unsupervised modeling situation that is very commonly used- cluster analysis.
We’ll start with model-based approaches to the problem, typically referred to as mixture models. A good starting point here is mclust. It provides a very easy way to search across a wide variety of cluster types, as well as any number of clusters (including 1 cluster). You may then select the ‘best’ solution based on BIC, which is an advantage of a model-based approach. Furthermore, it provides plots of the search space, the classifications, uncertainty, and density plots.
model = mclust::Mclust(X, G = 1:4)
summary(model)
plot(model)
Often people want to do a cluster analysis and then use those clusters in a subsequent analysis, e.g. as a predictor. This two-stage approach fails to take into account the uncertainty in the class assignment. However, the mixture model approach can do this appropriately, and we have another long-standing R package to help us there- flexmix. This package can actually handle a wide variety of regression models (e.g. different response distributions, mixed models4), allowing the results to vary across the latent clusters. It uses S4 class objects.
model = flexmix::flexmix(yn ~ x, k = 2) # two clusters
summary(model) # basic info
parameters(model) # coefficients (for each cluster/class)
posterior(model) # probability of class membership
# two outcomes, two models, two classes
model = flexmix(yn ~ x, k = 2,
model = list(FLXMRglm(yn ~ x + z),
FLXMRglm(yp ~ x + offset(q), family = 'poisson')))
Another type of model, or at least term, you may here with regard to mixture models is latent class analysis. The only distinction here is that the data is categorical, thus you can still just think of it as factor analysis. The poLCA package focuses on this. It also will allow for modeling the latent classes as an outcome (in distinction to flexmix), just as you would in standard logistic regression settings.
# assume X is a matrix of binary indicator variables
model = poLCA::poLCA(X ~ 1, values, nclass = 2) # basic latent class/cluster analysis
model = poLCA(X ~ x + z, values, nclass = 2) # add predictors
poLCA.posterior(model) # probability of class membership
While it does about as much as many would want, unfortunately poLCA is pretty bare bones. Look for lavaan to add latent classes in the future.
When it comes to modeling with latent clusters, there’s a lot of ways to do so. A very common approach that one sees in other modeling frameworks is hidden Markov models. Consider the hmm package as a starting point, but there’s a lot out there.
For traditional cluster analysis based on distance matrices, one doesn’t really need to go beyond the base R installation, as it comes with dist, kmeans, hclust functions, and the recommended package cluster.
X_dist = dist(X) # euclidean distances
X_dist = dist(X, method = 'manhattan') # manhattan distance
kmeans(X_dist, 3) # k-means 3 cluster solution
hclust(X_dist, method = 'average') # hierarchical clustering with average linkage
cluster::agnes(X_dist) # agglomerative clustering
cluster::diana(X_dist) # divisive clustering
Unfortunately such methods come with many arbitrary choices- distance metric, linkage method, number of clusters, clustering method, ‘performance’, etc. I’d suggest something like clusterSim to make it easy to search through the minimally dozens of options you might consider. For example, the following would search over several clustering approaches, evaluating them across different cluster numbers, cluster quality assessments, distance metrics, normalization techniques and so forth.
clusterSim::cluster.Sim(X,
p = 1,
minClusterNo = 2,
maxClusterNo = 5,
icq = c('G1', 'S'),
distances = c('d2', 'd5'),
normalizations = c('n1', 'n3'),
methods = c('m5', 'm3', 'm1'))
Structural Equation Modeling (SEM) is a very broad technique that encompasses and extends many others, bringing together regression, factor analysis, mixtures models and more. In particular it combines measurement models regarding latent variables and observed variables, and structural models to explore the effects of latent and observed variables on others.
Due to the use of latent variables and other modeling specifics, a special syntax is required to run SEM models, but otherwise it does not require much to get going, at least with the lavaan package. For those familiar with Mplus, it and various extensions can do about 90% of what Mplus does5, and even spits out results in the same format.
We can start with a path analysis using only observed variables. If you know how to run a regression, you can do this easily.
model = "
y ~ x + q + r + z
z ~ x + q
"
result = lavaan::sem(model)
summary(result, standardized=TRUE) # result with standardized effects
Confirmatory factor analysis is just like any other factor analysis, except some loadings are constrained to be zero, usually for theoretical and other not so good reasons.
model = "
lv1 =~ x1 + x2 + x3
lv2 =~ y1 + y2 + y3
y1 ~~ x1 # residual correlation to be estimated
"
result = cfa(model)
lavPredict(result) # get factor scores
SEM allows us to measure multiple outcomes, indirect effects, and combine latent and observed variables, all at once.
model = "
lv1 =~ x1 + x2 + x3
lv2 =~ y1 + y2
lv3 =~ z1 + z2 + z3
# a mediation model with lv2 as mediator
lv3 ~ lv1 + lv2 + q + f # q and f are observed
lv2 ~ lv1 + q
"
# sem with an approach to deal with missing values, robust ML, additional fit measures
sem(model, missing='fiml', estimator='MLR', fit.measures=TRUE)
Several packages extend the capabilities of lavaan.
In terms of Mplus style extensions, note lavaan.survey (survey design), blavaan (Bayesian), and semTools, which add things like multiple imputation for missing data, measurement invariance testing, additional fit indices, and a lot more.
A couple others I will note but with which I haven’t used yet. I note them because I will at some point, and their intentions are in the right place. One is regsem. The vast majority of SEM models are grossly overfit, estimating dozens of parameters with only a couple hundred observations. Such a package can surely assist with this problem, though I predict many will not like what they find (even though it will be a better result). In addition, while writing this I came across the lavaanPlot package, which might save me at least 50% of the headache of using Diagrammer.
Going beyond lavaan, I would suggest looking into the mediation package if you only have observed variables, as it can handle a wide variety of models for the mediator and outcome. In a similar vein check out piecewiseSEM. MplusAutomation will save you from having to use Mplus even if you want to use it. With one template you can write the syntax files, run the models, and bring back the results into R in a usable form, and all automatically, without ever opening Mplus.
Others that might be of note but I can’t speak to include lava and OpenMx. In addition, the aforementioned psych package has some SEM/lavaan capabilities.
My doc on SEM covers a lot of ground and provides more extensive examples.
Kline (2015) is a very popular applied treatment. See the classic Bollen (1989) for more details regarding standard models.
SEM can be seen as a very large subset of what are generally called graphical models, where variables are represented by nodes (sometimes called vertices) and the relations among them are represented by edges (sometimes called arcs). Edges may be directed (i.e. have arrows) or undirected. With directed (acyclic) graphs (DAG), we get into things like causal modeling, SEM, bayesian networks, etc. With undirected graphs, often called a Markov random fields, applications are in network models, spatial analysis, and more.
In most cases there are two main approaches to running such models. One is where two data sets are provided, one which specifies the nodes and various attributes associated with them. In addition, one has an edge list or data frame that specifies the nodes from which a connection begins, the nodes to which they connect, and possibly a column specifying the value/weight for the connection. The other approach regards an adjacency matrix, which is like a correlation or (inverse) distance matrix, where the off-diagonal values specify connections among the nodes, and may be simply binary or comprise continuous weights.
One of the more popular and powerful packages is igraph, and even when you’re using others, they are often using it under the hood. Given the right data, one can create a graph object, and then get any number of statistics, visualize it, or whatever. A lot of it is geared toward the creation of graphs, but normally graphs are too big to do by hand and/or you’d have the data already, and it’d be easier to create an adjacency matrix from your typical data frame and work from there.
g = igraph::from_adjacency(adjmat) # create graph from adjacency matrix
g = from_data_frame(df) # create graph from a data frame
as_adjacency_matrix(g) # convert graph to adjacency
plot(g) # plot it
degree(g) # degree for each node
hub_score(g) # hub score for each node
count_triangles(g) # N triangles for each node
clusters = cluster_edge_betweenness(g) # detect clusters
membership(clusters)
communities(clusters)
write_graph(g, 'g.dot') # write graph to e.g. dot format
The basic plot functionality for igraph is… basic. It’s not worth your time to make it pretty. To that end I suggest packages like networkd3, visNetwork, and Diagrammer. It’s not too big a deal to convert the graph object to something they will use (visNetwork will even work with an igraph object, e.g. with toVisNetworkData).
A package that’s useful for bayesian networks, or probabilistic DAG, is bnlearn. In this case we are looking for a graphical structure based on the data we have. The package allows for model fitting such that you could essentially duplicate a path analysis SEM-style, but the strength is in the model search and selection via a variety algorithms, possibly with cross-validation. It can work with binary, continuous, or a mix of those. With blacklists and whitelists, one can impose theoretical structure on the search such that some connections are always present while others are forbidden.
model = bnlearn::gs(X, blacklist = bl) # grow-shrink algorithm with a blacklist
plot(model)
Network analysis employs graphical models to study things like social or other specific types of networks. Two primary packages are network, which like igraph, serves as a foundation package, and sna. The primary object is a network class object instead of a graph object like with igraph, but you more or less proceed similarly.
g = network::network(adj_mat, directed=FALSE) # undirected network based on adjacency matrix
sna::summary(g) # use sna for the rest
plot(g)
betweenness(g) # various network statistics
degree(g)
efficiency(g)
model = ergm::ergm(g ~ edges) # exponential random graph model
With sna you may have some specific statistics not found in igraph, but it’s not really an either or, as you can easily move between the two. As with igraph, you’ll probably want to use a different package for visualization. While sna has some modeling capabilities, you might also consider ergm for exponential random graph models.
Pearl (2009) Also, he has every article available on his website.
There are several Use R! series entries for graphical models and network analysis.
With time series data, there isn’t really a specific model of concern, and in general, many models you might want to run can’t do anything special with a time-series class object6. For example, for a specific situation with observations over time, one could run a mixed model, a GAM, a gaussian process model, an ARIMA model, hidden Markov model, or a recurrent neural network. However, there is concern about how to deal with the data itself, even for basic processing and visualization.
To begin, note that there is a lot in the base R installation for dealing with time-based objects.
acf(Series) # autocorrelation plot
lag(Series, 1) # can be used in a formula
diff(Series, lag = 2) # take differences
fft(Series) # fast fourier transform
Series.ts = as.ts(Series)
plot(Series.ts)
frequency(Series.ts)
A couple very useful packages that make dealing with time and dates are xts, zoo, and lubridate.
xts::as.xts(Series)
apply.monthly(Series, somefunction) # apply a function at a specific period
zoo::window(Series)
lubridate::as_date(myDates)
ymd(myDates) # convert to YYYY-MM-DD format
hour(myDates) # extract hours from a date object
Newer packages like tsibble and tibbletime express to treat whole (tidy) data frames with a time-oriented context, and allow functionality for filling in missing time points, windowing functions, and more.
For modeling, again that depends on many things and may just be one of many models already demonstrated. I’ll mention forecast as a means to get started with traditional models like auto-regressive moving average and state-space models (with ‘tidy’ piping functionality), and prophet, an offering by Facebook that uses Stan (see Bayesian section).
forecast::auto.arima(Series) %>% plot()
# Series.df is a data.frame with two columns of date ('ds') and value ('y')
model = prophet::prophet(Series.df, growth = 'linear', n.changepoints = 1)
model$changepoints # date of estimated changepoint
forecast = predict(model, Series.df)
plot(model, fcst=forecast)
The following demonstrates an additive mixed model with a cyclic spline for the trend, and which models the residuals with an autoregressive structure.
mgcv::gamm(y ~ s(time, bs='cc'), correlation = corAR1())
Again the main thing is just to know that tools are available to help process time series, but models may not require it. The prophet package expects a date object, whereas mgcv did not.
Spatial models incorporate additional structure in a model that has some spatial component, usually, but not limited to, geography. A lot of the work you’ll do is with the data processing rather than modeling part.
A starting point for spatial data is the sp package, which provides basic spatial objects that can then be used for mapping or modeling, such as SpatialPoints, SpatialPolygonsDataFrame and more. Many, many other packages rely on sp.
A lot of sp, and its corresponding Use R! book, is geared toward creating a map from scratch, which is great, but often rarely needed, as typically one already has map data, e.g. in the form of shape files, and wants to simply import and then model or visualize it. It also uses lattice and base R for visualization, and is inordinately slow to visualize its products. While sp has been around for a long time, the newer sf (simple features) package will eventually supersede it (or at least that’s the plan), so I’ll give a quick example.
map_obj = sf::st_read("Shapefile.shp") # read in a shapefile
plot(map_obj)
head(map_obj) # spatial data.frame
world1 = st_as_sf(maps::map('world', plot = FALSE, fill = TRUE)) # convert maps object to sf
ggplot2::ggplot() + geom_sf(data = world1) # plot with ggplot2
The spdep package can serve as a first step to spatial modeling, providing common spatial descriptive tests and models. However, functionality may be available in other packages that does not require specific spatial data structures.
nb = spdep::nb2listw(nb_object, style="W") # create a weigthed neighbor list
moran(y, list=nb, length(nb_object), Szero(nb)) # Moran's I
moran.test(y, nb) # Moran test for spatial autocorrelation
model = spautolm(y ~ x + z, listw=nb, family="CAR") # conditional autoregressive model
summary(model)
model = lme(y ~ x, correlation=corSpatial(~ lat + lon, type='gaussian')) # via mixed model
model = gam(y ~ s(lat + lon, bs='gp')) # via additive model
model = gam(y ~ s(x) + s(geog_units, bs='mrf', xt=polygons)) # discrete space (markov random field)
The maptools package allows for easy conversion and combining of sp objects, and provides additional useful functions. For modeling, Bayesian approaches, like those that R-INLA provides, are common in the spatial realm.
For visualization, use maps for a quick peek and subsequent conversion to sp object (via maptools). For pretty (and often pretty easy) visualizations, geom_sf (in ggplot2), ggmap, mapview, leaflet, and plotly.
Note also that there are a lot of packages being used under the hood, e.g. rgeos, rgdal, but which you probably won’t use directly. In general, spatial tools, especially for visualization, have come a long way relatively recently, so you’ll best want to investigate to see what’s out there regularly.
Bivand, Pebesma, and Gómez-Rubio (2013) Not sure how useful this still is as far as how you’d want to go about things in R these days, but you would likely learn a lot that will prove useful in thinking about and exploring spatial problems.
Machine learning is more of an approach than a specific set of analysis techniques, though one will find models they won’t see in traditional statistical modeling. For practical purposes, your approach will involve measuring performance on a test (validation, hold-out) set of data rather than (just) the data you fit the model on.
Two packages are very popular in this realm, caret and mlr, but they work similarly because again, ML is more of a conceptual approach than a technique. We’ll demonstrate both on a classification task with a random forest.
ind = caret::createDataPartition(y, p = .8) # index to create a training and test set
cv_opts = trainControl(method='cv', number=10) # 10-fold cross validation
# random forest with various options
model_train = train(X_train, y_train,
method = 'rf',
tuneLength=5,
trControl = cv_opts,
preProcess = c("center", "scale"))
preds_rf = predict(model_train, X_test) # predictions
confusionMatrix(preds_rf, y_test) # classification results
If you use caret for nothing else, the confusionMatrix function is very handy for any classification model.
Here is a similar approach using mlr.
task = mlr::makeClassifTask(data = X, target = "y") # X is a data.frame with all variables
n = nrow(X)
train.set = sample(n, size = .8*n) # indices for training data
test.set = setdiff(1:n, train.set) # indices for test data
lrn = makeLearner("classif.randomForest")
model = train(lrn, task, subset = train.set)
preds_rf = predict(model, task = task, subset = test.set)
calculateConfusionMatrix(preds_rf, relative = T)
Both packages offer a very wide variety of ways to tune and customize your approach, and offer literally hundreds of models between them. Some other packages that offer some specific models (accessible by caret or mlr if desired) that one might want to be aware of are e1071, glmnet, randomForest, and xgboost. For understanding such models, consider lime.
When it comes to larger data and deep learning models, you may want to investigate tools like sparklyr, h20, and keras. Development is very recent and rapid in this domain, so you’ll need to keep up with it.
Bayesian modeling is very easy to do in R these days, even for fairly complicated models. Many packages use BUGS or its JAGS dialect, though more recent developments have been made with Stan.
Bayesian analysis has a long history in R, so you’ll find various packages more or less doing their own thing, but revolving around a couple of key packages like coda, MCMCpack, rjags, etc. You can use rjags to run something written in BUGS and get further processing of the results with coda. MCMCpack and MCMCglmm provide the ability to run several types of models in the Bayesian context.
model <- rjags::coda.samples(X, c("alpha","beta","sigma"), n.iter=1000)
summary(model)
coda::gelman.diag(model) # a diagnostic
coda::traceplot(model) # a diagnostic
MCMCpack::MCMClogit(y ~ x + z, burnin = 1000, mcmc = 5000, thin = 10) # logistic regression model
MCMCglmm::MCMCglmm(PO ~ x + z, burnin = 1000, nitt = 5000, thin = 10, random = ~group) # a mixed model
Many of those packages are seeing decreased use because of Stan. Stan is a probabilistic programming language developed by a very smart group of people at Columbia, and whose development has very rapidly enabled many people to jump in and get their Bayesian analysis done quickly, and explore the results with ease. The following shows the rstan ecosystem.
To begin, one can write their own Stan modeling program and use rstan to run it. The rstanarm package, developed by the Stan group, makes it as easy to run glm, ordinal, mixed, and multivariate models as it is to run them with standard tools. The brms package is ridiculously flexible, allowing one to conduct very complex models without ever programming in the Stan language directly. It has many distributional families, mixed modeling capability, and a variety of means to explore the model after the fact (some of which would apply to rstanarm as well).
model = rstan::stan(file='mystandcode.stan', warmup = 1000, iter = 5000, thin = 10, cores=4) # 4 chains in parallel
model = rstanarm::stan_glm(y ~ x, family = 'binomial') # Bayesian GLM
# set priors for random effect and regression coefficients
priors = c(brms::prior(cauchy(0, 1), class = sd),
prior(normal(0,10), class = b))
model = brm(y ~ x + (1 + x|group), family = 'zero_one_inflated_beta', prior = priors) # lme4 style
summary(model)
launch_shinystan(model) # interactive exploration (also rstanarm)
pp_check(model) # posterior predictive distribution (also rstanarm)
marginal_effects(model) # plot covariate effects
hypothesis(model, hypothesis = "exp(x) - 2 = 0") # a specific test
Many other packages are using Stan under the hood as well for specific types of models.
Some newer packages to keep an eye on include greta, which looks to take the Hamiltonian Monte Carlo approach of Stan to Tensor Flow and big data, and tidybayes, which will make for some easier post-processing of model results.
Kruschke (2014) Bayesian puppies!
McElreath (2016)
Gelman et al. (2013)
You can learn more about Stan in my document: Bayesian Basics.
Most of text analysis involves serious data munging in order to take the unstructured text and put it into a form that’s useful. What analysis is done once the data is prepared isn’t really specific to text. For example, one might use any classification model for sentiment analysis, and latent dirichlet allocation can be applied to any matrix of counts. In other instances, there are more text-oriented analyses, e.g. in the realm of deep learning (see keras).
The first thing to know is that there are many tools in the base R installation to deal with text.
grep(word_vec, pattern = 'xyz', value = T) # return matching values
grepl(word_vec, pattern = 'xyz') # return a vector of TRUE, FALSE if match
gsub(word_vec, pattern='xyz', replacement = 'qt') # string replacement
substr(word_vec, start = 5, stop = 10) # substring
adist(word_vec1, word_vec2) # string distance between sets of words; see stringdist package
tolower(word_vec) # convert to lower case
Generally you might want to use a package like stringr, which is actually a wrapper for both stringi and some base R functions. This serves to mostly make functions work consistently, but there is little other advantage.
stringr::str_extract(word_vec, pattern = 'xyz')
str_replace(word_vec, pattern = 'xyz', replacement = 'qt')
str_remove(word_vec, pattern = 'xyz')
str_sub(word_vec, start = 5, end = 10)
When it comes to text analysis, there’s a lot going on in the R world. Traditionally, there are packages like tm, lda, koRpus, and topicmodels. More recently, there are ones that make things easier, faster, or add functionality. These include tidytext, text2vec, and quanteda.
The following assumes a tidy text data frame, where you have a column that notes the document, a column for the words in the document. You can use basic dplyr functionality to remove stop-words and get word counts, then others to create the objects needed for further processing, and eventually analysis. The following is an example of a topic modeling approach.
lda_model = text2vec::LDA$new(n_topics = 10) # topic model setup
processed_text %>%
anti_join(stop_words_df) %>% # remove stopwords, with stop_words_df = data.frame(words=stop_word_vector)
count(document, words) %>% # get word counts
tidytext::cast_dfm() %>% # create a document term/feature matrix
quanteda::dfm_wordstem() %>% # quanteda allows further processing while in dtm form
text2vec::lda_model$fit_transform() # fit a fast LDA
Examples of the initial processing can be found with tidytext documentation, though folks very familiar with dplyr will see what’s going on clearly. In short, standard text processing in R is just about as easy as it can be. Your biggest problem will be dealing with large amounts of text, for which R’s memory hungry approach is not well suited.
As far as text-specific analysis goes, consider some packages like stm (structured topic models), text2vec (text embeddings and more), quanteda (misc processing and analysis), and spacyr (wrapper for the Python module).
In some modeling situations, you may have data that has enough missing values to warrant doing something about it, else you will lose a good chunk of your data when it comes to analysis. There are a couple packages of note here, including mice, Amelia, and missForest. Packages that are oriented toward SEM, e.g. lavaan, will often provide full information maximum likelihood (semTools adds multiple imputation and other approaches to lavaan). I’ll provide a simple example with mice (Multiple Imputation with Chained Equations). The basic approach is to create several imputed (i.e. now complete) data sets, run the desired model on each, and combine the results.
# 5 imputed data sets via predictive mean matching
imputed_data = mice::mice(mydata, method = "pmm", m = 5)
imputed_results = with(imputed_data, lm(y ~ x)) # regression model on each imputed data set
final_model = pool(imputed_results) # combine results via Rubin's rules
summary(final_model)
The mice package has a ton of functionality, and several packages extend it, and it even comes with a text, so if you’re going to use multiple imputation, consider it strongly.
Here are some discipline-specific task views that have not already been mentioned where you will find packages that may have just the modeling tools you’re looking for.
The key idea to running models in R is to know that there is a standard approach for a given situation, and many packages build on that. Different techniques will have different concerns, but usually it will be the case that using one of the primary packages within a given domain will get you pretty far, and if you decide to go with further, the approach will likely be little different with related packages. Hopefully this document has provided some good starting points for you, allowing you to dive in and have some fun.
Best of luck with your modeling! \(\qquad\sim \mathbb{M}\)
Bivand, Roger S, Edzer J Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. 2nd ed. Springer.
Bollen, Kenneth. 1989. “Structural Equations with Latent Variables.” New York: Wiley.
Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference. Vol. 5. Cambridge University Press.
Everitt, Brian S., Sabine Landau, Leese Morven, and Daniel Stahl. 2011. Cluster Analysis. 5th ed. John Wiley & Sons.
Fahrmeir, Ludwig, Thomas Kneib, Stefan Lang, and Brian Marx. 2013. Regression: Models, Methods and Applications. Springer Science & Business Media.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. CRC Press.
Hardin, James W., and Joseph M. Hilbe. 2012. Generalized Linear Models and Extensions, Third Edition. Stata Press.
Harrell, Frank E. 2015. Regression Modeling Strategies. 2nd ed. Springer.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. 2nd ed. Springer.
Heeringa, Steven G, Brady T West, and Patricia A Berglund. 2017. Applied Survey Data Analysis. Chapman; Hall/CRC.
Højsgaard, Søren, David Edwards, and Steffen Lauritzen. 2012. Graphical Models with R. Springer Science & Business Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Springer.
Kline, Rex B. 2015. Principles and Practice of Structural Equation Modeling. Guilford publications.
Kolaczyk, Eric D, and Gábor Csárdi. 2014. Statistical Analysis of Network Data with R. Vol. 65. Springer.
Kruschke, John. 2014. Doing Bayesian Data Analysis: A Tutorial Introduction with R. 2nd ed. Academic Press.
Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
Luke, Douglas A. 2015. A User’s Guide to Network Analysis in R. Springer.
Lumley, Thomas. 2011. Complex Surveys: A Guide to Analysis Using R. John Wiley & Sons.
McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Vol. 122. CRC Press.
Murphy, Kevin P. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
Nagarajan, Radhakrishnan, Marco Scutari, and Sophie Lèbre. 2013. “Bayesian Networks in R.” Springer 122. Springer: 125–27.
Pearl, Judea. 2009. Causality. Cambridge university press.
West, Brady T, Kathleen B Welch, and Andrzej T Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with R. 2nd ed. CRC press.
Even the author of optim doesn’t recommend it.↩
Any number besides zero can serve as the focal point (e.g. a one-inflated beta distribution).↩
There is the jagam function in mgcv to use JAGS, but I can’t think of a reason to use it over brms.↩
Using a mixed model on longitudinal data with flexmix negates the need to use SEM for what are sometimes called Latent Class Growth Curve Models, Growth Mixture Models or Latent Trajectory Models. If you need indirect effects, you’ll have to use Mplus or similar, and good luck with that.↩
Currently lavaan lacks the incorporation of categorical latent variables (i.e. mixture models), but you have more functionality in R for those elsewhere than you would Mplus. The current lavaan also lacks multilevel SEM, but that is a pretty rare model. There are very minor other differences.↩
Typically a time-based variable would be treated like any other numeric variable.↩