Exercises

In all these exercises click to show the code. Hints do not contain working code, and are there to help you out only.

Path Analysis

Part 0 Mediation

Consider the following code regarding a 3 variable mediation problem, and draw the corresponding path model, including the proposed path values. Which is the covariate, mediator, and outcome? Using the ‘product-of-paths’ approach, what value should the indirect effect be in the population?

set.seed(1212)
N = 1000
v = sample(0:1, N, replace = T)
q = .2*v + rnorm(N)
r = .7*q + .4*v + rnorm(N)

Run the above code, then use the mediation package to estimate it. The code below provides a hint. Except for the first line, the code is conceptual only, i.e. you have to fill in the right parts. Specifically, change the formulas in the lm() part, and the names of treat and mediator in the mediate() part.

d = data.frame(v, q, r)
mediation_model = lm(mediator ~ covariate, data=d)
outcome_model   = lm(outcome  ~ covariate + mediator, data=d)

library(mediation)
mediation_result = mediate(mediation_model, outcome_model,  treat = 'covariate', mediator = 'mediator')
summary(mediation_result)

Part 1 Path Analysis

This exercise investigates satisfaction in high school seniors using the 1993 Monitoring the Future dataset ('data/monitorfuture.csv'). Perform a path analysis on the four measured variables: self-esteem (esteem), overall life satisfaction (overall), locus of control (locus), and loneliness (lonely). Estimate the following model using lavaan (note that the dashed line is an ‘unanalyzed’ correlation and doesn’t need to be explicit in the model).

First get the data in, then set up your model code. The following provides a hint. Note that in this model you have two outcomes/endogenous variables, so write the model code pertaining to each.

mtf = read.csv('data/monitorfuture.csv')

satistifaction_ModelCode = 'Y ~ X + z'

After that, run the model using the sem function in lavaan. The following provides a hint.

satistifaction_ModelResults = sem(modelcodename, data=modeldataname)  

summary(satistifaction_ModelResults, rsquare=T)

Your results should be consistent with the following graph.

Questions:

What are the positive effects in this model?
What are the negative effects in this model?
What is your initial impression about model performance?
What specifically are the R² values for the endogenous variables?

Part 2 Indirect Effects

Rerun the model to explicitly test the indirect effects using the product-of-paths approach and labeling the paths in the lavaan model. Click to see how.

satistifaction_ModelCode_Mediation = '
  overall ~ c*esteem + lonely
  esteem ~ a*locus + b*lonely

  # Indirect effects
  locusOverall := a*c
  lonelyOverall := b*c
'

Regarding this model…

A. Is the indirect effect for loneliness on life satisfaction statistically significant?
B. How about locus of control?
C. Do you think loneliness causes self-esteem (more lonely, less self-esteem), or vice versa, or perhaps they are simply correlated?

Part 3 Non-SEM Graphical Models

You will need the bnlearn and igraph packages to run the following.

Bayesian Networks

The following provides some data for you to play with. From the help file

Lauritzen and Spiegelhalter (1988) motivate this example as follows:

“Shortness-of-breath (dyspnoea) may be due to tuberculosis, lung cancer or bronchitis, or none of them, or more than one of them. A recent visit to Asia increases the chances of tuberculosis, while smoking is known to be a risk factor for both lung cancer and bronchitis. The results of a single chest X-ray do not discriminate between lung cancer and tuberculosis, as neither does the presence or absence of dyspnoea.”

The asia data set contains the following yes-no binary variables:

D dyspnoea
T tuberculosis
L lung cancer
B bronchitis
A visit to Asia
S smoking
X chest X-ray
E tuberculosis versus lung cancer/bronchitis

The following will run a model for this data. What variables seem to be related others?

library(bnlearn)
data(asia)
asia = select(asia, -E)
mod = gs(asia)
bnlearn:::plot.bn(mod)

Network Analysis

The following code creates an adjacency matrix (1 signifies a connection, 0 none), the corresponding graph (for 10 nodes), and a couple statistical measures for nodes. Play around with it to see what you come up with. For example, if you reverse the probabilities for the 0s and 1s in that first line, how will the graph change?

library(igraph)
adj = matrix(sample(0:1, 1000, replace=T, prob = c(.75,.25)), 10, 10)
rownames(adj) = colnames(adj) = LETTERS[1:10]
diag(adj) = 0
g = graph_from_adjacency_matrix(adj, mode='directed')
plot(g)
degree(g)
page.rank(g)$vector

ggplot2::qplot(degree(g), geom='bar')

Factor Analysis

Part 0 PCA and FA comparison

An long-running personality theory posits 5 primary latent personality traits: Agreeableness, Conscientiousness, Extroversion, Neuroticism, and Openness. 25 personality self report items taken from the International Personality Item Pool (ipip.ori.org). The data from 2800 subjects are included here.

Judging from the following correlation matrix (blue= positive correlation, red=negative, bold=stronger, faded=weaker; hover for value), can you already see some structure appearing? How many factors do you see? How are they correlated with one another?

Load the data, load('data/bfi.RData'), and run a factor analysis for five factors and interpret the results. you can use fa.diagram for a quick and dirty graphical depiction.

load('data/bfi.RData')
library(psych)
bfi_5 = fa(bfi_trim, 5, rotate='promax')
fa.diagram(bfi_5, cut=.2, digits = 2)

Some think there are six factors, maybe you think there are fewer/more. Run a different factor analysis (i.e. with the nfactors argument set to some other number) and compare BIC to see if one is better than the other (lower BIC is preferable).

bfi_2 = fa(bfi_trim, 2)
bfi_2$BIC
bfi_5$BIC

Compare models for 1-7 factors in one line of code. Run the following.

sapply(1:7, function(nf) fa(bfi_trim, nfactors = nf)$BIC)

Do a PCA using the principal function, also using retaining 5 components. It will assume an independent structure for the factors and focus on extracting variance instead of explaining covariance. Judging by the results do you think it’s more or less interpretable?

Part 1

Data: National Longitudinal Survey of Youth (1997, NLSY97), which investigates the transition from youth to adulthood. For this example, a series of questions asked to the participants in 2006 pertaining to the government’s role in promoting well-being will be investigated. Questions regarded the government’s responsibility for following issues: provide jobs for everyone, keep prices under control, provide health care, provide for elderly, help industry, provide for unemployed, reduce income differences, provide college financial aid, provide decent housing, protect environment. They have four values 1:4, which range from ‘definitely should be’ to ‘definitely should not be’.

Run a factor analysis using the 'data/nlsy97_governmentNumeric.csv' dataset. Your first model will have all items (sans ID) loading on a single factor, with a name of your choosing. With lavaan, use the cfa function. Recall that you give it the model code (as a string) and a data = argument like cfa(mod, data=mydata).

The following provides a hint. Keep an eye out for typos for the variable names, that will be your biggest issue.

govtInvolvement = read.csv('data/nlsy97_governmentNumeric.csv')

library(lavaan)
govtInvolvement_ModelCode = 'LV =~ X1 + X2 + X3'
govtInvolvement_Results = cfa(govtInvolvement_ModelCode, data=govtInvolvement)
summary(govtInvolvement_Results, standardized=T, fit=T)

Note the standardized loadings, are the items ‘hanging together’ well? Do you see any that might be we somewhat weak?

Part 2

Now specify a two factor structure of your choosing. As an example, some of these are maybe more economic in nature, while others might fit in some other category. Whatever makes sense to you, or just randomly split it.
Does this seem any more interpretable? Were the latent variables notably correlated? Which model would you say is better based on internal performance, in terms of comparison (e.g. lower AIC = preferred model), and in terms of interpretability?

Click to show example (after you’ve tried yourself).

library(lavaan)
govtInvolvement = read.csv('data/nlsy97_governmentNumeric.csv')

govtInvolvement_ModelCode1 = "
  moregovt =~ ProvideJobs + Prices + HealthCare + Elderly + Industry + Unemployed + IncInequal + College + Housing + Environment
"
model1 = cfa(govtInvolvement_ModelCode1, data=govtInvolvement) 
summary(model1, fit=T, standardized=T)

govtInvolvement_ModelCode2 = '
  econ =~ ProvideJobs + Industry + Unemployed + IncInequal + Housing + Prices
  services =~ HealthCare + Elderly + College + Environment
'
model2 = cfa(govtInvolvement_ModelCode2, data=govtInvolvement) 
summary(model2, fit=T, standardized=T)

# Compare the two († means the better result)
library(semTools)
summary(compareFit(model1, model2))

# alternative
# anova(model1, model2)

SEM

If time allows do both, otherwise, pick one of the following. You will probably get more out choosing one and following the process steps outlined in the SEM chapter.

Exercise 1

The data for this exercise comes from an article by Marsh and Hocevar in 1985. The data regards information on 385 fourth and fifth grade students who filled out a ‘Self-Description Questionnaire.’ The questionnaire has 16 items, four of which measure self-perception of physical ability, four measure self-perception of physical appearance, four measure quality of relations with peers and the final four measure quality of relations with parents. The data in the file ‘Marsh85.dta’ has summary information with means, standard deviations and correlations. Rather than raw data, we will be using the sample covariance matrix with sample size equal to 385.

We are interested in determining how a student’s physical appearance and physical ability might predict relationships with their peers. The diagram for the model of interest is shown below.

Note below how you can detect the structure visually. Physical appearance hangs together well and is positively correlated with the other factors, which are more strongly correlated with each other. This is extremely important in factor analysis and SEM, as they deal with the correlation matrix, you should often be able to see the structure before modeling even begins with appropriate visual tools.

The following will get you started. All you need to do is fill in the model code. Take your time, and start with each latent variable, and then add the structural/regression component. Recall that you can name the latent variables whatever you want. Refer back to the document examples as needed.

load('data/Marsh85_SEMExercise.RData')
peerRel_ModelCode = ''
peerRel_Results = sem(peerRel_ModelCode, sample.cov=marshCov, sample.nobs=385)
summary(peerRel_Results, standardized=T, fit=T, rsquare=T)

Write a brief summary in terms of an assessment of the measurement components of the model, overall impression of model fit, and specifics of the structural relations (i.e. the paths among the latent variables) and model performance in terms of R².

Exercise 2

In this second example, we will use the classic Political Democracy dataset used by Bollen in his seminal 1989 book on structural equation modeling. This data set includes four measures of democracy at two points in time, 1960 and 1965, and three measures of industrialization in 1960, for 75 developing countries.

FoPress60: freedom of the press, 1960
FoPopp60: freedom of political opposition, 1960
FairElect60: fairness of elections, 1960
EffectiveLeg60: effectiveness of elected legislature, 1960
FoPress65: freedom of the press, 1965
FoPopp65: freedom of political opposition, 1965
FairElect65: fairness of elections, 1965
EffectiveLeg65: effectiveness of elected legislature, 1965
GNP60: GNP per capita, 1960
EnConsump60: energy consumption per capita, 1960
PercLaborForce60: percentage of labor force in industry, 1960

The model we wish to estimate is in according to the following diagram.

Here is some starter code. All you need to do is fill in the model code. Take your time, and start with each latent variable, and then add the structural/regression component. Recall that you can name the latent variables whatever you want. Refer back to the document examples as needed.

poldem = read.csv('data/PoliticalDemocracy.csv')
poldem_ModelCode = ''
poldem_Results = sem(poldem_ModelCode, data=poldem)
summary(poldem_Results, standardized=T, fit=T, rsquare=T)

Write a brief summary in terms of an assessment of the measurement components of the model, overall impression of model fit, and specifics of the structural relations (i.e. the paths among the latent variables) and model performance in terms of R².

SEM Exercises

Michael Clark Consulting for Statistics, Computing and Analytics Research

2018-01-04