Word Embeddings
A key idea in the examination of text concerns representing words as numeric quantities. There are a number of ways to go about this, and we’ve actually already done so. In the sentiment analysis section words were given a sentiment score. In topic modeling, words were represented as frequencies across documents. Once we get to a numeric representation, we can then run statistical models.
Consider topic modeling again. We take the document-term matrix, and reduce the dimensionality of it to just a few topics. Now consider a co-occurrence matrix, where if there are \(k\) words, it is a \(k\) x \(k\) matrix, where the diagonal values tell us how frequently wordi occurs with wordj. Just like in topic modeling, we could now perform some matrix factorization technique to reduce the dimensionality of the matrix10. Now for each word we have a vector of numeric values (across factors) to represent them. Indeed, this is how some earlier approaches were done, for example, using principal components analysis on the co-occurrence matrix.
Newer techniques such as word2vec and GloVe use neural net approaches to construct word vectors. The details are not important for applied users to benefit from them. Furthermore, applications have been made to create sentence and other vector representations11. In any case, with vector representations of words we can see how similar they are to each other, and perform other tasks based on that information.
A tired example from the literature is as follows: \[\mathrm{king - man + woman = queen}\]
So a woman-king is a queen.
Here is another example:
\[\mathrm{Paris - France + Germany = Berlin}\]
Berlin is the Paris of Germany.
The idea is that with vectors created just based on co-occurrence we can recover things like analogies. Subtracting the man vector from the king vector and adding woman, the most similar word to this would be queen. For more on why this works, take a look here.
Shakespeare example
We start with some already tokenized data from the works of Shakespeare. We’ll treat the words as if they just come from one big Shakespeare document, and only consider the words as tokens, as opposed to using n-grams. We create an iterator object for text2vec functions to use, and with that in hand, create the vocabulary, keeping only those that occur at least 5 times. This example generally follows that of the package vignette, which you’ll definitely want to spend some time with.
shakes_words_ls = list(shakes_words$word)
it = itoken(shakes_words_ls, progressbar = FALSE)
shakes_vocab = create_vocabulary(it)
shakes_vocab = prune_vocabulary(shakes_vocab, term_count_min = 5)
Let’s take a look at what we have at this point. We’ve just created word counts, that’s all the vocabulary object is.
Number of docs: 1
0 stopwords: ...
ngram_min = 1; ngram_max = 1
Vocabulary:
term term_count doc_count
1: bounties 5 1
2: rag 5 1
3: merchant's 5 1
4: ungovern'd 5 1
5: cozening 5 1
---
9090: of 17784 1
9091: to 20693 1
9092: i 21097 1
9093: and 26032 1
9094: the 28831 1
The next step is to create the token co-occurrence matrix (TCM). The definition of whether two words occur together is arbitrary. Should we just look at previous and next word? Five behind and forward? This will definitely affect results so you will want to play around with it.
# maps words to indices
vectorizer = vocab_vectorizer(shakes_vocab)
# use window of 10 for context words
shakes_tcm = create_tcm(it, vectorizer, skip_grams_window = 10)
Note that such a matrix will be extremely sparse. Most words do not go with other words in the grand scheme of things. So when they do, it usually matters.
Now we are ready to create the word vectors based on the GloVe model. Various options exist, so you’ll want to dive into the associated help files and perhaps the original articles to see how you might play around with it. The following takes roughly a minute or two on my machine. I suggest you start with n_iter = 10
and/or convergence_tol = 0.001
to gauge how long you might have to wait.
In this setting, we can think of our word of interest as the target, and any/all other words (within the window) as the context. Word vectors are learned for both.
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = shakes_vocab, x_max = 10)
shakes_wv_main = glove$fit_transform(shakes_tcm, n_iter = 1000, convergence_tol = 0.00001)
# dim(shakes_wv_main)
shakes_wv_context = glove$components
# dim(shakes_wv_context)
# Either word-vectors matrices could work, but the developers of the technique
# suggest the sum/mean may work better
shakes_word_vectors = shakes_wv_main + t(shakes_wv_context)
Now we can start to play. The measure of interest in comparing two vectors will be cosine similarity, which, if you’re not familiar, you can think of it similarly to the standard correlation12. Let’s see what is similar to Romeo.
rom = shakes_word_vectors["romeo", , drop = F]
# ham = shakes_word_vectors["hamlet", , drop = F]
cos_sim_rom = sim2(x = shakes_word_vectors, y = rom, method = "cosine", norm = "l2")
# head(sort(cos_sim_rom[,1], decreasing = T), 10)
romeo | juliet | tybalt | benvolio | nurse | iago | friar | mercutio | aaron | roderigo |
---|---|---|---|---|---|---|---|---|---|
1 | 0.78 | 0.72 | 0.65 | 0.64 | 0.63 | 0.61 | 0.6 | 0.6 | 0.59 |
Obviously Romeo is most like Romeo, but after that comes the rest of the crew in the play. As this text is somewhat raw, it is likely due to names associated with lines in the play. As such, one may want to narrow the window13. Let’s try love.
love = shakes_word_vectors["love", , drop = F]
cos_sim_rom = sim2(x = shakes_word_vectors, y = love, method = "cosine", norm = "l2")
# head(sort(cos_sim_rom[,1], decreasing = T), 10)
x | |
---|---|
love | 1.00 |
that | 0.80 |
did | 0.72 |
not | 0.72 |
in | 0.72 |
her | 0.72 |
but | 0.71 |
so | 0.71 |
know | 0.71 |
do | 0.70 |
The issue here is that love is so commonly used in Shakespeare, it’s most like other very common words. What if we take Romeo, subtract his friend Mercutio, and add Nurse? This is similar to the analogy example we had at the start.
test = shakes_word_vectors["romeo", , drop = F] -
shakes_word_vectors["mercutio", , drop = F] +
shakes_word_vectors["nurse", , drop = F]
cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2")
# head(sort(cos_sim_test[,1], decreasing = T), 10)
x | |
---|---|
nurse | 0.87 |
juliet | 0.72 |
romeo | 0.70 |
It looks like we get Juliet as the most likely word (after the ones we actually used), just as we might have expected. Again, we can think of this as Romeo is to Mercutio as Juliet is to the Nurse. Let’s try another like that.
test = shakes_word_vectors["romeo", , drop = F] -
shakes_word_vectors["juliet", , drop = F] +
shakes_word_vectors["cleopatra", , drop = F]
cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2")
# head(sort(cos_sim_test[,1], decreasing = T), 3)
x | |
---|---|
cleopatra | 0.81 |
romeo | 0.70 |
antony | 0.70 |
One can play with stuff like this all day. For example, you may find that a Romeo without love is a Tybalt!
Wikipedia
The following shows the code for analyzing text from Wikipedia, and comes directly from the text2vec vignette. Note that this is a relatively large amount of text (100MB), and so will take notably longer to process.
text8_file = "data/texts_raw/text8"
if (!file.exists(text8_file)) {
download.file("http://mattmahoney.net/dc/text8.zip", "data/text8.zip")
unzip("data/text8.zip", files = "text8", exdir = "data/texts_raw/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)
tokens = space_tokenizer(wiki)
it = itoken(tokens, progressbar = FALSE)
vocab = create_vocabulary(it)
vocab = prune_vocabulary(vocab, term_count_min = 5L)
vectorizer = vocab_vectorizer(vocab)
tcm = create_tcm(it, vectorizer, skip_grams_window = 5L)
glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10)
wv_main = glove$fit_transform(tcm, n_iter = 100, convergence_tol = 0.001)
wv_context = glove$components
word_vectors = wv_main + t(wv_context)
Let’s try our Berlin example.
berlin = word_vectors["paris", , drop = FALSE] -
word_vectors["france", , drop = FALSE] +
word_vectors["germany", , drop = FALSE]
berlin_cos_sim = sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2")
head(sort(berlin_cos_sim[,1], decreasing = TRUE), 5)
paris berlin munich germany at
0.7575511 0.7560328 0.6721202 0.6559778 0.6519383
Success! Now let’s try the queen example.
queen = word_vectors["king", , drop = FALSE] -
word_vectors["man", , drop = FALSE] +
word_vectors["woman", , drop = FALSE]
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
king son alexander henry queen
0.8831932 0.7575572 0.7042561 0.6769456 0.6755054
Not so much, though it is still a top result. Results are of course highly dependent upon the data and settings you choose, so keep the context in mind when trying this out.
Now that words are vectors, we can use them in any model we want, for example, to predict sentimentality. Furthermore, extensions have been made to deal with sentences, paragraphs, and even lda2vec! In any event, hopefully you have some idea of what word embeddings are and can do for you, and have added another tool to your text analysis toolbox.
You can imagine how it might be difficult to deal with the English language, which might be something on the order of 1 million words.↩
Simply taking the average of the word vector representations within a sentence to represent the sentence as a vector is surprisingly performant.↩
It’s also used in the Shakespeare Start to Finish section.↩
With a window of 5, Romeo’s top 10 includes others like Troilus and Cressida.↩