Summary

It should be clear at this point that text can be seen as amenable to analysis as anything else in statistics. Depending on the goals, the exploration of text can take on one of many forms. In most situations, at least some preprocessing may be required, and often it will be quite an undertaking to make the text amenable to analysis. However, this is often rewarded by interesting insights and a better understanding of the data at hand, and makes possible what otherwise would not be if only human-powered analysis were applied.

For more natural language processing tools in R, one should consult the corresponding task view. However, one should be aware that it doesn’t take much to strain one’s computing resources with R’s tools and standard approach. As an example, the Shakespeare corpus is very small by any standard, and even then it will take some time for certain statistics or topic modeling to be conducted. As such, one should be prepared to also spend time learning ways to make computing more efficient. Luckily, many aspects of the process may be easily distributed/parallelized.

Much natural language processing is actually done with deep learning techniques, which generally requires a lot of data, notable computing resources, copious amounts of fine tuning, and often involves optimization towards a specific task. Most of the cutting-edge work there is done in Python, and as a starting point for more common text-analytic approaches, you can check out the Natural Language Toolkit.

Dealing with text is not always easy, but it’s definitely easier than it ever has been. The number of tools at your disposal is vast, and more are being added all the time. One of the main take home messages is that text analysis can be a lot of fun, so enjoy the process!

Best of luck with your data! \(\qquad\sim\mathbb{M}\)