Or: R is less scary than you thought!
R, the open source package, has become the de facto standard for statistical computing and anything seriously data-related (note I am avoiding the term ‘big data’ here – oops, too late!). From data mining to predictive analytics to data visualisation, it seems like any self-respecting data professional now uses R. Or at least they pretend to. We all know that most people use Excel when nobody’s watching.
But anyway, R is immensely powerful. It is also command-line driven, which makes it quite scary, especially for those of us who don’t get to be hands-on as often as we’d like to. True, used in the wrong way, statistical algorithms can wreak havoc (garbage in – garbage out), but don’t let this intimidate you. I recently gave it a try myself and found myself hooked in a matter of minutes. And if I can do it, so can you!
There are now many free online courses teaching R but some of these represent a significant investment of time. So to get started and experience a taster of how R works, I would recommend the following: create a world cloud. If you’ve got 1-2 hours to fiddle around then the steps outlined below should help you create your first output with R. For example, here’s a word cloud of all my tweets over the past 3 years:
Yes, you can do this much more easily online with Wordle, but that is not the point… Besides, R also has a package to read directly from Twitter so you can plug all the power of R into it (but we won’t use that here).
So, here’s an example of how it works. I used R for Windows because the family iMac was already in use… As far as I know, however, the steps for the Mac version should be exactly the same.
Step 1: Install R.
Got to r-project.org and follow the download/installation instructions. Easy.
Step 2: Install RStudio.
Why? Because it makes R much more usable, so it won’t scare the pants off you. RStudio is an open-source user interface organising everything you need on one single screen. There are handy tabs and windows: command line, workspace, history, files, plots, packages and help . Do yourself a favour and download it from rstudio.com. Easy.
Step 3: Create a text file to turn into a wordle
You can use any text you like. For the sake of this exercise, the most obscure I could find was the transcript of a House of Lords debate on the state of the bee population… Copy & paste the text into a plain text file (e.g. lords.txt) and stick the file into a dedicated directory in your default documents folder (I’ll call mine ‘temp’). Make sure there are no other files in this directory.
Step 4: Open RStudio, install required or missing packages
For this exercise you need the text mining package (‘tm’) and the wordcloud package (‘wordcloud’). In turn, each of those make use of other packages too. Click on the Packages tab (bottom right window in RStudio) and see if they’re listed. If not, go to Tools > Install Packages (top menu bar) and install them from there. Rather than mess around manually with downloaded zip files, simply install the packages straight through the default CRAN mirror option (if you have a firewall, make sure the URL is not blocked). Once installed, tick the required in the list under the Packages tab – this will in effect load & activate them in the workspace (it’s the same as using the ‘library’ command in R). As you tick them, you may get some warnings of further missing packages that they rely on – if so, install those packages too.
All done? All packages installed? All packages ticked off in the list? Move on to Step 5.
Step 5: The data process – text mining, clean-up, wordcloud
Now we need to load the text file into RStudio and clean it up so that the word cloud makes sense (for example, you don’t want to highlight common words like ‘the’). For reference see Introduction to the tm (text mining) Package.
First, you need to load the text into a so-called corpus, so the tm package can process it. A corpus is a collection of documents (although in our case we only have one). The following command loads everything (beware!) from the specified directory (remember, I called it ‘temp’) into a corpus called ‘lords’:
lords <- Corpus (DirSource(“temp/”))
To see what’s in that corpus, type the command
This should print out contents on the main screen. Next, we need to clean it up. Execute the following in the command line, one line at a time:
lords <- tm_map(lords, stripWhitespace)
lords <- tm_map(lords, tolower)
lords <- tm_map(lords, removeWords, stopwords(“english”))
lords <- tm_map(lords, stemDocument)
The tm_map function comes with the tm package. The various commands are self-explanatory: strip unnecessary white space, convert everything to lower case (otherwise the wordcloud might highlight capitalised words separately), remove English common words like ‘the’ (so-called ‘stopwords’), and carry out text stemming for the final tidy-up. Depending on what you want to achieve you could also explicitly remove numbers and punctuation with the removeNumbers and removePunctuation arguments.
It is possible that you may get error messages whilst executing some of the commands, e.g. missing packages. If so install these as outlined above in Step 4, and repeat. Once I also got a message about Java being corrupted (JAVA_HOME not found), so looking this up on Google I found the solution was just to reinstall Java on my machine, reboot, and try again (note you can save your workspace in RStudio, so you never lose any work and always retain the history of what you’ve done). It might all go smoothly the first time, or it might not. Some issues can be specific to your particular hardware, operating system, or software versions. Be prepared for some fiddling – it’s called hacking! And remember, there’s loads of R help forums and tutorials online if you get stuck. Just type the relevant R command or error message into Google and you’ll find something relevant.
If all is well then you should now be ready to create your first wordcloud! Try this:
wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2″))
This command does what it says on the tin – try it as is, or fiddle with the settings to change the output. For further explanation of the command arguments see e.g. this page. To highlight a few, scale basically controls the difference between the largest and smallest font, max.words is required to limit the number of words in the cloud (if you omit this R will try to squeeze every unique word into the diagram!), rot.per is the percentage of vertical text, and colors provides a wide choice of symbolising your data, from single colours (e.g. colors=”black”) to pre-set colour palettes from the ColorBrewer package (e.g. colors=brewer.pal(8, “Dark2″)). Here’s the result:
Now, to go a step further, you may want to manually remove words from the cloud. For example, to get rid of the words “noble” and “lord”, you could use these commands:
lords <- tm_map(lords, removeWords, “noble”)
lords <- tm_map(lords, removeWords, “lord”)
Or you can make a list of words, c(“noble”, “lord”, etc…), to remove them in one go:
lords <- tm_map(lords, removeWords, c(“noble”, “lord”))
Just rerun the wordcloud command used above (hint: rather than type it all over again, use the Up arrow to scroll back to previously used commands) and see the result. Done!