How I used R to create a word cloud, step by step

Or: R is less scary than you thought!

R, the open source package, has become the de facto standard for statistical computing and anything seriously data-related (note I am avoiding the term ‘big data’ here – oops, too late!).  From data mining to predictive analytics to data visualisation, it seems like any self-respecting data professional now uses R. Or at least they pretend to. We all know that most people use Excel when nobody’s watching.

But anyway, R is immensely powerful. It is also command-line driven, which makes it quite scary, especially for those of us who don’t get to be hands-on as often as we’d like to. True, used in the wrong way, statistical algorithms can wreak havoc (garbage in – garbage out), but don’t let this intimidate you. I recently gave it a try myself and found myself hooked in a matter of minutes. And if I can do it, so can you!

There are now many free online courses teaching R but some of these represent a significant investment of time. So to get started and experience a taster of how R works, I would recommend the following: create a world cloud. If you’ve got 1-2 hours to fiddle around then the steps outlined below should help you create your first output with R. For example, here’s a word cloud of all my tweets over the past 3 years:

R word cloud 2010-2012 thierry_g

Yes, you can do this much more easily online with Wordle, but that is not the point… Besides, R also has a package to read directly from Twitter so you can plug all the power of R into it (but we won’t use that here).

So, here’s an example of how it works. I used R for Windows because the family iMac was already in use… As far as I know, however, the steps for the Mac version should be exactly the same.

Step 1: Install R.

Got to r-project.org and follow the download/installation instructions. Easy.

Step 2: Install RStudio.

Why? Because it makes R much more usable, so it won’t scare the pants off you. RStudio is an open-source user interface organising everything you need on one single screen. There are handy tabs and windows: command line, workspace, history, files, plots, packages and help . Do yourself a favour and download it from rstudio.com. Easy.

Step 3: Create a text file to turn into a wordle

You can use any text you like. For the sake of this exercise, the most obscure I could find was the transcript of a House of Lords debate on the state of the bee population… Copy & paste the text into a plain text file (e.g. lords.txt) and stick the file into a dedicated directory in your default documents folder (I’ll call mine ‘temp’). Make sure there are no other files in this directory.

Step 4: Open RStudio, install required or missing packages

For this exercise you need the text mining package (‘tm’) and the wordcloud package (‘wordcloud’). In turn, each of those make use of other packages too. Click on the Packages tab (bottom right window in RStudio) and see if they’re listed. If not, go to Tools > Install Packages (top menu bar) and install them from there. Rather than mess around manually with downloaded zip files, simply install the packages straight through the default CRAN mirror option (if you have a firewall, make sure the URL is not blocked). Once installed, tick the required in the list under the Packages tab – this will in effect load & activate them in the workspace (it’s the same as using the ‘library’ command in R). As you tick them, you may get some warnings of further missing packages that they rely on – if so, install those packages too.

All done? All packages installed? All packages ticked off in the list? Move on to Step 5.

Step 5: The data process – text mining, clean-up, wordcloud

Now we need to load the text file into RStudio and clean it up so that the word cloud makes sense (for example, you don’t want to highlight common words like ‘the’). For reference see Introduction to the tm (text mining) Package.

First, you need to load the text into a so-called corpus, so the tm package can process it. A corpus is a collection of documents (although in our case we only have one). The following command loads everything (beware!) from the specified directory (remember, I called it ‘temp’) into a corpus called ‘lords’:

lords <- Corpus (DirSource(“temp/”))

To see what’s in that corpus, type the command

inspect(lords)

This should print out contents on the main screen. Next, we need to clean it up. Execute the following in the command line, one line at a time:

lords <- tm_map(lords, stripWhitespace)

lords <- tm_map(lords, tolower)

lords <- tm_map(lords, removeWords, stopwords(“english”))

lords <- tm_map(lords, stemDocument)

The tm_map function comes with the tm package. The various commands are self-explanatory: strip unnecessary white space, convert everything to lower case (otherwise the wordcloud might highlight capitalised words separately), remove English common words like ‘the’ (so-called ‘stopwords’), and carry out text stemming for the final tidy-up. Depending on what you want to achieve you could also explicitly remove numbers and punctuation with the removeNumbers and removePunctuation arguments.

It is possible that you may get error messages whilst executing some of the commands, e.g. missing packages. If so install these as outlined above in Step 4, and repeat. Once I also got a message about Java being corrupted (JAVA_HOME not found), so looking this up on Google I found the solution was just to reinstall Java on my machine, reboot, and try again (note you can save your workspace in RStudio, so you never lose any work and always retain the history of what you’ve done). It might all go smoothly the first time, or it might not. Some issues can be specific to your particular hardware, operating system, or software versions. Be prepared for some fiddling – it’s called hacking! And remember, there’s loads of R help forums and tutorials online if you get stuck. Just type the relevant R command or error message into Google and you’ll find something relevant.

If all is well then you should now be ready to create your first wordcloud! Try this:

wordcloud(lords, scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2”))

This command does what it says on the tin – try it as is, or fiddle with the settings to change the output. For further explanation of the command arguments  see e.g. this page. To highlight a few,  scale basically controls the difference between the largest and smallest font, max.words is required to limit the number of words in the cloud (if you omit this R will try to squeeze every unique word into the diagram!), rot.per is the percentage of vertical text, and colors provides a wide choice of symbolising your data, from single colours (e.g. colors=”black”) to pre-set colour palettes from the ColorBrewer package (e.g. colors=brewer.pal(8, “Dark2”)). Here’s the result:

Lordswordcloud

Congratulations!

Now, to go a step further, you may want to manually remove words from the cloud. For example, to get rid of the words “noble” and “lord”, you could use these commands:

lords <- tm_map(lords, removeWords, “noble”)

lords <- tm_map(lords, removeWords, “lord”)

Or you can make a list of words, c(“noble”, “lord”, etc…), to remove them in one go:

lords <- tm_map(lords, removeWords, c(“noble”, “lord”))

Just rerun the wordcloud command used above (hint: rather than type it all over again, use the Up arrow to scroll back to previously used commands) and see the result. Done!

Have fun!

Advertisements

62 thoughts on “How I used R to create a word cloud, step by step

  1. Hi – I’m new to R and stumbled across this post in trying to find some resources on making word clouds. It seems straight forward enough, but when I follow along I can’t get past the first step in the corpus creation. After I adapt your code “lords <- Corpus (DirSource(“temp/”))", I get an error telling me the directory is empty. What am I doing wrong?
    Thanks,
    Aaron

    • Yes, I had the directory is empty problem, but solved it by typing a slightly more mac friendly path: Corpus (DirSource(“Documents/TEMP/”)), it seems to like being informed that the final directory is in “Documents”. Hope this helps.
      As mentioned in this very helpful guide > inspect(filename) soon tells you whether it has been imported.
      My problems came later, trying to get rid of an unexpected symbol � in the text file. Still working on it….

      • you could use gsub to get rid of all punctuations, by the typing in the following command:

        lords <- gsub("[[:punct:]]"," ",lords)

        Hope this works!

  2. Aaron, you probably need to set your working directory to the correct location. Check it with getwd(), and set it to what you want it to be with setwd(). For more help try R helpfiles or by googling the above commands. Good luck

  3. I didn’t quite get the indicated output. Stemming gave me some errors. And when I created the wordcloud, I got warnings and the cloud itself is sparse and with broken words.
    ****************************************************************************************************
    lords1 wordcloud(lords1,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,colors=brewer.pal(8,”Dark2″))
    There were 50 or more warnings (use warnings() to see the first 50)
    Warning messages:
    1: In wordcloud(lords1, scale = c(5, 0.5), max.words = 100, … :
    equality could not be fit on page. It will not be plotted.

    *********************************************************************************************************************
    When I created the wordcloud without stemming I got a better result but warnings still persist.

    Can you see what the problem might be?
    Thanks,

  4. I didn’t quite get the indicated output. Stemming gave me some errors. And when I created the wordcloud, I got warnings and the cloud itself is sparse and with broken words.
    ****************************************************************************************************
    lords1 wordcloud(lords1,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,
    colors=brewer.pal(8,”Dark2″))
    There were 50 or more warnings (use warnings() to see the first 50)
    Warning messages:
    1: In wordcloud(lords1, scale = c(5, 0.5), max.words = 100, … :
    equality could not be fit on page. It will not be plotted.

    **************************************************************************************
    When I created the wordcloud without stemming I got a better result but warnings still persist.

    Can you see what the problem might be?
    Thanks,

  5. I didn’t quite get the indicated output. Stemming gave me some errors. And when I created the wordcloud, I got warnings and the cloud itself is sparse and with broken words.
    ****************************************************************************************************
    lords1 wordcloud(lords1,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,
    use.r.layout=FALSE,colors=brewer.pal(8,”Dark2″))
    There were 50 or more warnings (use warnings() to see the first 50)
    Warning messages:
    1: In wordcloud(lords1, scale = c(5, 0.5), max.words = 100, … :
    equality could not be fit on page. It will not be plotted.

    **************************************************************************************
    When I created the wordcloud without stemming I got a better result but warnings still persist.

    Can you see what the problem might be?
    Thanks,

  6. Sorry for the multiple comments. I can’t seem to get my complete comment get posted. Some of the stuff (from the middle) is missing. Let me try one more time (in 2 parts)

    I didn’t quite get the indicated output. Stemming gave me some errors. And when I created the wordcloud, I got warnings and the cloud itself is sparse and with broken words.
    ****************************************************************************************************
    lords1<-tm_map(lords1,stemDocument)
    Refreshing GOE props…
    —Registering Weka Editors—
    Trying to add database driver (JDBC): RmiJdbc.RJDriver – Warning, not in CLASSPATH?
    Trying to add database driver (JDBC): jdbc.idbDriver – Warning, not in CLASSPATH?
    Trying to add database driver (JDBC): org.gjt.mm.mysql.Driver – Warning, not in CLASSPATH?
    Trying to add database driver (JDBC): com.mckoi.JDBCDriver – Warning, not in CLASSPATH?
    Trying to add database driver (JDBC): org.hsqldb.jdbcDriver – Warning, not in CLASSPATH?
    [KnowledgeFlow] Loading properties and plugins…
    [KnowledgeFlow] Initializing KF…

  7. Part-2:

    > wordcloud(lords1,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,
    colors=brewer.pal(8,”Dark2″))
    There were 50 or more warnings (use warnings() to see the first 50)
    Warning messages:
    1: In wordcloud(lords1, scale = c(5, 0.5), max.words = 100, … :
    equality could not be fit on page. It will not be plotted.

    **************************************************************************************
    When I created the wordcloud without stemming I got a better result but warnings still persist.

    Can you see what the problem might be?
    Thanks,

  8. Thanks for a very clearly laid out example. That was a big help. Happy Weekend! (In case you wake up one day wondering, “Does any of this make any difference?”)

  9. Pingback: Word Cloud | m's R Blog

  10. Hi, this is a very well laid out blog, but I have a couple of issues. First of all, I feel like my Wordcloud is much more condensed than the example that you have. Secondly, some of my words end up getting cut off mid-word. For example, “police” because “polic” and “charge” becomes “charg”, etc. What could be the problem here? I’ve done every command that you have written verbatim and my colleague is having the same problem. Thank you!

    • Not sure about your first observation. However the second — relating to the words getting cut off mid-word. That is related to stemming. charg is the stemmed version of charge. If you want to preserve words as is, comment out/remove the line which attempts to stem the words in your data set. (i.e. comment out this line
      lords <- tm_map(lords, stemDocument)
      )

  11. Hi, thanks for everyone’s comments, glad it helped in some cases. I’m afraid I can’t help out with individual issues as it’s very difficult to reproduce these errors – they can be highly depedendent on your particular version or operating system, and besides I’m not an R expert. My advice would be to just google any errors you get, and you may well find relevant answers on a user forum (that is also how I figured out my own example). Good luck!

  12. hey hi,
    thanks for such an usefull tutorial…
    can please explain how to tm_reduce so that i can bind all the command in one command

  13. Hi i have setted the way you said,in my current working directory ,
    i created a folder named temp,and it contains only lords.txt file,after trying this
    lords <- Corpus(DirSource(“temp/”))
    Error: unexpected input in "lords <- Corpus(DirSource(“"
    i'm getting this error,pls help me with this

  14. Thanks for this post, it was extremely easy to follow. I did run into some errors with some of your entries that use quotes ( ” ” ), but changing them to a single quote ( ‘ ‘ ) fixed everything. Just thought I’d pass that along in case anyone else is having the same issues.

  15. Pingback: Aplicaciones libres para crear nubes de palabras | La Grulla Libre

  16. Pingback: Aplicaciones libres para nubes de palabras | Portafolio eduPLEEmooc

  17. Pingback: Online Dating Profile Word Clouds | Data Etc

  18. Pingback: Is Pope Francis starting to make a difference? | stats, political science, and religion

  19. lords <- Corpus(DirSource(“temp/”))

    I tried the above and I get

    Error: unexpected input in "lords <- Corpus(DirSource(“"

  20. Hi! Great post, very simple to follow.

    Unfortunately, I seem to be having some trouble generating the Word Cloud. I’m getting this error message:

    Error: inherits(doc, “TextDocument”) is not TRUE

    My corpus is in PT_BR, but I’m using the correct stopwords file and, inspecting the text file, all seems to be OK.

    Any ideas?

  21. Awesome post! Thanks so much.

    This has been so helpful!! So to try and pay it forward I’ll mention the problems I ran into incase someone else is having similar problems.

    I’m using RStudio Version 0.98.1014 and I had a few problems that needed troubleshooting but other than that it worked great!

    Problems I ran into – probably unique to me and my environment but:

    1) If I copied and pasted the code into Rstudio the quotation marks were not of the right type. ” ” needed to be retyped.

    2) Someone else suggested running gsub(“[[:punct:]]”,” “, NAME). I had to run that to get the cloud to work.

    3) Once I successfully ran the code, then as suggested tried to remove a few words, but I couldn’t remove them unless I reloaded the data and removed them before removing stopwords

    4) The tm_map ( , stemdocument) requires a library that isn’t mentioned. “SnowballC” so I ran install.packages(“SnowballC”) to fix.

    Thanks for this webpage, it’s been great.

  22. This is great however there are too many white spaces that could be used to fill in more words. Each word is “reserving” its whole “minimum bounding rectangle” on the canvas. For example, if you take the word ‘bees’ you can see that the top part of the ‘b’ is creating unused white space on its right. If you pay attention to this wordcloud http://www.cs.berkeley.edu/~prmohan/images/wordcloud.png you can see exactly what I mean.
    Is there any parameter to somewhat pack the cloud? If not, where should I post my observations in order to improve the package?
    Thanks

  23. This was great. I would have never tried this without your example and guidance. Now I feel empowered to experiment more with text analysis.

    Thank you.

  24. Pingback: Els núvols de paraules | En Joan Jofra t'ho resol

  25. Pingback: Exploring My Twitter Social Graph Using the Twitter API

  26. Pingback: Kaggle Forum Machine Learning Cloud | Yada, Yada, Data

  27. I’m also getting the error:

    Error: inherits(doc, “TextDocument”) is not TRUE

    I tried using lords <- tm_map(lords, content_transformer(tolower)) but I'm getting a new error:

    Error in UseMethod("content", x) :
    no applicable method for 'content' applied to an object of class "character"

    Could this be due to the various emojis in my texts?

  28. Pingback: What I learned today | throwthebookaway

  29. This is a fantastic tutorial!

    I have almost completed, although I am getting a strange error: certain words are not displaying in full. For instance, reduction is showing as reduc, even when zoomed. Any ideas as to why it would be doing this?

  30. Hi, Great blog and I love the post. I’m trying to complete on my Mac and I seem to be getting the following error after posting this command:

    chat <- Corpus (DirSource(Documents/temp))

    Error in dir(directory, full.names = TRUE, pattern = pattern, recursive = recursive, :
    object 'Documents' not found

    Any idea? I have a folder called "temp" within my Documents director on my Mac

  31. I must thank you for the efforts you’ve put in writing this website.

    I am hoping to check out the same high-grade content by you later on as well.
    In truth, your creative writing abilities has inspired me to get my
    own, personal blog now 😉

  32. # i ahve applied trick just save ur getwd with t<- getwd()
    ##########$$$$$$$$$$$this works fine and perfect
    ############################################
    #only remove when havent installed the package wordlcoud
    #install.packages("wordcloud")
    ##########################################
    #install.packages("RColorBrewer")
    library(wordcloud)
    library(tm)
    library(SnowballC)
    library(RColorBrewer)

    t<-getwd()
    class(t)
    # reading the text
    lords <- Corpus (DirSource(t))
    filepath<-"http://textfiles.com/sex/808-lust.txtt&quot;
    text<-readLines(filepath)
    # load the data as corpus
    docs<- Corpus(VectorSource(text))
    #inspecting purpose
    inspect(docs)
    # process of purging starts

    tospace<- content_transformer(function(x,pattern) gsub(pattern," ",x))

    docs<- tm_map(docs,tospace,"/" )
    docs<- tm_map(docs,tospace,"@")
    docs<- tm_map(docs,tospace,"\\|")

    #cleaning the text

    #converting to lowercase
    docs<- tm_map(docs,content_transformer(tolower))

    #remove numbers

    docs<- tm_map(docs,removeNumbers)
    #############################################
    #remove english stopwords
    #############################################
    docs<- tm_map(docs,removeWords,stopwords("english"))
    #################################################
    #remove your own stopword
    docs<- tm_map(docs,removeWords,c("blabla1","blabla2"))
    docs<- tm_map(docs,removePunctuation)
    #eliminate extra white space
    docs<- tm_map(docs,stripWhitespace)
    #text stemming
    ############################################################
    docs<- tm_map(docs,stemDocument)
    #build term document matrix
    #############################################3
    dtm<- TermDocumentMatrix(docs)
    class(dtm)
    m<- as.matrix(dtm)
    v<- sort(rowSums(m),decreasing = TRUE)
    d<- data.frame(word= names(v),freq= v)
    head(d,14)
    ##################################################
    #generate the word cloud
    ###########################################################
    set.seed(100)
    wordcloud(words = d$word,freq = d$freq,min.freq = 20,random.order = TRUE,rot.per = 0.0,colors = brewer.pal(8,"Dark2"))

    happycoding
    github:
    https://github.com/say2sankalp/R-Cook-Book.git

  33. Great post. Thank you for your time in putting this together. Yes… the solution to most of the problems posted as well as my own were in the copy of text from the website into Rstudio… I could not believe how many versions of ” marks there are. Have a great holiday.

  34. Pingback: Adventures in data – creating & analyzing a South Park dataset – Bob Adams 5 – EE

  35. I would like to be able to do wordclouds for word pairs and not just single words. e.g. Mountain Lion or Peter Jackson to be treated as 1 word each rather than splitting into e.g. “mountain” and “lion”. I have my text file with the wordlist on separate lines, but the code always splits the terms.

  36. Pingback: Visualizing Text Messages Part 2: The Words – Statblog

  37. Pingback: Visualizing Text Messages Part 2: The Words - Beta Hats

  38. Pingback: Using A Wordcloud To Examine How You Write | grailrunner

  39. Hi All,

    I unit test the code below, which seems to be working fine but when it comes to run the wordcloud (See the last line of the code below), I always receive an error saying:

    :

    Error in nchar(names(tab), type = “chars”) :
    invalid multibyte string, element 817

    Not sure how to fix this, can anybody please help me on this?

    Thanks!

    :

    Reviewed <- Corpus (DirSource('MyDirectory'))

    inspect(Reviewed)

    #str(Reviews)

    Reviewed <- tm_map(Reviewed, stripWhitespace)

    Reviewed <- tm_map(Reviewed, tolower)

    Reviewed <- tm_map(Reviewed, removeWords, stopwords("english"))

    Reviewed <- tm_map(Reviewed, stemDocument)

    Reviewed <- tm_map(Reviewed, removePunctuation)

    #Reviewed <- tm_map(Reviewed, wordLengths=c(0,Inf))

    #library(RColorBrewer)

    wordcloud(Reviewed, scale=c(7,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(3, "Dark2"))

  40. Pingback: Quick wordclouds from PubMed abstracts – using PMID lists in R | Tales of R

  41. Hello,
    I have successfully created the wordcloud, but I try to save to jpg and when I try to open the jpg it says I don’t have permission.

    jpeg(“wordcloud.jpg”)
    wordcloud(cloudFrameB$word,cloudFrameB$freq)
    dev.off()

    If anything could help, would be appreciated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s