Analyzing Social Media in R

It goes without saying that the internet itself is one of the richest datasets available to anybody interested in data analysis and research. Of particular interest are social media websites, which are a treasure trove of information. Well, the good news is that some of this information is free to use, and it is very easy to do so with R.

First of all, I must mention that there are numerous packages and tutorials on the web for interacting with social media data through R. The Voson Lab have some tutorials and an R package called SocialMediaLab. Pablo Barberá wrote a prize-winning paper on politics and Twitter, and he has a (somewhat dated) workshop online. Many resources can be found from even a quick search on Google.

In this post, I’m going to explain how to access Twitter and Facebook through R, and how to run some analyses on the data we can scrape from these sites. First of all, let’s look at Twitter. Twitter requires that you have an account and then create an app on their app page. This is quite straightforward to do, all you need to do is follow the instructions on the site.

Once you have set up your app on Twitter, you will be able to generate a set of user keys and tokens. We will use these in R to access the Twitter API. In the example R code below, replace the _________ with your personal Twitter codes.

key <- "____________"
key_secret <- "__________"
token <- "_______________"
token_secret <- "_____________"

We will use the following packages, which, if you don’t have, you will have to install with the install.packages() function (or click on the “Install” button in RStudio). Then we set up authorization with Twitter, which will ask you if you want to store a file on your computer with these Twitter records for future use with R, which is useful. To make the plot that we use at the end, you’ll need the newest version of ggplot2, which is still in development.

devtools::install_github("hadley/ggplot2")
library(ggplot2)
library(igraph)
library(ggraph)
library(dplyr)
library(twitteR)
library(ROAuth)
library(quanteda)
library(stringi)
library(RColorBrewer)
library(tidytext)
library(widyr)
set.seed(1234)
setup_twitter_oauth(key, key_secret, token, token_secret)
## [1] "Using direct authentication"

Now that we have authorization, we are free to scrape some data! Two individuals that are in the news at the moment are the US presidential candidates Donald Trump and Hilary Clinton, who are both active on Twitter. Let’s scrape a maximum of 1000 tweets from each one. The TwitteR package has a getText method for the class of objects that are returned from this search.

trump <- searchTwitter('@realDonaldTrump', n=1000) %>% 
  lapply(function(x) x$getText()) %>% 
  lapply(function(x) stri_trans_general(x, "Latin-ASCII")) %>% 
  unlist() %>% 
  corpus()

clinton <- searchTwitter('@HillaryClinton', n=1000) %>% 
  lapply(function(x) x$getText()) %>% 
  lapply(function(x) stri_trans_general(x, "Latin-ASCII")) %>% 
  unlist() %>% 
  corpus()

These corpus objects are then easy to summarize. One such useful feature is the kwic function, which gives us the keyword in context. For example, we can see in what contexts Trump and Clinton mention each other in their tweets:

k <- kwic(trump, "clinton", 3)
head(k)
##                                          contextPre keyword
##   [text4, 8]           Journalists shower Hillary [ Clinton
## [text10, 11]                        - Emails Show [ Clinton
## [text22, 20]                      them to Hillary [ Clinton
## [text45, 11]                        - Emails Show [ Clinton
## [text55, 11]                        - Emails Show [ Clinton
## [text58, 17]                              Way' Of [ Clinton
##                                 contextPost
##   [text4, 8] ] with campaign cash          
## [text10, 11] ] Campaign Organized Potential
## [text22, 20] ] @realDonaldTrump here is    
## [text45, 11] ] Campaign Organized Potential
## [text55, 11] ] Campaign Organized Potential
## [text58, 17] ] Email Investigation:
h <- kwic(clinton, "trump", 3)
head(h)
##                                     contextPre keyword
##   [text1, 4]              RT@HillaryClinton: [   Trump
##   [text4, 8]               donate every time [   Trump
##   [text9, 4]              RT@HillaryClinton: [   Trump
## [text23, 19]                    ever WE NEED [   Trump
##  [text33, 9]                 case for Donald [   Trump
##  [text52, 4]              RT@HillaryClinton: [   Trump
##                               contextPost
##   [text1, 4] ] reportedly asked this     
##   [text4, 8] ] tweets something offensive
##   [text9, 4] ] reportedly asked this     
## [text23, 19] ] to clean ho               
##  [text33, 9] ] vote!#Never               
##  [text52, 4] ] reportedly asked this

Perhaps the most visually impressive way of summarizing this type of data is a wordcloud. It’s easy to make, however, we need to first clean up the data and transform it into a document-feature matrix. What we’re doing here is removing unnecessary elements from the text, including the “@” Twitter account names, and some other Twitter-specific elements, such as “rt” and “t.co”.

trump <- dfm(trump, toLower = TRUE, removeNumbers = TRUE,
             removePunct = TRUE, removeSeparators = TRUE,
             stem = TRUE, language = "english",
             removeTwitter = TRUE, 
             ignoredFeatures = c("rt", "https", "t.co", "h", "s",
                                 "wqqpjxfb", "75ollud4si",
                                "t", "ht", "realdonaldtrump", 
                                 stopwords()))

clinton <- dfm(clinton, toLower = TRUE, removeNumbers = TRUE,
             removePunct = TRUE, removeSeparators = TRUE,
             stem = TRUE, language = "english",
             removeTwitter = TRUE, 
             ignoredFeatures = c("rt", "https", "t.co",
                                 "t", "hillaryclinton", stopwords()))

Now we’re ready to plot some wordclouds!

plot(trump, max.words = 100, 
     colors = brewer.pal(6, "Dark2"), scale = c(4, .5), 
     random.order = FALSE, random.color = TRUE)

wordcloud-trump

plot(clinton, max.words = 100, 
     colors = brewer.pal(6, "Dark2"), scale = c(4, .5), 
     random.order = FALSE, random.color = TRUE)

wordcloud-clinton

We can also tidy up these objects with the tidytext package and plot some network graphs with the ggraph and igraph packages. You may need to install these packages if you do not already have them.

# install.packages('devtools)
# devtools::install_github('thomasp85/ggforce')
# devtools::install_github('thomasp85/ggraph')
# devtools::install_github('dgrtwo/widyr')
library(igraph)
library(ggraph)
library(tidytext)
library(widyr)

tidy_trump <- tidy(trump)
tidy_clinton <- tidy(clinton)

tidy_trump %>%
  pairwise_count(term, document, sort = TRUE) %>% 
  filter(n >= 10) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_node_point(color = "#87CEFA", size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  theme_void()

tidytext-trump

tidy_clinton %>%
  pairwise_count(term, document, sort = TRUE) %>% 
  filter(n >= 10) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_node_point(color = "#87CEFA", size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  theme_void()

 

 

tidytext-clinton

Another useful thing we can do with these data is sentiment analysis. Firstly, you will need to download a suitable dictionary, or lexicon, and place it in your working directory (one can be downloaded from here). Then we can load these into R:

nice <- scan('../opinion-lexicon-English/positive-words.txt',
            what='character', comment.char=';')
not_nice <- scan('../opinion-lexicon-English/negative-words.txt',
            what='character', comment.char=';')

Next, we need a quick function to calculate a sentiment score based on how often our candidate’s words match the “nice” and “not nice” dictionaries:

sentiments <- function(words, nice_text, not_nice_text){
    
    positive = match(words, nice_text)
    negative = match(words, not_nice_text)
  
    positive = !is.na(positive)
    negative = !is.na(negative)
  
    score = sum(positive) - sum(negative)

  return(score)
}

Using this, we can see a radical difference in Trump and Clinton:

sentiments(tidy_trump$term, nice, not_nice)
## [1] -115
sentiments(tidy_clinton$term, nice, not_nice)
## [1] -63

Let’s make a dataframe of all of this and plot it.

sentiments_df <- data_frame(Score = c(sentiments(tidy_trump$term, nice, not_nice), sentiments(tidy_clinton$term, nice, not_nice)), Candidate = c("Trump", "Hillary"))

ggplot(sentiments_df) + 
  geom_col(aes(y = Score, x = Candidate, fill = Score)) +
  theme_bw()

sentiment-analyses

Poor old Donald is certainly quite negative.

This is a short example of how you can access social media data using R. There are examples on how to use Facebook, and there is a larger vignette on quanteda here on how to use the quanteda package for the analysis of text.

 

 

[:]

postrelacionados