Kate Lyons

Having Fun with Tidy Text!

Julia Silge and David Robinson have a wonderful new book called “Text Mining with R” which has a companion website with great explanations and examples. Here are some additional applications of those examples on a corpus of geotagged Instagram posts from the Mission District neighborhood in San Francisco.

# Make sure you have the right packages!
packs = c("twitteR","RCurl","RJSONIO","stringr","ggplot2","devtools","DataCombine","ggmap",
          "topicmodels","slam","Rmpfr","tm","stringr","wordcloud","plyr",
          "tidytext","dplyr","tidyr","xlsx","scales","ggrepel","lubridate","purrr","broom")
lapply(packs, library, character.only=T)

The data need to be processed a bit more in order to analyze them. Let’s try from the start with Silge and Robinson.

# Get rid of stuff particular to the data (here encodings of links and such)
# Most of these are characters I don't have encodings for (other scripts, etc.)
tweets$text = gsub("Just posted a photo","", tweets$text)
tweets$text = gsub( "<.*?>", "", tweets$text)

# Get rid of super frequent spam posters
tweets <- tweets[! tweets$screenName %in% c("4AMSOUNDS",
      "BruciusTattoo","LionsHeartSF","hermesalchemist","Mrsourmash"),]


# Now for Silge and Robinson's code. What this is doing is getting rid of URLs, re-tweets (RT) and ampersands. This also gets rid of stop words without having to get rid of hashtags and @ signs by using str_detect and filter! 
reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"
tidy_tweets <- tweets %>% 
  filter(!str_detect(text, "^RT")) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

Awesome! Now our posts are cleaned with the hashtags and @ mentions still intact. What we can try now is to plot the frequency of some of these terms according to WHERE they occur. Silge and Robinson have an example with persons, let’s try with coordinates.

freq <- tidy_tweets %>% 
  group_by(latitude,longitude) %>% 
  count(word, sort = TRUE) %>% 
  left_join(tidy_tweets %>% 
              group_by(latitude,longitude) %>% 
              summarise(total = n())) %>%
  mutate(freq = n/total)

freq
## Source: local data frame [61,008 x 6]
## Groups: latitude, longitude [1,411]
## 
##    latitude longitude          word     n total       freq
##       <dbl>     <dbl>         <chr> <int> <int>      <dbl>
## 1  37.76000 -122.4200       mission  1844 21372 0.08628112
## 2  37.76000 -122.4200           san  1592 21372 0.07448999
## 3  37.76000 -122.4200      district  1576 21372 0.07374134
## 4  37.76000 -122.4200     francisco  1464 21372 0.06850084
## 5  37.75833 -122.4275          park   293  2745 0.10673953
## 6  37.75833 -122.4275       mission   285  2745 0.10382514
## 7  37.75833 -122.4275       dolores   278  2745 0.10127505
## 8  37.76000 -122.4200 #sanfrancisco   275 21372 0.01286730
## 9  37.76300 -122.4209         alley   245  2273 0.10778707
## 10 37.76300 -122.4209       clarion   242  2273 0.10646722
## # ... with 60,998 more rows

The n here is the total number of times this term has shown up, and the total is how many terms there are present in a particular coordinate.

Cool! Now we have a representation of terms, their frequency and their position. Now I might want to plot this somehow… one way would be to try to plot the most frequent terms (n > 50) (Some help on how to do this was taken from here and here).

freq2 <- subset(freq, n > 50) 

map <- get_map(location = 'Valencia St. and 20th, San Francisco,
               California', zoom = 15)

freq2$longitude<-as.numeric(freq2$longitude)
freq2$latitude<-as.numeric(freq2$latitude)
lon <- freq2$longitude
lat <- freq2$latitude

mapPoints <- ggmap(map) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_label_repel(data = freq2, aes(x = lon, y = lat, label = word),size = 2) 

Let’s zoom into that main central area to see what’s going on!

map2 <- get_map(location = 'Lexington St. and 19th, San Francisco,
               California', zoom = 16)
mapPoints2 <- ggmap(map2) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
  geom_label_repel(data = freq2, aes(x = lon, y = lat, label = word),size = 2) 

This can be manipulated in many different ways – either by playing with what frequency of terms you want to look at (maybe I want to see terms that occur 100 times, between 20 and 50 times, less than 20 times etc. etc.) OR by playing around with the map. At the moment though, this is pretty illuminating in the sense that it shows us that the most frequency terms are focused around certain ‘hotspots’ in the area, which in itself is just kind of cool to see.

Now let’s try out word frequency changes over time: what words were used more or less over the time of data collection? (Help from here) (Also used the lubridate package to help with time.)

# Might have to do this first
tidy_tweets$created2 <- as.POSIXct(tidy_tweets$created, format="%m/%d/%Y %H:%M")

words_by_time <- tidy_tweets %>%
  mutate(time_floor = floor_date(created2, unit = "1 week")) %>%
  count(time_floor, word) %>%
  ungroup() %>%
  group_by(time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 100)

words_by_time
## # A tibble: 1,979 × 5
##    time_floor             word count time_total word_total
##        <dttm>            <chr> <int>      <int>      <int>
## 1  0016-07-31             #art     7       2729        120
## 2  0016-07-31      #california     9       2729        149
## 3  0016-07-31     #dolorespark     6       2729        168
## 4  0016-07-31         #mission     5       2729        222
## 5  0016-07-31 #missiondistrict     1       2729        158
## 6  0016-07-31    #sanfrancisco    38       2729       1034
## 7  0016-07-31              #sf    23       2729        603
## 8  0016-07-31       #streetart    12       2729        229
## 9  0016-07-31             24th    10       2729        109
## 10 0016-07-31            alamo     3       2729        111
## # ... with 1,969 more rows

Alright, now we want to figure out those words that have changed the most in their frequency over time so as to isolate ones of interests to plot over time. This involves a few steps though.

nested_data <- words_by_time %>%
  nest(-word) 
nested_data
## # A tibble: 71 × 2
##                word              data
##               <chr>            <list>
## 1              #art <tibble [26 × 4]>
## 2       #california <tibble [28 × 4]>
## 3      #dolorespark <tibble [27 × 4]>
## 4          #mission <tibble [28 × 4]>
## 5  #missiondistrict <tibble [29 × 4]>
## 6     #sanfrancisco <tibble [29 × 4]>
## 7               #sf <tibble [29 × 4]>
## 8        #streetart <tibble [28 × 4]>
## 9              24th <tibble [28 × 4]>
## 10            alamo <tibble [27 × 4]>
## # ... with 61 more rows

Process as described by Silge and Robinson: “This data frame has one row for each person-word combination; the data column is a list column that contains data frames, one for each combination of person and word. Let’s use map() from the purrr library to apply our modeling procedure to each of those little data frames inside our big data frame. This is count data so let’s use glm() with family =”binomial" for modeling. We can think about this modeling procedure answering a question like, “Was a given word mentioned in a given time bin? Yes or no? How does the count of word mentions depend on time?”"

nested_models <- nested_data %>%
  mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, ., 
                                  family = "binomial")))

nested_models
## # A tibble: 71 × 3
##                word              data    models
##               <chr>            <list>    <list>
## 1              #art <tibble [26 × 4]> <S3: glm>
## 2       #california <tibble [28 × 4]> <S3: glm>
## 3      #dolorespark <tibble [27 × 4]> <S3: glm>
## 4          #mission <tibble [28 × 4]> <S3: glm>
## 5  #missiondistrict <tibble [29 × 4]> <S3: glm>
## 6     #sanfrancisco <tibble [29 × 4]> <S3: glm>
## 7               #sf <tibble [29 × 4]> <S3: glm>
## 8        #streetart <tibble [28 × 4]> <S3: glm>
## 9              24th <tibble [28 × 4]> <S3: glm>
## 10            alamo <tibble [27 × 4]> <S3: glm>
## # ... with 61 more rows

Silge and Robinson: “Now notice that we have a new column for the modeling results; it is another list column and contains glm objects. The next step is to use map() and tidy() from the broom package to pull out the slopes for each of these models and find the important ones. We are comparing many slopes here and some of them are not statistically significant, so let’s apply an adjustment to the p-values for multiple comparisons.”

slopes <- nested_models %>%
  unnest(map(models, tidy)) %>%
  filter(term == "time_floor") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

“Now let’s find the most important slopes. Which words have changed in frequency at a moderately significant level in our tweets?”

top_slopes <- slopes %>% 
  filter(adjusted.p.value < 0.1)

top_slopes
## # A tibble: 16 × 7
##             word       term      estimate    std.error  statistic
##            <chr>      <chr>         <dbl>        <dbl>      <dbl>
## 1   #dolorespark time_floor -1.239208e-07 1.877375e-08  -6.600749
## 2  #sanfrancisco time_floor -3.211771e-08 6.679608e-09  -4.808322
## 3            #sf time_floor -3.483812e-08 8.745735e-09  -3.983441
## 4          alley time_floor -7.976653e-08 1.277341e-08  -6.244732
## 5        clarion time_floor -9.680816e-08 1.449039e-08  -6.680855
## 6       district time_floor  6.701575e-08 5.346673e-09  12.534102
## 7        dolores time_floor -1.132466e-07 7.718189e-09 -14.672698
## 8      francisco time_floor  4.515439e-08 4.642235e-09   9.726865
## 9    manufactory time_floor -7.442008e-08 1.672651e-08  -4.449228
## 10       mission time_floor  3.625027e-08 3.999962e-09   9.062656
## 11          park time_floor -1.134007e-07 7.699185e-09 -14.728916
## 12           san time_floor  4.009249e-08 4.369834e-09   9.174832
## 13            sf time_floor -9.883069e-08 7.359724e-09 -13.428586
## 14        street time_floor -5.396273e-08 1.121572e-08  -4.811349
## 15       tartine time_floor -6.296002e-08 1.209275e-08  -5.206427
## 16      valencia time_floor -8.125194e-08 2.093658e-08  -3.880859
## # ... with 2 more variables: p.value <dbl>, adjusted.p.value <dbl>

Let’s plot them!

words_by_time %>%
  inner_join(top_slopes, by = c("word")) %>%
  ggplot(aes(time_floor, count/time_total, color = word)) +
  geom_line(size = 1.3) +
  labs(x = NULL, y = "Word frequency")

After looking at some of these features of our data set, it’s time to explore TOPIC MODELING, or (paraphrasing from David Blei) finding structure in more-or-less unstructured documents. To do this we need a document-term matrix. At the moment, the tweets are a little problematic in that they are broken up by words… whereas we actually would like the text of the tweet back as that is what we are treating as our ‘document’. The question at the moment is… do we want to keep the hashtags / can we?

# Let's try by taking our tweets that have been tidied already. First we need to count each word though, and create some kind of column that has 
# This first one is helpful for seeing encodings that need to be removed
tidy_tweets %>%
  count(document, word, sort=TRUE)
## Source: local data frame [102,888 x 3]
## Groups: document [14,958]
## 
##    document                      word     n
##       <int>                     <chr> <int>
## 1      5932        facewithtearsofjoy    17
## 2      8849                   balloon     8
## 3     12204 blackquestionmarkornament     8
## 4      7697                       nov     7
## 5      7697                       wed     7
## 6     12204 whitequestionmarkornament     7
## 7      2110              sliceofpizza     6
## 8      2452                nailpolish     6
## 9      2741     facewithnogoodgesture     6
## 10     4014                 earofrice     6
## # ... with 102,878 more rows
# Counting words so we can make a dtm with our preserved corpus with hashtags and such
tweet_words <- tidy_tweets %>%  
  count(document, word) %>%
  ungroup()

total_words <- tweet_words %>% 
  group_by(document) %>% 
  summarize(total = sum(n))

post_words <- left_join(tweet_words, total_words)

post_words
## # A tibble: 102,888 × 4
##    document           word     n total
##       <int>          <chr> <int> <int>
## 1         1      #nofilter     1     7
## 2         1  #sanfrancisco     1     7
## 3         1      afternoon     1     7
## 4         1        dolores     1     7
## 5         1           park     2     7
## 6         1             sf     1     7
## 7         2 @publicworkssf     1     5
## 8         2    #dustyrhino     1     5
## 9         2          close     1     5
## 10        2           grin     1     5
## # ... with 102,878 more rows
new_dtm <- post_words %>% 
  cast_dtm(document, word, n)

This seems to have worked :O :O :O Let’s see how topic modeling works here now…

Visualization in TIDY form also from Silge and Robinson!

#Set parameters for Gibbs sampling (parameters those used in
#Grun and Hornik 2011)
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
k <- 12
test_lda2 <-LDA(new_dtm,k, method="Gibbs", 
             control=list(nstart=nstart, seed = seed, best=best, 
                          burnin = burnin, iter = iter, thin=thin))

# Make that TIDY!!! 
test_lda_td2 <- tidy(test_lda2)

lda_top_terms2 <- test_lda_td2 %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

lda_top_terms2 %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()