• Stars
    star
    1,151
  • Rank 38,901 (Top 0.8 %)
  • Language
    R
  • License
    Other
  • Created about 8 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Text mining using tidy tools โœจ๐Ÿ“„โœจ

tidytext: Text mining using tidy tools

Authors: Julia Silge, David Robinson
License: MIT

R-CMD-check CRAN_Status_Badge Codecov test coverage DOI JOSS Downloads Total Downloads

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr, and ggplot2. In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out our book to learn more about text mining using tidy data principles.

Installation

You can install this package from CRAN:

install.packages("tidytext")

Or you can install the development version from GitHub with remotes:

library(remotes)
install_github("juliasilge/tidytext")

Tidy text mining example: the unnest_tokens function

The novels of Jane Austen can be so tidy! Letโ€™s use the text of Jane Austenโ€™s 6 completed, published novels from the janeaustenr package, and transform them to a tidy format. janeaustenr provides them as a one-row-per-line format:

library(janeaustenr)
library(dplyr)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number()) %>%
  ungroup()

original_books
#> # A tibble: 73,422 ร— 3
#>    text                    book                 line
#>    <chr>                   <fct>               <int>
#>  1 "SENSE AND SENSIBILITY" Sense & Sensibility     1
#>  2 ""                      Sense & Sensibility     2
#>  3 "by Jane Austen"        Sense & Sensibility     3
#>  4 ""                      Sense & Sensibility     4
#>  5 "(1811)"                Sense & Sensibility     5
#>  6 ""                      Sense & Sensibility     6
#>  7 ""                      Sense & Sensibility     7
#>  8 ""                      Sense & Sensibility     8
#>  9 ""                      Sense & Sensibility     9
#> 10 "CHAPTER 1"             Sense & Sensibility    10
#> # โ„น 73,412 more rows

To work with this as a tidy dataset, we need to restructure it as one-token-per-row format. The unnest_tokens() function is a way to convert a dataframe with a text column to be one-token-per-row:

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books
#> # A tibble: 725,055 ร— 3
#>    book                 line word       
#>    <fct>               <int> <chr>      
#>  1 Sense & Sensibility     1 sense      
#>  2 Sense & Sensibility     1 and        
#>  3 Sense & Sensibility     1 sensibility
#>  4 Sense & Sensibility     3 by         
#>  5 Sense & Sensibility     3 jane       
#>  6 Sense & Sensibility     3 austen     
#>  7 Sense & Sensibility     5 1811       
#>  8 Sense & Sensibility    10 chapter    
#>  9 Sense & Sensibility    10 1          
#> 10 Sense & Sensibility    13 the        
#> # โ„น 725,045 more rows

This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in a one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (available via the function get_stopwords()) with an anti_join().

tidy_books <- tidy_books %>%
  anti_join(get_stopwords())

We can also use count() to find the most common words in all the books as a whole.

tidy_books %>%
  count(word, sort = TRUE) 
#> # A tibble: 14,375 ร— 2
#>    word      n
#>    <chr> <int>
#>  1 mr     3015
#>  2 mrs    2446
#>  3 must   2071
#>  4 said   2041
#>  5 much   1935
#>  6 miss   1855
#>  7 one    1831
#>  8 well   1523
#>  9 every  1456
#> 10 think  1440
#> # โ„น 14,365 more rows

Sentiment analysis can be implemented as an inner join. Three sentiment lexicons are available via the get_sentiments() function. Letโ€™s examine how sentiment changes across each novel. Letโ€™s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.

library(tidyr)
get_sentiments("bing")
#> # A tibble: 6,786 ร— 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faces     negative 
#>  2 abnormal    negative 
#>  3 abolish     negative 
#>  4 abominable  negative 
#>  5 abominably  negative 
#>  6 abominate   negative 
#>  7 abomination negative 
#>  8 abort       negative 
#>  9 aborted     negative 
#> 10 aborts      negative 
#> # โ„น 6,776 more rows

janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>% 
  count(book, index = line %/% 80, sentiment) %>% 
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% 
  mutate(sentiment = positive - negative)

janeaustensentiment
#> # A tibble: 920 ร— 5
#>    book                index negative positive sentiment
#>    <fct>               <dbl>    <int>    <int>     <int>
#>  1 Sense & Sensibility     0       16       32        16
#>  2 Sense & Sensibility     1       19       53        34
#>  3 Sense & Sensibility     2       12       31        19
#>  4 Sense & Sensibility     3       15       31        16
#>  5 Sense & Sensibility     4       16       34        18
#>  6 Sense & Sensibility     5       16       51        35
#>  7 Sense & Sensibility     6       24       40        16
#>  8 Sense & Sensibility     7       23       51        28
#>  9 Sense & Sensibility     8       30       40        10
#> 10 Sense & Sensibility     9       15       19         4
#> # โ„น 910 more rows

Now we can plot these sentiment scores across the plot trajectory of each novel.

library(ggplot2)

ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(book), ncol = 2, scales = "free_x")

Sentiment scores across the trajectories of Jane Austen's six published novels

For more examples of text mining using tidy data frames, see the tidytext vignette.

Tidying document term matrices

Some existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels dataset.

library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
#> <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
#> Non-/sparse entries: 302031/23220327
#> Sparsity           : 99%
#> Maximal term length: 18
#> Weighting          : term frequency (tf)

If we want to analyze this with tidy tools, we need to transform it into a one-row-per-term data frame first with a tidy() function. (For more on the tidy verb, see the broom package).

tidy(AssociatedPress)
#> # A tibble: 302,031 ร— 3
#>    document term       count
#>       <int> <chr>      <dbl>
#>  1        1 adding         1
#>  2        1 adult          2
#>  3        1 ago            1
#>  4        1 alcohol        1
#>  5        1 allegedly      1
#>  6        1 allen          1
#>  7        1 apparently     2
#>  8        1 appeared       1
#>  9        1 arrested       1
#> 10        1 assault        1
#> # โ„น 302,021 more rows

We could find the most negative documents:

ap_sentiments <- tidy(AssociatedPress) %>%
  inner_join(get_sentiments("bing"), by = c(term = "word")) %>%
  count(document, sentiment, wt = count) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  arrange(sentiment)

Or we can join the Austen and AP datasets and compare the frequencies of each word:

comparison <- tidy(AssociatedPress) %>%
  count(word = term) %>%
  rename(AP = n) %>%
  inner_join(count(tidy_books, word)) %>%
  rename(Austen = n) %>%
  mutate(AP = AP / sum(AP),
         Austen = Austen / sum(Austen))


comparison
#> # A tibble: 4,730 ร— 3
#>    word             AP     Austen
#>    <chr>         <dbl>      <dbl>
#>  1 abandoned 0.000170  0.00000493
#>  2 abide     0.0000291 0.0000197 
#>  3 abilities 0.0000291 0.000143  
#>  4 ability   0.000238  0.0000148 
#>  5 able      0.000664  0.00151   
#>  6 abroad    0.000194  0.000178  
#>  7 abrupt    0.0000291 0.0000247 
#>  8 absence   0.0000776 0.000547  
#>  9 absent    0.0000436 0.000247  
#> 10 absolute  0.0000533 0.000128  
#> # โ„น 4,720 more rows

library(scales)
ggplot(comparison, aes(AP, Austen)) +
  geom_point(alpha = 0.5) +
  geom_text(aes(label = word), check_overlap = TRUE,
            vjust = 1, hjust = 1) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

Scatterplot for word frequencies in Jane Austen vs. AP news articles. Some words like "cried" are only common in Jane Austen, some words like "national" are only common in AP articles, and some word like "time" are common in both.

For more examples of working with objects from other text mining packages using tidy data principles, see the vignette on converting to and from document term matrices.

Community Guidelines

This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here.

More Repositories

1

widyr

Widen, process, and re-tidy a dataset
R
324
star
2

supervised-ML-case-studies-course

Supervised machine learning case studies in R! ๐Ÿ’ซ A free interactive tidymodels course
CSS
220
star
3

women-in-film

Code and analysis for the Pudding's text mining project ๐ŸŽฅ๐Ÿ’ƒ๐Ÿป
111
star
4

janeaustenr

An R Package for Jane Austen's Complete Novels ๐Ÿ“™
R
94
star
5

tidylo

Weighted tidy log odds ratio โš–๏ธ
R
91
star
6

modelops-playground

Experiments and use cases for model monitoring and updating ๐Ÿงฎ
R
55
star
7

silgelib

Personal R package
R
47
star
8

learntidytext

Learn about text mining ๐Ÿ“„ with tidy data principles
CSS
46
star
9

tidymodels-tutorial

Introduction to ML with R using tidymodels
R
44
star
10

juliasilge.com

My blog, built with blogdown and Hugo ๐Ÿ”—
HTML
40
star
11

intro_to_shiny

Introduction to Shiny workshop for satRday conference
HTML
25
star
12

old_bloggy_blog

My older Jekyll blog, based on the So Simple theme ๐Ÿ”—
JavaScript
24
star
13

deploytidymodels

Version, share, and deploy tidymodels workflows
R
21
star
14

tidytext-tutorial

Materials for leading tutorials on text mining using tidy data principles
Lua
20
star
15

ibm-ai-day

Presentation for IBM Community Day AI
HTML
14
star
16

caret-ML-course

Supervised machine learning case studies with caret in R! ๐ŸŒŸ A free interactive course
CSS
14
star
17

sdss2019

Presentation for short course at Symposium on Data Science and Statistics in May 2019
HTML
13
star
18

sherlock-holmes

Files for Sherlock Holmes analysis
R
12
star
19

course-ML-tidymodels

Temporary repo to build new version of course
CSS
12
star
20

blog_by_hugo

First Hugo version of my blog, built with blogdown ๐Ÿ”—
HTML
11
star
21

deming2018

Presentation for Deming Conference in December 2018
HTML
11
star
22

plotly-quarto-ghpages

GH pages + Quarto example for Plotly
CSS
9
star
23

neissapp

Shiny app for the NEISS data set
8
star
24

packagesurvey

Survey of R users on package discovery ๐Ÿ“ฆ
R
7
star
25

opioids

Analysis of opioid use in Texas
7
star
26

learning-sql

A repository to use as I learn about relational algebra, databases, and SQL
7
star
27

vetiverdemo

Demos for MLOps with vetiver
R
7
star
28

why-r-webinar

Webinar on word embeddings for Why R?
HTML
7
star
29

vamplyr

๐ŸŽƒ SPOOKY VIBES ๐ŸŽƒ
R
6
star
30

nasanotebooks

Notebooks and other materials for NASA Datanauts projects ๐ŸŒŒโ˜€๏ธ๐Ÿš€
HTML
6
star
31

ml-maintenance-2023

Talk for posit::conf() 2023 on reliable maintenance of machine learning models
CSS
6
star
32

normconf-simulation

Slides for talk at NormConf on efficient simulation for everyday ML decisions
CSS
5
star
33

jsm2018

Presentations for JSM 2018
JavaScript
5
star
34

stacksurveyapp

Shiny app for the Stack Overflow Developer Survey
5
star
35

tada2022

Text as Data (TADA ๐Ÿช„) Conference 2022
CSS
4
star
36

byu-seminar

Seminar on word embeddings at BYU
HTML
4
star
37

ggplot2-tutorial

Just the best plots EVER ๐Ÿ“Š
4
star
38

toy-bookdown

Build bookdown on Linux
TeX
3
star
39

populationapp

Shiny app exploring U.S. population density
2
star
40

fall2016competition

Materials for Utah Geek Events Fall 2016 College Scorecard Competition
Jupyter Notebook
2
star
41

southafricastats

Population and Mortality Statistics for South Africa
R
2
star
42

writing-about-tech

Talk for Code for Philly
JavaScript
2
star
43

choroplethrUTCensusTract

Shapefile, Metadata, and Visualization Functions for US Census Tracts in Utah ๐ŸŒ„
R
2
star
44

mlops-rstudio-meetup

RStudio Enterprise Meetup on MLOps with vetiver ๐Ÿบ in Python and R
Lua
2
star
45

SLCWaterMapping

Code and a Shiny app exploring water use in Salt Lake City
R
1
star
46

so-offsite-2018

Presentation for Stack Overflow offsite in October 2018
HTML
1
star
47

holidaymovies

What movie should we watch? ๐ŸŽ„
R
1
star
48

juliasilge

Profile README
1
star
49

PredictNamesApp

A Shiny App for exploring baby name popularity
R
1
star