• Stars
    star
    139
  • Rank 262,954 (Top 6 %)
  • Language
  • Created over 6 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

List of textual data sources to be used for text mining in R

R Text Data Compilation

The goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP in R. This repository will not be a guide on how to do text analysis/mining but rather how to get a data set to get started with minimal hassle.

Table of Contents

CRAN packages

janeaustenr

First we have the janeaustenr package popularized by Julia Silge in tidytextmining.

#install.packages("janeaustenr")
library(janeaustenr)

janeaustenr includes 6 books; emma, mansfieldpark, northangerabbey, persuasion, prideprejudice and sensesensibility all formatted as a character vector with elements of about 70 characters.

head(emma, n = 15)
#>  [1] "EMMA"                                                               
#>  [2] ""                                                                   
#>  [3] "By Jane Austen"                                                     
#>  [4] ""                                                                   
#>  [5] ""                                                                   
#>  [6] ""                                                                   
#>  [7] ""                                                                   
#>  [8] "VOLUME I"                                                           
#>  [9] ""                                                                   
#> [10] ""                                                                   
#> [11] ""                                                                   
#> [12] "CHAPTER I"                                                          
#> [13] ""                                                                   
#> [14] ""                                                                   
#> [15] "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"

All the books can also be found combined into one data.frame in the function austen_books()

dplyr::glimpse(austen_books())
#> Observations: 73,422
#> Variables: 2
#> $ text <chr> "SENSE AND SENSIBILITY", "", "by Jane Austen", "", "(1811...
#> $ book <fct> Sense & Sensibility, Sense & Sensibility, Sense & Sensibi...

Examples:

quRan

The quRan package contains the complete text of the Qur’an in Arabic (with and without vowels) and in English (the Yusuf Ali and Saheeh International translations).

#install.packages("quRan")
library(quRan)
dplyr::glimpse(quran_ar)
#> Observations: 6,236
#> Variables: 18
#> $ surah_id             <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2…
#> $ ayah_id              <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
#> $ surah_title_ar       <fct> الفاΨͺΨ­Ψ©, الفاΨͺΨ­Ψ©, الفاΨͺΨ­Ψ©, الفاΨͺΨ­Ψ©, الفاΨͺحة…
#> $ surah_title_en       <fct> Al-Faatiha, Al-Faatiha, Al-Faatiha, Al-Faat…
#> $ surah_title_en_trans <fct> The Opening, The Opening, The Opening, The …
#> $ revelation_type      <chr> "Meccan", "Meccan", "Meccan", "Meccan", "Me…
#> $ text                 <chr> "بِسْمِ Ψ§Ω„Ω„Ω‘ΩŽΩ‡Ω Ψ§Ω„Ψ±Ω‘ΩŽΨ­Ω’Ω…ΩŽΩ°Ω†Ω Ψ§Ω„Ψ±Ω‘ΩŽΨ­ΩΩŠΩ…Ω", …
#> $ surah                <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2…
#> $ ayah                 <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8…
#> $ ayah_title           <chr> "1:1", "1:2", "1:3", "1:4", "1:5", "1:6", "…
#> $ juz                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ manzil               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ page                 <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3…
#> $ hizb_quarter         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ sajda                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ sajda_id             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ sajda_recommended    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ sajda_obligatory     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Examples:

Twitter thread

scriptuRs

The scriptuRs package full text of the Standard Works for The Church of Jesus Christ of Latter-day Saints: the Old and New Testaments, the Book of Mormon, the Doctrine and Covenants, and the Pearl of Great Price. Each volume is in a data frame with a row for each verse, along with 19 columns of detailed metadata.

#install.packages("scriptuRs")
library(scriptuRs)
dplyr::glimpse(scriptuRs::book_of_mormon)
#> Observations: 6,604
#> Variables: 19
#> $ volume_id          <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
#> $ book_id            <dbl> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 6…
#> $ chapter_id         <dbl> 1190, 1190, 1190, 1190, 1190, 1190, 1190, 119…
#> $ verse_id           <dbl> 31103, 31104, 31105, 31106, 31107, 31108, 311…
#> $ volume_title       <chr> "Book of Mormon", "Book of Mormon", "Book of …
#> $ book_title         <chr> "1 Nephi", "1 Nephi", "1 Nephi", "1 Nephi", "…
#> $ volume_long_title  <chr> "The Book of Mormon", "The Book of Mormon", "…
#> $ book_long_title    <chr> "The First Book of Nephi", "The First Book of…
#> $ volume_subtitle    <chr> "Another Testament of Jesus Christ", "Another…
#> $ book_subtitle      <chr> "His Reign and Ministry", "His Reign and Mini…
#> $ volume_short_title <chr> "BoM", "BoM", "BoM", "BoM", "BoM", "BoM", "Bo…
#> $ book_short_title   <chr> "1 Ne.", "1 Ne.", "1 Ne.", "1 Ne.", "1 Ne.", …
#> $ volume_lds_url     <chr> "bm", "bm", "bm", "bm", "bm", "bm", "bm", "bm…
#> $ book_lds_url       <chr> "1-ne", "1-ne", "1-ne", "1-ne", "1-ne", "1-ne…
#> $ chapter_number     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ verse_number       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
#> $ text               <chr> "I, Nephi, having been born of goodly parents…
#> $ verse_title        <chr> "1 Nephi 1:1", "1 Nephi 1:2", "1 Nephi 1:3", …
#> $ verse_short_title  <chr> "1 Ne. 1:1", "1 Ne. 1:2", "1 Ne. 1:3", "1 Ne.…

Examples:

friends

The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files.

#install.packages("friends")
library(friends)

The data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.

dplyr::glimpse(friends)
#> Rows: 67,373
#> Columns: 6
#> $ text      <chr> "There's nothing to tell! He's just some guy I work with!", …
#> $ speaker   <chr> "Monica Geller", "Joey Tribbiani", "Chandler Bing", "Phoebe …
#> $ season    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ episode   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ scene     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ utterance <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…

Additionally data sets are included for more meta data.

dplyr::glimpse(friends_emotions)
#> Rows: 12,606
#> Columns: 5
#> $ season    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ episode   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ scene     <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, …
#> $ utterance <int> 1, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,…
#> $ emotion   <chr> "Mad", "Neutral", "Joyful", "Neutral", "Neutral", "Neutral",…

dplyr::glimpse(friends_entities)
#> Rows: 10,557
#> Columns: 5
#> $ season    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ episode   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ scene     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, …
#> $ utterance <int> 2, 3, 4, 8, 31, 33, 34, 38, 42, 44, 45, 47, 58, 4, 7, 9, 11,…
#> $ entities  <list> "Paul the Wine Guy", <"Joey Tribbiani", "Paul the Wine Guy"…

dplyr::glimpse(friends_info)
#> Rows: 236
#> Columns: 8
#> $ season            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ episode           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ title             <chr> "The Pilot", "The One with the Sonogram at the End",…
#> $ directed_by       <chr> "James Burrows", "James Burrows", "James Burrows", "…
#> $ written_by        <chr> "David Crane & Marta Kauffman", "David Crane & Marta…
#> $ air_date          <date> 1994-09-22, 1994-09-29, 1994-10-06, 1994-10-13, 199…
#> $ us_views_millions <dbl> 21.5, 20.2, 19.5, 19.7, 18.6, 18.2, 23.5, 21.1, 23.1…
#> $ imdb_rating       <dbl> 8.3, 8.1, 8.2, 8.1, 8.5, 8.1, 9.0, 8.1, 8.2, 8.1, 8.…

hcandersenr

The hcandersenr package includes many of H.C. Andersen’s fairy tales in 5 difference languages.

#install.packages("hcandersenr")
library(hcandersenr)

The fairy tales are found in the following data frames hcandersen_en, hcandersen_da, hcandersen_de, hcandersen_es and hcandersen_fr for the English, Danish, German, Spanish and French versions respectively. Please be advised that all fairy tales aren’t available in all languages in this package.

dplyr::glimpse(hcandersen_en)
#> Rows: 31,380
#> Columns: 2
#> $ text <chr> "A soldier came marching along the high road: \"Left, right - lef…
#> $ book <chr> "The tinder-box", "The tinder-box", "The tinder-box", "The tinder…

All the fairy tales are collected in the following data.frame:

dplyr::glimpse(hca_fairytales())
#> Rows: 126,102
#> Columns: 3
#> $ text     <chr> "Der kom en soldat marcherende hen ad landevejen: Γ©n, to! Γ©n,…
#> $ book     <chr> "The tinder-box", "The tinder-box", "The tinder-box", "The ti…
#> $ language <chr> "Danish", "Danish", "Danish", "Danish", "Danish", "Danish", "…

Examples:

Still pending.

proustr

This proustr packages gives you access to tools designed to do Natural Language Processing in French.

#install.packages("proustr")
library(proustr)

Furthermore it includes the following 7 books

  • Du cΓ΄tΓ© de chez Swann (1913): ducotedechezswann.
  • Γ€ l’ombre des jeunes filles en fleurs (1919): alombredesjeunesfillesenfleurs.
  • Le CΓ΄tΓ© de Guermantes (1921): lecotedeguermantes.
  • Sodome et Gomorrhe (1922) : sodomeetgomorrhe.
  • La PrisonniΓ¨re (1923) :laprisonniere.
  • Albertine disparue (1925, also know as : La Fugitive) : albertinedisparue.
  • Le Temps retrouvΓ© (1927) : letempretrouve.

Which are all found in the proust_books() function.

dplyr::glimpse(proust_books())
#> Rows: 4,690
#> Columns: 4
#> $ text   <chr> "Longtemps, je me suis couchΓ© de bonne heure. Parfois, Γ  peine …
#> $ book   <chr> "Du cΓ΄tΓ© de chez Swann", "Du cΓ΄tΓ© de chez Swann", "Du cΓ΄tΓ© de c…
#> $ volume <chr> "PremiΓ¨re partieΒ : Combray", "PremiΓ¨re partieΒ : Combray", "Prem…
#> $ year   <dbl> 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 191…

schrute

This schrute contains complete script transcription for The Office (US) television show.

#install.packages("schrute")
library(schrute)

The data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.

glimpse(theoffice)
#> Rows: 55,130
#> Columns: 12
#> $ index            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
#> $ season           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ episode          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ episode_name     <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",…
#> $ director         <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapis…
#> $ writer           <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Ricky…
#> $ character        <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Micha…
#> $ text             <chr> "All right Jim. Your quarterlies look very good. How …
#> $ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How …
#> $ imdb_rating      <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6…
#> $ total_votes      <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,…
#> $ air_date         <fct> 2005-03-24, 2005-03-24, 2005-03-24, 2005-03-24, 2005-…

Examples:

textdata

The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.

#install.packages("textdata")
library(textdata)

All the functions used in this package will prompt you to download the files. Once they are downloaded and cached they are easily loaded.

glimpse(textdata::dataset_imdb())
#> Rows: 25,000
#> Columns: 2
#> $ sentiment <chr> "neg", "neg", "neg", "neg", "neg", "neg", "neg", "neg", "neg…
#> $ text      <chr> "Story of a man who has unnatural feelings for a pig. Starts…

Available data sets:

with(catalogue, split(name, type))
#> $dataset
#> [1] "v1.0 sentence polarity"          "AG News"                        
#> [3] "DBpedia"                         "TREC-6 & TREC-50"               
#> [5] "IMDb Large Movie Review Dataset"
#> 
#> $embeddings
#> [1] "GloVe 6B"                "GloVe Twitter 27B"      
#> [3] "GloVe Common Crawl 42B"  "GloVe Common Crawl 840B"
#> 
#> $lexicon
#> [1] "AFINN-111"                                                   
#> [2] "Loughran-McDonald Sentiment lexicon"                         
#> [3] "Bing Sentiment Lexicon"                                      
#> [4] "NRC Word-Emotion Association Lexicon"                        
#> [5] "NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon)"
#> [6] "The NRC Valence, Arousal, and Dominance Lexicon"

gutenbergr

The gutenbergr package allows for search and download of public domain texts from Project Gutenberg. Currently includes more then 57,000 free eBooks.

#install.packages("gutenbergr")
library(gutenbergr)

To use gutenbergr you must know the Gutenberg id of the work you wish to analyze. A text search of the works can be done using the gutenberg_works function.

gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 Γ— 8
#>   gutenberg_id title  author  gutenberg_autho… language gutenberg_booksh… rights
#>          <int> <chr>  <chr>              <int> <chr>    <chr>             <chr> 
#> 1          768 Wuthe… Brontë…              405 en       Gothic Fiction/M… Publi…
#> # … with 1 more variable: has_text <lgl>

With that id you can use the gutenberg_download() function to

gutenberg_download(768)
#> Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
#> Using mirror http://aleph.gutenberg.org
#> # A tibble: 12,314 Γ— 2
#>    gutenberg_id text               
#>           <int> <chr>              
#>  1          768 "Wuthering Heights"
#>  2          768 ""                 
#>  3          768 "by Emily BrontΓ«"  
#>  4          768 ""                 
#>  5          768 ""                 
#>  6          768 ""                 
#>  7          768 ""                 
#>  8          768 "CHAPTER I"        
#>  9          768 ""                 
#> 10          768 ""                 
#> # … with 12,304 more rows

Examples:

Still pending.

text2vec

While the text2vec package isn’t a data package by itself, it does include a textual data set inside.

#install.packages("text2vec")
library(text2vec)

The data frame movie_review contains 5000 IMDB movie reviews selected for sentiment analysis. It has been preprocessed to include sentiment that means that an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1.

dplyr::glimpse(movie_review)
#> Rows: 5,000
#> Columns: 3
#> $ id        <chr> "5814_8", "2381_9", "7759_3", "3630_4", "9495_8", "8196_8", …
#> $ sentiment <int> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
#> $ review    <chr> "With all this stuff going down at the moment with MJ i've s…

epubr

The epubr package allows for extraction of metadata and textual content of epub files.

install.packages("epubr")
library(epubr)

Further information and examples can be found here.

Github packages

appa

This appa package contains complete script transcription for Avatar: The Last Airbender.

#devtools::install_github("averyrobbins1/appa")
library(appa)

The data set includes the full transcription, line by line with metadata about book number, chapter number, character, and more.

dplyr::glimpse(appa)
#> Rows: 13,385
#> Columns: 12
#> $ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ book              <fct> Water, Water, Water, Water, Water, Water, Water, Wat…
#> $ book_num          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ chapter           <fct> "The Boy in the Iceberg", "The Boy in the Iceberg", …
#> $ chapter_num       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ character         <chr> "Katara", "Scene Description", "Sokka", "Scene Descr…
#> $ full_text         <chr> "Water. Earth. Fire. Air. My grandmother used to tel…
#> $ character_words   <chr> "Water. Earth. Fire. Air. My grandmother used to tel…
#> $ scene_description <list> <>, <>, "[Close-up of the boy as he grins confident…
#> $ writer            <chr> "β€ŽMichael Dante DiMartino, Bryan Konietzko, Aaron Eha…
#> $ director          <chr> "Dave Filoni", "Dave Filoni", "Dave Filoni", "Dave F…
#> $ imdb_rating       <dbl> 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.…

sacred

The sacred package includes 9 tidy data sets: apocrypha, book_of_mormon, doctrine_and_covenants, greek_new_testament, king_james_version, pearl_of_great_price, tanach, vulgate and septuagint with column describing the position within each work.

#devtools::install_github("JohnCoene/sacred")
library(sacred)
dplyr::glimpse(apocrypha)
#> Rows: 5,725
#> Columns: 5
#> $ book.num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ book     <chr> "es1", "es1", "es1", "es1", "es1", "es1", "es1", "es1", "es1"…
#> $ psalm    <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11", "11", "…
#> $ verse    <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"…
#> $ text     <chr> "And Josias held the feast of the passover in Jerusalem unto …

Examples:

Still pending.

harrypotter

The harrypotter package includes the text from all 7 main series books.

#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

the 7 books; philosophers_stone, chamber_of_secrets, prisoner_of_azkaban, goblet_of_fire, order_of_the_phoenix, half_blood_prince and deathly_hallows are formatted as character vectors with a chapter for each string.

dplyr::glimpse(harrypotter::chamber_of_secrets)
#>  chr [1:19] "THE WORST BIRTHDAYγ€€γ€€Not for the first time, an argument had broken out over breakfast at number four, Privet "| __truncated__ ...

Examples:

hgwellsr

The hgwellsr package provides access to the full texts of six novels by H. G. Wells.

#devtools::install_github("erikhoward/hgwellsr")
library(hgwellsr)
  • Ann Veronica (1909): annveronica.
  • The History of Mr Polly (1910): mrpolly.
  • The Invisible Man (1897): invisibleman.
  • The Island of Doctor Moreau (1896): doctormoreau.
  • The Time Machine (1895):timemachine.
  • The War of the Worlds (1898): waroftheworlds.
head(annveronica, 10)
#>  [1] "CHAPTER THE FIRST"                                                     
#>  [2] ""                                                                      
#>  [3] "ANN VERONICA TALKS TO HER FATHER"                                      
#>  [4] ""                                                                      
#>  [5] ""                                                                      
#>  [6] "Part 1"                                                                
#>  [7] ""                                                                      
#>  [8] ""                                                                      
#>  [9] "One Wednesday afternoon in late September, Ann Veronica Stanley came"  
#> [10] "down from London in a state of solemn excitement and quite resolved to"

jeeves

The jeeves package provides access to the full texts of 38 works by P.G. Wodehouse.

#devtools::install_github("aniruhil/jeeves")
library(jeeves)
glimpse(adamselindistress)
#> Rows: 10,291
#> Columns: 3
#> $ gutenberg_id <int> 2233, 2233, 2233, 2233, 2233, 2233, 2233, 2233, 2233, 223…
#> $ text         <chr> "[Transcriber's Note for edition 11: in para. 4 of Chapte…
#> $ title        <chr> "A Damsel in Distress", "A Damsel in Distress", "A Damsel…

koanr

The koanr package includes text from several of the more important Zen koan texts.

#devtools::install_github("malcolmbarrett/koanr")
library(koanr)

The texts in this package include The Gateless Gate (gateless_gate), The Blue Cliff Record (blue_cliff_record), The Record of the Transmission of the Light(record_of_light), and The Book of Equanimity(book_of_equanimity).

dplyr::glimpse(gateless_gate)
#> Rows: 192
#> Columns: 4
#> $ collection <chr> "The Gateless Gate", "The Gateless Gate", "The Gateless Gat…
#> $ case       <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5,…
#> $ type       <chr> "title", "main_case", "commentary", "capping_verse", "title…
#> $ text       <chr> "Joshu's Dog", "A monk asked Joshu, \"Has the dog the Buddh…

sherlock

The sherlock package includes text from the Sherlock Holmes Books.

#devtools::install_github("EmilHvitfeldt/sherlock")
library(sherlock)

The goal of sherlock is to provide access to the full texts of Sherlock Holmes stories that are in the public domain. Text and further information regarding copyright laws can be found here.

dplyr::glimpse(holmes)
#> Rows: 65,958
#> Columns: 2
#> $ text <chr> "A STUDY IN SCARLET", "", "Table of contents", "", "Part I", "Mr.…
#> $ book <chr> "A Study In Scarlet", "A Study In Scarlet", "A Study In Scarlet",…

rperseus

The goal of rperseus is to furnish classicists, textual critics, and R enthusiasts with texts from the Classical World. While the English translations of most texts are available through gutenbergr, rperseus returns these works in their original language–Greek, Latin, and Hebrew.

#devtools::install_github("ropensci/rperseus")
library(rperseus)
aeneid_latin <- perseus_catalog %>% 
  filter(group_name == "Virgil",
         label == "Aeneid",
         language == "lat") %>% 
  pull(urn) %>% 
  get_perseus_text()
head(aeneid_latin)
#> # A tibble: 6 Γ— 7
#>   text           urn        group_name label description        language section
#>   <chr>          <chr>      <chr>      <chr> <chr>              <chr>      <int>
#> 1 Arma virumque… urn:cts:l… Virgil     Aene… "Perseus:bib:oclc… lat            1
#> 2 Conticuere om… urn:cts:l… Virgil     Aene… "Perseus:bib:oclc… lat            2
#> 3 Postquam res … urn:cts:l… Virgil     Aene… "Perseus:bib:oclc… lat            3
#> 4 At regina gra… urn:cts:l… Virgil     Aene… "Perseus:bib:oclc… lat            4
#> 5 Interea mediu… urn:cts:l… Virgil     Aene… "Perseus:bib:oclc… lat            5
#> 6 Sic fatur lac… urn:cts:l… Virgil     Aene… "Perseus:bib:oclc… lat            6

See the vignette for more examples.

tidygutenbergr

The tidygutenbergr contains many functions that will fetch data from Project Gutenberg using the gutenbergr package and do some light cleaning.

#devtools::install_github("emilHvitfeldt/tidygutenbergr")
library(tidygutenbergr)

tidygutenbergr contains a couple dozen datasets that can all be found here.

Many books will have metadata on the text such as book nunmber and chapter name/number.

glimpse(a_tale_of_two_cities())
#> Rows: 15,830
#> Columns: 4
#> $ text         <chr> "Book the First--Recalled to Life", "", "", "", "", "CHAP…
#> $ book         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ chapter      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ chapter_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

subtools

The subtools package doesn’t include any textual data, but allows you to read subtitle files.

#devtools::install_github("fkeck/subtools")
library(subtools)

the use of this function can be found in the examples.

Examples:

Tidytuesday

The tidytuesday project is an amazing collection of data sets that are well suited for beginners to hone their skills. Below is a list of the data sets that contain enough text data to analyse. This list does contain data set that are already present on this page but are kept here for completeness.

Examples will not be shown here since that is taken care of in the respective pages.

Date Topic
2019-01-01 #rstats and #TidyTuesday Tweets from rtweet
2019-03-12 Board Games Database
2019-04-23 Anime Dataset
2019-05-28 Wine ratings
2019-06-25 UFO Sightings around the world
2019-09-10 Amusement Park injuries
2019-10-22 Horror movie metadata
2019-12-17 Adoptable dogs
2019-12-24 Christmas Music Billboards
2020-03-17 The Office - Words and Numbers
2020-04-21 GDPR Fines
2020-04-28 Broadway Weekly Grosses
2020-05-05 Animal Crossing - New Horizons
2020-05-26 Cocktails
2020-06-09 African American Achievements
2020-06-16 American Slavery and Juneteenth
2020-08-11 Avatar: The last airbender
2020-09-08 Friends
2020-09-29 BeyoncΓ© and Taylor Swift Lyrics
2020-12-08 Women of 2020
2021-01-12 Art Collections
2021-03-02 Superbowl commercials
2021-03-23 UN Votes
2021-04-20 Netflix Shows
2021-04-27 CEO Departures
2021-06-15 Du Bois and Juneteenth Revisited

Wild data

This sections includes public data sets and how to import them into R ready for analysis. It is generally advised to save the resulting data such that you don’t re-download the data excessively.

Movie Review Data

This website include a handful of different movie review data sets. Below is the code chuck necessary to load in the data sets.

polarity dataset v2.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names, 
               ~ tibble(text = read_lines(.x),
                        polarity = str_detect(.x, "pos"),
                        cv_tag = str_extract(.x, "(?<=cv)\\d{3}"),
                        html_tag = str_extract(.x, "(?<=cv\\d{3}_)\\d*")))

glimpse(data)
#> Rows: 64,720
#> Columns: 4
#> $ text     <chr> "plot : two teen couples go to a church party , drink and the…
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
#> $ cv_tag   <chr> "000", "000", "000", "000", "000", "000", "000", "000", "000"…
#> $ html_tag <chr> "29416", "29416", "29416", "29416", "29416", "29416", "29416"…

sentence polarity dataset v1.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names, 
               ~ tibble(text = read_lines(.x),
                        polarity = str_detect(.x, "pos")))

glimpse(data)
#> Rows: 10,662
#> Columns: 2
#> $ text     <chr> "simplistic , silly and tedious . ", "it's so laddish and juv…
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…

scale dataset v1.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

subjs <- str_subset(file_names, "subj")
ids <- str_subset(file_names, "id")
ratings <- str_subset(file_names, "rating")
names <- str_extract(ratings, "(?<=rating.).*") %>%
  str_replace("\\+", " ")

data <- map_df(seq_len(length(names)), 
               ~ tibble(text = read_lines(subjs[.x]),
                        id = read_lines(ids[.x]),
                        rating = read_lines(ratings[.x]),
                        name = names[.x]))

glimpse(data)
#> Rows: 5,006
#> Columns: 4
#> $ text   <chr> "in my opinion , a movie reviewer's most important task is to o…
#> $ id     <chr> "29420", "17219", "18406", "18648", "20021", "20454", "20473", …
#> $ rating <chr> "0.1", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", …
#> $ name   <chr> "Dennis Schwartz", "Dennis Schwartz", "Dennis Schwartz", "Denni…

subjectivity dataset v1.0

library(tidyverse)
library(fs)

filepath <- file_temp() %>%
  path_ext_set("tar.gz")

download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz", filepath)

file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]

untar(filepath, files = file_names)

data <- map_df(file_names, 
               ~ tibble(text = read_lines(.x),
                        label = if_else(str_detect(.x, "quote"), 
                                        "subjective", 
                                        "objective")))

glimpse(data)
#> Rows: 10,000
#> Columns: 2
#> $ text  <chr> "smart and alert , thirteen conversations about one thing is a s…
#> $ label <chr> "subjective", "subjective", "subjective", "subjective", "subject…

SouthParkData

the following github repository BobAdamsEE/SouthParkData includes the script of the first 19 seasons of South Park. The following code snippet lets you download them all at once.

url_base <- "https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-season"
urls <- paste0(url_base, "/Season-", 1:19, ".csv")

data <- map_df(urls, ~ read_csv(.x))

glimpse(data)
#> Rows: 73,139
#> Columns: 4
#> $ Season    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Episode   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ Character <chr> "Boys", "Kyle", "Ike", "Kyle", "Cartman", "Kyle", "Stan", "K…
#> $ Line      <chr> "School day, school day, teacher's golden ru...\n", "Ah, dam…

Examples:

Saudi Newspapers Corpus

The following Github repository inparallel/SaudiNewsNet includes data and text from 31030 Arabic newspaper articles along with metadata, extracted from various online Saudi newspapers.

library(rio)
library(glue)
library(fs)
library(purrr)

dates <- c("2015-07-21", "2015-07-22", "2015-07-23", "2015-07-24", "2015-07-25",
           "2015-07-26", "2015-07-27", "2015-07-31", "2015-08-01", "2015-08-02",
           "2015-08-03", "2015-08-04", "2015-08-06", "2015-08-07", "2015-08-08",
           "2015-08-09", "2015-08-10", "2015-08-11")

tmp_path <- path_temp()

urls <- glue("https://raw.githubusercontent.com/inparallel/SaudiNewsNet/master/dataset/{dates}.zip")
paths <- path(tmp_path, dates, ext = "zip")

data <- map2_dfr(urls, paths, ~ {
  download.file(.x, .y)
  import_list(.y)[[1]]
})

glimpse(data)
#> Rows: 31,030
#> Columns: 6
#> $ source         <chr> "aawsat", "aawsat", "aawsat", "aawsat", "aawsat", "aaws…
#> $ url            <chr> "http://aawsat.com/home/article/410826/Ψ¨Ψ±ΩŠΨ·Ψ§Ω†ΩŠΨ§-Ψ£Ψ±Ψ¨ΨΉΨ©-م…
#> $ date_extracted <chr> "2015-07-21 02:51:32", "2015-07-21 02:51:33", "2015-07-…
#> $ title          <chr> "Ψ¨Ψ±ΩŠΨ·Ψ§Ω†ΩŠΨ§: Ψ£Ψ±Ψ¨ΨΉΨ© Ω…Ψ­Ψ§ΩˆΨ± Ω„Ψ§Ψ³ΨͺΨ±Ψ§Ψͺيجية جديدة ΨͺΨͺΨ΅Ψ―Ω‰ Ω„Ω„Ψͺطرف ع…
#> $ author         <chr> "Ω„Ω†Ψ―Ω†: Ψ±Ω†ΩŠΩ… Ψ­Ω†ΩˆΨ΄", "Ω„Ω†Ψ―Ω†: Β«Ψ§Ω„Ψ΄Ψ±Ω‚ Ψ§Ω„Ψ£ΩˆΨ³Ψ· Ψ£ΩˆΩ†Ω„Ψ§ΩŠΩ†Β»", "لند…
#> $ content        <chr> "Ψ­Ψ―Ψ― رئيس Ψ§Ω„ΩˆΨ²Ψ±Ψ§Ψ‘ Ψ§Ω„Ψ¨Ψ±ΩŠΨ·Ψ§Ω†ΩŠ ديفيد ΩƒΨ§Ω…ΩŠΨ±ΩˆΩ†ΨŒ Ψ§Ω„ΩŠΩˆΩ… (الاثن…

More Repositories

1

r-color-palettes

Comprehensive list of color palettes available in R β€οΈπŸ§‘πŸ’›πŸ’šπŸ’™πŸ’œ
R
1,264
star
2

paletteer

🎨🎨🎨 Collection of most color palettes in a single R package
R
779
star
3

ggpage

Creates Page Layout Visualizations in R πŸ“„πŸ“„πŸ“„
R
326
star
4

smltar

Manuscript of the book "Supervised Machine Learning for Text Analysis in R" by Emil Hvitfeldt and Julia Silge
TeX
248
star
5

RStudioConf2020Slides

Links to slides for rstudio::conf 2020
179
star
6

ISLR-tidymodels-labs

R
149
star
7

prismatic

color manipulation R package Simply and Tidy
R
127
star
8

textdata

Download, parse, store, and load text datasets instead of storing it in packages
R
69
star
9

quarto-revealjs-codewindow

quarto Revealjs pluging for styled codechunks
JavaScript
63
star
10

quarto-roughnotation

An extension that uses the roughnotation javascript library to add animated annotations to revealjs documents.
JavaScript
59
star
11

useR2020-text-modeling-tutorial

CSS
57
star
12

feature-engineering-az

Source for book "Feature Engineering A-Z"
HTML
57
star
13

friends

The Entire Transcript from Friends in Tidy Format
R
50
star
14

quarto-nes-theme

A Quarto reveal.js theme based on NES.css
SCSS
47
star
15

gganonymize

Anonymize the labels and text in a ggplot2
R
42
star
16

quarto-iframe-examples

JavaScript
41
star
17

quarto-revealjs-letterbox

A Quarto reveal.js theme based on xaringan letterbox
HTML
36
star
18

ggshapes

Adding various geometrical shapes to ggplot2
R
25
star
19

quickpalette

πŸƒβ€β™€οΈπŸŽ¨ R package for quick extraction of color palettes from text and images
R
25
star
20

quarto-designmode

Quarto Extension to enable DesignMode
JavaScript
24
star
21

emoji

Data About Emojis
R
24
star
22

quarto-revealjs-earth

Watercolor earth Quarto reveal.js theme
SCSS
22
star
23

talk-nyr-slidecraft

HTML
21
star
24

bookdown-github-actions-netlify

Example of using Github Actions to deploy bookdown on Netlify
R
21
star
25

unitscales

Adding additional ggplot2 scales that adds units
R
20
star
26

talk-quarto-theming-positconf

JavaScript
19
star
27

xaringancolor

Uniform Colors in Xaringan Presentations
R
17
star
28

xaringan-gallery

xaringan gallery, a growing collection of examples and custom themes
HTML
17
star
29

walmartAPI

Walmart Open API Wrapper
R
17
star
30

quarto-revealjs-spotlight

Quarto revealjs spotlight
Jupyter Notebook
15
star
31

quarto-revealjs-cinco-de-mayo

Cinco de Mayo Quarto reveal.js theme
JavaScript
15
star
32

horus

Visual tools to help machine learning model selection
R
14
star
33

quarto-revealjs-highlightword

quarto Revealjs pluging for styled codechunks
JavaScript
13
star
34

quarto-revealjs-seasons

quarto revealjs seasons theme template
JavaScript
13
star
35

elevators

Data Package Containing Information About Elevators in NYC
R
13
star
36

quarto-snow

Quarto Extension to add snow
JavaScript
12
star
37

scotus

Collection of Supreme Court of the United States' Opinions.
R
12
star
38

emilhvitfeldt

10
star
39

talk-branded-quarto

JavaScript
10
star
40

hcandersenr

An R Package for H.C. Andersens fairy tales
R
9
star
41

advent-of-steps

Adventofsteps slides
HTML
8
star
42

talk-useR2022-textrecipes

HTML
8
star
43

wordsalad

Provide Tools to Extract and Analyze Word Vectors
R
8
star
44

ferriswheels

Harmless data set about ferris wheels
R
8
star
45

workshop-useR2022-tidymodels

R
8
star
46

tidygutenbergr

cleaned gutenbergr text
R
7
star
47

quarto-revealjs-inverse

quarto revealjs inverse theme template
SCSS
7
star
48

talk-slc-slidecraft

JavaScript
7
star
49

tilemapr

R functions for square and hextile maps
R
7
star
50

quarto-revealjs-template

Personal Quarto Revealjs Template
SCSS
6
star
51

palette2vec

Technique to Embed Color Palettes to Multidimensional Space
R
6
star
52

talk-positconf2023-fancy-slides

HTML
6
star
53

fontscales

Use Iconographic Fonts in 'ggplot2'
R
6
star
54

offensiveR

R package that checks for offensive words in texts and documents
R
5
star
55

talk-rstudioconf2022-tidyclust

JavaScript
5
star
56

quarto-syntax-theme-editor

Low-tech quarto syntax highlighting theme editor
R
5
star
57

genderify

HTML
5
star
58

pptxtemplates

Provide themed powerpoint presentation templates to rmarkdown
R
5
star
59

index-challenge

R
5
star
60

emiladdins

Moar Addins for RStudio
R
5
star
61

rstats-adventofcode

R
5
star
62

fastTextR

An Interface to the 'fastText' Library
C++
4
star
63

aocfuns

Advent Of Code helper functions
R
4
star
64

ggrandom

Introduce Absolute Chaos to 'ggplot2'
R
4
star
65

roundabouts

R package containing roundabouts database
R
4
star
66

tidymodels-pipelines

JavaScript
4
star
67

extrasteps

More Steps for the 'recipes' Package
R
4
star
68

dplyr-slides

JavaScript
4
star
69

working-with-docker

🐳 Worked minimal examples of workflows with Docker and R
Dockerfile
4
star
70

percentify

Splitting a dataset according to percentile ranges to be used with tidy verbs
R
4
star
71

github-issue-table

HTML
3
star
72

talk-purrr-ocrug-2023

HTML
3
star
73

tidytuesday

#TidyTuesday Gallery
3
star
74

miscpalettes

R package with miscellaneous color palettes
R
3
star
75

workshop-ocrug-hackathon-nlp

HTML
3
star
76

textmodels4pharma

SCSS
3
star
77

refactoring-r

Refactoring in R
R
3
star
78

AU-2021spring-627

HTML
3
star
79

AU-2021fall-627

HTML
3
star
80

talk-harvard-tidymodels

CSS
3
star
81

talk-jsm-stunning-presentations

JavaScript
3
star
82

talk-socalrug-hackathon-feature-engineering

HTML
3
star
83

talk-branded-quarto-abuja

JavaScript
2
star
84

emilverse

Easily Install and Load Select Packages
R
2
star
85

Tree-GenViz

R
2
star
86

quarto-revealjs-merry

JavaScript
2
star
87

workshops

2
star
88

talks

My collection of talks
HTML
2
star
89

rstudioconf-agenda

HTML
2
star
90

emilfun

Personal R Package
R
2
star
91

recap

R
2
star
92

ggshuffle

Tool to determine if the plotting order impacts the resulting ggplot2
R
2
star
93

talk-positconf2023-101-slides

HTML
2
star
94

quarto-revealjs-loud

Automatic big text in quarto revealjs slides
JavaScript
2
star
95

courses

2
star
96

talk-AstraZeneca-recipes

HTML
2
star
97

talk-socal-positron

JavaScript
2
star
98

color-resources

A curated list of resources related to the use of colors
2
star
99

quarto-revealjs-yule

JavaScript
2
star
100

website-template

Jupyter Notebook
2
star