R Text Data Compilation
The goal of this repository is to act as a collection of textual data set to be used for training and practice in text mining/NLP in R. This repository will not be a guide on how to do text analysis/mining but rather how to get a data set to get started with minimal hassle.
Table of Contents
CRAN packages
janeaustenr
First we have the janeaustenr package popularized by Julia Silge in tidytextmining.
#install.packages("janeaustenr")
library(janeaustenr)
janeaustenr
includes 6 books; emma
, mansfieldpark
,
northangerabbey
, persuasion
, prideprejudice
and sensesensibility
all formatted as a character vector with elements of about 70
characters.
head(emma, n = 15)
#> [1] "EMMA"
#> [2] ""
#> [3] "By Jane Austen"
#> [4] ""
#> [5] ""
#> [6] ""
#> [7] ""
#> [8] "VOLUME I"
#> [9] ""
#> [10] ""
#> [11] ""
#> [12] "CHAPTER I"
#> [13] ""
#> [14] ""
#> [15] "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"
All the books can also be found combined into one data.frame in the
function austen_books()
dplyr::glimpse(austen_books())
#> Observations: 73,422
#> Variables: 2
#> $ text <chr> "SENSE AND SENSIBILITY", "", "by Jane Austen", "", "(1811...
#> $ book <fct> Sense & Sensibility, Sense & Sensibility, Sense & Sensibi...
Examples:
quRan
The quRan package contains the complete text of the Qurβan in Arabic (with and without vowels) and in English (the Yusuf Ali and Saheeh International translations).
#install.packages("quRan")
library(quRan)
dplyr::glimpse(quran_ar)
#> Observations: 6,236
#> Variables: 18
#> $ surah_id <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2β¦
#> $ ayah_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, β¦
#> $ surah_title_ar <fct> Ψ§ΩΩΨ§ΨͺΨΨ©, Ψ§ΩΩΨ§ΨͺΨΨ©, Ψ§ΩΩΨ§ΨͺΨΨ©, Ψ§ΩΩΨ§ΨͺΨΨ©, Ψ§ΩΩΨ§ΨͺΨΨ©β¦
#> $ surah_title_en <fct> Al-Faatiha, Al-Faatiha, Al-Faatiha, Al-Faatβ¦
#> $ surah_title_en_trans <fct> The Opening, The Opening, The Opening, The β¦
#> $ revelation_type <chr> "Meccan", "Meccan", "Meccan", "Meccan", "Meβ¦
#> $ text <chr> "ο»ΏΨ¨ΩΨ³ΩΩ
Ω Ψ§ΩΩΩΩΩΩ Ψ§ΩΨ±ΩΩΨΩΩ
ΩΩ°ΩΩ Ψ§ΩΨ±ΩΩΨΩΩΩ
Ω", β¦
#> $ surah <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2β¦
#> $ ayah <int> 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8β¦
#> $ ayah_title <chr> "1:1", "1:2", "1:3", "1:4", "1:5", "1:6", "β¦
#> $ juz <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ manzil <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ page <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3β¦
#> $ hizb_quarter <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ sajda <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, Fβ¦
#> $ sajda_id <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,β¦
#> $ sajda_recommended <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,β¦
#> $ sajda_obligatory <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,β¦
Examples:
scriptuRs
The scriptuRs package full text of the Standard Works for The Church of Jesus Christ of Latter-day Saints: the Old and New Testaments, the Book of Mormon, the Doctrine and Covenants, and the Pearl of Great Price. Each volume is in a data frame with a row for each verse, along with 19 columns of detailed metadata.
#install.packages("scriptuRs")
library(scriptuRs)
dplyr::glimpse(scriptuRs::book_of_mormon)
#> Observations: 6,604
#> Variables: 19
#> $ volume_id <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, β¦
#> $ book_id <dbl> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 6β¦
#> $ chapter_id <dbl> 1190, 1190, 1190, 1190, 1190, 1190, 1190, 119β¦
#> $ verse_id <dbl> 31103, 31104, 31105, 31106, 31107, 31108, 311β¦
#> $ volume_title <chr> "Book of Mormon", "Book of Mormon", "Book of β¦
#> $ book_title <chr> "1 Nephi", "1 Nephi", "1 Nephi", "1 Nephi", "β¦
#> $ volume_long_title <chr> "The Book of Mormon", "The Book of Mormon", "β¦
#> $ book_long_title <chr> "The First Book of Nephi", "The First Book ofβ¦
#> $ volume_subtitle <chr> "Another Testament of Jesus Christ", "Anotherβ¦
#> $ book_subtitle <chr> "His Reign and Ministry", "His Reign and Miniβ¦
#> $ volume_short_title <chr> "BoM", "BoM", "BoM", "BoM", "BoM", "BoM", "Boβ¦
#> $ book_short_title <chr> "1 Ne.", "1 Ne.", "1 Ne.", "1 Ne.", "1 Ne.", β¦
#> $ volume_lds_url <chr> "bm", "bm", "bm", "bm", "bm", "bm", "bm", "bmβ¦
#> $ book_lds_url <chr> "1-ne", "1-ne", "1-ne", "1-ne", "1-ne", "1-neβ¦
#> $ chapter_number <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ verse_number <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14β¦
#> $ text <chr> "I, Nephi, having been born of goodly parentsβ¦
#> $ verse_title <chr> "1 Nephi 1:1", "1 Nephi 1:2", "1 Nephi 1:3", β¦
#> $ verse_short_title <chr> "1 Ne. 1:1", "1 Ne. 1:2", "1 Ne. 1:3", "1 Ne.β¦
Examples:
friends
The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files.
#install.packages("friends")
library(friends)
The data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.
dplyr::glimpse(friends)
#> Rows: 67,373
#> Columns: 6
#> $ text <chr> "There's nothing to tell! He's just some guy I work with!", β¦
#> $ speaker <chr> "Monica Geller", "Joey Tribbiani", "Chandler Bing", "Phoebe β¦
#> $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ scene <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ utterance <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1β¦
Additionally data sets are included for more meta data.
dplyr::glimpse(friends_emotions)
#> Rows: 12,606
#> Columns: 5
#> $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ scene <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, β¦
#> $ utterance <int> 1, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,β¦
#> $ emotion <chr> "Mad", "Neutral", "Joyful", "Neutral", "Neutral", "Neutral",β¦
dplyr::glimpse(friends_entities)
#> Rows: 10,557
#> Columns: 5
#> $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ scene <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, β¦
#> $ utterance <int> 2, 3, 4, 8, 31, 33, 34, 38, 42, 44, 45, 47, 58, 4, 7, 9, 11,β¦
#> $ entities <list> "Paul the Wine Guy", <"Joey Tribbiani", "Paul the Wine Guy"β¦
dplyr::glimpse(friends_info)
#> Rows: 236
#> Columns: 8
#> $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ episode <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1β¦
#> $ title <chr> "The Pilot", "The One with the Sonogram at the End",β¦
#> $ directed_by <chr> "James Burrows", "James Burrows", "James Burrows", "β¦
#> $ written_by <chr> "David Crane & Marta Kauffman", "David Crane & Martaβ¦
#> $ air_date <date> 1994-09-22, 1994-09-29, 1994-10-06, 1994-10-13, 199β¦
#> $ us_views_millions <dbl> 21.5, 20.2, 19.5, 19.7, 18.6, 18.2, 23.5, 21.1, 23.1β¦
#> $ imdb_rating <dbl> 8.3, 8.1, 8.2, 8.1, 8.5, 8.1, 9.0, 8.1, 8.2, 8.1, 8.β¦
hcandersenr
The hcandersenr package includes many of H.C. Andersenβs fairy tales in 5 difference languages.
#install.packages("hcandersenr")
library(hcandersenr)
The fairy tales are found in the following data frames hcandersen_en
,
hcandersen_da
, hcandersen_de
, hcandersen_es
and hcandersen_fr
for the English, Danish, German, Spanish and French versions
respectively. Please be advised that all fairy tales arenβt available in
all languages in this package.
dplyr::glimpse(hcandersen_en)
#> Rows: 31,380
#> Columns: 2
#> $ text <chr> "A soldier came marching along the high road: \"Left, right - lefβ¦
#> $ book <chr> "The tinder-box", "The tinder-box", "The tinder-box", "The tinderβ¦
All the fairy tales are collected in the following data.frame:
dplyr::glimpse(hca_fairytales())
#> Rows: 126,102
#> Columns: 3
#> $ text <chr> "Der kom en soldat marcherende hen ad landevejen: Γ©n, to! Γ©n,β¦
#> $ book <chr> "The tinder-box", "The tinder-box", "The tinder-box", "The tiβ¦
#> $ language <chr> "Danish", "Danish", "Danish", "Danish", "Danish", "Danish", "β¦
Examples:
Still pending.
proustr
This proustr packages gives you access to tools designed to do Natural Language Processing in French.
#install.packages("proustr")
library(proustr)
Furthermore it includes the following 7 books
- Du cΓ΄tΓ© de chez Swann (1913):
ducotedechezswann
. - Γ lβombre des jeunes filles en fleurs (1919):
alombredesjeunesfillesenfleurs
. - Le CΓ΄tΓ© de Guermantes (1921):
lecotedeguermantes
. - Sodome et Gomorrhe (1922) :
sodomeetgomorrhe
. - La Prisonnière (1923) :
laprisonniere
. - Albertine disparue (1925, also know as : La Fugitive) :
albertinedisparue
. - Le Temps retrouvΓ© (1927) :
letempretrouve
.
Which are all found in the proust_books()
function.
dplyr::glimpse(proust_books())
#> Rows: 4,690
#> Columns: 4
#> $ text <chr> "Longtemps, je me suis couchΓ© de bonne heure. Parfois, Γ peine β¦
#> $ book <chr> "Du cΓ΄tΓ© de chez Swann", "Du cΓ΄tΓ© de chez Swann", "Du cΓ΄tΓ© de cβ¦
#> $ volume <chr> "PremiΓ¨re partieΒ : Combray", "PremiΓ¨re partieΒ : Combray", "Premβ¦
#> $ year <dbl> 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 1913, 191β¦
schrute
This schrute contains complete script transcription for The Office (US) television show.
#install.packages("schrute")
library(schrute)
The data set includes the full transcription, line by line with metadata about episode number, season number, character, and more.
glimpse(theoffice)
#> Rows: 55,130
#> Columns: 12
#> $ index <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16β¦
#> $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,β¦
#> $ episode <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,β¦
#> $ episode_name <chr> "Pilot", "Pilot", "Pilot", "Pilot", "Pilot", "Pilot",β¦
#> $ director <chr> "Ken Kwapis", "Ken Kwapis", "Ken Kwapis", "Ken Kwapisβ¦
#> $ writer <chr> "Ricky Gervais;Stephen Merchant;Greg Daniels", "Rickyβ¦
#> $ character <chr> "Michael", "Jim", "Michael", "Jim", "Michael", "Michaβ¦
#> $ text <chr> "All right Jim. Your quarterlies look very good. How β¦
#> $ text_w_direction <chr> "All right Jim. Your quarterlies look very good. How β¦
#> $ imdb_rating <dbl> 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6, 7.6β¦
#> $ total_votes <int> 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706, 3706,β¦
#> $ air_date <fct> 2005-03-24, 2005-03-24, 2005-03-24, 2005-03-24, 2005-β¦
Examples:
- Tidy Tuesday screencast: analyzing ratings and scripts from The Office
- Lasso regression with tidymodels and The Office
- tidytuesday: Part-of-Speech and textrecipes with The Office
textdata
The goal of textdata is to provide access to text-related data sets for easy access without bundling them inside a package. Some text datasets are too large to store within an R package or are licensed in such a way that prevents them from being included in an OSS-licensed package. Instead, this package provides a framework to download, parse, and store the datasets on the disk and load them when needed.
#install.packages("textdata")
library(textdata)
All the functions used in this package will prompt you to download the files. Once they are downloaded and cached they are easily loaded.
glimpse(textdata::dataset_imdb())
#> Rows: 25,000
#> Columns: 2
#> $ sentiment <chr> "neg", "neg", "neg", "neg", "neg", "neg", "neg", "neg", "negβ¦
#> $ text <chr> "Story of a man who has unnatural feelings for a pig. Startsβ¦
Available data sets:
with(catalogue, split(name, type))
#> $dataset
#> [1] "v1.0 sentence polarity" "AG News"
#> [3] "DBpedia" "TREC-6 & TREC-50"
#> [5] "IMDb Large Movie Review Dataset"
#>
#> $embeddings
#> [1] "GloVe 6B" "GloVe Twitter 27B"
#> [3] "GloVe Common Crawl 42B" "GloVe Common Crawl 840B"
#>
#> $lexicon
#> [1] "AFINN-111"
#> [2] "Loughran-McDonald Sentiment lexicon"
#> [3] "Bing Sentiment Lexicon"
#> [4] "NRC Word-Emotion Association Lexicon"
#> [5] "NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon)"
#> [6] "The NRC Valence, Arousal, and Dominance Lexicon"
gutenbergr
The gutenbergr package allows for search and download of public domain texts from Project Gutenberg. Currently includes more then 57,000 free eBooks.
#install.packages("gutenbergr")
library(gutenbergr)
To use gutenbergr you must know the Gutenberg id of the work you
wish to analyze. A text search of the works can be done using the
gutenberg_works
function.
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 Γ 8
#> gutenberg_id title author gutenberg_autho⦠language gutenberg_booksh⦠rights
#> <int> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 768 Wutheβ¦ BrontΓ«β¦ 405 en Gothic Fiction/Mβ¦ Publiβ¦
#> # β¦ with 1 more variable: has_text <lgl>
With that id you can use the gutenberg_download()
function to
gutenberg_download(768)
#> Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
#> Using mirror http://aleph.gutenberg.org
#> # A tibble: 12,314 Γ 2
#> gutenberg_id text
#> <int> <chr>
#> 1 768 "Wuthering Heights"
#> 2 768 ""
#> 3 768 "by Emily BrontΓ«"
#> 4 768 ""
#> 5 768 ""
#> 6 768 ""
#> 7 768 ""
#> 8 768 "CHAPTER I"
#> 9 768 ""
#> 10 768 ""
#> # β¦ with 12,304 more rows
Examples:
Still pending.
text2vec
While the text2vec package isnβt a data package by itself, it does include a textual data set inside.
#install.packages("text2vec")
library(text2vec)
The data frame movie_review
contains 5000 IMDB movie reviews selected
for sentiment analysis. It has been preprocessed to include sentiment
that means that an IMDB rating < 5 results in a sentiment score of 0,
and a rating >=7 has a sentiment score of 1.
dplyr::glimpse(movie_review)
#> Rows: 5,000
#> Columns: 3
#> $ id <chr> "5814_8", "2381_9", "7759_3", "3630_4", "9495_8", "8196_8", β¦
#> $ sentiment <int> 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, β¦
#> $ review <chr> "With all this stuff going down at the moment with MJ i've sβ¦
epubr
The epubr package allows for extraction of metadata and textual content of epub files.
install.packages("epubr")
library(epubr)
Further information and examples can be found here.
Github packages
appa
This appa package contains complete script transcription for Avatar: The Last Airbender.
#devtools::install_github("averyrobbins1/appa")
library(appa)
The data set includes the full transcription, line by line with metadata about book number, chapter number, character, and more.
dplyr::glimpse(appa)
#> Rows: 13,385
#> Columns: 12
#> $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1β¦
#> $ book <fct> Water, Water, Water, Water, Water, Water, Water, Watβ¦
#> $ book_num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ chapter <fct> "The Boy in the Iceberg", "The Boy in the Iceberg", β¦
#> $ chapter_num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ character <chr> "Katara", "Scene Description", "Sokka", "Scene Descrβ¦
#> $ full_text <chr> "Water. Earth. Fire. Air. My grandmother used to telβ¦
#> $ character_words <chr> "Water. Earth. Fire. Air. My grandmother used to telβ¦
#> $ scene_description <list> <>, <>, "[Close-up of the boy as he grins confidentβ¦
#> $ writer <chr> "βMichael Dante DiMartino, Bryan Konietzko, Aaron Ehaβ¦
#> $ director <chr> "Dave Filoni", "Dave Filoni", "Dave Filoni", "Dave Fβ¦
#> $ imdb_rating <dbl> 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.1, 8.β¦
sacred
The sacred package includes 9
tidy data sets: apocrypha
, book_of_mormon
, doctrine_and_covenants
,
greek_new_testament
, king_james_version
, pearl_of_great_price
,
tanach
, vulgate
and septuagint
with column describing the position
within each work.
#devtools::install_github("JohnCoene/sacred")
library(sacred)
dplyr::glimpse(apocrypha)
#> Rows: 5,725
#> Columns: 5
#> $ book.num <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1β¦
#> $ book <chr> "es1", "es1", "es1", "es1", "es1", "es1", "es1", "es1", "es1"β¦
#> $ psalm <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11", "11", "β¦
#> $ verse <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"β¦
#> $ text <chr> "And Josias held the feast of the passover in Jerusalem unto β¦
Examples:
Still pending.
harrypotter
The harrypotter package includes the text from all 7 main series books.
#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
the 7 books; philosophers_stone
, chamber_of_secrets
,
prisoner_of_azkaban
, goblet_of_fire
, order_of_the_phoenix
,
half_blood_prince
and deathly_hallows
are formatted as character
vectors with a chapter for each string.
dplyr::glimpse(harrypotter::chamber_of_secrets)
#> chr [1:19] "THE WORST BIRTHDAYγγNot for the first time, an argument had broken out over breakfast at number four, Privet "| __truncated__ ...
Examples:
- Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R
- Harry Plotter: Part 2 β Hogwarts Houses and their Stereotypes
hgwellsr
The hgwellsr package provides access to the full texts of six novels by H. G. Wells.
#devtools::install_github("erikhoward/hgwellsr")
library(hgwellsr)
- Ann Veronica (1909):
annveronica
. - The History of Mr Polly (1910):
mrpolly
. - The Invisible Man (1897):
invisibleman
. - The Island of Doctor Moreau (1896):
doctormoreau
. - The Time Machine (1895):
timemachine
. - The War of the Worlds (1898):
waroftheworlds
.
head(annveronica, 10)
#> [1] "CHAPTER THE FIRST"
#> [2] ""
#> [3] "ANN VERONICA TALKS TO HER FATHER"
#> [4] ""
#> [5] ""
#> [6] "Part 1"
#> [7] ""
#> [8] ""
#> [9] "One Wednesday afternoon in late September, Ann Veronica Stanley came"
#> [10] "down from London in a state of solemn excitement and quite resolved to"
jeeves
The jeeves package provides access to the full texts of 38 works by P.G. Wodehouse.
#devtools::install_github("aniruhil/jeeves")
library(jeeves)
glimpse(adamselindistress)
#> Rows: 10,291
#> Columns: 3
#> $ gutenberg_id <int> 2233, 2233, 2233, 2233, 2233, 2233, 2233, 2233, 2233, 223β¦
#> $ text <chr> "[Transcriber's Note for edition 11: in para. 4 of Chapteβ¦
#> $ title <chr> "A Damsel in Distress", "A Damsel in Distress", "A Damselβ¦
koanr
The koanr package includes text from several of the more important Zen koan texts.
#devtools::install_github("malcolmbarrett/koanr")
library(koanr)
The texts in this package include The Gateless Gate (gateless_gate
),
The Blue Cliff Record (blue_cliff_record
), The Record of the
Transmission of the Light(record_of_light
), and The Book of
Equanimity(book_of_equanimity
).
dplyr::glimpse(gateless_gate)
#> Rows: 192
#> Columns: 4
#> $ collection <chr> "The Gateless Gate", "The Gateless Gate", "The Gateless Gatβ¦
#> $ case <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5,β¦
#> $ type <chr> "title", "main_case", "commentary", "capping_verse", "titleβ¦
#> $ text <chr> "Joshu's Dog", "A monk asked Joshu, \"Has the dog the Buddhβ¦
sherlock
The sherlock package includes text from the Sherlock Holmes Books.
#devtools::install_github("EmilHvitfeldt/sherlock")
library(sherlock)
The goal of sherlock is to provide access to the full texts of Sherlock Holmes stories that are in the public domain. Text and further information regarding copyright laws can be found here.
dplyr::glimpse(holmes)
#> Rows: 65,958
#> Columns: 2
#> $ text <chr> "A STUDY IN SCARLET", "", "Table of contents", "", "Part I", "Mr.β¦
#> $ book <chr> "A Study In Scarlet", "A Study In Scarlet", "A Study In Scarlet",β¦
rperseus
The goal of rperseus is to
furnish classicists, textual critics, and R enthusiasts with texts from
the Classical World. While the English translations of most texts are
available through gutenbergr
, rperseus returns these works in their
original languageβGreek, Latin, and Hebrew.
#devtools::install_github("ropensci/rperseus")
library(rperseus)
aeneid_latin <- perseus_catalog %>%
filter(group_name == "Virgil",
label == "Aeneid",
language == "lat") %>%
pull(urn) %>%
get_perseus_text()
head(aeneid_latin)
#> # A tibble: 6 Γ 7
#> text urn group_name label description language section
#> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 Arma virumque⦠urn:cts:l⦠Virgil Aene⦠"Perseus:bib:oclc⦠lat 1
#> 2 Conticuere om⦠urn:cts:l⦠Virgil Aene⦠"Perseus:bib:oclc⦠lat 2
#> 3 Postquam res ⦠urn:cts:l⦠Virgil Aene⦠"Perseus:bib:oclc⦠lat 3
#> 4 At regina gra⦠urn:cts:l⦠Virgil Aene⦠"Perseus:bib:oclc⦠lat 4
#> 5 Interea mediu⦠urn:cts:l⦠Virgil Aene⦠"Perseus:bib:oclc⦠lat 5
#> 6 Sic fatur lac⦠urn:cts:l⦠Virgil Aene⦠"Perseus:bib:oclc⦠lat 6
See the vignette for more examples.
tidygutenbergr
The tidygutenbergr contains many functions that will fetch data from Project Gutenberg using the gutenbergr package and do some light cleaning.
#devtools::install_github("emilHvitfeldt/tidygutenbergr")
library(tidygutenbergr)
tidygutenbergr contains a couple dozen datasets that can all be found here.
Many books will have metadata on the text such as book nunmber and chapter name/number.
glimpse(a_tale_of_two_cities())
#> Rows: 15,830
#> Columns: 4
#> $ text <chr> "Book the First--Recalled to Life", "", "", "", "", "CHAPβ¦
#> $ book <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ chapter <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, β¦
#> $ chapter_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, Nβ¦
subtools
The subtools package doesnβt include any textual data, but allows you to read subtitle files.
#devtools::install_github("fkeck/subtools")
library(subtools)
the use of this function can be found in the examples.
Examples:
- Movies and series subtitles in R with subtools
- A tidy text analysis of Rick and Morty
- You beautiful, naΓ―ve, sophisticated newborn series
Tidytuesday
The tidytuesday project is an amazing collection of data sets that are well suited for beginners to hone their skills. Below is a list of the data sets that contain enough text data to analyse. This list does contain data set that are already present on this page but are kept here for completeness.
Examples will not be shown here since that is taken care of in the respective pages.
Date | Topic |
---|---|
2019-01-01 | #rstats and #TidyTuesday Tweets from rtweet |
2019-03-12 | Board Games Database |
2019-04-23 | Anime Dataset |
2019-05-28 | Wine ratings |
2019-06-25 | UFO Sightings around the world |
2019-09-10 | Amusement Park injuries |
2019-10-22 | Horror movie metadata |
2019-12-17 | Adoptable dogs |
2019-12-24 | Christmas Music Billboards |
2020-03-17 | The Office - Words and Numbers |
2020-04-21 | GDPR Fines |
2020-04-28 | Broadway Weekly Grosses |
2020-05-05 | Animal Crossing - New Horizons |
2020-05-26 | Cocktails |
2020-06-09 | African American Achievements |
2020-06-16 | American Slavery and Juneteenth |
2020-08-11 | Avatar: The last airbender |
2020-09-08 | Friends |
2020-09-29 | BeyoncΓ© and Taylor Swift Lyrics |
2020-12-08 | Women of 2020 |
2021-01-12 | Art Collections |
2021-03-02 | Superbowl commercials |
2021-03-23 | UN Votes |
2021-04-20 | Netflix Shows |
2021-04-27 | CEO Departures |
2021-06-15 | Du Bois and Juneteenth Revisited |
Wild data
This sections includes public data sets and how to import them into R ready for analysis. It is generally advised to save the resulting data such that you donβt re-download the data excessively.
This website include a handful of different movie review data sets. Below is the code chuck necessary to load in the data sets.
polarity dataset v2.0
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
data <- map_df(file_names,
~ tibble(text = read_lines(.x),
polarity = str_detect(.x, "pos"),
cv_tag = str_extract(.x, "(?<=cv)\\d{3}"),
html_tag = str_extract(.x, "(?<=cv\\d{3}_)\\d*")))
glimpse(data)
#> Rows: 64,720
#> Columns: 4
#> $ text <chr> "plot : two teen couples go to a church party , drink and theβ¦
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSEβ¦
#> $ cv_tag <chr> "000", "000", "000", "000", "000", "000", "000", "000", "000"β¦
#> $ html_tag <chr> "29416", "29416", "29416", "29416", "29416", "29416", "29416"β¦
sentence polarity dataset v1.0
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
data <- map_df(file_names,
~ tibble(text = read_lines(.x),
polarity = str_detect(.x, "pos")))
glimpse(data)
#> Rows: 10,662
#> Columns: 2
#> $ text <chr> "simplistic , silly and tedious . ", "it's so laddish and juvβ¦
#> $ polarity <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSEβ¦
scale dataset v1.0
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/scale_data.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
subjs <- str_subset(file_names, "subj")
ids <- str_subset(file_names, "id")
ratings <- str_subset(file_names, "rating")
names <- str_extract(ratings, "(?<=rating.).*") %>%
str_replace("\\+", " ")
data <- map_df(seq_len(length(names)),
~ tibble(text = read_lines(subjs[.x]),
id = read_lines(ids[.x]),
rating = read_lines(ratings[.x]),
name = names[.x]))
glimpse(data)
#> Rows: 5,006
#> Columns: 4
#> $ text <chr> "in my opinion , a movie reviewer's most important task is to oβ¦
#> $ id <chr> "29420", "17219", "18406", "18648", "20021", "20454", "20473", β¦
#> $ rating <chr> "0.1", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", "0.2", β¦
#> $ name <chr> "Dennis Schwartz", "Dennis Schwartz", "Dennis Schwartz", "Denniβ¦
subjectivity dataset v1.0
library(tidyverse)
library(fs)
filepath <- file_temp() %>%
path_ext_set("tar.gz")
download.file("http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz", filepath)
file_names <- untar(filepath, list = TRUE)
file_names <- file_names[!str_detect(file_names, "README")]
untar(filepath, files = file_names)
data <- map_df(file_names,
~ tibble(text = read_lines(.x),
label = if_else(str_detect(.x, "quote"),
"subjective",
"objective")))
glimpse(data)
#> Rows: 10,000
#> Columns: 2
#> $ text <chr> "smart and alert , thirteen conversations about one thing is a sβ¦
#> $ label <chr> "subjective", "subjective", "subjective", "subjective", "subjectβ¦
SouthParkData
the following github repository BobAdamsEE/SouthParkData includes the script of the first 19 seasons of South Park. The following code snippet lets you download them all at once.
url_base <- "https://raw.githubusercontent.com/BobAdamsEE/SouthParkData/master/by-season"
urls <- paste0(url_base, "/Season-", 1:19, ".csv")
data <- map_df(urls, ~ read_csv(.x))
glimpse(data)
#> Rows: 73,139
#> Columns: 4
#> $ Season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ Episode <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, β¦
#> $ Character <chr> "Boys", "Kyle", "Ike", "Kyle", "Cartman", "Kyle", "Stan", "Kβ¦
#> $ Line <chr> "School day, school day, teacher's golden ru...\n", "Ah, damβ¦
Examples:
Saudi Newspapers Corpus
The following Github repository inparallel/SaudiNewsNet includes data and text from 31030 Arabic newspaper articles along with metadata, extracted from various online Saudi newspapers.
library(rio)
library(glue)
library(fs)
library(purrr)
dates <- c("2015-07-21", "2015-07-22", "2015-07-23", "2015-07-24", "2015-07-25",
"2015-07-26", "2015-07-27", "2015-07-31", "2015-08-01", "2015-08-02",
"2015-08-03", "2015-08-04", "2015-08-06", "2015-08-07", "2015-08-08",
"2015-08-09", "2015-08-10", "2015-08-11")
tmp_path <- path_temp()
urls <- glue("https://raw.githubusercontent.com/inparallel/SaudiNewsNet/master/dataset/{dates}.zip")
paths <- path(tmp_path, dates, ext = "zip")
data <- map2_dfr(urls, paths, ~ {
download.file(.x, .y)
import_list(.y)[[1]]
})
glimpse(data)
#> Rows: 31,030
#> Columns: 6
#> $ source <chr> "aawsat", "aawsat", "aawsat", "aawsat", "aawsat", "aawsβ¦
#> $ url <chr> "http://aawsat.com/home/article/410826/Ψ¨Ψ±ΩΨ·Ψ§ΩΩΨ§-Ψ£Ψ±Ψ¨ΨΉΨ©-Ω
β¦
#> $ date_extracted <chr> "2015-07-21 02:51:32", "2015-07-21 02:51:33", "2015-07-β¦
#> $ title <chr> "Ψ¨Ψ±ΩΨ·Ψ§ΩΩΨ§: Ψ£Ψ±Ψ¨ΨΉΨ© Ω
ΨΨ§ΩΨ± ΩΨ§Ψ³ΨͺΨ±Ψ§ΨͺΩΨ¬ΩΨ© Ψ¬Ψ―ΩΨ―Ψ© ΨͺΨͺΨ΅Ψ―Ω ΩΩΨͺΨ·Ψ±Ω ΨΉβ¦
#> $ author <chr> "ΩΩΨ―Ω: Ψ±ΩΩΩ
ΨΩΩΨ΄", "ΩΩΨ―Ω: Β«Ψ§ΩΨ΄Ψ±Ω Ψ§ΩΨ£ΩΨ³Ψ· Ψ£ΩΩΩΨ§ΩΩΒ»", "ΩΩΨ―β¦
#> $ content <chr> "ΨΨ―Ψ― Ψ±Ψ¦ΩΨ³ Ψ§ΩΩΨ²Ψ±Ψ§Ψ‘ Ψ§ΩΨ¨Ψ±ΩΨ·Ψ§ΩΩ Ψ―ΩΩΩΨ― ΩΨ§Ω
ΩΨ±ΩΩΨ Ψ§ΩΩΩΩ
(Ψ§ΩΨ§Ψ«Ωβ¦