• Stars
    star
    113
  • Rank 310,115 (Top 7 %)
  • Language
    R
  • License
    Other
  • Created about 7 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multilingual Stopword Lists in R

stopwords: the R package

CRAN Version R build status codecov Downloads Total Downloads

R package providing โ€œone-stop shoppingโ€ (or should that be โ€œone-shop stoppingโ€?) for stopword lists in R, for multiple languages and sources. No longer should text analysis or NLP packages bake in their own stopword lists or functions, since this package can accommodate them all, and is easily extended.

Created by David Muhr, and extended in cooperation with Kenneth Benoit and Kohei Watanabe.

Installation

# from CRAN
install.packages("stopwords")

# Or get the development version from GitHub:
# install.packages("devtools")
devtools::install_github("quanteda/stopwords")

Usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("ja", source = "marimo"), 20)
##  [1] "็ง"       "ๅƒ•"       "่‡ชๅˆ†"     "่‡ช่บซ"     "ๆˆ‘ใ€…"     "็ง้”"    
##  [7] "ใ‚ใชใŸ"   "ๅฝผ"       "ๅฝผๅฅณ"     "ๅฝผใ‚‰"     "ๅฝผๅฅณใ‚‰"   "ใ‚ใ‚Œ"    
## [13] "ใใ‚Œ"     "ใ“ใ‚Œ"     "ใ‚ใ‚Œใ‚‰"   "ใ‚ใ‚Œใ‚‰ใฎ" "ใใ‚Œใ‚‰"   "ใใ‚Œใ‚‰ใฎ"
## [19] "ใ“ใ‚Œใ‚‰"   "ใ“ใ‚Œใ‚‰ใฎ"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"

Languages available

The following coverage of languages is currently available, by source. Note that the inclusiveness of the stopword lists will vary by source, and the number of languages covered by a stopword list does not necessarily mean that the source is better than one with more limited coverage. (There may be many reasons to prefer the default โ€œsnowballโ€ source over the โ€œstopwords-isoโ€ source, for instance.)

The following languages are currently available:

Language Code snowball marimo nltk stopwords-iso Other
Afrikaans af โœ“
Arabic ar โœ“ โœ“ โœ“ misc
Armenian hy โœ“
Azerbaijani az โœ“
Basque eu โœ“
Bengali bn โœ“
Breton br โœ“
Bulgarian bg โœ“
Catalan ca โœ“ misc
Chinese zh โœ“ โœ“ misc
Croatian hr โœ“
Czech cs โœ“
Danish da โœ“ โœ“ โœ“
Dutch nl โœ“ โœ“ โœ“
English en โœ“ โœ“ โœ“ โœ“ smart
Esperanto eo โœ“
Estonian et โœ“
Finnish fi โœ“ โœ“ โœ“
French fr โœ“ โœ“ โœ“
Galician gl โœ“
German de โœ“ โœ“ โœ“ โœ“
Greek el โœ“ misc
Greek (ancient) grc ancient, perseus
Gujarati gu misc
Hausa ha โœ“
Hebrew he โœ“ โœ“
Hindi hi โœ“
Hungarian hu โœ“ โœ“ โœ“
Indonesian id โœ“ โœ“
Irish ga โœ“
Italian it โœ“ โœ“ โœ“
Japanese ja โœ“ โœ“
Kazakh kk โœ“
Korean ko โœ“ โœ“
Kurdish ku โœ“
Latin la โœ“ ancient, perseus
Lithuanian lt โœ“
Latvian lv โœ“
Malay ms โœ“
Marathi mr โœ“
Nepali mr โœ“
Norwegian no โœ“ โœ“ โœ“
Persian fa โœ“
Polish pl โœ“
Portuguese pt โœ“ โœ“ โœ“
Romanian ro โœ“ โœ“ โœ“
Russian ru โœ“ โœ“ โœ“ โœ“
Slovak sk โœ“
Slovenian sl โœ“ โœ“
Somali so โœ“
Southern Sotho st โœ“
Spanish es โœ“ โœ“ โœ“
Swahili sw โœ“
Swedish sv โœ“ โœ“ โœ“
Thai th โœ“
Tagalog tl โœ“
Tajik tg โœ“
Turkish tr โœ“ โœ“
Ukrainian uk โœ“
Urdu ur โœ“
Vietnamese vi โœ“
Yoruba yo โœ“
Zulu zu โœ“

Basic usage

head(stopwords::stopwords("de", source = "snowball"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

head(stopwords::stopwords("de", source = "stopwords-iso"), 20)
##  [1] "a"           "ab"          "aber"        "ach"         "acht"       
##  [6] "achte"       "achten"      "achter"      "achtes"      "ag"         
## [11] "alle"        "allein"      "allem"       "allen"       "aller"      
## [16] "allerdings"  "alles"       "allgemeinen" "als"         "also"

For compatibility with the former quanteda::stopwords():

head(stopwords::stopwords("german"), 20)
##  [1] "aber"    "alle"    "allem"   "allen"   "aller"   "alles"   "als"    
##  [8] "also"    "am"      "an"      "ander"   "andere"  "anderem" "anderen"
## [15] "anderer" "anderes" "anderm"  "andern"  "anderr"  "anders"

Explore sources and languages:

# list all sources
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"

# list languages for a specific source
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"

Modifying stopword lists

It is now possible to edit your own stopword lists, using the interactive editor, with functions from the quanteda package (>= v2.02). For instance to edit the English stopword list for the Snowball source:

# edit the English stopwords
my_stopwords <- quanteda::char_edit(stopwords("en", source = "snowball"))

To edit stopwords whose underlying structure is a list, such as the โ€œmarimoโ€ source, we can use the list_edit() function:

# edit the English stopwords
my_stopwordlist <- quanteda::list_edit(stopwords("en", source = "marimo", simplify = FALSE))

Finally, itโ€™s possible to remove stopwords using pattern matching. The default is the easy-to-use โ€œglobโ€ style matching, which is equivalent to fixed matching when no wildcard characters are used. So to remove personal pronouns from the English Snowball word list, for instance, this would work:

library("quanteda", warn.conflicts = FALSE)
## Package version: 3.2.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
posspronouns <- stopwords::data_stopwords_marimo$en$pronoun$possessive
posspronouns
## [1] "my"    "our"   "your"  "his"   "her"   "its"   "their"

stopwords("en", source = "snowball") %>%
  head(n = 10)
##  [1] "i"         "me"        "my"        "myself"    "we"        "our"      
##  [7] "ours"      "ourselves" "you"       "your"

See the difference when we remove them โ€“ โ€œmyโ€, โ€œoursโ€, and โ€œyourโ€ are gone:

stopwords("en", source = "snowball") %>%
  head(n = 10) %>%
  char_remove(pattern = posspronouns)
## [1] "i"         "me"        "myself"    "we"        "ours"      "ourselves"
## [7] "you"

There is no char_add(), since itโ€™s just as easy to use c() for this, but there is a char_keep() for positive selection rather than removal.

Adding stopwords to your own package

In v2.2, weโ€™ve removed the function use_stopwords() because the dependency on usethis added too many downstream package dependencies, and stopwords is meant to be a lightweight package.

However it is very easy to add a re-export for stopwords() to your package by adding this file as stopwords.R:

#' Stopwords
#'
#' @description
#' Return a character vector of stopwords.
#' See \code{stopwords::\link[stopwords:stopwords]{stopwords()}} for details.
#' @usage stopwords(language = "en", source = "snowball")
#' @name stopwords
#' @importFrom stopwords stopwords
#' @export
NULL

and add stopwords to the list of Imports: in your DESCRIPTION file.

Contributing

Additional sources can be defined and contributed by adding new data objects, as follows:

  1. Data object. Create a named list of characters, in UTF-8 format, consisting of the stopwords for each language. The ISO-639-1 language code will form the name of the list element, and the values of each element will be the character vector of stopwords for literal matches. The data object should follow the package naming convention, and be called data_stopwords_newsource, where newsource is replaced by the name of the new source.

  2. Documentation. The new source should be clearly documented, especially the source from which was taken.

License

This package as well as the source repositories are licensed under MIT.