• Stars
    star
    166
  • Rank 227,748 (Top 5 %)
  • Language
    R
  • License
    Other
  • Created over 6 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ‘·β€β™‚οΈ A simple package for extracting useful features from character objects πŸ‘·β€β™€οΈ

πŸ‘· textfeatures πŸ‘·

Build status AppVeyor build status CRAN status Coverage Status DOI

Downloads Downloads lifecycle

Easily extract useful features from character objects.

Install

Install from CRAN.

## download from CRAN
install.packages("textfeatures")

Or install the development version from Github.

## install from github
devtools::install_github("mkearney/textfeatures")

Usage

textfeatures()

Input a character vector.

## vector of some text
x <- c(
  "this is A!\t sEntence https://github.com about #rstats @github",
  "and another sentence here", "THe following list:\n- one\n- two\n- three\nOkay!?!"
)

## get text features
textfeatures(x, verbose = FALSE)
#> # A tibble: 3 x 36
#>   n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars n_uq_chars n_commas
#>    <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>      <dbl>    <dbl>
#> 1  1.15      1.15       1.15          1.15       1.15          1.15    0.243      0.330        0
#> 2 -0.577    -0.577     -0.577        -0.577     -0.577        -0.577  -1.10      -1.12         0
#> 3 -0.577    -0.577     -0.577        -0.577     -0.577        -0.577   0.856      0.793        0
#> # … with 27 more variables: n_digits <dbl>, n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>,
#> #   n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
#> #   n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>, sent_afinn <dbl>,
#> #   sent_bing <dbl>, sent_syuzhet <dbl>, sent_vader <dbl>, n_polite <dbl>, n_first_person <dbl>,
#> #   n_first_personp <dbl>, n_second_person <dbl>, n_second_personp <dbl>, n_third_person <dbl>,
#> #   n_tobe <dbl>, n_prepositions <dbl>, w1 <dbl>, w2 <dbl>

Or input a data frame with a column named text.

## data frame with rstats tweets
rt <- rtweet::search_tweets("rstats", n = 2000, verbose = FALSE)

## get text features
tf <- textfeatures(rt, verbose = FALSE)

## preview data
tf
#> # A tibble: 2,000 x 134
#>    n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars n_uq_chars n_commas
#>     <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>      <dbl>    <dbl>
#>  1 -0.351     0.331     -1.07         -1.06       3.32          3.32    0.400      0.826    1.74 
#>  2 -0.351     0.331     -0.347        -0.345      1.11          1.11    0.670     -0.302    1.74 
#>  3 -0.351     0.331     -0.645        -0.643     -0.562        -0.562  -0.201      0.975    2.36 
#>  4 -0.351     0.331     -0.115        -0.114      2.09          2.09    0.554      0.412    2.84 
#>  5  0.832     0.331     -0.115        -0.114      2.09          2.09    0.270      0.129   -0.604
#>  6 -0.351     0.331     -0.347        -0.345      1.11          1.11   -0.173     -0.677   -0.604
#>  7 -0.351     0.331     -0.347        -0.345      1.11          1.11   -0.566     -0.187    1.74 
#>  8 -0.351     0.331     -0.645        -0.643     -0.562        -0.562  -1.68      -1.28    -0.604
#>  9 -0.351     0.331     -0.115        -0.114     -0.562        -0.562  -0.531     -0.421   -0.604
#> 10 -2.37     -2.96      -0.347        -0.345      1.11          1.11   -1.26      -0.815    0.877
#> # … with 1,990 more rows, and 125 more variables: n_digits <dbl>, n_exclaims <dbl>,
#> #   n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>,
#> #   n_uq_words <dbl>, n_caps <dbl>, n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>,
#> #   n_charsperword <dbl>, sent_afinn <dbl>, sent_bing <dbl>, sent_syuzhet <dbl>, sent_vader <dbl>,
#> #   n_polite <dbl>, n_first_person <dbl>, n_first_personp <dbl>, n_second_person <dbl>,
#> #   n_second_personp <dbl>, n_third_person <dbl>, n_tobe <dbl>, n_prepositions <dbl>, w1 <dbl>,
#> #   w2 <dbl>, w3 <dbl>, w4 <dbl>, w5 <dbl>, w6 <dbl>, w7 <dbl>, w8 <dbl>, w9 <dbl>, w10 <dbl>,
#> #   w11 <dbl>, w12 <dbl>, w13 <dbl>, w14 <dbl>, w15 <dbl>, w16 <dbl>, w17 <dbl>, w18 <dbl>,
#> #   w19 <dbl>, w20 <dbl>, w21 <dbl>, w22 <dbl>, w23 <dbl>, w24 <dbl>, w25 <dbl>, w26 <dbl>,
#> #   w27 <dbl>, w28 <dbl>, w29 <dbl>, w30 <dbl>, w31 <dbl>, w32 <dbl>, w33 <dbl>, w34 <dbl>,
#> #   w35 <dbl>, w36 <dbl>, w37 <dbl>, w38 <dbl>, w39 <dbl>, w40 <dbl>, w41 <dbl>, w42 <dbl>,
#> #   w43 <dbl>, w44 <dbl>, w45 <dbl>, w46 <dbl>, w47 <dbl>, w48 <dbl>, w49 <dbl>, w50 <dbl>,
#> #   w51 <dbl>, w52 <dbl>, w53 <dbl>, w54 <dbl>, w55 <dbl>, w56 <dbl>, w57 <dbl>, w58 <dbl>,
#> #   w59 <dbl>, w60 <dbl>, w61 <dbl>, w62 <dbl>, w63 <dbl>, w64 <dbl>, w65 <dbl>, w66 <dbl>,
#> #   w67 <dbl>, w68 <dbl>, w69 <dbl>, w70 <dbl>, w71 <dbl>, w72 <dbl>, w73 <dbl>, w74 <dbl>,
#> #   w75 <dbl>, …

Compare across multiple authors.

## data frame tweets from multiple news media accounts
news <- rtweet::get_timelines(
  c("cnn", "nytimes", "foxnews", "latimes", "washingtonpost"), 
  n = 2000)

## get text features (including ests for 20 word dims) for all observations
news_features <- textfeatures(news, word_dims = 20, verbose = FALSE)

Fast version

If you’re looking for something faster try setting sentiment = FALSE and word2vec = 0.

## get non-substantive text features
textfeatures(rt, sentiment = FALSE, word_dims = 0, verbose = FALSE)
#> # A tibble: 2,000 x 29
#>    n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars n_uq_chars n_commas
#>     <dbl>     <dbl>      <dbl>         <dbl>      <dbl>         <dbl>   <dbl>      <dbl>    <dbl>
#>  1 -0.351     0.331     -1.07         -1.06       3.32          3.32    0.400      0.826    1.74 
#>  2 -0.351     0.331     -0.347        -0.345      1.11          1.11    0.670     -0.302    1.74 
#>  3 -0.351     0.331     -0.645        -0.643     -0.562        -0.562  -0.201      0.975    2.36 
#>  4 -0.351     0.331     -0.115        -0.114      2.09          2.09    0.554      0.412    2.84 
#>  5  0.832     0.331     -0.115        -0.114      2.09          2.09    0.270      0.129   -0.604
#>  6 -0.351     0.331     -0.347        -0.345      1.11          1.11   -0.173     -0.677   -0.604
#>  7 -0.351     0.331     -0.347        -0.345      1.11          1.11   -0.566     -0.187    1.74 
#>  8 -0.351     0.331     -0.645        -0.643     -0.562        -0.562  -1.68      -1.28    -0.604
#>  9 -0.351     0.331     -0.115        -0.114     -0.562        -0.562  -0.531     -0.421   -0.604
#> 10 -2.37     -2.96      -0.347        -0.345      1.11          1.11   -1.26      -0.815    0.877
#> # … with 1,990 more rows, and 20 more variables: n_digits <dbl>, n_exclaims <dbl>,
#> #   n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>, n_periods <dbl>, n_words <dbl>,
#> #   n_uq_words <dbl>, n_caps <dbl>, n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>,
#> #   n_charsperword <dbl>, n_first_person <dbl>, n_first_personp <dbl>, n_second_person <dbl>,
#> #   n_second_personp <dbl>, n_third_person <dbl>, n_tobe <dbl>, n_prepositions <dbl>

Example: NASA meta data

Extract text features from NASA meta data:

## read NASA meta data
nasa <- jsonlite::fromJSON("https://data.nasa.gov/data.json")

## identify non-public or restricted data sets
nonpub <- grepl("Not publicly available|must register", 
  nasa$data$rights, ignore.case = TRUE) | 
  nasa$dataset$accessLevel %in% c("restricted public", "non-public")

## create data frame with ID, description (name it "text"), and nonpub
nd <- data.frame(text = nasa$dataset$description, nonpub = nonpub, 
  stringsAsFactors = FALSE)

## drop duplicates (truncate text to ensure more distinct obs)
nd <- nd[!duplicated(tolower(substr(nd$text, 1, 100))), ]

## filter via sampling to create equal number of pub/nonpub
nd <- nd[c(sample(which(!nd$nonpub), sum(nd$nonpub)), which(nd$nonpub)), ]
## get text features
nasa_tf <- textfeatures(nd, word_dims = 20, normalize = FALSE, verbose = FALSE)

## drop columns with little to no variance
min_var <- function(x, min = 1) {
  is_num <- vapply(x, is.numeric, logical(1))
  non_num <- names(x)[!is_num]
  yminvar <- names(x[is_num])[vapply(x[is_num], function(.x) stats::var(.x, 
      na.rm = TRUE) >= min, logical(1))]
  x[c(non_num, yminvar)]
}
nasa_tf <- min_var(nasa_tf)

## view summary
skimrskim(nasa_tf)
variable min 25% mid 75% max hist
n_caps 1 10 28 46 207 ▇▇▂▁▁▁▁▁
n_commas 0 1 6 9.75 32 ▇▅▃▁▁▁▁▁
n_digits 0 0 2 6 57 ▇▁▁▁▁▁▁▁
n_extraspaces 0 0 0 0 29 ▇▁▁▁▁▁▁▁
n_lowers 0 4.25 47 853.5 3123 ▇▁▂▁▁▁▁▁
n_nonasciis 0 0 0 0 20 ▇▁▁▁▁▁▁▁
n_periods 0 0 2 6 28 ▇▂▁▁▁▁▁▁
n_prepositions 0 0 1 8 18 ▇▁▁▃▂▁▁▁
n_puncts 0 0 2 12 59 ▇▂▁▁▁▁▁▁
n_tobe 0 0 0 3 7 ▇▁▁▂▁▁▁▁
n_uq_chars 2 15 28.5 46 68 ▂▇▅▂▅▅▃▁
n_uq_words 1 7 12.5 112.75 341 ▇▂▂▂▁▁▁▁
n_words 1 7 12.5 163.5 598 ▇▂▂▁▁▁▁▁
sent_afinn -18 0 0 3 30 ▁▁▇▂▁▁▁▁
sent_bing -9 0 0 1 23 ▁▁▇▁▁▁▁▁
sent_syuzhet -3.5 0 0 4.16 32.25 ▇▂▂▁▁▁▁▁
sent_vader -11.5 0 0 2.8 31.4 ▁▁▇▁▁▁▁▁
## add nonpub variable
nasa_tf$nonpub <- nd$nonpub

## run model predicting whether data is restricted
m1 <- glm(nonpub ~ ., data = nasa_tf[-1], family = binomial)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## view model summary
summary(m1)
#> 
#> Call:
#> glm(formula = nonpub ~ ., family = binomial, data = nasa_tf[-1])
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.01381  -0.01885   0.00078   0.04314   2.29757  
#> 
#> Coefficients:
#>                 Estimate Std. Error z value Pr(>|z|)   
#> (Intercept)      8.31318    2.70503   3.073  0.00212 **
#> n_uq_chars      -0.37317    0.14005  -2.665  0.00771 **
#> n_commas         0.14884    0.25324   0.588  0.55671   
#> n_digits        -0.19962    0.13118  -1.522  0.12809   
#> n_extraspaces    0.08942    0.16235   0.551  0.58179   
#> n_lowers        -0.01618    0.03261  -0.496  0.61983   
#> n_periods        1.17591    0.44971   2.615  0.00893 **
#> n_words         -0.02638    0.14660  -0.180  0.85723   
#> n_uq_words       0.04423    0.17763   0.249  0.80337   
#> n_caps           0.17170    0.06327   2.714  0.00666 **
#> n_nonasciis     -1.77660  367.21424  -0.005  0.99614   
#> n_puncts        -0.21932    0.16775  -1.307  0.19107   
#> sent_afinn       0.19473    0.43352   0.449  0.65330   
#> sent_bing       -0.56450    0.56620  -0.997  0.31876   
#> sent_syuzhet     0.06075    0.59648   0.102  0.91888   
#> sent_vader      -0.09451    0.35599  -0.265  0.79064   
#> n_tobe          -0.49601    0.76199  -0.651  0.51509   
#> n_prepositions   0.21984    0.52947   0.415  0.67799   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 396.480  on 285  degrees of freedom
#> Residual deviance:  57.512  on 268  degrees of freedom
#> AIC: 93.512
#> 
#> Number of Fisher Scoring iterations: 19

## how accurate was the model?
table(predict(m1, type = "response") > .5, nasa_tf$nonpub)
#>        
#>         FALSE TRUE
#>   FALSE   138    7
#>   TRUE      5  136

More Repositories

1

tweetbotornot

πŸ€– R package for detecting Twitter bots via machine learning
R
384
star
2

rstudiothemes

A curated list of RStudio themes found on Github
R
217
star
3

resist_oped

πŸ•΅πŸ½β€β™€οΈ Identifying the author behind New York Time’s op-ed from inside the Trump White House.
R
201
star
4

tidyversity

πŸŽ“ Tidy tools for academics
R
166
star
5

shinyapps_links

A collection of Shiny applications (links shared on Twitter)
R
135
star
6

pkgverse

πŸ“¦πŸ”­πŸŒ  Create your own universe of packages Γ  la tidyverse
R
119
star
7

tweetbotornot2

πŸ”πŸ¦πŸ€– Detect Twitter Bots!
R
88
star
8

presidential_election_county_results_2016

🏁 presidential_election_county_results_2016
R
59
star
9

kaggler

🏁 API client for Kaggle
R
55
star
10

dapr

β˜πŸΌπŸ‘‰πŸΌπŸ‘‡πŸΌπŸ‘ˆπŸΌ Dependency-free purrr-like apply/map/iterate functions
R
54
star
11

rreddit

π«βŸ‹ Get Reddit data
R
51
star
12

rtweet-workshop

Slides and code for the rtweet workshop
R
44
star
13

hexagon

◀️⏹▢️ R package for creating hexagon shaped xy data frames.
R
42
star
14

trumptweets

Download data on all of Donald Trump's (@realDonaldTrump) tweets
R
40
star
15

uslides

Rmarkdown template for pretty university-themed beamer presentations.
HTML
37
star
16

tbltools

πŸ—œπŸ”’ Tools for Working with Tibbles
R
35
star
17

rstudioconf_tweets

πŸ–₯ A repository for tracking tweets about rstudio::conf
R
33
star
18

wactor

Word Factor Vectors
R
32
star
19

hex-stickers

πŸ—ƒ Hex stickers for my R pkgs
R
28
star
20

readthat

Read Text Data
R
27
star
21

nytimes

nytimes: Interacting with New York TImes APIs
R
25
star
22

rtweet.download

{rtweet} helpers for automating large or time-consuming downloads
R
24
star
23

rmd2jupyter

Convert Rmd (rmarkdown) to ipynb (Jupyter notebook)
R
23
star
24

nicar_tworkshop

Slides for #NICAR18 workshop on collecting and analyzing Twitter data
R
23
star
25

tidyreg

πŸŽ“ Tidy regression tools for academics
R
22
star
26

driven-snow

❄️A light, bare-bones custom theme for Rstudio❄️
R
21
star
27

viewtweets

πŸ™ˆπŸ΅ View tweets (timelines, favorites, searches) in Rstudio πŸ΅πŸ™ˆ
R
21
star
28

newsAPI

API wrapper/R client for accessing https://newsapi.org
R
20
star
29

mizzourahmd

😎 A clean and stylish template for rmarkdown 🐯
R
20
star
30

xaringan_slides

πŸ“Ί Links to HTML5 presentations made using the R package {xaringan}.
20
star
31

funique

⌚️ A faster unique() function
R
19
star
32

tfse

πŸ›  Useful R functions for various things
R
18
star
33

chr

πŸ”€ Lightweight R package for manipulating [string] characters
R
18
star
34

tidymlm

πŸŽ“ Tidy multilevel modeling tools for academics
R
17
star
35

cspan_data

A repo for tracking the number of followers of Congress, the Cabinet, and Governors
R
16
star
36

googler

googler: Google from the R Console
R
14
star
37

tidysem

πŸŽ“ Tidy SEM tools for academics
R
14
star
38

ig

πŸ–Ό A minimal R client for interacting with Instagram’s public API
R
14
star
39

printbl

Printable Tibbles
R
14
star
40

stat

Course website for JOURN 8016: Advanced Quantitative Research Methods
HTML
13
star
41

nyt

πŸ“°πŸ—ž New York Times data
R
13
star
42

wibble

Web Data Frames
R
13
star
43

reflowdoc

δ·— Hard-Wrapping Rstudio Add-In δ·—
R
11
star
44

learnRvideos

πŸ“Ό Videos for learning about R
10
star
45

shouldbeverified

Predict Whether Twitter Users Should Be Verified
R
10
star
46

mocktwitter

🐧🐦 Generate HTML pages for Twitter statuses.
HTML
10
star
47

data-science-tenure

Making data science tools count toward tenure πŸ‘©β€πŸ«πŸ‘¨β€πŸ«
TeX
10
star
48

data-scribers

{data scribers} is a collection of posts about data science. And unlike other content aggregating sites, this one encourages people to visit the blog's actual site.
CSS
10
star
49

quant

Course Website Repo for JOURN 8006: Quantitative Research Methods in Journalism
HTML
10
star
50

twitter-datasets

9
star
51

dict

Word-Based Dictionaries for Natural Language
R
9
star
52

covid19

API Wrapper for COVID Tracking Project
R
9
star
53

mitchhedberg

An #rstats Ode to Hedberg
R
9
star
54

congress_tweets

Collecting tweets from members of Congress
R
9
star
55

plotting-county-election-results

πŸ‡ΊπŸ‡ΈπŸ Draw a beautiful county-level election results map with only a few lines of code
R
8
star
56

pkguse

Take Inventory of Package Use
R
8
star
57

googleapis

R client for accessing Google Cloud Natural Language APIs
R
8
star
58

rstatsresources

πŸ”—πŸ”— A curated collection of links about rstatsresources
8
star
59

pytweet

πŸ₯ API Wrapper for Twitter’s REST and stream APIs
Python
7
star
60

gh.com

Easily scrape Github
R
7
star
61

dowhen

πŸ€Έβ€β™€οΈ Do something when something else happens ⏰
R
7
star
62

alpacar

πŸ€–πŸ’ΉπŸ’° Algorithmic Stock Trading with Alpaca's Market API
R
7
star
63

stanford-sna

Material copied from http://sna.stanford.edu/
R
7
star
64

CV

My CV repo
TeX
6
star
65

mikewk.com

source code for my personal website
CSS
6
star
66

fivethirtyeight-approval

6
star
67

cronjob

Manage Cron Jobs
R
6
star
68

r-bloggers

[Tweet bot] R script tweeting new links to R-bloggers posts
R
6
star
69

JOURN_8006_Quant

πŸ“™ Course repository for JOURN 8006: Quantitative Research Methods in Journalism
R
5
star
70

rtweet_citations

πŸ“ tracking rtweet citations
TeX
5
star
71

fml

πŸ“ File Management and Location Tools πŸ“‚
R
5
star
72

opinion.classifier

What the Package Does (One Line, Title Case)
R
5
star
73

iphub

Lookup safeness of IP addresses via iphub
R
5
star
74

ncaa_bball_data

ncaa basketball team level data with tourney outcomes
R
5
star
75

wordword

R
4
star
76

attrbl

A tidy approach to attributes
R
4
star
77

rmdees

Rmd Helpers
R
4
star
78

jeopboty

My Jeopardy Twitter bot
R
4
star
79

pbr

R
4
star
80

weddingposter

TeX
4
star
81

qualtricks

πŸ“πŸ€‘πŸ›  Tools for Working with Qualtrics Data
R
4
star
82

useapi

πŸ“©πŸ“¨ A workflow for building API wrapper/client packages in R.
R
4
star
83

inaug_crowd_size

Plot of inaugural crowd sizes
R
4
star
84

name2sex

⚀ Get sex (female percent) estimates based on first names
R
3
star
85

rstudioconf19-machine-learning

HTML
3
star
86

tidycor

πŸŽ“ Tidy correlation tools for academics
R
3
star
87

journ-tweets

πŸ•΅ Tracking tweets from and to journalists
R
3
star
88

NCA18

Tracking and analyzing tweets about the 2018 National Communication Conference
R
3
star
89

makelinkrepo

πŸ”—πŸ”— Create link repositories and share them on Github
R
3
star
90

do

🌊 R client for DigitalOcean's API
R
3
star
91

lop

Shortcuts for Web Scraping and Data Wrangling
R
3
star
92

shiny-tweetbotornot2

R
3
star
93

warcraft

Warcraft mode for R
R
3
star
94

NCA17

Data collection and visualization of #NCA17 tweets
R
3
star
95

cngtweets

πŸ›πŸ¦ Screen names of members of U.S. Congress
3
star
96

datacenter

Create, Add, and Update Centralized Data
R
2
star
97

whereabouts

β“πŸŒβ† πŸ”€β†’πŸŒŽβ“ {whereabouts}: Find Your Whereabouts
R
2
star
98

dataviz

Data Visualization Tools
R
2
star
99

smartread

πŸ”ŽπŸ“˜ A smarter and simpler way to read data from common file types
R
2
star
100

rstudioconf_talks

Notes and links from rstudio::conf talks
R
2
star