• Stars
    star
    28
  • Rank 851,082 (Top 18 %)
  • Language
    HTML
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A little text processing library for Scala.

lib-text

A little text processing library for Scala.

Build Status Coverage Status Gitter

Overview

This is a little text processing library which supports language identification, tokenization, stopword filtering and provides some useful helper functions. The tokenization has been tuned to work well with text conventions commonly used in social media such as Twitter, and supports URLs, emoji, hashtags, emails and @-mentions cleanly. Stopword filtering is currently supported for

  • German
  • English
  • Spanish
  • French
  • Indonesian
  • Japanese
  • Malay
  • Dutch
  • Portuguese
  • Swedish
  • Turkish
  • Arabic

More to come.

Usage

Add to your project dependencies:

resolvers += "peoplepattern" at "https://dl.bintray.com/peoplepattern/maven/"

libraryDependencies += "com.peoplepattern" %% "lib-text" % "0.3"

Example

import com.peoplepattern.text.Implicits._

val txt = "Did you get your personalised print with your copy of #MadeintheAM on Black Friday? If not, there's still time! http://www.myplaydirect.com/one-direction"

txt.lang
// Some(en)

txt.tokens
// Vector(Did, you, get, your, personalised, print, with, your, copy, of, #MadeintheAM, on, Black, Friday, ?, If, not, ,, there's, still, time, !, http://www.myplaydirect.com/one-direction)

txt.terms
// Set(print, personalised, black, copy, friday, time)

txt.termsPlus
// Set(print, personalised, black, #madeintheam, copy, friday, time)

txt.termBigrams
// Set(black friday, personalised print)

License

lib-text is open source and licensed under the Apache License 2.0.

Acknowledgements

Developed with ❤️ at People Pattern Corporation

People Pattern logo