• Stars
    star
    1,269
  • Rank 37,079 (Top 0.8 %)
  • Language
  • Created over 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A collection of links to Ruby Natural Language Processing (NLP) libraries, tools and software

Ruby Natural Language Processing Resources

A collection of Natural Language Processing (NLP) Ruby libraries, tools and software. Suggestions and contributions are welcome.

Categories

APIs

3rd party NLP services

Client libraries to various 3rd party NLP API services.

  • alchemy_api - provides a client API library for AlchemyAPI's NLP services
  • aylien_textapi_ruby - AYLIEN's officially supported Ruby client library for accessing Text API
  • biffbot - Ruby gem for Diffbot's APIs that extract Articles, Products, Images, Videos, and Discussions from any web page
  • gengo-ruby - a Ruby library to interface with the Gengo API for translation
  • monkeylearn-ruby - build and consume machine learning models for language processing from your Ruby apps
  • poliqarpr - Ruby client for Poliqarp text corpus server
  • wlapi - Ruby based API for the project Wortschatz Leipzig

Instant Messaging Bots

Client/server libraries to various 3rd party instant messengers chat bots APIs.

Facebook Messenger

  • botstack - rapid FB Chatbot development with ruby on rails
  • facebook-messenger - Definitely the best Ruby client for Bots on Messenger
  • messenger-ruby - A simple library for supporting implementation of Facebook Messenger Bot in Ruby on Rails

Kik

Microsoft Bot Framework (Skype)

Slack

Telegram Messenger

Wechat

  • wechat API, command and message handling for WeChat in Rails
  • wechat-api - 用于微信 api 调用(非服务端推送信息)的处理。

Natural Language Understanding Tools

Voice-based devices bots

Client/server libraries to various 3rd party voice-based devices APIs.

Amazon Echo Alexa skills

Books

Bitext Alignment

Bitext alignment is the process of aligning two parallel documents on a segment by segment basis. In other words, if you have one document in English and its translation in Spanish, bitext alignment is the process of matching each segment from document A with its corresponding translation in document B.

  • alignment - alignment functions for corpus linguistics (Gale-Church implementation)

Case

  • active_support - the rails active_support gem has various string extensions that can handle case (e.g. .mb_chars.upcase.to_s or #transliterate)
  • string_pl - additional support for Polish encodings in Ruby 1.9
  • twitter-cldr-rb - casefolding
  • u - U extends Ruby’s Unicode support
  • unicode - Unicode normalization library
  • unicode_utils - Unicode algorithms for Ruby 1.9

Chatbot

  • chatterbot - A straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate
  • JeffBot - (Yet another) comical and extensible chat bot
  • Lita - Lita is a chat bot written in Ruby with persistent storage provided by Redis
  • MegaHAL - MegaHAL is a learning chatterbot
  • Markov-chain-bot-module - A chat bot utilizing Markov chains. It speaks Russian and English
  • stealth - An open source Ruby framework for conversational voice and text chatbots

Classification

Classification aims to assign a document or piece of text to one or more classes or categories making it easier to manage or sort.

  • Classifier - a general module to allow Bayesian and other types of classifications
  • classifier-reborn - (a fork of cardmagic/classifier) a general classifier module to allow Bayesian and other types of classifications
  • fastText Ruby - efficient text classification and representation learning - for Ruby
  • Latent Dirichlet Allocation - used to automatically cluster documents into topics
  • liblinear-ruby-swig - Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification and other large linear classifications)
  • linnaeus - a redis-backed Bayesian classifier
  • maxent_string_classifier - a JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework
  • Naive-Bayes - simple Naive Bayes classifier
  • nbayes - a full-featured, Ruby implementation of Naive Bayes
  • omnicat - a generalized rack framework for text classifications
  • omnicat-bayes - Naive Bayes text classification implementation as an OmniCat classifier strategy
  • stuff-classifier - a library for classifying text into multiple categories

Date and Time

  • Chronic - a pure Ruby natural language date parser
  • Chronic Between - a simple Ruby natural language parser for date and time ranges
  • Chronic Duration - a simple Ruby natural language parser for elapsed time
  • dotiw - Better distance of time in words for Rails http://ryanbigg.com
  • Kronic - a dirt simple library for parsing and formatting human readable dates
  • Nickel - extracts date, time, and message information from naturally worded text
  • Tickle - a natural language parser for recurring events
  • time_ago_in_words - Humanize elapsed time from some Time instance to Time.now
  • time-lord - adds extra functionality to the time class.

Emoji

  • active_emoji - A collection of emoji aliases for core Ruby methods
  • emoji - A gem. For Emoji. For everyone. ❤
  • gemoji - Emoji images and names
  • gemoji-parser - The missing helper methods for GitHub's gemoji gem
  • rumoji - Encode and decode emoji unicode characters into emoji-cheat-sheet form. article

Error Correction

  • Chat Correct - shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence
  • gingerice - Ruby wrapper for correcting spelling and grammar mistakes based on the context of complete sentences

Full-Text Search

  • ferret - an information retrieval library in the same vein as Apache Lucene
  • ranguba - a project to provide a full-text search system built on Groonga
  • Thinking Sphinx - Sphinx plugin for ActiveRecord/Rails

Keyword Ranking

  • graph-rank - Ruby implementation of the PageRank and TextRank algorithms
  • highscore - find and rank keywords in text

Language Detection

Language Localization

  • fast_gettext - Ruby GetText, but 3.5x faster + 560x less memory + simple + clean namespace + threadsave + extendable + multiple backends + Rails3 ready
  • ruby-gettext - pure Ruby Localization(L10n) library and tool which is modeled after the GNU gettext package

Lexical Databases and Ontologies

Lexical databases, knowledge-base common sense, multilingual lexicalized semantic networks and ontologies

BabelNet

ConceptNet

Mediawiki, Wikipedia

Wordnet

Machine Learning

  • Decision Tree - a ruby library which implements ID3 (information gain) algorithm for decision tree learning
  • rb-libsvm - implementation of SVM, a machine learning and classification algorithm
  • RubyFann - a ruby gem that binds to FANN (Fast Artificial Neural Network) from within a ruby/rails environment
  • tensorflow.rb - tensorflow for ruby
  • tensor_stream - A ground-up and standalone reimplementation of TensorFlow for ruby.

Machine Translation

Miscellaneous

  • Abbrev - Calculates the set of unique abbreviations for a given set of strings
  • calyx - A Ruby library for generating text with declarative recursive grammars
  • dialable - A Ruby gem that provides parsing and output of North American Numbering Plan (NANP) phone numbers, and includes location & time zones
  • gibber - Gibber replaces text with nonsensical latin with a maximum size difference of +/- 30%
  • hiatus - a localization QA tool
  • language_filter - a Ruby gem to detect and optionally filter multiple categories of language
  • Naturally - Natural (version number) sorting with support for legal document numbering, college course codes, and Unicode
  • RLTK - The Ruby Language Toolkit http://chriswailes.github.io/RLTK/
  • ruby-spacy - A wrapper module for using spaCy natural language processing library from the Ruby programming language via PyCall
  • Shellwords - Manipulates strings like the UNIX Bourne shell
  • sort_alphabetical - sort UTF8 Strings alphabetical via Enumerable extension
  • spintax_parser - A mixin to parse "spintax", a text format used for automated article generation. Can handle nested spintax.
  • stringex - some [hopefully] useful extensions to Ruby’s String class
  • twitter-text - gem that provides text processing routines for Twitter Tweets
  • nameable - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching

Multipurpose Tools

The following are libraries that integrate multiple NLP tools or functionality.

  • nlp - NLP tools for the Polish language
  • NlpToolz - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
  • Open NLP (Ruby bindings)
  • Stanford Core NLP (Ruby bindings)
  • Treat - natural language processing framework for Ruby
  • twitter-cldr-rb - TwitterCldr uses Unicode's Common Locale Data Repository (CLDR) to format certain types of text into their localized equivalents
  • ve - a linguistic framework that's easy to use
  • zipf - a collection of various NLP tools and libraries

Named Entity Recognition

  • Confidential Info Redactor - a Ruby gem to semi-automatically redact confidential information from a text
  • ruby-ner - named entity recognition with Stanford NER and Ruby
  • ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer

Ngrams

  • N-Gram - N-Gram generator in Ruby
  • ngram - break words and phrases into ngrams
  • raingrams - a flexible and general-purpose ngrams library written in Ruby

Numbers

Parsers

A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb.

  • linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
  • Parslet - A small PEG based parser library
  • rley - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm
  • Treetop - a Ruby-based parsing DSL based on parsing expression grammars

Part-of-Speech Taggers

  • engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
  • rbtagger - a simple ruby rule-based part of speech tagger
  • TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid
  • treetagger-ruby - The Ruby based wrapper for the TreeTagger by Helmut Schmid

Readability

  • lingua - Lingua::EN::Readability is a Ruby module which calculates statistics on English text

Regular Expressions

Online resources

Ruby NLP Presentations

Sentence Generation

  • gabbler - Gab-bler (noun) - rapid, unintelligible talk
  • faker - A library for generating fake data such as names, addresses, and phone numbers
  • kusari - Japanese random sentence generator based on Markov chain
  • literate_randomizer - Using Markov chains, this generates near-english prose.
  • markov-sentence-generator - Generates a random, locally-correct sentence using textual input and a Markov model
  • marky_markov - Markov Chain Generator
  • poem-generator - A generator for gothic poems
  • poetry - poetry generator
  • pwqgen.rb - Ruby implementation of passwdqc's pwqgen, a random pronouncable password generator
  • ramble - library for generating sentences from a yacc grammar
  • token_phrase - A token phrase generator

Sentence Segmentation

Sentence segmentation (aka sentence boundary disambiguation, sentence boundary detection) is the problem in natural language processing of deciding where sentences begin and end. Sentence segmentation is the foundation of many common NLP tasks (machine translation, bitext alignment, summarization, etc.).

Speech-to-Text

  • att_speech - A Ruby library for consuming the AT&T Speech API for speech to text
  • pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx
  • Speech2Text - using Google Speech to Text API Provide a Simple Interface to Convert Audio Files

Stemmers

Stemming is the term used in linguistic morphology and information retrieval to describe the process for reducing inflected (or sometimes derived) words to their word stem, base or root form.

Stop Words

  • clarifier
  • stopwords - really just a list of stopwords with some helpers
  • Stopwords Filter - a very simple and naive implementation of a stopwords filter that remove a list of banned words (stopwords) from a sentence

Summarization

Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.

  • Epitome - A small gem to make your text shorter; an implementation of the Lexrank algorithm
  • ots - Ruby bindings to open text summarizer
  • summarize - Ruby C wrapper for Open Text Summarizer

Text Extraction

  • docsplit - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
  • rtesseract - Ruby library for working with the Tesseract OCR
  • Ruby Readability - a tool for extracting the primary readable content of a webpage
  • ruby-tesseract - This wrapper binds the TessBaseAPI object through ffi-inline (which means it will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque Engine class
  • Yomu - a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit

Text Similarity

  • amatch - collection of five type of distances between strings (including Levenshtein, Sellers, Jaro-Winkler, 'pair distance'. Last one seems to work well to find similarity in long phrases)
  • damerau-levenshtein - calculates edit distance using the Damerau-Levenshtein algorithm
  • FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
  • fuzzy-string-match - fuzzy string matching library for ruby
  • FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
  • Going the Distance - contains scripts that do various distance calculations
  • hotwater - Fast Ruby FFI string edit distance algorithms
  • levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
  • soundex - A soundex function coded in Ruby
  • text - Collection of text algorithms
  • TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
  • tf-idf-similarity - calculate the similarity between texts using tf*idf

Text-to-Speech

  • espeak-ruby - small Ruby API for utilizing 'espeak' and 'lame' to create text-to-speech mp3 files
  • Isabella - a voice-computing assistant built in Ruby
  • tts - a ruby gem for converting text-to-speech using the Google translate service

Tokenizers

  • Jieba - Chinese tokenizer and segmenter (jRuby)
  • MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
  • NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
  • Pragmatic Tokenizer - a multilingual tokenizer to split a string into tokens
  • rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
  • Textoken - Simple and customizable text tokenization gem
  • thailang4r - Thai tokenizer
  • tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
  • tokenizer - a simple multilingual tokenizer

Word Count

  • wc - a rubygem to count word occurrences in a given text
  • word_count - a word counter for String and Hash in Ruby
  • Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
  • WordsCounted - a highly customisable Ruby text analyser

More Repositories

1

pragmatic_segmenter

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
Ruby
546
star
2

pragmatic_tokenizer

A multilingual tokenizer to split a string into tokens
Ruby
90
star
3

chat_correct

A Ruby gem that shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence.
Ruby
43
star
4

heroku-buildpack-mecab

This is a buildpack that enables using the mecab gem on Heroku Cedar.
Ruby
25
star
5

word_count_analyzer

Word Count Analyzer is a Ruby gem that analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used. It also provides comprehensive configuration options so you can easily customize how different gray areas should be counted and find the right word count for your purposes.
Ruby
20
star
6

confidential_info_redactor

Ruby gem to semi-automatically redact confidential information from a text
Ruby
14
star
7

surveyor_example

extended NUBIC/surveyor example (tied to a user model)
Ruby
7
star
8

confidential_info_redactor_lite

The lite version of https://github.com/diasks2/confidential_info_redactor - include your own language packs
Ruby
3
star
9

amcharts_example

Ruby on Rails tutorial describing how to link an amCharts JavaScript chart to the data in your database
Ruby
3
star
10

proz

ProZ is a Ruby wrapper for the ProZ.com API
Ruby
2
star
11

sdltm_importer

Import the content of a .sdltm translation memory file
Ruby
2
star
12

unicode_case_converter

A pure Ruby implementation to upcase and downcase unicode strings
Ruby
2
star
13

pretty_strings

Take strings that have been abused in the wild and clean them up (for translation tools)
Ruby
2
star
14

xlf_importer

XLIFF / XLF file importer
Ruby
1
star
15

tbx_importer

TBX (TermBase eXchange) file importer
Ruby
1
star
16

tmx_importer

TMX translation memory file importer
Ruby
1
star
17

jvn_segmenter

jRuby bindings for JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool
Ruby
1
star
18

era_835_parser

Electronic Remittance Advice (ERA) 835 parser
Ruby
1
star
19

finance

Personal Finance Tracker
Ruby
1
star
20

scheduler

Hospital Shift Scheduler
Ruby
1
star
21

txt_tm_importer

Import the content of a .txt translation memory file
Ruby
1
star