• Stars
    star
    262
  • Rank 156,136 (Top 4 %)
  • Language
    Ruby
  • License
    GNU General Publi...
  • Created over 12 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger

EngTagger

English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger

Description

A Ruby port of Perl Lingua::EN::Tagger, a probability based, corpus-trained tagger that assigns POS tags to English text based on a lookup dictionary and a set of probability values. The tagger assigns appropriate tags based on conditional probabilities--it examines the preceding tag to determine the appropriate tag for the current word. Unknown words are classified according to word morphology or can be set to be treated as nouns or other parts of speech. The tagger also extracts as many nouns and noun phrases as it can, using a set of regular expressions.

Features

  • Assigns POS tags to English text
  • Extract noun phrases from tagged text
  • etc.

Synopsis

require 'engtagger'

# Create a parser object
tgr = EngTagger.new

# Sample text
text = "Alice chased the big fat cat."

# Add part-of-speech tags to text
tagged = tgr.add_tags(text)

#=> "<nnp>Alice</nnp> <vbd>chased</vbd> <det>the</det> <jj>big</jj> <jj>fat</jj><nn>cat</nn> <pp>.</pp>"

# Get a list of all nouns and noun phrases with occurrence counts
word_list = tgr.get_words(text)

#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}

# Get a readable version of the tagged text
readable = tgr.get_readable(text)

#=> "Alice/NNP chased/VBD the/DET big/JJ fat/JJ cat/NN ./PP"

# Get all nouns from a tagged output
nouns = tgr.get_nouns(tagged)

#=> {"cat"=>1, "Alice"=>1}

# Get all proper nouns
proper = tgr.get_proper_nouns(tagged)

#=> {"Alice"=>1}

# Get all past tense verbs
pt_verbs = tgr.get_past_tense_verbs(tagged)

#=> {"chased"=>1}

# Get all the adjectives
adj = tgr.get_adjectives(tagged)

#=> {"big"=>1, "fat"=>1}

# Get all noun phrases of any syntactic level
# (same as word_list but take a tagged input)
nps = tgr.get_noun_phrases(tagged)

#=> {"Alice"=>1, "cat"=>1, "fat cat"=>1, "big fat cat"=>1}

Tag Set

The set of POS tags used here is a modified version of the Penn Treebank tagset. Tags with non-letter characters have been redefined to work better in our data structures. Also, the "Determiner" tag (DET) has been changed from 'DT', in order to avoid confusion with the HTML tag, <DT>.

CC      Conjunction, coordinating               and, or
CD      Adjective, cardinal number              3, fifteen
DET     Determiner                              this, each, some
EX      Pronoun, existential there              there
FW      Foreign words
IN      Preposition / Conjunction               for, of, although, that
JJ      Adjective                               happy, bad
JJR     Adjective, comparative                  happier, worse
JJS     Adjective, superlative                  happiest, worst
LS      Symbol, list item                       A, A.
MD      Verb, modal                             can, could, 'll
NN      Noun                                    aircraft, data
NNP     Noun, proper                            London, Michael
NNPS    Noun, proper, plural                    Australians, Methodists
NNS     Noun, plural                            women, books
PDT     Determiner, prequalifier                quite, all, half
POS     Possessive                              's, '
PRP     Determiner, possessive second           mine, yours
PRPS    Determiner, possessive                  their, your
RB      Adverb                                  often, not, very, here
RBR     Adverb, comparative                     faster
RBS     Adverb, superlative                     fastest
RP      Adverb, particle                        up, off, out
SYM     Symbol                                  *
TO      Preposition                             to
UH      Interjection                            oh, yes, mmm
VB      Verb, infinitive                        take, live
VBD     Verb, past tense                        took, lived
VBG     Verb, gerund                            taking, living
VBN     Verb, past/passive participle           taken, lived
VBP     Verb, base present form                 take, live
VBZ     Verb, present 3SG -s form               takes, lives
WDT     Determiner, question                    which, whatever
WP      Pronoun, question                       who, whoever
WPS     Determiner, possessive & question       whose
WRB     Adverb, question                        when, how, however

PP      Punctuation, sentence ender             ., !, ?
PPC     Punctuation, comma                      ,
PPD     Punctuation, dollar sign                $
PPL     Punctuation, quotation mark left        ``
PPR     Punctuation, quotation mark right       ''
PPS     Punctuation, colon, semicolon, elipsis  :, ..., -
LRB     Punctuation, left bracket               (, {, [
RRB     Punctuation, right bracket              ), }, ]

Install

gem install engtagger

Author

of this Ruby library

  • Yoichiro Hasebe (yohasebe [at] gmail.com)

Contributors

Many thanks to the collaborators listed in the right column of this GitHub page.

Acknowledgement

This Ruby library is a direct port of Lingua::EN::Tagger available at CPAN. The credit for the crucial part of its algorithm/design therefore goes to Aaron Coburn, the author of the original Perl version.

License

This library is distributed under the GPL. Please see the LICENSE file.

More Repositories

1

openai-chat-api-workflow

๐ŸŽฉ An Alfred 5 Workflow for using OpenAI Chat API to interact with GPT-4o ๐Ÿค–๐Ÿ’ฌ It also allows image generation ๐Ÿ–ผ๏ธ, image understanding ๐Ÿ‘€, speech-to-text conversion ๐ŸŽค, and text-to-speech synthesis ๐Ÿ”ˆ
299
star
2

wp2txt

A command-line toolkit to extract text content and category data from Wikipedia dump files
Ruby
172
star
3

lemmatizer

Lemmatizer for text in English. Inspired by Python's nltk.corpus.reader.wordnet.morphy
Ruby
108
star
4

rsyntaxtree

Syntax tree generator for linguistic research
Ruby
98
star
5

whisper-stream

A bash script using OpenAI Whisper API for continuous audio transcription with automatic silence detection
Shell
88
star
6

ruby-spacy

A wrapper module for using spaCy natural language processing library from the Ruby programming language via PyCall
Ruby
63
star
7

fzf-alfred-workflow

An Alfred workflow fo fuzzy find files/directories using fzf and fd.
55
star
8

deepl-alfred-translate-rewrite-workflow

An Alfred workflow to help translate and rewrite text using DeepL API
31
star
9

monadic-chat

๐Ÿค– + ๐Ÿณ + ๐Ÿง Monadic Chat is a framework designed to create and use intelligent chatbots. By providing a full-fledged Linux environment on Docker to GPT-4 and other LLMs, it allows the chatbots to perform advanced tasks that require external tools for searching, coding, testing, analysis, visualization, and more.
Ruby
24
star
10

fastmail-plus

A Chrome extension to make Fastmail web UI more usable and productive
JavaScript
21
star
11

vim-command-workflow

An Alfred workflow to search Vim command cheat sheet + type commands
Ruby
20
star
12

rginger

RGinger takes an English sentence and gives correction and rephrasing suggestions for it using Ginger proofreading API.
Ruby
17
star
13

monadic-chat-cli

Highly configurable CLI app for OpenAI's chat/text completion API
Ruby
10
star
14

rubyfca

Command line tool for Formal Concept Analysis written in Ruby
Ruby
7
star
15

code-packager

๐Ÿ“ฆ A set of bash scripts that package and unpack your codebase into and from a single JSON file, ready to be analyzed and understood by large language models (LLMs) like GPT, Claude, Command R, and Gemini ๐Ÿค–
Shell
7
star
16

finder-unclutter

An Alfred ๐ŸŽฉ workflow that removes duplicate Finder tabs and windows and arranges them into a single or dual-pane ๐Ÿ‘“ layout for a cleaner desktop experience ๐Ÿ–ฅ๏ธ ๐Ÿงน
6
star
17

rubyplb

Command line Pattern Lattice building tool written in Ruby
Ruby
4
star
18

paradocs

Paradocs: A Paragraph-Oriented Text Document Presentation System
4
star
19

objective-wordnet

3
star
20

mac-dictionary-selector

An Alfred3 Workflow that lets you quickly look up words from a variety of dictionaries preinstalled in OSX
Ruby
3
star
21

ruby-wordle

A set of ruby scripts to generate word-lists, solve Wordle and play Wordle
Ruby
2
star
22

five-block-timer

โฑ๏ธ Five Block Timer is a flexible and customizable web-based timer app designed to help manage time effectively. It allows for the creation of up to four distinct time blocks plus an initial countdown block, making it ideal for various timing needs such as conference talks, exams, or productivity sessions.
JavaScript
1
star
23

quickanswers

QuickAnswers
JavaScript
1
star
24

rsyntaxtree_web

JavaScript
1
star
25

speak_slow

SpeakSlow modifies audio files adding pauses and/or altering speed to suit for language study
Ruby
1
star