• Stars
    star
    108
  • Rank 321,259 (Top 7 %)
  • Language
    Ruby
  • License
    MIT License
  • Created about 12 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Lemmatizer for text in English. Inspired by Python's nltk.corpus.reader.wordnet.morphy

lemmatizer

Lemmatizer for text in English. Inspired by Python's nltk.corpus.reader.wordnet.morphy package.

Based on code posted by mtbr at his blog entry WordNet-based lemmatizer

Version 0.2 has added functionality to add user supplied data at runtime

Installation

sudo gem install lemmatizer

Usage

require "lemmatizer"
  
lem = Lemmatizer.new
  
p lem.lemma("dogs",    :noun ) # => "dog"
p lem.lemma("hired",   :verb ) # => "hire"
p lem.lemma("hotter",  :adj  ) # => "hot"
p lem.lemma("better",  :adv  ) # => "well"
  
# when part-of-speech symbol is not specified as the second argument, 
# lemmatizer tries :verb, :noun, :adj, and :adv one by one in this order.
p lem.lemma("fired")           # => "fire"
p lem.lemma("slow")            # => "slow"

Limitations

# Lemmatizer leaves alone words that its dictionary does not contain.
# This keeps proper names such as "James" intact.
p lem.lemma("MacBooks", :noun) # => "MacBooks" 
  
# If an inflected form is included as a lemma in the word index,
# lemmatizer may not give an expected result.
p lem.lemma("higher", :adj) # => "higher" not "high"!

# The above has to happen because "higher" is itself an entry word listed in dict/index.adj .
# To fix this, modify the original dict directly (lib/dict/index.{noun|verb|adj|adv}) 
# or supply with custom dict files (recommended).

Supplying with user dict

# You can supply custom dict files consisting of lines in the format of <pos>\s+<form>\s+<lemma>.
# The data in user supplied files overrides the preset data. Here's the sample. 

# --- sample.dict1.txt (don't include hash symbol on the left) ---
# adj   higher   high
# adj   highest  high
# noun  MacBooks MacBook
# ---------------------------------------------------------------

lem = Lemmatizer.new("sample.dict1.txt")

p lem.lemma("higher", :adj)     # => "high"
p lem.lemma("highest", :adj)    # => "high"
p lem.lemma("MacBooks", :noun)  # => "MacBook"

# The argument to Lemmatizer.new can be either of the following:
# 1) a path string to a dict file (e.g. "/path/to/dict.txt")
# 2) an array of paths to dict files (e.g. ["./dict/noun.txt", "./dict/verb.txt"])

Resolving abbreviations

# You can use 'abbr' tag in user dicts to resolve abbreviations in text.

# --- sample.dict2.txt (don't include hash symbol on the left) ---
# abbr  utexas   University of Texas
# abbr  mit      Massachusetts Institute of Technology
# ---------------------------------------------------------------

# <NOTE>
# 1. Expressions on the right (substitutes) can contain white spaces, 
#    while expressions in the middle (words to be replaced) cannot.
# 2. Double/Single quotations could be used with substitute expressions,
#    but not with original expressions.

lem = Lemmatizer.new("sample.dict2.txt")

p lem.lemma("utexas", :abbr) # => "University of Texas"
p lem.lemma("mit", :abbr)    # => "Massachusetts Institute of Technology"

Author

Thanks for assistance and contributions:

License

Licensed under the MIT license.

More Repositories

1

openai-chat-api-workflow

๐ŸŽฉ An Alfred 5 Workflow for using OpenAI Chat API to interact with GPT-4o ๐Ÿค–๐Ÿ’ฌ It also allows image generation ๐Ÿ–ผ๏ธ, image understanding ๐Ÿ‘€, speech-to-text conversion ๐ŸŽค, and text-to-speech synthesis ๐Ÿ”ˆ
299
star
2

engtagger

English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
Ruby
262
star
3

wp2txt

A command-line toolkit to extract text content and category data from Wikipedia dump files
Ruby
172
star
4

rsyntaxtree

Syntax tree generator for linguistic research
Ruby
98
star
5

whisper-stream

A bash script using OpenAI Whisper API for continuous audio transcription with automatic silence detection
Shell
88
star
6

ruby-spacy

A wrapper module for using spaCy natural language processing library from the Ruby programming language via PyCall
Ruby
63
star
7

fzf-alfred-workflow

An Alfred workflow fo fuzzy find files/directories using fzf and fd.
55
star
8

deepl-alfred-translate-rewrite-workflow

An Alfred workflow to help translate and rewrite text using DeepL API
31
star
9

monadic-chat

๐Ÿค– + ๐Ÿณ + ๐Ÿง Monadic Chat is a framework designed to create and use intelligent chatbots. By providing a full-fledged Linux environment on Docker to GPT-4 and other LLMs, it allows the chatbots to perform advanced tasks that require external tools for searching, coding, testing, analysis, visualization, and more.
Ruby
24
star
10

fastmail-plus

A Chrome extension to make Fastmail web UI more usable and productive
JavaScript
21
star
11

vim-command-workflow

An Alfred workflow to search Vim command cheat sheet + type commands
Ruby
20
star
12

rginger

RGinger takes an English sentence and gives correction and rephrasing suggestions for it using Ginger proofreading API.
Ruby
17
star
13

monadic-chat-cli

Highly configurable CLI app for OpenAI's chat/text completion API
Ruby
10
star
14

rubyfca

Command line tool for Formal Concept Analysis written in Ruby
Ruby
7
star
15

code-packager

๐Ÿ“ฆ A set of bash scripts that package and unpack your codebase into and from a single JSON file, ready to be analyzed and understood by large language models (LLMs) like GPT, Claude, Command R, and Gemini ๐Ÿค–
Shell
7
star
16

finder-unclutter

An Alfred ๐ŸŽฉ workflow that removes duplicate Finder tabs and windows and arranges them into a single or dual-pane ๐Ÿ‘“ layout for a cleaner desktop experience ๐Ÿ–ฅ๏ธ ๐Ÿงน
6
star
17

rubyplb

Command line Pattern Lattice building tool written in Ruby
Ruby
4
star
18

paradocs

Paradocs: A Paragraph-Oriented Text Document Presentation System
4
star
19

objective-wordnet

3
star
20

mac-dictionary-selector

An Alfred3 Workflow that lets you quickly look up words from a variety of dictionaries preinstalled in OSX
Ruby
3
star
21

ruby-wordle

A set of ruby scripts to generate word-lists, solve Wordle and play Wordle
Ruby
2
star
22

five-block-timer

โฑ๏ธ Five Block Timer is a flexible and customizable web-based timer app designed to help manage time effectively. It allows for the creation of up to four distinct time blocks plus an initial countdown block, making it ideal for various timing needs such as conference talks, exams, or productivity sessions.
JavaScript
1
star
23

quickanswers

QuickAnswers
JavaScript
1
star
24

rsyntaxtree_web

JavaScript
1
star
25

speak_slow

SpeakSlow modifies audio files adding pauses and/or altering speed to suit for language study
Ruby
1
star