• Stars
    star
    728
  • Rank 62,237 (Top 2 %)
  • Language
    Ruby
  • License
    MIT License
  • Created about 12 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Ruby gem to calculate the similarity between texts using tf*idf

Ruby Vector Space Model (VSM) with tf*idf weights

Gem Version Build Status Coverage Status Code Climate

Calculates the similarity between texts using a bag-of-words Vector Space Model with Term Frequency-Inverse Document Frequency (tf*idf) weights. If your use case demands performance, use Lucene (see below).

Usage

require 'matrix'
require 'tf-idf-similarity'

Create a set of documents:

document1 = TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
document2 = TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
document3 = TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
corpus = [document1, document2, document3]

Create a document-term matrix using Term Frequency-Inverse Document Frequency function:

model = TfIdfSimilarity::TfIdfModel.new(corpus)

Or, create a document-term matrix using the Okapi BM25 ranking function:

model = TfIdfSimilarity::BM25Model.new(corpus)

Create a similarity matrix:

matrix = model.similarity_matrix

Find the similarity of two documents in the matrix:

matrix[model.document_index(document1), model.document_index(document2)]

Print the tf*idf values for terms in a document:

tfidf_by_term = {}
document1.terms.each do |term|
  tfidf_by_term[term] = model.tfidf(document1, term)
end
puts tfidf_by_term.sort_by{|_,tfidf| -tfidf}

Tokenize a document yourself, for example by excluding stop words:

require 'unicode_utils'
text = "Lorem ipsum dolor sit amet..."
tokens = UnicodeUtils.each_word(text).to_a - ['and', 'the', 'to']
document1 = TfIdfSimilarity::Document.new(text, :tokens => tokens)

Provide, by yourself, the number of times each term appears and the number of tokens in the document:

require 'unicode_utils'
text = "Lorem ipsum dolor sit amet..."
tokens = UnicodeUtils.each_word(text).to_a - ['and', 'the', 'to']
term_counts = Hash.new(0)
size = 0
tokens.each do |token|
  # Unless the token is numeric.
  unless token[/\A\d+\z/]
    # Remove all punctuation from tokens.
    term_counts[token.gsub(/\p{Punct}/, '')] += 1
    size += 1
  end
end
document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size)

Or, use your own classes for the tokenizer and tokens, like in this example.

Read the documentation at RubyDoc.info.

Troubleshooting

NoMethodError: undefined method `[]' for Matrix:Module

The matrix gem conflicts with Ruby's internal Matrix module. Don't use the matrix gem.

Speed

Instead of using the Ruby Standard Library's Matrix class, you can use one of the GNU Scientific Library (GSL), NArray or NMatrix (0.0.9 or greater) gems for faster matrix operations. For example:

require 'narray'
model = TfIdfSimilarity::TfIdfModel.new(corpus, :library => :narray)

NArray seems to have the best performance of the three libraries.

The NMatrix gem gives access to Automatically Tuned Linear Algebra Software (ATLAS), which you may know of through Linear Algebra PACKage (LAPACK) or Basic Linear Algebra Subprograms (BLAS). Follow these instructions to install the NMatrix gem.

Extras

You can access more term frequency, document frequency, and normalization formulas with:

require 'tf-idf-similarity/extras/document'
require 'tf-idf-similarity/extras/tf_idf_model'

The default tf*idf formula follows the Lucene Conceptual Scoring Formula.

Why?

At the time of writing, no other Ruby gem implemented the tf*idf formula used by Lucene, Sphinx and Ferret.

Term frequencies

  • The vss gem does not normalize the frequency of a term in a document; this occurs frequently in the academic literature, but only to demonstrate why normalization is important.
  • The tf_idf and similarity gems normalize the frequency of a term in a document to the number of terms in that document, which never occurs in the literature.
  • The tf-idf gem normalizes the frequency of a term in a document to the number of unique terms in that document, which never occurs in the literature.

Document frequencies

  • The vss gem does not normalize the inverse document frequency.
  • The treat, tf_idf, tf-idf and similarity gems use variants of the typical inverse document frequency formula.

Normalization

  • The treat, tf_idf, tf-idf, rsemantic and vss gems have no normalization component.

Additional adapters

Adapters for the following projects were also considered:

  • Ruby-LAPACK is a very thin wrapper around LAPACK, which has an opaque Fortran-style naming scheme.
  • Linalg and RNum give access to LAPACK from Ruby but are old and unavailable as gems.

Reference

Further Reading

Lucene implements many more similarity functions, such as:

Lucene can even combine similarity measures.

Copyright (c) 2012 James McKinney, released under the MIT license

More Repositories

1

validictory

🎓 deprecated general purpose python data validator
Python
239
star
2

color-generator

Ruby gem to randomly generate distinct colors with consistent lightness and saturation
Ruby
80
star
3

pupa-ruby

A data scraping framework based on Open Civic Data's Pupa
Ruby
67
star
4

image-proxy

An image proxy using the Express Node.js framework
JavaScript
59
star
5

multi_mail

Ruby gem to easily switch between email APIs
Ruby
35
star
6

fastcsv

A fast Ragel-based CSV parser, compatible with Ruby's CSV
Ruby
29
star
7

open_data_canada

The aspirational canonical database of Canadian open government data catalogs
Ruby
9
star
8

open_source_canada

Auditing tools for Canadian governments' open source code (data files not maintained)
Ruby
8
star
9

copy_paste_pdf

Converts PDF to CSV by copy-pasting from Apple's Preview to Microsoft Excel
Ruby
7
star
10

info-go

Documentation and scraper for the Government of Ontario Employee and Organization Directory (INFO-GO) API
Ruby
5
star
11

gitpop

Find the best fork on GitHub
JavaScript
5
star
12

wikipedia-names-your-band

Wikipedia Names Your Band
Python
4
star
13

rackbin

The simplest possible Rack postbin
Ruby
3
star
14

information_request_summaries_and_responses

Collects information request summaries and responses
Ruby
3
star
15

inventory

Open data standards inventory
Python
2
star
16

redirector

A simple Sinatra redirection app
Ruby
1
star
17

indoor_voice

Lowercase all-caps strings excluding acronyms
Ruby
1
star
18

best_new_music

Save Pitchfork's Best New Music to your Spotify account
Ruby
1
star
19

opengovdialogue.ca-jekyll

HTML
1
star
20

popit-ruby

Ruby gem that wraps the PopIt API
Ruby
1
star
21

token_action

Rails engine to redeem tokens and perform actions
Ruby
1
star
22

WhatsMyWard-Themes

Themes for Apps 4 Good's "What's My Ward?"
Ruby
1
star
23

netstring

A netstring parser and emitter
Ruby
1
star
24

rabx-message

A RPC using Anything But XML (RABX) message parser and emitter
Ruby
1
star
25

multi_mail-servers

Test servers for MultiMail
Ruby
1
star
26

clip-analysis

Analysis of Canadian open data licenses in the CIPPIC Open Licensing Project
Ruby
1
star
27

pdftk

Fork of pdftk for OS X Lion
Java
1
star
28

lycopodium

Test what transformations you can make to a set of values without creating collisions
Ruby
1
star