• Stars
    star
    115
  • Rank 306,027 (Top 7 %)
  • Language
    Ruby
  • License
    Other
  • Created over 13 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Calculate similarity between documents using TF-IDF weights

Similarity

Overview

A Ruby library for calculating the similarity between pieces of text using a Term Frequency-Inverse Document Frequency method.

A bag of words model is used. Terms in the source documents are downcased and punctuation is removed, but stemming is not currently implemented.

This library was written to facilitate the creation of diagrams talked about by Jonathan Stray in his full-text visualization of the Iraq War Logs post. An example of how to generate a Gephi compatible file including labelling of nodes with key words is included in the examples directory.

The library depends on the GNU Scientific Library, and the gsl ruby gem but does not use sparse matrix representations to speed up the calculations, since there is no support for them in the GSL. I am currently looking into fixing this, and would appreciate any help!

Dependencies

Similarity depends on the GNU Scientific Library, and the gsl ruby gem. On OSX with https://github.com/mxcl/homebrew the GSL can be installed with

brew install gsl

The gsl gem should then install normally. For other platforms, please add the information to the wiki and Iโ€™ll add them to this readme.

Usage

First we load some documents into the corpus

require 'similarity'

:

corpus = Corpus.new

:

doc1 = Document.new(:content => "A document with a lot of additional words some of which are about chunky bacon")
doc2 = Document.new(:content => "Another longer document with many words and again about chunky bacon")
doc3 = Document.new(:content => "Some text that has nothing to do with pork products")

:

[doc1, doc2, doc3].each { |doc| corpus << doc }

Then to compare documents we can use the similar_documents method

corpus.similar_documents(doc1).each do |doc, similarity|
 puts "Similarity between doc #{doc1.id} and doc #{doc.id} is #{similarity}"
end

:

#=>
 Similarity between doc 70137042580340 and doc 70137042580340 is 0.9999999999999997
 Similarity between doc 70137042580340 and doc 70137042580240 is 0.06068602112714361
 Similarity between doc 70137042580340 and doc 70137042580160 is 0.04882114791611661

The cross-similarity matrix (useful for creating graphs) is also available

similarity_matrix = corpus.similarity_matrix

For more examples, see the examples directory.

Todo

  • Performance improvements
    • Switch to storing document vector spaces in sparse form, using linalg or csparse?
  • (Optional) stemming of source terms

Contributing

  • Fork the project
  • Send a pull request
  • Donโ€™t touch the .gemspec, Iโ€™ll do that when I release a new version

Author

Chris Lowis - BBC R&D

More Repositories

1

wraith

Wraith โ€” A responsive screenshot comparison tool
Ruby
4,813
star
2

Imager.js

Responsive images while we wait for srcset to finish cooking
JavaScript
3,833
star
3

peaks.js

JavaScript UI component for interacting with audio waveforms
JavaScript
2,886
star
4

audiowaveform

C++ program to generate waveform data and render waveform images from audio files
C++
1,658
star
5

sqs-consumer

Build Amazon Simple Queue Service (SQS) based applications without the boilerplate
TypeScript
1,541
star
6

bbplot

R package that helps create and export ggplot2 charts in the style used by the BBC News data team
R
1,434
star
7

simorgh

The BBC's Open Source Web Application. Contributions welcome! Used on some of our biggest websites, e.g.
TypeScript
1,394
star
8

VideoContext

An experimental HTML5 & WebGL video composition and rendering API.
JavaScript
1,318
star
9

waveform-data.js

Audio Waveform Data Manipulation API โ€“ resample, offset and segment waveform data in JavaScript.
JavaScript
936
star
10

brave

Basic Real-time AV Editor - allowing you to preview, mix, and route live audio and video streams on the cloud
Python
646
star
11

tal

TV Application Layer
JavaScript
550
star
12

react-transcript-editor

A React component to make correcting automated transcriptions of audio and video easier and faster. By BBC News Labs. - Work in progress
JavaScript
494
star
13

psammead

React component library for BBC World Service and more
JavaScript
320
star
14

newslabs-datastringer

Monitor datasets, gets alerts when something happens
JavaScript
212
star
15

html5-video-compositor

This is the BBC Research & Development UX Team's experimental shader based video composition engine for the browser. For new projects please consider using or new VideoContext library https://github.com/bbc/videocontext .
JavaScript
207
star
16

REST-API-example

Simple REST API example in Sinatra
Ruby
193
star
17

grandstand

BBC Grandstand is a collection of common CSS abstractions and utility helper classes
SCSS
190
star
18

sqs-producer

Simple scaffolding for applications that produce SQS messages
TypeScript
181
star
19

r-audio

A library of React components for building Web Audio graphs.
JavaScript
168
star
20

chaos-lambda

Randomly terminate ASG instances during business hours
Python
163
star
21

turingcodec

Source code for the Turing codec, an HEVC software encoder optimised for fast encoding of large resolution video content
C++
153
star
22

bbc-vamp-plugins

A collection of audio feature extraction algorithms written in the Vamp plugin format.
C++
152
star
23

bbc-a11y

BBC Accessibility Guidelines Checker
Gherkin
134
star
24

rcookbook

Reference manual for creating BBC-style graphics using the BBC's bbplot package built on top of R's ggplot2 library
HTML
127
star
25

gel-grid

A flexible code implementation of the GEL Grid Guidelines
SCSS
126
star
26

audio-offset-finder

Find the offset of an audio file within another audio file
Python
124
star
27

datalab-ml-training

Machine Learning Training
Jupyter Notebook
117
star
28

viewporter

In-browser responsive testing tool.
CSS
114
star
29

flashheart

A fully-featured Node.js REST client built for ease-of-use and resilience
JavaScript
114
star
30

qtff-parameter-editor

QuickTime file parameter editor for modifying transfer function, colour primary and matrix characteristics
C++
114
star
31

gel-typography

A flexible code implementation of the GEL Typography Guidelines
CSS
111
star
32

consumer-contracts

Consumer-driven contracts in JavaScript
JavaScript
105
star
33

color-contrast-checker

An accessibility checker tool for validating the color contrast based on WCAG 2.0 and WCAG 2.1 standards.
JavaScript
81
star
34

slayer

JavaScript time series spike detection for Node.js and the browser; like the Octave findpeaks function.
JavaScript
77
star
35

lrud

Left, Right, Up, Down. A spatial navigation library for devices with input via directional controls.
JavaScript
76
star
36

audio_waveform-ruby

Ruby gem that provides access to audio waveform data files generated by audiowaveform
Ruby
76
star
37

software-engineering-technical-assessments

Technical assessment for hiring
Kotlin
71
star
38

nghq

An implementation of Multicast QUIC https://tools.ietf.org/html/draft-pardue-quic-http-mcast-07
C
67
star
39

bigscreen-player

Simplified media playback for bigscreen devices
JavaScript
65
star
40

speculate

Automatically generates an RPM Spec file for your Node.js project
JavaScript
64
star
41

zeitgeist

Twitter Zeitgeist
Ruby
62
star
42

wally

Cucumber feature viewer and navigator
Ruby
57
star
43

theano-bpr

An implementation of Bayesian Personalised Ranking in Theano
Python
54
star
44

ShouldIT

A language agnostic BDD framework.
JavaScript
53
star
45

news-gem-cloudwatch-sender

Send metrics to InfluxDB from Cloudwatch
Ruby
53
star
46

unicode-bidirectional

A Javascript implementation of the Unicode 9.0.0 Bidirectional Algorithm
JavaScript
45
star
47

subtitles-generator

A node module to generate subtitles by segmenting a list of time-coded text - BBC News Labs
JavaScript
44
star
48

accessibility-news-and-you

We want to be the most accessible news website in the world. This is how.
HTML
44
star
49

codext

VS Code's editor shipped as a browser extension.
JavaScript
42
star
50

talexample

An example TV app written using TAL
JavaScript
40
star
51

rdfspace

RDFSpace constructs a vector space from any RDF dataset which can be used for computing similarities between resources in that dataset.
Python
39
star
52

digital-paper-edit-client

Work in progress - BBC News Labs digital paper edit project - React Client
JavaScript
39
star
53

clientside-recommender

A client-side recommender system implemented in Javascript.
Java
39
star
54

gel

JavaScript
39
star
55

childrens-games-starter-pack

This is the Starter Pack for Children's games, containing everything a games developer might need to start building an HTML5 game for Children's BBC. Every game should be forked into a new repository from this repo.
JavaScript
38
star
56

alephant

The Alephant framework is a collection of isolated Ruby gems, which interconnect to offer powerful message passing functionality built up around the "Broker" pattern.
Ruby
37
star
57

vc2-reference

A reference encoder and decoder for SMPTE ST 2042-1 "VC-2 Video Compression"
C++
34
star
58

ruby-lsh

Locality Sensitive Hashing in Ruby
Ruby
32
star
59

Strophejs-PubSub-Demo

A simple demo of Publish/Subscribe in the browser using Strophe.js
JavaScript
31
star
60

lrud-spatial

Left, Right, Up, Down. A spatial navigation library for devices with input via directional controls.
JavaScript
30
star
61

diarize-jruby

A simple toolkit for speaker segmentation and identification
Ruby
30
star
62

pydvbcss

Python library that implements DVB protocols for companion synchronisation
Python
28
star
63

gel-sass-tools

A collection of Sass Settings & Tools which align to key GEL values
SCSS
27
star
64

a11y-tests-web

Runs automated accessibility tests against configurable lists of webpages
JavaScript
27
star
65

RadioVisDemo

RadioDNS and RadioVIS Slideshow Protocol Demo
Python
27
star
66

device-discovery-pairing

Analysis and background research on discovery and pairing for the MediaScape project
26
star
67

node-canvas-lambda-deps

Node Canvas AWS Lambda dependencies i.e. compiled shared object files for Cairo, Pixman, libpng, libjpeg etc.
JavaScript
26
star
68

clever-thumbnailer

Audio thumbnail generator
C
25
star
69

spassky

Distributed web testing tool
JavaScript
25
star
70

bbc-speech-segmenter

A complete speech segmentation system using Kaldi and x-vectors for voice activity detection (VAD) and speaker diarisation.
Shell
24
star
71

genie

BBC Genie Games Framework
JavaScript
24
star
72

media-sequence

HTML5 media sequenced playback API: play one or multiple sequences of a same audio or video with plain JavaScript.
JavaScript
24
star
73

Chart.Bands.js

Chart.js plugin to allow banding on a chart
JavaScript
23
star
74

newslabs-Text_Analytics

A space for code and projects around analysing news content
Python
23
star
75

curriculum-data

BBC Curriculum Instance Data
23
star
76

cloudflare-queue-consumer

Build Cloudflare Queues based applications without the boilerplate (based on SQS Consumer)
TypeScript
23
star
77

videocontext-devtools

Chrome DevTools extension for easy VideoContext debugging.
JavaScript
22
star
78

bmx

Library and utilities to read and write broadcasting media files. Primarily supports the MXF file format
C++
22
star
79

adaptivepodcasting

A project exploring the potential of media which adapts based on sensors and data
JavaScript
21
star
80

UCMythTV

A full implementation of Universal Control 0.6.0 for use on a computer running Mythbuntu with a slightly modified version of MythTV (patches and configure script included).
Python
20
star
81

rdfsim

Large RDF hierarchies as vector spaces
Python
20
star
82

bug

Started life at BBC News - BUG enables control and monitoring of broadcast kit from a single web interface.
JavaScript
20
star
83

digital-paper-edit-electron

Work in progress - BBC News Labs digital paper edit project - Electron, Cross Platform Desktop app - Mac, Windows, Linux
C++
20
star
84

gst-ttml-subtitles

Library and elements that add support for TTML subtitles to GStreamer.
C
19
star
85

dvbcss-synctiming

Measuring synchronisation timing accuracy for DVB Compainion Screen Synchronisation TVs and Companions
Python
19
star
86

fcpx-xml-composer

Work in progress - Module to Convert a json sequence into an FCPX XML. For BBC News Labs digital paper edit project
JavaScript
18
star
87

bbcrd-brirs

An impulse response dataset for dynamic data-based auralisation of advanced sound systems
Common Lisp
18
star
88

MiD

Make it Digital: the BBC's Digital Creativity initiative
Arduino
17
star
89

device_api-android

DeviceAPI-Android
Ruby
17
star
90

tams

Time Addressable Media Store API
Makefile
17
star
91

gs-sass-tools

A collection of Sass variables, functions and mixins, part of BBC Grandstand
CSS
16
star
92

enzyme-adapter-inferno

Inferno enzyme adapter
JavaScript
16
star
93

get-title

Extract the best title value from within HTML head elements.
JavaScript
16
star
94

morty-docs

Generate a static website from markdown files
JavaScript
16
star
95

storyplayer

BBC Research & Development's Object Based Media Player
TypeScript
15
star
96

dialogger

Text-based media editing interface
JavaScript
15
star
97

bbcat-base

Base library for the BBC Audio Toolbox
C++
15
star
98

origin_simulator

A tool to simulate a (flaky) upstream origin during load and stress tests.
Elixir
15
star
99

catflap-camera

Raspberry Pi based catflap-triggered camera. As seen on TV.
Python
15
star
100

citron

Citron is an experimental quote extraction system created by BBC R&D
Python
15
star