• Stars
    star
    1,038
  • Rank 44,388 (Top 0.9 %)
  • Language
    Ruby
  • License
    Other
  • Created about 13 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast citation reference parsing

AnyStyle

CI Coverage Status

AnyStyle is a fast and smart parser of bibliographic references. Originally inspired by parsCit and FreeCite, AnyStyle uses machine learning algorithms and aims to make it easy to train models with data that's relevant to you.

Using AnyStyle on the command line

$ [sudo] gem install anystyle-cli
$ anystyle --help
$ anystyle help find
$ anystyle help parse

See anystyle-cli for more details.

Using AnyStyle in Ruby

Install the anystyle gem.

$ [sudo] gem install anystyle

Now you can use the static Parser and Finder instances by calling the AnyStyle.parse or AnyStyle.find methods. For example:

require 'anystyle'

pp AnyStyle.parse 'Derrida, J. (1967). L’écriture et la différence (1 éd.). Paris: Éditions du Seuil.'
#-> [{
#  :author=>[{:family=>"Derrida", :given=>"J."}],
#  :date=>["1967"],
#  :title=>["L’écriture et la différence"],
#  :edition=>["1"],
#  :location=>["Paris"],
#  :publisher=>["Éditions du Seuil"],
#  :language=>"fr",
#  :scripts=>["Common", "Latin"],
#  :type=>"book"
#}]

You can also create your own AnyStyle::Parser or AnyStyle::Finder with custom options.

Using AnyStyle on the web

AnyStyle is available at anystyle.io.

The web application is open source and you're welcome to host your own instance!

Improving results for your data

Training

You can train custom Finder and Parser models. To do this, you need to prepare your own data sets for training. You can create your own data from scratch or build on AnyStyle's default sets. The default parser model uses the core data set. And though the finder model sources aren't available in their entirety, due to copyright restrictions, you can find several tagged documents here.

When you have compiled a data set for training, you will be ready to create your own model:

$ anystyle train training-data.xml custom.mod

This will save your new model as custom.mod. To use your model instead of AnyStyle's default, use the -P or --parser-model flag and, respectively, -F or --finder-model to use a custom finder model. For instance, the command below will parse a file bib.txt with the custom model and print the result to STDOUT in JSON format:

$ anystyle -P custom.mod -f json parse bib.txt -

When training your own models, it's good practice to check their quality using a second data set. For example, to check your custom model using AnyStyle's manually curated gold data set:

$ anystyle -P x.mod check ./res/parser/gold.xml
Checking gold.xml.................   1 seq  0.06%   3 tok  0.01%  3s

This command prints sequence and token error rates. Here, sequence errors are the number of references tagged differently by the parser as compared to the curated input; the number of token errors is the total number of words in these references. In the example above, one reference was wrong (out of 1,700 at the time), because a total of three words had a different tag.

When working with training data, it's a good idea to use the Wapiti::Dataset API in Ruby: it supports standard set operators and makes it easy to combine or compare data sets.

Natural Languages used in AnyStyle

The core data set contains the manually marked-up references which comprise AnyStyle's default parser model. If your references include non-English documents, the distribution of natural languages in this corpus is relevant.

Language n
ENGLISH 965
FRENCH 54
GERMAN 26
ITALIAN 11
Others 9
Not reliably determined 449
(but mainly English)

(Measured using cld and AnyStyle version 1.3.13)

There is a strong prevalence of English-language documents with the conventions used in English-language bibliographies, with some representation of other European languages. The languages used reflect those used in scientific publishing as well as the maintainers' competencies. If you are working with documents in languages other than English, you might consider training the model with some examples in the relevant languages.

AnyStyle works with references written in any Latin script, including most European languages, languages such as Indonesian and Malaysian, as well as romanized Arabic, Chinese and Japanese. It also supports non-Latin alphabets such as Cyrillic, although no examples of these appear in the default training sets. Languages written in syllabaries or complex symbols which don't use white space to separate tokens aren't compatible with AnyStyle's approach: this includes Chinese, Japanese, Arabic, and Indian languages.

Dictionary Adapters

During the statistical analysis of reference strings, AnyStyle relies on a large feature dictionary; by default, AnyStyle creates a persistent Ruby hash in the folder of the anystyle-data Gem. This uses up about 2MB of disk space and keeps the entire dictionary in memory. If you prefer a smaller memory footprint, you can use AnyStyle's GDBM dictionary. GDBM bindings are part of the Ruby standard library and supported on all platforms, though you may need to install GDBM before installing Ruby.

If you don't want to use the persistent Ruby hash nor GBDM, you can store your dictionary in memory or use a Redis. The best way to change the default dictionary adapter is by adjusting AnyStyle's default configuration (when using the static parser instances you must set the default before using the parser):

AnyStyle::Dictionary.defaults[:adapter] = :ruby
#-> Use a persistent Ruby hash;
#-> slower start-up than GDBM but no extra dependency

AnyStyle::Dictionary.defaults[:adapter] = :hash
#-> Use in-memory dictionary; slow start-up but uses no space on disk

require 'anystyle/dictionary/gdbm'
AnyStyle::Dictionary.defaults[:adapter] = :gdbm

To use Redis, install the redis and redis/namespace (optional) Gems and configure AnyStyle to use the Redis adapter:

AnyStyle::Dictionary.defaults[:adapter] = :redis

# Adjust the Redis-specifi configuration
require 'anystyle/dictionary/redis'
AnyStyle::Dictionary::Redis.defaults[:host] = 'localhost'
AnyStyle::Dictionary::Redis.defaults[:port] = 6379

About AnyStyle

Contributing

The AnyStyle source code is hosted on GitHub. You can check out a copy of the latest code using Git:

$ git clone https://github.com/inukshuk/anystyle.git

If you've found a bug or have a question, please report the issue or, for extra credit, clone the AnyStyle repository, write a failing example, fix the bug and submit a pull request.

Credits

AnyStyle is a volunteer effort and you're encourage to join! Over the years the main contributors have been:

License

Copyright 2011-2023 Sylvester Keil. All rights reserved.

AnyStyle is distributed under a BSD-style license. See LICENSE for details.

More Repositories

1

jekyll-scholar

jekyll extensions for the blogging scholar
Gherkin
1,121
star
2

bibtex-ruby

A BibTeX library, parser, and converter for Ruby.
Ruby
153
star
3

sqleton

Visualize your SQLite database schema
JavaScript
98
star
4

citeproc-ruby

A Citation Style Language (CSL) Cite Processor
Ruby
97
star
5

edtf.js

Extended Date Time Format (ISO 8601-2 / EDTF) Parser for JavaScript
JavaScript
65
star
6

anystyle-cli

AnyStyle Command Line Interface
Ruby
50
star
7

wapiti-ruby

Wicked fast Conditional Random Fields for Ruby
C
37
star
8

jquery.tube.js

jQuery plugin for accessing YouTube's player and data APIs
JavaScript
35
star
9

edtf-ruby

Extended Date/Time Format implementation for Ruby
Ruby
33
star
10

anystyle.io

SCSS
21
star
11

citeproc

A CSL Cite Processor API
Gherkin
19
star
12

latex-decode

Converts LaTeX to Unicode
Ruby
17
star
13

csl-ruby

Citation Style Language (CSL) API for Ruby
Ruby
16
star
14

crfpp

Conditional Random Fields for Ruby
C++
9
star
15

csl-styles

CSL styles and locales as a RubyGem
Ruby
7
star
16

jquery.cover.js

jQuery plugin for backwards-compatible background-size: cover
JavaScript
3
star
17

gears.textmate

A theme for the TextMate WebPreview
3
star
18

gestures-lib

A replacement for Android's GestureDetector
Java
3
star
19

tropy-aeon

Tropy to Aeon Timeline Export Plugin
JavaScript
3
star
20

tropy-plugin-notes

Tropy plugin to export notes only
2
star
21

anystyle-editor

AnyStyle Token Editor
CoffeeScript
2
star
22

citeproc-js

a Ruby wrapper around citeproc.js
JavaScript
2
star
23

univie

A LaTeX package that provides the cover sheet for theses at the University of Vienna
2
star
24

jquery.activate.js

A simple jQuery plugin to manage active/inactive states
JavaScript
1
star
25

gazpacho

transforms your gherkin features
1
star
26

anystyle-data

AnyStyle dictionary data
Ruby
1
star
27

crypto-garden

Ruby
1
star
28

jquery.smoothies

Yet another jQuery plugin for smooth scrolling.
JavaScript
1
star