• Stars
    star
    109
  • Rank 319,077 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Library for unit extraction - fork of quantulum for python3

quantulum3

Travis master build state Coverage Status PyPI version PyPI - Python Version PyPI - Status

Python library for information extraction of quantities, measurements and their units from unstructured text. It is able to disambiguate between similar looking units based on their k-nearest neighbours in their GloVe vector representation and their Wikipedia page.

This is the Python 3 compatible fork of recastrodiaz' fork of grhawks' fork of the original by Marco Lagi. The compatibility with the newest version of sklearn is based on the fork of sohrabtowfighi.

User Guide

Installation

pip install quantulum3

To install dependencies for using or training the disambiguation classifier, use

pip install quantulum3[classifier]

The disambiguation classifier is used when the parser find two or more units that are a match for the text.

Usage

>>> from quantulum3 import parser
>>> quants = parser.parse('I want 2 liters of wine')
>>> quants
[Quantity(2, 'litre')]

The Quantity class stores the surface of the original text it was extracted from, as well as the (start, end) positions of the match:

>>> quants[0].surface
u'2 liters'
>>> quants[0].span
(7, 15)

The value attribute provides the parsed numeric value and the unit.name attribute provides the name of the parsed unit:

>>> quants[0].value
2.0
>>> quants[0].unit.name
'litre'

An inline parser that embeds the parsed quantities in the text is also available (especially useful for debugging):

>>> print parser.inline_parse('I want 2 liters of wine')
I want 2 liters {Quantity(2, "litre")} of wine

As the parser is also able to parse dimensionless numbers, this library can also be used for simple number extraction.

>>> print parser.parse('I want two')
[Quantity(2, 'dimensionless')]

Units and entities

All units (e.g. litre) and the entities they are associated to (e.g. volume) are reconciled against WikiPedia:

>>> quants[0].unit
Unit(name="litre", entity=Entity("volume"), uri=https://en.wikipedia.org/wiki/Litre)

>>> quants[0].unit.entity
Entity(name="volume", uri=https://en.wikipedia.org/wiki/Volume)

This library includes more than 290 units and 75 entities. It also parses spelled-out numbers, ranges and uncertainties:

>>> parser.parse('I want a gallon of beer')
[Quantity(1, 'gallon')]

>>> parser.parse('The LHC smashes proton beams at 12.8–13.0 TeV')
[Quantity(12.8, "teraelectronvolt"), Quantity(13, "teraelectronvolt")]

>>> quant = parser.parse('The LHC smashes proton beams at 12.9±0.1 TeV')
>>> quant[0].uncertainty
0.1

Non-standard units usually don't have a WikiPedia page. The parser will still try to guess their underlying entity based on their dimensionality:

>>> parser.parse('Sound travels at 0.34 km/s')[0].unit
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)

Export/Import

Entities, Units and Quantities can be exported to dictionaries and JSON strings:

>>> quant = parser.parse('I want 2 liters of wine')
>>> quant[0].to_dict()
{'value': 2.0, 'unit': 'litre', "entity": "volume", 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}
>>> quant[0].to_json()
'{"value": 2.0, "unit": "litre", "entity": "volume", "surface": "2 liters", "span": [7, 15], "uncertainty": null, "lang": "en_US"}'

By default, only the unit/entity name is included in the exported dictionary, but these can be included:

>>> quant = parser.parse('I want 2 liters of wine')
>>> quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)  # same args apply to .to_json()
{'value': 2.0, 'unit': {'name': 'litre', 'surfaces': ['cubic decimetre', 'cubic decimeter', 'litre', 'liter'], 'entity': {'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}, 'uri': 'Litre', 'symbols': ['l', 'L', 'ltr', 'ℓ'], 'dimensions': [{'base': 'decimetre', 'power': 3}], 'original_dimensions': [{'base': 'litre', 'power': 1, 'surface': 'liters'}], 'currency_code': None, 'lang': 'en_US'}, 'entity': 'volume', 'surface': '2 liters', 'span': (7, 15), 'uncertainty': None, 'lang': 'en_US'}

Similar export syntax applies to exporting Unit and Entity objects.

You can import Entity, Unit and Quantity objects from dictionaries and JSON. This requires that the object was exported with include_unit_dict=True and include_entity_dict=True (as appropriate):

>>> quant_dict = quant[0].to_dict(include_unit_dict=True, include_entity_dict=True)
>>> quant = Quantity.from_dict(quant_dict)
>>> ent_json = "{'name': 'volume', 'dimensions': [{'base': 'length', 'power': 3}], 'uri': 'Volume'}"
>>> ent = Entity.from_json(ent_json)

Disambiguation

If the parser detects an ambiguity, a classifier based on the WikiPedia pages of the ambiguous units or entities tries to guess the right one:

>>> parser.parse('I spent 20 pounds on this!')
[Quantity(20, "pound sterling")]

>>> parser.parse('It weighs no more than 20 pounds')
[Quantity(20, "pound-mass")]

or:

>>> text = 'The average density of the Earth is about 5.5x10-3 kg/cm³'
>>> parser.parse(text)[0].unit.entity
Entity(name="density", uri=https://en.wikipedia.org/wiki/Density)

>>> text = 'The amount of O₂ is 2.98e-4 kg per liter of atmosphere'
>>> parser.parse(text)[0].unit.entity
Entity(name="concentration", uri=https://en.wikipedia.org/wiki/Concentration)

In addition to that, the classifier is trained on the most similar words to all of the units surfaces, according to their distance in GloVe vector representation.

Spoken version

Quantulum classes include methods to convert them to a speakable unit.

>>> parser.parse("Gimme 10e9 GW now!")[0].to_spoken()
ten billion gigawatts
>>> parser.inline_parse_and_expand("Gimme $1e10 now and also 1 TW and 0.5 J!")
Gimme ten billion dollars now and also one terawatt and zero point five joules!

Manipulation

While quantities cannot be manipulated within this library, there are many great options out there:

Extension

Training the classifier

If you want to train the classifier yourself, you will need the dependencies for the classifier (see installation).

Use quantulum3-training on the command line, the script quantulum3/scripts/train.py or the method train_classifier in quantulum3.classifier to train the classifier.

quantulum3-training --lang <language> --data <path/to/training/file.json> --output <path/to/output/file.joblib>

You can pass multiple training files in to the training command. The output is in joblib format.

To use your custom model, pass the path to the trained model file to the parser:

parser = Parser.parse(<text>, classifier_path="path/to/model.joblib")

Example training files can be found in quantulum3/_lang/<language>/train.

If you want to create a new or different similars.json, install pymagnitude.

For the extraction of nearest neighbours from a vector word representation file, use scripts/extract_vere.py. It automatically extracts the k nearest neighbours in vector space of the vector representation for each of the possible surfaces of the ambiguous units. The resulting neighbours are stored in quantulum3/similars.json and automatically included for training.

The file provided should be in .magnitude format as other formats are first converted to a .magnitude file on-the-run. Check out pre-formatted Magnitude formatted word-embeddings and Magnitude for more information.

Additional units

It is possible to add additional entities and units to be parsed by quantulum. These will be added to the default units and entities. See below code for an example invocation:

>>> from quantulum3.load import add_custom_unit, remove_custom_unit
>>> add_custom_unit(name="schlurp", surfaces=["slp"], entity="dimensionless")
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")
[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

The keyword arguments to the function add_custom_unit are directly translated to the properties of the unit to be created.

Custom Units and Entities

It is possible to load a completely custom set of units and entities. This can be done by passing a list of file paths to the load_custom_units and load_custom_entities functions. Loading custom untis and entities will replace the default units and entities that are normally loaded.

The recomended way to load quantities is via a context manager:

>>> from quantulum3 import load, parser
>>> with load.CustomQuantities(["path/to/units.json"], ["path/to/entities.json"]):
>>>     parser.parse("This extremely sharp tool is precise up to 0.5 slp")

[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

>>> # default units and entities are loaded again

But it is also possible to load custom units and entities manually:

>>> from quantulum3 import load, parser

>>> load.load_custom_units(["path/to/units.json"])
>>> load.load_custom_entities(["path/to/entities.json"])
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")

[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]

>>> # remove custom units and entities and load default units and entities
>>> load.reset_quantities()

See the Developer Guide below for more information about the format of units and entities files.

Developer Guide

Adding Units and Entities

See units.json for the complete list of units and entities.json for the complete list of entities. The criteria for adding units have been:

It's easy to extend these two files to the units/entities of interest. Here is an example of an entry in entities.json:

"speed": {
    "dimensions": [{"base": "length", "power": 1}, {"base": "time", "power": -1}],
    "URI": "https://en.wikipedia.org/wiki/Speed"
}
  • The name of an entity is its key. Names are required to be unique.
  • URI is the name of the wikipedia page of the entity. (i.e. https://en.wikipedia.org/wiki/Speed => Speed)
  • dimensions is the dimensionality, a list of dictionaries each having a base (the name of another entity) and a power (an integer, can be negative).

Here is an example of an entry in units.json:

"metre per second": {
    "surfaces": ["metre per second", "meter per second"],
    "entity": "speed",
    "URI": "Metre_per_second",
    "dimensions": [{"base": "metre", "power": 1}, {"base": "second", "power": -1}],
    "symbols": ["mps"]
},
"year": {
    "surfaces": [ "year", "annum" ],
    "entity": "time",
    "URI": "Year",
    "dimensions": [],
    "symbols": [ "a", "y", "yr" ],
    "prefixes": [ "k", "M", "G", "T", "P", "E" ]
}
  • The name of a unit is its key. Names are required to be unique.
  • URI follows the same scheme as in the entities.json
  • surfaces is a list of strings that refer to that unit. The library takes care of plurals, no need to specify them.
  • entity is the name of an entity in entities.json
  • dimensions follows the same schema as in entities.json, but the base is the name of another unit, not of another entity.
  • symbols is a list of possible symbols and abbreviations for that unit.
  • prefixes is an optional list. It can contain Metric and Binary prefixes and automatically generates according units. If you want to add specifics (like different surfaces) you need to create an entry for that prefixes version on its own.

All fields are case sensitive.

Contributing

dev build:

Travis dev build state Coverage Status

If you'd like to contribute follow these steps:

  1. Clone a fork of this project into your workspace
  2. Run pip install -e . at the root of your development folder.
  3. pip install pipenv and pipenv shell
  4. Inside the project folder run pipenv install --dev
  5. Make your changes
  6. Run scripts/format.sh and scripts/build.py from the package root directory.
  7. Test your changes with python3 setup.py test (Optional, will be done automatically after pushing)
  8. Create a Pull Request when having commited and pushed your changes

Language support

Travis dev build state Coverage Status

There is a branch for language support, namely language_support. From inspecting the README file in the _lang subdirectory and the functions and values given in the new _lang.en_US submodule, one should be able to create own language submodules. The new language modules should automatically be invoked and be available, both through the lang= keyword argument in the parser functions as well as in the automatic unittests.

No changes outside the own language submodule folder (i.e. _lang.de_DE) should be necessary. If there are problems implementing a new language, don't hesitate to open an issue.

More Repositories

1

pyfronius

Automated JSON API based communication with Fronius Symo
Python
15
star
2

vimulator

A vim-emulator for jEdit 5
Java
14
star
3

ha_bayernluefter

Custom component for the Bayernluefter
Python
8
star
4

YOLOv3-Training-Snowman-Detector

Adapted YOLOv3 Network, trained to detect cats
Python
7
star
5

ha_blnet

BLNET custom component for Home Assistant
Python
7
star
6

pysyncthru

A very basic python SyncThru bridge
Python
6
star
7

pyblnet

Automate wireless communication to UVR1611 via BL-NET
Python
5
star
8

freiraumETHZ

Find free rooms in ETHZ Campuses
Python
5
star
9

fupv_altklausuren

Beispielbearbeitungen von Aufgaben aus dem Fach Funktionale Programmierung und Verifikation, inklusive Tests
OCaml
3
star
10

ubuntu.whatsapp-nativefier

WhatsApp / WhatsAppWeb as native App on ubuntu via Electron nativefier
JavaScript
3
star
11

wetter.com.py

Scrape German weather data from wetter.com
Python
3
star
12

pyboinc

A very basic package to connect to a BOINC client based on the BOINC GUI RPC Protocol
Python
3
star
13

ha_syncthru

Custom component for Samsung Syncthru printers that unleashes the power of Web-Scraping
Python
2
star
14

pyscc

Python Smart Contracts for Cardano
Python
2
star
15

diy_bayernluft

self-build sensors for the Bayernlüfter
C
2
star
16

ha-config

Tools and examples for a home assistant configuration
Python
2
star
17

fifled

tool to create bounding boxes and labels for moving objects in mostly static videos
C++
2
star
18

abstract_artist

A simulation of the creation of abstract arts
C++
1
star
19

zokrada

ZK-SNARK Proof Verifiers on Cardano
Python
1
star
20

weak_heap_sort

Implementation and Verification of Functional Weak Heap Sort
Isabelle
1
star
21

projective_geometry

Isabelle Formalisation of projective geometry with focus on the RP2
Isabelle
1
star
22

pyernluefter

Automated async web-based communication with the Bayernluefter
Python
1
star
23

berichtung

Erfahrungen und Berichte aus diversen Gelegenheiten
TeX
1
star
24

msos_searcher

A half-hearted attempt at finding a magic square of squares
Python
1
star
25

flowdetect

Detect "Objects" in Images based on Image flow.
C++
1
star
26

rep_grnvs

Ausgewählte Übungsaufgaben für ein 8-tägiges GRNVS Repetitorium
1
star
27

humidity_control

Reads a DHT 12 sensor via ESP8266 and controls a air refresher as well as publishing the data via MQTT
C++
1
star
28

skserialize

Safe, efficient sklearn model persistence
Python
1
star
29

java_finite_automata

Parses finite automata and creates a DFA based on a given NFA
Java
1
star
30

bplustrees

A Verified Imperative Implementation of B+-Trees in Isabelle
Isabelle
1
star
31

symplex

Symbolic computation of the simplex algorithm for educational purposes
Python
1
star