• Stars
    star
    299
  • Rank 139,269 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ“ƒLanguage Model based sentences scoring library

lm-scorer

PyPi version Open in Colab
Lint status Test macOS status Test Ubuntu status
Code style Linter Types checker Test runner Task runner Build tool
Project license

๐Ÿ“ƒ Language Model based sentences scoring library

Synopsis

This package provides a simple programming interface to score sentences using different ML language models.

A simple CLI is also available for quick prototyping.
You can run it locally or on directly on Colab using this notebook.

Do you believe that this is useful? Has it saved you time? Or maybe you simply like it?
If so, support this work with a Star โญ๏ธ.

Install

pip install lm-scorer

Usage

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

# Available models
list(LMScorer.supported_model_names())
# => ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", distilgpt2"]

# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained("gpt2", device=device, batch_size=batch_size)

# Return token probabilities (provide log=True to return log probabilities)
scorer.tokens_score("I like this package.")
# => (scores, ids, tokens)
# scores = [0.018321, 0.0066431, 0.080633, 0.00060745, 0.27772, 0.0036381]
# ids    = [40,       588,       428,      5301,       13,      50256]
# tokens = ["I",      "ฤ like",   "ฤ this",  "ฤ package", ".",     "<|endoftext|>"]

# Compute sentence score as the product of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="prod")
# => 6.0231e-12

# Compute sentence score as the mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="mean")
# => 0.064593

# Compute sentence score as the geometric mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="gmean")
# => 0.013489

# Compute sentence score as the harmonic mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="hmean")
# => 0.0028008

# Get the log of the sentence score.
scorer.sentence_score("I like this package.", log=True)
# => -25.835

# Score multiple sentences.
scorer.sentence_score(["Sentence 1", "Sentence 2"])
# => [1.1508e-11, 5.6645e-12]

# NB: Computations are done in log space so they should be numerically stable.

CLI

lm-scorer cli

The pip package includes a CLI that you can use to score sentences.

usage: lm-scorer [-h] [--model-name MODEL_NAME] [--tokens] [--log-prob]
                 [--reduce REDUCE] [--batch-size BATCH_SIZE]
                 [--significant-figures SIGNIFICANT_FIGURES] [--cuda CUDA]
                 [--debug]
                 sentences-file-path

Get sentences probability using a language model.

positional arguments:
  sentences-file-path   A file containing sentences to score, one per line. If
                        - is given as filename it reads from stdin instead.

optional arguments:
  -h, --help            show this help message and exit
  --model-name MODEL_NAME, -m MODEL_NAME
                        The pretrained language model to use. Can be one of:
                        gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2.
  --tokens, -t          If provided it provides the probability of each token
                        of each sentence.
  --log-prob, -lp       If provided log probabilities are returned instead.
  --reduce REDUCE, -r REDUCE
                        Reduce strategy applied on token probabilities to get
                        the sentence score. Available strategies are: prod,
                        mean, gmean, hmean.
  --batch-size BATCH_SIZE, -b BATCH_SIZE
                        Number of sentences to process in parallel.
  --significant-figures SIGNIFICANT_FIGURES, -sf SIGNIFICANT_FIGURES
                        Number of significant figures to use when printing
                        numbers.
  --cuda CUDA           If provided it runs the model on the given cuda
                        device.
  --debug               If provided it provides additional logging in case of
                        errors.

Development

You can install this library locally for development using the commands below. If you don't have it already, you need to install poetry first.

#ย Clone the repo
git clone https://github.com/simonepri/lm-scorer
#ย CD into the created folder
cd lm-scorer
# Create a virtualenv and install the required dependencies using poetry
poetry install

You can then run commands inside the virtualenv by using poetry run COMMAND.
Alternatively, you can open a shell inside the virtualenv using poetry shell.

If you wish to contribute to this project, run the following commands locally before opening a PR and check that no error is reported (warnings are fine).

# Run the code formatter
poetry run task format
# Run the linter
poetry run task lint
# Run the static type checker
poetry run task types
# Run the tests
poetry run task test

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the license file for details.

More Repositories

1

geo-maps

๐Ÿ—บ High Quality GeoJSON maps programmatically generated.
JavaScript
1,265
star
2

upash

๐Ÿ”’Unified API for password hashing algorithms
JavaScript
536
star
3

sympact

๐Ÿ”ฅ Stupid Simple CPU/MEM "Profiler" for your JS code.
JavaScript
439
star
4

datasets-knowledge-embedding

๐Ÿ“ A collection of common datasets used in knowledge embedding
Shell
145
star
5

country-iso

๐Ÿ—บ Get the ISO 3166-1 alpha-3 country code from geographic coordinates.
JavaScript
144
star
6

pidtree

๐Ÿšธ Cross platform children list of a PID.
JavaScript
124
star
7

geojson-geometries-lookup

โšก๏ธ Fast geometry in geometry lookup for large GeoJSONs.
JavaScript
90
star
8

is-sea

๐ŸŒŠ Check whether a geographic coordinate is in the sea or not on the earth.
JavaScript
49
star
9

osm-geojson

๐Ÿ”ฐ Get GeoJSON of a OpenStreetMap's relation from the API.
JavaScript
49
star
10

fitbit2garmin

โฌ‡ Downloads lifetime Fitbit data and exports it into the format supported by Garmin Connect data importer. This includes historical body composition data (weight, BMI, and fat percentage), activity data (calories burned, steps, distance, active minutes, and floors climbed), and individual GPS exercises (TCX).
Python
38
star
11

env-dot-prop

โ™ป๏ธ Get, set, or delete nested properties of process.env using a dot path
JavaScript
33
star
12

roboprime

๐Ÿค– Full featured 21 DOF 3D Printed Humanoid Robot based on ATmega328P chip
Arduino
27
star
13

competitive-programming

๐Ÿ… This repository contains all the problems I solved while training myself for programming competitions
C++
21
star
14

phc-argon2

๐Ÿ”’ Node.JS Argon2 password hashing algorithm following the PHC string format.
JavaScript
17
star
15

phc-format

๐Ÿ“ PHC String Format implementation for Node.JS
JavaScript
17
star
16

upash-cli

๐ŸŒŒ Hash password directly from your terminal
JavaScript
15
star
17

ni

๐Ÿ“ฆ A better `npm init` **NOT RELEASED**
JavaScript
12
star
18

phc-pbkdf2

๐Ÿ”’ Node.JS PBKDF2 password hashing algorithm following the PHC string format.
JavaScript
12
star
19

osm-countries

๐Ÿ”Ž Get the OpenStreetMap's relation id from a country code.
JavaScript
11
star
20

varname-seq2seq

๐Ÿ“„Source code variable naming using a seq2seq architecture
Python
10
star
21

project-version

๐Ÿ‘€ Get the current version of your project.
JavaScript
10
star
22

fever-transformers

๐Ÿ“„ Evidence Retrieval and Claim Verification for the FEVER shared task using Transformer Networks
Python
10
star
23

bin-manager

๐ŸŒ€ Binaries available as local nodeJS dependencies
JavaScript
8
star
24

leadoii

๐Ÿ† Leaderboard Generator for the Italian Olympiads of Informatics Training Platform
Vue
8
star
25

phc-scrypt

๐Ÿ”’ Node.JS scrypt password hashing algorithm following the PHC string format.
JavaScript
6
star
26

phc-bcrypt

๐Ÿ”’ Easy to use Unified API for bcrypt password hashing algorithm
JavaScript
6
star
27

act

โœ๏ธ Multi-purpose URI tracker.
JavaScript
6
star
28

tsse

โฑ Timing safe string equals.
JavaScript
3
star
29

restify-errors-options

๐Ÿ”ง Add custom options to Restify's errors
JavaScript
3
star
30

text-tokenizers-colab

๐Ÿ”ช Tokenize text on the fly on Colab.
Jupyter Notebook
3
star
31

leadoii-static

๐Ÿ…Pre-Generated Leaderboards of the Italian Olympiads of Informatics Training Platform Users
HTML
3
star
32

sudoku-solver

๐Ÿ”ข Sudoku Solutions Enumerator (Sequential and Parallel)
Java
2
star
33

text2error

ใ€ฐ Introduce errors in error free text
Python
2
star
34

restify-errors-thrower

๐Ÿ’ฅ Throw Restify errors easily!
JavaScript
2
star
35

kdf-salt

๐ŸŽฒ Crypto secure salt generator
JavaScript
2
star
36

docker-osrm-backend

๐Ÿ›ฃ The Open Source Routing Machine Docker ready!
Shell
2
star
37

geojson-geometries

โ› Extract elementary geometries from a GeoJSON inheriting properties.
JavaScript
2
star
38

css-viewport-units-cross-browser

Cross-Browser CSS3 Viewport Units: (vh, vw, vmin, vmax)
CSS
2
star
39

talking-unicorn

๐Ÿฆ„ An Arduino based greating unicorn.
Arduino
2
star
40

edgelist-mapper

๐Ÿ“ŠMaps nodes and edges of a multi-relational graph to integer
Python
1
star
41

ardutank

๐Ÿš— An Arduino based rover
C++
1
star
42

rgcn-link-prediction-experiments

1
star
43

restify-errors-options-errno

โ˜Ž๏ธ Add errno to Restify's errors
JavaScript
1
star