• Stars
    star
    1,551
  • Rank 30,167 (Top 0.6 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python Keyphrase Extraction module

pke - python keyphrase extraction

pke is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset.

python-package workflow

Table of Contents

Installation

To pip install pke from github:

pip install git+https://github.com/boudinfl/pke.git

pke relies on spacy (>= 3.2.3) for text processing and requires models to be installed:

# download the english model
python -m spacy download en_core_web_sm

Minimal example

pke provides a standardized API for extracting keyphrases from a document. Start by typing the 5 lines below. For using another model, simply replace pke.unsupervised.TopicRank with another model (list of implemented models).

import pke

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be a simple 
# test string and preprocessing is carried out using spacy
extractor.load_document(input='text', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

A detailed example is provided in the examples/ directory.

Getting started

To get your hands dirty with pke, we invite you to try our tutorials out.

Name Link
Getting started with pke and keyphrase extraction Open In Colab
Model parameterization Open In Colab
Benchmarking models Open In Colab

Implemented models

pke currently implements the following keyphrase extraction models:

Model performances

For comparison purposes, overall results of implemented models on commonly-used benchmark datasets are available in results. Code for reproducing these experiments are in the benchmarking notebook (also available on Open In Colab).

Citing pke

If you use pke, please cite the following paper:

@InProceedings{boudin:2016:COLINGDEMO,
  author    = {Boudin, Florian},
  title     = {pke: an open source python-based keyphrase extraction toolkit},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {69--73},
  url       = {http://aclweb.org/anthology/C16-2015}
}

More Repositories

1

ake-datasets

Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Shell
138
star
2

takahe

takahe is a multi-sentence compression module
Python
54
star
3

sume

Sume is an implementation of the concept-based ILP model for summarization.
Python
36
star
4

centrality_measures_ijcnlp13

Centrality Measures for Graph-Based Keyphrase Extraction
Python
13
star
5

taln-archives

TALN Archives is a digital archive of French research articles in Natural Language Processing
TeX
12
star
6

kea

A tokenizer for French
JavaScript
11
star
7

ir-using-kg

Keyphrase Generation for Scientific Document Retrieval
Python
11
star
8

acm-cr

ACM-CR: A Manually Annotated Test Collection for Citation Recommendation
TeX
8
star
9

hulth-2003-pre

Preprocessed Inspec keyphrase extraction benchmark dataset
Shell
8
star
10

duc-2001-pre

Preprocessed DUC 2001 keyphrase extraction benchmark dataset
7
star
11

semeval-2010-pre

Preprocessed SemEval-2010 benchmark dataset for keyphrase extraction
7
star
12

marujo-2012-pre

Preprocessed Marujo keyphrase extraction benchmark dataset
Shell
5
star
13

redefining-absent-keyphrases

Code and dataset for the paper "Redefining Absent Keyphrases and their Effect on Retrieval Effectiveness"
Python
5
star
14

krapivin-2009-pre

Preprocessed Krapivin keyphrase extraction benchmark dataset
Python
4
star
15

lina-msc

LINA-msc is a dataset for evaluating Multi-sentence Compression in French.
3
star
16

kepy

kepy is a keyphrase extraction module in Python
Python
2
star
17

cross-language_IR

Un cours de deux heures sur la recherche d'information cross-lingue
TeX
2
star
18

wikinews-2013-pre

Preprocessed Wikinews Keyphrase benchmark dataset
Python
1
star
19

boudinfl.github.io

website
HTML
1
star
20

CLIREC

CLinical Information Retrieval Evaluation Collection
Jupyter Notebook
1
star
21

pke-benchmarking

Jupyter Notebook
1
star