rake-nltk
RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.
Features
- Ridiculously simple interface.
- Configurable word and sentence tokenizers, language based stop words etc
- Configurable ranking metric.
Setup
Using pip
pip install rake-nltk
Directly from the repository
git clone https://github.com/csurfer/rake-nltk.git
python rake-nltk/setup.py install
Quick start
from rake_nltk import Rake
# Uses stopwords for english from NLTK, and all puntuation characters by
# default
r = Rake()
# Extraction given the text.
r.extract_keywords_from_text(<text to process>)
# Extraction given the list of strings where each string is a sentence.
r.extract_keywords_from_sentences(<list of sentences>)
# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()
# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()
Debugging Setup
If you see a stopwords error, it means that you do not have the corpus
stopwords
downloaded from NLTK. You can download it using command below.
python -c "import nltk; nltk.download('stopwords')"
References
This is a python implementation of the algorithm as mentioned in paper Automatic keyword extraction from individual documents by Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley
Why I chose to implement it myself?
- It is extremely fun to implement algorithms by reading papers. It is the digital equivalent of DIY kits.
- There are some rather popular implementations out there, in python(aneesha/RAKE) and node(waseem18/node-rake) but neither seemed to use the power of NLTK. By making NLTK an integral part of the implementation I get the flexibility and power to extend it in other creative ways, if I see fit later, without having to implement everything myself.
- I plan to use it in my other pet projects to come and wanted it to be modular and tunable and this way I have complete control.
Contributing
Bug Reports and Feature Requests
Please use issue tracker for reporting bugs or feature requests.
Development
- Checkout the repository.
- Make your changes and add/update relavent tests.
- Install
poetry
usingpip install poetry
. - Run
poetry install
to create project's virtual environment. - Run tests using
poetry run tox
(Any python versions which you don't have checked out will fail this). Fix failing tests and repeat. - Make documentation changes that are relavant.
- Install
pre-commit
usingpip install pre-commit
and runpre-commit run --all-files
to do lint checks. - Generate documentation using
poetry run sphinx-build -b html docs/ docs/_build/html
. - Generate
requirements.txt
for automated testing usingpoetry export --dev --without-hashes -f requirements.txt > requirements.txt
. - Commit the changes and raise a pull request.
Buy the developer a cup of coffee!
If you found the utility helpful you can buy me a cup of coffee using