• Stars
    star
    432
  • Rank 100,650 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

EmbedRank: Unsupervised Keyphrase Extraction using Sentence Embeddings (official implementation)

This is the implementation of the following paper: https://arxiv.org/abs/1801.04470

Installation

Local Installation

  1. Download full Stanford CoreNLP Tagger version 3.8.0 http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip

  2. Install sent2vec from https://github.com/epfml/sent2vec

    • Clone/Download the directory
    • go to sent2vec directory
    • git checkout f827d014a473aa22b2fef28d9e29211d50808d48
    • make
    • pip install cython
    • inside the src folder
      • python setup.py build_ext
      • pip install .
      • (In OSX) If the setup.py throws an error (ignore warnings), open setup.py and add '-stdlib=libc++' in the compile_opts list.
    • Download a pre-trained model (see readme of Sent2Vec repo) , for example wiki_bigrams.bin
  3. Install requirements

    After cloning this repository go to the root directory and pip install -r requirements.txt

  4. Download NLTK data

import nltk 
nltk.download('punkt')
  1. Launch Stanford Core NLP tagger

    • Open a new terminal
    • Go to the stanford-core-nlp-full directory
    • Run the server java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos -status_port 9000 -port 9000 -timeout 15000 &
  2. Set the paths in config.ini.template

    • You can leave [STANFORDTAGGER] parameters empty
    • For [STANFORDCORENLPTAGGER] :
      • set host to localhost
      • set port to 9000
    • For [SENT2VEC]:
      • set your model_path to the pretrained model your_path_to_model/wiki_bigrams.bin (if you choosed wiki_bigrams.bin)
    • rename config.ini.template to config.ini

Docker

Probably the easiest way to get started is by using the provided Docker image. From the project's root directory, the image can be built like so:

$ docker build . -t keyphrase-extraction

This can take a few minutes to finish. Also, keep in mind that pre-trained sent2vec models will not be downloaded since each model is several GBs in size and don't forget to allocate enough memory to your docker container (models are loaded in RAM).

To launch the model in an interactive mode, in order to use your own code, run

$ docker run -v {path to wiki_bigrams.bin}:/sent2vec/pretrained_model.bin -it keyphrase-extraction
# Run the corenlp server
/app # cd /stanford-corenlp
/stanford-corenlp # nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos -status_port 9000 -port 9000 -timeout 15000 &
# Press enter to get stdin back
/stanford-corenlp # cd /app
/app # python
>>> import launch

You have to specify the path to your sent2vec model using the -v argument. If, for example, you should choose not to use the wiki_bigrams.bin model, adjust your path accordingly (and of course, remember to remove the curly brackets).

Usage

Once the CoreNLP server is running

import launch

embedding_distributor = launch.load_local_embedding_distributor()
pos_tagger = launch.load_local_corenlp_pos_tagger()

kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en')  #extract 10 keyphrases
kp2 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text2, 10, 'en')
...

This return for each text a tuple containing three lists:

  1. The top N candidates (string) i.e keyphrases
  2. For each keyphrase the associated relevance score
  3. For each keyphrase a list of alias (other candidates very similar to the one selected as keyphrase)

Method

This is the implementation of the following paper: https://arxiv.org/abs/1801.04470

embedrank

By using sentence embeddings , EmbedRank embeds both the document and candidate phrases into the same embedding space.

N candidates are selected as keyphrases by using Maximal Margin Relevance using the cosine similarity between the candidates and the document in order to model the informativness and the cosine similarity between the candidates is used to model the diversity.

An hyperparameter, beta (default=0.55), controls the importance given to informativness and diversity when extracting keyphrases. (beta = 1 only informativness , beta = 0 only diversity) You can change the beta hyperparameter value when calling extract_keyphrases:

kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, raw_text, 10, 'en', beta=0.8)  #extract 10 keyphrases with beta=0.8

If you want to replicate the results of the paper you have to set beta to 1 or 0.5 and turn off the alias feature by specifiying alias_threshold=1 to extract_keyphrases method.

More Repositories

1

cleanerversion

CleanerVersion adds a versioning/historizing layer to your relational DB which implements a "Slowly Changing Dimensions Type 2" behavior
Python
136
star
2

Invoke-Forensics

Invoke-Forensics provides PowerShell commands to simplify working with the forensic tools KAPE and RegRipper.
PowerShell
109
star
3

bugbounty

Swisscom Vulnerability Disclosure Policy & Bug Bounty Programme
81
star
4

ArtifactCollectionMatrix

Forensic Artifact Collection Tool Matrix
73
star
5

open-service-broker

Open Service Broker is an implementation of the "Open Service Broker API" based on Spring Boot & Groovy. It enables platforms such as Cloud Foundry & Kubernetes to provision and manage services.
Groovy
70
star
6

PowerGRR

PowerGRR is an API client library in PowerShell working on Windows, Linux and macOS for GRR automation and scripting.
PowerShell
56
star
7

detections

Threat intelligence and threat detection indicators (IOC, IOA)
YARA
53
star
8

PowerSponse

PowerSponse is a PowerShell module focused on targeted containment and remediation during incident response.
PowerShell
38
star
9

ruby-netsnmp

SNMP library in ruby (v1, v2c, v3)
Ruby
32
star
10

gitlab-merge-request-resource

A concourse resource to check for new merge requests on GitLab
Shell
31
star
11

ai-research-mamo-framework

A Model Agnostic Multi-Objective Framework for Deep Learning models
Python
31
star
12

cf-statistics-plugin

CloudFoundry CLI plugin for displaying real-time metrics and statistics data
Go
27
star
13

bitbucket-cli

A Bitbucket Enterprise CLI
Go
19
star
14

splunk-addon-powershell

Splunk Add-on for PowerShell provides field extraction for PowerShell event logs.
17
star
15

update-java-ca-certificates

Small utility to convert the system trust store to a system Java KeyStore
Go
16
star
16

korp

A command line tool for pushing docker images into a corporate registry based on Kubernetes yaml files
Go
15
star
17

cf-reverse-proxy

Proxy app to make your your HTTP backends publicy accessible
JavaScript
11
star
18

containerdays-2024-krm

Resources used for the ContainerDays 2024 Talk «Evolving GitOps: Harnessing Kubernetes Resource Model for 5G Core»
Go
10
star
19

puppet-scaleio

Ruby
9
star
20

dynstrg-howto

JavaScript
9
star
21

collectd-scaleio

A collectd plugin for scaleio
Python
8
star
22

dopi

DEPRECATED - Deployment Orchestrator for Puppet - inner Orchestrator
Ruby
8
star
23

waypoint-plugin-cloudfoundry

A plugin for Hashicorp Waypoint that allows to deploy artifacts on Cloud Foundry
Go
8
star
24

mongodb-enterprise-boshrelease

A bosh release for MongoDB Enterprise.
Shell
7
star
25

swisscom-csirt-resources

A curated list of analysis tools and resources created or maintained by Swisscom CSIRT.
7
star
26

dop_common

DEPRECATED - Shared library for Deployment Orchestrator for Puppet
Ruby
7
star
27

searchdump

A simple tool to backup *Search (e.g: ElasticSearch / OpenSearch) to multiple destinations
Go
7
star
28

cf-sample-app-python

A sample Flask application to deploy to Cloud Foundry which works out of the box.
Python
7
star
29

net-ssh-cli

A library to make interactive SSH sessions more convenient.
Ruby
7
star
30

ai-research-fairsourcing

This project provides actionable insights to improve Fairness and Diversity metrics during the recruiting pipeline of a company. It assesses the impact of each candidate with respect to the team's objectives. You can select the desired dimensions to consider as well as what are the relevant subgroups. Track your progress through time and adapt your targets!
Jupyter Notebook
7
star
31

dopv

DEPRECATED - Deployment Orchestrator for Puppet VM provisioning
Ruby
7
star
32

JCR-Hopper

Migrate AEM with Grace
Java
7
star
33

apisix-opa-plugin

Go
6
star
34

mip

Mobile Insights Platform
Python
6
star
35

dcsplus-utils

Helpful utilities for DCS+ users
PowerShell
5
star
36

pongo2-runner

A small utility to render pongo2 templates
Go
5
star
37

cf-sample-app-nodejs

A sample Express application to deploy to Cloud Foundry which works out of the box.
JavaScript
5
star
38

dopc-client

DEPRECATED - Deployment Orchestrator for Puppet - Controller Client
Ruby
5
star
39

dopc

DEPRECATED - Deployment Orchestrator for Puppet - Controller
Ruby
5
star
40

terraform-dcs-demo

This repo contains sample infrastructure as code snippets to deploy, maintain and manage infrastructure on DCS using Terraform vCloud Director provider.
HCL
5
star
41

leaselocker

This package provides a solution to avoid race conditions in Kubernetes when multiple processes are updating the same resource. It offers a set of utility functions and classes that handle synchronization and locking mechanisms, ensuring that only one process can modify the resource at a time.
Go
4
star
42

cf-rasa-chatbot

Go
4
star
43

docs-api

The documentation of the Swisscom APIs
HTML
4
star
44

sample-uaa-spring-boot-service-provider

Java
4
star
45

ip-whitelisting-route-service-demo-app

A demo app for an IP whitelisting route service in Cloud Foundry
Go
4
star
46

crossplane-composition-tester

BDD test framework for the Crossplane compositions implemented with functions
Python
4
star
47

docs-appcloud-service-offerings

The documentation to the services in the Swisscom Application Cloud marketplace
HTML
4
star
48

docs-k8wms

Documentation of the kubernetes workload management stack with github pages
4
star
49

mssql-always-encrypted

An utils library to work with MSSQL Always Encrypted features
Go
3
star
50

blogpost-cnb

Resources used in the blog post "Cloud Native Buildpacks to unite PaaS and CaaS"
Java
3
star
51

churn-intent-DE

3
star
52

sample-uaa-spring-boot-resource-server

Java
3
star
53

containerdays-2024-dns

Resources used for the ContainerDays 2024 Talk «Building and Operating a Highly Reliable Cloud Native DNS Service With Open Source Technologies»
Shell
3
star
54

renovate-approve-bot-bitbucket-server

Bot to automatically approve Bitbucket Server PRs
Go
3
star
55

mac-fan

A small collection of utilities to control your Macbook fan speed.
Shell
3
star
56

eos

EOS is a simple IPTV middleware prepared mainly for Android AOSP environment. It is a framework which can be easily ported to a target platform.
C
3
star
57

appcloud-cf-cli-plugin

The official cf CLI plugin for the Swisscom Application Cloud
Go
3
star
58

securitytxt

Swisscom security contacts according to RFC 9116
3
star
59

provider-cortex

Go
2
star
60

ssl-tool

A tool to deal with SSL things
Go
2
star
61

esc-vm-scheduler-helm-chart

Helm Chart for the Swisscom ESC VM-Scheduler
2
star
62

kube-tools

A collection of small tools to work with Kubernetes
Go
2
star
63

ai-research-document-classification

Python
2
star
64

aws-generate-secrets

Go
2
star
65

cf-elk-sample

Example app for using ELK Service with NodeJS
JavaScript
2
star
66

opa-demo

An Open Policy Agent demo - source of the code used at WeAreDevelopers Live 2020
Go
2
star
67

esbuild-webserver

A simple web-server that can be used as an alternative to Webpack's dev-server
Go
2
star
68

sample-uaa-angular-client

Oidc sample app for Angular
TypeScript
2
star
69

mcollective-cmd-agent

This is a fork of the puppetlabs shell agent with ruby 1.8.7 support and additional features. https://github.com/puppetlabs/mcollective-shell-agent
Ruby
2
star
70

puppet-package_verifiable

The idea is that we have a way to check within the catalog whether a package currently installed matches the one we want to install.
Ruby
2
star
71

hfc

Ruby
1
star
72

docs-appcloud-service-connector

HTML
1
star
73

kibana-buildpack

Go
1
star
74

istioports

A small app which adds large port ranges to an Istio ServiceEntry
Go
1
star
75

open-service-broker-extension-template

A template to show how Swisscom's service broker could be extended
Groovy
1
star
76

docs-dop

DEPRECATED
HTML
1
star
77

sample-uaa-javascript-client

Oidc (authorization code with PKCE) sample javascript app
HTML
1
star
78

cf-sample-app-go

A sample Go application to deploy to Cloud Foundry which works out of the box.
Go
1
star
79

pod-lifecycle-notifier

A simple probe that notifies defined channels about its own startup and shutdown.
Java
1
star
80

cf-scraper

A simple app which scrapes information about Cloud Foundry orgs
JavaScript
1
star
81

docs-appcloud-devguide

The documentation for developers working with the Swisscom Application Cloud
HTML
1
star
82

kubernetes-testing

A simple Kubernetes test suite
Ruby
1
star
83

cf-sample-app-dotnetcore

A sample ASP.NET application to deploy to Cloud Foundry which works out of the box.
C#
1
star
84

gte

Inspired by the dockerize template library, GTE is a go template engine based on the golang template package and the go-jmespath library (JMESPath is a query language for JSON).
Go
1
star
85

cf-default-app-staticfile

The default Static File app that will be pushed into the Swisscom Application cloud if no source code is provided.
HTML
1
star