• Stars
    star
    549
  • Rank 77,801 (Top 2 %)
  • Language
    HTML
  • License
    Apache License 2.0
  • Created about 6 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

Downloads Downloads Weekly daily Version Python 3 GitHub stars

News

[05 August 2021] - We are releasing version 3.0 of NLPCube and models and introducing FLAVOURS. This is a major update, but we did our best to maintain the same API, so previous implementation will not crash. The supported language list is smaller, but you can open an issue for unsupported languages, and we will do our best to add them. Other options include fixing the pip package version 1.0.8 pip install nlpcube==0.1.0.8.

[15 April 2019] - We are releasing version 1.1 models - check all supported languages below. Both 1.0 and 1.1 models are trained on the same UD2.2 corpus; however, models 1.1 do not use vector embeddings, thus reducing disk space and time required to use them. Some languages actually have a slightly increased accuracy, some a bit decreased. By default, NLP Cube will use the latest (at this time) 1.1 models.

To use the older 1.0 models just specify this version in the load call: cube.load("en", 1.0) (en for English, or any other language code). This will download (if not already downloaded) and use this specific model version. Same goes for any language/version you want to use.

If you already have NLP Cube installed and want to use the newer 1.1 models, type either cube.load("en", 1.1) or cube.load("en", "latest") to auto-download them. After this, calling cube.load("en") without version number will automatically use the latest ones from your disk.


NLP-Cube

NLP-Cube is an opensource Natural Language Processing Framework with support for languages which are included in the UD Treebanks (list of all available languages below). Use NLP-Cube if you need:

  • Sentence segmentation
  • Tokenization
  • POS Tagging (both language independent (UPOSes) and language dependent (XPOSes and ATTRs))
  • Lemmatization
  • Dependency parsing

Example input: "This is a test.", output is:

1       This    this    PRON    DT      Number=Sing|PronType=Dem        4       nsubj   _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     _
3       a       a       DET     DT      Definite=Ind|PronType=Art       4       det     _
4       test    test    NOUN    NN      Number=Sing     0       root    SpaceAfter=No
5       .       .       PUNCT   .       _       4       punct   SpaceAfter=No

If you just want to run it, here's how to set it up and use NLP-Cube in a few lines: Quick Start Tutorial.

For advanced users that want to create and train their own models, please see the Advanced Tutorials in examples/, starting with how to locally install NLP-Cube.

Simple (PIP) installation / update version

Install (or update) NLP-Cube with:

pip3 install -U nlpcube

API Usage

To use NLP-Cube *programmatically (in Python), follow this tutorial The summary would be:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
document=cube(text)            # call with your own text (string) to obtain the annotations

The document object now contains the annotated text, one sentence at a time. To print the third words's POS (in the first sentence), just run:

print(document.sentences[0][2].upos) # [0] is the first sentence and [2] is the third word

Each token object has the following attributes: index, word, lemma, upos, xpos, attrs, head, label, deps, space_after. For detailed info about each attribute please see the standard CoNLL format.

Flavours

Previous versions on NLP-Cube were trained on individual treebanks. This means that the same language was supported by multiple models at the same time. For instance, you could parse English (en) text with en_ewt, en_esl, en_lines, etc. The current version of NLPCube combines all flavours of a treebank under the same umbrella, by jointly optimizing a conditioned model. You only need to load the base language, for example en and then select which flavour to apply at runtime:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."


# Parse using the default flavour (in this case EWT)
document=cube(text)            # call with your own text (string) to obtain the annotations
# or you can specify a flavour
document=cube(text, flavour='en_lines') 

Webserver Usage

The current version dropped supported, since most people preferred to implement their one NLPCube as a service.

Cite

If you use NLP-Cube in your research we would be grateful if you would cite the following paper:

  • NLP-Cube: End-to-End Raw Text Processing With Neural Networks, BoroÈ™, Tiberiu and Dumitrescu, Stefan Daniel and Burtica, Ruxandra, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. p. 171--179. October 2018

or, in bibtex format:

@InProceedings{boro-dumitrescu-burtica:2018:K18-2,
  author    = {BoroÈ™, Tiberiu  and  Dumitrescu, Stefan Daniel  and  Burtica, Ruxandra},
  title     = {{NLP}-Cube: End-to-End Raw Text Processing With Neural Networks},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {171--179},
  abstract  = {We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL's "Multilingual Parsing from Raw Text to Universal Dependencies 2018" Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.},
  url       = {http://www.aclweb.org/anthology/K18-2017}
}

Languages and performance

For comparison, the performance of 3.0 models is reported on the 2.2 UD corpus, but distributed models are obtained from UD 2.7.

Results are reported against the test files for each language (available in the UD 2.2 corpus) using the 2018 conll eval script. Please see more info about what each metric represents here.

Notes:

  • version 1.1 of the models no longer need the large external vector embedding files. This makes loading the 1.1 models faster and less RAM-intensive.
  • all reported results here are end-2-end. (e.g. we test the tagging accuracy on our own segmented text, as this is the real use-case; CoNLL results are mostly reported on "gold" - or pre-segmented text, leading to higher accuracy for the tagger/parser/etc.)
Language Model Token Sentence UPOS XPOS AllTags Lemmas UAS LAS
Chinese
zh-1.0 93.03 99.10 88.22 88.15 86.91 92.74 73.43 69.52
zh-1.1 92.34 99.10 86.75 86.66 85.35 92.05 71.00 67.04
zh.3.0 95.88 87.36 91.67 83.54 82.74 85.88 79.15 70.08
English
en-1.0 99.25 72.8 95.34 94.83 92.48 95.62 84.7 81.93
en-1.1 99.2 70.94 94.4 93.93 91.04 95.18 83.3 80.32
en-3.0 98.95 75.00 96.01 95.71 93.75 96.06 87.06 84.61
French
fr-1.0 99.68 94.2 92.61 95.46 90.79 93.08 84.96 80.91
fr-1.1 99.67 95.31 92.51 95.45 90.8 93.0 83.88 80.16
fr-3.0 99.71 93.92 97.33 99.56 96.61 90.79 89.81 87.24
German
de-1.0 99.7 81.19 91.38 94.26 80.37 75.8 79.6 74.35
de-1.1 99.77 81.99 90.47 93.82 79.79 75.46 79.3 73.87
de-3.0 99.77 86.25 94.70 97.00 85.02 82.73 87.08 82.69
Hungarian
hu-1.0 99.8 94.18 94.52 99.8 86.22 91.07 81.57 75.95
hu-1.1 99.88 97.77 93.11 99.88 86.79 91.18 77.89 70.94
hu-3.0 99.75 91.64 96.43 99.75 89.89 91.31 86.34 81.29
Italian
it-1.0 99.89 98.14 86.86 86.67 84.97 87.03 78.3 74.59
it-1.1 99.92 99.07 86.58 86.4 84.53 86.75 76.38 72.35
it-3.0 99.92 98.13 98.26 98.15 97.34 97.76 94.07 92.66
Romanian (RO-RRT)
ro-1.0 99.74 95.56 97.42 96.59 95.49 96.91 90.38 85.23
ro-1.1 99.71 95.42 96.96 96.32 94.98 96.57 90.14 85.06
ro-3.0 99.80 95.64 97.67 97.11 96.76 97.55 92.06 87.67
Spanish
es-1.0 99.98 98.32 98.0 98.0 96.62 98.05 90.53 88.27
es-1.1 99.98 98.40 98.01 98.00 96.6 97.99 90.51 88.16
es-3.0 99.96 97.17 96.88 99.91 94.88 98.17 92.11 89.86

More Repositories

1

brackets

An open source code editor for the web, written in JavaScript, HTML and CSS.
JavaScript
33,300
star
2

react-spectrum

A collection of libraries and tools that help you build adaptive, accessible, and robust user experiences.
TypeScript
11,538
star
3

leonardo

Generate colors based on a desired contrast ratio
JavaScript
1,855
star
4

antialiased-cnns

pip install antialiased-cnns to improve stability and accuracy
Python
1,611
star
5

balance-text

A plugin for implementing balancing of wrapping text in a web page
JavaScript
1,362
star
6

adobe.github.com

Adobe central hub for open source
CSS
1,290
star
7

spectrum-web-components

Spectrum Web Components
TypeScript
1,177
star
8

brackets-shell

CEF3-based application shell for Brackets.
Python
1,176
star
9

spectrum-css

The standard CSS implementation of the Spectrum design language.
CSS
1,154
star
10

aem-core-wcm-components

Standardized components to build websites with AEM.
Java
709
star
11

S3Mock

A simple mock implementation of the AWS S3 API startable as Docker image, TestContainer, JUnit 4 rule, JUnit Jupiter extension or TestNG listener
Java
699
star
12

jsonschema2md

Convert Complex JSON Schemas into Markdown Documentation
JavaScript
563
star
13

aem-project-archetype

Maven template to create best-practice websites on AEM.
JavaScript
519
star
14

ferrum

Features from the rust language in javascript: Provides Traits/Type classes & a hashing infrastructure and an advanced library for working with sequences/iterators in js
JavaScript
496
star
15

brackets-app

Deprecated CEF1-based app shell for Brackets. Use https://github.com/adobe/brackets-shell instead.
C++
490
star
16

cryptr

Cryptr: a GUI for Hashicorp's Vault
HTML
487
star
17

cssfilterlab

CSS FilterLab
JavaScript
348
star
18

hyde

A front-end to Jekyll that parses C++ sources to produce and enforce out-of-line documentation
C++
303
star
19

node-smb-server

A 100% JavaScript implementation of the SMB file sharing protocol.
JavaScript
276
star
20

htl-spec

HTML Template Language Specification
275
star
21

aem-guides-wknd

Tutorial Code companion for Getting Started Developing with AEM Sites WKND Tutorial
JavaScript
261
star
22

lit-mobx

Mixin and base class for using mobx with lit-element
TypeScript
260
star
23

xdm

Experience Data Model
JavaScript
235
star
24

lagrange

A Robust Geometry Processing Library
C++
215
star
25

webkit

Experiments and contributions to WebKit. Tracks git://git.webkit.org/WebKit.git
213
star
26

chromium

Experiments and contributions to Chromium project
C++
207
star
27

elixir-styler

An @elixir-lang code-style enforcer that will just FIFY instead of complaining
Elixir
207
star
28

avmplus

Source code for the Actionscript virtual machine
ActionScript
194
star
29

ops-cli

Ops - cli wrapper for Terraform, Ansible, Helmfile and SSH for cloud automation
Python
186
star
30

pdf-embed-api-samples

Samples for Adobe Document Services PDF Embed API
JavaScript
155
star
31

Deep-Audio-Prior

Audio Source Separation Without Any Training Data.
Python
154
star
32

rules_gitops

This repository contains rules for continuous, GitOps driven Kubernetes deployments.
Starlark
151
star
33

aem-htl-repl

Read–Eval–Print Loop environment for HTL.
JavaScript
151
star
34

OSAS

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.
Python
150
star
35

stringlifier

Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.
Python
148
star
36

svg-native-viewer

SVG Native viewer is a library that parses and renders SVG Native documents
C++
142
star
37

Spry

Spry is a JavaScript-based framework that enables the rapid development of Ajax-powered web pages.
HTML
140
star
38

XMP-Toolkit-SDK

The XMP Toolkit allows you to integrate XMP functionality into your product or solution
C++
135
star
39

brackets-phonegap

A brackets extension for PhoneGap development.
JavaScript
112
star
40

brackets.io

brackets.io website
HTML
111
star
41

tf-manage

Shell
110
star
42

aem-component-generator

AEM Component Generator is a java project that enables developers to generate the base structure of an AEM component using a JSON configuration file specifying component and dialog properties and other configuration options.
Java
109
star
43

himl

A hierarchical yaml config in Python
Python
107
star
44

adobe-client-data-layer

An event-driven store for all trackable data of your site.
JavaScript
107
star
45

GLS3D

An implementation of OpenGL for Stage3D that can run inside Flash Player 11+
C
105
star
46

coral-spectrum

A JavaScript library of Web Components following Spectrum design patterns.
JavaScript
104
star
47

aem-core-cif-components

A set of configurations and components to get you started with AEM Commerce development
Java
101
star
48

aem-boilerplate

Use this repository template for new AEM projects.
JavaScript
99
star
49

react-webcomponent

This projects automates the wrapping of a React component in a CustomElement.
JavaScript
95
star
50

web-platform

JavaScript
90
star
51

ride

REST API Automation framework for functional, integration, fuzzing, and performance testing
Java
90
star
52

alloy

Alloy is the web SDK for the Adobe Experience Platform.
JavaScript
85
star
53

go-starter

Bootstrap a new project from a template.
Go
83
star
54

asset-share-commons

A modern, open-source asset share reference implementation built on Adobe Experience Manager (AEM)
Java
83
star
55

orc

ORC is a tool for finding violations of C++'s One Definition Rule on the OSX toolchain.
C++
79
star
56

experience-platform-postman-samples

77
star
57

pdfservices-node-sdk-samples

Samples for the Adobe Document Services PDF Tools Node SDK
HTML
77
star
58

sbmc

Sample-based Monte Carlo Denoising using a Kernel-Splatting Network [Siggraph 2019]
Python
76
star
59

git-server

A GitHub Protocol & API emulation
JavaScript
75
star
60

spectrum-tokens

Tokens used by Spectrum, Adobe's design system.
JavaScript
74
star
61

aio-theme

The Adobe I/O theme for building markdown powered sites
JavaScript
70
star
62

aem-sample-we-retail-journal

We.Retail Journal is a sample showcasing SPA Editing capabilities in AEM using React and Angular
CSS
69
star
63

aem-guides-wknd-spa

69
star
64

frontend-regression-validator

Visual regression tool used to compare baseline and updated instances of a website in a deployment pipeline.
Python
67
star
65

blackhole

An HTTP sink (for testing) with optional recording and playback ability
Go
65
star
66

aem-spa-project-archetype

Maven Archetype for creating new AEM SPA projects
CSS
63
star
67

aio-cli

Adobe I/O Extensible CLI
JavaScript
60
star
68

aem-upload

Makes uploading to AEM easier, and can be used as a command line executable or required as a Node.js module.
JavaScript
59
star
69

aem-modernize-tools

A suite of tools to modernize your AEM Sites implementations off legacy features.
Java
58
star
70

dds2atf

Tool for converting DDS files into ATF files suitable for use with the Flash Stage3D API
C++
58
star
71

redux-saga-promise

Create actions that return promises, which are resolved/rejected by a redux saga
JavaScript
58
star
72

aem-react-editable-components

SPA React Editable Components for Adobe Experience Manager
TypeScript
55
star
73

xmp-docs

XMP documentation
52
star
74

adobe-photoshop-api-sdk

Adobe Photoshop API SDK
JavaScript
50
star
75

aem-enablement

Content required for AEM Enablement
Java
50
star
76

brackets-edge-web-fonts

Edge Web Fonts extension for Brackets. Simply unzip and drop into your Brackets extension folder to browse and include Edge Web Fonts.
JavaScript
50
star
77

aem-brackets-extension

Brackets extension for Adobe Experience Manager (AEM) front-end developers with auto-sync and HTL support.
JavaScript
50
star
78

helix-home

The home of Project Helix
HTML
49
star
79

aem-testing-clients

Testing tools for Adobe Experience Manager
Java
49
star
80

aem-guides-wknd-graphql

JavaScript
47
star
81

brackets-registry

A registry system for hosting Brackets extensions powered by node.js
JavaScript
46
star
82

helix-cli

Command-line tools for developing with AEM
JavaScript
46
star
83

htlengine

An HTL (Sightly) Interpreter/Compiler for Node.js
HTML
45
star
84

aem-dispatcher-experiments

Experiments to demonstrate the impact of the Dispatcher and it's configuration parameters.
HTML
44
star
85

pdfservices-python-sdk-samples

Adobe PDFServices python SDK Samples
Python
44
star
86

node-fetch-retry

Node Module for performing retries using node-fetch
JavaScript
42
star
87

commerce-cif-connector

AEM Commerce connector for Magento and GraphQL
Java
42
star
88

aem-react-core-wcm-components

41
star
89

behavior_tree_editor

A visual editor for building behavior trees for the bots
JavaScript
41
star
90

libLOL

Python
40
star
91

starter-repo

Documentation templates for use in open source and open development projects
40
star
92

commerce-cif-magento

Adobe Commerce Integration Framework (CIF) Magento Integration
JavaScript
40
star
93

bin2c

Convert to/Embed binary files in C source files, quickly and efficiently.
C
38
star
94

graphicalweb-keynote

Keynote for Graphical Web Conference
JavaScript
37
star
95

aem-site-template-standard

Basic site template for AEM that allows non-Java experts to create new sites by customizing CSS and JS only.
SCSS
37
star
96

aio-cli-plugin-cloudmanager

Cloud Manager plugin for the Adobe I/O CLI
JavaScript
37
star
97

oss-contributors

How do tech companies rank amongst themselves when it comes to github.com activity?
JavaScript
35
star
98

aem-eclipse-developer-tools

The Eclipse plugin that brings you the full connection to the Adobe Experience Manager, with auto-sync and project creation wizard.
Java
35
star
99

fetch

Simplified HTTP/1(.1) and HTTP/2 requests with Server Push Support
JavaScript
34
star
100

PDFServices.NET.SDK.Samples

This .NET sample solution helps you get started with the Adobe PDF Services SDK.
HTML
33
star