• Stars
    star
    217
  • Rank 182,446 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Natural language processing support for Pandas dataframes.

Text Extensions for Pandas

Documentation Status Binder

Natural language processing support for Pandas dataframes.

Text Extensions for Pandas turns Pandas DataFrames into a universal data structure for representing intermediate data in all phases of your NLP application development workflow.

Web site: https://ibm.biz/text-extensions-for-pandas

API docs: https://text-extensions-for-pandas.readthedocs.io/

Features

SpanArray: A Pandas extension type for spans of text

  • Connect features with regions of a document
  • Visualize the internal data of your NLP application
  • Analyze the accuracy of your models
  • Combine the results of multiple models

TensorArray: A Pandas extension type for tensors

  • Represent BERT embeddings in a Pandas series
  • Store logits and other feature vectors in a Pandas series
  • Store an entire time series in each cell of a Pandas series

Pandas front-ends for popular NLP toolkits

CoNLL-2020 Paper

Looking for the model training code from our CoNLL-2020 paper, "Identifying Incorrect Labels in the CoNLL-2003 Corpus"? See the notebooks in this directory.

The associated data set is here.

Installation

This library requires Python 3.7+, Pandas, and Numpy.

To install the latest release, just run:

pip install text-extensions-for-pandas

Depending on your use case, you may also need the following additional packages:

  • spacy (for SpaCy support)
  • transformers (for transformer-based embeddings and BERT tokenization)
  • ibm_watson (for IBM Watson support)

Alternatively, packages are available to be installed from conda-forge for use in a conda environment with:

conda install --channel=conda-forge text_extensions_for_pandas

Installation from Source

If you'd like to try out the very latest version of our code, you can install directly from the head of the master branch:

pip install git+https://github.com/CODAIT/text-extensions-for-pandas

You can also directly import our package from your local copy of the text_extensions_for_pandas source tree. Just add the root of your local copy of this repository to the front of sys.path.

Documentation

For examples of how to use the library, take a look at the example notebooks in this directory. You can try out these notebooks on Binder by navigating to https://mybinder.org/v2/gh/frreiss/tep-fred/branch-binder?urlpath=lab/tree/notebooks

To run the notebooks on your local machine, follow the following steps:

  1. Install Anaconda or Miniconda.
  2. Check out a copy of this repository.
  3. Use the script env.sh to set up an Anaconda environment for running the code in this repository.
  4. Type jupyter lab from the root of your local source tree to start a JupyterLab environment.
  5. Navigate to the notebooks directory and choose any of the notebooks there

API documentation can be found at https://text-extensions-for-pandas.readthedocs.io/en/latest/

Contents of this repository

  • text_extensions_for_pandas: Source code for the text_extensions_for_pandas module.
  • env.sh: Script to create a conda environment pd capable of running the notebooks and test cases in this project
  • generate_docs.sh: Script to build the API documentation
  • api_docs: Configuration files for generate_docs.sh
  • binder: Configuration files for running notebooks on Binder
  • config: Configuration files for env.sh.
  • docs: Project web site
  • notebooks: example notebooks
  • resources: various input files used by our example notebooks
  • test_data: data files for regression tests. The tests themselves are located adjacent to the library code files.
  • tutorials: Detailed tutorials on using Text Extensions for Pandas to cover complex end-to-end NLP use cases (work in progress).

Contributing

This project is an IBM open source project. We are developing the code in the open under the Apache License, and we welcome contributions from both inside and outside IBM.

To contribute, just open a Github issue or submit a pull request. Be sure to include a copy of the Developer's Certificate of Origin 1.1 along with your pull request.

Building and Running Tests

Before building the code in this repository, we recommend that you use the provided script env.sh to set up a consistent build environment:

$ ./env.sh --env_name myenv
$ conda activate myenv

(replace myenv with your choice of environment name).

To run tests, navigate to the root of your local copy and run:

pytest text_extensions_for_pandas

To build pip and source code packages:

python setup.py sdist bdist_wheel

(outputs go into ./dist).

To build API documentation, run:

./generate_docs.sh

More Repositories

1

spark-bench

Benchmark Suite for Apache Spark
Scala
238
star
2

deep-histopath

A deep learning approach to predicting breast tumor proliferation scores for the TUPAC16 challenge
Jupyter Notebook
206
star
3

stocator

Stocator is high performing connector to object storage for Apache Spark, achieving performance by leveraging object storage semantics.
Java
111
star
4

covid-notebooks

Jupyter notebooks that analyze COVID-19 time series data
Jupyter Notebook
104
star
5

max-central-repo

Central Repository of Model Asset Exchange project. This repository contains information about the available models, current project status, contribution guidelines and supporting assets.
78
star
6

aardpfark

A library for exporting Spark ML models and pipelines to PFA
Scala
54
star
7

presentations

Talks & Workshops by the CODAIT team
Jupyter Notebook
52
star
8

r4ml

Scalable R for Machine Learning
R
42
star
9

spark-ref-architecture

Reference Architectures for Apache Spark
Scala
38
star
10

graph_def_editor

GraphDef Editor: A port of the TensorFlow contrib.graph_editor package that operates over serialized graphs
Python
31
star
11

magicat

🧙😺 magicat - Deep learning magic.. with the convenience of cat!
JavaScript
26
star
12

node-red-contrib-model-asset-exchange

Node-RED nodes for the Model Asset Exchange on IBM Developer
JavaScript
20
star
13

max-tfjs-models

Pre-trained TensorFlow.js models for the Model Asset Exchange
JavaScript
18
star
14

pardata

Python
17
star
15

nlp-editor

Visual Editor for Natural Language Processing pipelines
JavaScript
15
star
16

flight-delay-notebooks

Analyzing flight delay and weather data using Elyra, IBM Data Asset Exchange, Kubeflow Pipelines and KFServing
Jupyter Notebook
15
star
17

spark-db2

DB2/DashDB Connector for Apache Spark
Scala
14
star
18

redrock

RedRock - Mobile Application prototype using Apache Spark, Twitter and Elasticsearch
Scala
14
star
19

spark-netezza

Netezza Connector for Apache Spark
Scala
13
star
20

Identifying-Incorrect-Labels-In-CoNLL-2003

Research into identifying and correcting incorrect labels in the CoNLL-2003 corpus.
Jupyter Notebook
12
star
21

max-vis

Image annotation library and command-line utility for MAX image models
JavaScript
9
star
22

fae-tfjs

JavaScript
9
star
23

WELCOME-TO-CODAIT

Welcome to the Center for Open-Source Data & AI Technologies (CODAIT) organization on GitHub! Learn more about our projects ...
8
star
24

spark-tracing

A flexible instrumentation package for visualizing the internal operation of Apache Spark and related tools
Scala
8
star
25

redrock-v2

RedRock v2 Repository
Jupyter Notebook
8
star
26

max-node-red-docker-image

Demo Docker image for the Model Asset Exchange Node-RED module
Dockerfile
8
star
27

max-workshop-oscon-2019

7
star
28

notebook-exporter

One Click deployment of Notebooks - Bringing Notebooks to Production
Scala
6
star
29

redrock-ios

RedRock - Mobile Application prototype
JavaScript
4
star
30

max-base

This repo has been moved
Python
4
star
31

max-status

Current status of the Model Asset Exchange ecosystem
4
star
32

project-codenet-notebooks

Jupyter Notebook
3
star
33

MAX-Web-App-skeleton

A fully functioning skeleton for MAX model web apps
JavaScript
3
star
34

development-guidelines

Development Guidelines and related resources for IBM Spark Technology Center
3
star
35

codait.github.io

CODAIT Homepage
HTML
3
star
36

dax-schemata

Python
2
star
37

redrock-v2-ios

RedRock v2 iPad Application
JavaScript
2
star
38

max-pytorch-mnist

Jupyter Notebook
2
star
39

teach-nao-robot-a-new-skill

Teach your NAO robot a new skill using deep learning microservices
2
star
40

max-fashion-mnist-tutorial-app

Python
1
star
41

MAX-cloud-deployment-cheatsheets

Work in progress
1
star
42

ddc-data-and-ai-2021-automate-using-open-source

Jupyter Notebook
1
star
43

exchange-metadata-converter

Basic conversion utility for YAML-based metadata descriptors
Python
1
star
44

streaming-integration-sample

Scala
1
star
45

covid-trusted-ai-pipeline

Jupyter Notebook
1
star