• Stars
    star
    181
  • Rank 210,934 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this.

GitHub license PyPI - Python Version Build Status PyPI version

Utilities for pre-processing text for deep learning in Keras.

ktext performs common pre-processing steps associated with deep learning (cleaning, tokenization, padding, truncation). Most importantly, ktext allows you to perform these steps using process-based threading in parallel. If you don't think you might benefit from parallelization, consider using the text preprocessing utilities in keras instead.

ktext helps you with the following:

  1. Cleaning You may want to clean your data to remove items like phone numbers and email addresses and replace them with generic tags, or remove HTML. This step is optional, but can help remove noise in your data.

  2. Tokenization Take a raw string, ex "Hello World!" and tokenize it so it looks like ['Hello', 'World', '!']

  3. Generating Vocabulary and a {Token -> index} mapping Map each unique token in your corpus to an integer value. This usually stored as a dictionary. For example {'Hello': 2, 'World':3, '!':4} might be a valid mapping from tokens to integers. You usually want to reserve an integer for rare or unseen words (ktext uses 1) and another integer for padding (ktext uses 0). You can set a threshold for rare words (see documentation).

  4.  Truncating and Padding While it is not necessary, it can be much easier if all your documents are the same length. The way we can accomplish this is through truncating and padding. For all documents below the desired length we can pad the document with 0's and documents above the desired length can be truncated. This utility allows you to build a histogram of your document lengths and choose a sensible document length for your corpus.

This utility accomplishes all of the above using process-based threading for speed. Sklearn style fit, transform, and fit_transform interfaces are provided (but not directly compatible with sklearn yet). Pull requests and comments are welcome.

Note: This utility is useful if all of your data can fit into memory on a single node. Otherwise, if your data cannot fit into memory, consider using distributing computing paradigms such as Hive, Spark or Dask.

Documentation

This notebook contains a tutorial on how to use this library.

Installation

$ pip install ktext

More Repositories

1

code_search

Code For Medium Article: "How To Create Natural Language Semantic Search for Arbitrary Objects With Deep Learning"
Jupyter Notebook
490
star
2

Docker_Tutorial

Code and helper scripts for article on Medium "How Docker Can Help You Become A More Effective Data Scientist"
Shell
161
star
3

Seq2Seq_Tutorial

Code For Medium Article "How To Create Data Products That Are Magical Using Sequence-to-Sequence Models"
Jupyter Notebook
138
star
4

llama-inference

experiments with inference on llama
Python
104
star
5

oauth-tutorial

Like GitHub Pages, but you choose who can see it without usernames & passwords.
Shell
98
star
6

nbsanity

Render notebooks like nbviewer, but using Quarto as the renderer
Python
49
star
7

wandb-cicd

Jupyter Notebook
40
star
8

ft-drift

Check for data drift between two OpenAI multi-turn chat jsonl files.
Jupyter Notebook
33
star
9

hamel-site

Repo for Hamel's Professional Website
Jupyter Notebook
26
star
10

get-lambda

Use Actions to acquire those precious lambda GPUs
Python
19
star
11

replicate-examples

Python
18
star
12

docker-gpu

Dockerfile for deep learning on GPUs
Shell
12
star
13

wandb-modal-webhook

A webhook that integrates the W&B model registry with Modal Labs
Python
11
star
14

wandbtocsv

CLI tool to export W&B metrics to a csv file.
Jupyter Notebook
10
star
15

oauth-render-quarto

JavaScript
10
star
16

posit-python-quarto

Posit conf::2023 Quarto for python developers workshop
HTML
8
star
17

nolib_nbdev

a test repo
Jupyter Notebook
6
star
18

csv_to_xls

convert csvs to excel
Jupyter Notebook
6
star
19

notes

Hamel's Random Notes
JavaScript
6
star
20

quarto-gpt

Python
5
star
21

fastai-issue-summarizer

Jupyter Notebook
4
star
22

python-notes

Notes and explorations on python concurrency.
Jupyter Notebook
4
star
23

pydoc_quarto

Generate minimal reference API docs from a python library in the form of markdown files that can be rendered in Quarto.
Jupyter Notebook
4
star
24

docker-cpu

For CPU Bound ML Research
Shell
3
star
25

test-template

Jupyter Notebook
2
star
26

react-deploy-test

For practicing deploying react apps to GH Pages
JavaScript
2
star
27

model_explainability

Notes from @dansbecker 's model explainability course
Jupyter Notebook
2
star
28

face-detect

Minimal face detection
JavaScript
2
star
29

learn-pyshiny

2
star
30

minimal-site

CSS
2
star
31

hello-world-deploy

I don't know what I'm doing
Python
2
star
32

try_modal

Python
2
star
33

brainflow

Brainflow
2
star
34

bq-gpt

gpt for big query
Jupyter Notebook
2
star
35

pyscript-getemail

A simple pyscript app
JavaScript
2
star
36

ntest18

testing attachments
Jupyter Notebook
1
star
37

understand_plugins

My notes on Python plugins and a minimal example of implementing them
Python
1
star
38

test-nvlink

Python
1
star
39

ntest2

testing pages
Python
1
star
40

quarto-minimal

Jupyter Notebook
1
star
41

learn-hugo

Jupyter Notebook
1
star
42

test-stuff

test-stuff
1
star
43

pydantic-yaml-parser

parse yaml files with pydantic
Jupyter Notebook
1
star
44

test-batching-replicate

1
star
45

my-first-chrome-extension

1
star
46

test-new-repo

1
star
47

nbp-new3

1
star
48

wandb-demo

Jupyter Notebook
1
star
49

gifs-experiment

1
star
50

simple_flask_app

So Hamel can learn flask apps for Data Science
HTML
1
star
51

imagenette-test

Repo for my work on https://github.com/fastai/imagenette/issues/40
Jupyter Notebook
1
star
52

learn-metaflow

Python
1
star
53

my-blog-site

Jupyter Notebook
1
star
54

Issue-Label-Bot-Examples

Example of GitHub App Using Python
1
star
55

learn-caddy

HTML
1
star
56

dbt-project

Learn dbt
1
star
57

hamel

General Utilities
Jupyter Notebook
1
star
58

griffe-quarto

Generate reference api docs for Quarto.
Jupyter Notebook
1
star
59

number-check

1
star
60

learn-transformers

Jupyter Notebook
1
star
61

ZipCodeWeb_Scraping

Custom Web Scraping
Python
1
star
62

learning-how-to-learn

1
star
63

mold

1
star
64

quarto_nbimg

Jupyter Notebook
1
star
65

actions-demo

For Actions
1
star
66

posit2023-nbdev

Demo for posit conf
Jupyter Notebook
1
star
67

notification-sanity

Jupyter Notebook
1
star
68

cfg_nbdev_test

Test nbdev stuff
1
star
69

nbdev-test2023

testing nbdev
Python
1
star
70

wasm-chat-notebook

A Mock WASM notebook with pyscript that is like chat
HTML
1
star
71

lua-quarto

CSS
1
star
72

learn-go

1
star
73

tokenfight

1
star
74

fasthtml-eugene-challenge

Python
1
star