• Stars
    star
    126
  • Rank 284,543 (Top 6 %)
  • Language
    Python
  • Created almost 6 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Find strings/words in text; convenience and C speed πŸŽ†

textsearch

Build Status Coverage Status PyPI PyPI

Find and/or replace multiple strings in text; focusing on convenience and using C for speed.

It mainly helps with providing convenience for NLP / text search related tasks. For example, it will help find tokens by default only if it is a full word match (and not a sub-match).

Features

  • βœ“ Compared to equivalent in regex, usually ~30-100x faster
  • βœ“ Tokenizer, string replacer, spell checker and many more things can be built on this (stay-tuned)
  • βœ“ Extendable (write your own handlers)
  • βœ“ Supports a prefix- or postfix based regex, usually something missing with other matchers
  • βœ“ Few depencies (relies on a C-module, credits to WojciechMula/pyahocorasick)
  • βœ“ Similar to flashtext (by my friend Vikash), but 30% faster, as convenient, and more features
  • βœ“ Optional support for accented characters (~15% slowdown)
  • βœ“ Lots of tests (good place to see examples for inspiration) and good coverage

Showcase

Here a list of the projects that are depending on textsearch.

Name Explanation By
rebrand Refactor your software using programming language independent string replacement. kootenpv
contractions Fixes contractions such as you're to you are kootenpv
tok Find strings/words in text; convenience and C speed kootenpv

Installation

Works on Python 3+ (make an issue if you really need Python 2)

pip install textsearch

Main functionality

ts = TextSearch("ignore", "norm")
ts.add("hi", "HI")
"hi" in ts            # True
ts.contains("hi!")    # True
ts.replace("hi!")     # "HI!"
ts.remove("hi")
ts.contains("hi!")    # False
ts
TextSearch(case='ignore', returns='norm', num_items=0)

More examples

See tests/ for more usage examples.

from textsearch import TextSearch

ts = TextSearch(case="ignore", returns="match")
ts.add("hi")
ts.findall("hello, hi")
# ["hi"]

ts = TextSearch(case="ignore", returns="norm")
ts.add("hi", "greeting")
ts.add("hello", "greeting")
ts.findall("hello, hi")
# ["greeting", "greeting"]

ts = TextSearch(case="ignore", returns="match")
ts.add(["hi", "bye"])
ts.findall("hi! bye! HI")
# ["hi", "bye", "hi"]

ts = TextSearch(case="insensitive", returns="match")
ts.add(["hi", "bye"])
ts.findall("hi! bye! HI")
# ["hi", "bye", "HI"]

ts = TextSearch("sensitive", "object")
ts.add("HI")
ts.findall("hi")
# []
ts.findall("HI")
# [TSResult(match='HI', norm='HI', start=0, end=2, case='upper', is_exact=True)]

ts = TextSearch("sensitive", dict)
ts.add("hI")
ts.findall("hI")
# [{'case': 'mixed', 'end': 2, 'exact': True, 'match': 'hI', 'norm': 'hI', 'start': 0}]

Usage

TextSearch takes the following arguments:

case: one of "ignore", "insensitive", "sensitive", "smart"
    - ignore: converts both sought words and to-be-searched to lower case before matching.
              Text matches will always be returned in lowercase.
    - insensitive: converts both sought words and to-be-searched to lower case before matching.
                   However: it will return the original casing as it uses the position found.
    - sensitive: does not do any conversion, will only match on exact words added.
    - smart: takes an input `k` and also adds k.title(), k.upper() and k.lower(). Matches sensitively.
returns: one of 'match', 'norm', 'object' (becomes TSResult) or a custom class
    See the examples!
    - match: returns the sought key when a match occurs
    - norm: returns the associated (usually normalized) value when a key match occurs
    - object: convenient object for working with matches, e.g.:
      TSResult(match='HI', norm='greeting', start=0, end=2, case='upper', is_exact=True)
    - class: bring your own class that will get instantiated like:
      MyClass(**{"match": k, "norm": v, "case": "lower", "exact": False, start: 0, end: 1})
      Trick: to get json-serializable results, use `dict`.
left_bound_chars (set(str)):
    Characters that will determine the left-side boundary check in findall/replace.
    Defaults to set([A-Za-z0-9_])
right_bound_chars (set(str)):
    Characters that will determine the right-side boundary check in findall/replace
    Defaults to set([A-Za-z0-9_])
replace_foreign_chars (default=False): replaces 'Γ‘' with 'a' for example in both input and target.
    Adds a roughly 15% slowdown.
handlers (list): provides a way to add hooks to matches.
    Currently only used when left and/or right bound chars are set.
    Regex can only be used when using norm
    The default handler that gets added in any case will check boundaries.
    Check how to conveniently add regex at the `add_regex_handler` function.
    Default: [(False, True, self.bounds_check)]
    - The first argument should be the normalized tag to fire on.
    - The second argument should be whether to keep the result
    - The third should be the handler function, that takes the arguments:
      - text: the original sentence/document text
      - start: the starting position of said string
      - stop: the ending position of said string
      - norm: the normalized result found
      Should return:
      - start: In case start position should change it is possible
      - stop: In case end position should change it is possible
      - norm: the new returned item. In case this is None, it will be removed
    Custom example:
      >>> def custom_handler(text, start, stop, norm):
      >>>    return start, stop, text[start:stop] + " is OK"
      >>> ts = TextSearch("ignore", "norm", handlers=[("HI", True, custom_handler)])
      >>> ts.add("hi", "HI")
      >>> ts.findall("hi HI")
      ['hi is OK', 'HI is OK']

Benchmarks

Coming soon, will compare what's available in Java and Go as well, and flashtext, regex.

More Repositories

1

whereami

Uses WiFi signals πŸ“Ά and machine learning to predict where you are
Python
5,100
star
2

yagmail

Send email in Python conveniently for gmail using yagmail
Python
2,639
star
3

neural_complete

A neural network trained to help writing neural network code using autocomplete
Python
1,152
star
4

gittyleaks

πŸ’§ Find sensitive information for a git repo
Python
741
star
5

sky

πŸŒ… next generation web crawling using machine intelligence
Python
328
star
6

contractions

Fixes contractions such as `you're` to `you are`
Python
308
star
7

access_points

Scan your WiFi and get access point information and signal quality
Python
187
star
8

brightml

Convenient Machine-Learned Auto Brightness (Linux)
Python
120
star
9

shrynk

Using Machine Learning to learn how to Compress ⚑
Python
109
star
10

loco

Share localhost through SSH. Local/Remote port forwarding made safe and easy.
Python
106
star
11

cliche

Build a simple command-line interface from your functions πŸ’»
Python
105
star
12

tok

Fast and customizable tokenization 🚀
Python
64
star
13

just

Just is a wrapper to automagically read/write a file based on extension
Python
50
star
14

aserve

Easily mock an API β˜•
Python
50
star
15

spacy_api

Server/Client around Spacy to load spacy only once
Python
46
star
16

xtoy

Automated Machine Learning: go from 'X' to 'y' without effort.
Python
46
star
17

requests_viewer

View requests objects with style
Python
42
star
18

cant

For those who can't remember how to get a result
Python
34
star
19

aioyagmail

makes sending emails very easy by doing all the magic for you, asynchronously
Python
29
star
20

sysdm

Scripts as a service. Builds on systemd (for Linux)
Python
21
star
21

deep_eye2mouse

Move the mouse by your webcam + eyes
Python
20
star
22

reddit_ml_challenge

Reddit Machine Learning: Tagging Challenge
Python
19
star
23

inthenews.io

Get the latest and greatest in news (on Python)
CSS
19
star
24

crtime

Get creation time of files for any platform - no external dependencies ⏰
Python
16
star
25

natura

Find currencies / money talk in natural text
Python
15
star
26

rebrand

✨ Refactor your software using programming language independent, case-preserving string replacement πŸ’„
Python
15
star
27

emacs-kooten-theme

Dark color theme by kootenpv
Emacs Lisp
14
star
28

justdb

Just a thread/process-safe, file-based, fast, database.
Python
8
star
29

fastlang

Fast Detection of Language without Dependencies
Python
7
star
30

quickpip

A template for creating a quick, maintainable and high quality pypi project
Python
7
star
31

xdb

Ambition: Single API for any database in Python
Python
6
star
32

nostalgia_chrome

Self tracking your online life!
Python
5
star
33

cnn_basics

NLP using CNN on Cornell Movie Ratings
Python
4
star
34

kootenpv.github.io

Pascal van Kooten's website hosted on github.io
CSS
3
star
35

gittraffic

Save your gittrafic data so it won't get lost!
Python
3
star
36

flymake-solidity

flymake for solidity, using flymake-easy: live feedback on writing solidity contracts
Emacs Lisp
3
star
37

ppm

Safe password manager
C
2
star
38

automl_presentation

Example code for the presentation "Automated Machine Learning"
Python
2
star
39

dot_access

Makes nested python objects easy to go through
Python
1
star
40

feedview

View a feed url with `feedview <URL>`
Python
1
star
41

PassMan

android app for ppm
C
1
star
42

mockle

Automatic Mocking by Pickles
Python
1
star
43

emoji-picker

Python
1
star