• Stars
    star
    2,094
  • Rank 22,061 (Top 0.5 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc ๐Ÿฆ–

StringZilla ๐Ÿฆ–

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" ๐Ÿ˜…

Performance

StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. So if you're haunted by open(...).readlines() and str().splitlines() taking forever, this should help ๐Ÿ˜Š

Backend \ Device IoT Laptop Server
Speed Comparison ๐Ÿ‡
Python for loop 4 MB/s 14 MB/s 11 MB/s
C++ for loop 520 MB/s 1.0 GB/s 900 MB/s
C++ string.find 560 MB/s 1.2 GB/s 1.3 GB/s
Scalar StringZilla 2 GB/s 3.3 GB/s 3.5 GB/s
Hyper-Scalar StringZilla 4.3 GB/s 12 GB/s 12.1 GB/s
Efficiency Metrics ๐Ÿ“Š
CPU Specs 8-core ARM, 0.5 W/core 8-core Intel, 5.6 W/core 22-core Intel, 6.3 W/core
Performance/Core 2.1 - 3.3 GB/s 11 GB/s 10.5 GB/s
Bytes/Joule 4.2 GB/J 2 GB/J 1.6 GB/J

Partition & Sort

Coming soon.

Quick Start: Python ๐Ÿ

1๏ธ. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs

Basic Usage

StringZilla offers two mostly interchangeable core classes:

from stringzilla import Str, File

text1 = Str('some-string')
text2 = File('some-file.txt')

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

  • Length: len(text) -> int
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> str

Advanced Operations

  • 'substring' in text -> bool
  • text.contains('substring', start=0, end=9223372036854775807) -> bool
  • text.find('substring', start=0, end=9223372036854775807) -> int
  • text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
  • text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
  • text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Need copies?

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

lines.append('Pythonic string')
lines.extend(shuffled_copy)

Quick Start: C ๐Ÿ› ๏ธ

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

#include "stringzilla.h"

// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};

// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);

// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);

Contributing ๐Ÿ‘พ

Future development plans include:

  • Replace PyBind11 with CPython.
  • Reverse-order operations in Python #12.
  • Bindings for JavaScript #25, Java, and Rust.
  • Faster string sorting algorithm.
  • Splitting CSV rows into columns.
  • Splitting with multiple separators at once #29.
  • UTF-8 validation.
  • Arm SVE backend.

Here's how to set up your dev environment and run some tests.

Development

CPython:

# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

NodeJS:

npm install && node javascript/test.js

Benchmarking

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

License ๐Ÿ“œ

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.


If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack ๐Ÿค—

More Repositories

1

SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics โ€” for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 ๐Ÿ“
C
913
star
2

SwiftSemanticSearch

Real-time on-device text-to-image and image-to-image Semantic Search with video stream capture using USearch & UForm AI Swift SDKs for Apple devices ๐Ÿ
Swift
81
star
3

ParallelReductionsBenchmark

Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!
C++
73
star
4

usearch-molecules

Searching for structural similarities across billions of molecules in milliseconds
Python
46
star
5

memchr_vs_stringzilla

memchr vs stringzilla - up to 7x throughput difference between two SIMD-accelerated substring search libraries in Rust
Rust
45
star
6

usearch-images

Semantic Search demo featuring UForm, USearch, UCall, and StreamLit, to visual and retrieve from image datasets, similar to "CLIP Retrieval"
Python
38
star
7

BenchmarkingTutorial

Google Benchmark tutorial for C/C++ developers diving into High-Performance Computing and Numerical Methods โฑ๏ธ
C++
26
star
8

usearch-binary

Binary vector search example using Unum's USearch engine and pre-computed Wikipedia embeddings from Co:here and MixedBread
Jupyter Notebook
19
star
9

tinysemver

Tiny Semantic Versioning (SemVer) library and GitHub CI, that doesn't depend on 300K lines of JavaScript code and fits in a single Python file
Python
16
star
10

cpp-cuda-python-starter-kit

Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
Cuda
16
star
11

abusing-vector-search

Example of using Vector Search algorithms for non-traditional workloads, like GIS, stock prices, and sets
Python
12
star
12

MongooseMiner

Documentation retrieval system to help LLMs navigate less-popular (yet often more powerful) Python libraries
Python
11
star
13

HashTableBenchmark

A simple cross-platform speed & memory-efficiency benchmark for the most common hash-table implementations in the C++ world
C++
11
star
14

LibSee

Link to this library and it will log all the LibC functions you are calling and how much time you are spending in them!
C
11
star
15

extrapolaTED

Bringing TED experiences to every topic with Gen AI
Jupyter Notebook
8
star
16

affine-gaps

Less-wrong single-file Numba-accelerated Python implementation of Gotoh affine gap penalty extensions for the Needlemanโ€“Wunsch, Smith-Waterman, and Levenshtein algorithms for sequence alignment
Python
7
star
17

AssemblyStats

A research project highlighting the rarity of SIMD instructions in modern software
Jupyter Notebook
6
star
18

TenPack

Fast Tensors Packaging library for text, image, video, and audio data compatible with PyTorch, TensorFlow, & NumPy ๐Ÿ–ผ๏ธ๐ŸŽต๐ŸŽฅ โžก๏ธ ๐Ÿง 
C++
6
star
19

scaling-democracy

GPU-accelerated Schulze voting method in Python, Numba, and CUDA, using ideas from Algebraic Graph Theory
Cuda
5
star
20

acid-redis

Tiny Redis-like Persistent ACID Store on RocksDB with JSON-RPC using UCall and UStore
CMake
4
star
21

PolyglotBot

Bot we've build for the Poe.com hackathon at the AGI house to gather results from multiple LLMs and trigger specialized models on-demand
Python
4
star
22

HaversineSimSIMD

Staging area for Haversine distance computations in SimSIMD and USearch
C++
4
star
23

image-search

Semantic Image Search Server with UForm, USearch, UCall
Python
3
star
24

spacev-1b

Billion-scale Semantic Search dataset derived from Microsoft SpaceV for Vector Search benchmarks
1
star
25

AppResources

Collection of resources for app development
1
star
26

CppNeuralSTL

Simple neural network models from scratch using only C++ STL
C++
1
star