• Stars
    star
    1,749
  • Rank 25,507 (Top 0.6 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated 15 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc ๐Ÿฆ–

StringZilla ๐Ÿฆ–

StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" ๐Ÿ˜…

Performance

StringZilla uses a heuristic so simple it's almost stupid... but it works. It matches the first few letters of words with hyper-scalar code to achieve memcpy speeds. The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms. So if you're haunted by open(...).readlines() and str().splitlines() taking forever, this should help ๐Ÿ˜Š

Backend \ Device IoT Laptop Server
Speed Comparison ๐Ÿ‡
Python for loop 4 MB/s 14 MB/s 11 MB/s
C++ for loop 520 MB/s 1.0 GB/s 900 MB/s
C++ string.find 560 MB/s 1.2 GB/s 1.3 GB/s
Scalar StringZilla 2 GB/s 3.3 GB/s 3.5 GB/s
Hyper-Scalar StringZilla 4.3 GB/s 12 GB/s 12.1 GB/s
Efficiency Metrics ๐Ÿ“Š
CPU Specs 8-core ARM, 0.5 W/core 8-core Intel, 5.6 W/core 22-core Intel, 6.3 W/core
Performance/Core 2.1 - 3.3 GB/s 11 GB/s 10.5 GB/s
Bytes/Joule 4.2 GB/J 2 GB/J 1.6 GB/J

Partition & Sort

Coming soon.

Quick Start: Python ๐Ÿ

1๏ธ. Install via pip: pip install stringzilla
2. Import classes: from stringzilla import Str, File, Strs

Basic Usage

StringZilla offers two mostly interchangeable core classes:

from stringzilla import Str, File

text1 = Str('some-string')
text2 = File('some-file.txt')

The Str is designed to replace long Python str strings and wrap our C-level API. On the other hand, the File memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

  • Length: len(text) -> int
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> str

Advanced Operations

  • 'substring' in text -> bool
  • text.contains('substring', start=0, end=9223372036854775807) -> bool
  • text.find('substring', start=0, end=9223372036854775807) -> int
  • text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
  • text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
  • text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices.

lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)

Need copies?

sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)

Basic list-like operations are also supported:

lines.append('Pythonic string')
lines.extend(shuffled_copy)

Quick Start: C ๐Ÿ› ๏ธ

There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.

#include "stringzilla.h"

// Initialize your haystack and needle
strzl_haystack_t haystack = {your_text, your_text_length};
strzl_needle_t needle = {your_subtext, your_subtext_length, your_anomaly_offset};

// Perform string-level operations
size_t character_count = strzl_naive_count_char(haystack, 'a');
size_t character_position = strzl_naive_find_char(haystack, 'a');
size_t substring_position = strzl_naive_find_substr(haystack, needle);

// Perform collection level operations
strzl_array_t array = {your_order, your_count, your_get_begin, your_get_length, your_handle};
strzl_sort(&array, &your_config);

Contributing ๐Ÿ‘พ

Future development plans include:

  • Replace PyBind11 with CPython.
  • Reverse-order operations in Python #12.
  • Bindings for JavaScript #25, Java, and Rust.
  • Faster string sorting algorithm.
  • Splitting CSV rows into columns.
  • Splitting with multiple separators at once #29.
  • UTF-8 validation.
  • Arm SVE backend.

Here's how to set up your dev environment and run some tests.

Development

CPython:

# Clean up and install
rm -rf build && pip install -e . && pytest scripts/test.py -s -x

# Install without dependencies
pip install -e . --no-index --no-deps

NodeJS:

npm install && node javascript/test.js

Benchmarking

To benchmark on some custom file and pattern combinations:

python scripts/bench.py --haystack_path "your file" --needle "your pattern"

To benchmark on synthetic data:

python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"

Packaging

To validate packaging:

cibuildwheel --platform linux

Compiling C++ Tests

# Install dependencies
brew install libomp llvm

# Compile and run tests
cmake -B ./build_release \
    -DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
    -DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
    -DSTRINGZILLA_USE_OPENMP=1 \
    -DSTRINGZILLA_BUILD_TEST=1 \
    && \
    make -C ./build_release -j && ./build_release/stringzilla_test

License ๐Ÿ“œ

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.


If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack ๐Ÿค—

More Repositories

1

SimSIMD

Up to 200x Faster Inner Products and Vector Similarity โ€” for Python, JavaScript, Rust, and C, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE ๐Ÿ“
C
707
star
2

ParallelReductionsBenchmark

Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!
C++
59
star
3

memchr_vs_stringzilla

memchr vs stringzilla - up to 7x throughput difference between two SIMD-accelerated substring search libraries in Rust
Rust
39
star
4

SwiftSemanticSearch

S3 stands for Swift Semantic Search - using USearch & UForm to builds scalable multimodal Retrieval-Augmented pipelines on Apple devices ๐Ÿ
Swift
32
star
5

usearch-molecules

Searching for structural similarities across billions of molecules in milliseconds
Python
29
star
6

usearch-images

Semantic Search demo featuring UForm, USearch, UCall, and StreamLit, to visual and retrieve from image datasets, similar to "CLIP Retrieval"
Python
27
star
7

BenchmarkingTutorial

Google Benchmark examples and tutorials for C/C++ developers diving into High-Performance Computing and Numerical Methods โฑ๏ธ
C++
21
star
8

usearch-binary

Binary vector search example using Unum's USearch engine and pre-computed Wikipedia embeddings from Co:here and MixedBread
Jupyter Notebook
17
star
9

abusing-vector-search

Example of using Vector Search algorithms for non-traditional workloads, like GIS, stock prices, and sets
Python
11
star
10

HashTableBenchmark

A simple cross-platform speed & memory-efficiency benchmark for the most common hash-table implementations in the C++ world
C++
10
star
11

LibSee

Link to this library and it will log all the LibC functions you are calling and how much time you are spending in them!
C
8
star
12

extrapolaTED

Bringing TED experiences to every topic with Gen AI
Jupyter Notebook
6
star
13

TenPack

Fastest Tensors Packaging library for image, video, and audio data designed for PyTorch, TensorFlow, & NumPy ๐Ÿ–ผ๏ธ๐ŸŽต๐ŸŽฅ โžก๏ธ ๐Ÿง 
C++
6
star
14

acid-redis

Tiny Redis-like Persistent ACID Store on RocksDB with JSON-RPC using UCall and UStore
CMake
4
star
15

PolyglotBot

Bot we've build for the Poe.com hackathon at the AGI house to gather results from multiple LLMs and trigger specialized models on-demand
Python
4
star
16

image-search

Semantic Image Search Server with UForm, USearch, UCall
Python
3
star
17

HaversineSimSIMD

Staging area for Haversine distance computations in SimSIMD and USearch
C++
3
star
18

AssemblyStats

A research project highlighting the rarity of SIMD instructions in modern software
Jupyter Notebook
2
star
19

AppResources

Collection of resources for app development
1
star
20

CppNeuralSTL

Simple neural network models from scratch using only C++ STL
C++
1
star