• Stars
    star
    479
  • Rank 91,752 (Top 2 %)
  • Language
    Rust
  • License
    MIT License
  • Created about 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Serverless full-text search with Cloudflare Workers, WebAssembly, and Roaring Bitmaps

Edgesearch

Build a full text search API using Cloudflare Workers and WebAssembly.

Features

Demos

Check out the demo folder for live demos with source code.

How it works

Edgesearch builds a reverse index by mapping terms to a compressed bit set (using Roaring Bitmaps) of IDs of documents containing the term, and creates a custom worker script and data to upload to Cloudflare Workers.

Data

An array of term-documents pairs sorted by term is built, where term is a string and documents is a compressed bit set.

This array is then split into chunks of up to 25 MiB, as each Cloudflare Workers KV entry can hold a value up to 25 MiB in size.

To find the documents bit set associated with a term, a binary search is done to find the appropriate chunk, and then the pair within the chunk.

The same structure and process is used to store and retrieve document contents.

Packing multiple bit sets/documents reduces read/write costs and deploy times, and improves caching and execution speed due to fewer fetches.

Searching

Search terms have an associated mode. There are three modes that match documents in different ways:

Mode Document match condition
Require Has all terms with this mode.
Contain Has at least one term with this mode.
Exclude Has none of the terms with this mode.

For example, a document with terms a, b, c, d, and e would match the query require (d, a) contain (g, b, f) exclude (h, i).

The results are generated by doing bitwise operations across multiple bit sets. The general computation could be summarised as:

result = (req_a & req_b & req_c & ...) & (con_a | con_b | con_c | ...) & ~(exc_a | exc_b | exc_c | ...)

Cloudflare

There are some nice advantages when only using Cloudflare Workers:

  • Faster than a VM or container with less cold starts, as code is run on a V8 Isolate.
  • Naturally distributed to the edge for very low latency.
  • Takes advantage of Cloudflare for SSL, caching, and distribution.
  • No need to worry about scaling, networking, or servers.

WebAssembly

The C implementation of Roaring Bitmaps is compiled to WebAssembly. A basic implementation of essential C standard library functionality is implemented to make compilation possible.

Usage

Get the CLI

LLVM 9 or higher is required to use the CLI for building the worker.

Precompiled binaries are available for x86-64:

Linux | macOS | Windows

Build CLI from source

Rust must be installed.

bash ./prebuild.sh
cargo build --release

The CLI will be available at ./target/release/edgesearch.

Build the worker

The data needs to be formatted into two files:

  • Documents: contents of all documents, delimited by NULL (ASCII 0), including at the end.
  • Document terms: terms for each corresponding document. Each term and document must end with NULL (ASCII 0).

This format allows for simple reading and writing without libraries, parsers, or loading all the data into memory. Terms are separate from documents for easy switching between or testing of different documents-terms mappings.

The relation between a document's terms and content is irrelevant to Edgesearch and terms do not necessarily have to be words from the document.

A document must be a JSON serialised value, such as "hello", 123, or {"prop1": 1, "prop2": {}}.

For example:

File Contents
documents {"title":"Stupid Love","artist":"Lady Gaga","year":2020} \0
{"title":"Don't Start Now","artist":"Dua Lipa","year":2020} \0
...
document-terms title_stupid \0 title_love \0 artist_lady \0 artist_gaga \0 year_2020 \0 \0
title_dont \0 title_start \0 title_now \0 artist_dua \0 artist_lipa \0 year_2020 \0 \0
...

A folder needs to be provided for Edgesearch to write temporary and built code and data files. It's advised to provide a folder for the exclusive use of Edgesearch with no other contents.

edgesearch \
  --data-store kv \
  --documents documents \
  --document-terms document-terms \
  --maximum-query-results 20 \
  --output-dir /path/to/edgesearch/build/output/dir/

Deploy the worker

edgesearch-deploy-cloudflare handles deploying to Cloudflare.

This will upload the worker script and associated WASM to Cloudflare Workers, and write every key to Cloudflare Workers KV:

npx edgesearch-deploy-cloudflare \
  --account-id CF_ACCOUNT_ID \
  --account-email [email protected] \
  --global-api-key CF_GLOBAL_API_KEY \
  --name my-edgesearch \
  --output-dir /path/to/edgesearch/build/output/dir/ \
  --namespace CF_KV_NAMESPACE_ID \
  --upload-data

Testing locally

edgesearch-test-server loads a built worker to run locally.

This will create a local server on port 8080:

npx edgesearch-test-server \
  --output-dir /path/to/edgesearch/build/output/dir/ \
  --port 8080

The client can be used with a local test server; provide the origin (e.g. http://localhost:8080) to the constructor (see below).

Calling the API

A JavaScript client for the browser and Node.js is available for using a deployed Edgesearch worker:

import * as Edgesearch from 'edgesearch-client';

type Document = {
  title: string;
  artist: string;
  year: number;
};

const client = new Edgesearch.Client<Document>('https://my-edgesearch.me.workers.dev');
const query = new Edgesearch.Query();
query.add(Edgesearch.Mode.REQUIRE, 'world');
query.add(Edgesearch.Mode.CONTAIN, 'hello', 'welcome', 'greetings');
query.add(Edgesearch.Mode.EXCLUDE, 'bye', 'goodbye');
let response = await client.search(query);
query.setContinuation(response.continuation);
response = await client.search(query);

Performance

Searches that retrieve entries not cached at edge locations will be slow. To reduce cache misses, ensure that there is consistent traffic.

More Repositories

1

minify-html

Extremely fast and smart HTML + JS + CSS minifier, available for Rust, Deno, Java, Node.js, Python, Ruby, and WASM
Rust
811
star
2

hackerverse

Exploring Hacker News by mapping and analyzing 40 million posts and comments for fun
TypeScript
146
star
3

minify-js

Extremely fast JavaScript minifier, available for Rust and Node.js
Rust
118
star
4

esbuild-rs

Rust wrapper for esbuild, an extremely fast JS minifier written in Go
Rust
54
star
5

queued

Highly durable simple queue service scalable to millions of operations per second
Rust
28
star
6

crawler-toolkit-hn

TypeScript
11
star
7

ltsu

Resumable concurrent large file (≀40 TB) uploads to AWS S3 Glacier and Backblaze B2
TypeScript
9
star
8

xtjs-lib

Complementary extra standard library for JS/TS; type safe, no dependencies, modular imports
TypeScript
7
star
9

parse-js

JavaScript and JSX parsing library
Rust
6
star
10

ooml

Transparent object-orientated web UI framework
JavaScript
5
star
11

blobd

Blob storage designed for huge amounts of random reads and small objects with constant latency
Rust
5
star
12

fastrie

Compile time static memory-packed associative tries
Rust
5
star
13

wyhash.js

Port of wyhash to TypeScript/JavaScript
TypeScript
4
star
14

csv.js

Fast error-free CSV parser
TypeScript
3
star
15

awesome-stars

Repositories starred
3
star
16

mydns

Fast and lightweight personal DNS server for your blocklists, custom mappings, and DNS-over-TLS
TypeScript
3
star
17

docker-postgres-plus

Docker image containing PostgreSQL, pgvector, RUM, and TimescaleDB
Dockerfile
3
star
18

seekable-async-file

Async pread and pwrite for Rust, with optional delayed sync and metrics
Rust
2
star
19

shasync

Sync files with a storage provider and CDN
TypeScript
2
star
20

setup-b2

Set up GitHub Actions workflow with Backblaze B2 CLI
JavaScript
2
star
21

zucchini

Data-portable static web music library player and manager
TypeScript
2
star
22

nanoscript

Language with simple parser, and interpreter, written fully in Java
Java
2
star
23

db-rpc

Rust
2
star
24

emoji-data

List of fully-qualified Unicode emoji code points and descriptions
JavaScript
2
star
25

dbflock

Migrate, change, apply, and manage database schemas
TypeScript
2
star
26

off64

Read from and write to byte slices with u64 offsets in Rust
Rust
2
star
27

skyhole

Script to automatically set up a secure personal DNS-over-TLS server with Pi-hole
Shell
2
star
28

cabinet

Beautiful zero-configuration media streaming server + app
TypeScript
2
star
29

valid.js

Type-safe composable validation
TypeScript
2
star
30

esbuild-native

Node.js native wrapper for esbuild using N-API and Cgo
C
2
star
31

cloudflare-r2-workers

Utilities for using Cloudflare R2 with Workers
TypeScript
2
star
32

write-journal

Write to files with a journal
Rust
1
star
33

signal-future

Rust future that can be resolved via a controller
Rust
1
star
34

stochastic-queue

Queue and MPMC channel that pops in a random order
Rust
1
star
35

bufpool

Vec<u8> pool allocator
Rust
1
star
36

log-structured

Use log-structured storage
Rust
1
star
37

fast-spsc-queue

Fast lockless bounded single-producer single-consumer queue
Rust
1
star
38

memorymodule-rs

Rust wrapper for MemoryModule
Rust
1
star
39

totp.js

Generate and verify TOTP codes
TypeScript
1
star
40

eucalyptus

Offline encrypted personal resource trackers
TypeScript
1
star
41

b2-proxy-cloudflare-worker

JavaScript
1
star
42

jobs

Fetch jobs at tech companies
TypeScript
1
star
43

mermaid-svg

Generate mermaid diagram SVGs
TypeScript
1
star
44

setup

Personal scripts for quickly setting up a new system
Shell
1
star
45

crawler-toolkit-web

TypeScript
1
star
46

tinybuf

Container for many types of immutable bytes, with optimisations for small arrays
Rust
1
star
47

tokio-sync-read-stream

Create a stream from a std::io::Read using Tokio blocking threads
Rust
1
star
48

treeutils

Various system utilities powered by Node.js
Rust
1
star
49

b2-upload-action

GitHub action for uploading a file to B2
JavaScript
1
star
50

struct-name

Derive the name of a struct
Rust
1
star
51

ff

Wrappers around ffmpeg and ffprobe commands
TypeScript
1
star
52

aws-v4

AWS Signature v4
TypeScript
1
star
53

sacli

Elegant type-safe CLI builder for Node.js
TypeScript
1
star
54

illusion

Proxy S3-compatible object uploads and downloads with client side encryption
Rust
1
star
55

vstr

Dynamic adaptive small string compression, optimised for lookup keys: FS paths, KV-store keys, record IDs, etc.
Rust
1
star