• Stars
    star
    795
  • Rank 57,274 (Top 2 %)
  • Language
    Rust
  • License
    Other
  • Created almost 6 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A string type for Rust that is not required to be valid UTF-8.

bstr

This crate provides extension traits for &[u8] and Vec<u8> that enable their use as byte strings, where byte strings are conventionally UTF-8. This differs from the standard library's String and str types in that they are not required to be valid UTF-8, but may be fully or partially valid UTF-8.

Build status crates.io

Documentation

https://docs.rs/bstr

When should I use byte strings?

See this part of the documentation for more details: https://docs.rs/bstr/1.*/bstr/#when-should-i-use-byte-strings.

The short story is that byte strings are useful when it is inconvenient or incorrect to require valid UTF-8.

Usage

cargo add bstr

Examples

The following two examples exhibit both the API features of byte strings and the I/O convenience functions provided for reading line-by-line quickly.

This first example simply shows how to efficiently iterate over lines in stdin, and print out lines containing a particular substring:

use std::{error::Error, io::{self, Write}};
use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        if line.contains_str("Dimension") {
            stdout.write_all(line)?;
        }
        Ok(true)
    })?;
    Ok(())
}

This example shows how to count all of the words (Unicode-aware) in stdin, line-by-line:

use std::{error::Error, io};
use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut words = 0;
    stdin.lock().for_byte_line_with_terminator(|line| {
        words += line.words().count();
        Ok(true)
    })?;
    println!("{}", words);
    Ok(())
}

This example shows how to convert a stream on stdin to uppercase without performing UTF-8 validation and amortizing allocation. On standard ASCII text, this is quite a bit faster than what you can (easily) do with standard library APIs. (N.B. Any invalid UTF-8 bytes are passed through unchanged.)

use std::{error::Error, io::{self, Write}};
use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    let mut upper = vec![];
    stdin.lock().for_byte_line_with_terminator(|line| {
        upper.clear();
        line.to_uppercase_into(&mut upper);
        stdout.write_all(&upper)?;
        Ok(true)
    })?;
    Ok(())
}

This example shows how to extract the first 10 visual characters (as grapheme clusters) from each line, where invalid UTF-8 sequences are generally treated as a single character and are passed through correctly:

use std::{error::Error, io::{self, Write}};
use bstr::{ByteSlice, io::BufReadExt};

fn main() -> Result<(), Box<dyn Error>> {
    let stdin = io::stdin();
    let mut stdout = io::BufWriter::new(io::stdout());

    stdin.lock().for_byte_line_with_terminator(|line| {
        let end = line
            .grapheme_indices()
            .map(|(_, end, _)| end)
            .take(10)
            .last()
            .unwrap_or(line.len());
        stdout.write_all(line[..end].trim_end())?;
        stdout.write_all(b"\n")?;
        Ok(true)
    })?;
    Ok(())
}

Cargo features

This crates comes with a few features that control standard library, serde and Unicode support.

  • std - Enabled by default. This provides APIs that require the standard library, such as Vec<u8> and PathBuf. Enabling this feature also enables the alloc feature.
  • alloc - Enabled by default. This provides APIs that require allocations via the alloc crate, such as Vec<u8>.
  • unicode - Enabled by default. This provides APIs that require sizable Unicode data compiled into the binary. This includes, but is not limited to, grapheme/word/sentence segmenters. When this is disabled, basic support such as UTF-8 decoding is still included. Note that currently, enabling this feature also requires enabling the std feature. It is expected that this limitation will be lifted at some point.
  • serde - Enables implementations of serde traits for BStr, and also BString when alloc is enabled.

Minimum Rust version policy

This crate's minimum supported rustc version (MSRV) is 1.60.0.

In general, this crate will be conservative with respect to the minimum supported version of Rust. MSRV may be bumped in minor version releases.

Future work

Since it is plausible that some of the types in this crate might end up in your public API (e.g., BStr and BString), we will commit to being very conservative with respect to new major version releases. It's difficult to say precisely how conservative, but unless there is a major issue with the 1.0 release, I wouldn't expect a 2.0 release to come out any sooner than some period of years.

A large part of the API surface area was taken from the standard library, so from an API design perspective, a good portion of this crate should be on solid ground. The main differences from the standard library are in how the various substring search routines work. The standard library provides generic infrastructure for supporting different types of searches with a single method, where as this library prefers to define new methods for each type of search and drop the generic infrastructure.

Some probable future considerations for APIs include, but are not limited to:

  • Unicode normalization.
  • More sophisticated support for dealing with Unicode case, perhaps by combining the use cases supported by caseless and unicase.

Here are some examples that are probably out of scope for this crate:

  • Regular expressions.
  • Unicode collation.

The exact scope isn't quite clear, but I expect we can iterate on it.

In general, as stated below, this crate brings lots of related APIs together into a single crate while simultaneously attempting to keep the total number of dependencies low. Indeed, every dependency of bstr, except for memchr, is optional.

High level motivation

Strictly speaking, the bstr crate provides very little that can't already be achieved with the standard library Vec<u8>/&[u8] APIs and the ecosystem of library crates. For example:

  • The standard library's Utf8Error can be used for incremental lossy decoding of &[u8].
  • The unicode-segmentation crate can be used for iterating over graphemes (or words), but is only implemented for &str types. One could use Utf8Error above to implement grapheme iteration with the same semantics as what bstr provides (automatic Unicode replacement codepoint substitution).
  • The twoway crate can be used for fast substring searching on &[u8].

So why create bstr? Part of the point of the bstr crate is to provide a uniform API of coupled components instead of relying on users to piece together loosely coupled components from the crate ecosystem. For example, if you wanted to perform a search and replace in a Vec<u8>, then writing the code to do that with the twoway crate is not that difficult, but it's still additional glue code you have to write. This work adds up depending on what you're doing. Consider, for example, trimming and splitting, along with their different variants.

In other words, bstr is partially a way of pushing back against the micro-crate ecosystem that appears to be evolving. Namely, it is a goal of bstr to keep its dependency list lightweight. For example, serde is an optional dependency because there is no feasible alternative. In service of this philosophy, currently, the only required dependency of bstr is memchr.

License

This project is licensed under either of

at your option.

The data in src/unicode/data/ is licensed under the Unicode License Agreement (LICENSE-UNICODE), although this data is only used in tests.

More Repositories

1

ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Rust
48,517
star
2

xsv

A fast CSV command line toolkit written in Rust.
Rust
10,377
star
3

toml

TOML parser for Golang with reflection.
Go
4,464
star
4

quickcheck

Automated property based testing for Rust (with shrinking).
Rust
2,408
star
5

erd

Translates a plain text description of a relational database schema to a graphical entity-relationship diagram.
Haskell
1,805
star
6

fst

Represent large sets and maps compactly with finite state transducers.
Rust
1,771
star
7

jiff

A date-time library for Rust that encourages you to jump into the pit of success.
Rust
1,736
star
8

rust-csv

A CSV parser for Rust, with Serde support.
Rust
1,710
star
9

walkdir

Rust library for walking directories recursively.
Rust
1,283
star
10

nflgame

An API to retrieve and read NFL Game Center JSON data. It can work with real-time data, which can be used for fantasy football.
Python
1,257
star
11

nfldb

A library to manage and update NFL data in a relational database.
Python
1,079
star
12

aho-corasick

A fast implementation of Aho-Corasick in Rust.
Rust
1,028
star
13

byteorder

Rust library for reading/writing numbers in big-endian and little-endian.
Rust
980
star
14

wingo

A fully-featured window manager written in Go.
Go
958
star
15

memchr

Optimized string search routines for Rust.
Rust
888
star
16

advent-of-code

Rust solutions to AoC 2018
Rust
479
star
17

xgb

The X Go Binding is a low-level API to communicate with the X server. It is modeled on XCB and supports many X extensions.
Go
472
star
18

termcolor

Cross platform terminal colors for Rust.
Rust
462
star
19

rust-snappy

Snappy compression implemented in Rust (including the Snappy frame format).
Rust
449
star
20

go-sumtype

A simple utility for running exhaustiveness checks on Go "sum types."
Go
421
star
21

chan

Multi-producer, multi-consumer concurrent channel for Rust.
Rust
392
star
22

regex-automata

A low level regular expression library that uses deterministic finite automata.
Rust
352
star
23

cargo-benchcmp

A small utility to compare Rust micro-benchmarks.
Rust
342
star
24

suffix

Fast suffix arrays for Rust (with Unicode support).
Rust
262
star
25

rure-go

Go bindings to Rust's regex engine.
Go
250
star
26

tabwriter

Elastic tabstops for Rust.
Rust
247
star
27

rebar

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
Python
227
star
28

imdb-rename

A command line tool to rename media files based on titles from IMDb.
Rust
226
star
29

critcmp

A command line tool for comparing benchmarks run by Criterion.
Rust
216
star
30

ty

Easy parametric polymorphism at run time using completely unidiomatic Go.
Go
198
star
31

xgbutil

A utility library to make use of the X Go Binding easier. (Implements EWMH and ICCCM specs, key binding support, etc.)
Go
191
star
32

pytyle3

An updated (and much faster) version of pytyle that uses xpybutil and is compatible with Openbox Multihead.
Python
181
star
33

dotfiles

My configuration files and personal collection of scripts.
Vim Script
154
star
34

rsc-regexp

Translations of a simple C program to Rust.
Rust
138
star
35

rust-cbor

CBOR (binary JSON) for Rust with automatic type based decoding and encoding.
Rust
129
star
36

chan-signal

Respond to OS signals with channels.
Rust
125
star
37

goim

Goim is a robust command line utility to maintain and query the Internet Movie Database (IMDb).
Go
117
star
38

same-file

Cross platform Rust library for checking whether two file paths are the same file.
Rust
101
star
39

clibs

A smattering of miscellaneous C libraries. Includes sane argument parsing, a thread-safe multi-producer/multi-consumer queue, and implementation of common data structures (hashmaps, vectors and linked lists).
C
98
star
40

ucd-generate

A command line tool to generate Unicode tables as source code.
Rust
95
star
41

nflvid

An experimental library to map play meta data to footage of that play.
Python
91
star
42

rust-stats

Basic statistical functions on streams for Rust.
Rust
87
star
43

migration

Package migration for Golang automatically handles versioning of a database schema by applying a series of migrations supplied by the client.
Go
81
star
44

winapi-util

Safe wrappers for various Windows specific APIs.
Rust
64
star
45

xpybutil

An incomplete xcb-util port plus some extras
Python
62
star
46

graphics-go

Automatically exported from code.google.com/p/graphics-go
Go
60
star
47

rust-pcre2

High level Rust bindings to PCRE2.
C
56
star
48

blog

My blog.
Rust
52
star
49

rust-sorts

Implementations of common sorting algorithms in Rust with comprehensive tests and benchmarks.
Rust
51
star
50

openbox-multihead

Openbox with patches for enhanced multihead support.
C
46
star
51

nakala

A low level embedded information retrieval system.
Rust
45
star
52

nflfan

View your fantasy teams with nfldb using a web interface.
JavaScript
43
star
53

globset

A globbing library for Rust.
Rust
42
star
54

utf8-ranges

Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
Rust
42
star
55

rtmpdump-ksv

rtmpdump with ksv's patch. Intended to track upstream git://git.ffmpeg.org/rtmpdump as well.
C
40
star
56

regexp

A regular expression library implemented in Rust.
Rust
37
star
57

xdg

A Go package for reading config and data files according to the XDG Base Directory specification.
Go
35
star
58

locker

A simple Golang package for conveniently using named read/write locks. Useful for synchronizing access to session based storage in web applications.
Go
34
star
59

nflcmd

A collection of command line utilities for viewing NFL statistics and rankings with nfldb.
Python
32
star
60

notes

A collection of small notes that aren't appropriate for my blog.
31
star
61

mempool

A fast thread safe memory pool for reusing allocations.
Rust
29
star
62

gribble

A command oriented language whose environment is defined through Go struct types by reflection.
Go
28
star
63

vcr

A simple wrapper tool around ffmpeg to capture video from a VCR.
Rust
27
star
64

encoding_rs_io

Streaming I/O adapters for the encoding_rs crate.
Rust
25
star
65

rust-cmail

A simple command line utility for periodically sending email containing the output of long-running commands.
Rust
21
star
66

cluster

A simple API for managing a network cluster with smart peer discovery.
Go
19
star
67

pager-multihead

A pager that supports per-monitor desktops (compatible with Openbox Multihead and Wingo)
Python
15
star
68

cablastp

Performs BLAST on compressed proteomic data.
Go
15
star
69

rust-error-handling-case-study

Code for the case study in my blog post: http://blog.burntsushi.net/rust-error-handling
Rust
15
star
70

rg-cratesio-typosquat

The source code of the 'rg' crate. It is an intentional typo-squat that redirects folks to 'ripgrep'.
Rust
15
star
71

imgv

An image viewer for Linux written in Go.
Go
14
star
72

cmd

A convenience library for executing commands in Go, including executing commands in parallel with a pool.
Go
14
star
73

fanfoot

View your fantasy football leagues and get text alerts when one of your players scores.
Python
12
star
74

cmail

cmail runs a command and sends the output to your email address at certain intervals.
Go
12
star
75

gohead

An xrandr wrapper script to manage multi-monitor configurations. With hooks.
Go
12
star
76

burntsushi-blog

A small Go application for my old blog.
CSS
12
star
77

intern

A simple package for interning strings, with a focus on efficiently representing dense pairwise data.
Go
11
star
78

crev-proofs

My crev reviews.
10
star
79

pytyle1

A lightweight X11 tool for simulating tiling in a stacking window manager.
Python
9
star
80

cif

A golang package for reading and writing data in the Crystallographic Information File (CIF) format. It mostly conforms to the CIF 1.1 specification.
Go
9
star
81

rucd

WIP
Rust
8
star
82

qcsv

An API to read and analyze CSV files by inferring types for each column of data.
Python
8
star
83

pyndow

A window manager written in Python
Python
8
star
84

csql

Package csql provides convenience functions for use with the types and functions defined in the standard library `database/sql` package.
Go
6
star
85

freetype-go

A fork of freetype-go with bounding box calculations.
Go
6
star
86

sqlsess

Simple database backed session management. Integrates with Gorilla's sessions package.
Go
6
star
87

go-wayland-simple-shm

C
5
star
88

sqlauth

A simple Golang package that provides database backed user authentication with bcrypt.
Vim Script
4
star
89

lcmweb

A Go web application for coding documents with the Linguistic Category Model.
JavaScript
4
star
90

bcbgo

Computational biology tools for the BCB group at Tufts University.
Go
4
star
91

fex

A framework for specifying and executing experiments.
Haskell
3
star
92

present

My presentations.
HTML
3
star
93

memchr-2.6-mov-regression

Rust
3
star
94

genecentric

A tool to generate between-pathway modules and perform GO enrichment on them.
Python
3
star
95

rust-docs

A silly repo for managing my Rust crate documentation.
Python
3
star
96

pcre2-mirror

A git mirror for PCRE2's SVN repository at svn://vcs.exim.org/pcre2/code
2
star
97

xpyb

A clone of xorg-xpyb.
C
2
star
98

burntsushi-homepage

A small PHP web application for my old homepage.
PHP
2
star
99

window-marker

Use vim-like marks on windows.
Python
2
star
100

sudoku

An attempt at a sudoku solver in Haskell.
Haskell
1
star