• Stars
    star
    101
  • Rank 338,166 (Top 7 %)
  • Language
    Rust
  • License
    MIT License
  • Created almost 6 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Rust wc clone

cw - Count Words

A fast wc clone in Rust.

Synopsis

-% cw --help
cw 0.5.0
Thomas Hurst <[email protected]>
Count Words - word, line, character and byte count

USAGE:
    cw [FLAGS] [OPTIONS] [input]...

FLAGS:
    -c, --bytes              Count bytes
    -m, --chars              Count UTF-8 characters instead of bytes
    -h, --help               Prints help information
    -l, --lines              Count lines
    -L, --max-line-length    Count bytes (default) or characters (-m) of the longest line
    -V, --version            Prints version information
    -w, --words              Count words

OPTIONS:
        --files0-from <files0_from>    Read input from the NUL-terminated list of filenames in the given file.
        --files-from <files_from>      Read input from the newline-terminated list of filenames in the given file.
        --threads <threads>            Number of counting threads to spawn [default: 1]

ARGS:
    <input>...    Input files

-% cw Dickens_Charles_Pickwick_Papers.xml
 3449440 51715840 341152640 Dickens_Charles_Pickwick_Papers.xml

Performance

Counts of multiple files may be accelerated by use of the --threads option:

  'xargs <files cw --threads=12' ran
    2.01 ± 0.03 times faster than 'xargs <files cw --threads=4'
    7.07 ± 0.09 times faster than 'xargs <files cw'
   11.55 ± 0.15 times faster than 'xargs <files wc'
   17.31 ± 0.23 times faster than 'xargs <files gwc'

Line counts are optimized using the bytecount crate:

  'cw -l Dickens_Charles_Pickwick_Papers.xml' ran
    3.44 ± 0.04 times faster than 'wc -l Dickens_Charles_Pickwick_Papers.xml'
    4.17 ± 0.05 times faster than 'gwc -l Dickens_Charles_Pickwick_Papers.xml'

Line counts with line length are optimized using the memchr crate:

  'cw -lL Dickens_Charles_Pickwick_Papers.xml' ran
    1.73 ± 0.01 times faster than 'wc -lL Dickens_Charles_Pickwick_Papers.xml'
   15.07 ± 0.07 times faster than 'gwc -lL Dickens_Charles_Pickwick_Papers.xml'

Note without -m cw only operates on bytes, and it never cares about your locale.

  'cw Dickens_Charles_Pickwick_Papers.xml' ran
    1.45 ± 0.01 times faster than 'wc Dickens_Charles_Pickwick_Papers.xml'
    2.05 ± 0.00 times faster than 'gwc Dickens_Charles_Pickwick_Papers.xml'

-m enables UTF-8 processing, with a fast-path for just character length, again using bytecount:

  'cw -m Dickens_Charles_Pickwick_Papers.xml' ran
   30.21 ± 0.39 times faster than 'gwc -m Dickens_Charles_Pickwick_Papers.xml'
   70.36 ± 0.91 times faster than 'wc -m Dickens_Charles_Pickwick_Papers.xml'
  'cw -m test-utf-8.html' ran
   84.74 ± 1.12 times faster than 'wc -m test-utf-8.html'
  124.21 ± 1.64 times faster than 'gwc -m test-utf-8.html'

And another path for character and line length:

  'cw -mlL Dickens_Charles_Pickwick_Papers.xml' ran
    3.88 ± 0.01 times faster than 'gwc -mlL Dickens_Charles_Pickwick_Papers.xml'
    9.05 ± 0.02 times faster than 'wc -mlL Dickens_Charles_Pickwick_Papers.xml'
  'cw -mlL test-utf-8.html' ran
    9.42 ± 0.01 times faster than 'wc -mlL test-utf-8.html'
   18.95 ± 0.03 times faster than 'gwc -mlL test-utf-8.html'

And a slow path for everything else:

  'cw -mLlw Dickens_Charles_Pickwick_Papers.xml' ran
    1.35 ± 0.00 times faster than 'gwc -mLlw Dickens_Charles_Pickwick_Papers.xml'
    3.15 ± 0.00 times faster than 'wc -mLlw Dickens_Charles_Pickwick_Papers.xml'

These tests are on FreeBSD 12 on a 2.1GHz Westmere Xeon. gwc is from GNU coreutils 8.30 - note its performance here is rather pessimised in some areas by FreeBSD's rather weak memchr implementation. YMMV.

For best results build with:

cargo build --release --features runtime-dispatch-simd

This enables SIMD optimizations for line and character counting. It has no effect if you count anything else.

Future

  • Test suite.
  • Factor internals out into a library. (#1)
  • Improve multibyte support.
  • Possibly implement locale.
  • Replace clap/structopt with something lighter.

See Also

uwc focuses on following Unicode rules as precisely as possible, taking into account less-common newlines, counting graphemes as well as codepoints, and following Unicode word-boundary rules precisely.

The cost of this is currently a great deal of performance, with counts on my benchmark file taking over a minute.

cw was originally called rwc until I noticed this existed. It's quite old and doesn't appear to compile.

A little library that only does plain newline counting, along with a binary called lc. Version 0.2 will use the same algorithm as cw.

More Repositories

1

Compactor

A user interface for Windows 10 filesystem compression
Rust
1,148
star
2

monotime

A sensible interface to monotonic time in Ruby
Ruby
156
star
3

tarssh

A simple SSH tarpit inspired by endlessh
Rust
127
star
4

rtss

Relative TimeStamps for Stuff
Rust
51
star
5

zfsnapr

Recursive ZFS snapshot mounter
Ruby
24
star
6

rust-linereader

A fast Rust line reader
Rust
23
star
7

borg-backup.sh

A simple shell script for driving BorgBackup
Shell
20
star
8

fast-memchr

A port of rust-memchr's fallback and SSE2 memchr() to C
C
19
star
9

faccess

Cross-platform file access checks in Rust
Rust
16
star
10

rust-proctitle

A safe cross-platform interface to setting process titles
Rust
16
star
11

checkrestart

sysutils/checkrestart: A FreeBSD tool to find stale processes that may need restarting after an upgrade
C
14
star
12

gcstool

A small tool for creating and searching Golomb Compressed Sets
Rust
13
star
13

rust-filesize

Physical disk use retrieval
Rust
12
star
14

pqsort

A generic partial quicksort macro for C99.
C++
12
star
15

run-one

A BSD-compatible reimplementation of Ubuntu's run-one
Shell
11
star
16

compresstimator

Simple and fast compressibility tester
Rust
10
star
17

mkjail

Create minimal jail environments on FreeBSD
Ruby
9
star
18

elite_shield_tester

A Rust port of Down To Earth Astronomy's Elite Dangerous shield tester
Rust
8
star
19

mkpass

Generates reasonably secure passwords
Rust
5
star
20

annoirc

A bot to annotate IRC with information about posted links
Rust
5
star
21

portacl-rc

A FreeBSD rc(8) script for mac_portacl(4)
Roff
4
star
22

pkg-cruft

Find cruft on pkgng systems like FreeBSD
Ruby
4
star
23

TerraIntrimmer

Trim the notification queue from Terra Invicta saves
Rust
3
star
24

fast-bytecount

A port of the Rust bytecount SSE2 and AVX2 algorithms to C
M4
3
star
25

ruby-reattempt

Yet another Ruby retry library.
Ruby
3
star
26

rust-bitrw

A Rust library for bit-level reading and writing
Rust
3
star
27

esc

Email Search Command, because Email Sucks Completely
Rust
3
star
28

blooming-rust

Disk-backed Bloom Filters for Rust
Rust
3
star
29

par_qsort

A quick and dirty parallel quicksort in Rust
Rust
2
star
30

simplepass

Simple Ruby and Rust password generation
Rust
2
star
31

tikibar

Prototypical Ruby progress bar library
Ruby
2
star
32

ruby-capsicum

A Ruby interface to Capsicum sandboxing
Ruby
2
star
33

blooming-ruby

Ruby BitArray and BloomFilter library
Ruby
1
star
34

quickhash

Multithreaded stream hashing
Rust
1
star
35

IMSErious

Execute commands in response to Dovecot's Internet Message Store Event notifications
Rust
1
star
36

ruby-filemon

A Ruby interface to FreeBSD's filemon(4) device
Ruby
1
star
37

numastat

FreeBSD NUMA domain memory monitor
Python
1
star
38

nfo.fcgi

Newzbin's ancient FastCGI NFO service
Ruby
1
star
39

123-spellcheck

An email spellchecker I made for a friend
Rust
1
star
40

unprivileged

Privilege dropping for Rust
Rust
1
star
41

precache

Read the contents of a directory tree and hope it has useful side-effects
Rust
1
star
42

ruby-gcs

A small Ruby library for creating and searching Golomb Compressed Sets
Ruby
1
star
43

swapflush

Flush swap devices on FreeBSD
C
1
star