• Stars
    star
    1,342
  • Rank 35,018 (Top 0.7 %)
  • Language
    Rust
  • License
    MIT License
  • Created over 3 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

fastest text uwuifier in the west

uwuify

fastest text uwuifier in the west

transforms

Hey, I think I really love you. Do you want a headpat?

into

hey, (κˆα΄—κˆ) i think i weawwy wuv you. ^β€’ο»Œβ€’^ do y-you want a headpat?

there's an uwu'd version of this readme

faq

what?

u want large amounts of text uwu'd in a smol amount of time

where?

ur computer, if it has a recent x86 cpu (intel, amd) that supports sse4.1

why?

why not?

how?

tldr: 128-bit simd vectorization plus some big brain algos

click for more info

after hours of research, i've finally understood the essence of uwu'd text

there are a few transformations:

  1. replace some words (small -> smol, etc.)
  2. nya-ify (eg. naruhodo -> nyaruhodo)
  3. replace l and r with w
  4. stutter sometimes (hi -> h-hi)
  5. add a text emoji after punctuation (,, ., or !) sometimes

these transformation passes take advantage of sse4.1 vector intrinsics to process 16 bytes at once. for string searching, i'm using a custom simd implementation of the bitap algorithm for matching against multiple strings. for random number generation, i'm using XorShift32. for most character-level detection within simd registers, its all masking and shifting to simulate basic state machines in parallel

multithreading is supported, so u can exploit all of ur cpu cores for the noble goal of uwu-ing massive amounts of text

utf-8 is handled elegantly by simply ignoring non-ascii characters in the input

unfortunately, due to both simd parallelism and multithreading, some words may not be fully uwu'd if they were lucky enough to cross the boundary of a simd vector or a thread's buffer. they won't escape so easily next time

ok i want uwu'd text, how do i run this myself?

install command-line tool

  1. install rust: run curl https://sh.rustup.rs -sSf | sh on unix, or go here for more options
  2. run cargo install uwuify
  3. run uwuify which will read from stdin and output to stdout. make sure u press ctrl + d (unix) or ctrl + z and enter (windows) after u type stuff in stdin to send an EOF

if you are having trouble running uwuify, make sure you have ~/.cargo/bin in your $PATH

it is possible to read and write from files by specifying the input file and output file, in that order. u can use --help for more info. pass in -v for timings

this is on crates.io here

include as library

  1. put uwuify = "^0.2" under [dependencies] in your Cargo.toml file
  2. the library is called uwuifier (slightly different from the name of the binary!) use it like so:
use uwuifier::uwuify_str_sse;
assert_eq!(uwuify_str_sse("hello world"), "hewwo wowwd");

documentation is here

build from this repo

click for more info

  1. install rust
  2. run git clone https://github.com/Daniel-Liu-c0deb0t/uwu.git && cd uwu
  3. run cargo run --release
testing
  1. run cargo test
benchmarking
  1. run mkdir test && cd test

warning: large files of 100mb and 1gb, respectively

  1. run curl -OL http://mattmahoney.net/dc/enwik8.zip && unzip enwik8.zip
  2. run curl -OL http://mattmahoney.net/dc/enwik9.zip && unzip enwik9.zip
  3. run cd .. && ./bench.sh

i don't believe that this is fast. i need proof!!1!

tldr: can be almost as fast as simply copying a file

click for more info

raw numbers from running ./bench.sh on a 2019 macbook pro with eight intel 2.3 ghz i9 cpus and 16 gb of ram are shown below. the dataset used is the first 100mb and first 1gb of english wikipedia. the same dataset is used for the hutter prize for text compression

1 thread uwu enwik8
time taken: 178 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 0.55992 gb/s

2 thread uwu enwik8
time taken: 105 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 0.94701 gb/s

4 thread uwu enwik8
time taken: 60 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 1.64883 gb/s

8 thread uwu enwik8
time taken: 47 ms
input size: 100000000 bytes
output size: 115095591 bytes
throughput: 2.12590 gb/s

copy enwik8

real	0m0.035s
user	0m0.001s
sys	0m0.031s

1 thread uwu enwik9
time taken: 2087 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 0.47905 gb/s

2 thread uwu enwik9
time taken: 992 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 1.00788 gb/s

4 thread uwu enwik9
time taken: 695 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 1.43854 gb/s

8 thread uwu enwik9
time taken: 436 ms
input size: 1000000000 bytes
output size: 1149772651 bytes
throughput: 2.29214 gb/s

copy enwik9

real	0m0.387s
user	0m0.001s
sys	0m0.341s

//TODO: compare with other tools

why isn't this readme uwu'd?

so its readable

if u happen to find uwu'd text more readable, there's always an uwu'd version

ok but why aren't there any settings i can change?!1?!!1

free will is an illusion

wtf this is so unprofessional how are u gonna get hired at faang now?!

don't worry, i've got u covered

Title: uwu is all you need

Abstract

Recent advances in computing have made strides in parallelization, whether at a fine-grained level with SIMD instructions, or at a high level with multiple CPU cores. Taking advantage of these advances, we explore how the useful task of performing an uwu transformation on plain text can be scaled up to large input datasets. Our contributions in this paper are threefold: first, we present, to our knowledge, the first rigorous definition of uwu'd text. Second, we show our novel algorithms for uwu-ing text, exploiting vectorization and multithreading features that are available on modern CPUs. Finally, we provide rigorous experimental results that show how our implementation could be the "fastest in the west." In our benchmarks, we observe that our implementation was almost as a fast as a simple file copy, which is entirely IO-bound. We believe our work has potential applications in various domains, from data augmentation and text preprocessing for natural language processing, to giving authors the ability to convey potentially wholesome or cute meme messages with minimal time and effort.

// TODO: write paper

// TODO: write more about machine learning so i get funding

ok i need to use this for something and i need the license info

mit license

ok but i have an issue with this or a suggestion or a question not answered here

open an issue, be nice

projects using this

  • uwu-tray: a tray icon to uwuify your text
  • uwubot: discord bot for uwuifying text
  • uwupedia: the uwuified encycwopedia
  • discord uwu webhook: automatically uwuifies all sent messages on discord via webhooks
  • twent weznowor: best twitter bot ever
  • alaia: a simple yet powerful intuitive chatbot for discord
  • uwuify-mdbook: an mdbook pre-processor for all your uwuify needs
  • uwu-joke: automatically uwuifies typed text and text copied to your clipboard
  • discordbot (go): discord (and telegram and slack) bot for fun
  • let me know if u make a project with uwuify! i appreciate u all!

references

More Repositories

1

Java-Machine-Learning

Deep learning library for Java, with fully connected, convolutional, and recurrent layers. Also features many gradient descent optimization algorithms.
Java
134
star
2

block-aligner

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
Jupyter Notebook
122
star
3

cute-nucleotides

Cute tricks for SIMD vectorized binary encoding and decoding of nucleotides, in Rust.
Rust
110
star
4

triple_accel

Rust edit distance routines accelerated using SIMD. Supports fast Hamming, Levenshtein, restricted Damerau-Levenshtein, etc. distance calculations and string search.
Rust
100
star
5

UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Java
58
star
6

Adversarial-point-perturbations-on-3D-objects

New distributional and shape attacks on neural networks that process 3D point cloud data.
Python
37
star
7

3D-Neural-Network-Adversarial-Attacks

Research on adversarial attacks and defenses for deep neural network 3D point cloud classifiers like PointNet and PointNet++.
Python
23
star
8

ANTISEQUENCE

Rust library for processing sequencing reads.
Rust
20
star
9

simple-saca

Hardware go brrr bounded context suffix array construction algorithm
Rust
17
star
10

dlb-kmer-sampling

Optimal distance lower bound k-mer sampling.
Rust
13
star
11

9S

Basic Rust program that uses multiple threads to send and receive pings.
Rust
10
star
12

spliced-aligner

Spliced aligner.
Rust
5
star
13

Java-Fuzzy-Search

A general, multi-threaded fuzzy searching language, called fuzzysplit, that is built on top of a fast and flexible Java fuzzy search library. Can be applied to demultiplex and trim DNA.
Java
5
star
14

diff-align

Differentiable position-specific probability matrix alignment.
Python
4
star
15

General-Algorithms

A variety of algorithm implementations and short classes, mostly for competitive programming and machine learning.
Python
4
star
16

replit_audio

Rust library for playing audio in repl.it.
Rust
3
star
17

WWDC-2020-Coronavirus-Comparison

Efficiently comparing the 2019 coronavirus genome with a couple of other coronaviruses in Swift.
Swift
2
star
18

rust-cloc

Count lines from files in a directory.
Rust
2
star
19

frfr

Parallelized Ukkonen/Myer's edit distance algorithm, fr fr.
Rust
2
star
20

diff-bench-paper

Edits to the supplementary data of the difference recurrence alignment paper.
C++
1
star
21

simulate-seqs

Simulate sequences.
Rust
1
star
22

reCall

Interpreter for reCall, a dynamically typed scripting language that emphasizes recursion. Also contains object-oriented and functional features.
Java
1
star
23

Maze-Solver

2D grid-based maze solver that is written in Java. Uses Swing for GUI and contains 8 different methods for solving mazes.
Java
1
star