• Stars
    star
    113
  • Rank 310,009 (Top 7 %)
  • Language
    C
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

UTF-8 Text Processing (R Package)

utf8

rcc Coverage Status CRAN Status License CRAN RStudio Mirror Downloads

utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in Rโ€™s UTF-8 handling.

Installation

Stable version

utf8 is available on CRAN. To install the latest released version, run the following command in R:

install.packages("utf8")

Development version

To install the latest development version, run the following:

devtools::install_github("patperry/r-utf8")

Usage

library(utf8)

Validate character data and convert to UTF-8

Use as_utf8() to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:

# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
#> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4

# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
#> [1] "faรงile" "faรงile" "faรงile"

Normalize data

Use utf8_normalize() to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.

# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
#> [1] "ร…" "AฬŠ" "โ„ซ"
utf8_normalize(angstrom) == "\u00c5"
#> [1] TRUE TRUE TRUE

# perform full Unicode case-folding
utf8_normalize("GrรถรŸe", map_case = TRUE)
#> [1] "grรถsse"

# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("๐–ธ๐—ˆ ๐”๐ง๐ข๐œ๐จ๐๐ž ๐—… ๐—๐–พ๐—‹๐–ฝ ๐•Œ ๐—…๐—‚๐—„๐–พ ๐‘ก๐‘ฆ๐‘๐‘’๐‘“๐‘Ž๐‘๐‘’๐‘  ๐—Œ๐—ˆ ๐—๐–พ ๐—‰๐—Ž๐— ๐—Œ๐—ˆ๐—†๐–พ ๐šŒ๐š˜๐š๐šŽ๐š™๐š˜๐š’๐š—๐š๐šœ ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐”–๐”ฒ๐”ญ๐”ญ๐”ฉ๐”ข๐”ช๐”ข๐”ซ๐”ฑ๐”ž๐”ฏ๐”ถ ๐”š๐”ฒ๐”ฉ๐”ฑ๐”ฆ๐”ฉ๐”ฆ๐”ซ๐”ค๐”ณ๐”ž๐”ฉ ๐”“๐”ฉ๐”ž๐”ซ๐”ข ๐—Œ๐—ˆ ๐—’๐—ˆ๐—Ž ๐–ผ๐–บ๐—‡ ๐“ฎ๐“ท๐“ฌ๐“ธ๐“ญ๐“ฎ ๐•—๐• ๐•Ÿ๐•ฅ๐•ค ๐—‚๐—‡ ๐—’๐—ˆ๐—Ž๐—‹ ๐’‡๐’๐’๐’•๐’”.",
               map_compat = TRUE)
#> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."

Print emoji

On some platforms (including MacOS), the R implementation of print() uses an outdated version of the Unicode standard to determine which characters are printable. Use utf8_print() for an updated print function:

print(intToUtf8(0x1F600 + 0:79)) # with default R print function
#> [1] "๐Ÿ˜€๐Ÿ˜๐Ÿ˜‚๐Ÿ˜ƒ๐Ÿ˜„๐Ÿ˜…๐Ÿ˜†๐Ÿ˜‡๐Ÿ˜ˆ๐Ÿ˜‰๐Ÿ˜Š๐Ÿ˜‹๐Ÿ˜Œ๐Ÿ˜๐Ÿ˜Ž๐Ÿ˜๐Ÿ˜๐Ÿ˜‘๐Ÿ˜’๐Ÿ˜“๐Ÿ˜”๐Ÿ˜•๐Ÿ˜–๐Ÿ˜—๐Ÿ˜˜๐Ÿ˜™๐Ÿ˜š๐Ÿ˜›๐Ÿ˜œ๐Ÿ˜๐Ÿ˜ž๐Ÿ˜Ÿ๐Ÿ˜ ๐Ÿ˜ก๐Ÿ˜ข๐Ÿ˜ฃ๐Ÿ˜ค๐Ÿ˜ฅ๐Ÿ˜ฆ๐Ÿ˜ง๐Ÿ˜จ๐Ÿ˜ฉ๐Ÿ˜ช๐Ÿ˜ซ๐Ÿ˜ฌ๐Ÿ˜ญ๐Ÿ˜ฎ๐Ÿ˜ฏ๐Ÿ˜ฐ๐Ÿ˜ฑ๐Ÿ˜ฒ๐Ÿ˜ณ๐Ÿ˜ด๐Ÿ˜ต๐Ÿ˜ถ๐Ÿ˜ท๐Ÿ˜ธ๐Ÿ˜น๐Ÿ˜บ๐Ÿ˜ป๐Ÿ˜ผ๐Ÿ˜ฝ๐Ÿ˜พ๐Ÿ˜ฟ๐Ÿ™€๐Ÿ™๐Ÿ™‚๐Ÿ™ƒ๐Ÿ™„๐Ÿ™…๐Ÿ™†๐Ÿ™‡๐Ÿ™ˆ๐Ÿ™‰๐Ÿ™Š๐Ÿ™‹๐Ÿ™Œ๐Ÿ™๐Ÿ™Ž๐Ÿ™"

utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
#> [1] "๐Ÿ˜€โ€‹๐Ÿ˜โ€‹๐Ÿ˜‚โ€‹๐Ÿ˜ƒโ€‹๐Ÿ˜„โ€‹๐Ÿ˜…โ€‹๐Ÿ˜†โ€‹๐Ÿ˜‡โ€‹๐Ÿ˜ˆโ€‹๐Ÿ˜‰โ€‹๐Ÿ˜Šโ€‹๐Ÿ˜‹โ€‹๐Ÿ˜Œโ€‹๐Ÿ˜โ€‹๐Ÿ˜Žโ€‹๐Ÿ˜โ€‹๐Ÿ˜โ€‹๐Ÿ˜‘โ€‹๐Ÿ˜’โ€‹๐Ÿ˜“โ€‹๐Ÿ˜”โ€‹๐Ÿ˜•โ€‹๐Ÿ˜–โ€‹๐Ÿ˜—โ€‹๐Ÿ˜˜โ€‹๐Ÿ˜™โ€‹๐Ÿ˜šโ€‹๐Ÿ˜›โ€‹๐Ÿ˜œโ€‹๐Ÿ˜โ€‹๐Ÿ˜žโ€‹๐Ÿ˜Ÿโ€‹๐Ÿ˜ โ€‹๐Ÿ˜กโ€‹๐Ÿ˜ขโ€‹๐Ÿ˜ฃโ€‹๐Ÿ˜คโ€‹๐Ÿ˜ฅโ€‹๐Ÿ˜ฆโ€‹๐Ÿ˜งโ€‹๐Ÿ˜จโ€‹๐Ÿ˜ฉโ€‹๐Ÿ˜ชโ€‹๐Ÿ˜ซโ€‹โ€ฆ"

utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
#> [1] "๐Ÿ˜€โ€‹๐Ÿ˜โ€‹๐Ÿ˜‚โ€‹๐Ÿ˜ƒโ€‹๐Ÿ˜„โ€‹๐Ÿ˜…โ€‹๐Ÿ˜†โ€‹๐Ÿ˜‡โ€‹๐Ÿ˜ˆโ€‹๐Ÿ˜‰โ€‹๐Ÿ˜Šโ€‹๐Ÿ˜‹โ€‹๐Ÿ˜Œโ€‹๐Ÿ˜โ€‹๐Ÿ˜Žโ€‹๐Ÿ˜โ€‹๐Ÿ˜โ€‹๐Ÿ˜‘โ€‹๐Ÿ˜’โ€‹๐Ÿ˜“โ€‹๐Ÿ˜”โ€‹๐Ÿ˜•โ€‹๐Ÿ˜–โ€‹๐Ÿ˜—โ€‹๐Ÿ˜˜โ€‹๐Ÿ˜™โ€‹๐Ÿ˜šโ€‹๐Ÿ˜›โ€‹๐Ÿ˜œโ€‹๐Ÿ˜โ€‹๐Ÿ˜žโ€‹๐Ÿ˜Ÿโ€‹๐Ÿ˜ โ€‹๐Ÿ˜กโ€‹๐Ÿ˜ขโ€‹๐Ÿ˜ฃโ€‹๐Ÿ˜คโ€‹๐Ÿ˜ฅโ€‹๐Ÿ˜ฆโ€‹๐Ÿ˜งโ€‹๐Ÿ˜จโ€‹๐Ÿ˜ฉโ€‹๐Ÿ˜ชโ€‹๐Ÿ˜ซโ€‹๐Ÿ˜ฌโ€‹๐Ÿ˜ญโ€‹๐Ÿ˜ฎโ€‹๐Ÿ˜ฏโ€‹๐Ÿ˜ฐโ€‹๐Ÿ˜ฑโ€‹๐Ÿ˜ฒโ€‹๐Ÿ˜ณโ€‹๐Ÿ˜ดโ€‹๐Ÿ˜ตโ€‹๐Ÿ˜ถโ€‹๐Ÿ˜ทโ€‹๐Ÿ˜ธโ€‹๐Ÿ˜นโ€‹๐Ÿ˜บโ€‹๐Ÿ˜ปโ€‹๐Ÿ˜ผโ€‹๐Ÿ˜ฝโ€‹๐Ÿ˜พโ€‹๐Ÿ˜ฟโ€‹๐Ÿ™€โ€‹๐Ÿ™โ€‹๐Ÿ™‚โ€‹๐Ÿ™ƒโ€‹๐Ÿ™„โ€‹๐Ÿ™…โ€‹๐Ÿ™†โ€‹๐Ÿ™‡โ€‹๐Ÿ™ˆโ€‹๐Ÿ™‰โ€‹๐Ÿ™Šโ€‹๐Ÿ™‹โ€‹๐Ÿ™Œโ€‹๐Ÿ™โ€‹๐Ÿ™Žโ€‹๐Ÿ™โ€‹"

Citation

Cite utf8 with the following BibTeX entry:

@Manual{,
  title = {utf8: Unicode Text Processing},
  author = {Patrick O. Perry},
  year = {2018},
  note = {R package version 1.1.4},
  url = {https://github.com/patperry/r-utf8},
}

Contributing

The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If youโ€™d like to contribute, either

  • fork the repository and submit a pull request

  • file an issue;

  • or contact the maintainer via e-mail.

This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.

More Repositories

1

r-corpus

Text corpus analysis in R
R
51
star
2

timsort

Port of Java's TimSort to C
C
42
star
3

hs-monte-carlo

A Monte Carlo monad and transformer for Haskell.
Haskell
40
star
4

r-mbest

Moment-Based Estimation for Hierarchical Models (R Package)
R
24
star
5

hs-linear-algebra

Haskell BLAS bindings
Haskell
16
star
6

bench-term-matrix

Benchmark: Term frequency matrix
HTML
13
star
7

utf8lite

Lightweight UTF-8 Processing
C
12
star
8

hs-gsl-random

Haskell bindings to the GSL random number generators and distribution functions
Haskell
11
star
9

hs-ieee754

Approximate comparisons for IEEE floating point numbers in Haskell
Haskell
10
star
10

r-frame

Flexible data frames (R package)
R
10
star
11

cmockery

Fork of Google's C unit testing framework.
Shell
7
star
12

r-rmtstat

An R package for distributions from random matrix theory
R
6
star
13

r-bcv

R package for performing Bi-Cross-Validation of an SVD
C
3
star
14

permutation

A Haskell library for permutations and combinations.
Haskell
3
star
15

old-r-frame

Flexible data frames supporting keys and matrix columns (R package)
R
3
star
16

monad-interleave

Monads with an unsaveInterleaveIO-like operation
Haskell
2
star
17

thesis

PhD Thesis - Cross Validation for Unsupervised Learning
Shell
2
star
18

patperry.github.com

The homepage of Patrick O. Perry
HTML
2
star
19

mbarank

Article: Is There a Better Way to Rank Business Schools?
Makefile
2
star
20

r-nbinom

Negative binomial regression (R package)
R
2
star
21

hs-iproc

Haskell implementation of interaction process fitting
Haskell
2
star
22

interaction-proc

A point process model for repeated pairwise interactions
R
2
star
23

lapack

Haskell LAPACK bindings
Haskell
2
star
24

caffeine

Dataset: Caffeine content of about 600 drinks.
R
2
star
25

implicit-bayes

Implicitly Bayesian Eigenvector Regularization
R
2
star
26

rayperrysmasonry.com

Website for Ray's Masonry
Ruby
1
star
27

iproc

Model fitting as described in "Point process modeling for interaction networks"
C
1
star
28

linalg

Linear algebra in C (via LAPACK, BLAS)
C
1
star
29

rhdf5

Extremely unofficial fork of Bernd Fischer's "rhdf5" package (http://www.bioconductor.org/packages/2.12/bioc/html/rhdf5.html).
C
1
star
30

biclustpl

Profile-Likelihood Biclustering (R package)
C
1
star
31

corpus

A C library for analyzing text data
C
1
star
32

qp

Quadratic program solver, in C, using the GNU Scientific Library
C
1
star
33

CV

Curriculum Vitae
1
star
34

core

Port of google's dense_hashset to C
C
1
star
35

svd-hdd

The Singular Value Decomposition and High-Dimensional Data (Book Chapter)
1
star
36

dotfiles

My UNIX home directory
Shell
1
star
37

minimax-rank-est

Paper: Minimax rank estimation for subspace tracking
1
star