utf8 is an R package for manipulating and printing UTF-8 text that fixes multiple bugs in Rโs UTF-8 handling.
utf8 is available on CRAN. To install the latest released version, run the following command in R:
install.packages("utf8")
To install the latest development version, run the following:
devtools::install_github("patperry/r-utf8")
library(utf8)
Use as_utf8()
to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:
# second entry is encoded in latin-1, but declared as UTF-8 x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile") Encoding(x) <- c("UTF-8", "UTF-8", "bytes") as_utf8(x) # fails #> Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4 # mark the correct encoding Encoding(x[2]) <- "latin1" as_utf8(x) # succeeds #> [1] "faรงile" "faรงile" "faรงile"
Use utf8_normalize()
to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.
# three ways to encode an angstrom character (angstrom <- c("\u00c5", "\u0041\u030a", "\u212b")) #> [1] "ร " "Aฬ" "โซ" utf8_normalize(angstrom) == "\u00c5" #> [1] TRUE TRUE TRUE # perform full Unicode case-folding utf8_normalize("Grรถรe", map_case = TRUE) #> [1] "grรถsse" # apply compatibility maps to NFKC normal form # (example from https://twitter.com/aprilarcus/status/367557195186970624) utf8_normalize("๐ธ๐ ๐๐ง๐ข๐๐จ๐๐ ๐ ๐๐พ๐๐ฝ ๐ ๐ ๐๐๐พ ๐ก๐ฆ๐๐๐๐๐๐๐ ๐๐ ๐๐พ ๐๐๐ ๐๐๐๐พ ๐๐๐๐๐๐๐๐๐๐ ๐๐ ๐๐๐๐ ๐๐ฒ๐ญ๐ญ๐ฉ๐ข๐ช๐ข๐ซ๐ฑ๐๐ฏ๐ถ ๐๐ฒ๐ฉ๐ฑ๐ฆ๐ฉ๐ฆ๐ซ๐ค๐ณ๐๐ฉ ๐๐ฉ๐๐ซ๐ข ๐๐ ๐๐๐ ๐ผ๐บ๐ ๐ฎ๐ท๐ฌ๐ธ๐ญ๐ฎ ๐๐ ๐๐ฅ๐ค ๐๐ ๐๐๐๐ ๐๐๐๐๐.", map_compat = TRUE) #> [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."
On some platforms (including MacOS), the R implementation of print()
uses an outdated version of the Unicode standard to determine which characters are printable. Use utf8_print()
for an updated print function:
print(intToUtf8(0x1F600 + 0:79)) # with default R print function #> [1] "๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ก๐ข๐ฃ๐ค๐ฅ๐ฆ๐ง๐จ๐ฉ๐ช๐ซ๐ฌ๐ญ๐ฎ๐ฏ๐ฐ๐ฑ๐ฒ๐ณ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐" utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line #> [1] "๐โ๐โ๐โ๐โ๐โ๐ โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐ โ๐กโ๐ขโ๐ฃโ๐คโ๐ฅโ๐ฆโ๐งโ๐จโ๐ฉโ๐ชโ๐ซโโฆ" utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit #> [1] "๐โ๐โ๐โ๐โ๐โ๐ โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐ โ๐กโ๐ขโ๐ฃโ๐คโ๐ฅโ๐ฆโ๐งโ๐จโ๐ฉโ๐ชโ๐ซโ๐ฌโ๐ญโ๐ฎโ๐ฏโ๐ฐโ๐ฑโ๐ฒโ๐ณโ๐ดโ๐ตโ๐ถโ๐ทโ๐ธโ๐นโ๐บโ๐ปโ๐ผโ๐ฝโ๐พโ๐ฟโ๐โ๐โ๐โ๐โ๐โ๐ โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ๐โ"
Cite utf8 with the following BibTeX entry:
@Manual{,
title = {utf8: Unicode Text Processing},
author = {Patrick O. Perry},
year = {2018},
note = {R package version 1.1.4},
url = {https://github.com/patperry/r-utf8},
}
The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If youโd like to contribute, either
-
fork the repository and submit a pull request
-
or contact the maintainer via e-mail.
This project is released with a Contributor Code of Conduct, and if you choose to contribute, you must adhere to its terms.