• Stars
    star
    1,021
  • Rank 43,551 (Top 0.9 %)
  • Language
    Ruby
  • License
    MIT License
  • Created over 12 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Character encoding detection, brought to you by ICU

CharlockHolmes

Character encoding detecting library for Ruby using ICU

Usage

First you'll need to require it

require 'charlock_holmes'

Encoding detection

contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}

# optionally there will be a :language key as well, but
# that's mostly only returned for legacy encodings like ISO-8859-1

NOTE: CharlockHolmes::EncodingDetector.detect will return nil if it was unable to find an encoding.

For binary content, :type will be set to :binary

Though it's more efficient to reuse once detector instance:

detector = CharlockHolmes::EncodingDetector.new

detection1 = detector.detect(File.read('test.xml'))
detection2 = detector.detect(File.read('test2.json'))

# and so on...

String monkey patch

Alternatively, you can just use the detect_encoding method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

detection = contents.detect_encoding

Ruby 1.9 specific

NOTE: This method only exists on Ruby 1.9+

If you want to use this library to detect and set the encoding flag on strings, you can use the detect_encoding! method on the String class

require 'charlock_holmes/string'

contents = File.read('test.xml')

# this will detect and set the encoding of `contents`, then return self
contents.detect_encoding!

Transcoding

Being able to detect the encoding of some arbitrary content is nice, but what you probably want is to be able to transcode that content into an encoding your application is using.

content = File.read('test2.txt')
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'

The first parameter is the content to transcode, the second is the source encoding (the encoding the content is assumed to be in), and the third parameter is the destination encoding.

Installing

If the traditional gem install charlock_holmes doesn't work, you may need to specify the path to your installation of ICU using the --with-icu-dir option during the gem install or by configuring Bundler to pass those arguments to Gem:

Configure Bundler to always use the correct arguments when installing:

bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c

Using Gem to install directly without Bundler:

gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c

If you get a compile time error that looks like error: delegating constructors are permitted only in C++11 or something else related to C++11, you need to set the --with-cxxflags=-std=c++11 options

Bundler:

bundle config build.charlock_holmes --with-icu-dir=/path/to/installed/icu4c --with-cxxflags=-std=c++11

Installing directly:

gem install charlock_holmes -- --with-icu-dir=/path/to/installed/icu4c --with-cxxflags=-std=c++11

Homebrew

If you're installing on Mac OS X then using Homebrew is the easiest way to install ICU.

However, be warned; it is a Keg-Only (see homedir issue #167 for more info) install meaning RubyGems won't find it when installing without specifying --with-icu-dir

To install ICU with Homebrew:

brew install icu4c

Configure Bundler to always use the correct arguments when installing:

bundle config build.charlock_holmes --with-icu-dir=/usr/local/opt/icu4c

Using Gem to install directly without Bundler:

gem install charlock_holmes -- --with-icu-dir=/usr/local/opt/icu4c

More Repositories

1

mysql2

A modern, simple and very fast Mysql library for Ruby - binding to libmysql
Ruby
2,228
star
2

yajl-ruby

A streaming JSON parsing and encoding library for Ruby (C bindings to yajl)
C
1,478
star
3

escape_utils

Faster string escaping routines for your ruby apps
C
516
star
4

streamly

A streaming REST client for Ruby, using libcurl
Ruby
106
star
5

commit-message

A quick little macOS utility to make writing commit messages easier
Swift
45
star
6

jquery-infinite-scroll

jQuery Infinite Scroll plugin
JavaScript
43
star
7

bzip2-ruby

Original libbz2 ruby C bindings from Guy Decoux, with some new love
C
41
star
8

utf8

A lightweight UTF8-aware String class meant for use with Ruby 1.8.x
C
33
star
9

fast_xs

excessively fast escaping
Ruby
28
star
10

eventdns

An EventMachine based DNS server
Ruby
17
star
11

mochilo

A ruby library for BananaPack
C
13
star
12

http-switchboard

an http switchboard implementation that can use rev, eventmachine or regular threaded "panels"
Ruby
8
star
13

jquery-autocompleter

jQuery Autocompleter Plugin
JavaScript
8
star
14

freckle-api

A ruby client for the Freckle API
Ruby
8
star
15

powerwall-homekit

HomeKit accessory for Tesla Powerwalls
Go
7
star
16

jquery-field-validations

A half-assed port of ActiveRecord validations to jQuery
JavaScript
7
star
17

ohsnap

Ruby library for the Snappy compression algorithm
C
5
star
18

handbrake.rb

Ruby bindings to libhb
C
3
star
19

fiscal_year_calculations

Some helper methods for the Time object for fiscal year calculation
Ruby
3
star
20

json-machine

Ruby
3
star
21

wesabe-api

A ruby client for the Wesabe API
Ruby
3
star
22

dynamic_asset_helper

Rails plugin providing helpers to dynamically include JS/CSS assets based on routes
Ruby
2
star
23

banana_phone

BananaPhone is RPC for BananaPack
Ruby
2
star
24

screenlogic-homekit

A virtual HomeKit accessory for the Pentair ScreenLogic protocol adapter
Go
2
star
25

arel_profile

Ruby
1
star