• Stars
    star
    151
  • Rank 237,884 (Top 5 %)
  • Language
    Crystal
  • License
    MIT License
  • Created about 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast HTML5 Parser with css selectors for Crystal language

MyHTML

Build Status

Fast HTML5 Parser (Crystal binding for awesome lexborisov's myhtml and Modest). This shard used in production to parse millions of pages per day, very stable and fast.

WARNING: original libraries (myhtml and Modest) not maintained since july 2020, i recommend switch to successor parser: Lexbor.

Installation

Add this to your application's shard.yml:

dependencies:
  myhtml:
    github: kostya/myhtml

And run shards install

Usage example

require "myhtml"

html = <<-HTML
  <html>
    <body>
      <div id="t1" class="red">
        <a href="/#">O_o</a>
      </div>
      <div id="t2"></div>
    </body>
  </html>
HTML

myhtml = Myhtml::Parser.new(html)

myhtml.nodes(:div).each do |node|
  id = node.attribute_by("id")

  if first_link = node.scope.nodes(:a).first?
    href = first_link.attribute_by("href")
    link_text = first_link.inner_text

    puts "div with id #{id} have link [#{link_text}](#{href})"
  else
    puts "div with id #{id} have no links"
  end
end

# Output:
#   div with id t1 have link [O_o](/#)
#   div with id t2 have no links

Css selectors example

require "myhtml"

html = <<-HTML
  <html>
    <body>
      <table id="t1">
        <tr><td>Hello</td></tr>
      </table>
      <table id="t2">
        <tr><td>123</td><td>other</td></tr>
        <tr><td>foo</td><td>columns</td></tr>
        <tr><td>bar</td><td>are</td></tr>
        <tr><td>xyz</td><td>ignored</td></tr>
      </table>
    </body>
  </html>
HTML

myhtml = Myhtml::Parser.new(html)

p myhtml.css("#t2 tr td:first-child").map(&.inner_text).to_a
# => ["123", "foo", "bar", "xyz"]

p myhtml.css("#t2 tr td:first-child").map(&.to_html).to_a
# => ["<td>123</td>", "<td>foo</td>", "<td>bar</td>", "<td>xyz</td>"]

More Examples

examples

Development Setup:

git clone https://github.com/kostya/myhtml.git
cd myhtml
make
crystal spec

Benchmark

Parse 1000 times google page(600Kb), and 1000 times css select. myhtml-program, crystagiri-program, nokogiri-program

Lang Shard Lib Parse time, s Css time, s Memory, MiB
Crystal lexbor lexbor 2.54 0.099 7.8
Crystal myhtml myhtml(+modest) 3.17 0.16 8.4
Ruby 2.7 Nokogiri libxml2 9.19 10.76 139.8
Crystal Crystagiri libxml2 11.27 - 25.0

More Repositories

1

benchmarks

Some benchmarks of different languages
Makefile
2,738
star
2

eye

Process monitoring tool. Inspired from Bluepill and God.
Ruby
1,187
star
3

crystal-benchmarks-game

Crystal implementations for The Computer Language Benchmarks Game
C
114
star
4

jit-benchmarks

Benchmark for interpreted languages implementations.
Ruby
94
star
5

lexbor

Fast HTML5 Parser with CSS selectors. This is successor of myhtml and expected to be faster and use less memory.
Crystal
88
star
6

simple_rpc

RPC Server and Client for Crystal. Implements msgpack-rpc protocol.
Crystal
63
star
7

modest

CSS selectors for HTML5 Parser myhtml
Crystal
48
star
8

cron_scheduler

Simple job scheduler with crontab patterns for Crystal Language.
Crystal
46
star
9

limiter

Rate limiter for Crystal. Memory and Redis based.
Crystal
34
star
10

pg_csv

Fast Ruby PG csv export. Uses pg function 'copy to csv'. Effective on millions rows.
Ruby
30
star
11

memory_cache

Super simple in memory key-value storage with expires for Crystal.
Crystal
29
star
12

pg_reindex

Console utility for gracefully rebuild indexes/pkeys for PostgreSQL, with minimal locking in semi-auto mode.
Ruby
20
star
13

auto_json

Auto JSON convertations for classes and structs, based on auto_constructor fields
Crystal
19
star
14

run_with_fork

Some simple parallelism for Crystal. Run some heavy or blocked thread operations in background fork.
Crystal
19
star
15

redisoid

Redis client for Crystal with auto-reconnection and pool (wrapper for stefanwille/crystal-redis, kostya/redis-reconnect, ysbaddaden/pool). Ready to use in production.
Crystal
17
star
16

http_parser.cr

Crystal wrapper for Http Parser lib: https://github.com/joyent/http-parser
C
16
star
17

auto_initialize

Generate initialize methods for classes and structs
Crystal
15
star
18

pgq

Queues system for AR/Rails based on PgQ Skytools for PostgreSQL, like Resque on Redis. Rails 2.3 and 3 compatible.
Ruby
12
star
19

ruby-app

Ruby micro framework for easy create ruby applications (daemons, EventMachine-apps, db-apps, cli...). Features: bundler, environments, activesupport, rails dirs tree. Fast loading and low memory using.
Ruby
11
star
20

cron_parser

Cron parser for Crystal language. Translated from Ruby https://github.com/siebertm/parse-cron
Crystal
11
star
21

eye-http

Http interface for the Eye gem
Ruby
10
star
22

curl-downloader

Powerfull http-client for Crystal based on libcurl binding.
Crystal
9
star
23

eye-rotate

Log rotate for the Eye gem.
Ruby
8
star
24

balancer

Simple Tcp Balancer
Crystal
8
star
25

socks

Socks5 server in Crystal. Simple implementation without auth, bind, associate and ipv6.
Crystal
8
star
26

simple_idn

SimpleIdn for Crystal language. Translated from Ruby https://github.com/mmriis/simpleidn
Crystal
7
star
27

auto_constructor

Auto construct initialize methods for classes and structs
Crystal
7
star
28

confuddle

Utility for work with unfuddle.com account from console
Ruby
7
star
29

redis-reconnect

Redis client with autoreconnection for slow clients (wrapper for stefanwille/crystal-redis). Used as part of redisoid shard.
Crystal
7
star
30

tkrzw

Fast Persistent Key Value Storage
Crystal
6
star
31

timeouter

Simple timeouter
Crystal
6
star
32

blank

method Blank for Crystal Language
Crystal
6
star
33

bin_script

Easy writing and executing bins(executable scripts) in Rails Application (especially for crontab or god). For my purposes much better than Rake, Thor and Rails Runner.
Ruby
6
star
34

nagios_helper

Gem for writing, testing, executing Nagios checks inside Rails application. Checks running throught http or script.
Ruby
6
star
35

kyotocabinet

Fast Persistent Embedded KeyValue Storage. Wrapper for KyotoCabinet
Crystal
5
star
36

sidekiq-kawai

Syntax sugar for Sidekiq consumers. Each consumer is a class, with clean interface, and custom logger.
Ruby
5
star
37

jaro_winkler

Crystal implementation of Jaro-Winkler distance algorithm which supports UTF-8 string
Crystal
5
star
38

public_suffix

Public Suffix for Crystal
Crystal
4
star
39

entities

Crystal html entities decoder
Crystal
4
star
40

stuffs

Some stuffs which i used in every project for Crystal. Mini ActiveSupport
Crystal
3
star
41

forking

Simple processes forking, and restarting. Master process starts as daemon.
Ruby
3
star
42

thread_pool

Simple Thread pool for Crystal
Crystal
3
star
43

fast_to_f

Fast floats parser in Crystal (wrapper for fast_double_parser).
Crystal
3
star
44

idn_ffi

LibIdn FFI ruby binding
Ruby
3
star
45

pgq_web

Web interface for pgq gem. Inspect pgq and londiste queues
Ruby
3
star
46

encoding_name

Normalizer of encoding name for Crystal (to use it in crystal internal encoder-decoder)
Crystal
3
star
47

resque-kawai

Syntax sugar for Resque consumers. Each consumer is a class, with clean interface, and custom logger.
Ruby
3
star
48

find_lib

Find dynamic libary in system paths, multiplatform (to use dlopen and dlsym).
Crystal
2
star
49

to_query

ActiveSupport to_query method for Crystal.
Crystal
2
star
50

html_unicoder

Convert html page to utf-8 for Crystal language
Crystal
2
star
51

msgpack_protocol

Msgpack protocol for eventmachine
Ruby
2
star
52

encoding-kawai

EncodingKawai - little sintax sugar for ruby force_encoding, also working on 1.8.7.
Ruby
2
star
53

marshal64

Marshal + Base64 coder. Usefull for serialize data to db.
Ruby
1
star
54

sidekiq-marshal

Marshal encoder for sidekiq. Enables when required.
Ruby
1
star
55

jober

Simple background jobs, queues.
Ruby
1
star
56

resque-marshal

Marshal encoder for resque. Enables when required.
Ruby
1
star
57

crystal-metric

This is set of 21 benchmarks for Crystal language, as one file.
Crystal
1
star
58

em-nodes

Simple abstraction on top of EventMachine for easy create clients, servers, workers, ...
Ruby
1
star
59

ruby-app-cron

RubyApp extension, adds Forverb support.
Ruby
1
star
60

nagios_rails_server

Async server for gem nagios_helper, based on rack,thin and async-sinatra.
Ruby
1
star
61

nagios_check

Dsl to create nagios checks, inside application.
Crystal
1
star
62

pg_reconnect

ActiveRecord PostgreSQL auto-reconnection, works with 2.3 and 3.2 rails. Uses hackety wrapper on adapter execute.
Ruby
1
star
63

ruby-app-ar

RubyApp extension, adds ActiveRecord support.
Ruby
1
star