• Stars
    star
    182
  • Rank 211,154 (Top 5 %)
  • Language
    C
  • Created almost 2 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

High-performance HTML5 parser for Ruby based on Lexbor, with support for both CSS selectors and XPath.

Nokolexbor

CI

Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.

It's a performance-focused HTML5 parser for Ruby based on Lexbor. It supports both CSS selectors and XPath. Nokolexbor's API is designed to be 1:1 compatible as much as possible with Nokogiri's API.

Requirements

Nokolexbor is shipped with pre-compiled gems on most common platforms:

  • Linux: x86_64, with glibc >= 2.17
  • macOS: x86_64 and arm64
  • Windows: ucrt64, mingw32 and mingw64

If you are on a supported platform, just jump to the Installation section. Otherwise, you need to install CMake to compile C extensions:

macOS

brew install cmake

Linux (Debian, Ubuntu, etc.)

sudo apt-get install cmake

Installation

Add to your Gemfile:

gem 'nokolexbor'

Then, run bundle install.

Or, install the gem directly:

gem install nokolexbor

Quick start

require 'nokolexbor'
require 'open-uri'

# Parse HTML document
doc = Nokolexbor::HTML(URI.open('https://github.com/serpapi/nokolexbor'))

# Search for nodes by css
doc.css('#readme h1', 'article h2', 'p[dir=auto]').each do |node|
  puts node.content
end

# Search for text nodes by css
doc.css('#readme p > ::text').each do |text|
  puts text.content
end

# Search for nodes by xpath
doc.xpath('//div[@id="readme"]//h1', '//article//h2').each do |node|
  puts node.content
end

Features

  • Nokogiri-compatible APIs.
  • High performance HTML parsing, DOM manipulation and CSS selectors engine.
  • XPath search engine (ported from libxml2).
  • Text nodes CSS selector support: ::text.

Searching methods overview

  • css and at_css
    • Based on Lexbor.
    • Only accepts CSS selectors, doesn't support mixed syntax like div#abc /text().
    • To select text nodes, use pseudo element ::text. e.g. div#abc > ::text.
    • Performance is much higher than libxml2 based methods.
  • xpath and at_xpath
    • Based on libxml2.
    • Only accepts XPath syntax.
    • Works in the same way as Nokogiri's xpath and at_xpath.
  • nokogiri_css and nokogiri_at_css (requires Nokogiri installed)
    • Based on libxml2.
    • Accept mixed syntax like div#abc /text().
    • Works in the same way as Nokogiri's css and at_css.

Different behaviors from Nokogiri

  • For selector :nth-of-type(n), n is not affected by prior filter. For example, if we want to select the 3rd div excluding class a and class b, which will be the last div in the following HTML:

    <body>
      <div></div>
      <div class="a"></div>
      <div class="b"></div>
      <div></div>
      <div></div>
    </body>
    

    In Nokogiri, the selector should be div:not(.a):not(.b):nth-of-type(3)

    In Nokolexbor, :not does affect the place of the last div (same in browsers), the selector should be div:not(.a):not(.b):nth-of-type(5), but this losts the purpose of filtering though.

Benchmarks

Benchmark parsing google result page (368 KB) and selecting nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.

Run with: ruby bench/bench.rb

Nokolexbor (iters/s) Nokogiri (iters/s) Diff
parsing 487.6 93.5 5.22x faster
at_css 50798.8 50.9 997.87x faster
css 7437.6 52.3 142.11x faster
at_xpath 57.077 53.176 same-ish
xpath 51.523 58.438 same-ish
Raw data
Warming up --------------------------------------
    Nokolexbor parse    56.000  i/100ms
      Nokogiri parse     8.000  i/100ms
Calculating -------------------------------------
    Nokolexbor parse    487.564  (Β±10.9%) i/s -      9.688k in  20.117173s
      Nokogiri parse     93.470  (Β±21.4%) i/s -      1.736k in  20.024163s

Comparison:
    Nokolexbor parse:      487.6 i/s
      Nokogiri parse:       93.5 i/s - 5.22x  (Β± 0.00) slower

Warming up --------------------------------------
   Nokolexbor at_css     5.548k i/100ms
     Nokogiri at_css     6.000  i/100ms
Calculating -------------------------------------
   Nokolexbor at_css     50.799k (Β±13.8%) i/s -    987.544k in  20.018481s
     Nokogiri at_css     50.907  (Β±35.4%) i/s -    828.000  in  20.666258s

Comparison:
   Nokolexbor at_css:    50798.8 i/s
     Nokogiri at_css:       50.9 i/s - 997.87x  (Β± 0.00) slower

Warming up --------------------------------------
      Nokolexbor css   709.000  i/100ms
        Nokogiri css     4.000  i/100ms
Calculating -------------------------------------
      Nokolexbor css      7.438k (Β±14.7%) i/s -    145.345k in  20.083833s
        Nokogiri css     52.338  (Β±36.3%) i/s -    816.000  in  20.042053s

Comparison:
      Nokolexbor css:     7437.6 i/s
        Nokogiri css:       52.3 i/s - 142.11x  (Β± 0.00) slower

Warming up --------------------------------------
 Nokolexbor at_xpath     2.000  i/100ms
   Nokogiri at_xpath     4.000  i/100ms
Calculating -------------------------------------
 Nokolexbor at_xpath     57.077  (Β±31.5%) i/s -    920.000  in  20.156393s
   Nokogiri at_xpath     53.176  (Β±35.7%) i/s -    876.000  in  20.036717s

Comparison:
 Nokolexbor at_xpath:       57.1 i/s
   Nokogiri at_xpath:       53.2 i/s - same-ish: difference falls within error

Warming up --------------------------------------
    Nokolexbor xpath     3.000  i/100ms
      Nokogiri xpath     3.000  i/100ms
Calculating -------------------------------------
    Nokolexbor xpath     51.523  (Β±31.1%) i/s -    903.000  in  20.102568s
      Nokogiri xpath     58.438  (Β±35.9%) i/s -    852.000  in  20.001408s

Comparison:
      Nokogiri xpath:       58.4 i/s
    Nokolexbor xpath:       51.5 i/s - same-ish: difference falls within error

More Repositories

1

google-search-results-python

Google Search Results via SERP API pip Python Package
Python
593
star
2

awesome-seo-tools

Curated list of awesome SEO tools
HTML
290
star
3

lego-ai-parser

Lego AI Parser is an open-source application that uses OpenAI to parse visible text of HTML elements.
Python
227
star
4

turbo_tests

Run RSpec tests on multiple cores. Like parallel_tests but with incremental summarized output. Originally extracted from the Discourse and Rubygems source code.
Ruby
173
star
5

clauneck

A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
Ruby
159
star
6

google-search-results-nodejs

SerpApi client library for Node.js. Previously: Google Search Results Node.js.
JavaScript
78
star
7

google-search-results-golang

Google Search Results GoLang API
Go
58
star
8

google-search-results-php

Google Search Results PHP API via Serp Api
PHP
57
star
9

serpapi-python

a Python client library for SerpApi.
Python
55
star
10

google-search-results-ruby

Google Search Results via SERP API Ruby Gem
Ruby
53
star
11

public-roadmap

Public Roadmap for SerpApi, LLC (https://serpapi.com)
50
star
12

serpapi-javascript

Scrape and parse search engine results using SerpApi.
TypeScript
48
star
13

google-search-results-java

Google Search Results JAVA API via SerpApi
Java
38
star
14

review-analyzer

A Chrome Extension for extracting valuable insights from reviews, generating concise summaries, sentiment analysis, and keyword extraction
JavaScript
32
star
15

google-reviews-analyzer

Uses LLM to summarize reviews of a business
JavaScript
28
star
16

serapis-ai-image-classifier

Automatic Image Classification from SERP Data
Python
26
star
17

Wander

Replicate Wanderlust demo that shown in OpenAI Dev Day
JavaScript
24
star
18

code-challenge

SerpApi code challenge
HTML
18
star
19

google-local-results-ai-parser

A ruby gem to extract structured data from Google Local Search Results using the serpapi/bert-base-local-results model, enabling parsing, classification, and information extraction from English HTML content.
Ruby
14
star
20

google-local-results-ai-server

A server code for serving BERT-based models for text classification. It is designed by SerpApi for heavy-load prototyping and production tasks, specifically for the implementation of the google-local-results-ai-parser gem.
Python
13
star
21

automatic-images-classifier-generator

Generate machine learning models fully automatically to clasiffiy any images using SERP data
Python
11
star
22

uule_converter

A Ruby library for encoding and decoding UULE parameters in Google search URLs using coordinates
Ruby
10
star
23

google-maps-pb-decoder

Google Maps pb (i.e., protobuf) parameter decoder.
Ruby
9
star
24

Auto-GPT-SerpApi-Plugin

An Auto-GPT Plugin that connects SerpApi to Auto-GPT
Python
8
star
25

google-search-results-dotnet

Google Search Results via SERP API DotNet Package
C#
8
star
26

serpapi-search-swift

Scrape and parse search resuts from Google, Bing, Baidu, Yandex, Yahoo, Home depot, Ebay and more.. using [SerpApi](https://serpapi.com).
Ruby
6
star
27

google-apps-script

Google Apps Scripts for Google Sheet to integrate SerpApi
JavaScript
4
star
28

serpapi-ruby

Official Ruby wrapper for SerpApi HTTP endpoints
Ruby
4
star
29

ved_decoder

VedDecoder is a decoder for the Google ved parameter
Ruby
3
star
30

serpapi-golang

SerpApi client implementation in Golang
Go
3
star
31

serpapi-search-rust

Search results in Rust powered by SerpApi.com
Rust
3
star
32

seo-rank-tracker

TypeScript
2
star
33

serpapi-rust

Scrape any major search engine from our easy, fast, scalable and feature rich API powered by SerpApi
Rust
2
star
34

google-sheet-addon-guide

Documentation for Google Sheet Add-on for SerpApi
2
star
35

test-knowledge-graph-desktop

Tests for Google Knowledge Graph API
Ruby
1
star
36

test-bing-organic-results-desktop

Tests for SerpApi desktop Bing organic results https://serpapi.com/bing-organic-results
Ruby
1
star
37

serpapi-dotnet

SerpApi Client library for dotnet 5 and 6
C#
1
star
38

test-product-results

Test Google Product page results
Ruby
1
star
39

test-bing-organic-results-mobile

Tests for SerpApi mobile Bing organic results https://serpapi.com/organic-results
1
star
40

test-shopping-results-desktop

Test shopping results for SerpApi desktop
Ruby
1
star
41

test-scholar-organic-results

Test Google Scholar organic results for SerpApi
Ruby
1
star
42

test-bing-knowledge-graph-desktop

Tests for SerpApi desktop Bing knowledge graph results https://serpapi.com/bing-knowledge-graph
Ruby
1
star
43

test-organic-results-desktop

Tests for SerpApi desktop organic results
Ruby
1
star
44

test-product-reviews-results

Ruby
1
star
45

serpapi-search-cpp

Library to search on Google, Bing, HomeDepot, Baidu, Yandex and more using SerpApi written in C++.
C++
1
star
46

test-organic-results-mobile

Tests for SerpApi mobile organic results
Ruby
1
star
47

test-product-specs-results

Ruby
1
star
48

serpapi-vscode-snippets

SerAPI Snippets for VSCode - Scrape search engine results
JavaScript
1
star
49

test-related-questions-desktop

Test related questions "People also ask" block
Ruby
1
star
50

test-images-results-desktop

Tests for SerpApi Images results for desktop (beta)
Ruby
1
star
51

showcase-pot-stock-map

showcase market research with serpapi to track pot stock
JavaScript
1
star
52

test-product-sellers-results

Ruby
1
star
53

test-google-direct-answers-box-api

Tests for Google Direct Answers Box API
Ruby
1
star
54

test-bing-ad-results-desktop

Tests for SerpApi desktop Bing ad results https://serpapi.com/bing-ads
Ruby
1
star
55

spec-builder

Run all tests for SerpApi.com
Ruby
1
star
56

test-video-results

Test video result for mobile and desktop
Ruby
1
star
57

test-news-results-desktop

Test news results for SerpApi desktop
Ruby
1
star
58

hash-json-path

HashJsonPath is a simple gem to access hash and set hash value using json path
Ruby
1
star
59

serpapi-wallstreet-analysis

Analyze company business using Google search powered by SerpApi.com
Python
1
star
60

showcase-serpapi-tensorflow-keras-image-training

Tensorflow / Keras training a network to recognize Apple logo versus a real Apple fruit
Python
1
star
61

serpapi-java

Official Java wrapper for SerpApi HTTP endpoints
Java
1
star