• Stars
    star
    341
  • Rank 123,998 (Top 3 %)
  • Language
    PHP
  • License
    MIT License
  • Created about 10 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.

hQuery.php Build Status Donate

An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.

You can use the familiar jQuery/CSS selector syntax to easily find the data you need.

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.

See tests/README.md.

API Documentation

πŸ’‘ Features

  • Very fast parsing and lookup
  • Parses broken HTML
  • jQuery-like style of DOM traversal
  • Low memory usage
  • Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
  • Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())
  • Caches response for multiple processing tasks
  • PSR-7 friendly (see hQuery::fromHTML($message))
  • PHP 5.3+
  • No dependencies

πŸ›  Install

Just add this folder to your project and include_once 'hquery.php'; and you are ready to hQuery.

Alternatively composer require duzun/hquery

or using npm install hquery.php, require_once 'node_modules/hquery.php/hquery.php';.

βš™ Usage

Basic setup:

// Optionally use namespaces
use duzun\hQuery;

// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';

// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";

// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour

I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.

Load HTML from a file

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);

Where $context is created with stream_context_create().

For an example of using $context to make a HTTP request with proxy see #26.

Load HTML from a string

hQuery::fromHTML( string $html, string $url = NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';

Load a remote HTML document

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )
use duzun\hQuery;

// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);

For building advanced requests (POST, parameters etc) see hQuery::http_wr(), though I recommend using a specialized (PSR-7?) library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg.

PSR-7 example:

composer require php-http/message php-http/discovery php-http/curl-client

If you don't have cURL PHP extension, just replace php-http/curl-client with php-http/socket-client in the above command.

use duzun\hQuery;

use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;

$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();

$request = $messageFactory->createRequest(
  'GET',
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);

$response = $client->sendRequest($request);

$doc = hQuery::fromHTML($response, $request->getUri());

Another option is to use stream_context_create() to create a $context, then call hQuery::fromFile($url, false, $context).

Processing the results

hQuery::find( string $sel, array|string $attr = NULL, hQuery_Node $ctx = NULL )
// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();

// If the result of find() is not empty
// $banners is a collection of elements (hQuery_Element)
if ( $banners ) {

    // Iterate over the result
    foreach($banners as $pos => $a) {
        $links[$pos] = $a->attr('href'); // get absolute URL from href property
        $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text

        // Filter the result
        if ( !$a->hasClass('logo') ) {
            // $a->style property is the parsed $a->attr('style')
            if ( strtolower($a->style['position']) == 'fixed' ) continue;

            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src')
        }
    }

    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;

        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

πŸ–§ Live Demo

On DUzun.Me

A lot of people ask for sources of my Live Demo page. Here we go:

view-source:https://duzun.me/playground/hquery

πŸƒ Run the playground

You can easily run any of the examples/ on your local machine. All you need is PHP installed in your system. After you clone the repo with git clone https://github.com/duzun/hQuery.php.git, you have several options to start a web-server.

Option 1:
cd hQuery.php/examples
php -S localhost:8000

# open browser http://localhost:8000/
Option 2 (browser-sync):

This option starts a live-reload server and is good for playing with the code.

npm install
gulp

# open browser http://localhost:8080/
Option 3 (VSCode):

If you are using VSCode, simply open the project and run debugger (F5).

πŸ”§ TODO

  • Unit tests everything
  • Document everything
  • Cookie support (implemented in mem for redirects)
  • Improve selectors to be able to select by attributes
  • Add more selectors
  • Use HTTPlug internally

πŸ’– Support my projects

I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).

If you like what I'm doing and this project helps you reduce time to develop, please consider to:

  • β˜… Star and Share the projects you like (and use)
  • β˜• Give me a cup of coffee - PayPal.me/duzuns (contact at duzun.me)
  • β‚Ώ Send me some Bitcoin at this addres: bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa (or using the QR below) bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa

More Repositories

1

URL.js

Parse and format URLs
JavaScript
14
star
2

string-encode.js

Convert different types of JavaScript String to/from Uint8Array
JavaScript
8
star
3

base64util

base64 encode/decode utility for browsers and node.js
JavaScript
7
star
4

verup

Increment and update version in all project files
JavaScript
6
star
5

cycle-crypt

Variable size symmetric key encryption algorithm. PHP & JavaScript implementation, small, portable and fast.
JavaScript
5
star
6

nppPyAlignColumn

Align Columns Plugin for Notepad++
Python
5
star
7

jquery.tableSortable

A jQuery Plugin to make a table sortable by clicking header cells
TypeScript
4
star
8

dotfiles

My dotfiles (Linux, OSX & Windows) for bash & zsh on PC & server
Shell
4
star
9

promise-sugar

Promise syncatctic sugar - no need to write ".then" in your promise chains
JavaScript
3
star
10

p2peg.js

Peer to Peer Entropy Generator - or Random numbers generator with p2p seeding
JavaScript
3
star
11

watchem.js

Live reload implementation in one script
JavaScript
3
star
12

require-strip-json-comments

Require .json files which contain comments
JavaScript
3
star
13

DelphiUnits

A collection of Delphi Units
Pascal
3
star
14

P2PEG

Peer to Peer Entropy Generator - or Random numbers generator with p2p seeding
PHP
3
star
15

require-json5

Require JSON5 files in node - a better JSON for ES5 era
JavaScript
3
star
16

classifyed.js

A tiny yet powerful lib for creating extensible JS Classes.
JavaScript
2
star
17

jetpack-tab-setwindow

Monkey patch Firefox Addon SDK's Tab class with a method for moving tabs to other windows
JavaScript
2
star
18

TCM

My UD3 pack of tools for windows nerds, built around Total Commander
HTML
2
star
19

manjaro-setup

My setup scripts for Linux Manjaro
Shell
2
star
20

jquery.autobox

Autogrow <textarea> (vertically or horizontally) to fit the contents automatically
JavaScript
2
star
21

jquery.load_img

A simple method to load images asynchronously.
JavaScript
2
star
22

crypt-equals

Timing attack safe string/buffer comparison
JavaScript
2
star
23

jAJAX

jQuery-like AJAX method for multiple environments: browser, chrome-extension, safari-extension, firefox-extension.
JavaScript
2
star
24

bcache-scripts

Scripts to monitor and control bcache state
Shell
1
star
25

jquery.loading

Add class(es) to DOM elements while waiting for async action. Promise or callback.
TypeScript
1
star
26

sync-sha1

Tiny sha1 in JavaScript
JavaScript
1
star
27

gulp-multidest

pipe gulp streams to multiple destinations
JavaScript
1
star
28

callback-promise

Convert callback style APIs to Promise based APIs
JavaScript
1
star
29

WinBatch

A collection of Windows Batch files gathered over time
Batchfile
1
star
30

array.php

Useful array methods in PHP
PHP
1
star
31

Bak.Bat

A Windows batch Version Control Script
Batchfile
1
star
32

asyncSeries.js

Asynchronously process a list of items consecutively
JavaScript
1
star
33

docker-sphinx

Sphinx Search for Docker
Shell
1
star
34

SipHash

SipHash-2-4 implementation in PHP
PHP
1
star
35

gulp-gccs

Gulp plugin to compile JS files using Google Closure Compiler Service
JavaScript
1
star
36

UserActivityCounter

A simple app to track your activity and presence time while working at your PC
Pascal
1
star
37

gccs

Vanilla Node.js script to compile JS code/files using Google Closure Compiler Service (zero dependencies)
JavaScript
1
star