• Stars
    star
    424
  • Rank 102,009 (Top 3 %)
  • Language
    PHP
  • Created about 12 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A PHP library/toolkit designed to handle all of your web scraping needs under a MIT or LGPL license. Also has web server and WebSocket server classes for building custom servers.

Ultimate Web Scraper Toolkit

A PHP library of tools designed to handle all of your web scraping needs under a MIT or LGPL license. This toolkit easily makes RFC-compliant web requests that are indistinguishable from a real web browser, has a web browser-like state engine for handling cookies and redirects, and a full cURL emulation layer for web hosts without the PHP cURL extension installed. The powerful tag filtering library TagFilter is included to easily extract the desired content from each retrieved document or used to process HTML documents that are offline.

This tookit also comes with classes for creating custom web servers and WebSocket servers. That custom API you want the average person to install on their home computer or deploy to devices in the enterprise just became easier to deploy.

Donate Discord

Features

  • Carefully follows the IETF RFC Standards surrounding the HTTP protocol.
  • Supports file transfers, SSL/TLS, and HTTP/HTTPS/CONNECT proxies.
  • Easy to emulate various web browser headers.
  • A web browser-like state engine that emulates redirection (e.g. 301) and automatic cookie handling for managing multiple requests.
  • HTML form extraction and manipulation support. No need to fake forms!
  • Extensive callback support.
  • Asynchronous/Non-blocking socket support. For when you need to scrape lots of content simultaneously.
  • WebSocket support.
  • A full cURL emulation layer for drop-in use on web hosts that are missing cURL.
  • An impressive CSS3 selector tokenizer (TagFilter::ParseSelector()) that carefully follows the W3C Specification and passes the official W3C CSS3 static test suite.
  • Includes a fast and powerful tag filtering library (TagFilter) for correctly parsing really difficult HTML content (e.g. Microsoft Word HTML) and can easily extract desired content from HTML and XHTML using CSS3 compatible selectors.
  • TagFilter::HTMLPurify() produces XSS defense results on par with HTML Purifier.
  • Includes the legacy Simple HTML DOM library to parse and extract desired content from HTML. NOTE: Simple HTML DOM is only included for legacy reasons. TagFilter is much faster and more accurate as well as more powerful and flexible.
  • DNS over HTTPS support.
  • International domain name (IDNA/Punycode) support.
  • An unncessarily feature-laden web server class with optional SSL/TLS support. Run a web server written in pure PHP. Why? Because you can, that's why.
  • A decent WebSocket server class is included too. For a scalable version of the WebSocket server class, see Data Relay Center.
  • Can be used to download entire websites for offline use.
  • Has a liberal open source license. MIT or LGPL, your choice.
  • Designed for relatively painless integration into your project.
  • Sits on GitHub for all of that pull request and issue tracker goodness to easily submit changes and ideas respectively.

Getting Started

Web Scraping - Techniques and tools of the trade

Example object-oriented usage:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	// Retrieve a URL (emulating Firefox by default).
	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	// Check for connectivity and response errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Get the final URL after redirects.
	$baseurl = $result["url"];

	// Use TagFilter to parse the content.
	$html = TagFilter::Explode($result["body"], $htmloptions);

	// Retrieve a pointer object to the root node.
	$root = $html->Get();

	// Find all anchor tags inside a div with a specific class.
	// A useful CSS selector cheat sheet:  https://gist.github.com/magicznyleszek/809a69dd05e1d5f12d01
	echo "All the URLs:\n";
	$rows = $root->Find("div.someclass a[href]");
	foreach ($rows as $row)
	{
		echo "\t" . $row->href . "\n";
		echo "\t" . HTTP::ConvertRelativeToAbsoluteURL($baseurl, $row->href) . "\n";
	}

	// Find all table rows that have 'th' tags.
	$rows = $root->Find("tr")->Filter("th");
	foreach ($rows as $row)
	{
		echo "\t" . $row->GetOuterHTML() . "\n\n";
	}

	// Find the OpenGraph URL in the meta tags of the HTML (if any).
	// For example:  <meta property="og:url" content="SOMEURL" />
	// The next line first finds all matching rows and then current() returns the first row.
	$metaurl = $root->Find("meta[property=\"og:url\"]")->current();
	if ($metaurl !== false)  echo trim($metaurl->content) . "\n\n";
?>

Example direct ID usage:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	// Retrieve a URL (emulating Firefox by default).
	$url = "http://www.somesite.com/something/";
	$web = new WebBrowser();
	$result = $web->Process($url);

	// Check for connectivity and response errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Get the final URL after redirects.
	$baseurl = $result["url"];

	// Use TagFilter to parse the content.
	$html = TagFilter::Explode($result["body"], $htmloptions);

	// Find all anchor tags inside a div with a specific class.
	// A useful CSS selector cheat sheet:  https://gist.github.com/magicznyleszek/809a69dd05e1d5f12d01
	echo "All the URLs:\n";
	$result2 = $html->Find("div.someclass a[href]");
	if (!$result2["success"])
	{
		echo "Error parsing/finding URLs.  " . $result2["error"] . "\n";
		exit();
	}

	foreach ($result2["ids"] as $id)
	{
		// Faster direct access.
		echo "\t" . $html->nodes[$id]["attrs"]["href"] . "\n";
		echo "\t" . HTTP::ConvertRelativeToAbsoluteURL($baseurl, $html->nodes[$id]["attrs"]["href"]) . "\n";
	}

	// Find all table rows that have 'th' tags.
	// The 'tr' tag IDs are returned.
	$result2 = $html->Filter($html->Find("tr"), "th");
	if (!$result2["success"])
	{
		echo "Error parsing/finding table rows.  " . $result2["error"] . "\n";
		exit();
	}

	foreach ($result2["ids"] as $id)
	{
		echo "\t" . $html->GetOuterHTML($id) . "\n\n";
	}
?>

Example HTML form extraction:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	$url = "https://www.somewebsite.com/login/";

	// Turn on the automatic forms extraction option.  Note that Javascript is not executed.
	$web = new WebBrowser(array("extractforms" => true));
	$result = $web->Process($url);

	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	if (count($result["forms"]) != 1)
	{
		echo "Was expecting one form.  Received:  " . count($result["forms"]) . "\n";
		exit();
	}

	// Forms are extracted in the order they appear in the HTML.
	$form = $result["forms"][0];

	// Set some or all of the variables in the form.
	$form->SetFormValue("username", "cooldude123");
	$form->SetFormValue("password", "password123");

	// Submit the form.
	$result2 = $form->GenerateFormRequest();
	$result = $web->Process($result2["url"], $result2["options"]);

	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Use TagFilter to parse the content.
	$html = TagFilter::Explode($result["body"], $htmloptions);

	// Do something with the response here...
?>

Example POST request:

<?php
	require_once "support/web_browser.php";

	$url = "https://api.somesite.com/profile";

	// Send a POST request to a URL.
	$web = new WebBrowser();
	$options = array(
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);

	$result = $web->Process($url, $options);

	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Do something with the response.
?>

Example large file/content retrieval:

<?php
	require_once "support/web_browser.php";

	function DownloadFileCallback($response, $data, $opts)
	{
		if ($response["code"] == 200)
		{
			$size = ftell($opts);
			fwrite($opts, $data);

			if ($size % 1000000 > ($size + strlen($data)) % 1000000)  echo ".";
		}

		return true;
	}

	// Download a large file.
	$url = "http://downloads.somesite.com/large_file.zip";
	$fp = fopen("the_file.zip", "wb");
	$web = new WebBrowser();
	$options = array(
		"read_body_callback" => "DownloadFileCallback",
		"read_body_callback_opts" => $fp
	);
	echo "Downloading '" . $url . "'...";
	$result = $web->Process($url, $options);
	echo "\n";
	fclose($fp);

	// Check for connectivity and response errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Do something with the response.
?>

Example custom SSL options usage:

<?php
	require_once "support/http.php";
	require_once "support/web_browser.php";

	// Generate default safe SSL/TLS options using the "modern" ciphers.
	// See:  https://mozilla.github.io/server-side-tls/ssl-config-generator/
	$sslopts = HTTP::GetSafeSSLOpts(true, "modern");

	// Adjust the options as necessary.
	// For a complete list of options, see:  http://php.net/manual/en/context.ssl.php
	$sslopts["capture_peer_cert"] = true;

	// Demonstrates capturing the SSL certificate.
	// Returning false terminates the connection without sending any data.
	function CertCheckCallback($type, $cert, $opts)
	{
		var_dump($type);
		var_dump($cert);

		return true;
	}

	// Send a POST request to a URL.
	$url = "https://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"sslopts" => $sslopts,
		"peer_cert_callback" => "CertCheckCallback",
		"peer_cert_callback_opts" => false,
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, $options);

	// Check for connectivity and response errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Do something with the response.
?>

Example debug mode usage:

<?php
	require_once "support/web_browser.php";

	// Send a POST request to a URL with debugging enabled.
	// Enabling debug mode for a request uses more RAM since it collects all data sent and received over the wire.
	$url = "https://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"debug" => true,
		"postvars" => array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		)
	);
	$result = $web->Process($url, $options);

	// Check for connectivity errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	echo "------- RAW SEND START -------\n";
	echo $result["rawsend"];
	echo "------- RAW SEND END -------\n\n";

	echo "------- RAW RECEIVE START -------\n";
	echo $result["rawrecv"];
	echo "------- RAW RECEIVE END -------\n\n";
?>

Uploading Files

File uploads are handled several different ways so that very large files can be processed. The "files" option is an array of arrays that represents one or more files to upload. File uploads will automatically switch a POST request's Content-Type from "application/x-www-form-urlencoded" to "multipart/form-data".

<?php
	require_once "support/web_browser.php";

	// Upload two files.
	$url = "http://api.somesite.com/photos";
	$web = new WebBrowser();
	$options = array(
		"postvars" => array(
			"uid" => 12345
		),
		"files" => array(
			array(
				"name" => "file1",
				"filename" => "mycat.jpg",
				"type" => "image/jpeg",
				"data" => file_get_contents("/path/to/mycat.jpg")
			),
			array(
				"name" => "file2",
				"filename" => "mycat-hires.jpg",
				"type" => "image/jpeg",
				"datafile" => "/path/to/mycat-hires.jpg"
			)
		)
	);
	$result = $web->Process($url, $options);

	// Check for connectivity and response errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Do something with the response.
?>

Each file in the "files" array must have the following options:

  • name - The server-side key to use.
  • filename - The filename to send to the server. Well-written server-side software will generally ignore this other than to look at the file extension (e.g. ".jpg", ".png", ".pdf").
  • type - The MIME type to send to the server. Run a Google search for "mime type for xyz" where "xyz" is the file extension of the file you are sending.

One of the following options must also be provided for each file:

  • data - A string containing the data to send. This should only be used for small files.
  • datafile - A string containing the path and filename to the data to send OR a seekable file resource handle. This is the preferred method for uploading large files. Files exceeding 2GB may have issues under 32-bit PHP.

File uploads within extracted forms are handled similarly to the above. When calling $form->SetFormValue(), pass in an array containing the file information with "filename", "type", and "data" or "datafile". The "name" key-value will automatically be filled in when calling $form->GenerateFormRequest().

Sending Non-Standard Requests

The vast majority of requests to servers are GET, POST application/x-www-form-urlencoded, and POST multipart/form-data. However, there may be times that other request types need to be sent to a server. For example, a lot of APIs being written these days want JSON content instead of a standard POST request to be able to handle richer incoming data.

Example:

<?php
	require_once "support/web_browser.php";

	// Send a POST request with a custom body.
	$url = "http://api.somesite.com/profile";
	$web = new WebBrowser();
	$options = array(
		"method" => "POST",
		"headers" => array(
			"Content-Type" => "application/json"
		),
		"body" => json_encode(array(
			"id" => 12345,
			"firstname" => "John",
			"lastname" => "Smith"
		), JSON_UNESCAPED_SLASHES)
	);
	$result = $web->Process($url, $options);

	// Check for connectivity and response errors.
	if (!$result["success"])
	{
		echo "Error retrieving URL.  " . $result["error"] . "\n";
		exit();
	}

	if ($result["response"]["code"] != 200)
	{
		echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";
		exit();
	}

	// Do something with the response.
?>

Working with such APIs is best done by building a SDK. Here are several SDKs and their relevant API documentation that might be useful:

The above SDKs utilize this toolkit.

Debugging SSL/TLS

Connecting to a SSL/TLS enabled server is fraught with difficulties. SSL/TLS connections are much more fragile through no fault of the toolkit but rather SSL/TLS doing its thing. Here are the known reasons a SSL/TLS connection will fail to establish:

  • Network failures. The server gives up because SSL/TLS is expensive on both local and remote system resources (mostly CPU), there is a temporary network condition (just retry the request), or the request is being actively blocked by a firewall (e.g. port blocking for a range of abusive IPs).
  • The server (or client) SSL/TLS certificate is incomplete or does not validate against a known root CA certificate list.
  • The server (or client) SSL/TLS certificate has expired. Not much can be done here except completely disable SSL validation.
  • A bug in Ultimate Web Scraper Toolkit exposed due to underlying TLS bugs in PHP. This is really rare though.

PHP does not expose much of the underlying SSL/TLS layer to applications when establishing connections, which makes it incredibly difficult to diagnose certain issues with SSL/TLS. To diagnose network related problems, use the 'openssl s_client' command line tool from the same host the problematic script is running on. Setting the "cafile", "auto_peer_name", "auto_cn_match", and "auto_sni" SSL options may help too.

If all else fails and secure, encrypted communication with the server are not required, disable the "verify_peer" and "verify_peer_name" SSL options and enable the "allow_self_signed" SSL option. Note that making these changes results in a connection that is no more secure that plaintext HTTP. Don't send passwords or other information that should be kept secure. This solution should only ever be used as a last resort. Always try to get the toolkit working with verification first.

Handling Pagination

There is a common pattern in the scraping world: Pagination. This is most often seen when submitting a form and the request is passed off to a basic search engine that usually returns anywhere from 10 to 50 results.

Unfortunately, you need all 8,946 results for the database you are constructing. There are two ways to handle the scenario: Fake it or follow the links/buttons.

"Faking it" eliminates the need to handle pagination in the first place. What is meant by this? Well, a lot of GET/POST requests in pagination scenarios pass along the "page size" to the server. Let's say 50 results are being returned but the number '50' in a size attribute is also being sent to the server either on the first page or subsequent pages in the URL. Well, what happens if the value '10000' is sent for the page size instead of '50'? About 85% of the time, the server-side web facing software assumes it will only be passed the page size values provided in some client-side select box. Therefore, the server-side just casts the submitted value to an integer and passes it along to the database AND does all of its pagination calculations from that submitted value. The result is that all of the desired server-side data can be retrieved with just one request. Frequently, if the page size is not in the first page of search results, page 2 of those search results will generally reveal what parameter is used for page size. The ability to fake it on such a broad scale just goes to show that writing a functional search engine is a difficult task for a lot of developers.

But what if faking it doesn't work? There's plenty of server-side software that can't handle processing/returning lots of data and will instead return an error - for example, with experimenting, maybe a server fails to return more than 3,000 rows at a time but that's still significantly more than 50 rows at a time. Or the developer wrote their code to assume that their data might get scraped and forces the upper limit on the page size anyway. Doing so just hurts them more than anything else since the scraping script will end up using more of their system resources to retrieve the same amount of data. Regardless, if the data can't be retrieved all at once, pagination at whatever limit is imposed by the server is the only option. If the requests are just URL-based, then pagination can be done by manipulating the URL. If the requests are POST-based, then extracting forms from the page may be required. It depends entirely on how the search engine was constructed.

Example:

<?php
	require_once "support/web_browser.php";
	require_once "support/tag_filter.php";

	// Retrieve the standard HTML parsing array for later use.
	$htmloptions = TagFilter::GetHTMLOptions();

	$url = "http://www.somesite.com/something/?s=3000";
	$web = new WebBrowser(array("extractforms" => true));

	do
	{
		// Retrieve a URL.
		$retries = 3;
		do
		{
			$result = $web->Process($url);
			$retries--;
			if (!$result["success"])  sleep(1);
		} while (!$result["success"] && $retries > 0);

		// Check for connectivity and response errors.
		if (!$result["success"])
		{
			echo "Error retrieving URL.  " . $result["error"] . "\n";

			exit();
		}

		if ($result["response"]["code"] != 200)
		{
			echo "Error retrieving URL.  Server returned:  " . $result["response"]["code"] . " " . $result["response"]["meaning"] . "\n";

			exit();
		}

		$baseurl = $result["url"];

		// Use TagFilter to parse the content.
		$html = TagFilter::Explode($result["body"], $htmloptions);

		// Retrieve a pointer object to the root node.
		$root = $html->Get();

		$found = false;
		// Attempt to extract information.
		// Set $found to true if there is at least one row of data.

		if ($found)
		{
			$row = $root->Find("div.pagination a[href]")->Filter("/~contains:Next")->current();
			if ($row === false)  break;

			$url = HTTP::ConvertRelativeToAbsoluteURL($baseurl, $row->href);
		}
	} while ($found);
?>

One other useful tip is to attempt to use wildcard SQL characters or text patterns to extract more data than the website operator likely intended. If a search box requires some field to be filled in for a search to be accepted, try a single '%' to see if the server is accepting wildcard LIKE queries. If not, then maybe walking through the set of possible alphanumeric values will work (e.g. "a", "b", "c", "d") and then being careful to exclude duplicated data (e.g. "XYZ, Inc." would show up in six different search result sets).

Another useful tip is to be aware of URLs for detail pages. For example, when viewing details about an item from a search and the item has "id=2018000001" in the URL for that page and then another item has "id=2017003449", then there may be a predictable pattern of year + sequence within that year as part of the ID for any given item. Searching may not even be necessary as it may be possible to generate the URL dynamically (e.g. "id=2018000001", "id=2018000002", "id=2018000003") if the goal is to copy all records.

Advanced Troubleshooting

Let's say you encounter a webpage that displays just fine in a regular web browser but does not return the expected data when the same apparent request is made using Ultimate Web Scraper Toolkit. The first thing to do when troubleshooting is to go back to the very beginning with a fresh session:

  • Start a Private Web Browsing/Incognito window. (Close existing private/incognito windows first.)
  • Open Developer Tools, go to the Network tab, and enable Persistent Logs. Be sure to turn this off when you are done.
  • Visit the starting URL and traverse to the problem location.

Persistent Logs keeps track of every single request made, including redirects.

The next step is to analyze the entire history looking for cookies that are set and various request headers that are sent to each server. There's always something that is overlooked. Here are several things to specifically look for:

  • A missing or incorrect HTTP cookie. A missing cookie may be set by sending a missing request to the server. That is, simply following a more natural path that a regular user would take with some extra requests may be all that is needed to set the correct cookie. A cookie could also be set by Javascript, which usually requires setting such cookies manually using $currstate = $web->GetState(), modify $currstate, and then $web->SetState($currstate).
  • A missing or incorrect HTTP header. Many servers expect a valid Referer header to access certain resources. Again, following a more natural path solves this issue. Another example is when working with XHR requests, servers may expect the X-Requested-With: XMLHttpRequest header and also the GET/POST method to be correct.
  • The incorrect HTTP method is used. If the web browser makes a POST request but a GET request is made using the toolkit, the server might not be able to handle the request. Check and make sure that the correct HTTP method is being sent to the server.
  • The actual request is done inside a WebSocket connection. Rare. Watch for 101 Upgrade WebSocket response codes. There could be communication happening over WebSocket that unlocks something. Note that many WebSocket connections require valid session establishment via cookies first before they will function (see above).
  • A missing request to the server that involves none of the above. Super rare. For example, I have run into one system that required making a request for a specific JPEG image on a completely different server before it would allow a download to take place. The server hosting the download was obviously checking session cookies behind the scenes to see if the user had made the request for the image on the other server before allowing the download to take place. A real web browser tends to follow all instructions while web scrapers tend to take shortcuts. There was no indicator that requesting the JPEG was required. Requesting each URL in the returned HTML and then narrowing things down to the specific request revealed how the system worked. A clever attempt to prevent scraping but easily dealt with.
  • Running the request from a different IP address from the web browser. CloudFlare, in particular, has detection systems in place for requests originating from DigitalOcean, AWS, Azure, Tor, web proxies, etc. and can deliver alternate content than would be delivered to a local IP address. CloudFlare is a popular frontend caching proxy/CDN system for a backend website. Some tools exist that can sometimes un-proxy a domain to reveal the original backend server IP address. Connecting directly by IP address instead of domain name requires special treatment when using the toolkit (i.e. a custom Host header, custom SSL/TLS setup, etc). It can be done but is a little tricky/finicky to get right.

There are probably plenty of other sneaky things floating around out there but those tips cover 98% of the more difficult stuff when running into it. Basically, problems arise from not perfectly replicating the parts of requests that the server side assumes will reasonably exist under normal circumstances. Enabling debugging mode for requests using the toolkit and dumping request + responses to the screen may also reveal differences between what the browser is sending/receiving and what code using the toolkit is sending/receiving.

Offline Downloading

Included with Ultimate Web Scraper Toolkit is an example script to download a website starting at a specified URL. The script demonstrates bulk concurrent downloading and processing of HTML, CSS, images, Javascript, and other files almost like a web browser would do.

Example usage:

php offline_download_example.php offline-test https://barebonescms.com/ 3

That will download content up to three links deep to the local computer system starting at the root URL of barebonescms.com. All valid URLs to barebonescms.com are transformed into local disk references. CDNs for images and Javascript are transformed into subdirectories. The script also attempts to maintain the relative URL structure of the original website wherever possible.

The script is only an example of what a website downloader might look like since it lacks features that a better tool might have (e.g. the ability to exclude certain URL paths). It's a great starting point though for building something more complete and/or a custom solution for a specific purpose.

There are some limitations. For example, any files loaded via Javascript won't necessarily be retrieved. See the Limitations section below for additional information.

Limitations

The only real limitation with Ultimate Web Scraper Toolkit is its inability to process Javascript. A simple regex here and there to extract data hardcoded via Javascript usually works well enough.

For the 0.5% of websites where there is useful content to scrape but the entire page content is generated using Javascript or is protected by Javascript in unusual ways, a real web browser is required. Fortunately, there is PhantomJS (headless Webkit), which can be scripted (i.e. automated) to handle the aforementioned Javascript-heavy sites. However, PhantomJS is rather resource intensive and slooooow. After all, PhantomJS emulates a real web browser which includes the full startup sequence and then it proceeds to download the entire page's content. That, in turn, can take hundreds of requests to complete and can easily include downloading things such as ads.

It is very rare though to run into a website like that. Ultimate Web Scraper Toolkit can handle most anything else.

More Information

Full documentation and more examples can be found in the 'docs' directory of this repository.

More Repositories

1

email_sms_mms_gateways

A simple repo containing a list of known e-mail to SMS, MMS, and Rich Messaging carrier gateways in JSON format under a MIT or LGPL license.
184
star
2

js-fileexplorer

A zero dependencies, customizable, pure Javascript widget for navigating, managing, uploading, and downloading files and folders or other hierarchical object structures on any modern web browser.
JavaScript
181
star
3

sso-server

Do you need a PHP login system that rocks? Then install this SSO server. It's an awesome, scalable, secure, flexible PHP login system for the modern era.
PHP
122
star
4

createprocess-windows

A complete, robust command-line utility to construct highly customized calls to the CreateProcess() Windows API. Released under a MIT or LGPL license.
C++
94
star
5

php-app-server

Create lightweight, installable applications written in HTML, CSS, Javascript, and PHP for the Windows, Mac, and Linux desktop operating systems.
PHP
88
star
6

cloud-storage-server

An open source, extensible, self-hosted cloud storage API. The base server implements a complete file system similar to Amazon Cloud Drive, B2 Cloud Storage and other providers. MIT or LGPL.
PHP
79
star
7

cross-platform-cpp

A wonderful, lightweight, cross-platform C++ snippet library under a MIT or LGPL license.
C++
74
star
8

jquery-fancyfileuploader

A jQuery plugin to convert the HTML file input type into a fancy file uploader under a MIT or LGPL license. Mobile-friendly too!
JavaScript
53
star
9

portable-apache-maria-db-php-for-windows

Portable Apache + Maria DB + PHP for Windows is for web developers who prefer manually editing configuration files and want "manual" but quick startup of Apache and Maria DB (no Windows services). No more hunting for ZIP files for each separate piece of software.
PHP
47
star
10

php-filemanager

A fantastic mobile-friendly, web-based file manager, code editor, and file previewer for the web. Can be used to create HTML/CSS/Javascript embeds for websites, a web-based file sharing portal, and much more. MIT or LGPL, your choice.
JavaScript
45
star
11

service-manager

The world's first cross-platform, open source (MIT or LGPL), programming AND scripting language-agnostic solution to system service development. Source code:
PHP
37
star
12

php-license-server

A high-performance license server system service for creating and managing products, major versions, and software licenses for the purpose of selling installable software products.
PHP
37
star
13

cloud-backup

A flexible, powerful, and easy to use rolling incremental backup system that pushes collated, compressed, and encrypted data to online cloud storage services. MIT or LGPL.
PHP
26
star
14

ultimate-email

A PHP library/toolkit designed to handle all of your one-off e-mail needs under a MIT or LGPL license.
PHP
26
star
15

windows-pe-artifact-library

Contains over 375 samples of Windows Portable Executable (PE) files ranging from the common to the completely esoteric with detailed origin information for each sample. Spans decades of computing in roughly 64MB of disk storage. Unique, ultra-rare PE file format artifacts. Any researcher's most delightful find!
23
star
16

php-ext-qolfuncs

A set of quality of life improvement functions designed for PHP core.
C
23
star
17

php-cool-file-transfer

Directly transfer files between two devices (PC, tablet, smartphone, refridgerator) using nothing more than a web browser and a standard PHP enabled web server. MIT or LGPL.
PHP
22
star
18

messagebox-windows

A complete, robust command-line utility to construct calls to the MessageBox() and MessageBeep() Windows APIs. Released under a MIT or LGPL license.
C++
19
star
19

ssh-win64

An automatically updated repository of MSYS SSH binaries for 64-bit Windows. Sourced via Git Portable.
19
star
20

server-instant-start

Spin up a fully configured Ubuntu/Debian-based web server in under 10 minutes with Nginx (w/ HTTPS), PHP FPM, Postfix, OpenDKIM, MySQL/MariaDB, PostgreSQL, and more. Deploy your web application too.
PHP
19
star
21

sso-client-php

The PHP SSO Client portion of the Barebones SSO Server/Client. Pairs with the SSO Server, which is an awesome, scalable, secure, flexible PHP login system that's become a bit ridiculous - but it still rocks anyway.
PHP
19
star
22

portable-apache-mysql-php-for-windows

SEE NOTE. Portable Apache + MySQL + PHP for Windows is for web developers who prefer manually editing configuration files and want "manual" but quick startup of Apache and MySQL (no Windows services). No more hunting for ZIP files for each separate piece of software.
PHP
17
star
23

service-manager-src

The source code to the world's first cross-platform, open source (MIT or LGPL), programming AND scripting language-agnostic solution to system service development. Binaries:
C++
16
star
24

php-winpefile

Windows Portable Executable file format command-line tools and PHP classes. Easily extract structures and information, modify files, and even construct files from scratch in the Windows Portable Executable (PE) file format (EXEs, DLLs, etc).
PHP
16
star
25

xcron

xcron is the souped up, modernized cron/Task Scheduler for Windows, Mac OSX, Linux, and FreeBSD server and desktop operating systems. MIT or LGPL.
PHP
10
star
26

digitalocean

DigitalOcean PHP SDK plus a feature-complete command-line interface with full support for all DigitalOcean APIs. MIT or LGPL.
PHP
10
star
27

ssh-win32

An automatically updated repository of MSYS SSH binaries for 32-bit Windows. Sourced via Git Portable.
9
star
28

php-decomposer

Generate no-conflict standalone builds of PHP Composer/PSR-enabled software. MIT or LGPL.
PHP
9
star
29

php-ext-sync

SEE NOTE. CubicleSoft authored PHP Extension: Synchronization objects (sync). MIT license.
C
8
star
30

admin-pack

A PHP toolkit designed specifically for programmers to quickly create a nice-looking, custom-built, secure administrative web interface. MIT or LGPL.
PHP
8
star
31

php-obsmanager

Remotely control Open Broadcaster (OBS) scene selection anywhere in the world via a standard web browser. MIT or LGPL.
PHP
7
star
32

barebones-cms

The official Barebones CMS release distribution. Accept no substitutes. MIT or LGPL.
PHP
7
star
33

admin-pack-with-extras

A PHP toolkit designed specifically for programmers to quickly create a nice-looking, custom-built, secure administrative web interface. MIT or LGPL.
JavaScript
7
star
34

php-discord-sdk

An ultra-lightweight PHP SDK for accessing the Discord API and Discord webhook endpoints.
PHP
7
star
35

cloud-storage-tools

Useful tools to access Cloud Storage Server APIs directly from the command-line.
PHP
7
star
36

php-ssl-certs

Easily manage SSL Certificate Signing Requests (CSRs) and SSL certificate chains with a pure PHP-based command-line tool. MIT or LGPL.
PHP
7
star
37

php-libs

A single repository containing all CubicleSoft PHP libraries. Fully automated nightly updates. MIT or LGPL.
PHP
7
star
38

status-tracker

A simple and elegant cron script and web application health status tracker written in PHP. MIT or LGPL.
PHP
7
star
39

web-knocker-firewall-service

Not your average port knocker. A web-based service written in pure PHP that opens protected TCP and UDP ports in response to encrypted requests from a correctly configured client for a limited but renewable time period. MIT or LGPL.
PHP
7
star
40

php-drc

Data Relay Center (DRC) is a powerful multi-channel, secure, general-purpose WebSocket PHP server and client SDKs for PHP and Javascript. Similar to how Internet Relay Chat (IRC) works but designed specifically for data and the web!
PHP
6
star
41

ifds

Easily create your own custom file format with the Incredibly Flexible Data Storage (IFDS) file format. Repository contains: The IFDS specification (CC0) and the official PHP reference implementation of IFDS, a paging file cache class, and example usage classes (MIT or LGPL).
PHP
6
star
42

offline-forms

A form designer and data gathering tool for use in areas with spotty or unknown Internet connectivity powered by any standard web browser. MIT or LGPL.
JavaScript
5
star
43

getsidinfo-windows

Dumps information about Windows Security Identifiers (SIDs) as easy-to-consume JSON output. MIT or LGPL.
C++
5
star
44

portable-apps-mirror-proxy

An unofficial WSUS-like mirror/proxy server with system service integration for the Portable Apps platform client software. Choose from a slightly modified MIT or LGPL license.
PHP
5
star
45

webroute

Official reference implementation and technical description of the WebRoute Internet protocol. MIT or LGPL.
PHP
5
star
46

php-zipstreamwriter

A fast, efficient streaming library for creating ZIP files on the fly in pure userland PHP without any external tools, PHP extensions, or physical disk storage requirements.
PHP
4
star
47

php-csprng

A PHP library that utilizes available CSPRNGs and a set of convenience functions for generating random data under a MIT or LGPL license.
PHP
4
star
48

php-flexforms

FlexForms is a powerful HTML forms generator/builder PHP class to output HTML forms using PHP arrays. MIT or LGPL.
PHP
4
star
49

network-speedtest-cli

A command-line tool for cloud and network speed testing of single TCP connections. Supports most common setups (e.g. TCP ports 22, 80, 443, and others) and has integrations for Digital Ocean, speedtest.net, and custom OoklaServer installs.
PHP
4
star
50

license-server-demo

A complete Stripe + PHP License Server integration + product support center + demo app ready to adjust and deploy. Get back to writing software in minutes.
PHP
4
star
51

gettokeninformation-windows

A complete, robust command-line utility to dump the contents of Windows security tokens using the GetTokenInformation() Windows API as JSON. MIT or LGPL.
C++
4
star
52

remoted-api-server

Allows any standard TCP/IP server to be remoted with low-overhead TCP connectivity. Allows TCP/IP clients to easily and directly connect to a TCP/IP server operating completely behind a firewall by utilizing the WebRoute protocol. MIT or LGPL.
PHP
4
star
53

php-short-open-tag-finder

Intelligent command-line tool to find software references to short open tags with an optional "ask to replace" mode. Compatible with all versions of PHP, including PHP 8.
PHP
4
star
54

efss

Encrypted File Storage System (EFSS). A real, virtual, mountable block-based file system for PHP. MIT or LGPL.
PHP
4
star
55

voicemail-manager

Works with Twilio-compatible systems to automatically route incoming calls to a voicemail queue. Has flexible management options available through phone and web interfaces.
PHP
4
star
56

cloud-storage-server-ext-scripts

A powerful and flexible cross-platform /scripts extension for the self-hosted cloud storage API for starting and monitoring long-running scripts. Includes a PHP SDK for interacting with the /scripts API. MIT or LGPL.
PHP
4
star
57

json-base64

A single, massive repository containing all official reference implementations of JSON-Base64 as well as related applications, addons, icons, and logos.
C#
4
star
58

jquery-ui-masked-picker

A jQuery UI plugin that implements a masked picker to fill in a custom-formatted string.
HTML
3
star
59

sso-native-apps

Example native app frameworks for integrating Android, iOS, Windows Phone 8, and popular desktop operating systems (Windows, Mac, Linux) with the SSO Server.
C++
3
star
60

js-flexforms

FlexForms is a powerful HTML forms generator/builder Javascript class to output HTML forms using JSON-style arrays. MIT or LGPL.
JavaScript
3
star
61

php-misc

Miscellaneous, lonely PHP classes that don't already have a home in another CubicleSoft product but want to be free and open source and loved. MIT or LGPL.
PHP
3
star
62

php-flexforms-modules

Official PHP modules for FlexForms (charts, HTML editor, etc). MIT or LGPL.
PHP
3
star
63

php-libs-to-composer

CubicleSoft PHP Software Development Libraries for Composer
PHP
3
star
64

ssh-extract

Extracts MSYS SSH binaries from Git Portable and automatically pushes them to ssh-win32 and ssh-win64. MIT or LGPL.
PHP
3
star
65

jquery-tablebodyscroll

A really nice jQuery plugin to scroll the body of long tables so the table fits on a single screen vertically. MIT or LGPL. Mobile-friendly data tables.
JavaScript
3
star
66

php-libs-namespaced

A single repository containing all CubicleSoft PHP libraries inside a CubicleSoft namespace. Fully automated nightly updates. MIT or LGPL.
PHP
3
star
67

resilient-ip-php

The very first prototype implementation of the Resilient Internet Protocol (ResIP). ResIP is a modern tunneling protocol designed to withstand a wide range of network conditions and has features beyond what TCP/IP can offer. This prototype is written in PHP. MIT or LGPL.
PHP
3
star
68

csdb

A portable, cross-platform, cross-database, lightweight, debuggable, replication-aware, migration-friendly, transaction-capable, data access layer (DAL) for PHP.
PHP
3
star
69

barebones-cms-sdks

The Barebones CMS SDKs development repository. Changes made here propagate to the Barebones CMS release distribution repository.
PHP
2
star
70

jquery-tablecards

A really nice jQuery plugin to convert tables to responsive cards via templates. MIT or LGPL license. Mobile-friendly data tables.
JavaScript
2
star
71

net-test

Command-line network testing tool. Sets up a debuggable TCP/IP echo server or client.
PHP
2
star
72

sso-client-aspnet

The ASP.NET (C#) SSO Client portion of the Barebones SSO Server/Client. Pairs with the SSO Server, which is an awesome, scalable, secure, flexible PHP login system that's become a bit ridiculous - but it still rocks anyway.
C#
2
star
73

php-web-tester

An ultra lightweight testing framework for creating repeatable, instrumented builds of PHP-based software products. MIT or LGPL.
PHP
2
star
74

php-twilio-sdk

An ultra-lightweight PHP SDK for accessing Twilio APIs and emitting valid TwiML verbs in response to webhook calls. Also works with SignalWire!
PHP
2
star
75

matrix-multiply

A set of pure ANSI C/C++ matrix multiplication implementations and a test suite. MIT or LGPL.
C++
2
star
76

file-tracker

The world's first cross-platform, bulk visual diff/merge tool. Massively deploy and synchronize changes/updates across multiple software projects and systems in minutes instead of days or weeks.
2
star
77

cloud-storage-server-ext-feeds

A powerful and flexible cross-platform /feeds extension for the self-hosted cloud storage API for sending and filtering notifications with data payloads. Includes a PHP SDK for interacting with the /feeds API. MIT or LGPL.
PHP
2
star
78

php-concurrency-tester

A simple program that executes another PHP command-line script and (hopefully) collects output in CSV format for later analysis. Mostly for performance testing/verifying localhost TCP/IP servers.
PHP
2
star
79

barebones-cms-extensions

A list of available plugins, language packs, and other extensions for Barebones CMS. Fully automated repository updated daily.
PHP
1
star
80

barebones-cms-shortcode-bb_syntaxhighlight

Barebones CMS 1.x - Syntax Highlighter shortcode | NOT COMPATIBLE WITH Barebones CMS 2.0!
CSS
1
star
81

php-flexforms-extras

FlexForms Extras adds the most commonly used best-of-class Javascript components to the already excellent FlexForms PHP class. MIT or LGPL.
JavaScript
1
star
82

barebones-cms-plugin-demo_mode

Barebones CMS 1.x - Demo/Kiosk Mode plugin | NOT COMPATIBLE WITH Barebones CMS 2.0!
PHP
1
star
83

barebones-cms-widget-bb_page_protect

Barebones CMS 1.x - SSO Server/Client Page Protection widget | NOT COMPATIBLE WITH Barebones CMS 2.0!
PHP
1
star
84

barebones-cms-plugin-edit_area

Barebones CMS 1.x - EditArea plugin | NOT COMPATIBLE WITH Barebones CMS 2.0!
JavaScript
1
star
85

barebones-cms-shortcode-bb_flash

Barebones CMS 1.x - Flash Object (SWF) shortcode | NOT COMPATIBLE WITH Barebones CMS 2.0!
PHP
1
star
86

barebones-cms-api

The Barebones CMS API development repository. Changes made here propagate to the Barebones CMS release distribution repository.
PHP
1
star
87

product-hashes

Latest JSON release files containing multiple file hashes for all released CubicleSoft commercial products.
1
star
88

barebones-cms-instant-start

Quickly, easily, and automatically install all components of Barebones CMS in just a couple of minutes.
PHP
1
star
89

barebones-cms-docs

The Barebones CMS documentation repository. Changes made here propagate to the Barebones CMS release distribution repository.
1
star
90

barebones-cms-widget-bb_langmap_modifier

Barebones CMS 1.x - Language Map Modifier widget | NOT COMPATIBLE WITH Barebones CMS 2.0!
PHP
1
star
91

barebones-cms-plugin-sso_plugin

Barebones CMS 1.x - SSO server/client integration plugin | NOT COMPATIBLE WITH Barebones CMS 2.0!
PHP
1
star
92

getiptables-windows

Dumps information about Windows TCP/IP and UDP/IP tables (both IPv4 and IPv6) as JSON. MIT or LGPL.
C++
1
star
93

barebones-cms-plugins-demo-site

The plugins used on the Barebones CMS demo site. MIT or LGPL.
PHP
1
star