• Stars
    star
    452
  • Rank 96,430 (Top 2 %)
  • Language
    PHP
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A universal web-util for PHP.

PHP Scraper: a web utility for PHP

Unit Tests Total Downloads Latest Version License

For full documentation, visit phpscraper.de.

PHPScraper is a versatile web-utility for PHP. Its primary objective is to streamline the process of extracting information from websites, allowing you to focus on accomplishing tasks without getting caught up in the complexities of selectors, data structure preparation, and conversion.

Under the hood, it uses

See composer.json for more details.

⏲️ PHPScraper in 5 Minutes explained

Here are a few impressions of the way the library works. More examples are on the project website.

Basics: Flexible Calling as an Attribute or Method

All scraping functionality can be accessed either as a function call or a property call. For example, the title can be accessed in two ways:

// Prep
$web = new \Spekulatius\PHPScraper\PHPScraper;
$web->go('https://google.com');

// Returns "Google"
echo $web->title;

// Also returns "Google"
echo $web->title();

🔋 Batteries included: Meta data, Links, Images, Headings, Content, Keywords, ...

Many common use cases are covered already. You can find prepared extractors for various HTML tags, including interesting attributes. You can filter and combine these to your needs. In some cases there is an option to get a simple or detailed version, here in the case of linksWithDetails:

$web = new \Spekulatius\PHPScraper\PHPScraper;

// Contains:
// <a href="https://placekitten.com/456/500" rel="ugc">
//   <img src="https://placekitten.com/456/400">
//   <img src="https://placekitten.com/456/300">
// </a>
$web->go('https://test-pages.phpscraper.de/links/image-urls.html');

// Get the first link on the page and print the result
print_r($web->linksWithDetails[0]);
// [
//     'url' => 'https://placekitten.com/456/500',
//     'protocol' => 'https',
//     'text' => '',
//     'title' => null,
//     'target' => null,
//     'rel' => 'ugc',
//     'image' => [
//         'https://placekitten.com/456/400',
//         'https://placekitten.com/456/300'
//     ],
//     'isNofollow' => false,
//     'isUGC' => true,
//     'isSponsored' => false,
//     'isMe' => false,
//     'isNoopener' => false,
//     'isNoreferrer' => false,
// ]

If there aren't any matching elements (here links) on the page, an empty array will be returned. If a method normally returns a string it might return null. Details such as follow_redirects, etc. are optional configuration parameters (see below).

Most of the DOM should be covered using these methods:

A full list of methods with example code can be found on phpscraper.de. Further examples are in the tests.

Download Files

Besides processing the content on the page itself, you can download files using fetchAsset:

// Absolute URL
$csvString = $web->fetchAsset('https://test-pages.phpscraper.de/test.csv');

// Relative URL after navigation
$csvString = $web
  ->go('https://test-pages.phpscraper.de/meta/lorem-ipsum.html')
  ->fetchAsset('/test.csv');

You will only need to write the content into a file or cloud storage.

Process the RSS feeds, sitemap.xml, etc.

PHPScraper can assist in collecting feeds such as RSS feeds, sitemap.xml-entries and static search indexes. This can be useful when deciding on the next page to crawl or building up a list of pages on a website.

Here we are processing the sitemap into a set of FeedEntry-DTOs:

(new \Spekulatius\PHPScraper\PHPScraper)
    ->go('https://phpscraper.de')
    ->sitemap

// array(131) {
//   [0]=>
//   object(Spekulatius\PHPScraper\DataTransferObjects\FeedEntry)#165 (3) {
//     ["title"]=>
//     string(0) ""
//     ["description"]=>
//     string(0) ""
//     ["link"]=>
//     string(22) "https://phpscraper.de/"
//   }
//   [1]=>
// ...

Whenever post-processing is applied, you can fall back to the underlying *Raw-methods.

Process CSV-, XML- and JSON files and URLs

PHPScraper comes out of the box with file / URL processing methods for CSV-, XML- and JSON:

  • parseJson
  • parseXml
  • parseCsv
  • parseCsvWithHeader (generates an asso. array using the first row)

Each method can process both strings as well as URLs:

// Parse JSON into array:
$json = $web->parseJson('[{"title": "PHP Scraper: a web utility for PHP", "url": "https://phpscraper.de"}]');
// [
//     'title' => 'PHP Scraper: a web utility for PHP',
//     'url' => 'https://phpscraper.de'
// ]

// Fetch and parse CSV into a simple array:
$csv = $web->parseCsv('https://test-pages.phpscraper.de/test.csv');
// [
//     ['date', 'value'],
//     ['1945-02-06', 4.20],
//     ['1952-03-11', 42],
// ]

// Fetch and parse CSV with first row as header into an asso. array structure:
$csv = $web->parseCsvWithHeader('https://test-pages.phpscraper.de/test.csv');
// [
//     ['date' => '1945-02-06', 'value' => 4.20],
//     ['date' => '1952-03-11', 'value' => 42],
// ]

Additional CSV parsing parameters such as separator, enclosure and escape are possible.

There is more!

There are plenty of examples on the PHPScraper website and in the tests.

Check the playground.php if you prefer learning by doing. You get it up and running with:

$ git clone [email protected]:spekulatius/PHPScraper.git && composer update

💪 Roadmap

The future development is organized into milestones. Releases follow semver.

v1: Building the first stable version

  • Improve documentation and examples.
  • Organize code better (move websites into separate repos, etc.)
  • Add support for feeds and some typical file types.

v2: Service Upgrade:

v3: Expand the functionality and cover more 'types'

  • Expand to parse a wider range of types, elements, embeds, etc.
  • Improve performance with caching and concurrent fetching of assets
  • Minor improvements for parsing methods

v4: Expand to provide more guidance on building custom scrapers on top of PHPScraper

TBC.

😍 Sponsors

PHPScraper is sponsored by:

With your support, PHPScraper can became the PHP swiss army knife for the web. If you find PHPScraper useful to your work, please consider a sponsorship or donation. Thank you 💪

⚙️ Configuration (optional)

If needed, you can use the following configuration options:

User Agent

You can set the browser agent using setConfig:

$web->setConfig([
  'agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:107.0) Gecko/20100101 Firefox/107.0'
]);

It defaults to Mozilla/5.0 (compatible; PHP Scraper/1.x; +https://phpscraper.de).

Proxy Support

You can configure proxy support with setConfig:

$web->setConfig(['proxy' => 'http://user:[email protected]:3128']);

Timeout

You can set the timeout using setConfig:

$web->setConfig(['timeout' => 15]);

Setting the timeout to zero will disable it.

Disabling SSL

While unrecommended, it might be required to disable SSL checks. You can do so using:

$web->setConfig(['disable_ssl' => true]);

You can call setConfig multiple times. It stores the config and merges it with previous settings. This should be kept in mind in the unlikely use-case when unsetting values.

🚀 Installation with Composer

composer require spekulatius/phpscraper

After the installation, the package will be picked up by the Composer autoloader. If you are using a common PHP application or framework such as Laravel or Symfony you can start scraping now 🚀

If not or you are building a standalone-scraper, please include the autoloader in vendor/ at the top of your file:

<?php

require __DIR__ . '/vendor/autoload.php';

// ...

Now you can now use any of the examples on the documentation website or from the tests/-folder.

Please consider supporting PHPScraper with a star or sponsorship:

composer thanks

Thank you 💪

Testing

The library comes with a PHPUnit test suite. To run the tests, run the following command from the project folder:

composer test

You can find the tests here. The test pages are publicly available.

MISC: Issues, Ideas, Contributing, CHANGELOG, UPGRADING, LICENSE

More Repositories

1

infosec-dorks

A Personal Collection of Infosec Dorks
148
star
2

awesome-filament

Awesome FilamentPHP stuff
93
star
3

laravel-commonmark-blog

A simple filesystem-based, SEO-optimized blog for Laravel using CommonMark.
PHP
17
star
4

spatie-crawler-toolkit-for-laravel

A toolkit for Spatie's Crawler and Laravel.
PHP
16
star
5

awesome-privacy-friendly-web-analytics

List of privacy-friendly analytics solutions
16
star
6

vuepress-plugin-web-monetization

Adds the web-monetization metatag to your VuePress website
JavaScript
14
star
7

awesome-php-scrapers-and-crawlers

An awesome list covering PHP scrapers, spiders and crawlers
8
star
8

laravel-passwordless-example

Example repo for Laravel passwordless login article (see link)
PHP
7
star
9

awesome-infosec

Personal infosec awesome list. Highly subjective by nature.
6
star
10

vuepress-plugin-ackee

Ackee Analytics plugin for VuePress
JavaScript
6
star
11

faasd-phpscraper

openfaas/faasd wrapper for PHPScraper
PHP
6
star
12

spatie-crawler-cached-queue-example

Example to demonstrate the usage of cached queues across multiple requests.
PHP
6
star
13

url-parameter-tracker-list

List of parameters usually used for tracking
Shell
6
star
14

vuepress-plugin-umami

Umami plugin for VuePress
JavaScript
6
star
15

keyword-merge

A helper to compare and identify similar keywords using PHP.
PHP
6
star
16

web-stuff

Web stuff I've discovered and liked. My public notes.
4
star
17

awesome-silverstripe

Collection of links around SilverStripe
4
star
18

vuepress-plugin-plausible

VuePress plugin for Plausible.io
JavaScript
4
star
19

phpscraper-keyword-scraping-example

Simple example of a few ways to extract keywords from a website
PHP
4
star
20

phpscraper-keyword-length-distribution-example

Example to demonstration the parsing of keywords as well as the simple analysis of the data to get a length distribution
PHP
3
star
21

BrowserExtension.dev

Public repo for BrowserExtension.dev. Based on Eleventail by Phil Hawksworth
Nunjucks
3
star
22

puphpeteer-docker

Docker containers for PuPHPeteer
Dockerfile
2
star
23

link-scraping-test-beautifulsoup-vs-phpscraper

Tasking both BeautifulSoup and PHPScraper to extract links - a comparison of code and performance.
Python
2
star
24

first-last

Collection of bash scripts designed to enhance readability of complex bash scripts with pipes.
Shell
1
star
25

spekulatius

1
star
26

PainfreeReleases

Helps you to main a CHANGELOG easily.
PHP
1
star
27

phpscraper-test-pages

A set of test pages used for PHP scraper. For PHPScraper check the main repo:
HTML
1
star
28

faasd-franc

faasd function for franc lib
JavaScript
1
star
29

phpscraper-docs

VuePress-site repo for PHPScraper docs website.
JavaScript
1
star
30

silverstripe-timezones

Provides time zone data as well as a pre-populated dropdown field for SilverStripe
PHP
1
star
31

laravel-powertools

Personal collection of stuff to help with Laravel projects, my powertools.
PHP
1
star
32

linux-bash-mail-merge

CLI helper to merge mail content based on a CSV file and trigger sending in your browser using mailto.
PHP
1
star
33

asset-store

A simple site to store assets
1
star
34

hacks

Several hacks and alpha code. Experiments - not stable at all, probably not maintained. Kids: Don't try this at home.
Shell
1
star
35

puphpeteer-playground

Playground / demo repo to replicate Puppeteer functionality in PuPHPeteer
JavaScript
1
star