• Stars
    star
    2,400
  • Rank 19,162 (Top 0.4 %)
  • Language
    PHP
  • License
    MIT License
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

πŸ•Έ Crawl the web using PHP πŸ•·

Latest Version on Packagist MIT Licensed Tests Total Downloads

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to crawl multiple urls concurrently.

Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood Chrome and Puppeteer are used to power this feature.

Support us

We invest a lot of resources into creating best in class open source packages. You can support us by buying one of our paid products.

We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on our contact page. We publish all received postcards on our virtual postcard wall.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setCrawlObserver must be an object that extends the \Spatie\Crawler\CrawlObservers\CrawlObserver abstract class:

namespace Spatie\Crawler\CrawlObservers;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

abstract class CrawlObserver
{
    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    abstract public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText,
    ): void;

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    abstract public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void;

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
    }
}

Using multiple observers

You can set multiple observers with setCrawlObservers:

Crawler::create()
    ->setCrawlObservers([
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        <class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>,
        ...
     ])
    ->startCrawling($url);

Alternatively you can set multiple observers one by one with addCrawlObserver:

Crawler::create()
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->addCrawlObserver(<class that extends \Spatie\Crawler\CrawlObservers\CrawlObserver>)
    ->startCrawling($url);

Executing JavaScript

By default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

Crawler::create()
    ->executeJavaScript()
    ...

In order to make it possible to get the body html after the javascript has been executed, this package depends on our Browsershot package. This package uses Puppeteer under the hood. Here are some pointers on how to install it on your system.

Browsershot will make an educated guess as to where its dependencies are installed on your system. By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the setBrowsershot(Browsershot $browsershot) method.

Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...

Note that the crawler will still work even if you don't have the system dependencies required by Browsershot. These system dependencies are only required if you're calling executeJavaScript().

Filtering certain urls

You can tell the crawler not to visit certain urls by using the setCrawlProfile-function. That function expects an object that extends Spatie\Crawler\CrawlProfiles\CrawlProfile:

/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;

This package comes with three CrawlProfiles out of the box:

  • CrawlAllUrls: this profile will crawl all urls on all pages including urls to an external site.
  • CrawlInternalUrls: this profile will only crawl the internal urls on the pages of a host.
  • CrawlSubdomains: this profile will only crawl the internal urls and its subdomains on the pages of a host.

Custom link extraction

You can customize how links are extracted from a page by passing a custom UrlParser to the crawler.

Crawler::create()
    ->setUrlParserClass(<class that implements \Spatie\Crawler\UrlParsers\UrlParser>::class)
    ...

By default, the LinkUrlParser is used. This parser will extract all links from the href attribute of a tags.

There is also a built-in SitemapUrlParser that will extract & crawl all links from a sitemap. It does support sitemap index files.

Crawler::create()
    ->setUrlParserClass(SitemapUrlParser::class)
    ...

Ignoring robots.txt and robots meta

By default, the crawler will respect robots data. It is possible to disable these checks like so:

Crawler::create()
    ->ignoreRobots()
    ...

Robots data can come from either a robots.txt file, meta tags or response headers. More information on the spec can be found here: http://www.robotstxt.org/.

Parsing robots data is done by our package spatie/robots-txt.

Accept links with rel="nofollow" attribute

By default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:

Crawler::create()
    ->acceptNofollowLinks()
    ...

Using a custom User Agent

In order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.

Crawler::create()
    ->setUserAgent('my-agent')

You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.

// Disallow crawling for my-agent
User-agent: my-agent
Disallow: /

Setting the number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the setConcurrency method.

Crawler::create()
    ->setConcurrency(1) // now all urls will be crawled one by one

Defining Crawl Limits

By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.

The crawl behavior can be controlled with the following two options:

  • Total Crawl Limit (setTotalCrawlLimit): This limit defines the maximal count of URLs to crawl.
  • Current Crawl Limit (setCurrentCrawlLimit): This defines how many URLs are processed during the current crawl.

Let's take a look at some examples to clarify the difference between these two methods.

Example 1: Using the total crawl limit

The setTotalCrawlLimit method allows you to limit the total number of URLs to crawl, no matter how often you call the crawler.

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

Example 2: Using the current crawl limit

The setCurrentCrawlLimit will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total limit of pages to crawl.

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

Example 3: Combining the total and crawl limit

Both limits can be combined to control the crawler:

$queue = <your selection/implementation of a queue>;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

Example 4: Crawling across requests

You can use the setCurrentCrawlLimit to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.

Initial Request

To start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).

// Create a queue using your queue-driver.
$queue = <your selection/implementation of a queue>;

// Crawl the first set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serializedQueue = serialize($queue);

Subsequent Requests

For any following requests you will need to unserialize your original queue and pass it to the crawler:

// Unserialize queue
$queue = unserialize($serializedQueue);

// Crawls the next set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serialized_queue = serialize($queue);

The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.

An example with more details can be found here.

Setting the maximum crawl depth

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the setMaximumDepth method.

Crawler::create()
    ->setMaximumDepth(2)

Setting the maximum response size

Most html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.

You can change the maximum response size.

// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)

Add a delay between requests

In some cases you might get rate-limited when crawling too aggressively. To circumvent this, you can use the setDelayBetweenRequests() method to add a pause between every request. This value is expressed in milliseconds.

Crawler::create()
    ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms

Limiting which content-types to parse

By default, every found page will be downloaded (up to setMaximumResponseSize() in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the setParseableMimeTypes() with an array of allowed types.

Crawler::create()
    ->setParseableMimeTypes(['text/html', 'text/plain'])

This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.

Using a custom crawl queue

When crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in ArrayCrawlQueue.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.

A valid crawl queue is any class that implements the Spatie\Crawler\CrawlQueues\CrawlQueue-interface. You can pass your custom crawl queue via the setCrawlQueue method on the crawler.

Crawler::create()
    ->setCrawlQueue(<implementation of \Spatie\Crawler\CrawlQueues\CrawlQueue>)

Here

Change the default base url scheme

By default, the crawler will set the base url scheme to http if none. You have the ability to change that with setDefaultScheme.

Crawler::create()
    ->setDefaultScheme('https')

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Testing

First, install the Puppeteer dependency, or your tests will fail.

npm install puppeteer

To run the tests you'll have to start the included node based server first in a separate terminal window.

cd tests/server
npm install
node server.js

With the server running, you can start testing.

composer test

Security

If you've found a bug regarding security please mail [email protected] instead of using the issue tracker.

Postcardware

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.

We publish all received postcards on our company website.

Credits

License

The MIT License (MIT). Please see License File for more information.

More Repositories

1

laravel-permission

Associate users with roles and permissions
PHP
11,600
star
2

laravel-medialibrary

Associate files with Eloquent models
PHP
5,427
star
3

laravel-backup

A package to backup your Laravel app
PHP
5,337
star
4

laravel-activitylog

Log activity inside your Laravel app
PHP
5,316
star
5

browsershot

Convert HTML to an image, PDF or string
PHP
4,434
star
6

laravel-query-builder

Easily build Eloquent queries from API requests
PHP
3,675
star
7

laravel-analytics

A Laravel package to retrieve pageviews and other data from Google Analytics
PHP
2,948
star
8

image-optimizer

Easily optimize images using PHP
PHP
2,450
star
9

async

Easily run code asynchronously
PHP
2,401
star
10

laravel-responsecache

Speed up a Laravel app by caching the entire response
PHP
2,248
star
11

data-transfer-object

Data transfer objects with batteries included
PHP
2,220
star
12

laravel-translatable

Making Eloquent models translatable
PHP
2,030
star
13

laravel-sitemap

Create and generate sitemaps with ease
PHP
2,011
star
14

dashboard.spatie.be

The source code of dashboard.spatie.be
PHP
1,940
star
15

laravel-fractal

An easy to use Fractal wrapper built for Laravel and Lumen applications
PHP
1,845
star
16

package-skeleton-laravel

A skeleton repository for Spatie's Laravel Packages
PHP
1,714
star
17

period

Complex period comparisons
PHP
1,618
star
18

laravel-collection-macros

A set of useful Laravel collection macros
PHP
1,602
star
19

laravel-newsletter

Manage Mailcoach and MailChimp newsletters in Laravel
PHP
1,570
star
20

checklist-going-live

The checklist that is used when a project is going live
1,489
star
21

laravel-tags

Add tags and taggable behaviour to your Laravel app
PHP
1,454
star
22

opening-hours

Query and format a set of opening hours
PHP
1,340
star
23

schema-org

A fluent builder Schema.org types and ld+json generator
PHP
1,337
star
24

eloquent-sortable

Sortable behaviour for Eloquent models
PHP
1,268
star
25

laravel-cookie-consent

Make your Laravel app comply with the crazy EU cookie law
PHP
1,268
star
26

laravel-data

Powerful data objects for Laravel
PHP
1,240
star
27

laravel-sluggable

An opinionated package to create slugs for Eloquent models
PHP
1,236
star
28

laravel-settings

Store strongly typed application settings
PHP
1,218
star
29

laravel-searchable

Pragmatically search through models and other sources
PHP
1,217
star
30

pdf-to-image

Convert a pdf to an image
PHP
1,207
star
31

laravel-mail-preview

A mail driver to quickly preview mail
PHP
1,171
star
32

once

A magic memoization function
PHP
1,159
star
33

laravel-honeypot

Preventing spam submitted through forms
PHP
1,134
star
34

laravel-image-optimizer

Optimize images in your Laravel app
PHP
1,121
star
35

laravel-google-calendar

Manage events on a Google Calendar
PHP
1,119
star
36

regex

A sane interface for php's built in preg_* functions
PHP
1,097
star
37

laravel-multitenancy

Make your Laravel app usable by multiple tenants
PHP
1,092
star
38

image

Manipulate images with an expressive API
PHP
1,064
star
39

array-to-xml

A simple class to convert an array to xml
PHP
1,056
star
40

laravel-uptime-monitor

A powerful and easy to configure uptime and ssl monitor
PHP
1,020
star
41

db-dumper

Dump the contents of a database
PHP
987
star
42

laravel-webhook-client

Receive webhooks in Laravel apps
PHP
985
star
43

laravel-model-states

State support for models
PHP
968
star
44

laravel-view-models

View models in Laravel
PHP
963
star
45

simple-excel

Read and write simple Excel and CSV files
PHP
930
star
46

laravel-web-tinker

Tinker in your browser
JavaScript
925
star
47

laravel-webhook-server

Send webhooks from Laravel apps
PHP
920
star
48

calendar-links

Generate add to calendar links for Google, iCal and other calendar systems
PHP
904
star
49

laravel-db-snapshots

Quickly dump and load databases
PHP
889
star
50

laravel-mix-purgecss

Zero-config Purgecss for Laravel Mix
JavaScript
887
star
51

laravel-schemaless-attributes

Add schemaless attributes to Eloquent models
PHP
880
star
52

blender

The Laravel template used for our CMS like projects
PHP
879
star
53

fork

A lightweight solution for running code concurrently in PHP
PHP
863
star
54

laravel-schedule-monitor

Monitor scheduled tasks in a Laravel app
PHP
859
star
55

laravel-menu

Html menu generator for Laravel
PHP
854
star
56

phpunit-watcher

A tool to automatically rerun PHPUnit tests when source code changes
PHP
831
star
57

laravel-failed-job-monitor

Get notified when a queued job fails
PHP
826
star
58

laravel-model-status

Easily add statuses to your models
PHP
818
star
59

form-backend-validation

An easy way to validate forms using back end logic
JavaScript
800
star
60

temporary-directory

A simple class to work with a temporary directory
PHP
796
star
61

laravel-feed

Easily generate RSS feeds
PHP
789
star
62

laravel-event-sourcing

The easiest way to get started with event sourcing in Laravel
PHP
772
star
63

enum

Strongly typed enums in PHP supporting autocompletion and refactoring
PHP
769
star
64

laravel-server-monitor

Don't let your servers just melt down
PHP
769
star
65

laravel-package-tools

Tools for creating Laravel packages
PHP
767
star
66

laravel-tail

An artisan command to tail your application logs
PHP
726
star
67

valuestore

Easily store some values
PHP
722
star
68

laravel-health

Check the health of your Laravel app
PHP
719
star
69

geocoder

Geocode addresses to coordinates
PHP
709
star
70

pdf-to-text

Extract text from a pdf
PHP
707
star
71

ssh

A lightweight package to execute commands over an SSH connection
PHP
696
star
72

menu

Html menu generator
PHP
688
star
73

laravel-url-signer

Create and validate signed URLs with a limited lifetime
PHP
685
star
74

ssl-certificate

A class to validate SSL certificates
PHP
675
star
75

laravel-route-attributes

Use PHP 8 attributes to register routes in a Laravel app
PHP
674
star
76

laravel-validation-rules

A set of useful Laravel validation rules
PHP
663
star
77

laravel-pdf

Create PDF files in Laravel apps
PHP
661
star
78

url

Parse, build and manipulate URL's
PHP
659
star
79

laravel-html

Painless html generation
PHP
654
star
80

laravel-event-projector

Event sourcing for Artisans πŸ“½
PHP
642
star
81

laravel-server-side-rendering

Server side rendering JavaScript in your Laravel application
PHP
636
star
82

vue-tabs-component

An easy way to display tabs with Vue
JavaScript
626
star
83

macroable

A trait to dynamically add methods to a class
PHP
621
star
84

laravel-blade-javascript

A Blade directive to export variables to JavaScript
PHP
618
star
85

laravel-onboard

A Laravel package to help track user onboarding steps
PHP
616
star
86

laravel-csp

Set content security policy headers in a Laravel app
PHP
614
star
87

laravel-cors

Send CORS headers in a Laravel application
PHP
607
star
88

laravel-short-schedule

Schedule artisan commands to run at a sub-minute frequency
PHP
607
star
89

laravel-translation-loader

Store your translations in the database or other sources
PHP
602
star
90

vue-table-component

A straight to the point Vue component to display tables
JavaScript
591
star
91

activitylog

A very simple activity logger to monitor the users of your website or application
PHP
586
star
92

phpunit-snapshot-assertions

A way to test without writing actual testΒ cases
PHP
584
star
93

http-status-check

CLI tool to crawl a website and check HTTP status codes
PHP
584
star
94

laravel-queueable-action

Queueable actions in Laravel
PHP
584
star
95

ray

Debug with Ray to fix problems faster
PHP
574
star
96

freek.dev

The sourcecode of freek.dev
PHP
571
star
97

server-side-rendering

Server side rendering JavaScript in a PHP application
PHP
568
star
98

string

String handling evolved
PHP
558
star
99

laravel-http-logger

Log HTTP requests in Laravel applications
PHP
538
star
100

laravel-blade-x

Use custom HTML components in your Blade views
PHP
533
star