• Stars
    star
    12,707
  • Rank 2,433 (Top 0.05 %)
  • Language
    TypeScript
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated 20 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee
A web scraping and browser automation library

NPM latest version Downloads Chat on discord Build Status

ℹ️ Crawlee is the successor to Apify SDK. 🎉 Fully rewritten in TypeScript for a better developer experience, and with even more powerful anti-blocking features. The interface is almost the same as Apify SDK so upgrading is a breeze. Read the upgrading guide to learn about the changes. ℹ️

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler
cd my-crawler
npm start

Manual installation

If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler we also need to install Playwright. It's not bundled with Crawlee to reduce install size.

npm install crawlee playwright
import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

By default, Crawlee stores data to ./storage in the current working directory. You can override this directory via Crawlee configuration. For details, see Configuration guide, Request storage and Result storage.

🛠 Features

  • Single interface for HTTP and headless browser crawling
  • Persistent queue for URLs to crawl (breadth & depth first)
  • Pluggable storage of both tabular data and files
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routing, error handling and retries
  • Dockerfiles ready to deploy
  • Written in TypeScript with generics

👾 HTTP crawling

  • Zero config HTTP2 support, even for proxies
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Integrated fast HTML parsers. Cheerio and JSDOM
  • Yes, you can scrape JSON APIs as well

💻 Real browser crawling

  • JavaScript rendering and screenshots
  • Headless and headful support
  • Zero-config generation of human-like fingerprints
  • Automatic browser management
  • Use Playwright and Puppeteer with the same interface
  • Chrome, Firefox, Webkit and many others

Usage on the Apify platform

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud. Visit the Apify SDK website to learn more about deploying Crawlee to the Apify platform.

Support

If you find any bug or issue with Crawlee, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.

More Repositories

1

proxy-chain

Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining.
JavaScript
798
star
2

fingerprint-suite

Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.
TypeScript
760
star
3

got-scraping

HTTP client made for scraping based on got.
TypeScript
417
star
4

actor-page-analyzer

Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.
JavaScript
149
star
5

apify-cli

Apify command-line interface helps you create, develop, build and run Apify actors, and manage the Apify cloud platform.
TypeScript
115
star
6

actor-scraper

House of Apify Scrapers. Generic scraping actors with a simple UI to handle complex web crawling and scraping use cases.
JavaScript
114
star
7

apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
Python
110
star
8

apify-sdk-js

Apify SDK monorepo
TypeScript
108
star
9

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
TypeScript
87
star
10

apify-actor-docker

Base Docker images for Apify actors.
Dockerfile
64
star
11

fingerprint-generator

Generates realistic browser fingerprints
TypeScript
63
star
12

fingerprint-injector

Home of fingerprint injector.
TypeScript
62
star
13

header-generator

NodeJs package for generating browser-like headers.
TypeScript
62
star
14

apify-client-js

Apify API client for JavaScript / Node.js.
TypeScript
61
star
15

covid-19

Open APIs with statistics about Covid-19
JavaScript
45
star
16

apify-client-python

Apify API client for Python
Python
42
star
17

apify-docs

This project is the home of Apify's documentation.
API Blueprint
22
star
18

xlsx-stream

JavaScript / Node.js library to stream data into an XLSX file
JavaScript
22
star
19

apify-ts

Crawlee dev repo
TypeScript
21
star
20

got-cjs

An action to release a CommonJS version of the popular library got, which is soon to be available only in an ESM format.
JavaScript
21
star
21

actor-templates

This project is the 🏠 home of Apify actor template projects to help users quickly get started.
Python
21
star
22

actor-content-checker

You can use this act to monitor any page's content and get a notification when content changes.
JavaScript
18
star
23

actor-web-automation-agent

This is the experimental version of Web Automation Agent. The agent uses natural language instructions to browse the web and extract data.
TypeScript
16
star
24

actor-quick-start

Contains a boilerplate of an Apify actor to help you get started quickly build your own actors.
Dockerfile
15
star
25

devtools-server

Runs a simple server that allows you to connect to Chrome DevTools running on dynamic hosts, not only localhost.
JavaScript
13
star
26

apify-shared-js

Utilities and constants shared across Apify projects.
TypeScript
11
star
27

better-sqlite3-with-prebuilds

Better SQLite prebuild & publish action
10
star
28

chat-with-a-website

A simple app that lets you chat with a given website.
Python
9
star
29

actor-scrapy-executor

Apify actor to run web spiders written in Python in the Scrapy library
Python
9
star
30

apify-zapier-integration

Apify integration for Zapier
JavaScript
8
star
31

homebrew-tap

A Homebrew tap for Apify tools
Ruby
7
star
32

workflows

Apify's reusable github workflows
6
star
33

actor-legacy-phantomjs-crawler

The actor implements the legacy Apify Crawler product. It uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of JavaScript code.
JavaScript
6
star
34

idcac

I Don't Care About Cookies extension compiled for use with Playwright/Puppeteer
JavaScript
6
star
35

act-crawler-results-to-s3

Apify actor to upload crawler results to AWS S3.
JavaScript
6
star
36

super-scraper

Generic REST API for scraping websites. Drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!
TypeScript
5
star
37

actor-example-python

Example Apify Actor written in Python
Python
5
star
38

browser-headers-generator

Package generating randomized browser-like headers.
JavaScript
4
star
39

input-schema-editor-react

Apify input schema editor written in React.js
JavaScript
4
star
40

act-crawl-url-list

Apify actor to crawl a list of URLs
JavaScript
4
star
41

apify-storage-local-js

Local emulation of the apify-client NPM package, which enables local use of Apify SDK.
TypeScript
3
star
42

actor-imagediff

Returns an image containing difference of two given images.
JavaScript
3
star
43

apify-web-covid-19

A list of public COVID-19 APIs to be rendered on https://apify.com/covid-19
JavaScript
3
star
44

http-request

A HTTP request library for Node.js, with a common-sense API, support for Brotli compression and without bugs in "request" NPM package
JavaScript
3
star
45

aidevworld2023

How to get clean web data for chatbots and LLMs slides and supporting materials.
JavaScript
3
star
46

crawlee-parallel-scraping-example

An example repository showcasing how you can scrape in parallel using one request queue
TypeScript
3
star
47

actor-example-php

Example of Apify actor using PHP
PHP
2
star
48

apify-php-tutorial

PHP
2
star
49

actor-example-proxy-intercept-request

Example: Intercept requests from https connection using "Man in the middle" proxy solution.
JavaScript
2
star
50

apify-eslint-config

Apify ESLint preset to be shared between projects
JavaScript
2
star
51

slack-messages-action

It wraps up messages sending from Apify GitHub workflows into Slack.
TypeScript
2
star
52

scraping-tools-js

A library of utility functions that make scraping, data extraction and usage of headless browsers easier and faster.
JavaScript
2
star
53

actor-beautifulsoup-scraper

Python
2
star
54

apify-tsconfig

TypeScript configuration shared across projects in Apify.
Shell
1
star
55

generative-bayesian-network

JavaScript
1
star
56

waw-file-specification

Contains specification of the Web Automation Workflow (WAW) file.
1
star
57

playwright-test-actor

Source code for the Playwright Test public actor.
TypeScript
1
star
58

apify-sdk-v2

Snapshot of Apify SDK v2 + sdk.apify.com website. This project is no longer maintained. See the https://github.com/apify/apify-sdk-js repo instead!
JavaScript
1
star
59

actor-algolia-website-indexer

Apify actor that crawls website and indexes selected web pages to Algolia index. It's used to power the search on https://help.apify.com
JavaScript
1
star
60

apify-eslint-config-ts

Typescript ESLint configuration shared across projects in Apify.
JavaScript
1
star
61

actor-proxy-test

JavaScript
1
star
62

appmixer-components

Home of all the future Appmixer components on the Apify platform.
JavaScript
1
star
63

actor-example-secret-input

Example actor showcasing the secret input fields
Dockerfile
1
star
64

actor-scrapy-books-example

Example of Python Scrapy project. It scrapes book data from https://books.toscrape.com/.
Python
1
star
65

komparz

Special, yet insignificant actors
JavaScript
1
star
66

actor-crawler-cheerio

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
JavaScript
1
star
67

actor-crawler-puppeteer

DEPRECATED: An Apify actor that enables crawling of websites using headless Chrome and Puppeteer. The actor is highly customizable and supports recursive crawling of websites as well as lists of URLs.
JavaScript
1
star
68

scrapy-migrator

A standalone POC script for wrapping Scrapy projects with Apify middleware.
Python
1
star