• Stars
    star
    14,725
  • Rank 2,016 (Top 0.04 %)
  • Language
    TypeScript
  • License
    Apache License 2.0
  • Created over 8 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Crawlee
A web scraping and browser automation library

NPM latest version Downloads Chat on discord Build Status

ℹ️ Crawlee is the successor to Apify SDK. 🎉 Fully rewritten in TypeScript for a better developer experience, and with even more powerful anti-blocking features. The interface is almost the same as Apify SDK so upgrading is a breeze. Read the upgrading guide to learn about the changes. ℹ️

Crawlee covers your crawling and scraping end-to-end and helps you build reliable scrapers. Fast.

Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs.

Crawlee is available as the crawlee NPM package.

👉 View full documentation, guides and examples on the Crawlee project website 👈

Installation

We recommend visiting the Introduction tutorial in Crawlee documentation for more information.

Crawlee requires Node.js 16 or higher.

With Crawlee CLI

The fastest way to try Crawlee out is to use the Crawlee CLI and choose the Getting started example. The CLI will install all the necessary dependencies and add boilerplate code for you to play with.

npx crawlee create my-crawler
cd my-crawler
npm start

Manual installation

If you prefer adding Crawlee into your own project, try the example below. Because it uses PlaywrightCrawler we also need to install Playwright. It's not bundled with Crawlee to reduce install size.

npm install crawlee playwright
import { PlaywrightCrawler, Dataset } from 'crawlee';

// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log }) {
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
    // Uncomment this option to see the browser window.
    // headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://crawlee.dev']);

By default, Crawlee stores data to ./storage in the current working directory. You can override this directory via Crawlee configuration. For details, see Configuration guide, Request storage and Result storage.

🛠 Features

  • Single interface for HTTP and headless browser crawling
  • Persistent queue for URLs to crawl (breadth & depth first)
  • Pluggable storage of both tabular data and files
  • Automatic scaling with available system resources
  • Integrated proxy rotation and session management
  • Lifecycles customizable with hooks
  • CLI to bootstrap your projects
  • Configurable routing, error handling and retries
  • Dockerfiles ready to deploy
  • Written in TypeScript with generics

👾 HTTP crawling

  • Zero config HTTP2 support, even for proxies
  • Automatic generation of browser-like headers
  • Replication of browser TLS fingerprints
  • Integrated fast HTML parsers. Cheerio and JSDOM
  • Yes, you can scrape JSON APIs as well

💻 Real browser crawling

  • JavaScript rendering and screenshots
  • Headless and headful support
  • Zero-config generation of human-like fingerprints
  • Automatic browser management
  • Use Playwright and Puppeteer with the same interface
  • Chrome, Firefox, Webkit and many others

Usage on the Apify platform

Crawlee is open-source and runs anywhere, but since it's developed by Apify, it's easy to set up on the Apify platform and run in the cloud. Visit the Apify SDK website to learn more about deploying Crawlee to the Apify platform.

Support

If you find any bug or issue with Crawlee, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE.md file for details.

More Repositories

1

crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Python
3,734
star
2

fingerprint-suite

Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.
TypeScript
875
star
3

proxy-chain

Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining.
JavaScript
825
star
4

got-scraping

HTTP client made for scraping based on got.
TypeScript
490
star
5

actor-page-analyzer

Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.
JavaScript
149
star
6

apify-cli

Apify command-line interface helps you create, develop, build and run Apify actors, and manage the Apify cloud platform.
TypeScript
119
star
7

apify-sdk-js

Apify SDK monorepo
TypeScript
117
star
8

apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
Python
115
star
9

actor-scraper

House of Apify Scrapers. Generic scraping actors with a simple UI to handle complex web crawling and scraping use cases.
JavaScript
115
star
10

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
TypeScript
87
star
11

fingerprint-generator

Generates realistic browser fingerprints
TypeScript
67
star
12

apify-actor-docker

Base Docker images for Apify actors.
Dockerfile
67
star
13

apify-client-js

Apify API client for JavaScript / Node.js.
JavaScript
63
star
14

fingerprint-injector

Home of fingerprint injector.
TypeScript
63
star
15

header-generator

NodeJs package for generating browser-like headers.
TypeScript
63
star
16

covid-19

Open APIs with statistics about Covid-19
JavaScript
46
star
17

apify-client-python

Apify API client for Python
Python
43
star
18

apify-docs

This project is the home of Apify's documentation.
API Blueprint
24
star
19

actor-templates

This project is the 🏠 home of Apify actor template projects to help users quickly get started.
Python
24
star
20

xlsx-stream

JavaScript / Node.js library to stream data into an XLSX file
JavaScript
23
star
21

apify-ts

Crawlee dev repo
TypeScript
22
star
22

got-cjs

An action to release a CommonJS version of the popular library got, which is soon to be available only in an ESM format.
JavaScript
21
star
23

actor-web-automation-agent

This is the experimental version of Web Automation Agent. The agent uses natural language instructions to browse the web and extract data.
TypeScript
19
star
24

actor-content-checker

You can use this act to monitor any page's content and get a notification when content changes.
JavaScript
17
star
25

super-scraper

Generic REST API for scraping websites. Drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!
TypeScript
16
star
26

devtools-server

Runs a simple server that allows you to connect to Chrome DevTools running on dynamic hosts, not only localhost.
JavaScript
15
star
27

actor-quick-start

Contains a boilerplate of an Apify actor to help you get started quickly build your own actors.
Dockerfile
15
star
28

apify-shared-js

Utilities and constants shared across Apify projects.
TypeScript
12
star
29

better-sqlite3-with-prebuilds

Better SQLite prebuild & publish action
10
star
30

chat-with-a-website

A simple app that lets you chat with a given website.
Python
9
star
31

actor-scrapy-executor

Apify actor to run web spiders written in Python in the Scrapy library
Python
9
star
32

apify-zapier-integration

Apify integration for Zapier
JavaScript
8
star
33

idcac

I Don't Care About Cookies extension compiled for use with Playwright/Puppeteer
JavaScript
8
star
34

homebrew-tap

A Homebrew tap for Apify tools
Ruby
7
star
35

workflows

Apify's reusable github workflows
6
star
36

actor-legacy-phantomjs-crawler

The actor implements the legacy Apify Crawler product. It uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of JavaScript code.
JavaScript
6
star
37

act-crawler-results-to-s3

Apify actor to upload crawler results to AWS S3.
JavaScript
6
star
38

actor-example-python

Example Apify Actor written in Python
Python
5
star
39

browser-headers-generator

Package generating randomized browser-like headers.
JavaScript
4
star
40

input-schema-editor-react

Apify input schema editor written in React.js
JavaScript
4
star
41

act-crawl-url-list

Apify actor to crawl a list of URLs
JavaScript
4
star
42

crawlee-parallel-scraping-example

An example repository showcasing how you can scrape in parallel using one request queue
TypeScript
4
star
43

actor-imagediff

Returns an image containing difference of two given images.
JavaScript
3
star
44

apify-web-covid-19

A list of public COVID-19 APIs to be rendered on https://apify.com/covid-19
JavaScript
3
star
45

actor-example-proxy-intercept-request

Example: Intercept requests from https connection using "Man in the middle" proxy solution.
JavaScript
3
star
46

apify-storage-local-js

Local emulation of the apify-client NPM package, which enables local use of Apify SDK.
TypeScript
3
star
47

aidevworld2023

How to get clean web data for chatbots and LLMs slides and supporting materials.
JavaScript
3
star
48

actor-example-php

Example of Apify actor using PHP
PHP
2
star
49

apify-php-tutorial

PHP
2
star
50

apify-eslint-config

Apify ESLint preset to be shared between projects
JavaScript
2
star
51

http-request

A HTTP request library for Node.js, with a common-sense API, support for Brotli compression and without bugs in "request" NPM package
JavaScript
2
star
52

actor-vector-database-integrations

Transfer data from Apify Actors to vector databases (Chroma, Milvus, Pinecone, PostgreSQL (PG-Vector), Qdrant, and Weaviate)
Python
2
star
53

slack-messages-action

It wraps up messages sending from Apify GitHub workflows into Slack.
TypeScript
2
star
54

scraping-tools-js

A library of utility functions that make scraping, data extraction and usage of headless browsers easier and faster.
JavaScript
2
star
55

actor-beautifulsoup-scraper

Python
2
star
56

apify-tsconfig

TypeScript configuration shared across projects in Apify.
Shell
1
star
57

generative-bayesian-network

JavaScript
1
star
58

waw-file-specification

Contains specification of the Web Automation Workflow (WAW) file.
1
star
59

playwright-test-actor

Source code for the Playwright Test public actor.
TypeScript
1
star
60

actor-algolia-website-indexer

Apify actor that crawls website and indexes selected web pages to Algolia index. It's used to power the search on https://help.apify.com
JavaScript
1
star
61

apify-eslint-config-ts

Typescript ESLint configuration shared across projects in Apify.
JavaScript
1
star
62

actor-proxy-test

JavaScript
1
star
63

appmixer-components

Home of all the future Appmixer components on the Apify platform.
JavaScript
1
star
64

actor-example-secret-input

Example actor showcasing the secret input fields
Dockerfile
1
star
65

actor-scrapy-books-example

Example of Python Scrapy project. It scrapes book data from https://books.toscrape.com/.
Python
1
star
66

komparz

Special, yet insignificant actors
JavaScript
1
star
67

apify-sdk-v2

Snapshot of Apify SDK v2 + sdk.apify.com website. This project is no longer maintained. See the https://github.com/apify/apify-sdk-js repo instead!
JavaScript
1
star
68

actor-crawler-cheerio

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
JavaScript
1
star
69

actor-crawler-puppeteer

DEPRECATED: An Apify actor that enables crawling of websites using headless Chrome and Puppeteer. The actor is highly customizable and supports recursive crawling of websites as well as lists of URLs.
JavaScript
1
star
70

actor-monorepo-example

An example repository with multiple Apify Actors sharing code between each other.
JavaScript
1
star
71

apify-haystack

The official integration for Apify and Haystack 2.0
Python
1
star
72

openapi

An OpenAPI specification for the Apify API.
JavaScript
1
star
73

scrapy-migrator

A standalone POC script for wrapping Scrapy projects with Apify middleware.
Python
1
star