• Stars
    star
    490
  • Rank 89,305 (Top 2 %)
  • Language
    TypeScript
  • Created over 3 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

HTTP client made for scraping based on got.

Got Scraping

Got Scraping is a small but powerful got extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.

Installation

$ npm install got-scraping

Note:

  • Node.js >=15.10.0 is required due to instability of HTTP/2 support in lower versions.

API

Got scraping package is built using the got.extend(...) functionality, therefore it supports all the features Got has.

Interested what's under the hood?

const { gotScraping } = require('got-scraping');

gotScraping
    .get('https://apify.com')
    .then( ({ body }) =>ย console.log(body))

options

proxyUrl

Type: string

URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.

const { gotScraping } = require('got-scraping');

gotScraping
    .get({
        url: 'https://apify.com',
        proxyUrl: 'http://usernamed:[email protected]:1234',
    })
    .then(({ body }) => console.log(body))

useHeaderGenerator

Type: boolean
Default: true

Whether to use the generation of the browser-like headers.

headerGeneratorOptions

See the HeaderGeneratorOptions docs.

const response = await gotScraping({
    url: 'https://api.apify.com/v2/browser-info',
    headerGeneratorOptions:{
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89
            }
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    }
});

sessionToken

A non-primitive unique object which describes the current session. By default, it's undefined, so new headers will be generated every time. Headers generated with the same sessionToken never change.

Under the hood

Thanks to the included header-generator package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.

Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.

Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.

HTTP/1.1 headers are always automatically formatted in Pascal-Case. However, there is an exception: x- headers are not modified in any way.

By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.

Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.

To get more detailed information about the implementation, please refer to the source code.

Tips

This package can only generate all the standard attributes. You might want to add the referer header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.

This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.

Overriding request headers

const response = await gotScraping({
    url: 'https://apify.com/',
    headers: {
        'user-agent': 'test',
    },
});

For more advanced usage please refer to the Got documentation.

JSON mode

You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML content type. You might want to alter the generated headers to match the browser ones.

const response = await gotScraping({
    responseType: 'json',
    url: 'https://api.apify.com/v2/browser-info',
});

Error recovery

This section covers possible errors that might happen due to different site implementations.

RequestError: Client network socket disconnected before secure TLS connection was established

The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined or a custom value.

More Repositories

1

crawlee

Crawleeโ€”A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
TypeScript
14,725
star
2

crawlee-python

Crawleeโ€”A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Python
3,734
star
3

fingerprint-suite

Browser fingerprinting tools for anonymizing your scrapers. Developed by Apify.
TypeScript
875
star
4

proxy-chain

Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining.
JavaScript
825
star
5

actor-page-analyzer

Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSON-LD metadata, analyzes AJAX requests, etc.
JavaScript
147
star
6

apify-cli

Apify command-line interface helps you create, develop, build and run Apify actors, and manage the Apify cloud platform.
TypeScript
119
star
7

apify-sdk-js

Apify SDK monorepo
TypeScript
117
star
8

apify-sdk-python

The Apify SDK for Python is the official library for creating Apify Actors in Python. It provides useful features like actor lifecycle management, local storage emulation, and actor event handling.
Python
115
star
9

actor-scraper

House of Apify Scrapers. Generic scraping actors with a simple UI to handle complex web crawling and scraping use cases.
JavaScript
115
star
10

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
TypeScript
87
star
11

fingerprint-generator

Generates realistic browser fingerprints
TypeScript
67
star
12

apify-actor-docker

Base Docker images for Apify actors.
Dockerfile
67
star
13

apify-client-js

Apify API client for JavaScript / Node.js.
JavaScript
63
star
14

fingerprint-injector

Home of fingerprint injector.
TypeScript
63
star
15

header-generator

NodeJs package for generating browser-like headers.
TypeScript
63
star
16

covid-19

Open APIs with statistics about Covid-19
JavaScript
46
star
17

apify-client-python

Apify API client for Python
Python
43
star
18

apify-docs

This project is the home of Apify's documentation.
API Blueprint
24
star
19

actor-templates

This project is the ๐Ÿ  home of Apify actor template projects to help users quickly get started.
Python
24
star
20

xlsx-stream

JavaScript / Node.js library to stream data into an XLSX file
JavaScript
23
star
21

apify-ts

Crawlee dev repo
TypeScript
22
star
22

got-cjs

An action to release a CommonJS version of the popular library got, which is soon to be available only in an ESM format.
JavaScript
21
star
23

actor-web-automation-agent

This is the experimental version of Web Automation Agent. The agent uses natural language instructions to browse the web and extract data.
TypeScript
19
star
24

actor-content-checker

You can use this act to monitor any page's content and get a notification when content changes.
JavaScript
17
star
25

super-scraper

Generic REST API for scraping websites. Drop-in replacement for ScrapingBee, ScrapingAnt, and ScraperAPI services. And it is open-source!
TypeScript
16
star
26

devtools-server

Runs a simple server that allows you to connect to Chrome DevTools running on dynamic hosts, not only localhost.
JavaScript
15
star
27

actor-quick-start

Contains a boilerplate of an Apify actor to help you get started quickly build your own actors.
Dockerfile
15
star
28

apify-shared-js

Utilities and constants shared across Apify projects.
TypeScript
12
star
29

better-sqlite3-with-prebuilds

Better SQLite prebuild & publish action
10
star
30

chat-with-a-website

A simple app that lets you chat with a given website.
Python
9
star
31

actor-scrapy-executor

Apify actor to run web spiders written in Python in the Scrapy library
Python
9
star
32

apify-zapier-integration

Apify integration for Zapier
JavaScript
8
star
33

idcac

I Don't Care About Cookies extension compiled for use with Playwright/Puppeteer
JavaScript
8
star
34

homebrew-tap

A Homebrew tap for Apify tools
Ruby
7
star
35

workflows

Apify's reusable github workflows
6
star
36

actor-legacy-phantomjs-crawler

The actor implements the legacy Apify Crawler product. It uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of JavaScript code.
JavaScript
6
star
37

act-crawler-results-to-s3

Apify actor to upload crawler results to AWS S3.
JavaScript
6
star
38

actor-example-python

Example Apify Actor written in Python
Python
5
star
39

browser-headers-generator

Package generating randomized browser-like headers.
JavaScript
4
star
40

input-schema-editor-react

Apify input schema editor written in React.js
JavaScript
4
star
41

crawlee-parallel-scraping-example

An example repository showcasing how you can scrape in parallel using one request queue
TypeScript
4
star
42

act-crawl-url-list

Apify actor to crawl a list of URLs
JavaScript
4
star
43

actor-imagediff

Returns an image containing difference of two given images.
JavaScript
3
star
44

apify-web-covid-19

A list of public COVID-19 APIs to be rendered on https://apify.com/covid-19
JavaScript
3
star
45

actor-example-proxy-intercept-request

Example: Intercept requests from https connection using "Man in the middle" proxy solution.
JavaScript
3
star
46

apify-storage-local-js

Local emulation of the apify-client NPM package, which enables local use of Apify SDK.
TypeScript
3
star
47

actor-vector-database-integrations

Transfer data from Apify Actors to vector databases (Chroma, Milvus, Pinecone, PostgreSQL (PG-Vector), Qdrant, and Weaviate)
Python
3
star
48

aidevworld2023

How to get clean web data for chatbots and LLMs slides and supporting materials.
JavaScript
3
star
49

actor-example-php

Example of Apify actor using PHP
PHP
2
star
50

apify-php-tutorial

PHP
2
star
51

apify-eslint-config

Apify ESLint preset to be shared between projects
JavaScript
2
star
52

http-request

A HTTP request library for Node.js, with a common-sense API, support for Brotli compression and without bugs in "request" NPM package
JavaScript
2
star
53

slack-messages-action

It wraps up messages sending from Apify GitHub workflows into Slack.
TypeScript
2
star
54

scraping-tools-js

A library of utility functions that make scraping, data extraction and usage of headless browsers easier and faster.
JavaScript
2
star
55

actor-beautifulsoup-scraper

Python
2
star
56

apify-tsconfig

TypeScript configuration shared across projects in Apify.
Shell
1
star
57

generative-bayesian-network

JavaScript
1
star
58

waw-file-specification

Contains specification of the Web Automation Workflow (WAW) file.
1
star
59

playwright-test-actor

Source code for the Playwright Test public actor.
TypeScript
1
star
60

apify-sdk-v2

Snapshot of Apify SDK v2 + sdk.apify.com website. This project is no longer maintained. See the https://github.com/apify/apify-sdk-js repo instead!
JavaScript
1
star
61

actor-algolia-website-indexer

Apify actor that crawls website and indexes selected web pages to Algolia index. It's used to power the search on https://help.apify.com
JavaScript
1
star
62

apify-eslint-config-ts

Typescript ESLint configuration shared across projects in Apify.
JavaScript
1
star
63

actor-proxy-test

JavaScript
1
star
64

appmixer-components

Home of all the future Appmixer components on the Apify platform.
JavaScript
1
star
65

actor-example-secret-input

Example actor showcasing the secret input fields
Dockerfile
1
star
66

actor-scrapy-books-example

Example of Python Scrapy project. It scrapes book data from https://books.toscrape.com/.
Python
1
star
67

komparz

Special, yet insignificant actors
JavaScript
1
star
68

actor-crawler-cheerio

DEPRECATED: An actor that crawls websites and parses HTML pages using Cheerio library. Supports recursive crawling as well as URL lists.
JavaScript
1
star
69

actor-crawler-puppeteer

DEPRECATED: An Apify actor that enables crawling of websites using headless Chrome and Puppeteer. The actor is highly customizable and supports recursive crawling of websites as well as lists of URLs.
JavaScript
1
star
70

actor-monorepo-example

An example repository with multiple Apify Actors sharing code between each other.
JavaScript
1
star
71

apify-haystack

The official integration for Apify and Haystack 2.0
Python
1
star
72

openapi

An OpenAPI specification for the Apify API.
JavaScript
1
star
73

scrapy-migrator

A standalone POC script for wrapping Scrapy projects with Apify middleware.
Python
1
star