• Stars
    star
    632
  • Rank 68,674 (Top 2 %)
  • Language
    Go
  • License
    MIT License
  • Created about 1 year ago
  • Updated 16 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

scrape data data from Google Maps. Extracts data such as the name, address, phone number, website URL, rating, reviews number, latitude and longitude, reviews,email and more for each place

Google maps scraper

build Go Report Card

Google maps scraper

A command line google maps scraper build using

scrapemate web crawling framework.

You can use this repository either as is, or you can use it's code as a base and customize it to your needs

Update Added email extraction from business website support

Features

  • Extracts many data points from google maps
  • Exports the data to CSV, JSON or PostgreSQL
  • Perfomance about 55 urls per minute (-depth 1 -c 8)
  • Extendable to write your own exporter
  • Dockerized for easy run in multiple platforms
  • Scalable in multiple machines
  • Optionally extracts emails from the website of the business

Notes on email extraction

By defaul email extraction is disabled.

If you enable email extraction (see quickstart) then the scraper will visit the website of the business (if exists) and it will try to extract the emails from the page.

For the moment it only checks only one page of the website (the one that is registered in Gmaps). At some point, it will be added support to try to extract from other pages like about, contact, impressum etc.

Keep in mind that enabling email extraction results to larger processing time, since more pages are scraped.

Extracted Data Points

link
title
category
address
open_hours
popular_times
website
phone
plus_code
review_count
review_rating
reviews_per_rating
latitude
longitude
cid
status
descriptions
reviews_link
thumbnail
timezone
price_range
data_id
images
reservations
order_online
menu
owner
complete_address
about
user_reviews
emails

Note: email is empty by default (see Usage)

Quickstart

Using docker:

touch results.csv && docker run -v $PWD/example-queries.txt:/example-queries -v $PWD/results.csv:/results.csv gosom/google-maps-scraper -depth 1 -input /example-queries -results /results.csv -exit-on-inactivity 3m

file results.csv will contain the parsed results.

If you want emails use additionally the -email parameter

On your host

(tested only on Ubuntu 22.04)

git clone https://github.com/gosom/google-maps-scraper.git
cd google-maps-scraper
go mod download
go build
./google-maps-scraper -input example-queries.txt -results restaurants-in-cyprus.csv -exit-on-inactivity 3m

Be a little bit patient. In the first run it downloads required libraries.

The results are written when they arrive in the results file you specified

If you want emails use additionally the -email parameter

Command line options

try ./google-maps-scraper -h to see the command line options available:

  -c int
        sets the concurrency. By default it is set to half of the number of CPUs (default 8)
  -cache string
        sets the cache directory (no effect at the moment) (default "cache")
  -debug
        Use this to perform a headfull crawl (it will open a browser window) [only when using without docker]
  -depth int
        is how much you allow the scraper to scroll in the search results. Experiment with that value (default 10)
  -dsn string
        Use this if you want to use a database provider
  -email
        Use this to extract emails from the websites
  -exit-on-inactivity duration
        program exits after this duration of inactivity(example value '5m')
  -input string
        is the path to the file where the queries are stored (one query per line). By default it reads from stdin (default "stdin")
  -json
        Use this to produce a json file instead of csv (not available when using db)
  -lang string
        is the languate code to use for google (the hl urlparam).Default is en . For example use de for German or el for Greek (default "en")
  -produce
        produce seed jobs only (only valid with dsn)
  -results string
        is the path to the file where the results will be written (default "stdout")

Using Database Provider (postgreSQL)

For running in your local machine:

docker-compose -f docker-compose.dev.yaml up -d

The above starts a PostgreSQL contains and creates the required tables

to access db:

psql -h localhost -U postgres -d postgres

Password is postgres

Then from your host run:

go run main.go -dsn "postgres://postgres:postgres@localhost:5432/postgres" -produce -input example-queries.txt --lang el

(configure your queries and the desired language)

This will populate the table gmaps_jobs .

you may run the scraper using:

go run main.go -c 2 -depth 1 -dsn "postgres://postgres:postgres@localhost:5432/postgres"

If you have a database server and several machines you can start multiple instances of the scraper as above.

Kubernetes

You may run the scraper in a kubernetes cluster. This helps to scale it easier.

Assuming you have a kubernetes cluster and a database that is accessible from the cluster:

  1. First populate the database as shown above
  2. Create a deployment file scraper.deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: google-maps-scraper
spec:
  selector:
    matchLabels:
      app: google-maps-scraper
  replicas: {NUM_OF_REPLICAS}
  template:
    metadata:
      labels:
        app: google-maps-scraper
    spec:
      containers:
      - name: google-maps-scraper
        image: gosom/google-maps-scraper:v0.9.3
        imagePullPolicy: IfNotPresent
        args: ["-c", "1", "-depth", "10", "-dsn", "postgres://{DBUSER}:{DBPASSWD@DBHOST}:{DBPORT}/{DBNAME}", "-lang", "{LANGUAGE_CODE}"]

Please replace the values or the command args accordingly

Note: Keep in mind that because the application starts a headless browser it requires CPU and memory. Use an appropriate kubernetes cluster

Perfomance

Expected speed with concurrency of 8 and depth 1 is 55 jobs/per minute. Each search is 1 job + the number or results it contains.

Based on the above: if we have 1000 keywords to search with each contains 16 results => 1000 * 10 = 16000 jobs.

We expect this to take about 10000/55 ~ 291 minutes ~ 5 hours

If you want to scrape many keywords then it's better to use the Database Provider in combination with Kubernetes for convenience and start multipe scrapers in more than 1 machines.

References

For more instruction you may also read the following links

Licence

This code is licenced under the MIT Licence

Contributing

Please open an ISSUE or make a Pull Request

Notes

Please use this scraper responsibly

banner is generated using OpenAI's DALE

More Repositories

1

scrapemate

Golang Crawling and scraping framework
Go
58
star
2

address-parser-go-rest

Address Parser Go REST is a REST API that provides address parsing functionality using the libpostal library. Users can submit a request to parse an address into its individual components, and the API returns a JSON response with the parsed components.
Go
10
star
3

context-spell-correct

Context based spelling correction REST API implemented in Golang
Go
8
star
4

IpVanishManager

starts a vpn connection to ip vanish
Python
7
star
5

gosql2pc

a Golang library for implementing two phase commit transactions in PostgreSQL, ensuring atomicity and consistency across distributed systems.
Go
7
star
6

MaximumFlowMPM

python implementation of Malhotra, Pramodh Kumar, and Maheshwari (MPM) algorithm for finding maximum flow in a network
Python
3
star
7

hermeshooks

A scheduler for webhooks written in Golang and Postgres
Go
2
star
8

RAKE

C++ Implementation of the RAKE alogrithm
QMake
1
star
9

pydata-intro-sql

Presentation for PyData Limassol meetup
Python
1
star
10

kit

a golang toolkit
Go
1
star
11

scrapemate-example-scrapemelive

Go
1
star
12

scrapemate-highlevel-api-example

This is an example of how you can use scrapemate to scrape a website
Go
1
star
13

matrix-multiplication-benchmark

Python
1
star
14

githubwebhooks-googlesheets

When a PR is approved catches the github webhook and writes some data in a Google Sheet (Python & GCP Cloud Functions)
Python
1
star
15

go-minhash

An implementation of minhash in golang
Go
1
star
16

path-planning

Dijkstra and A* implementation to compute shortest paths (uni bonn assigment)
Python
1
star
17

QuantifiedAndroidRepo

Java
1
star
18

toy-browser-engine

Makefile
1
star