• Stars
    star
    171
  • Rank 222,266 (Top 5 %)
  • Language
    JavaScript
  • License
    GNU Lesser Genera...
  • Created over 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Web data extraction tool implemented as chrome extension

Web Scraper

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV or JSON Lines.

Latest Version

Read about installation process on installation page.

Changelog

v0.3.6

  • Updated support for Tables (update vertical tables support and added complex headers and data rows)
  • Added export and import sitemap from file
  • Added Russian translations and support of i18n that make possible to add every language translation
  • Added Rest Api CRUD storage for sitemaps
  • Moved to webpack bundler
  • Added id hints from predefined model
  • Added selectors for Constants and Documents
  • Refactored preview data and added search in scraped data
  • Refactored returned items model to JSON
  • Added saving in JSON lines

v0.3

  • Enabled pasting of multiple start URLs (by @jwillmer)
  • Added scraping of dynamic table columns (by @jwillmer)
  • Added style extraction type (by @jwillmer)
  • Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by @jwillmer)
  • Added image improvements to find images in div background (by @jwillmer)
  • Added support for vertical tables (by @jwillmer)
  • Added random delay function between requests (by @Euphorbium)
  • Start URL can now also be a local URL (by @3flex)
  • Added CSV export options (by @mohamnag)
  • Added Regex group for select (by @RuneHL)
  • JSON export/import of settings (by @haisi)
  • Added date and number pattern in URL (by @codoff)
  • Added pagination selector limit (by @codoff)
  • Improved CSV export (by @haisi)
  • Added click limit option (by @panna-ahmed)

v0.2

  • Added Element click selector
  • Added Element scroll down selector
  • Added Link popup selector
  • Improved table selector to work with any html markup
  • Added Image download
  • Added keyboard shortcuts when selecting elements
  • Added configurable delay before using selector
  • Added configurable delay between page visiting
  • Added multiple start url configuration
  • Added form field validation
  • Fixed a lot of bugs

v0.1.3

  • Added Table selector
  • Added HTML selector
  • Added HTML attribute selector
  • Added data preview
  • Added ranged start urls
  • Fixed bug which made selector tree not to show on some operating systems

Bugs

When submitting a bug please attach an exported sitemap if possible.

Development

Read the Development Instructions before you start.

License

LGPLv3

More Repositories

1

casr

Collect crash (or UndefinedBehaviorSanitizer error) reports, triage, and estimate severity.
Rust
279
star
2

dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Python
153
star
3

oss-sydr-fuzz

OSS-Sydr-Fuzz - OSS-Fuzz fork for hybrid fuzzing (fuzzer+DSE) open source software.
C
127
star
4

Futag

FUTAG (FUzzing Target Automated Generator) - автоматический генератор фаззинг-оберток для библиотек
Python
51
star
5

scrapy-puppeteer

Library that helps use puppeteer in scrapy.
Python
43
star
6

pu4spark

Positive-Unlabeled Learning for Apache Spark
Scala
40
star
7

rop-benchmark

ROP Benchmark is a tool to compare ROP compilers
Python
36
star
8

crusher

Python
35
star
9

qdt

QEMU Development Toolkit
Python
34
star
10

atr4s

Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala
Scala
33
star
11

spark-openstack

Scripts to setup Spark cluster (any version) in any Openstack environment with optional useful tools.
Jinja
30
star
12

juliet-dynamic

Juliet C/C++ Dynamic Test Suite
23
star
13

qemu-gui

GUI for QEMU
C++
20
star
14

hdl-benchmarks

Collection of open HDL modules, subsystems and microprocessors (benchmarks) that are used for related tools testing.
Verilog
17
star
15

michman

Service for distributed systems deployment; part of Asperitas
Go
17
star
16

natch

Natch: инструмент определения поверхности атаки
Shell
16
star
17

sydr-benchmark

Sydr benchmark applications
C++
15
star
18

quix86

An x86-64 instruction decoder.
C
15
star
19

cotea

cotea: Ansible control tool
Python
14
star
20

EcgLib

Python
13
star
21

centos6.9-build-docker

CentOS 6.9 build Docker environment to distribute portable Linux binaries
Dockerfile
11
star
22

swat

SWAT - System-Wide Analysis Toolkit
C
11
star
23

proceedings

Proceedings of ISP RAS LaTeX Template
TeX
10
star
24

v8-aotc

V8 ahead-of-time compilation project
C++
10
star
25

scrapy-puppeteer-service

A special service that runs puputeer instances.
JavaScript
10
star
26

tact

C
8
star
27

lingvodoc-react

JavaScript
7
star
28

texterra-py

Texterra python sdk
Python
7
star
29

utopia-hls

Utopia: a High-Level Synthesis framework
C++
7
star
30

lingvodoc

More advanced Python version for Dialeqt project
JavaScript
7
star
31

riscv-avs

RISC-V Architecture Verification Suite (AVS)
Assembly
7
star
32

microtesk-old

MicroTESK: Specification-Based Framework for Developing Test Program Generators
7
star
33

tm

Regularized multilingual Probabilistic Semantic Analysis Scala implementation.
HTML
6
star
34

TrustedDynamic

Dockerfile
6
star
35

proceedings-md

Automatic markdown to docx converter that follows the Ispras proceedings design requirements
TypeScript
6
star
36

clouni

Cloud Unifier Tool for Service Orchestration
Python
5
star
37

dedoc-utils

Useful utilities for automatic document images processing
Python
5
star
38

FuzzedDataProviderCS

FuzzedDataProvider for C#, inspired by Google's FuzzedDataProvider.
C#
5
star
39

parmasan

Mirror repository with parmasan project
C++
5
star
40

microtesk

MicroTESK: Specification-Based Framework for Developing Test Program Generators
Java
5
star
41

gocotea

gocotea: Ansible control tool on Golang
Go
5
star
42

endometrium-dataset-analysis

This repository is dedicated to the analysis of the EndoNuke dataset
Jupyter Notebook
4
star
43

esoc

Ethernet Switch on Configurable Logic
Stata
3
star
44

angiocells_analysis

Jupyter Notebook
3
star
45

libosuction

A tool for stripping dynamic libraries of unneeded symbols
C
3
star
46

news-page-dataset

3
star
47

I3S

Python
2
star
48

parmasan-remake

Mirror repository with patched remake for parmasan
C
2
star
49

minimap2_index_modifier

C
2
star
50

hls-idct

Inverse Discrete Cosine Transform (IDCT) algorithm implementations are written in languages for High-Level Synthesis (HLS) and Hardware Construction (HC) tools.
Verilog
2
star
51

sv-tests

Test suites based on Verilog and SystemVerilog standards
Verilog
1
star
52

cv

Klever Continuous Verification Framework
Python
1
star
53

flagsup

Build flags extractor and summarizer.
Python
1
star
54

mammo_crop

Jupyter Notebook
1
star
55

dedockerfiles

Collection of dockerfiles for dedoc group projects
Dockerfile
1
star
56

qdt-guest-agent

C++
1
star
57

NetBlox

Java
1
star
58

staccato

Fork for the STACCATO project of University of Michigan
C
1
star
59

flint

Scalable machine learning framework
Scala
1
star
60

gephi-graphson

Importer and exporter plugins for Gephi for GraphSON format
Java
1
star
61

PTAHA

Patent Timesaving Automatic Helpful Apparatus
R
1
star
62

RISC-V-nML

RISC-V nML is a specification of ISA RISC-V in nML architecture decription language.
1
star