• Stars
    star
    171
  • Rank 217,591 (Top 5 %)
  • Language
    JavaScript
  • License
    GNU Lesser Genera...
  • Created almost 5 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Web data extraction tool implemented as chrome extension

Web Scraper

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV or JSON Lines.

Latest Version

Read about installation process on installation page.

Changelog

v0.3.6

  • Updated support for Tables (update vertical tables support and added complex headers and data rows)
  • Added export and import sitemap from file
  • Added Russian translations and support of i18n that make possible to add every language translation
  • Added Rest Api CRUD storage for sitemaps
  • Moved to webpack bundler
  • Added id hints from predefined model
  • Added selectors for Constants and Documents
  • Refactored preview data and added search in scraped data
  • Refactored returned items model to JSON
  • Added saving in JSON lines

v0.3

  • Enabled pasting of multiple start URLs (by @jwillmer)
  • Added scraping of dynamic table columns (by @jwillmer)
  • Added style extraction type (by @jwillmer)
  • Added text manipulation (trim, replace, prefix, suffix, remove HTML) (by @jwillmer)
  • Added image improvements to find images in div background (by @jwillmer)
  • Added support for vertical tables (by @jwillmer)
  • Added random delay function between requests (by @Euphorbium)
  • Start URL can now also be a local URL (by @3flex)
  • Added CSV export options (by @mohamnag)
  • Added Regex group for select (by @RuneHL)
  • JSON export/import of settings (by @haisi)
  • Added date and number pattern in URL (by @codoff)
  • Added pagination selector limit (by @codoff)
  • Improved CSV export (by @haisi)
  • Added click limit option (by @panna-ahmed)

v0.2

  • Added Element click selector
  • Added Element scroll down selector
  • Added Link popup selector
  • Improved table selector to work with any html markup
  • Added Image download
  • Added keyboard shortcuts when selecting elements
  • Added configurable delay before using selector
  • Added configurable delay between page visiting
  • Added multiple start url configuration
  • Added form field validation
  • Fixed a lot of bugs

v0.1.3

  • Added Table selector
  • Added HTML selector
  • Added HTML attribute selector
  • Added data preview
  • Added ranged start urls
  • Fixed bug which made selector tree not to show on some operating systems

Bugs

When submitting a bug please attach an exported sitemap if possible.

Development

Read the Development Instructions before you start.

License

LGPLv3

More Repositories

1

casr

Collect crash (or UndefinedBehaviorSanitizer error) reports, triage, and estimate severity.
Rust
248
star
2

oss-sydr-fuzz

OSS-Sydr-Fuzz - OSS-Fuzz fork for hybrid fuzzing (fuzzer+DSE) open source software.
C
109
star
3

dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Python
97
star
4

Futag

FUTAG (FUzzing Target Automated Generator) - автоматический генератор фаззинг-оберток для библиотек
Python
51
star
5

scrapy-puppeteer

Library that helps use puppeteer in scrapy.
Python
43
star
6

pu4spark

Positive-Unlabeled Learning for Apache Spark
Scala
40
star
7

rop-benchmark

ROP Benchmark is a tool to compare ROP compilers
Python
36
star
8

crusher

Python
35
star
9

qdt

QEMU Development Toolkit
Python
34
star
10

atr4s

Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala
Scala
33
star
11

spark-openstack

Scripts to setup Spark cluster (any version) in any Openstack environment with optional useful tools.
Jinja
30
star
12

juliet-dynamic

Juliet C/C++ Dynamic Test Suite
22
star
13

qemu-gui

GUI for QEMU
C++
20
star
14

michman

Service for distributed systems deployment; part of Asperitas
Go
18
star
15

hdl-benchmarks

Collection of open HDL modules, subsystems and microprocessors (benchmarks) that are used for related tools testing.
Verilog
17
star
16

sydr-benchmark

Sydr benchmark applications
C++
15
star
17

quix86

An x86-64 instruction decoder.
C
15
star
18

natch

Natch: инструмент определения поверхности атаки
Shell
15
star
19

cotea

cotea: Ansible control tool
Python
14
star
20

EcgLib

Python
12
star
21

centos6.9-build-docker

CentOS 6.9 build Docker environment to distribute portable Linux binaries
Dockerfile
11
star
22

swat

SWAT - System-Wide Analysis Toolkit
C
11
star
23

proceedings

Proceedings of ISP RAS LaTeX Template
TeX
10
star
24

v8-aotc

V8 ahead-of-time compilation project
C++
10
star
25

scrapy-puppeteer-service

A special service that runs puputeer instances.
JavaScript
10
star
26

tact

C
8
star
27

lingvodoc-react

JavaScript
7
star
28

texterra-py

Texterra python sdk
Python
7
star
29

lingvodoc

More advanced Python version for Dialeqt project
JavaScript
7
star
30

riscv-avs

RISC-V Architecture Verification Suite (AVS)
Assembly
7
star
31

microtesk-old

MicroTESK: Specification-Based Framework for Developing Test Program Generators
7
star
32

tm

Regularized multilingual Probabilistic Semantic Analysis Scala implementation.
HTML
6
star
33

TrustedDynamic

Dockerfile
6
star
34

proceedings-md

Automatic markdown to docx converter that follows the Ispras proceedings design requirements
TypeScript
6
star
35

dedoc-utils

Useful utilities for automatic document images processing
Python
5
star
36

clouni

Cloud Unifier Tool for Service Orchestration
Python
5
star
37

FuzzedDataProviderCS

FuzzedDataProvider for C#, inspired by Google's FuzzedDataProvider.
C#
5
star
38

microtesk

MicroTESK: Specification-Based Framework for Developing Test Program Generators
Java
5
star
39

gocotea

gocotea: Ansible control tool on Golang
Go
5
star
40

parmasan

Mirror repository with parmasan project
C++
4
star
41

endometrium-dataset-analysis

This repository is dedicated to the analysis of the EndoNuke dataset
Jupyter Notebook
4
star
42

esoc

Ethernet Switch on Configurable Logic
Stata
3
star
43

angiocells_analysis

Jupyter Notebook
3
star
44

libosuction

A tool for stripping dynamic libraries of unneeded symbols
C
3
star
45

news-page-dataset

3
star
46

I3S

Python
2
star
47

utopia-hls

Utopia: a High-Level Synthesis framework
C++
2
star
48

hls-idct

Inverse Discrete Cosine Transform (IDCT) algorithm implementations are written in languages for High-Level Synthesis (HLS) and Hardware Construction (HC) tools.
Verilog
2
star
49

sv-tests

Test suites based on Verilog and SystemVerilog standards
Verilog
1
star
50

cv

Klever Continuous Verification Framework
Python
1
star
51

NetBlox

Java
1
star
52

flagsup

Build flags extractor and summarizer.
Python
1
star
53

dedockerfiles

Collection of dockerfiles for dedoc group projects
Dockerfile
1
star
54

qdt-guest-agent

C++
1
star
55

parmasan-remake

Mirror repository with patched remake for parmasan
C
1
star
56

staccato

Fork for the STACCATO project of University of Michigan
C
1
star
57

flint

Scalable machine learning framework
Scala
1
star
58

gephi-graphson

Importer and exporter plugins for Gephi for GraphSON format
Java
1
star
59

RISC-V-nML

RISC-V nML is a specification of ISA RISC-V in nML architecture decription language.
1
star