• This repository has been archived on 10/Dec/2018
  • Stars
    star
    532
  • Rank 83,377 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 13 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[not actively maintained] A lightweight Python library that uses Webkit to enable easy scraping of dynamic, Javascript-heavy web pages

NOTE: This package is not actively maintained. It uses QtWebkit, which is end-of-life and probably doesn't get security fixes backported. Consider using a similar package like Spynner instead.

Overview

Author: Niklas Baumstark

dryscrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook.

It is built on the shoulders of capybara-webkit's webkit-server. A big thanks goes to thoughtbot, inc. for building this excellent piece of software!

Changelog

  • 1.0: Added Python 3 support, small performance fixes, header names are now properly normalized. Also added the function dryscrape.start_xvfb() to easily start Xvfb.
  • 0.9.1: Changed semantics of the headers function in a backwards-incompatible way: It now returns a list of (key, value) pairs instead of a dictionary.

Supported Platforms

The library has been confirmed to work on the following platforms:

  • Mac OS X 10.9 Mavericks and 10.10 Yosemite
  • Ubuntu Linux
  • Arch Linux

Other unixoid systems should work just fine.

Windows is not officially supported, although dryscrape should work with cygwin.

A word about Qt 5.6

The 5.6 version of Qt removes the Qt WebKit module in favor of the new module Qt WebEngine. So far webkit-server has not been ported to WebEngine (and likely won't be in the near future), so Qt <= 5.5 is a requirement.

Installation, Usage, API Docs

Documentation can be found at dryscrape's ReadTheDocs page.

Quick installation instruction for Ubuntu:

# apt-get install qt5-default libqt5webkit5-dev build-essential python-lxml python-pip xvfb
# pip install dryscrape

Contact, Bugs, Contributions

If you have any problems with this software, don't hesitate to open an issue on Github or open a pull request or write a mail to niklas baumstark at Gmail.

More Repositories

1

libc-database

Build a database of libc offsets to simplify exploitation
Shell
1,680
star
2

3dpwn

VirtualBox 3D exploits & PoCs
Python
160
star
3

sploits

C++
156
star
4

hack2win-chrome

This is collaborative work of Ned Williamson and Niklas Baumstark
C++
131
star
5

contest-algos

C++
121
star
6

bspfuzz

C++
91
star
7

35c3ctf-challs

Python
88
star
8

elgoog

elgoog/searchme challenge from 34C3 CTF / WCTF 2018: sources & exploit
C
66
star
9

webkit-server

[not actively maintained] The C++ webkit-server from capybara-webkit with useful extensions and Python bindings
C++
48
star
10

memfuzzing

Memory fuzzing based on sinn3r's In Memory Fuzzer
Python
26
star
11

ruby-dynamic-binding

Implements a flexible form of dynamic binding to Ruby which allows to run a Proc inside a custom name lookup context
Ruby
26
star
12

bingrep

A small utility to grep for pointers & binary data in memory dumps / live process memory
C++
23
star
13

34c3ctf-sols

Solutions for my 34C3CTF challenges
Python
22
star
14

dump-seccomp

GDB plugin to dump SECCOMP rules set via prctnl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER)
C
22
star
15

tcr

ICPC team contest reference of German team hacKIT
C++
19
star
16

ctf-tools

Python
16
star
17

rpi-qemu

Shell
11
star
18

haskell-brainfuck

BF interpreter written in Haskell as a small exercise
Haskell
8
star
19

ub-to-rce

C
7
star
20

kitbot

Yet another, minimalistic IRC bot
Ruby
7
star
21

rubyfun

Ruby
6
star
22

33c3ctf-mario

Source for mario challenge from 33C3 CTF
Python
6
star
23

apache-ssl-key-extract

Modification of passe-partout utility (http://www.hsc.fr/ressources/outils/passe-partout/) to read memory from files instead of relying on ptrace
C
5
star
24

codingpad-ideone

A modification of the excellent Codingpad Chrome extension by Felix Kling that uses ideone.com as a backend instead of codepad.org.
JavaScript
4
star
25

33c3ctf-coercive

code and exploit for 33C3 CTF task 'coercive'
Haskell
4
star
26

pbbs-maxflow

C++
3
star
27

gdbinit

Python
3
star
28

mona

Corelan Repository for mona.py
Python
3
star
29

save-the-robot

TeX
3
star
30

boxes

Stuff to manage virtual machines
Python
3
star
31

arch-initramfs-dropbear-decrypt

mkinitcpio hooks for Arch Linux to unlock encrypted partitions on boot via remote login
Shell
3
star
32

chrome-builds

3
star
33

linux-syscalls

Create tables to get an overview over system calls numbers and signatures for x86 and x86-64
Python
3
star
34

haskell-soy

A Haskell implementation of Google's Closure Templates
Haskell
2
star
35

vvz-ssh

An SSH tunnel for VVZ
Ruby
2
star
36

niklasb.github.com

My github pages
2
star
37

sudoku-pdf

A set of scripts to generate Sudoku puzzles and write them to a PDF
Python
2
star
38

lz-index

Implementation of an LZ index based on the SDSL library
C++
2
star
39

sslutils

Some helpful(?) stuff for working with CAs
Ruby
2
star
40

winhook

C++
2
star
41

crhash

A customizable hash brute forcer
C++
2
star
42

ctf-glicko2

Source code for Glicko-2 rating app for CTF teams 2016.
HTML
2
star
43

ida-colors

1
star
44

gpuc-rainbow

C++
1
star
45

test

1
star
46

webgdb

JavaScript
1
star
47

dotfiles

Shell
1
star
48

linux-config

Shell
1
star
49

xkcd-hash

C
1
star
50

faustctf-vpn-gateway

VPN setup used for FaustCTF
Shell
1
star
51

vimrc

My vimrc (loosely based on https://github.com/nvie/vimrc)
Vim Script
1
star
52

contest-tasks-webapp

Hosted at http://dtun.de/tasks/
PHP
1
star
53

linux-notes

Notes for Linux stuff
1
star
54

random-scripts

Shell
1
star