• Stars
    star
    328
  • Rank 128,352 (Top 3 %)
  • Language
    JavaScript
  • License
    GNU Affero Genera...
  • Created almost 14 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Websites crawler with built-in exploration and control web interface

Hyphe: web corpus curation tool & links crawler

DOI SWH SWH

Welcome to Hyphe, a research-driven web crawler developped at the Sciences Po médialab for the DIME-SHS Web project (ANR-10-EQPX-19-01).

Hyphe aims at providing a tool to build web corpus by crawling data from the web and generating networks between what we call "web entities", which can be single pages as well as a website, subdomains or parts of it, or even a combination of those.

Demo & Tutos

You can try a limited version of Hyphe at the following url: http://hyphe.medialab.sciences-po.fr/demo/

You can find extensive tutorials on Hyphe's Wiki. See also these videos on how to grow a Hyphe corpus and what is a web entity.

How to install?

Before running Hyphe, you may want to adjust the settings first. The default config will work but you may want to tune it for your own needs. There is a procedure to change the configuration after the installation. However we recommend to take a look at the Configuration documentation for detailed explanation of each available option.

Warning: Hyphe can be quite disk-consuming, a big corpus with a few hundred crawls with a depth 2 can easily take up to 50GB, so if you plan on allowing multiple users, you should ensure at least a few hundreds gigabytes are available on your machine. You can reduce disk-space by setting to false the option store_crawled_html_content and limiting the max_depth allowed.

Migrating older versions

Hyphe has changed a lot in the past few years. Migrating from an older version by pulling the code from git is not guaranteed anymore, it is highly recommended to reinstall from scratch. Older corpora can be rebuilt by exporting the list of web entities from the old version and recrawl from that list of urls in the new Hyphe.

Easy install: using Docker

For an easy install either on Linux, Mac OS X or Windows, the best solution is to rely on Docker.

Docker enables isolated install and execution of software stacks, which helps installing easily a whole set of dependencies.

Docker's containers are sizeable: you should ensure at least 4GB of empty space is available before installing. In any case, as expressed above, for a regular and complete use of Hyphe, you should better ensure at least 100GB are available.

Note for Mac OS: you need Apple's XCode installed to allow Docker to run on Mac OS. (XCode is not required anymore for Docker, although it's always preferable to have it for other reasons, such as git etc.)

1. Install Docker

First, you should deploy Docker on your machine following its official installation instructions.

Once you've got Docker installed and running, you will need Docker Compose to set up and orchestrate Hyphe services in a single line. Docker Compose is already installed along with Docker on Windows and Mac OS X, but you may need to install it for Linux.

2. Download Hyphe

Collect Hyphe's sourcecode from this git repository (recommended way to benefit from future updates) or download and uncompress a zipped release, then enter the resulting directory:

git clone https://github.com/medialab/hyphe.git hyphe
cd hyphe

Or, if you do not have git (for instance on a Mac without XCode), you can also download and uncompress the files from Hyphe's latest release by clicking the link to "Source code (zip)" or "Source code (tar.gz)" from the following page: https://github.com/medialab/hyphe/releases

3. Configure

Then, copy the default configuration files and edit them to adjust the settings to your needs:

# use "copy" instead of "cp" under Windows powershell
cp .env.example .env
cp config-backend.env.example config-backend.env
cp config-frontend.env.example config-frontend.env

The .env file lets you configure:

  • TAG: the reference Docker image you want to work with among

    • prod: for the latest stable release
    • preprod: for intermediate unstable developments
    • A specific version, for instance 1.3.0. You will find the list on Hyphe's Docker Hub page and descriptions for each version on GitHub's releases page.
  • PUBLIC_PORT: the web port on which Hyphe will be served (usually 80 for a single service server, or for a shared host any other port you like which will need to be redirected)

  • DATA_PATH: using Hyphe can quickly consume several gigabytes of hard drive. By default, volumes will be stored within Docker's default directories but you can define your own path here.

    WARNING: DATA_PATH MUST be either empty, or a full absolute path including leading and trailing slashes (for instance /var/opt/hyphe/).

    It is not currently supported under Windows, and should always remain empty in this case (so you should install Hyphe from a drive with enough available space).

  • RESTART_POLICY: the choice of autorestart policy you want Hyphe containers to apply

    • no: (default) containers will not be restarted automatically under any circumstance
    • always: containers will always restart when stopped
    • on-failure: containers will restart only if the exit code indicates an on-failure error
    • unless-stopped: containers will always restart unless when explicitly stopped

    If you want Hyphe to start automatically at boot, you should use the always policy and make sure the Docker daemon is started at boot time with your service manager.

Hyphe's internal settings are adjustable within config-backend.env and config-frontend.env. Adjust the settings values to your needs following recommendations from the config documentation.

If you want to restrict Hyphe's access to a selected few, you should leave HYPHE_OPEN_CORS_API false in config-backend.env, and setup HYPHE_HTPASSWORD_USER & HYPHE_HTPASSWORD_PASS in config-frontend.env (use openssl passwd -apr1 to generate your password's encrypted value).

4. Prepare the Docker containers

You have two options: either collect, or build Hyphe's Docker containers.

  • Recommended: Pull our official preassembled images from the Docker Store

    docker-compose pull
  • Alternative: Build your own images from the source code (mostly for development or if you intend to edit the code, and for some very specific configuration settings):

    docker-compose build

Pulling should be faster, but it will still take a few minutes to download or build everything either way.

5. Start Hyphe

Finally, start Hyphe containers with the following command, which will run Hyphe and display all of its logs in the console until stopped by pressing Ctrl+C.

docker-compose up

Or run the containers as a background daemon (for instance for production on a server):

docker-compose up -d

Once the logs say "All tests passed. Ready!", you can access your Hyphe install at http://localhost:80/ (or http://localhost:<PUBLIC_PORT>/ if you changed the port value in the .env configuration file).

6. Stop and monitor Hyphe

To stop containers running in background, use docker-compose stop (or docker-compose down to also clean relying data).

You can inspect the logs of the various Docker containers using docker-compose logs, or with option -f to track latest entries like with tail.

Whenever you change any configuration file, restart the Docker container to take the changes into account:

docker-compose stop
docker-compose up -d

Run docker-compose help to get more explanations on any extra advanced use of Docker.

If you encounter issues with the Docker builds, please report an issue including the "Image ID" of the Docker images you used from the output of docker images or, if you installed from source, the last commit ID (read from git log).

7. Update to future versions

WARNING: Do not do this if you're not sure of what you're doing, upgrading to major new versions can potentially break your existing corpuses making it really complex to get your data back.

If you installed from git by pulling our builds from DockerHub, you should be able to update Hyphe to future minor releases by simply doing the following:

docker-compose down
git pull
docker-compose pull
# eventually edit your configuration files to use new options
docker-compose up -d

Manual install (complex and only for Linux)

If your computer or server relies on an old Linux distribution unable to run Docker, if you want to contribute to Hyphe's backend development, or for any other personal reason, you might want to rather install Hyphe manually by following the manual install instructions.

Please note there are many dependencies which are not always trivial to install and that you might run in quite a bit of issues. You can ask for some help by opening an issue and describing your problem, hopefully someone will find some time to try and help you.

Hyphe relies on a web interface with a server daemon which must be running at all times. When manually installed, one must start, stop or restart the daemon using the following command (without sudo):

bin/hyphe <start|restart|stop> [--nologs]

By default the starter will display Hyphe's log in the console using tail. You can use Ctrl+C whenever you like to stop displaying logs without shutting Hyphe down. Use the --nologs option to disable logs display on start. Logs are always accessible from the log directory.

All settings can be configured directly from the global configuration file config/config.json. Restart Hyphe afterwards to take changes into account: bin/hyphe restart.

Serve Hyphe on the web

As soon as the Docker containers or the manual daemon start, you can use Hyphe's web interface on your local machine at the following url:

For personal uses, you can already work with Hyphe as such. Although, if you want to let others use it as well (typically if you installed on a distant server), you need to serve it on a webserver and make a few adjustments.

Read the dedicated documentation to do so.

Advanced developers features & contributing

Please read the dedicated Developers documentation and the API description.

What's next?

See our roadmap!

Papers & references

Tutorials / examples

Publications about Hyphe

  • OOGHE-TABANOU, Benjamin, JACOMY, Mathieu, GIRARD, Paul & PLIQUE, Guillaume, "Hyperlink is not dead!" (Proceeding / Slides), In Proceedings of the 2nd International Conference on Web Studies (WS.2 2018), Everardo Reyes, Mark Bernstein, Giancarlo Ruffo, and Imad Saleh (Eds.). ACM, New York, NY, USA, 12-18. DOI: https://doi.org/10.1145/3240431.3240434

  • PLIQUE, Guillaume, JACOMY, Mathieu, OOGHE-TABANOU, Benjamin & GIRARD, Paul, "It's a Tree... It's a Graph... It's a Traph!!!! Designing an on-file multi-level graph index for the Hyphe web crawler". (Video / Slides) Presentation at the FOSDEM, Brussels, BELGIUM, February 3rd, 2018.

  • JACOMY, Mathieu, GIRARD, Paul, OOGHE-TABANOU, Benjamin, et al, "Hyphe, a curation-oriented approach to web crawling for the social sciences.", in International AAAI Conference on Web and Social Media. Association for the Advancement of Artificial Intelligence, 2016.

Publications using Hyphe

Credits & License

Mathieu Jacomy, Benjamin Ooghe-Tabanou & Guillaume Plique @ Sciences Po médialab

Discover more of our projects at médialab tools.

This work is supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Hyphe is a free open source software released under AGPL 3.0 license.

Thanks to https://www.useragents.me for maintaining a great updated list of common user agents which are reused within Hyphe!

[...] I hear _kainos_ [(greek: "now")] in the sense of thick, ongoing presence, with __hyphae__ infusing all sorts of temporalities and materialities."

Donna J. Haraway, Staying with the Trouble, Making kin with the Chthlucene p.2

More Repositories

1

artoo

artoo.js - the client-side scraping companion.
JavaScript
1,100
star
2

iwanthue

Colors for data scientists.
HTML
636
star
3

minet

A webmining CLI tool & library for python.
Python
285
star
4

ipysigma

A Jupyter widget using sigma.js to render interactive networks.
Jupyter Notebook
193
star
5

sandcrawler

sandcrawler.js - the server-side scraping companion.
JavaScript
107
star
6

gazouilloire

Twitter stream + search API grabber
Python
104
star
7

fonio

a collaborative scholarly text editor allowing to build static websites
JavaScript
67
star
8

ural

A helper library full of URL-related heuristics.
Python
61
star
9

ANTA

Actor Network Text Analyser
PHP
56
star
10

zup

a simple interface from extracting texts from (almost) any url
JavaScript
52
star
11

manylines

Explore networks and publish narratives.
HTML
52
star
12

xan

The CSV magician
Rust
42
star
13

hyphe-browser

Browser version of Hyphe (WIP)
JavaScript
29
star
14

tesselle

an image annotation and publication tool
TypeScript
27
star
15

csv-rinse-repeat

CSV grooming, the JS way
JavaScript
21
star
16

scholarScape

Python
21
star
17

minivan

Web interface for network analysis.
JavaScript
20
star
18

sandcrawler-dashboard

A handy terminal dashboard plugin for sandcrawler.
JavaScript
20
star
19

table2net

JavaScript
20
star
20

graph-recipes

Online experimental network tinkering
JavaScript
18
star
21

ricardo_data

The RICardo dataset compiles trade statistics sources of international trade bilateral flows of the 19th century.
Python
16
star
22

drive-in

publish a simple website from a public google drive folder
JavaScript
16
star
23

sciencescape

sciencescape
JavaScript
16
star
24

toflit18

TOFLIT18 datascape's sources.
JavaScript
14
star
25

google-bookmarklets

extract list of results from a google page as csv with bookmarklets
JavaScript
14
star
26

aime-core

Core engine of the AIME project serving its harmonized database.
JavaScript
14
star
27

ricardo

RICardo Project, Historical Trade Database
JavaScript
13
star
28

bibliotools3.0

modification of bibliotools 2.2 from Sébastian Grauwin
Python
12
star
29

website

The lab's static website and its admin.
JavaScript
12
star
30

casanova

Specialized & performant CSV readers, writers and enrichers for python.
Python
11
star
31

Facettage

Facet management for backendless datascapes
JavaScript
11
star
32

heatgraph

you don't want to know
JavaScript
11
star
33

reference_manager

Python
10
star
34

twitwi

Collection of Twitter-related helper functions for python.
Python
10
star
35

hyphe-traph

A Trie/Graph hybrid memory structure used by the Hyphe crawler to index pages & webentities.
Python
10
star
36

pelote

A collection of network-related python utilities.
Python
9
star
37

hyphe2solr

XSLT
8
star
38

benchmarkForceAtlas2

JavaScript
8
star
39

bothan

A node.js phantom interface for scraping purposes.
JavaScript
8
star
40

GeoPolHist

TypeScript
7
star
41

quenouille

A library of multithreaded iterator workflows for python.
Python
6
star
42

gulp-artoo

A gulp plugin to create artoo.js bookmarklets.
JavaScript
6
star
43

drive-out

Export of google drive private folders for node.js
JavaScript
5
star
44

catwalk

Nano tweet curation tool for humanities
JavaScript
5
star
45

LDA_testing

Collection of scripts testing various implementation of and wrapping around LDA
Python
5
star
46

bhht-datascape

The BHHT project's datascape.
JavaScript
5
star
47

ANTA2

Actor Network Text Analyser v2
Python
5
star
48

reanalyse

django platform to explore TEI verbatims, documents & speakers within structured qualitative studies
JavaScript
5
star
49

doxer

experimental ngram extractor (using django + solr)
XSLT
4
star
50

grunt-artoo

A grunt task to create artoo bookmarklets.
JavaScript
4
star
51

ouatterrir

WIP
JavaScript
4
star
52

scrapers

Miscellaneous scrapers.
Python
4
star
53

toflit18_data

Datapackage for TOFLIT18 research project
Stata
4
star
54

gitlaw

Python
4
star
55

halexp

medialab's expert search engine poc
Python
4
star
56

resin-annuaire

Site de l'annuaire des expertises du Réseau d'Ingenieur USPC/SciencesPo
JavaScript
3
star
57

fabrique-fragmentation

A prototype for La Fabrique de la Loi : réseaux d'alignement et fragmentation des groupes politiques
JavaScript
3
star
58

quinoa-server

Node application providing diverse services for quinoa apps
JavaScript
3
star
59

communautes-libres-graph

TypeScript
3
star
60

digital-training-topics

HTML
3
star
61

mango_cognitive

Set of cognitive games to be added to a limesurvey platform.
JavaScript
3
star
62

frontcast

JavaScript
3
star
63

medialab-network-dataset

Various GEXF networks from our studies
3
star
64

portic-storymaps-2021

source code for the PORTIC datasprint 2021 publication
JavaScript
3
star
65

bottom-up-color-space-experiment

Online experiment to build color spaces from empirical color distances in various contexts
JavaScript
3
star
66

cours-facebook

3
star
67

mango_core

Mango "plugin" for limesurvey. Used to add a surveys router to chain survey one after the other.
PHP
2
star
68

navicrawler

Archive repository for WebAtlas' NaviCrawler project
JavaScript
2
star
69

ipinion-rank

Python
2
star
70

quinoa-vis-modules

[WIP] Set of api-consistent visualization react components for quinoa apps
JavaScript
2
star
71

tools-old

HTML
2
star
72

unfccc-scrap-snippet

A code snippet used to scrape UNFCCC data during EMAPS project
2
star
73

habitele

JavaScript
2
star
74

otree-iat

Implicit Association Test for oTree.
JavaScript
2
star
75

controversy-mapping

Controversy Mapping Archive
CSS
2
star
76

double-dating-data

mini website for the exhibition
HTML
2
star
77

bulgur

JavaScript
2
star
78

amendement

Parsing des amendements
JavaScript
2
star
79

nyt-api

fetch nyt articles, store them into db, display authors
JavaScript
2
star
80

spatialization-quality

JavaScript
2
star
81

graines

Classification of Twitter users using multimodal embeddings
HTML
2
star
82

portic-datasprint-2021

Ce répertoire contient un ensemble de ressources utiles, et les productions des participants du datasprint PORTIC 2021 qui s'est tenu les 6, 7, 8 et 9 avril 2021.
Jupyter Notebook
2
star
83

rebus

A writing tool materializing distant echoes from twitter data
JavaScript
1
star
84

Gazou_twint_sns_comp

Python
1
star
85

well-being-metrics

A mini website for SoWell project
CSS
1
star
86

medialab.sciences-po.fr

The médialab website source code : a Wordpress theme
PHP
1
star
87

FlaskTools

Python
1
star
88

tweet-bubbles

Small python script generating a tweet bubbles dataviz.
Python
1
star
89

fabrique-typo-amendements

Jupyter Notebook
1
star
90

poltergeist

theme for ghost (node.js blogging platform)
CSS
1
star
91

artoratoire

JavaScript
1
star
92

loubar

Just some experiments in visualizing the inner workings of a graph community detection algorithms.
TypeScript
1
star
93

webclim_misinformation_related_interventions

TeX
1
star
94

carnet-algopresse

An ongoing experiment about setting a workflow for collectively refining visualizations and their textual apparatus
JavaScript
1
star
95

aime-tweets

Collect Bruno Latour's tweet for display on ModesOfExistence.org
Python
1
star
96

cnap-fnac

HTML
1
star
97

hyphe-topic-datascape

A prototype datascape to explore the content of a Hyphe corpus where topics have been added to web pages
CSS
1
star
98

benchmarkNetworkCut

1
star
99

MappingFrenchRussia

JavaScript
1
star
100

personal-air-timeline

a SaveOurAir experiment
JavaScript
1
star