• This repository has been archived on 22/Jan/2020
  • Stars
    star
    453
  • Rank 96,573 (Top 2 %)
  • Language
    Shell
  • License
    Other
  • Created over 11 years ago
  • Updated over 11 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

dhtcrawler is a DHT crawler written in erlang. It can join a DHT network and crawl many P2P torrents. The program save all torrent info into database and provide an http interface to search a torrent by a keyword

dhtcrawler2

dhtcrawler is a DHT crawler written in erlang. It can join a DHT network and crawl many P2P torrents. The program save all torrent info into database and provide an http interface to search a torrent by a keyword.

screenshot

dhtcrawler2 is an extended version to dhtcrawler. It has improved a lot on crawling speed, and much more stable.

This git branch maintain pre-compiled erlang files to start dhtcrawler2 directly. So you don't need to compile it yourself, just download it and run it to collect torrents and search a torrent by a keyword.

Enjoy it!

Usage

  • install Erlang R16B or newer

  • download mongodb and start mongodb first

      mongod --dbpath your-database-path --setParameter textSearchEnabled=true
    
  • start crawler, on Windows, just click win_start_crawler.bat

  • start hash_reader, on Windows, just click win_start_hash.bat

  • start httpd, on Windows, just click win_start_http.bat

  • wait several minutes and checkout localhost:8000

You can also compile the source code and run it manually. The source code is in src branch of this repo.

Also you can check more technique information at my blog site (Chinese) codemacro.com

Source code

dhtcrawler is totally open source, and can be used in any purpose, but you should keep my name on, copyright by me please. You can checkout dhtcrawler2 source code in this git repo src branch.

Config

Most config value is in priv/dhtcrawler.config, when you first run dhtcrawler, this file will be generated automatically. And the other config values are passed by arguments to erlang functions. In most case you don't need to change these config values, except these network addresses.

Mongodb Replica set

It's not related about dhtcrawler, but only Mongodb, try figure it yourself.

Another http front-end

Yes of course you can write another http front-end UI based on the torrent database, if you're interested in it I can help you about the database format.

Sphinx

Yes, dhtcrawler2 support sphinx search. There's a tool named sphinx-builder load torrents from database and create sphinx index. crawler-http can also search text by sphinx.

dhtcrawler2 use mongodb text search by default, to use sphinx, follow these steps below:

  • Download sphinx, the version tested is a fork version named coreseek which support Chinese characters. coreseek4.1
  • unzip the binary archive and add bin directory to PATH environment variable, so that dhtcrawler can invoke indexer tool
  • config etc/csft.conf file
    • add a delta index, i.e:

        source delta:xml
        {
            type = xmlpipe2
            xmlpipe_command = cat g:/downloads/coreseek-4.1-win32/var/test/delta.xml
        }
        index delta:xml
        {
            source = delta
            path = g:/downloads/coreseek-4.1-win32/var/data/delta
        }
      
    • change the other directories, better to use absolute path

  • run win_init_sphinx_index.bat to generate a default sphinx-builder config file, and terminate win_init_sphinx_index.bat
  • config priv/sphinx_builder.config, specify main and delta sphinx index source file name, main and delta index name and sphinx config file, these file names must match these configs you write in etc/csft.conf
  • run win_init_sphinx_index.bat again to initialize sphinx index file, terminate win_init_sphinx_index.bat and if it initialize sphinx index successfully, never run it again
  • run sphinx searchd server
  • run win_start_sphinx_builder to start sphinx-builder, it will read torrents from your torrent database and build the index into sphinx
  • change priv/hash_reader.config search_method to sphinx, so that hash_reader will not build mongodb text search index any more
  • change priv/httpd.config search_method to sphinx, so that crawler-http will search keyword by sphinx

Lots of details! And you'd better to know sphinx well.

LICENSE

See LICENSE.txt

More Repositories

1

dhtcrawler

dhtcrawler is a DHT crawler written in erlang. It can join a DHT network and crawl many P2P torrents.
Erlang
118
star
2

pigy

a chess game server framework
C
52
star
3

ext-blog

ext-blog is a common lisp blog engine. It supports custom theme and you can port a WordPress theme for it.
Common Lisp
44
star
4

kdht

kdht is an erlang DHT implementation
Erlang
42
star
5

lockfree-list

C
37
star
6

kcontainer

kcontainer is a lightweight container sample
Java
36
star
7

jcm

JCM is a distributed name service (cluster map manager)
Java
21
star
8

toy_jvm

toy jvm in jvm
Java
18
star
9

erlang-tcpserver

A simple erlang TCP server based on supervisor and gen_server OTP behaviours.
Erlang
16
star
10

zk-benchmark

Python
12
star
11

drill-storage-http

http storage plugin for apache drill
Java
10
star
12

icerl

Erlang protocol implementation for Ice
Erlang
10
star
13

servlet-web-framework-demo

a minimum demo web framework based on servlet
Java
9
star
14

toy_jit

used in toy_jvm
C
5
star
15

dummy-requirejs

a dummy require.js implementation for study purpose
JavaScript
4
star
16

kl-verify

generate a very simple verify code picture
Common Lisp
4
star
17

image

Image is an image-drawing with a few drawing primitives (circles, ellipses, lines, rectangles, text). It currently has code for "dumping" an image to either X11 drawables or GIF files (via Skippy). It currently only has a single font available for drawing text, but it should be possible to extend the font handling to load fonts from other formats.
Common Lisp
4
star
18

ioc-sample

a bean container (or IoC) sample
Java
3
star
19

crazyeggs_mobile

Lua
3
star
20

codertrace

a programmer profile web app
Ruby
3
star
21

test

C++
2
star
22

erl-rmmseg

This library binds rmmseg-cpp to erlang as BIF.
C++
2
star
23

kevinlynx.github.io

HTML
2
star
24

query_player

an access log player, used to test/verify web app server using existing queries, like tcpcopy but no need root privilege
Python
2
star
25

order-lunch

a simple web app to help my co-workers to order lunch
Ruby
2
star
26

magnet-web

CSS
1
star
27

kvproxy

key-value proxy for memcache/redis etc
Java
1
star
28

proxylet-app

a very lightweight python reverse proxy server, forked and modified from proxylet
Python
1
star
29

giza

giza is a client library for the Sphinx search engine (http://www.sphinxsearch.com). It speaks Sphinx's binary searchd protocol natively.
Erlang
1
star
30

cppwee

cppwee is a c++ workflow execution engine(WEE)
C++
1
star
31

luafeiq

luaFeiq is a simple IP messager lua implementation. It's main goal is to communicate with Windows FeiQ. Because it's totally written by Lua, so it 's crossplatform.
Lua
1
star
32

codemacro-source

source code of codemacro
JavaScript
1
star
33

ext-blog-data

optional data files for ext-blog
1
star
34

restas.file-publisher

A restas module to publish static files
Common Lisp
1
star