• This repository has been archived on 18/Oct/2019
  • Stars
    star
    140
  • Rank 261,473 (Top 6 %)
  • Language
    Ruby
  • Created about 13 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A direct web spider framworks for Ruby

direct_web_spider

direct_web_spider is direct web spider framwork for ruby.

Requirements

  • Ruby1.9.2+
  • sudo apt-get install libcurl4-gnutls-dev
  • bundle install

Features

  • 将定向爬取网站抽象成几个独立的过程(Fetcher, Paginater, Digger, Parser)
  • 封装爬取所需的通用功能,让开发者只需关心网页信息抽取逻辑
  • 编码处理模块
  • 日志模块,方便调试和错误跟踪
  • 多种下载接口(单线程、多线程、Event IO),并可以随时切换
  • 简单图片识别功能(针对电子商务网站价格信息为图片的情况)

Run

  • 运行顺序
    1. ruby script/run_fetcher
    2. ruby script/run_paginater
    3. ruby script/run_digger
    4. ruby script/run_parser
  • 启动参数: ruby script/run_parser -eproduction -dty -sdangdang -n500
    • -e指定环境(production/development)默认为development.
    • -d为以什么方式下载(ty->多线程,normal ->单线程, em -> EventIO)默认为normal,推荐使用ty参数
    • -s为指定运行哪个网站(dangdang,jingdong,etc)默认是当当
    • -n为指定一次爬取多少条记录,默认1000
    • -h为帮助信息

db-china.org

More Repositories

1

second_level_cache

Write Through and Read Through caching library inspired by CacheMoney and cache_fu, support ActiveRecord 4, 5 and 6.
Ruby
396
star
2

oauth_china

OAuth gem for rails,支持新浪,腾讯,网易,搜狐微博和豆瓣。
Ruby
185
star
3

petri_flow

Petri Net Workflow Engine for Ruby.
Ruby
166
star
4

pgmq

📮🐘 💎 pgmq. Message Queue with Postgres.
PLpgSQL
57
star
5

oh-my-github-circles

GitHub User Circle Generator Using GitHub Actions
JavaScript
47
star
6

schemaless-pg

使用Postgres实现一个Leancloud clone
Ruby
44
star
7

hackernews-insight

Hackernews Insight using TiDB Cloud
21
star
8

pg-fuzzywuzzy

postgresql fuzzywuzzy extension
PLpgSQL
12
star
9

shadow

shadow table.
PLpgSQL
12
star
10

txt2img

txt2img
Ruby
12
star
11

remote_session_demo

Remote Session Demo
Ruby
11
star
12

kiba-plus

Kiba enhancement for Ruby ETL.
Ruby
9
star
13

oh-my-github-dashboard

This repository provides a data pipeline that syncs GitHub repositories with a free MySQL-compatible cloud database, TiDB Cloud. It can be used as a standalone data pipeline or as a personal dashboard.
TypeScript
8
star
14

si9n

语录推荐系统
JavaScript
7
star
15

oh-my-github-pipeline

🔄 A flexible open-source data pipeline for seamlessly syncing data from any github user to your database.
Ruby
6
star
16

repo-track-pipeline

🔄 A flexible open-source data pipeline for seamlessly syncing data from any repository to your database.
Ruby
6
star
17

redundant_column

redundant_column
Ruby
6
star
18

hooopo.github.com

octopress
JavaScript
5
star
19

nolist

Simple Maillist with Mailgun and Sinatra
Ruby
4
star
20

rails-tidb

Ruby
4
star
21

pgmq_worker_ruby_demo

pgmq worker
Ruby
3
star
22

chatgpt-xiaoai

小爱音箱集成LLM,SaaS 服务
Ruby
3
star
23

learn_erlang

learn erlang
Erlang
3
star
24

activerecord-tidb-adapter-demo

Ruby
2
star
25

websocket_sinatra_demo

websocket sinatra demo
Ruby
2
star
26

kiba-plus-demo

Kiba Plus demo.
Ruby
2
star
27

dash-reshape

This is a GitHub tool that provides a comprehensive repository dashboard for analyzing and visualizing GitHub events in a powerful and intuitive way
Ruby
2
star
28

dlt

Data Lake Toolkit for Ruby.
1
star
29

drawerd-server

starter
JavaScript
1
star
30

repo-contributor-circles

GitHub repo contributor circles generator.
JavaScript
1
star
31

mi-service

XiaoMi Cloud Service for mi.com
Ruby
1
star
32

ossinsight-x

Automatically post trending repos to Twitter every day.
Ruby
1
star
33

tidb-serverless-ruby-connect-example

TiDB serverless ruby connect example
Ruby
1
star