read-art
- Readability reference to Arc90's.
- Scrape article from any page (automatically).
- Make any web page readable, no matter Chinese or English.
快速抓取网页文章标题和内容,适合node.js爬虫使用,服务于ElasticSearch。
Guide
- Features
- Performance
- Installation
- Usage
- Debug
- Score Rule
- Extract Selectors
- Image Fallback
- Threshold
- Customize Settings
- Output
- Notes
How it works
In my case, the speed of spider is about 1500k documents per day, and the maximize crawling speed is 1.2k /minute, avg 1k /minute, the memory cost are about 200 MB on each spider kernel, and the accuracy is about 90%, the rest 10% can be fixed by customizing Score Rules or Selectors. it's better than any other readability modules.
(4) Server infos:
- 20M bandwidth of fibre-optical
- 8 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz cpus
- 32G memory