• This repository has been archived on 14/Nov/2019
  • Stars
    star
    234
  • Rank 170,670 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 11 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Web Crawler for Elasticsearch

(River Web does not sync up with the latest elasticsearch. Fess is Enterprise Search Server and contains the same features as River Web. See Fess)

Elasticsearch River Web

Overview

Elasticsearch River Web is a web crawler application for Elasticsearch. This application provides a feature to crawl web sites and extract the content by CSS Query. (As of version 1.5, River Web is not Elasticsearch plugin)

If you want to use Full Text Search Server, please see Fess.

Version

River Web Tested on ES Download
master 2.4.X Snapshot
2.4.0 2.4.0 Download
2.0.2 2.3.1 Download
2.0.1 2.2.0 Download
2.0.0 2.1.2 Download

For old version, see README_ver1.md or README_ver1.5.md.

Issues/Questions

Please file an issue. (Japanese forum is here.)

Installation

Install River Web

Zip File

$ unzip elasticsearch-river-web-[VERSION].zip

Tar.GZ File

$ tar zxvf elasticsearch-river-web-[VERSION].tar.gz

Usage

Create Index To Store Crawl Data

An index for storing crawl data is needed before starting River Web. For example, to store data to "webindex/my_web", create it as below:

$ curl -XPUT 'localhost:9200/webindex' -d '
{  
  "settings":{  
    "index":{  
      "refresh_interval":"1s",
      "number_of_shards":"10",
      "number_of_replicas" : "0"
    }
  },
  "mappings":{  
    "my_web":{  
      "properties":{  
        "url":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "method":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "charSet":{  
          "type":"string",
          "index":"not_analyzed"
        },
        "mimeType":{  
          "type":"string",
          "index":"not_analyzed"
        }
      }
    }
  }
}'

Feel free to add any properties other than the above if you need them.

Register Crawl Config Data

A crawling configuration is created by registering a document to .river_web index as below. This example crawls sites of http://www.codelibs.org/ and http://fess.codelibs.org/.

$ curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://www.codelibs.org/", "http://fess.codelibs.org/"],
    "include_urls" : ["http://www.codelibs.org/.*", "http://fess.codelibs.org/.*"],
    "max_depth" : 3,
    "max_access_count" : 100,
    "num_of_thread" : 5,
    "interval" : 1000,
    "target" : [
      {
        "pattern" : {
          "url" : "http://www.codelibs.org/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body"
          },
          "bodyAsHtml" : {
            "html" : "body"
          },
          "projects" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      },
      {
        "pattern" : {
          "url" : "http://fess.codelibs.org/.*",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          },
          "body" : {
            "text" : "body",
            "trimSpaces" : true
          },
          "menus" : {
            "text" : "ul.nav-list li a",
            "isArray" : true
          }
        }
      }
    ]
}'

The configuration is:

Property Type Description
index string Stored index name.
type string Stored type name.
urls array Start point of URL for crawling.
include_urls array White list of URL for crawling.
exclude_urls array Black list of URL for crawling.
max_depth int Depth of crawling documents.
max_access_count int The number of crawling documents.
num_of_thread int The number of crawler threads.
interval int Interval time (ms) to crawl documents.
incremental boolean Incremental crawling.
overwrite boolean Delete documents of old duplicated url.
user_agent string User-agent name when crawling.
robots_txt boolean If you want to ignore robots.txt, false.
authentications object Specify BASIC/DIGEST/NTLM authentication info.
target.urlPattern string URL pattern to extract contents by CSS Query.
target.properties.name string "name" is used as a property name in the index.
target.properties.name.text string CSS Query for the property value.
target.properties.name.html string CSS Query for the property value.
target.properties.name.script string Rewrite the property value by Script(Groovy).

Start Crawler

./bin/riverweb --config-id [config doc id] --cluster-name [Elasticsearch Cluster Name] --cleanup

For example,

./bin/riverweb --config-id my_web --cluster-name elasticsearch --cleanup

Unregister Crawl Config Data

If you want to stop the crawler, kill the crawler process and then delete the config document as below:

$ curl -XDELETE 'localhost:9200/.river_web/config/my_web'

Examples

Full Text Search for Your site (ex. http://fess.codelibs.org/)

$ curl -XPUT 'localhost:9200/.river_web/fess/fess_site' -d '{
    "index" : "webindex",
    "type" : "fess_site",
    "urls" : ["http://fess.codelibs.org/"],
    "include_urls" : ["http://fess.codelibs.org/.*"],
    "max_depth" : 3,
    "max_access_count" : 1000,
    "num_of_thread" : 5,
    "interval" : 1000,
    "target" : [
      {
        "pattern" : {
            "url" : "http://fess.codelibs.org/.*",
            "mimeType" : "text/html"
        },
        "properties" : {
            "title" : {
                "text" : "title"
            },
            "body" : {
                "text" : "body",
                "trimSpaces" : true
            }
        }
      }
    ]
}'

Aggregate a title/content from news.yahoo.com

$ curl -XPUT 'localhost:9200/.river_web/config/yahoo_site' -d '{
    "index" : "webindex",
    "type" : "my_web",
    "urls" : ["http://news.yahoo.com/"],
    "include_urls" : ["http://news.yahoo.com/.*"],
    "max_depth" : 1,
    "max_access_count" : 10,
    "num_of_thread" : 3,
    "interval" : 3000,
    "user_agent" : "Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko",
    "target" : [
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/video/.*html",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "title"
          }
        }
      },
      {
        "pattern" : {
          "url" : "http://news.yahoo.com/.*html",
          "mimeType" : "text/html"
        },
        "properties" : {
          "title" : {
            "text" : "h1.headline"
          },
          "content" : {
            "text" : "section#mediacontentstory p"
          }
        }
      }
    ]
}'

(if news.yahoo.com is updated, the above example needs to be updated.)

Others

BASIC/DIGEST/NTLM authentication

River Web supports BASIC/DIGEST/NTLM authentication. Set authentications object.

...
"num_of_thread" : 5,
"interval" : 1000,
"authentications":[
  {
    "scope": {
      "scheme":"BASIC"
    },
    "credentials": {
      "username":"testuser",
      "password":"secret"
    }
  }],
"target" : [
...

The configuration is:

Property Type Description
authentications.scope.scheme string BASIC, DIGEST or NTLM
authentications.scope.host string (Optional)Target hostname.
authentications.scope.port int (Optional)Port number.
authentications.scope.realm string (Optional)Realm name.
authentications.credentials.username string Username.
authentications.credentials.password string Password.
authentications.credentials.workstation string (Optional)Workstation for NTLM.
authentications.credentials.domain string (Optional)Domain for NTLM.

For example, if you want to use an user in ActiveDirectory, the configuration is below:

"authentications":[
  {
    "scope": {
      "scheme":"NTLM"
    },
    "credentials": {
      "domain":"your.ad.domain",
      "username":"taro",
      "password":"himitsu"
    }
  }],

Use attachment type

River Web supports attachment type. For example, create a mapping with attachment type:

curl -XPUT "localhost:9200/web/test/_mapping?pretty" -d '{
  "test" : {
    "properties" : {
...
      "my_attachment" : {
          "type" : "attachment",
          "fields" : {
            "file" : { "index" : "no" },
            "title" : { "store" : "yes" },
            "date" : { "store" : "yes" },
            "author" : { "store" : "yes" },
            "keywords" : { "store" : "yes" },
            "content_type" : { "store" : "yes" },
            "content_length" : { "store" : "yes" }
          }
      }
...

and then start your river. In "properties" object, when a value of "type" is "attachment", the crawled url is stored as base64-encoded data.

curl -XPUT localhost:9200/.river_web/config/2 -d '{
      "index" : "web",
      "type" : "data",
      "urls" : "http://...",
...
      "target" : [
...
        {
          "settings" : {
            "html" : false
          },
          "pattern" : {
            "url" : "http://.../.*"
          },
          "properties" : {
            "my_attachment" : {
              "type" : "attachment"
            }
          }
        }
      ]
...

Use Multibyte Characters

An example in Japanese environment is below. First, put some configuration file into conf directory of Elasticsearch.

$ cd $ES_HOME/conf    # ex. /etc/elasticsearch if using rpm package
$ sudo wget https://raw.github.com/codelibs/fess-server/master/src/tomcat/solr/core1/conf/mapping_ja.txt
$ sudo wget http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/collection1/conf/lang/stopwords_ja.txt 

and then create "webindex" index with analyzers for Japanese. (If you want to use uni-gram, remove cjk_bigram in filter)

$ curl -XPUT "localhost:9200/webindex" -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "default" : {
          "type" : "custom",
          "char_filter" : ["mappingJa"],
          "tokenizer" : "standard",
          "filter" : ["word_delimiter", "lowercase", "cjk_width", "cjk_bigram"]
        }
      },
      "char_filter" : {
        "mappingJa": {
          "type" : "mapping",
          "mappings_path" : "mapping_ja.txt"
        }
      },
      "filter" : {
        "stopJa" : {
          "type" : "stop",
          "stopwords_path" : "stopwords_ja.txt"
        }
      }
    }
  }
}'

Rewrite a property value by Script

River Web allows you to rewrite crawled data by Java's ScriptEngine. "javascript" is available. In "properties" object, put "script" value to a property you want to rewrite.

...
        "properties" : {
...
          "flag" : {
            "text" : "body",
            "script" : "value.indexOf('Elasticsearch') > 0 ? 'yes' : 'no';"
          },

The above is, if a string value of body element in HTML contains "Elasticsearch", set "yes" to "flag" property.

Use HTTP proxy

Put "proxy" property in "crawl" property.

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "index" : "webindex",
    "type" : "my_web",
...
        "proxy" : {
          "host" : "proxy.server.com",
          "port" : 8080
        },

Specify next crawled urls when crawling

To set "isChildUrl" property to true, the property values is used as next crawled urls.

...
    "target" : [
      {
...
        "properties" : {
          "childUrl" : {
            "value" : ["http://fess.codelibs.org/","http://fess.codelibs.org/ja/"],
            "isArray" : true,
            "isChildUrl" : true
          },

Intercept start/execute/finish/close actions

You can insert your script to Executing Crawler(execute)/Finished Crawler(finish). To insert scripts, put "script" property as below:

curl -XPUT 'localhost:9200/.river_web/config/my_web' -d '{
    "script":{
      "execute":"your script...",
      "finish":"your script...",
    },
    ...

FAQ

What does "No scraping rule." mean?

In a river setting, "url" is starting urls to crawl a site, "include_urls" filters urls whether are crawled or not, and "target.pattern.url" is a rule to store extracted web data. If a crawling url does not match "target.pattern.url", you would see the message. Therefore, it means the crawled url does not have an extraction rule.

How to extract an attribute of meta tag

For example, if you want to grab a content of description's meta tag, the configuration is below:

...
"target" : [
...
  "properties" : {
...
    "meta" : {
      "attr" : "meta[name=description]",
      "args" : [ "content" ]
    },

Incremental crawling dose not work?

"url" field needs to be "not_analyzed" in a mapping of your stored index. See Create Index To Store Crawl Data.

Where is crawled data stored?

crawled data are stored to ".s2robot" index during cralwing, data extracted from them are stored to your index specified by a river setting, and then data in "robot" index are removed when the crawler is finished.

Powered By

More Repositories

1

fess

Fess is very powerful and easily deployable Enterprise Search Server.
Java
910
star
2

libdxfrw

C++ library to read and write DXF/DWG files
C
474
star
3

elasticsearch-taste

Mahout Taste-based recommendation on Elasticsearch
Java
336
star
4

jcifs

JCIFS is an Open Source client library that implements the CIFS/SMB networking protocol in 100% Java
Java
217
star
5

elasticsearch-dataformat

Excel/CSV/BulkJSON downloads on Elasticsearch.
Java
144
star
6

elasticsearch-analysis-kuromoji-ipadic-neologd

Elasticsearch's Analyzer for Kuromoji with Neologd
Java
114
star
7

elasticsearch-reindexing

Elasticsearch plugin for reindexing
Java
108
star
8

ranklib

A library of learning to rank algorithms
Java
96
star
9

docker-fess

Docker files for Fess
Shell
91
star
10

elasticsearch-auth

Authentication filter for Elasticsearch
Java
75
star
11

elasticsearch-cluster-runner

Elasticsearch Cluster Launcher on One JVM Instance
Java
66
star
12

elasticsearch-minhash

Elasticsearch plugin for b-bit minhash algorism
Java
62
star
13

elasticsearch-dynarank

This plugin provides a feature to change top N documents in a search result.
Java
56
star
14

minhash

This provides tools for b-bit MinHash algorism.
Java
33
star
15

fess-crawler

Web/FileSystem Crawler Library
Java
28
star
16

elasticsearch-analysis-synonym

NGramSynonymTokenizer for Elasticsearch
Java
24
star
17

fess-docs

Fess Docs.
Python
23
star
18

fess-site-search

Fess Site Search provides JavaScript files.
SCSS
22
star
19

gitbucket-fess-plugin

GitBucket plugin for Fess
Scala
21
star
20

jhighlight

JHighlight is an embeddable pure Java syntax highlighting library.
Java
17
star
21

elasticsearch-quartz

Scheduler for Elasticsearch plugins
Java
17
star
22

fione

Fione is Enterprise AI Platform
Java
15
star
23

nekohtml

HTML parser and tag balancer.
Java
14
star
24

elasticsearch-langfield

This plugin provides a useful feature for multi-language
Java
13
star
25

docker-codesearch

Code Search on Fess
Shell
11
star
26

elasticsearch-configsync

Java
11
star
27

fess-testdata

Test Data Repository for Crawling/Parsing
Rich Text Format
11
star
28

spnego

Integrated Windows Authentication in Java
Java
10
star
29

esanpy

Python Text Analyzer based on Elasticsearch
Python
9
star
30

elasticsearch-plugin-example

Example project for elasticsearch plugin
Java
9
star
31

elasticsearch-analyze-api

Elasticsearch plugin for Text Analyze API
Java
8
star
32

elasticsearch-querybuilders

QueryBuilders to build Query DSL
Java
8
star
33

recotem

An easy to use interface to recommender systems.
Vue
8
star
34

xpp3

MXP1: Xml Pull Parser 3rd Edition (XPP3)
Java
8
star
35

fess-ds-atlassian

DataStore Crawler for JIRA/Confluence
Java
7
star
36

sastruts-html5

Java
7
star
37

docker-elasticsearch

Shell
7
star
38

elasticsearch-indexing-proxy

This plugin intercepts indexing requests and writes them to files
Java
6
star
39

elasticsearch-plugin-archetype

Maven Archetype for Elasticsearch plugin
Java
6
star
40

elasticsearch-repository-ssh

SSH repository for Snapshot/Restore
Java
6
star
41

elasticsearch-dictionary

This plugin manages dictionary files.
Java
6
star
42

elasticsearch-eventhook

This plugin invokes your scripts on Elasticsearch cluster event.
Java
6
star
43

elasticsearch-analysis-extension

Elasticsearch Plugin for Analysis Library
Java
6
star
44

elasticsearch-httpclient

Elasticsearch HTTP Rest Client
Java
6
star
45

docker-fione

Docker for Fione
Dockerfile
6
star
46

fess-ds-sharepoint

DataStore Crawler for SharePoint
Java
6
star
47

elasticsearch-vector

Vector type and search functions for Elasticsearch
Java
6
star
48

fess-suggest

Suggest Utility for Fess
Java
6
star
49

corelib

Core library for CodeLibs
Java
5
star
50

elasticsearch-qrcache

Query Result Cache for Elasticsearch
Java
5
star
51

elasticsearch-analysis-ja

Elasticsearch Analysis Library for Japanese
Java
5
star
52

elasticsearch-abtest

Java
4
star
53

fess-ds-gsuite

DataStore Crawler for GSuite
Java
4
star
54

fesen

Fesen is a search engine for Fess.
Java
4
star
55

elasticsearch-module

Elasticsearch Jar files are put into Maven repository.
Shell
4
star
56

sso-proxy

Java
4
star
57

docker-h2o

Docker files for H2O
Shell
4
star
58

fess-ds-slack

DataStore Crawler for Slack
Java
4
star
59

fesen-httpclient

HTTP Rest Client for Elasticsearch and OpenSearch
Java
4
star
60

jupyterhub-docker

JupyterHub on Docker
Dockerfile
4
star
61

fess-server

Fess release distribution on Tomcat
Batchfile
4
star
62

fess-ds-office365

DataStore Crawler for Office365
Java
4
star
63

elasticsearch-analysis-kuromoji-unidic-neologd

Elasticsearch's Analyzer for Kuromoji with UniDic and Neologd
Java
4
star
64

cl-struts

CodeLibsγŒγ‚΅γƒγƒΌγƒˆγ™γ‚‹Struts 1.2.9です。
HTML
3
star
65

docker-semanticsearch

Java
3
star
66

docker-docsearch

Document Search on Fess
Java
3
star
67

fess-webapp-semantic-search

Java
3
star
68

fess-ds-db

DataStore Crawler for Database(JDBC)
Java
3
star
69

fess-webapp-chatgpt

Java
3
star
70

kubernetes-fess

Fess on Kubernetes
Shell
3
star
71

recotem-batch-example

Shell
3
star
72

elasticsearch-fess-suggest

Java
3
star
73

elasticsearch-lang-velocity

Velocity language for Elasticsearch
Java
3
star
74

sai

JavaScript engine developed in the Java programming language
Java
3
star
75

fess-crawler-playwright

Java
3
star
76

elasticsearch-sstmpl

Script-based Search Template for Elasticsearch
Java
3
star
77

fess-ds-example

DataStore Crawler for Sample Implementation
Java
3
star
78

fess-xpack

X-Pack Support for Fess
Java
3
star
79

elasticsearch-analysis-fess

Java
3
star
80

elasticsearch-analysis-kuromoji-unidic

Elasticsearch's Analyzer for Kuromoji with UniDic
Java
3
star
81

fess-ingest-vectorizer

Vectorizing Ingester for Fess
Java
2
star
82

curl4j

curl-like Java Client
Java
2
star
83

fess-ds-csv

DataStore Crawler for CSV Files
Java
2
star
84

search-ann-benchmark

Evaluating and comparing ANN search algorithms across various platforms
Jupyter Notebook
2
star
85

fess-script-groovy

Groovy language plugin for Fess
Java
2
star
86

fess-ds-elasticsearch

DataStore Crawler for Elasticsearch
Java
2
star
87

recotem-cli

Go
2
star
88

fess-script-ognl

OGNL language plugin for Fess
Java
2
star
89

devenv

Development Environments
Ruby
2
star
90

elasticsearch-util

Utility classes for Elasticsearch plugin
Java
2
star
91

commons-fileupload-1.2

commons-fileupload 1.2 with bugfixes
Java
2
star
92

fess-cloud

SolrCould for Fess
Shell
2
star
93

fess-ds-s3

DataStore Crawler for Amazon S3 (Under Development)
Java
2
star
94

opensearch-runner

Java
2
star
95

logana

Search/Recommend Log Analysis System
Python
2
star
96

logstash-input-dstat

logstash input plugin for dstat
Ruby
2
star
97

fess-ds-salesforce

DataStore Crawler for Salesforce
Java
2
star
98

fess-solr-plugin

Solr Plugin for Fess
Java
2
star
99

elasticsearch-analysis-seunjeon

Scala
2
star
100

solrlib

Solr Client Library
Java
2
star