• Stars
    star
    632
  • Rank 70,884 (Top 2 %)
  • Language
    Java
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Parser Microservice for the YaCy Grid

YaCy Grid Component: Parser

The YaCy Grid is the second-generation implementation of YaCy, a peer-to-peer search engine. A YaCy Grid installation consists of a set of micro-services which communicate with each other using the MCP, see https://github.com/yacy/yacy_grid_mcp

Purpose

The Parser is a microservices which can be deployed i.e. using Docker. When the Parser Component is started, it searches for a MCP and connects to it. By default the local host is searched for a MCP but you can configure one yourself.

What it does

The Parser is able to read a WARC file and parses it's content. The content is analyzed, the plain text, links, images and more entities are extracted. The result is stored in a JSON Object. Calling the parser will generate a list of JSON Objects, each containing the analyzed content of one internet resource. The parser understands not only HTML but also a wide range of different document formats, including PDF, all OpenOffice and MS Office document formats and much more.

Installation: Download, Build, Run

At this time, yacy_grid_parser is not provided in compiled form, you easily build it yourself. It's not difficult and done in one minute! The source code is hosted at https://github.com/yacy/yacy_grid_parser, you can download it and run loklak with:

> git clone --recursive https://github.com/yacy/yacy_grid_parser.git
> cd yacy_grid_parser
> gradle run

This repository uses git submodules to integrate yacy_grid_mcp into yacy_grid_parser. In case that you clones this repository without the --recursive do now:

> git submodule update --init --recursive

The submodules require, that each subsequent

> git pull origin master

requires also a pull for the submodules, in case anything has changed there. You can do that easily with:

> git submodule foreach git pull origin master

Example for Parsing a set of Documents

For this example, a hosted version of yacy_grid_parser is provided at http://yacygrid.com:8500. The example shows, how a web site is crawled using wget, then parsed with yacy_grid_parser and finally indexed with yacy_search_server ('legacy' YaCy/1.x) using the surrogate dump reading method:

First, crawl a site (here:publicplan.de):

> wget -r -l3 "https://www.publicplan.de/" --warc-file="publicplan.de"

This produces the file "publicplan.de.warc.gz". That file can then be send to the hosted yacy_grid_parser with:

> curl -X POST -F "[email protected]" -F "flatfile=true" -o "publicplan.de.flatjson" http://localhost:8500/yacy/grid/parser/parser.json

The result is a file "publicplan.de.flatjson" with 774 single JSON objects, each printed in it's own line in the flatjson file. To index that file with legacy YaCy (YaCy/1.x) just copy it into the yacy_search_server/DATA/SURROGATES/in/ path.

Contribute

This is a community project and your contribution is welcome!

  1. Check for open issues or open a fresh one to start a discussion around a feature idea or a bug.
  2. Fork the repository on GitHub to start making your changes (branch off of the master branch).
  3. Write a test that shows the bug was fixed or the feature works as expected.
  4. Send a pull request and bug us on Gitter until it gets merged and published. :)

What is the software license?

LGPL 2.1

Have fun!

@0rb1t3r

More Repositories

1

yacy_search_server

Distributed Peer-to-Peer Web Search Engine and Intranet Search Appliance
Java
3,077
star
2

yacy_grid_mcp

The YaCy Grid Master Connect Program
Java
654
star
3

cider

"Content Integration Framework: Document Extraction and Retrieval" - A document parser framework that stores parsed entities into jena ( http://jena.sourceforge.net/ ) RDF vocabularies and provides knowledge-base enhanced semantic ananlysis of content. Annotated content can be used by search engines to present content navigation which will be implemented in the YaCy Search Engine
Java
650
star
4

yacy_webclient_bootstrap

YaCy Search Client using bootstrapcss
HTML
639
star
5

yacy_grid_crawler

Crawler Microservice for the YaCy Grid
Java
636
star
6

yacy_search_androidclient

An Android App which searches on a YaCy search server
Java
625
star
7

yacy_grid_loader

A headless browser as loader microservice in the YaCy Grid
Java
620
star
8

YaCyBar

A firefox toolbar for YaCy
JavaScript
617
star
9

yacy_webclient_yaml4

A web client for a YaCy search server based on yaml4 css
CSS
615
star
10

yacy_webclient_authentication

Authentication layer for a YaCy webclient
PHP
613
star
11

yacy_artwork

Yacy Artwork
610
star
12

yacy_docs

Documentation Project for Yacy
603
star
13

yacy_grid_framework

Starting Point for a local YaCy Grid
Shell
433
star
14

yacy_forum_archive

A back-up of the YaCy forum which lived until february 2018
HTML
425
star
15

yacy_grid_cluster

Management tools for a Raspberry Pi Demonstration Cluster of the YaCy Grid
JavaScript
287
star
16

yacy_grid_search

Search API and Search Aggregation module for the YaCy Grid
Java
92
star
17

searchlab

Portal for YaCy Grid and Data Science Applications
Java
19
star
18

searchlab_apps

Search Apps for the Searchlab
CSS
9
star