• Stars
    star
    970
  • Rank 47,174 (Top 1.0 %)
  • Language
    Shell
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search

https://opensemanticsearch.org

Open Semantic Search is:

  • an integrated search server,
  • ETL framework for document processing (crawling, text extraction, text analysis, named entity recognition and OCR for images and embedded images in PDF),
  • search user interfaces, text mining, text analytics and search apps for fulltext search, faceted search, exploratory search and knowledge graph search

Documentation

This README.md is documentation for software developers.

Documentation for users and admins

The documentation for users and admins is included in the software packages/images and linked in the search user interface (Menu "Help").

Software architecture

You can find the documentation of the search engine architecture in docs/doc/modules/README.md.

Documentation format

This integrated HTML documentation is generated by the static site generator MkDocs with the config file mkdocs.yml.

The source of the documentation (Markdown format) and the charts (mermaid format) is editable in the directory docs.

Build

How to build the deb package for installation on Debian or Ubuntu server or the docker images for running in Docker containers:

Clone git repositories

Clone the repository including the dependencies:

git clone --recurse-submodules --remote-submodules https://github.com/opensemanticsearch/open-semantic-search.git
cd open-semantic-search

Build deb package

To build a deb package for Debian GNU/Linux or Ubuntu Linux, call the build script build-deb as user root (change user by su or sudo su):

./build-deb

Build Desktop Search VM

How to build an Open Semantic Desktop Search Appliance for VirtualBox is documented in src/open-semantic-desktop-search/README.md.

Build docker images

Build the Docker images using the default docker-compose config docker-compose.yml:

docker-compose build

Run docker containers

After these builds all the Docker images/dependencies/services can be started together by docker-compose with the config file docker-compose.yml.

You can start the whole environment by running:

docker-compose up

which will expose the web user interface on port 8080.

You can browse the Open Semantic Search user interface in your favourite browser by this URL:

http://localhost:8080/search/

Automated tests

For CI/CD there are some different automated tests:

Integration tests

Since the submodule Open Semantic ETL uses and needs different powerful services like Solr, spaCy-services or Tika-Server by HTTP and REST-API, many automated tests run as integration tests within the docker-compose environment configured in docker-compose.etl.test.yml so these services are available while running the unittests and integration tests.

docker-compose -f docker-compose.etl.test.yml build
docker-compose -f docker-compose.etl.test.yml up

End to end tests

Some automated integration tests and end-to-end (E2E) tests within a web browser controlled by the browser automation framework Playwright and the node.js / javascript based test framework JEST.

You can extend the automated tests in test/test.js

They run by the docker image Dockerfile-test and need the services of the docker-compose environment docker-compose.test.yml:

docker-compose -f docker-compose.test.yml build
docker-compose -f docker-compose.test.yml up

Dependencies

Dependencies are resolved automatically by building or by installation of the Debian or Ubuntu packages or by building the Docker images.

Documentation on this dependencies which may help debugging dependency hell issues or installations in other environments:

Build dependencies on Source code (GIT)

Dependencies on other Git repositories / submodules of components like Open Semantic ETL are defined in the Git config file .gitmodules

The submodules will be checked out automatically to the subdirectory src, if you check out this repository by git in recursive mode.

Packaging dependencies of Java archives (JAR)

The submodules src/tika-server.deb and src/solr.deb need the JAR of Apache Tika-Server and Apache Solr.

If not there, they will be downloaded from Apache Software Foundation by wget in the build-deb script or the submodules Dockerfile.

Installation dependencies on Debian/Ubuntu packages (DEB)

Dependencies of tools and libraries, which are available in the Debian or Ubuntu package repositories, are defined in the section Depends of the deb package config file DEBIAN/control

Installation dependencies on Python packages (PIP)

Dependencies of Python libraries which are not available as packages of the Linux distribution but in Python Package Index (PyPI), are defined in

src/open-semantic-etl/src/opensemanticetl/requirements.txt

This dependencies will be installed automatically on installation of the Debian/Ubuntu packages by the DEBIAN/postinst script of the Debian/Ubuntu packages or by docker build configured by Dockerfile by

pip3 install -r /usr/lib/python3/dist-packages/opensemanticetl/requirements.txt

Contributors

Most contributors are not shown by the Github user interface as "Contributors" of this repository, since this main repository is structured by Git submodules like Open Semantic ETL and other modules, which are managed in separated Git(hub) repositories.

So thanks to all (current and former) contributors:

  • Markus Mandalka (@mandalka)
  • @g-braeunlich
  • @maehr
  • @sdinten
  • @wsldankers
  • @rivimey
  • @rbussche
  • @mosea3
  • @bhelou
  • @hpiedcoq
  • @andreclinio
  • @agharbeia
  • @ciyer
  • @davidshq ...

Feel free to extend if you contributed/supported/sponsored in different forms.

More Repositories

1

open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Python
258
star
2

open-semantic-entity-search-api

Open Source REST API for named entity extraction, named entity linking, named entity disambiguation, recommendation & reconciliation of entities like persons, organizations and places for (semi)automatic semantic tagging & analysis of documents by linked data knowledge graph like SKOS thesaurus, RDF ontology, database(s) or list(s) of names
Python
177
star
3

open-semantic-search-apps

Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations and named entities) and data import (ETL like text extraction, OCR and crawling filesystems or websites)
CSS
94
star
4

open-semantic-visual-graph-explorer

Open Semantic Visual Linked Data Graph Explorer: Open Source tool (web app) and user interace (UI) for discovery, exploration and visualization of direct and indirect connections between named entities like persons, organizations, locations & concepts from thesarus or ontologies within your documents and knowledgegraph
HTML
78
star
5

solr-ontology-tagger

Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri
Python
46
star
6

solr-php-ui

Solr client and user interface for search
HTML
21
star
7

solr-relevance-ranking-analysis

Solr Relevance Ranking Analysis and Visualization Tool
Python
17
star
8

open-semantic-search-appliance

Open Semantic Search Appliance (VM)
Shell
12
star
9

lexemes

Import lexemes (dictionary including different grammar forms/lexical forms for each lexical entry) from Wikidata to Apache Solr synonyms config
Python
7
star
10

solr-synonames

Import synonames (multilingual variants of first names from Wikidata) to Solr managed synonyms graph
Python
6
star
11

spacy-services.deb

Debian & Ubuntu package for REST microservices for spaCy natural language processing and machine learning framework for named entity recognition
Shell
5
star
12

tika-server.deb

Apache Tika Server as Debian GNU/Linux and Ubuntu Linux package
Dockerfile
5
star
13

open-semantic-etl-filemonitoring-remote

File monitoring of filesystem by inotify for indexing new/changed files immediately by a remote API on remote search server
Python
5
star
14

tesseract-ocr-cache

Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again
Python
5
star
15

tika-python.deb

tika-python as Debian GNU/Linux and Ubuntu Linux package
3
star
16

neo4j.deb

Debian package of Neo4j graph database preconfigured for Open Semantic ETL and Open Semantic Search
Shell
2
star
17

solr.deb

Apache Solr as Debian package with preconfigured schema for Open Semantic ETL and Open Semantic Search
Shell
2
star