• Stars
    star
    147
  • Rank 251,347 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Entity Extraction Text Processor

Baleen

Baleen 2 has now reached the end of it's life, and we encourage all users to move to Baleen 3. There is no intention to provide further updates or any significant support for Baleen 2.

Build Status Coverage Status Codacy Badge

Baleen is an extensible text processing capability that allows entity-related information to be extracted from unstructured and semi-structured data sources. It makes available in a structured format things of interest otherwise stored in formats such as text documents - references to people, organisations, unique identifiers, location information.

Baleen is written in Java 8 using the software project management tool Maven 3 and draws heavily on the Apache Unstructured Information Management Architecture (UIMA) which provides a framework, components and infrastructure to handle unstructured information management.

Baleen was written by the Defence Science and Technology Laboratory (Dstl) in support of UK Defence users looking to extract entities and search unstructured text documents. License information can be found in the accompanying LICENSE.txt file in this repository and the licenses of libraries on which Baleen is dependent are listed in the file THIRD-PARTY.txt.

Baleen is still under active development, and is released here not as a final product but as a work in progress. As such, there may be bugs, issues, typos, mistakes in the documentation, and more. We hope that contributions from other users will improve Baleen and result in a better framework for others to use.

Upgrading to 2.4 and later

Baleen 2.4 (and later) contain a number of changes that may make it incompatible with older pipelines - for upgrade guidance see the Upgrading Between Versions wiki page.

Getting Started

Baleen includes an in-built server, which hosts full documentation and guides on how to use Baleen. To get started, you will need to launch this server and read this documentation. To launch the server, run the following command.

java -jar baleen-2.8.0-SNAPSHOT.jar

Once running, the server can be accessed at http://localhost:6413.

If you require the Javadoc to be available through the in-built server, then you should place the Baleen Javadoc JAR in the same directory as the Baleen JAR.

Prerequisites

Running

To run Baleen, you will need:

  • A sensible amount of RAM. Start with 4GB and alter according to the number of annotators being employed.
  • Java 8 or larer

Developing

The develop with Baleen, we suggest you use:

  • Oracle Java JDK 1.8
  • Eclipse Mars or greater (assumed to include Maven)
  • Maven

Baleen requires Java 8 or later.

Licence

Crown copyright 2017-2019

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

Data

Baleen contains data derived from other data sources. For more information, please refer to the Baleen source code.

Code-Point Open

Licensed under the Open Government Licence (OGL) v3 - http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

Contains OS data (c) Crown copyright and database right 2015

Contains Royal Mail data (c) Royal Mail copyright and database right 2015

Contains National Statistics data (c) Crown copyright and database right 2015

Countries JSON

Licensed under the ODC Open Database Licence (ODbL) 1.0 - http://opendatacommons.org/licenses/odbl/1.0/

Any rights in individual contents of the database are licensed under the Database Contents License - http://opendatacommons.org/licenses/dbcl/1.0/

Countries GeoJSON

Licensed under the ODC Public Domain Dedication and Licence (PDDL) 1.0 - http://opendatacommons.org/licenses/pddl/1.0/

OpenNLP Language Models

Licensed under the Apache Software License 2.0 - http://www.apache.org/licenses/LICENSE-2.0

More Repositories

1

Stone-Soup

A software project to provide the target tracking community with a framework for the development and testing of tracking algorithms.
Python
409
star
2

ideaworks

Simple idea capturing web app
Python
91
star
3

re3d

Relationship and Entity Extraction Evaluation Dataset
78
star
4

YAWNING-TITAN

YAWNING TITAN is an abstract, graph based cyber-security simulation environment that supports the training of intelligent agents for autonomous cyber operations.
Python
58
star
5

IES4

The Information Exchange Standard (IES) is a standard for information exchange developed within UK Government.
25
star
6

SAPIENT-Proto-Files

20
star
7

baleen3

Baleen 3 is a data processing tool based on the Annot8 framework
TypeScript
20
star
8

Stone-Soup-Notebooks

This is archive of older Jupyter notebooks for demonstrating Stone Soup. Please see up to date notebooks on https://stonesoup.rtfd.io/
Jupyter Notebook
13
star
9

srup

The Secure Remote Update Protocol (SRUP) is a secure & confirmable command-and-control (C2) protocol for the Internet of Things (IoT), based on MQTT.
C++
13
star
10

Noisify

A simple library for adding noise to data.
Python
12
star
11

osgb

Library for converting between OSGB and WGS84 coordinates
Java
11
star
12

sarcastic

C
10
star
13

SAPIENT-Middleware-and-Test-Harness

C#
10
star
14

lighthouse

Web application for finding useful tools, data and techniques
Python
9
star
15

muc3

Message Understanding Conference 3 Corpus
HTML
9
star
16

Open_Source_ECOA_Toolset_AS5

The Open Source ECOA Toolset (OSETS) allows developers to produce experimental system software which conforms to version 5 of the European Component Oriented Architecture (ECOA) standard.
Java
8
star
17

IES

The Information Exchange Standard (IES) is a standard for information exchange developed within UK Government.
6
star
18

RET

Python
5
star
19

BSI-Flex-335-v2-Test-Harness

BSI Flex 335 v2 SAPIENT test harness
C#
3
star
20

Open-Federated-Search

Java
3
star
21

NPIF-Sleuth

A test tool for STANAG 7023 formatted files
Python
3
star
22

lighthouse-builder

Code for provision and deployment for various components of the dstl-lighthouse project
Shell
3
star
23

Human-Interface-Horizons

A roadmap of HMI developments and the anticipated cognitive challenges associated with delivering these into the future defence and security operating environment.
JavaScript
3
star
24

Apex-SAPIENT-Middleware

Python
2
star
25

docker-images

Docker Images
Dockerfile
2
star
26

machinetranslation

Common Java API for connecting to Machine Translation tools, and implementations of that API for various tools
Java
2
star
27

nifi-processors

NiFi Processors
Java
2
star