• Stars
    star
    258
  • Rank 158,189 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 12 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CLAVIN (Cartographic Location And Vicinity INdexer) is an open source software package for document geoparsing and georesolution that employs context-based geographic entity resolution.

CLAVIN

CLAVIN Master

License

CLAVIN (Cartographic Location And Vicinity INdexer) is an open source software package for document geoparsing and georesolution that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply "look up" location names; rather, it uses intelligent heuristics-based combinatorial optimization in an attempt to identify precisely which "Springfield" (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., "Ivory Coast" and "CΓ΄te d'Ivoire") as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data.

CLAVIN natively uses Apache OpenNLP for extracting place names in text as part of this library. CLAVIN now also integrates with Novetta's own AdaptNLP project for place name extraction. To use AdaptNLP, you'll need to follow the instructions on that repo to bring up an instance of the extractor. Lastly, we also maintain the clavin-nerd project (which will be updated in the near future), that enables CLAVIN to use Stanford NER.

Novetta also maintains the CLAVIN-Rest project, which provides a RESTful microservice wrapper around CLAVIN or CLAVIN-NERD. CLAVIN-Rest is configured (and provides instructions) to easily build and run this package as a docker image.

Breaking changes

This release includes breaking changes in the form of an update to all namespaces. The namespaces have been changed from com.bericotech to com.novetta which reflects a change in corporate ownership, and re-alignment to our new domain.

How to build & use CLAVIN:

  1. Check out a copy of the source code:
git clone https://github.com/Novetta/CLAVIN.git
  1. Move into the newly-created CLAVIN directory:
cd CLAVIN
  1. Download the latest version of allCountries.zip gazetteer file from GeoNames.org:
curl -O http://download.geonames.org/export/dump/allCountries.zip
  1. Unzip the GeoNames gazetteer file:
unzip allCountries.zip
  1. Compile the source code:
mvn compile
  1. Create the Lucene Index (this one-time process will take several minutes):
MAVEN_OPTS="-Xmx4g" mvn exec:java -Dexec.mainClass="com.novetta.clavin.index.IndexDirectoryBuilder"
  1. Run the example program:
MAVEN_OPTS="-Xmx2g" mvn exec:java -Dexec.mainClass="com.novetta.clavin.WorkflowDemo"

If you encounter an error that looks like this:

... InvocationTargetException: Java heap space ...

Set the appropriate environmental variable controlling Maven's memory usage, and increase the size with export MAVEN_OPTS=-Xmx4g or similar.

Once that all runs successfully, feel free to modify the CLAVIN source code to suit your needs.

N.B.: Loading the worldwide gazetteer uses a non-trivial amount of memory. When using CLAVIN in your own programs, if you encounter Java heap space errors (like the one described in Step 7), bump up the maximum heap size for your JVM.

Add CLAVIN to your project:

CLAVIN is published to Maven Central. You can add a dependency on the CLAVIN project:

<dependency>
   <groupId>com.novetta</groupId>
   <artifactId>CLAVIN</artifactId>
   <version>3.0.0</version>
</dependency>

You will still need to build the GeoNames Lucene Index as described in steps 3, 4, and 6 in "How to build & use CLAVIN".

Choosing an Extractor

When using this library, you're now able to choose between two different extractors: Novetta AdaptNLP and Apache OpenNLP. For AdaptNLP

AdaptNLP

Creating an AdaptNlpExtractor:

LocationExtractor extractor = new AdaptNlpExtractor();

OpenNLP

Creating an ApacheExtractor:

LocationExtractor extractor = new ApacheExtractor();

There are also some convenience methods in the GeoParserFactory for Apache OpenNLP.

So, for example, to set up the Gazetteer, AdaptNLP Extractor and GeoParser classes from scratch, it looks like this with default settings:

// the maximum hit depth for CLAVIN searches
private int maxHitDepth = 3;

// the maximum context window for CLAVIN searches
private int maxContextWindow = 5;

// switch controlling use of fuzzy matching
private boolean fuzzy = false;

// adaptnlp host, port
private string host = "http://localhost";
private int port = 5000;

Gazetteer gazetteer = new LuceneGazetteer(new File(pathToLuceneIndex));
LocationExtractor extractor = new AdaptNlpExtractor(host, port);
Geoparser parser = new GeoParser(extractor, gazetteer, maxHitDepth, maxContentWindow, fuzzy);

License:

Copyright (C) 2012-2020 Novetta

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

adaptnlp

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.
Jupyter Notebook
415
star
2

CMF-AMQP-Configuration

Documentation and scripts automating the process of configuring and maintaining our implementation of the AMQP Transport Layer.
Shell
77
star
3

lib2nbdev

Library for converting non-nbdev libraries and projects to nbdev
Jupyter Notebook
36
star
4

CLAVIN-NERD

Stanford NLP Implementation of the CLAVIN LocationTagger
Java
24
star
5

CLAVIN-rest

A Spring Boot microservice that serves the CLAVIN (https://github.com/novetta/CLAVIN) library for geo rectifying locations mentioned in text.
Java
17
star
6

Event-Bus

Event Bus project
Java
5
star
7

CMF

Common Messaging Framework (Messaging Abstraction v3)
Java
3
star
8

SMMT

Supermicro Monitoring Tool (SMMT) automatically installs, configures, and utilizes Supermicro (SM) software to monitor the health of Linux systems running on SM hardware.
Shell
3
star
9

Operation-Blockbuster

2
star
10

fractal1027

Fun with Fractals and CuPy on NVIDIA GPUs
Jupyter Notebook
2
star
11

blockchain-standup-randomizer

Run this before your daily standup to securely randomize the order based on the latest block in your favorite blockchain
Python
1
star
12

hack-day

code from our hack days (warning: here be dragons)
1
star
13

TV-Displays

Repository for files and Google App Script used to render displays within the company.
JavaScript
1
star
14

Environmental-Intelligence

The analysis of Environmental factors as it pertains to Intelligence.
Java
1
star