• Stars
    star
    720
  • Rank 62,908 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Full text geoparsing as a Python library

Full text geoparsing as a Python library. Extract the place names from a piece of English-language text, resolve them to the correct place, and return their coordinates and structured geographic information.

Mordecai is ready for an upgrade! Please take the user survey here to help shape what v3 will look like.

Example usage

>>> from mordecai import Geoparser
>>> geo = Geoparser()
>>> geo.geoparse("I traveled from Oxford to Ottawa.")

[{'country_conf': 0.96474487,
  'country_predicted': 'GBR',
  'geo': {'admin1': 'England',
   'country_code3': 'GBR',
   'feature_class': 'P',
   'feature_code': 'PPLA2',
   'geonameid': '2640729',
   'lat': '51.75222',
   'lon': '-1.25596',
   'place_name': 'Oxford'},
  'spans': [{'end': 22, 'start': 16}],
  'word': 'Oxford'},
 {'country_conf': 0.83302397,
  'country_predicted': 'CAN',
  'geo': {'admin1': 'Ontario',
   'country_code3': 'CAN',
   'feature_class': 'P',
   'feature_code': 'PPLC',
   'geonameid': '6094817',
   'lat': '45.41117',
   'lon': '-75.69812',
   'place_name': 'Ottawa'},
  'spans': [{'end': 32, 'start': 26}],
  'word': 'Ottawa'}]

Mordecai requires a running Elasticsearch service with Geonames in it. See "Installation" below for instructions.

Installation and Requirements

  1. Mordecai is on PyPI and can be installed for Python 3 with pip:
pip install mordecai

Note: It's strongly recommended that you run Mordecai in a virtual environment. The libraries that Mordecai depends on are not always the most recent versions and using a virtual environment prevents libraries from being downgraded or running into other issues:

python -m venv mordecai-env
source mordecai-env/bin/activate
pip install mordecai
  1. You should then download the required spaCy NLP model:
python -m spacy download en_core_web_lg
  1. In order to work, Mordecai needs access to a Geonames gazetteer running in Elasticsearch. The easiest way to set it up is by running the following commands (you must have Docker installed first).
docker pull elasticsearch:5.5.2
wget https://andrewhalterman.com/files/geonames_index.tar.gz --output-file=wget_log.txt
tar -xzf geonames_index.tar.gz
docker run -d -p 127.0.0.1:9200:9200 -v $(pwd)/geonames_index/:/usr/share/elasticsearch/data elasticsearch:5.5.2

See the es-geonames for the code used to produce this index.

To update the index, simply shut down the old container, re-download the index from s3, and restart the container with the new index.

Citing

If you use this software in academic work, please cite as

@article{halterman2017mordecai,
  title={Mordecai: Full Text Geoparsing and Event Geocoding},
  author={Halterman, Andrew},
  journal={The Journal of Open Source Software},
  volume={2},
  number={9},
  year={2017},
  doi={10.21105/joss.00091}
}

How does it work?

Mordecai takes in unstructured text and returns structured geographic information extracted from it.

  • It uses spaCy's named entity recognition to extract placenames from the text.

  • It uses the geonames gazetteer in an Elasticsearch index (with some custom logic) to find the potential coordinates of extracted place names.

  • It uses neural networks implemented in Keras and trained on new annotated English-language data labeled with Prodigy to infer the correct country and correct gazetteer entries for each placename.

The training data for the two models includes copyrighted text so cannot be shared freely. Applying Mordecai to non-English language text would require labeling data in the target language and retraining.

API and Configuration

When instantiating the Geoparser() module, the following options can be changed:

  • es_hosts : List of hosts where the Geonames Elasticsearch service is running. Defaults to ['localhost'], which is where it runs if you're using the default Docker setup described above.
  • es_port : What port the Geonames Elasticsearch service is running on. Defaults to 9200, which is where the Docker setup has it
  • es_ssl : Whether Elasticsearch requires an SSL connection. Defaults to False.
  • es_auth : Optional HTTP auth parameters to use with ES. If provided, it should be a two-tuple of (user, password).
  • country_confidence : Set the country model confidence below which no geolocation will be returned. If it's really low, the model's probably wrong and will return weird results. Defaults to 0.6.
  • verbose : Return all the features used in the country picking model? Defaults to False.
  • threads: whether to use threads to make parallel queries to the Elasticsearch database. Defaults to True, which gives a ~6x speedup.

geoparse is the primary endpoint and the only one that most users will need. Other, mostly internal, methods may be useful in some cases:

  • lookup_city takes a city name, country, and (optionally) ADM1/state/governorate and does a rule-based lookup for the city.
  • infer_country take a document and attempts to infer the most probable country for each.
  • query_geonames and query_geonames_country can be used for performing a search over Geonames in Elasticsearch
  • methods with the _feature prefix are internal methods for calculating country picking features from text.

batch_geoparse takes in a list of documents and uses spaCy's nlp.pipe method to process them more efficiently in the NLP step.

Advanced users on large machines can increase the lru_cache parameter from 250 to 1000. This will use more memory but will increase parsing speed.

Tests

Mordecai includes unit tests. To run the tests, cd into the mordecai directory and run:

pytest

The tests require access to a running Elastic/Geonames service to complete. The tests are currently failing on TravisCI with an unexplained segfault but run fine locally. Mordecai has only been tested with Python 3.

Acknowledgements

An earlier verion of this software was donated to the Open Event Data Alliance by Caerus Associates. See Releases or the legacy-docker branch for the 2015-2016 and the 2016-2017 production versions of Mordecai.

This work was funded in part by DARPA's XDATA program, the U.S. Army Research Laboratory and the U.S. Army Research Office through the Minerva Initiative under grant number W911NF-13-0332, and the National Science Foundation under award number SBE-SMA-1539302. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, ARO, Minerva, NSF, or the U.S. government.

Contributing

Contributions via pull requests are welcome. Please make sure that changes pass the unit tests. Any bugs and problems can be reported on the repo's issues page.

More Repositories

1

scraper

Scrapes sites. Gets news. Eventually events.
Python
81
star
2

PLOVER

Next generation event data ontology
TeX
68
star
3

petrarch2

Another next-generation event coding platform.
Python
68
star
4

es-geonames

Create a Geonames gazetteer index in Elasticsearch
Python
65
star
5

phoenix_pipeline

Turning news into events since 2014.
Python
49
star
6

petrarch

The Python-language successor to the TABARI event-data coding software.
Python
44
star
7

Dictionaries

PETRARCH actor, agent and verb dictionaries
TeX
22
star
8

UniversalPetrarch

Language-agnostic political event coding using universal dependencies
Python
18
star
9

eldiablo

Event data in a box, basically.
Shell
15
star
10

text_to_CAMEO

Convert text-intensive ICEWS data on Dataverse to conventional ISO-3166 and CAMEO codes
Python
10
star
11

hypnos

RESTful API around the PETRARCH coding software
Python
10
star
12

stanford_pipeline

Program to run scraped news stories through CoreNLP.
Python
10
star
13

CountryInfo

CountryInfo.txt and related utility program
Perl
8
star
14

openeventdata.github.io

Homepage for the Open Event Data Alliance
Ruby
5
star
15

Computational-Approaches

TeX
5
star
16

political-actor-recommendation

Automatic Political Actor Recommendation In Real Time (APART)
Python
5
star
17

Presentations

Presentations on event data and related issues
4
star
18

tabari_dictionaries

Dictionaries designed to work with the TABARI event-data coder.
4
star
19

arabic_dictionaries

Arabic language actor and verb dictionaries for CAMEO-style event data
Jupyter Notebook
4
star
20

hackathon

Repository for the GDELT hackathon at PSU on September 28, 2013
Python
3
star
21

arabic_event_gsr

Gold standard records for Arabic event data
2
star
22

mitie_container

MITIE as a service
Python
2
star
23

Focus_Locality_Extraction

Python
2
star
24

birdcage-deprecated

Basic, Integrated, and Reliably Distributed/Dockerized Coding, Actors, and Geolocation for Events
Python
2
star
25

plovigy

Small footprint Python/terminal program for simple annotation of event gold standard record using prodigy input/output file formats
Python
2
star
26

mitie-py

Make the MITIE Python wrapper installable.
Python
1
star