• Stars
    star
    184
  • Rank 208,008 (Top 5 %)
  • Language
    Python
  • Created almost 14 years ago
  • Updated over 13 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple Python library/tool for pulling location information from unstructured text
geodict
-------

A simple Python library/command-line tool for pulling location information from unstructured text

Installing
----------

This library uses a large geo-dictionary of countries, regions and cities, all stored in a MySQL database. The source data required is included in this project. To get started:

- Enter the details of your MySQL server and account into geodict_config.py
- Install the MySQLdb module for Python ('easy_install MySQL-python' may do the trick)
- cd into the folder you've unpacked this to, and run ./populate_database.py

This make take several minutes, depending on your machine, since there's over 2 million cities

Running
-------

Once you've done that, give the command-line tool a try:
./geodict.py < testinput.txt

That should produce something like this:
Spain
Italy
Bulgaria
New Zealand
Barcelona, Spain
Wellington New Zealand
Alabama
Wisconsin

Those are the actual strings that the tool picked out as locations. If you want more information
on each of them in a machine-readable format you can specify JSON or CSV:
./geodict.py -f json < testinput.txt
[{"found_tokens": [{"code": "ES", "matched_string": "Spain", "lon": -4.0, "end_index": 4, "lat": 40.0, "type": "COUNTRY", "start_index": 0}]}, {"found_tokens": [{"code": "IT", "matched_string": "Italy", "lon": 12.833299999999999, "end_index": 10, "lat": 42.833300000000001, "type": "COUNTRY", "start_index": 6}]}, {"found_tokens": [{"code": "BG", "matched_string": "Bulgaria", "lon": 25.0, "end_index": 19, "lat": 43.0, "type": "COUNTRY", "start_index": 12}]}, {"found_tokens": [{"code": "NZ", "matched_string": "New Zealand", "lon": 174.0, "end_index": 42, "lat": -41.0, "type": "COUNTRY", "start_index": 32}]}, {"found_tokens": [{"matched_string": "Barcelona", "lon": 2.1833300000000002, "end_index": 52, "lat": 41.383299999999998, "type": "CITY", "start_index": 44}, {"code": "ES", "matched_string": "Spain", "lon": -4.0, "end_index": 59, "lat": 40.0, "type": "COUNTRY", "start_index": 55}]}, {"found_tokens": [{"matched_string": "Wellington", "lon": 174.78299999999999, "end_index": 70, "lat": -41.299999999999997, "type": "CITY", "start_index": 61}, {"code": "NZ", "matched_string": "New Zealand", "lon": 174.0, "end_index": 82, "lat": -41.0, "type": "COUNTRY", "start_index": 72}]}, {"found_tokens": [{"code": "AL", "matched_string": "Alabama", "lon": -86.807299999999998, "end_index": 196, "lat": 32.798999999999999, "type": "REGION", "start_index": 190}]}, {"found_tokens": [{"code": "WI", "matched_string": "Wisconsin", "lon": -89.638499999999993, "end_index": 332, "lat": 44.256300000000003, "type": "REGION", "start_index": 324}]}]

./geodict.py -f csv < testinput.txt
location,type,lat,lon
Spain,country,40.0,-4.0
Italy,country,42.8333,12.8333
Bulgaria,country,43.0,25.0
New Zealand,country,-41.0,174.0
"Barcelona, Spain",city,41.3833,2.18333
Wellington New Zealand,city,-41.3,174.783
Alabama,region,32.799,-86.8073
Wisconsin,region,44.2563,-89.6385

For more of a real-world test, try feeding in the front page of the New York Times:
curl -L "http://newyorktimes.com/" | ./geodict.py
Georgia
Brazil
United States
Iraq
China
Brazil
Pakistan
Afghanistan
Erlanger, Ky
Japan
China
India
India
Ecuador
Ireland
Washington
Iraq
Guatemala

The tool just treats its input as plain text, so in production you'd want to use something like
beautiful soup to strip the tags out of the HTML, but even with messy input like that it's able
to work reasonably well.

Developers
----------

To use this from within your own Python code
import geodict_lib

and then call
locations = geodict_lib.find_locations_in_text(text)

The code itself may be a bit non-idiomatic, I'm still getting up to speed with Python!

Credits
-------

© Pete Warden, 2010 <[email protected]> - http://www.openheatmap.com/

World cities data is from MaxMind: http://www.maxmind.com/app/worldcities

All code is licensed under the GPL V3. For more details on the license see the included gpl.txt
file or go to http://www.gnu.org/licenses/

More Repositories

1

dstk

A collection of the best open data sets and open-source tools for data science
Ruby
1,125
star
2

iPhoneTracker

Objective-C
1,029
star
3

c_hashmap

A simple string hashmap in C
C
521
star
4

spchcat

Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi.
C
424
star
5

open-speech-recording

Web application to record speech for an open data set
HTML
417
star
6

ParallelCurl

A PHP class providing an easy interface for running multiple concurrent CURL requests
PHP
379
star
7

dstkdata

The (large) data files needed for the Data Science Toolkit project
221
star
8

findbyemail

A PHP module that incorporates all known APIs that map an email address to user information
PHP
108
star
9

openheatmap

A web renderer for geographic heat maps, using OpenStreetMap compatible file formats
C
102
star
10

extract_loudest_section

Trims .wav audio files to the loudest section of a given length
C++
94
star
11

buzzprofilecrawl

A simple script to crawl Google Profile pages and extract their information as structured data
PHP
90
star
12

tensorflow_makefile

C++
70
star
13

catdoc

Command-line utility for converting Microsoft Word documents to text
C
69
star
14

stm32_bare_lib

Simple examples and utilities for the STM32 "Blue Pill"
C
59
star
15

picoproto

Abominably Tiny Protobuf File Parser in C++
C++
54
star
16

ble_file_transfer

Example of transferring file data over BLE using an Arduino Nano Sense and WebBLE
C++
47
star
17

crunchcrawl

A project to gather, analyze and visualized the data in Crunchbase
PHP
46
star
18

salesforce_restphp_example

A minimal example showing how to handle the OAuth login process and make API calls using the Salesforce REST interface in PHP
PHP
42
star
19

pyparallelcurl

A simple Python class for running multiple URL fetches in parallel
Python
40
star
20

handmadeimap

An implementation of IMAP and POP3 in PHP using raw sockets rather than the php-imap extension
PHP
39
star
21

common_crawl_types

A simple Ruby example of how to process Common Crawl files using Elastic MapReduce
Ruby
30
star
22

pagerankgraph

Visualizes search engine ranking algorithms for a given domain
PHP
30
star
23

magic_wand

Magic Wand example for TensorFlow Lite Micro
Jupyter Notebook
28
star
24

genderfromname

A PHP port of the Perl Text::GenderFromName module
PHP
23
star
25

openwordcloud

Renders word clouds using HTML5's Canvas element
JavaScript
21
star
26

tensorflow_ios

C++
21
star
27

linkedinoauthexample

A minimal example showing how to handle the OAuth login process for LinkedIn using PHP
PHP
19
star
28

MLloWorld

Shows how to write a simple data contest entry for Kaggle, using scikit-learn for machine learning algorithms
Python
18
star
29

geoip_example

A simple example showing how to use the GeoIP API in PHP with the free database from http://maxmind.com
PHP
15
star
30

boilerpipe

A branch of the boilerpipe project
Java
15
star
31

arduino_nano_ble_write_flash

An example of modifying flash memory on the Arduino Nano BLE Sense 33 from a sketch, using Mbed.
C++
13
star
32

postgis2gmap

A small collection of PL/PGSQL functions for converting to and from Google's map tile coordinates
13
star
33

delicious_tags

A demonstration showing how to use the Delicious API to retrieve the top tags for a URL
PHP
12
star
34

simpledb_loader

A Java project exploring the fastest way to upload data to Amazon's SimpleDB
Java
10
star
35

hellosocialworld

A minimal but complete example of a site relying on authentication and sharing through Twitter and Facebook
Ruby
9
star
36

cruftstripper

Pulls strings that statistically look like valid sentences from unstructured text
Python
8
star
37

GeocodeFile

A PHP script to turn a file of addresses into latitude, longitude coordinates
PHP
8
star
38

cc2text

An example job that converts Common Crawl archived web pages into text
Ruby
7
star
39

memory_planner

Prototype for a memory planner for TensorFlow Lite Micro
C++
6
star
40

ml_memory_analyser

Runtime memory usage analysis utilities for ML models
Python
6
star
41

tf_ios_makefile_example

A simple iOS example showing how to use the iOS library produced by TensorFlow's makefile.
Objective-C++
6
star
42

twitteroauthexample

An example of how to implement the UI workflow for Twitter's oAuth process in PHP
PHP
6
star
43

flxjs

A javascript library emulating the Flex 3 Matrix, Point and Rectangle classes
JavaScript
6
star
44

schoolcrawl

Crawls the value-added school effectiveness data from the LA Times website
PHP
5
star
45

copyoptimizer

An example of using KissMetrics to measure and optimize your landing page copy
JavaScript
5
star
46

cuda-convnet

Python
4
star
47

stt_standalone_client

Copy of the example program from Coqui's open-source library with just the files needed to compile against a binary release.
C++
4
star
48

osm2pgsql

A fork of the importer for OpenStreetMap format files in PostGIS
C
4
star
49

invite_example

Sample code for the InvitesDoneRight service
PHP
4
star
50

stitchingbug

A demonstration of a polygon stitching problem with the HTML5 Canvas element
JavaScript
3
star
51

magic_wand_digit_data

Digit gesture training data for the Arduino BLE Nano Sense magic wand
3
star
52

pico_colabs

Colab notebooks for building Raspberry Pi Pico examples on the web
Jupyter Notebook
3
star
53

minimalprofiler

A super-simple profiler for Ruby
Ruby
3
star
54

geocodetest

Measures the quality of address to coordinate results across multiple services
Ruby
3
star
55

parec

Standalone version of the Pulse Audio pacat example
C
2
star
56

datathon

Support code for the ASA Datathon
Python
2
star
57

magic_wand_capture

C++
2
star
58

v4l2_opengl

Minimal example of using Video for Linux 2 together with OpenGL to display a live camera feed
C
2
star
59

pico_multicore_coherence_test

Minimal repro example for a memory coherence problem I encountered when running on the RP2040's dual cores
C
2
star
60

petesplugins

My old open-source video effects
2
star
61

cliargs_py

A utility module to make handling command line arguments in Python easier
Python
2
star
62

mbed-hello-world

Starter program for the mbed IDE for ARM M-class microcontrollers
C++
2
star
63

sinatraperftoolsexample

A minimal example of how to use Perftools in Sinatra
Ruby
2
star
64

cliargs

Easily get command-line arguments for PHP scripts, with error-checking and built-in usage help
2
star
65

tensorflow_apq8009

C++
1
star
66

invitesdoneright

1
star
67

BitBrain_C_Code

GitHub version of BitBrain code repository
C
1
star
68

bookiewatcher

Ruby script for analyzing the voting patterns on the bookie poll
Ruby
1
star
69

person_sensor_blues_web

Web frontend for the Blues/Person Sensor integration
Python
1
star
70

streaming_speech_metrics

Python library for analyzing the latency and accuracy of streaming speech to text systems
Python
1
star