• Stars
    star
    201
  • Rank 193,347 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 5 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🏣📮〠 Japanese postal code data.

posuto

Current PyPI packages

Posuto is a wrapper for the postal code data distributed by Japan Post. It makes mapping Japanese postal codes to addresses easier than working with the raw CSV.

You can read more about the motivations for posuto in Parsing the Infamous Japanese Postal CSV.

issueを英語で書く必要はありません。

Postbox character by Irasutoya

Features:

  • multi-line neighborhoods are joined
  • parenthetical notes are put in a separate field
  • change reasons are converted from flags to labels
  • kana records are unified for easy access
  • codes with multiple areas provide a list of alternates

Romaji provided by JP Post were previously included in this library, but they are extremely low quality and hard to sync, due to being updated separately. If you need romaji it is recommended you use cutlet instead.

To install:

pip install posuto

Example usage:

import posuto as 〒

🗼 = 〒.get('〒105-0011')

print(🗼)
# "東京都港区芝公園"
print(🗼.prefecture)
# "東京都"
print(🗼.kana)
# "トウキョウトミナトクシバコウエン"
print(🗼.note)
# None

Note: Unfortunately 〒 and 🗼 are not valid identifiers in Python, so the above is pseudocode. See examples/sample.py for an executable version.

You can provide a postal code with basic formatting, and postal data will be returned as a named tuple with a few convenience functions. Read on for details of how quirks in the original data are handled.

Details

The original CSV files are managed in source control here but are not distributed as part of the pip package. Instead, the CSV is converted to JSON, which is then put into an sqlite db and included in the package distribution. That means most of the complexity in code in this package is actually in the build and not at runtime.

The postal code data has many irregularities and strange parts. This explains how they're dealt with.

As another note, in normal usage posuto doesn't require any dependencies. When actually building the postal data from the raw CSVs mojimoji is used for character conversion and iconv for encoding conversion.

Field names

The primary fields of an address and the translations preferred here for each are:

  • 都道府県: prefecture
  • 市区町村: city
  • 町域名: neighborhood
    # 🗼
    tt = posuto.get('〒105-0011')
    print(tt.prefecture, tt.city, tt.neighborhood)
    # "東京都 港区 芝公園"

Notes

The postal data often includes notes in the neighborhood field. These are always in parenthesis with one exception, "以下に掲載がない場合". All notes are put in the notes field, and no attempt is made to extract their yomigana or romaji (which are often not available anyway).

minatoku = posuto.get('1050000')
print(minatoku.note)
# "以下に掲載がない場合"

Yomigana

Yomigana are converted to full-width kana.

Long Neighborhood Names

The postal data README explains that when the neighborhood field is over 38 characters it will be continued onto multiple lines. This is not explicitly marked in the data, and where line breaks are inserted in long neighborhoods appears to be random (it's often neither after the 38th character nor at a reasonable word boundary). The only indicator of long lines is an unclosed parenthesis on the first line. Such long lines are always in order in the original file.

In posuto, the parenthetical information is considered a note and put in the note field.

omiya = posuto.get('6020847')
print(omiya)
# "京都府京都市上京区大宮町"
print(omiya.note)
# "今出川通河原町西入、今出川通寺町東入、今出川通寺町東入下る、河原町通今出川下る、河原町通今出川下る西入、寺町通今出川下る東入、中筋通石薬師上る"

Multiple Regions in One Code

Sometimes a postal code covers multiple regions. Often the city is the same and just the neighborhood varies, but sometimes part of the city field varies, or even the whole city field. Codes like this are indicated by the "一つの郵便番号で二以上の町域を表す場合の表示" field in the original CSV data, which is called multi here.

For now, if more than one region uses multiple codes, the main entry is for the first region listed in the main CSV, and other regions are stored as a list in the alternates property. There may be a better way to do this.

Programming Notes

This section is for notes on the use of the library itself as opposed to notes about the data structure.

Multi-threaded Environments

By default, posuto creates a DB connection and cursor on startup and reuses it for all requests. In the typical single-threaded, read-only scenario this is not a problem, but it causes warnings (and may cause problems) in a multi-threaded scenario. In that case you can manage db connections manually using a context manager object.

from posuto import Posuto

with Posuto() as pp:
    tower = pp.get('〒105-0011')

Using the object this way the connection will be automatically closed when the with block is exited.

License

The original postal data is provided by JP Post with an indication they will not assert copyright. The code in this repository is released under the MIT or WTFPL license.

More Repositories

1

fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
C++
384
star
2

cutlet

Japanese to romaji converter in Python
Python
283
star
3

unidic-py

Unidic packaged for installation via pip.
Python
74
star
4

ndl-crop

Script for cropping photos from the NDL.
Python
37
star
5

unidic-lite

A small version of UniDic for easy pip installs.
Python
36
star
6

showmemore

SHOW ME MORE OF [-----]
Python
28
star
7

ipadic-py

IPAdic packaged for easy use from Python.
Python
25
star
8

awesome-digital-collections

Publicly accessible digital collections.
19
star
9

palladian-facades

🏛️ Palladian Facade Generator for ProcJam2015
LiveScript
19
star
10

multilang-filter

Script for preprocessing multilingual Markdown.
Python
14
star
11

deltos

A magic notepad. δ
LiveScript
13
star
12

gamefaces

Public domain headshots
12
star
13

dupdupdraw

Forthish drawing system with random program generation.
JavaScript
11
star
14

chargen

Random generator taking literature as input.
Python
7
star
15

node-migemo

Japanese search regex generator
LiveScript
7
star
16

ja-tokenizer-benchmark

Compare the speed of various Japanese tokenizers in Python.
Python
7
star
17

philtre

Search objects with a familiar syntax.
LiveScript
6
star
18

jp-ner

[abandoned] Work on generating an NER dataset for Japanese
Python
5
star
19

jumandic-py

JumanDic packaged for use with PyPI.
Python
3
star
20

awesome-gamedev-jp

ゲーム開発に役立つリンク集
3
star
21

shesha

Random generator toolkit
JavaScript
3
star
22

bontan.ls

Bontan is a simple scraper primarily intended for articles.
LiveScript
2
star
23

lua-mecab

Lua wrapper for Mecab Japanese morphological analyzer.
C++
2
star
24

fugashi-streamlit-demo

Streamlit demo for fugashi
Python
2
star
25

gutenjuice

Top books from Project Gutenberg, in raw form and extracted.
2
star
26

bookoff-redirect

Deal with BookOff query parameter nonsense.
HTML
2
star
27

fugashi-sagemaker-demo

A basic introduction to using fugashi for Japanese tokenization.
Jupyter Notebook
2
star
28

github-tasks.vim

Github task plugin for vim
Vim Script
2
star
29

mecab-packed

[broken/wip] Bundled mecab & unidic for installing via pip.
Shell
1
star
30

language-disruptor

Randomly replace words in Japanese sentences.
Python
1
star
31

poine-tool

POINE関連のツール
Python
1
star
32

mecab-manylinux1-wheel-builder

Build manylinux1 wheels with MeCab installed.
Shell
1
star
33

bontan

Get embed code for a link, using OEmbed as appropriate.
Nim
1
star
34

yuzulabo.works

Yuzu Labo web site
CSS
1
star
35

deltos.vim

A vim plugin for use with Deltos.
Vim Script
1
star
36

kanji

Kanji data package for Python
Python
1
star
37

visidata-conll

CoNLL-U data loader for Visidata.
Python
1
star
38

everybayes

Document classification for everyone.
Python
1
star
39

jfmt.lua

Tool for wrapping Japanese text to natural width
Lua
1
star
40

searchy

[discontinued] Simple interactive search for Node
LiveScript
1
star