Russian Names
russiannames
is a Python 3 library dedicated to parse Russian names, surnames and midnames, identify person gender by fullname and how name is written. It uses MongoDB as backend to speed-up name parsing.
Documentation
Documentation is built automatically and can be found on https://russiannames.readthedocs.org/en/latest/
Installation
To install Python library use pip install russiannames
via pip or python setup.py install
To use database you need MongoDB instance. Unpack db_data_bson.zip file from https://github.com/datacoon/russiannames/blob/master/data/bson/db_dump_bson.zip
and use mongorestore
command to restore names
database with 3 collections: names, surnames and midnames
Features
Database of names used for identification
- 375449 surnames - collection: surnames
- 32134 first names - collection: names
- 48274 midnames - collection: midnames
Detailed database statistics by gender and collection
collection | total | males | females | universal or unidentified |
---|---|---|---|---|
names | 32134 | 19297 | 8278 | 1196 |
midnames | 48274 | 30114 | 16143 | 0 |
surnames | 375274 | 124662 | 111534 | 38827 |
Supports 12 formats of Russian full names writing style
Format | Example | Description |
---|---|---|
f | ΠΠ»ΡΠ³Π° | only first name |
s | ΠΠ΅ΡΡΠΎΠ² | only surname |
Fs | Π. Π‘ΠΈΠ΄ΠΎΡΠΎΠ²Π° | first letter of first name and full surname |
sF | ΠΠΈΠΊΠΎΠ»Π°Π΅Π² Π‘. | full surname and first letter of surname |
sf | ΠΠ±ΡΠ°ΠΌΠΎΠ² Π‘Π΅ΠΌΠ΅Π½ | full surname and full first name |
fs | Π‘ΠΎΠ½Ρ ΠΠ°ΠΌΠΈΡΠ»Π»ΠΈΠ½Π° | full first name and full surname |
fm | ΠΠ²Π°Π½ ΠΠ΅ΡΡΠΎΠ²ΠΈΡ | full first name and full middlename |
SFM | Π.Π.Π. | first letters of surname, first name, middlename |
FMs | Π.Π. ΠΠ³ΠΎΡΠΎΠ²Π° | first letters of first and middle name and full furname |
sFM | ΠΠΈΠΊΠΎΠ»Π°Π΅Π½ΠΊΠΎ Π‘.Π. | full surname and first letters of first and middle names |
sfM | ΠΠ΅ΡΡΠ°ΠΊΠΎΠ²Π° ΠΠΈΠ½Π°ΠΈΠ΄Π° Π. | full surname, first name and first letter of middle name |
sfm | ΠΠ°Π·Π°ΠΊΠΎΠ² Π ΠΈΠ½Π°Ρ ΠΡΡΡΡΠΎΠ²ΠΈΡ | full name as surname, first name and middle name |
fms | Π‘Π²Π΅ΡΠ»Π°Π½Π° ΠΡΡ ΠΈΠΏΠΎΠ²Π½Π° ΠΠΎΠ»ΠΊΠΎΠ²Π° | full name as first name, middle name and surname |
Supports names with following ethnics identification
9 ethnic types in names, surnames and middle names supported
key | name (en) | name (rus) |
---|---|---|
arab | Arabic | ΠΡΠ°Π±ΡΠΊΠΎΠ΅ |
arm | Armenian | ΠΡΠΌΡΠ½ΡΠΊΠΎΠ΅ |
geor | Georgian | ΠΡΡΠ·ΠΈΠ½ΡΠΊΠΎΠ΅ |
germ | German | ΠΠ΅ΠΌΠ΅ΡΠΊΠΈΠ΅ |
greek | Greek | ΠΡΠ΅ΡΠ΅ΡΠΊΠΈΠ΅ |
jew | Jew | ΠΠ²ΡΠ΅ΠΉΡΠΊΠΈΠ΅ |
polsk | Polish | ΠΠΎΠ»ΡΡΠΊΠΈΠ΅ |
slav | Slavic (Russian) | Π‘Π»Π°Π²ΡΠ½ΡΠΊΠΈΠ΅ |
tur | Turkic | Π’ΡΡΠΊΡΠΊΠΈΠ΅ (ΡΡΡΠΊΠΎΡΠ·ΡΡΠ½ΡΠ΅) |
Limitations
- very rare names, surnames or middlenames could be not parsed
- ethnic identification is still on early stage
Speed optimization
- preconfigured and preindexed MongoDb collections used
Usage and Examples
Parse name and identify gender
Parses names and returns: format, surname, first name, middle name, parsed (True/False) and gender
>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.parse('ΠΠΈΠ³ΠΌΠ°ΡΡΠ»Π»ΠΈΠ½ Π ΠΈΠ½Π°Ρ ΠΡ
ΠΌΠ΅ΡΠΎΠ²ΠΈΡ')
{'format': 'sfm', 'sn': 'ΠΠΈΠ³ΠΌΠ°ΡΡΠ»Π»ΠΈΠ½', 'fn': 'Π ΠΈΠ½Π°Ρ', 'mn': 'ΠΡ
ΠΌΠ΅ΡΠΎΠ²ΠΈΡ', 'gender': 'm', 'text': 'ΠΠΈΠ³ΠΌΠ°ΡΡΠ»Π»ΠΈΠ½ Π ΠΈΠ½Π°Ρ ΠΡ
ΠΌΠ΅ΡΠΎΠ²ΠΈΡ', 'parsed': True}
>>> parser.parse('ΠΠ΅ΡΡΠΎΠ²Π° C.Π―.')
{'format': 'sFM', 'sn': 'ΠΠ΅ΡΡΠΎΠ²Π°', 'fn_s': 'C', 'mn_s': 'Π―', 'gender': 'f', 'text': 'ΠΠ΅ΡΡΠΎΠ²Π° C.Π―.', 'parsed': True}
Gender field could have one of following values:
- m: Male
- f: Female
- u: Unknown / unidentified
- -: Impossible to identify
Ethnic identification (experimental)
Parses surname, first name and middle name and tries to identify person ethic affiliation of the person
>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.classify('ΠΠΈΠ³ΠΌΠ°ΡΡΠ»Π»ΠΈΠ½', 'Π ΠΈΠ½Π°Ρ', 'ΠΡ
ΠΌΠ΅ΡΠΎΠ²ΠΈΡ')
{'ethnics': ['tur'], 'gender': 'm'}
>>> parser.classify('ΠΠ»Π΅ΠΊΡΠ΅Π΅Π²Π°', 'ΠΠ»ΡΠ³Π°', 'ΠΠ²Π°Π½ΠΎΠ²Π½Π°')
{'ethnics': ['slav'], 'gender': 'f'}
Supported languages
- Russian
Requirements
- pymongo
- click
Related projects
- Slavic names https://github.com/wb-08/SlavicNames - same data shipped as SQLite database