• Stars
    star
    122
  • Rank 284,359 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 5 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Russian names parsers, gender identification and processing tools

Russian Names

russiannames is a Python 3 library dedicated to parse Russian names, surnames and midnames, identify person gender by fullname and how name is written. It uses MongoDB as backend to speed-up name parsing.

Documentation

Documentation is built automatically and can be found on https://russiannames.readthedocs.org/en/latest/

Installation

To install Python library use pip install russiannames via pip or python setup.py install

To use database you need MongoDB instance. Unpack db_data_bson.zip file from https://github.com/datacoon/russiannames/blob/master/data/bson/db_dump_bson.zip

and use mongorestore command to restore names database with 3 collections: names, surnames and midnames

Features

Database of names used for identification

  • 375449 surnames - collection: surnames
  • 32134 first names - collection: names
  • 48274 midnames - collection: midnames

Detailed database statistics by gender and collection

collection total males females universal or unidentified
names 32134 19297 8278 1196
midnames 48274 30114 16143 0
surnames 375274 124662 111534 38827

Supports 12 formats of Russian full names writing style

Format Example Description
f Ольга only first name
s ΠŸΠ΅Ρ‚Ρ€ΠΎΠ² only surname
Fs О. Π‘ΠΈΠ΄ΠΎΡ€ΠΎΠ²Π° first letter of first name and full surname
sF НиколаСв Б. full surname and first letter of surname
sf Абрамов Π‘Π΅ΠΌΠ΅Π½ full surname and full first name
fs Боня ΠšΠ°ΠΌΠΈΡƒΠ»Π»ΠΈΠ½Π° full first name and full surname
fm Иван ΠŸΠ΅Ρ‚Ρ€ΠΎΠ²ΠΈΡ‡ full first name and full middlename
SFM М.Π”.М. first letters of surname, first name, middlename
FMs А.Н. Π•Π³ΠΎΡ€ΠΎΠ²Π° first letters of first and middle name and full furname
sFM НиколаСнко Б.П. full surname and first letters of first and middle names
sfM ΠŸΠ΅Ρ‚Ρ€Π°ΠΊΠΎΠ²Π° Π—ΠΈΠ½Π°ΠΈΠ΄Π° М. full surname, first name and first letter of middle name
sfm Казаков Π ΠΈΠ½Π°Ρ‚ Артурович full name as surname, first name and middle name
fms Π‘Π²Π΅Ρ‚Π»Π°Π½Π° Архиповна Π’ΠΎΠ»ΠΊΠΎΠ²Π° full name as first name, middle name and surname

Supports names with following ethnics identification

9 ethnic types in names, surnames and middle names supported

key name (en) name (rus)
arab Arabic АрабскоС
arm Armenian АрмянскоС
geor Georgian ГрузинскоС
germ German НСмСцкиС
greek Greek ГрСчСскиС
jew Jew ЕврСйскиС
polsk Polish ПольскиС
slav Slavic (Russian) БлавянскиС
tur Turkic Π’ΡŽΡ€ΠΊΡΠΊΠΈΠ΅ (Ρ‚ΡŽΡ€ΠΊΠΎΡΠ·Ρ‹Ρ‡Π½Ρ‹Π΅)

Limitations

  • very rare names, surnames or middlenames could be not parsed
  • ethnic identification is still on early stage

Speed optimization

  • preconfigured and preindexed MongoDb collections used

Usage and Examples

Parse name and identify gender

Parses names and returns: format, surname, first name, middle name, parsed (True/False) and gender

>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.parse('Нигматуллин Π ΠΈΠ½Π°Ρ‚ АхмСтович')
{'format': 'sfm', 'sn': 'Нигматуллин', 'fn': 'Π ΠΈΠ½Π°Ρ‚', 'mn': 'АхмСтович', 'gender': 'm', 'text': 'Нигматуллин Π ΠΈΠ½Π°Ρ‚ АхмСтович', 'parsed': True}
>>> parser.parse('ΠŸΠ΅Ρ‚Ρ€ΠΎΠ²Π° C.Π―.')
{'format': 'sFM', 'sn': 'ΠŸΠ΅Ρ‚Ρ€ΠΎΠ²Π°', 'fn_s': 'C', 'mn_s': 'Π―', 'gender': 'f', 'text': 'ΠŸΠ΅Ρ‚Ρ€ΠΎΠ²Π° C.Π―.', 'parsed': True}

Gender field could have one of following values:

  • m: Male
  • f: Female
  • u: Unknown / unidentified
  • -: Impossible to identify

Ethnic identification (experimental)

Parses surname, first name and middle name and tries to identify person ethic affiliation of the person

>>> from russiannames.parser import NamesParser
>>> parser = NamesParser()
>>> parser.classify('Нигматуллин', 'Π ΠΈΠ½Π°Ρ‚', 'АхмСтович')
{'ethnics': ['tur'], 'gender': 'm'}
>>> parser.classify('АлСксССва', 'Ольга', 'Ивановна')
{'ethnics': ['slav'], 'gender': 'f'}

Supported languages

  • Russian

Requirements

  • pymongo
  • click

Related projects