• Stars
    star
    35
  • Rank 725,077 (Top 15 %)
  • Language
    Go
  • License
    MIT License
  • Created about 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

normalize

License: MIT Go Reference ci codecov Go Report Card

Simple library for fuzzy text sanitizing, normalizing and comparison.

Why

People type differently. This may be a problem if you need to associate user input with some internal entity or compare two inputs of different users. Say abc-01 and ABC 01 must be considered to be the same strings in your system. There are many heuristics we can apply to make this work:

  • Remove special characters.
  • Convert everything to lowercase.
  • etc.

This library is essentially an easily configurable set of useful helpers implementing all these transformations.

Installation

go get -u github.com/avito-tech/normalize 

Features

Normalize fuzzy text

package main 

import (
	"fmt"
	"github.com/avito-tech/normalize"
)

func main() {
	fuzzy := "VAG-1101"
	clean := normalize.Normalize(fuzzy)
	fmt.Print(clean) // vag1101

	manyFuzzy := []string{"VAG-1101", "VAG-1102"}
	manyClean := normalize.Many(manyFuzzy)
	fmt.Print(manyClean) // {"vag1101", "vag1102"}
}

Default rules (in order of actual application):

  • Any char except latin/cyrillic letters, German umlauts (ä, ö, ü) and digits are removed.
  • Rare cyrillic letters ё and й are replaced with common equivalents е and и.
  • Latin/cyrillic look-alike pairs are normalized to latin letters, so В (в) becomes B (b). Please check all replacement pairs in WithCyrillicToLatinLookAlike normalizer in normalizers.go.
  • German umlauts ä, ö, ü get converted to latin a, o, u.
  • The whole string gets lower cased.

Compare fuzzy texts

Compare two strings with all normalizations described above applied. Provide threshold parameters to tweak how similar strings must be to make the function return true. threshold is relative value, so 0.5 roughly means "strings are 50% different after all normalizations applied".

Levenstein distance is used under the hood to compute distance between strings.

package main

import (
    "fmt"
    "github.com/avito-tech/normalize"
)

func main() {
	fuzzy := "Hyundai-Kia"
	otherFuzzy := "HYUNDAI"
	similarityThreshold := 0.3
	result := normalize.AreStringsSimilar(fuzzy, otherFuzzy, similarityThreshold)

	// distance(hyundaikia, hyundai) = 3
	// 3 / len(hyundaikia) = 0.3 
	fmt.Print(result) // true
}

Default rules

  • Apply default normalization (described above).
  • Calculate Levenstein distance and return true if distance / strlen <= threshold.

Configuration

Both AreStringsSimilar and Normalize accept arbitrary number of normalizers as an optional parameter. Normalizer is any function that accepts string and returns string.

For example, following option will leave string unchanged.

package main

import "github.com/avito-tech/normalize"

func WithNoNormalization() normalize.Option {
	return func(str string) string {
		return str
	}
}

You can configure normalizing to use only those options you need. For example, you can use only lower casing and cyr2lat conversion during normalization. Note that the order of options matters.

package main

import (
	"fmt"
	"github.com/avito-tech/normalize"
)

func main() {
	fuzzy := "АВ-123"
	clean := normalize.Normalize(fuzzy, normalize.WithLowerCase(), normalize.WithCyrillicToLatinLookAlike())
	fmt.Print(clean) // ab-123
}

More Repositories

1

playbook

AvitoTech team playbook
1,443
star
2

Paparazzo

Custom iOS camera and photo picker with editing capabilities
Swift
769
star
3

avito-android

Infrastructure of Avito android
Kotlin
356
star
4

Emcee

Emcee is a tool that runs Android and iOS tests in parallel using multiple simulators and emulators across many servers
Swift
323
star
5

bioyino

High performance and high-precision multithreaded StatsD server
Rust
224
star
6

Marshroute

Marshroute is an iOS Library for making your Routers simple but extremely powerful
Swift
220
star
7

netramesh

Ultra light service mesh for any orchestrator
Go
217
star
8

deepsecrets

Secrets scanner that understands code
Python
180
star
9

aqueduct

Framework for create performance-efficient prediction
Python
171
star
10

go-transaction-manager

Transaction manager for GoLang
Go
156
star
11

Mixbox

iOS UI testing framework https://t.me/mixbox_english https://t.me/mixbox_russian
Swift
149
star
12

go-mutesting

Mutation testing for Go source code. Fork from https://github.com/zimmski/go-mutesting
Go
127
star
13

krop

Small widget for image cropping in Instagram-like style
Kotlin
126
star
14

autumn-2021-intern-assignment

98
star
15

internship_backend_2022

Тестовое задание на позицию стажера-бэкендера
Go
84
star
16

avitotech-presentations

84
star
17

Calcifer

Calcifer
Swift
72
star
18

nginx-log-collector

nginx-log-collector
Go
54
star
19

sx-frontend-trainee-assignment

Тестовое задание для стажёра Frontend в команду Seller Experience
53
star
20

Konveyor

Kotlin
48
star
21

auto-backend-trainee-assignment

Тестовое задание на позицию бекенд разработчика в юнит Авто
39
star
22

navigator

Multicluster service mesh solution based on envoy
Go
39
star
23

pulemet

Controlled RPS for interservice communication
Python
39
star
24

android-ui-testing

Kotlin
38
star
25

python-trainee-assignment

Тестовое задание по python
37
star
26

adv-backend-trainee-assignment

Тестовое задание для стажёра Backend в команду Advertising
29
star
27

internship_frontend_2022

Тестовое задание на позицию стажера-фронтендера
TypeScript
26
star
28

frontend-trainee-assignment-2023

25
star
29

job-backend-trainee-assignment

Тестовое задание на позицию стажера-бекендера в юнит "Работа"
24
star
30

safedeal-frontend-trainee

22
star
31

pg_reindex

Console utility for rebuilding indexes and primary keys for PostgreSQL in automatic mode with analysis of index bloating and without table locking
Python
21
star
32

msg-backend-trainee-assignment

В ДАННЫЙ МОМЕНТ НЕ АКТУАЛЬНО! Тестовое задание на позицию стажера-бекендера
21
star
33

ios-trainee-problem-2021

Тестовое задание для стажера по направлению iOS
19
star
34

verticals

Публичный репозиторий кластера Verticals
19
star
35

pro-fe-trainee-task

Тестовое задание для FE стажера в Авито Pro (Команда ARPU)
19
star
36

android-trainee-task-2021

18
star
37

smart-redis-replication

Go
18
star
38

mi-backend-trainee-assignment

Тестовое задание для стажёра Backend в команду MI
17
star
39

dba-utils

Shell
17
star
40

geo-backend-trainee-assignment

16
star
41

clickhouse-vertica-udx

UDF to seamlessly connect ClickHouse to Vertica using external tables
C++
16
star
42

blur-layout

Support for blurred semitransparent backgrounds in Android.
Assembly
16
star
43

internship

Тестовое задание для iOS-стажировки
15
star
44

prop-types-definition

Patch for prop-types to get property type definition in runtime
JavaScript
15
star
45

abito

Python package for hypothesis testing. Suitable for using in A/B-testing software
Python
14
star
46

tm-backend-trainee

Тестовое задание для стажёра Backend в команду Trade Marketing
13
star
47

mx-backend-trainee-assignment

Тестовое задание для стажёра Backend в команду MX
13
star
48

CommandLineToolkit

Small swift package to create command line tools faster
Swift
13
star
49

antibot-developer-trainee

Тестовая задача для разработчика-стажёра в команду Информационной безопасности Авито для защиты сайта от ботов
13
star
50

bx-backend-trainee-assignment

Тестовое задание на позицию стажера-бекендера в юнит Buyer Experience
12
star
51

internship_ios_2022

Тестовое задание на позицию стажёра в iOS
Swift
10
star
52

patterns-and-practices-abstracts

9
star
53

dba-docs

PLpgSQL
9
star
54

ImageSource

Image abstraction toolkit
Swift
8
star
55

qa-trainee-task

Тестовое задание для стажёра-автоматизатора
8
star
56

gravure

Image processing microservice
Rust
8
star
57

pro-backend-trainee-assignment

7
star
58

puppet-controlrepo-template

Шаблон control repo для Puppet к статье «Инфраструктура как код в Авито: уроки которые мы извлекли»
Ruby
7
star
59

pgmock

PostgreSQL 9.4+ extension for unit tests
PLpgSQL
7
star
60

mi-trainee-task-2021

6
star
61

mi-trainee-task

Тестовое задание для стажера в Market Intelligence.
6
star
62

android-trainee-task

6
star
63

ShopX-QA-trainee

задания к собеседованию
6
star
64

ios-trainee-problem

Задача для стажера на платформу iOS
6
star
65

iOS-trainee-assignment-2023

5
star
66

protocol-writer

Simplest of apps to write timed protocols from interviews
JavaScript
5
star
67

bx-android-trainee-assigment

5
star
68

safedeal-backend-trainee

5
star
69

puppet-module-template

Шаблон Puppet модуля к статье «Инфраструктура как код в Авито: уроки которые мы извлекли»
Ruby
5
star
70

trainspotting

Python Dependency Injector based on interface binding
Python
5
star
71

android-peerlab-moscow

5
star
72

GraphiteClient

Lightweight Swift framework for feeding data into Graphite and statsD.
Swift
4
star
73

video-course-patterns-and-practices

PHP
4
star
74

xrpcd

PostgreSQL RPC built on top of pgq.
Python
4
star
75

doner

Centralized file downloading service
Rust
4
star
76

ap-frontend-trainee-assignment

4
star
77

aaa-ml-datasets-course

Репозиторий курса по созданию датасетов Академии Аналитиков Авито
Jupyter Notebook
3
star
78

moira

Go
3
star
79

qa-trainee-general

Тестовое задание для QA-cтажёра
3
star
80

aaa-ml-sys-design

ML System Design lectures materials
Python
3
star
81

vas-frontend-trainee-assignment

Задание для стажёра в команду VAS
2
star
82

Emcee.cloud.action

GItHub action for emcee.cloud
TypeScript
2
star
83

moira-client

Python
2
star
84

qa-into-CoE-trainee-task

Тестовое задание для стажёра QA в Центр экспертизы по Обеспечению качества
2
star
85

EmceePluginSupport

Swift package that allows to extend Emcee with plugins
Swift
1
star
86

test-asap

Package for easy to start browser testing
JavaScript
1
star
87

moira-web

JavaScript
1
star
88

avito-vault

Puppet модуль, автоматизирующий выкладку секретов из vault.
Ruby
1
star
89

alert-autoconf

Python
1
star
90

brave-new-billing

Тестовое задание для backend-стажёра в юнит Billing, Avito
1
star