• Stars
    star
    35
  • Rank 750,836 (Top 15 %)
  • Language
    Go
  • License
    MIT License
  • Created over 3 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

normalize

License: MIT Go Reference ci codecov Go Report Card

Simple library for fuzzy text sanitizing, normalizing and comparison.

Why

People type differently. This may be a problem if you need to associate user input with some internal entity or compare two inputs of different users. Say abc-01 and ABC 01 must be considered to be the same strings in your system. There are many heuristics we can apply to make this work:

  • Remove special characters.
  • Convert everything to lowercase.
  • etc.

This library is essentially an easily configurable set of useful helpers implementing all these transformations.

Installation

go get -u github.com/avito-tech/normalize 

Features

Normalize fuzzy text

package main 

import (
	"fmt"
	"github.com/avito-tech/normalize"
)

func main() {
	fuzzy := "VAG-1101"
	clean := normalize.Normalize(fuzzy)
	fmt.Print(clean) // vag1101

	manyFuzzy := []string{"VAG-1101", "VAG-1102"}
	manyClean := normalize.Many(manyFuzzy)
	fmt.Print(manyClean) // {"vag1101", "vag1102"}
}

Default rules (in order of actual application):

  • Any char except latin/cyrillic letters, German umlauts (ä, ö, ü) and digits are removed.
  • Rare cyrillic letters ё and й are replaced with common equivalents е and и.
  • Latin/cyrillic look-alike pairs are normalized to latin letters, so В (в) becomes B (b). Please check all replacement pairs in WithCyrillicToLatinLookAlike normalizer in normalizers.go.
  • German umlauts ä, ö, ü get converted to latin a, o, u.
  • The whole string gets lower cased.

Compare fuzzy texts

Compare two strings with all normalizations described above applied. Provide threshold parameters to tweak how similar strings must be to make the function return true. threshold is relative value, so 0.5 roughly means "strings are 50% different after all normalizations applied".

Levenstein distance is used under the hood to compute distance between strings.

package main

import (
    "fmt"
    "github.com/avito-tech/normalize"
)

func main() {
	fuzzy := "Hyundai-Kia"
	otherFuzzy := "HYUNDAI"
	similarityThreshold := 0.3
	result := normalize.AreStringsSimilar(fuzzy, otherFuzzy, similarityThreshold)

	// distance(hyundaikia, hyundai) = 3
	// 3 / len(hyundaikia) = 0.3 
	fmt.Print(result) // true
}

Default rules

  • Apply default normalization (described above).
  • Calculate Levenstein distance and return true if distance / strlen <= threshold.

Configuration

Both AreStringsSimilar and Normalize accept arbitrary number of normalizers as an optional parameter. Normalizer is any function that accepts string and returns string.

For example, following option will leave string unchanged.

package main

import "github.com/avito-tech/normalize"

func WithNoNormalization() normalize.Option {
	return func(str string) string {
		return str
	}
}

You can configure normalizing to use only those options you need. For example, you can use only lower casing and cyr2lat conversion during normalization. Note that the order of options matters.

package main

import (
	"fmt"
	"github.com/avito-tech/normalize"
)

func main() {
	fuzzy := "АВ-123"
	clean := normalize.Normalize(fuzzy, normalize.WithLowerCase(), normalize.WithCyrillicToLatinLookAlike())
	fmt.Print(clean) // ab-123
}

More Repositories

1

playbook

AvitoTech team playbook
1,443
star
2

Paparazzo

Custom iOS camera and photo picker with editing capabilities
Swift
769
star
3

avito-android

Infrastructure of Avito android
Kotlin
356
star
4

Emcee

Emcee is a tool that runs Android and iOS tests in parallel using multiple simulators and emulators across many servers
Swift
331
star
5

bioyino

High performance and high-precision multithreaded StatsD server
Rust
229
star
6

netramesh

Ultra light service mesh for any orchestrator
Go
228
star
7

go-transaction-manager

Transaction manager for GoLang
Go
227
star
8

Marshroute

Marshroute is an iOS Library for making your Routers simple but extremely powerful
Swift
224
star
9

deepsecrets

Secrets scanner that understands code
Python
180
star
10

aqueduct

Framework for create performance-efficient prediction
Python
171
star
11

Mixbox

iOS UI testing framework https://t.me/mixbox_english https://t.me/mixbox_russian
Swift
152
star
12

go-mutesting

Mutation testing for Go source code. Fork from https://github.com/zimmski/go-mutesting
Go
145
star
13

krop

Small widget for image cropping in Instagram-like style
Kotlin
126
star
14

avitotech-presentations

Go
112
star
15

autumn-2021-intern-assignment

98
star
16

internship_backend_2022

Тестовое задание на позицию стажера-бэкендера
Go
84
star
17

Calcifer

Calcifer
Swift
72
star
18

nginx-log-collector

nginx-log-collector
Go
55
star
19

sx-frontend-trainee-assignment

Тестовое задание для стажёра Frontend в команду Seller Experience
53
star
20

Konveyor

Kotlin
48
star
21

auto-backend-trainee-assignment

Тестовое задание на позицию бекенд разработчика в юнит Авто
42
star
22

navigator

Multicluster service mesh solution based on envoy
Go
39
star
23

pulemet

Controlled RPS for interservice communication
Python
39
star
24

android-ui-testing

Kotlin
39
star
25

python-trainee-assignment

Тестовое задание по python
37
star
26

adv-backend-trainee-assignment

Тестовое задание для стажёра Backend в команду Advertising
29
star
27

frontend-trainee-assignment-2023

27
star
28

internship_frontend_2022

Тестовое задание на позицию стажера-фронтендера
TypeScript
26
star
29

job-backend-trainee-assignment

Тестовое задание на позицию стажера-бекендера в юнит "Работа"
26
star
30

safedeal-frontend-trainee

22
star
31

pg_reindex

Console utility for rebuilding indexes and primary keys for PostgreSQL in automatic mode with analysis of index bloating and without table locking
Python
21
star
32

msg-backend-trainee-assignment

В ДАННЫЙ МОМЕНТ НЕ АКТУАЛЬНО! Тестовое задание на позицию стажера-бекендера
21
star
33

internship

Тестовое задание для iOS-стажировки
20
star
34

geo-backend-trainee-assignment

20
star
35

android-trainee-task-2021

20
star
36

blur-layout

Support for blurred semitransparent backgrounds in Android.
Assembly
19
star
37

verticals

Публичный репозиторий кластера Verticals
19
star
38

pro-fe-trainee-task

Тестовое задание для FE стажера в Авито Pro (Команда ARPU)
19
star
39

ios-trainee-problem-2021

Тестовое задание для стажера по направлению iOS
19
star
40

smart-redis-replication

Go
18
star
41

mi-backend-trainee-assignment

Тестовое задание для стажёра Backend в команду MI
17
star
42

dba-utils

Shell
17
star
43

clickhouse-vertica-udx

UDF to seamlessly connect ClickHouse to Vertica using external tables
C++
15
star
44

abito

Python package for hypothesis testing. Suitable for using in A/B-testing software
Python
15
star
45

prop-types-definition

Patch for prop-types to get property type definition in runtime
JavaScript
15
star
46

tm-backend-trainee

Тестовое задание для стажёра Backend в команду Trade Marketing
13
star
47

CommandLineToolkit

Small swift package to create command line tools faster
Swift
13
star
48

antibot-developer-trainee

Тестовая задача для разработчика-стажёра в команду Информационной безопасности Авито для защиты сайта от ботов
13
star
49

bx-backend-trainee-assignment

Тестовое задание на позицию стажера-бекендера в юнит Buyer Experience
12
star
50

mx-backend-trainee-assignment

Тестовое задание для стажёра Backend в команду MX
11
star
51

internship_ios_2022

Тестовое задание на позицию стажёра в iOS
Swift
10
star
52

patterns-and-practices-abstracts

9
star
53

dba-docs

PLpgSQL
9
star
54

qa-trainee-task

Тестовое задание для стажёра-автоматизатора
8
star
55

gravure

Image processing microservice
Rust
8
star
56

ImageSource

Image abstraction toolkit
Swift
8
star
57

pgmock

PostgreSQL 9.4+ extension for unit tests
PLpgSQL
8
star
58

mi-trainee-task-2021

7
star
59

puppet-controlrepo-template

Шаблон control repo для Puppet к статье «Инфраструктура как код в Авито: уроки которые мы извлекли»
Ruby
7
star
60

ap-frontend-trainee-assignment

7
star
61

pro-backend-trainee-assignment

6
star
62

mi-trainee-task

Тестовое задание для стажера в Market Intelligence.
6
star
63

android-trainee-task

6
star
64

ShopX-QA-trainee

задания к собеседованию
6
star
65

ios-trainee-problem

Задача для стажера на платформу iOS
6
star
66

iOS-trainee-assignment-2023

5
star
67

protocol-writer

Simplest of apps to write timed protocols from interviews
JavaScript
5
star
68

bx-android-trainee-assigment

5
star
69

safedeal-backend-trainee

5
star
70

puppet-module-template

Шаблон Puppet модуля к статье «Инфраструктура как код в Авито: уроки которые мы извлекли»
Ruby
5
star
71

android-peerlab-moscow

5
star
72

GraphiteClient

Lightweight Swift framework for feeding data into Graphite and statsD.
Swift
4
star
73

video-course-patterns-and-practices

PHP
4
star
74

xrpcd

PostgreSQL RPC built on top of pgq.
Python
4
star
75

doner

Centralized file downloading service
Rust
4
star
76

trainspotting

Python Dependency Injector based on interface binding
Python
4
star
77

moira

Go
3
star
78

qa-trainee-general

Тестовое задание для QA-cтажёра
3
star
79

aaa-ml-sys-design

ML System Design lectures materials
Python
3
star
80

aaa-ml-datasets-course

Репозиторий курса по созданию датасетов Академии Аналитиков Авито
Jupyter Notebook
3
star
81

vas-frontend-trainee-assignment

Задание для стажёра в команду VAS
2
star
82

Emcee.cloud.action

GItHub action for emcee.cloud
Shell
2
star
83

moira-client

Python
2
star
84

alert-autoconf

Python
2
star
85

homebrew-tap

Homebrew Tap of Avito products and tools
Ruby
2
star
86

avito-pixel

HTML
2
star
87

qa-into-CoE-trainee-task

Тестовое задание для стажёра QA в Центр экспертизы по Обеспечению качества
2
star
88

moira-web

JavaScript
1
star
89

test-asap

Package for easy to start browser testing
JavaScript
1
star
90

EmceePluginSupport

Swift package that allows to extend Emcee with plugins
Swift
1
star
91

avito-vault

Puppet модуль, автоматизирующий выкладку секретов из vault.
Ruby
1
star
92

brave-new-billing

Тестовое задание для backend-стажёра в юнит Billing, Avito
1
star
93

Emcee.cloud-CLI

Shell
1
star