• Stars
    star
    514
  • Rank 85,646 (Top 2 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 9 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A big list of homoglyphs and some code to detect them

Homoglyphs

Java Quick Start

Include the Homoglyph library in your project by downloading it from Maven Central:

<dependency>
    <groupId>net.codebox</groupId>
    <artifactId>homoglyph</artifactId>
    <version>1.1.1</version>
</dependency>

Then use the HomoglyphBuilder class to build a Homoglyph object, and call its search() method with the text you want to search, and the word/s you want to search for:

String textToSearch = "Get free ϲrEd1ᴛ";
String[] bannedWords = new String[]{"credit"};
Homoglyph homoglyph = HomoglyphBuilder.build();
List<SearchResult> results = homoglyph.search(textToSearch, bannedWords);

JavaScript Quick Start

Include the Homoglyph library in your project by downloading it from NPM:

npm install homoglyph-search

Then call the module's search() function with the text you want to search, and the word/s you want to search for:

var homoglyphSearch = require('homoglyph-search');
var bannedWords = ['credit'];
var textToSearch = 'Get free ϲrEd1ᴛ';
var results = homoglyphSearch.search(textToSearch, bannedWords);

Background

Homoglyphs are characters with different meanings, that look similar/identical to each other - like the digit '0' and the capital letter 'O' for example.

Homoglyphs within a single alphabet tend to be rare for obvious reasons. These days, however, the internet runs on Unicode which means that it is possible to mix the letters from many different languages together in one place, massively increasing the number of homoglyphs.

For example, each of the characters shown below (all rendered using the same font) are different, with their own unique Unicode codepoint values, but they all look more-or-less like the capital letter 'A':

A Α А Ꭺ ᗅ ᴀ ꓮ A 𐊠 𝐀 𝐴 𝑨 𝒜 𝓐 𝔄 𝔸 𝕬 𝖠 𝗔 𝘈 𝘼 𝙰 𝚨 𝛢 𝜜 𝝖 𝞐

As well as creating general confusion, homoglyphs can cause particular problems for software developers. For example, if a social media website wants to protect its users from offensive language it may create a 'black-list' of forbidden words, and block any content that contains them. However, someone wishing to use one of the black-listed words could replace one of its letters with a homoglyph - the word would no longer match the one on the black-list, but its meaning would still be apparent to anyone who saw it.

I have tried to compile a list of all the homoglyphs I could find, and to make the list useful by processing it in various ways to make it easier to use in software. The list of homoglyphs I used is based on the one that appears on the Unicode Consortium website however this list, although long, was incomplete, so I added some further pairs found thanks to homoglyphs.net

JavaScript and Unicode

As noted by Mathias Bynens, JavaScript has a Unicode Problem. String processing code that works perfectly well with regular English characters can behave in unexpected ways when more exotic ones are used.

In the example below the string 'FOUR' has a length value of '4' as we would expect, however when the letters are replaced with high-value homoglyphs the length is reported as '8'. This problem occurs for any character with a Unicode codepoint value higher than U+FFFF:

>'FOUR'.length
4
>'𐊇𐊒𝐔𝐑'.length
8

This can cause problems when attempting to process Strings in order to detect homoglyph substitutions; however the JavaScript search function mentioned above uses the new ECMAScript 6 for...of construct which correctly extracts individual characters from a piece of text to allow a search to be performed.

Java and Unicode

Unfortunately, Java also has a Unicode problem! - when the language was designed, the Unicode standard only used 16-bits to encode each character, and so the corresponding Java char data type was specified to have 16-bits as well. The Unicode standard has since been updated to add many more different characters, and more than 16 bits are required to represent them all. This means that we must be careful when handling Strings that contain high-value characters, we can't rely, for example, on the .length() method returning the correct number of characters in a String.

This Java class provides a homoglyph-aware search function that correctly handles high-value Unicode characters by using the int datatype to represent codepoint values.

More Repositories

1

mosaic

Python script for creating photomosaic images
Python
531
star
2

image_augmentor

Data augmentation tool for images
Python
440
star
3

bitmeteros

BitMeter OS - a cross-platform bandwidth monitor
C
330
star
4

reading-list-mover

A Python utility for moving bookmarks/reading lists between services
Python
200
star
5

markov-text

Python utility that uses a Markov Chain to generate random sentences using a source text
Python
185
star
6

moment-precise-range

A moment.js plugin to display human-readable date/time ranges
JavaScript
150
star
7

bayesian-classifier

A Naive Bayesian Classifier written in Python
Python
102
star
8

star-charts

Generate SVG star charts using Python
Python
100
star
9

mazes

JavaScript Maze Generator
JavaScript
64
star
10

monkeyshine

A collection of slightly evil JavaScript
JavaScript
45
star
11

lunar-calendar

Generates an HTML Lunar Calendar
HTML
42
star
12

old-time-radio

An internet radio station streaming classic shows from the Golden Age of Radio
JavaScript
34
star
13

convnet-designer

A utility for designing Convolutional Neural Networks
JavaScript
27
star
14

readable-regex

Java library for creating readable regular expressions
Java
26
star
15

javabean-tester

JavaBean Tester
Java
24
star
16

js-planet-phase

A small JavaScript library for rendering realistic moon and planet phases in HTML
JavaScript
22
star
17

table-sorter

Table Sorter
JavaScript
21
star
18

clipper

Page Clipper Bookmarklet
JavaScript
21
star
19

generative-patterns

Web-Based Generative Pattern Maker
JavaScript
19
star
20

wordvis

This is a Python script to generate Sunburst Charts that visualise the structure of English words.
Python
16
star
21

process-roulette

A shell script game where you kill random processes on your computer, the more you kill the higher your score!
Shell
14
star
22

sarsa-lambda

A Python implementation of the SARSA λ reinforcement learning algorithm
Python
11
star
23

regex_parser

A regular expression parser written in JavaScript
JavaScript
10
star
24

https-certificate-expiry-checker

A Python script for checking when HTTPS certificates will expire
Python
10
star
25

star-rise-and-set-times

A browser-based tool for calculating the rising and setting times of stars
JavaScript
9
star
26

connect4

Python
8
star
27

blockchain

Minimum Viable Blockchain in Python
Python
8
star
28

maze.js

Maze Generation Algorithms and Rendering Code
JavaScript
8
star
29

save-restore

Save/Restore Bookmarklet
JavaScript
8
star
30

boggle

A boggle solver and game
Python
7
star
31

bokeh

Bokeh Animation with JavaScript
JavaScript
7
star
32

video-barcode-generator

Video Barcode Generator
Shell
6
star
33

show-passwords

Show Passwords Bookmarklet
HTML
6
star
34

scheme-interpreter

An interpreter for a basic subset of the Scheme programming language
Python
6
star
35

algorithms

Python
5
star
36

solar-system-moons

Python code for generating infographic posters that visualise data about the outer planets in our solar system
Python
5
star
37

top-down-parser

A simple top-down parser written in JavaScript
JavaScript
4
star
38

magnetic-pendulum

An interactive simulation of a Magnetic Pendulum
JavaScript
4
star
39

planetary-systems

Python code to generate an SVG showing planetary systems
Python
4
star
40

stellar-classification-parser

A parser for star classification codes
JavaScript
4
star
41

bitmeteros-python-client

A graphical client for BitMeter OS
Python
4
star
42

zen.sh

A shell script meditation timer for macOS
Shell
3
star
43

netmo

Browser-based remote network monitoring tool
Python
3
star
44

gradient-descent

Python implementations of both Linear and Logistic Regression using Gradient Descent
Python
3
star
45

neural-net

A simple neural network implemented in Python
Python
2
star
46

hltracker

Hotline Tracker
Visual Basic
2
star
47

crypto-tools

Python
1
star
48

twitter-min

Minimalist Twitter Search Client, contained in a single HTML file
HTML
1
star
49

analog-digital-clock

Analogue/Digital Clock
JavaScript
1
star
50

peg-parser

The beginnings of a Parsing Expression Grammar parser
Python
1
star
51

landscape

JavaScript
1
star
52

ball_stand

A 3D model of a ball-stand, useful for displaying the clear acrylic balls often used for contact juggling.
OpenSCAD
1
star
53

reinforcement-robot

Experiment with Reinforcement Learning using robots!
JavaScript
1
star
54

tcpdump-web

A web interface for tcpdump
JavaScript
1
star
55

wordle.sh

A Wordle-solving shell script
Shell
1
star
56

svg-stars

Generate pretty animated SVGs for stars
JavaScript
1
star