Update project metadata for the adoption
Update unicode data
Quick version of the tests that doesn't use the network
This project has been adopted from the original confusable_homoglyphs by Victor Felder.
a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph
Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.
AlaskaJazz
is single script: only Latin
characters.ΑlaskaJazz
is mixed-script: the first character is a
greek letter.You might also want to avoid people being tricked into entering their
password on www.microsоft.com
or
www.faϲebook.com
instead of www.microsoft.com
or www.facebook.com
. Here is a
utility to play with these confusable
homoglyphs.
Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.
Allo
and ρττ
are fine: single script.AlloΓ
is fine when our preferred script alias is
'latin': mixed script, but Γ
is not confusable.Alloρ
is dangerous: mixed script and ρ
could be confused with p
.This library is compatible with Python 3.
Yep.
The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.
The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.
This data is stored in two JSON files: categories.json
and confusables.json
. If you delete them, they will both be
recreated by downloading and parsing the two abovementioned files and
stored as JSON files again.