Brand Monitor: Typo generation FAQ
Why should I monitor misspellings for my domain name?
Brand Alert API and Brand Alert Monitor provide a toolkit for searching newly registered or recently removed domain names by a substring.
However, often it may not be enough to check the brand name only, since new domain names can slightly vary.
This technique is called typosquatting. The term "typosquatting" means registering and using a domain name which is similar to that of the victim but with intentional typos.
Misspelled domain names might be of a potential risk both for the domain owners and end-users, since such the domain names might be used for brandjacking, redirecting traffic to competitors, harmful content distribution, etc.
How many misspellings can you generate?
The number of possible typos strictly depends on the search term's length. The longer the word, the more misspellings may be generated based on it.
However, finding all the possible typos would be a demanding task and, typically it's no use searching through all the possible letters combinations.
As of now, the number of misspellings is limited to 1,000 per search term.
If for some reason you need more variations, please contact us.
How do you generate typos for a domain?
Our service supports the following rules for misspelling generation:
- Bitsquatting;
- Homoglyph substitution;
- Wildcard substitution;
- Natural languages differences;
- Common misspellings dictionary;
- Words splitting;
- Letters mistype;
What is bitsquatting?
Bitsquatting is a form of typosquatting which usually makes sense for machine-to-machine interaction. The general idea is to switch a bit or several bits in the domain name's binary representation.
If the machine's RAM is broken, a bit might be switched on any of the underlying network levels, which results in connecting to a potentially malicious host.
For example, if the search term is google, its bit representation is:
g | o | o | g | l | e | ||||||||||||||||||||||||||||||||||||||||||
0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 56 | 57 |
Let's switch bit 7 to 0:
f | o | o | g | l | e | ||||||||||||||||||||||||||||||||||||||||||
0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 56 | 57 |
The result search term is foogle.
The same mechanics might be applied for any bit or even for several bits at the same time.
What is homoglyph substitution?
In different natural and artificial languages, there are characters which look quite similar or even identical, e.g. the Latin "C" is identical to the Cyrillic "С".
Another example is Latin "G" and the sound sign "ɢ". Such the characters are called "Homoglyphs".
In spite of their similarity, they are different in terms of Unicode.
It means that a fraudster can register a domain name with such similar-looking characters and redirect traffic to a malicious resource.
Such an attack is called IDN homograph attack.
In order to prevent the attack, it'd be better to monitor all the possible homoglyph combinations for your domain name.
What is the common misspellings dictionary?
It is a dictionary which consists of real-world misspellings. There is a rather comprehensive one at wikipedia.org.
We check a search term against this list so as to generate possible typos for it.
What are natural languages differences?
In different regions, people may spell and pronounce the same words in different ways. For example, for British and American English, there are such common differences as:
- behaviour → behavior
- catalogue → catalog
- center → centre
- etc.
We generate all the possible misspelled domain names according to the most common differences.
What is word splitting?
Sometimes domain names consist of several concatenated words, e.g. thelongestdomainnameever.com.
A possible typosquatting for such a domain name is to split it to words and then concatenate them with a hyphen "-".
We use a pre-trained model in order to detect original words in a character sequence and then combine them with any possible combination of hyphens.
For instance, for domain name thelongestdomainnameever.com, we'll get:
- the-longest-domain-name-ever.com
- thelongest-domainname-ever.com
- the-longestdomainnameever.com
- etc.
What is letters mistype?
The most natural misspelling are typing mistakes in one or two letters.
For instance, that can be done by combining the source domain name's letters with their keyboard neighbors.
Such the domain names may look pretty similar to the original ones, but they probably will lead to a harmful resource.
We use the following letter mistype rules:
- Letter repetition - add extra letters which repeat themselves in the original word, e.g. google.com → gooogle.com or ggoogle.com
- Letter replacement - replace every letter in a word to its keyboard neighbor, e.g. google.com → toogle.com or boogle.com
- Letter addition - add keyboard neighbor letter before or after each character, e.g. google.com → gfoogle.com
- Letter reversion - reverse 2 letters in a word, e.g. google.com → goolge.com, googel.com
- Letter omission - skip a letter, e.g. google.com → gogle.com, googl.com
- Vowel swapping - change any vowel character in a word for other vowel characters, google.com → gaogle.com, guogle.com