Charles Leaver – This Is Why Edit Difference Is Used In Detection Part 2

Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften

 

In the very first about edit distance, we took a look at searching for destructive executables with edit distance (i.e., the number of character edits it requires to make two text strings match). Now let’s look at how we can utilize edit distance to search for malicious domains, and how we can utilize edit distance features that can be combined with other domain features to identify suspicious activity.

Here is the Background

What are bad actors playing at with malicious domains? It could be merely using a close spelling of a common domain to trick careless users into looking at advertisements or picking up adware. Genuine sites are gradually catching onto this technique, sometimes called typo squatting.

Other destructive domains are the result of domain generation algorithms, which could be utilized to do all types of wicked things like avert countermeasures that block known compromised sites, or overwhelm domain name servers in a distributed DoS attack. Older variations utilize randomly generated strings, while more advanced ones add techniques like injecting typical words, further confusing defenders.

Edit distance can assist with both usage cases: let’s see how. First, we’ll omit common domains, considering that these are generally safe. And, a list of typical domains supplies a baseline for detecting anomalies. One excellent source is Quantcast. For this discussion, we will adhere to domain names and prevent sub-domains (e.g. ziften.com, not www.ziften.com).

After data cleaning, we compare each candidate domain name (input data observed in the wild by Ziften) to its possible next-door neighbors in the very same top-level domain (the tail end of a domain name – classically.com,. org, etc. and now can be almost anything). The standard task is to find the nearest next-door neighbor in terms of edit distance. By discovering domain names that are one step away from their nearby next-door neighbor, we can quickly spot typo-ed domain names. By finding domain names far from their next-door neighbor (the normalized edit distance we introduced in Part 1 is beneficial here), we can also find anomalous domain names in the edit distance area.

What were the Results?

Let’s take a look at how these outcomes appear in reality. Use caution when browsing to these domains since they might contain malicious material!

Here are a few possible typos. Typo squatters target well known domains considering that there are more opportunities somebody will visit. Several of these are suspicious according to our danger feed partners, but there are some false positives as well with charming names like “wikipedal”.

Here are some weird looking domain names far from their neighbors.

So now we have produced 2 beneficial edit distance metrics for hunting. Not just that, we have three functions to potentially add to a machine learning design: rank of nearest next-door neighbor, range from next-door neighbor, and edit distance 1 from neighbor, indicating a danger of typo shenanigans. Other features that might be utilized well with these are other lexical features such as word and n-gram distributions, entropy, and the length of the string – and network functions like the number of unsuccessful DNS demands.

Simple Code that you can Play Around with

Here is a simplified version of the code to play with! Developed on HP Vertica, however this SQL ought to work on most innovative databases. Keep in mind the Vertica editDistance function may vary in other applications (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).