UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection

In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).


a b s t r a c t
In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article "UMUDGA: a dataset for profiling DGA-based botnet" [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of MLpowered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).
© 2020 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license.

Specification table
Subject area Computer Network and Communications, Artificial Intelligence More specific subject area Network Security, Machine Learning, Natural Language Processing, Intrusion Detection Systems Type of data TXT , CSV , and ARFF files.
How data were acquired Domain Generation Algorithms have been implemented, executed and their data have been collected and processed to extract the identified features. Data format Raw : list of Fully Qualified Domain Names (FQDNs) in form of TXT files.
Analyzed : list of features in form of ARFF and CSV files.
Parameters for data collection Domain Generation Algorithms (DGAs) have been executed to collect a fixed number of generated domains. Whenever required, the random generator has been initialized with the string "3138C81ED54AD5F8E905555A6623C9C9 ". Description of data collection Phase 1 : 37 DGAs have been collected and executed to generate at least 10,0 0 0 AGDs. One million legitimate FQDNs have also been added to the collection, for a total of 38 + million domain names. Phase 2 : Each FQDN has been processed and compared with the English language to extract 100+ numerical features. Data source location Faculty of Computer Science, University of Murcia, Murcia, Spain Data accessibility Data repository : UMUDGA: University of Murcia Domain Generation Algorithm Dataset [2] . Data identification number:

Data
The proposed dataset is publicly available through Mendeley Data [2] . As depicted in Fig. 1 , the dataset is composed of four root folders that encompass different functionalities and scopes. In order of importance there are: • The domain generation algorithms -in this folder, for each malware variant, there are the DGA executable, the source code, and the reference to the analysis. • The actual data folder (named Fully Qualified Domain Names) -in this folder, for each malware variant plus the legitimate domains, there are three subfolders: -Raw list -includes the TXT lists of Fully Qualified Domain Names (FQDNs) in different tiers (e.g., 10 0 0, 10,0 0 0); -ARFF features -includes the data processed and exported in the TXT (see [4] ) format.
-CSV features -includes the data processed and exported as comma-separated CSV files. • The language data -in this folder, there are the executables to preprocess any given language and the preprocessed, ready-to-use data for the English language ( i.e ., the raw wordlists obtained from the Leipzig Corpora [5] and the lists of extracted n Grams). • The utility folder -in this folder, there are the executables and the source codes for any relevant package that might be helpful for the researchers, e.g ., the collision checker.
In the following sections, we will refer to several figures and tables. Specifically: • Figures: -Dataset structure -the figure mentioned above ( Fig. 1 ) reports the Mendeley Data [2] repository structure; -Framework architecture -from the main co-submitted article [1, Fig. 3 2LD , OLD } denotes the domain levels.
-the general feature statistics ( Table 2 ) -presents the mean, standard deviation, minimum, and maximum metrics for each feature and each n Grams set. • Algorithms: -Algorithm 1 ( LCS (d, A ) )-presents the pseudocode for the Longest Consecutive Sequence algorithm; -Algorithm 2 ( PE (d, p) )-presents the pseudocode for the percentiles calculation algorithm; -Algorithm 3 ( R (t, A ) )-presents the pseudocode for the ratio of characters algorithm; Alongside with the Mendeley Data [2] , there is a duplicated copy of the source code, packages, executables, and documentation in a Github public repository [3] that serves as the official project page. Moreover, the Github wiki page "Feature Statistics" [3] also provides metrics and charts for each feature calculated and available in the dataset.

Experimental design, materials and methods
Before introducing the dataset, it is worth mentioning a few terms and definitions that will be used throughout the article. Firstly, with botnet we identify an group of infected machines, called bots or zombies , that communicates with of one or more of the Command & Control (C&C) servers that act as a relay for the commands issued by the botmaster (botnet owner). Bots often use pseudo-random domain generators, called domain generation algorithms (DGAs) , to communicate with the C&C servers. These DGAs generate thousands of domain names, called algorithmically generated domains (AGDs) . A deep dive on the subject, with specific attention to machine learning (ML) techniques, is offered by Plohmann et al. [6][7][8] .
The primary research article [1] thoroughly describes the architecture of the data generation framework (see [1, Fig. 3] ). To be precise, the figure highlights both the required inputs (the malware DGAs and the English Language Data) and the provided outputs (the AGD lists and the AGD features sets) that have been implemented to guarantee the scientific accuracy and reproducibility of the dataset.
A selected list of 50 malware variants has been collected, analyzed, processed, and included in the proposed dataset to be as complete as possible. The primary research article [1, Table 1] presents these malware variants according to their tier level, i.e ., the number of AGDs generated for that specific malware variant. It is important to remark that several variants such as Pizd, Gozi , or Rovnix have wordlist-based DGAs; thus, their possible AGDs are limited.  Firstly, each of the 50 malware variant DGAs included in the dataset has been collected from online sources [9][10][11] and implemented in a module named Domain List Generation . Their fixed initialization parameters are described in the following dedicated subsection. To be more precise, whenever a malware variant, such as Gozi , needs one or more wordlists in order to generate the domain names, we have considered each wordlist as a separate variant and memorized the wordlist itself in the corresponding DGA folder.
Secondly, the raw lists of AGDs are then processed by the secondary module, named Feature Extraction , that calculates the features according to their formal definitions as described in the following dedicated subsection.
The generated AGDs lists present 551 collisions, which are available in a separate file in the root of the project. To be more precise:

Domain list generation
Several independent executables that implement each malware variant DGA constitute the backbone of the Domain List Generation module. The main output of this module is a list of AGDs generated by the malware variants, and to be as precise as possible, each DGA implementation utilizes a fixed seed for the pseudorandom number generator (PRNG) and firstly analyzes, whenever available, the original initialization vectors for the specific malware sample analyzed. Each malware family also includes the links fo the source code and the related analysis.

Feature extraction
The Feature Extraction module is composed by two independent processes, namely the NLP Processor and the nGrams Processor . The features extracted are the ones belonging to Context-Free family, defined as specified in Def. 1 , quoting Zago et al . [7] : A feature that is related only to a Fully Qualified Domain Name (FQDN) and thus is independent of contextual information, including, but not limited to, timing, origin or any other environment configuration. First and foremost example of this family is the lexical analysis of the domain name.
The Domain Inspector processes each AGD generated, as presented in [1,Fig. 3] . To be precise, the two primary submodules mentioned above require validated FQDNs augmented with their n Grams sets. Specifically, as reported in [1] , this research only focuses on the first three sets of n Grams ( i.e ., n = 1 , 2 , 3 ).
The first process ( i.e ., the NLP Processor ) extracts a total of 22 features by analyzing the domain name as a string. Table 1 presents the extracted list with their formal definitions.
The second process ( i.e ., the nGrams Processor ), compares the different sets of n Grams generated by the Domain Inspector with the ones provided by the Leipzig Corpora [5] for the English language (one million words from Wikipedia, 2016 update), generating a total of 29 features per n Grams. Section 2.3 presents the formal definitions and the algorithms required for extending and validating the feature set.

Feature definitions
In order to provide a formal declaration of the proposed features, it is necessary to establish a set of standard definitions. Firstly, it is necessary to introduce a series of well-defined terms that will be used through most of the definitions. Intuitively, these definitions will refer to the set of n Grams ( Def. 1 ) and its distributions, either absolute ( Def. 3 ) or relative ( Def. 4 ), and the formula that calculates it ( Def. 2 ). Moreover, since most of the features aim to compare this distribution with the one obtained from the English language, another series of definitions is necessary, namely the absolute ( Def. 6 ) and relative ( Def. 8 ) distributions and the formulae that calculates them ( Def. 5 and Def. 7 , respectively). To avoid symbols ambiguity, with | · | we will refer to the size of the collection " · ", while with ABS (·) we will refer to the absolute value of the variable " · ". It is important to notice that the Def. 1 explicitly excludes the dot (". ") character, due to its reserved use as hierarchical separator [12] , and the underscore ("_ ") character, as per the RFC 1034 [12] .
Having the definition of then n Grams set, we define the application that transforms any FQDN in a vector of fixed length representing the occurrences of each n Grams. Definition 2 ( n Grams Application). Let d be a FQDN, G its sorted n Grams set (See Def. 1 ), n the size of the n Grams and let F ( g, d ) be the absolute frequency for all the n Grams g ∈ G of the domain d .
Then we define as ρ the linear application that associate each element of G of the domain d with a real number, in form of a vector of absolute frequencies: Definition 3 ( n Grams Vector). Let d be a FQDN. Then we define as w d the vector resulting of applying ρ( · ) to the n Grams set G obtained from the domain d . Formally: Definition 4 ( n Grams Relative Vector). Let w d be the vector of relative frequencies obtained by dividing each element of w d by the total sum. Mathematically: = 1 and has 0 as result for any other g ∈ G . It also holds that , having 0 for any other element of w d .
The obtained n Grams vector can be compared with virtually any language data, namely the n Grams relative frequency, i.e ., the frequency of the n Grams in the target language.
Definition 5 ( n Grams Language Application). Let d be a FQDN, G its sorted n Grams set (See Def. 1 ), n the size of the n Grams and let L ( g, T ) be the absolute frequency in the target language dictionary T for all the n Grams g ∈ G of the domain d . Within the scope of this article, T is the English language dictionary [5] .
Then we define as σ the linear application that associate each element of G of the domain d with a real number, in form of a vector of absolute frequencies:: Definition 6 ( n Grams Language Vector). Let d be a FQDN. Then we define as φ d the vector resulting of applying σ ( · ) to the n Grams set G obtained from the domain d . Formally:

Definition 7 ( n Grams Language Relative Application).
Let d be a FQDN, G its sorted n Grams set (See Def. 1 ), n the size of the n Grams and let L ( g, T ) be the relative frequency in the target language dictionary T for all the n Grams g ∈ G of the domain d . Within the scope of this article, T is the English language dictionary [5] .
Then we define as σ the linear application that transforms the domain d in a vector of relative frequencies: Definition 8 ( n Grams Language Relative Vector). Let d be a FQDN. Then we define as φ d the vector resulting of applying σ ( · ) to the domain d . Formally: Using [5] as source for the English language, the following example holds.

Domain name as string
The first set of features are the ones that do not depend on the size of the chosen n Grams, and they are presented in Table 1 . In the table, we make use of three algorithms: i) the Longest Consecutive Sequence ( LCS (d, A ) ), Algorithm 1 ), that extracts the longest consecutive sequence Levels (LD), and in this article we will refer to "es " as the top level domain (TLD), to "um " as the second level domain (2LD) and to "www " concatenated to any other subdomain level as the other level domain (OLD).
The features defined in Table 1 include properties such as the number of domain levels; the longest consecutive sequence of consonants, vowels and numbers; and multiple ratios between set of characters and the domain name.

Domain name as n GRAM
With regards to the features that depend on the size of the n Grams, the following paragraphs introduce their formal definitions with the relative description and mathematical notation. Each feature is repeated for each distinct value of n , in this proposed dataset (available at [2] ) the values of n are n = 1 , 2 , 3 . In the following paragraphs, each feature is individually formalised. PE (d, p) .

Algorithm 2 Percentiles calculation -
The n Grams array Ensure: w d is sorted The desired percentile Feature n G-E: Entropy. Entropy is the average rate at which information is produced by a stochastic source of data.
Mathematically, let φ d be the English relative vector (See Def. 8 ) of the domain d ., then the entropy of the domain is defined as: Feature n G-COV: Covariance. The sample covariance is a measure of the joint variability of two random variables.
Let w d be the n Grams relative vector (See Def. 4 ) and φ d be the n Grams language vector (See Def. 6 ). Covariance allows us to determine if exists dependence between w d and φ d by a given d . We will use the following formula: Where: · = arithmetic mean of " · ". Let i, j be two independent indexes running from 0 to the size | w d | = | φ d | (See Def. 4 and Def. 8 ). Then, for any two pair (w i ∈ w d , φ i ∈ φ d ) and (w j ∈ w d , φ j ∈ φ d ) , Kendall's Correlation defines them as: It follows: where: n 0 = n (n −1) Let w d be the n Grams relative vector (See Def. 4 ) and φ d be the n Grams language vector (See Def. 8 ), let also m = | w d | = | φ d | be the size of the two vectors. We define as the Pearson's Correlation the following: where: · = arithmetic mean of " · "; | d | = length of the domain name; Feature n G-SPE: Spearman's Correlation. Computes Spearman's rank correlation of the domain d with respect to the English language. It is implemented with Apache Commons Math SpearmansCorrelation class [13] .
We will refer to this feature also with the symbol of "w ".
Feature n G-QMEAN: Quadratic mean of frequencies. Represents the quadratic mean (or root mean square) of the relative frequencies for the domain d . Let w d be the n Grams relative vector (See Def. 4 ) of the domain d : Feature n G-SUMSQ: Squared sum of frequencies.
Represents the squared sum of the relative frequencies of the domain d . Mathematically, let w d be the n Grams relative vector (See Def. 4 ) of the domain d : Feature n G-VAR: Variance of frequencies Represents the variance of the relative frequencies of the domain d . Mathematically, let w d be the n Grams relative vector (See Def. 4 ) of the domain d : Feature n G-PSTD: Population standard deviation of frequencies. Represents the variance of the relative frequencies of the domain d . Mathematically, let w d be the n Grams relative vector (See Def. 4 ) of the domain d : The kurtosis is not defined for those collections with less than 3 elements. Such event cannot occur in our environment because the size of the vector | w d | is always greater than 3.
The skeweness is not defined for those collections with less than 2 elements. Such event cannot occur in our environment because the size of the vector | w d | is always greater than 2.
Feature n G-TPVAR: Population variance of target language frequencies. Represents the population variance of the English language frequencies for the n Grams of d . Mathematically: The kurtosis is not defined for those collections with less than 3 elements. Such event cannot occur in our environment because the size of the vector | φ d | is always greater than 3.
The skeweness is not defined for those collections with less than 2 elements. Such event cannot occur in our environment because the size of the vector | φ d | is always greater than 2.
Feature n G-PRO: Pronounceability Score. This feature calculates how pronounceable a domain d is, as described by [14,Linguistic Filter 2] , it quantifies "the extent to which a string adheres to the phonotactics of the English language". However, we do consider the whole FQDNs as base for the computation, not only the 2LD.
Let φ d be the English relative vector (See Def. 8 ) of the domain d and n the n Grams size. It follows: Feature n G-NORM: Normality Score. This feature calculates a score that reflects the attribute of the English language, as defined by [15,Feature 9] . Mathematically, let w d be the n Grams vector Where J(w d , φ d ) is the Jaccard similarity coefficient given by the following expression: where: ABS (·) = absolute value of " · ".
Feature n G-DST-EM: Earth Movers Distance. Calculates the Earth Movers distance (also known as 1 st Wasserstein distance) of the relative frequencies w d with respect to the English language.
It is implemented with Apache Commons Math EarthMoversDistance class [13] .
where: ABS (·) = absolute value of " · ". Table 2 presents classic statistical measures for the features, considering the whole dataset altogether. It is worth mentioning that, for each feature, the class-wise boxplot distribution is available at [3] .

Feature Statistics
By looking at Table 2 , it is worth noticing a few values that stand out for two different reasons, namely having a zero value for either the minimum value or the standard deviation one: • Having a minimum value equal to zero -The reason behind these values are to be searched in the nature of the feature. For example, the NLP-1G-MED feature reports the median value of the frequency distribution, which in most of AGDs is zero. However, when considering the NLP-3G-E feature, the reason is quite different. That is, if each 3Gram have zero probability, e.g. the AGD "dajsrmdwhv.tv " belonging to the Kraken (2nd version) variant, then the entropy is defined as zero.
• Having standard deviation value equal to zero -In order to have zero standard deviation, all the values of the features must be equals. This is the case of a group of feature calculated over 2Grams and 3Grams, namely NLP-n G-25P , NLP-n G-50P , NLP-n G-75P and NLP-n G-MED , where n = 2 , 3 . Once again, having most of the terms at zero in the AGDs distributions, cause these features to have themselves a zero value. However, it is not the case for the 1Gram case because of the non-zero probability of each term. However, for completeness, these features are still included in the dataset.

Code and data availability
As specified in the previous section, there are two main code components that interact to generate the proposed dataset, namely the Domain List Generation and the Feature Extraction modules. The dataset with the released code has been published on the well-known platform Mendeley Data [2] . Fig. 1 highlights the structure of the repository.

Domain list generation module
This module is mainly realized in Python 2.7 and it has been released under the MIT license. As specified before, the PRNGs have been initialized with a specific seed (either integer or string), available within each DGA source code.
Specifically, the fixed parameters for each DGA are: • PRNG Seed -Each random generator has been initialized with the hardcoded integer value "521496385 ". • String Seed -Whenever a DGA requires a string seed as initialization vector, the module uses the string: "3138C81ED54AD5F8E905555A6623C9C9 ". • Malware variant specific seeds -Security vendors often release, along with the relative signatures, also the initialization vectors for each variant discovered in the wild (either TLDs, numbers, strings, or wordlists). In such cases, the initialization vectors are coded in the generator and marked with online source for reference. • Random date range -Most of the DGAs require a random date in order to generate the AGDs. When not fixed by some internal constraint, the dates are generated randomly from 01/01/1970 01:00 AM to 01/01/3000 01:10 AM .

Feature extraction module
This module implements the feature definitions as described in Section 2.3 . It has been realised in Java 1.8 making use primarily of Apache Commons Math [13] as main library for statistical and mathematical purposes.
The code, however, is closed source and is not, and will not released to the general public.

Technical validation
When considering the list of FQDNs that we assume legitimate, two main problems are to be considered. As specified before, each domain is firstly validated by the Apache Domain Validator library. A total of 178 FQDNs fail to pass the validation procedure. To be more precise: • 140 domains are technically invalid because of the presence of at least one underscore character ("_ "): the validation library checks the domains against the RFC 1123 [16] , which limits host names to letters, digits and hyphen. The policy for the underscore character has been clarified later with the RFC 2181 [17, Section 11] ;