MaldomDetector: A System for Detecting Algorithmically Generated Domain Names with Machine Learning

One of the leading problems in cyber security at present is the unceasing emergence of sophisticated attacks, such as botnets and ransomware, that rely heavily on Command and Control (C&C) channels to conduct their malicious activities remotely. To avoid channel detection, attackers constantly try to create different covert communication techniques. One such technique is Domain Generation Algorithm (DGA), which allows malware to generate numerous domain names until it finds its corresponding C&C server. It is highly resilient to detection systems and reverse engineering, while allowing the C&C server to have several redundant domain names. This paper presents a malicious domain name detection system, MaldomDetector, which is based on machine learning. It is capable of detecting DGA-based communications and circumventing the attack before it makes any successful connection with the C&C server, using only domain name's characters. MaldomDetector uses a set of easy-to-compute and language-independent features in addition to a deterministic algorithm to detect malicious domains. The experimental results demonstrate that MaldomDetector can operate efficiently as a first alarm to detect DGA-based domains of malware families while maintaining high detection accuracy.

between the victims and the controllers, i.e., cybercriminals, and thus limit the harmful effects of these sophisticated malware attacks [12]. Domains generated by DGA do not usually employ typical word-based domains, however, identifying them is a difficult process [13]. In this paper, a detection system, MaldomDetector, has been proposed to detect DGA-based domain names effectively, before any successful connection with the C&C server can be made. It consists of two modules. Firstly, the Data Preparation Module, which extracts informative features employing a deterministic algorithm called Randomness Measuring Algorithm (RMA), to measure the randomness in the domain name characters. Secondly, the Decision Making Module, which runs a machine learning classifier to process the extracted features and produce a decision offering high detection accuracy. The remained of this paper is organized as follows. Section 2 discusses the related work. Section 3 describes MaldomDetector's high-level architecture. Section 4 discusses MaldomDetector's implementation, including feature extraction, dataset creation, and building and evaluating the classifier. Section 5 discusses the properties of the DGA-based domain names detection systems. Finally, Section 6 concludes the paper.

Related Work
Several research works have attempted to detect algorithmically generated domain names through analysing the names' strings. Xu et al. [14] proposed a method to detect DGA-based domain names through inspecting the domain's character strings, by combining an n-gram approach and deep convolutional neural network. They provided evaluation results for different types of DGA using bigram and trigram representations. The average accuracies for 2-gram and 3-gram were 94.15% and 98.29% respectively. However, n-gram based approaches are computationally expensive and languagedependent. YU et al. [15] Investigated the use of deep neural network techniques to detect DGA-based domains based solely on the domain name characters, trained with large amounts of heuristically labelled real traffic. However, this method has many parameters that must be estimated, causing high computational cost, and requires large amounts of training data to learn, i.e., has long training times. Selvi et al. [16] presented a machine learning approach ,which used a Random Forest algorithm to detect DGA-based domain names. They extracted features that rely on several characteristics, such as the lexical attributes of the domain names, some statistical information, and masked N-grams. The experimental results showed a detection accuracy of 98.91% and false positive rating of 0.76%. However, most of the extracted features are statistical (i.e. mean, variance, and standard deviation) that become less effective when the domain name is short. Additionally, this method requires a relatively high training time (approximately 1.21 hours). Lv et al. [17] analysed malicious and benign domain names using a Hidden Markov Model (HMM), where a total of 12 attributes from five categories were extracted from the characteristics of DNS communications. One of these categories relates to the characters of the domain name. HMM requires expensive computations and has a time due to the dependence on a set of features extracted from the DNS response packet. The classification accuracy and the recall rate obtained from the test results were 91.52% and 89.32% respectively. Mac et al. [18] presented a thorough investigation on various methods, such as supervised learning techniques, Hidden Markov Model (HMM), and bidirectional Long Short-Term Memory Network (LSTM) to detect botnet attacks based on DGA. A detailed comparative analysis of these methods was presented. The maximum precision and recall obtained from this research were 92.32% and 93.09% respectively. However, just like n-gram-based approaches, Markov-based techniques are language dependent and require expensive computations. Shi et al. [19] proposed the Extreme Learning Machine (ELM) technique to detect harmful domain names. Three out of the nine features were lexical attributes. However, the rest of the features require data from the DNS responses and access to the information in the WHOIS lookup web service, both of which may increase time taken to identify the malicious domains. The used dataset was imbalanced, and the accuracy rate was 96.28%. Song et al. [20] presented a method based on a Random Forest classifier to detect algorithmically generated domain names. Ten features were extracted from the characters of the domain name. Some features were based on n-gram computations. The achieved precision was 93.5% and the false positive rate was 3.49%. Truong and Cheng [21] proposed a system for detecting domain fluxing by analysing DNS traffic and extracting lexical features from the character strings in malicious and benign domain names. A machine learning method was used to build this system, achieving 92.3% accuracy and 4.8% false positive rate. The methods discussed above are based on a set of features that are either extracted out of whole DNS communications (i.e., DNS query and response packets) or that require data from external sources, such as WHOIS site. Furthermore, most of them are language-dependent and require complex computations. However, building a system that can accurately identify malicious domain names based on features extracted from the initial DNS requests would be preferable as it offers potentially earlier detection. This paper proposes a machine learning-based detection system, MaldomDetector. It provides high detection accuracy (~ 98%) and low false positive rate (~ 4%) depending solely on character level features extracted out of the domain name string of the initial DNS request. These features are easy to compute and do not need to access any external sources.

3-MaldomDetector High-Level Architecture
The architecture of MaldomDetector is presented in this section. As depicted in Fig

Raw dataset collection
A labelled dataset is required to train the classifier of MaldomDetector using a supervised learning method. It is important that this dataset is made up of samples from many types of DGA families, so that the classifier is trained on different types of malicious domains. The quality of the underpinning dataset is crucial to any machine learning task. A high-quality pre-labelled ground truth dataset (i.e. verified malicious and benign data) was chosen to train and evaluate MaldomDetector [14] [16] [15]. The malicious domains were collected from a known ground truth dataset, DGArchive, which had been used by previous researches [14] [15]. Real DGA-based domains that appeared on the internet have been collected in the DGArchive project [22] [23]. In this project, different DGA-based malware families, such as Locky and Cryptolocker, were analysed comprehensively, and all the possible domain names generated dynamically by some types of malware have been resolved or enumerated to cover the majority of real and active DGAs. In addition, a set of DGA-based domains obtained from the Bambenek consulting feeds [24], which is also a known ground truth dataset of DGA-based domain names collected by reverse engineering specific malware families observed in real traffic. Bambenek malicious domains have also been used in past work e.g., [15] and [25]. Bambenek feeds were used to create the testing dataset to check the MaldomDetector's performance on unseen examples. All of the collected domain names from DGArchive project and Bambenek feeds were labelled as malicious. The summary of the malware families employing DGA collected from DGArchive, in addition to the DGAs taken from Bambenek dataset are indicated in Table 1. The ground truth benign domains were collected from the Alexa top domains site [26], which lists the domain names of the most visited websites on the internet. Alexa top ranked lists were often used in the preceding research works [14] [16] [18] [19] [15] [25] because they represent trustworthy sets of normal domain names. It ranks the websites based on their popularity using different criteria, such as page views and unique visitors, and provides various lists of top sites, e.g. top 500 and top 1000. We selected 85,000 domain names from the top 1 million sites to build the required benign dataset. Since the domains of Alexa are ranked based on their popularity, the first top domains were selected instead of random selection to form the ground truth benign dataset, because they are more representative of how a benign domain looks [14] [19] [15].

Domain Name Analysis
The domain names in the raw datasets of Table 1 were analysed to extract a set of attributes that can be used to detect DGA-based domain names. The top-level domains (TLDs) were excluded from this analysis because the DGA-based domains use the same TLDs that are used in benign domains. It is noticeable that most of the DGA-based domain names contain meaningless strings where it is difficult to pronounce or read them. The reason for this difficulty belongs to the existence of several sequential consonant or vowel letters in most malicious domain names. Table 2 displays some examples of the malicious domain names. Fig. 2 compares the frequency for the number of sequential consonant letters between the benign and DGA-based domain names. According to some sources like [27], The letter "y" is a special letter that can represent both kinds of speech sounds, i.e., vowel and consonant, depending on its position and the letters surrounding it. Since some extracted features in this research are related to the pronunciation such as the maximum number of sequential consonant and vowel letters, therefore the status of the letter "y" can affect the values of these features and hence affect the detection accuracy. Therefore, we have considered the two cases of "y" during the implementation of the experiments using RMA and we found that considering "y" as a vowel gives better accuracy.  Shannon's entropy can serve as a good metric to measure the randomness in characters distribution within a given domain name [8]. It can be calculated using equation (1).
Where is the probability distribution = ( 1 , … , ) with ≥ 0 for = 1, … , and ∑ =1 = 1 Entropy exceeding a threshold value can be a useful indicator to identify DGA-based domain names. The normal distribution curves of entropy values for the DGA-based and benign domains are shown in Fig.  3. It is evident from Fig. 3 that there is a clear differentiation in the probability distribution of the entropy values between the curves, which may suggest entropy is a useful feature to classify the domains. Therefore, the entropy value for the domain name was selected as a feature to identify the DGA-based domain names. The entropy values of some domain names are indicated in Table 2. RMA has been constructed to initially identify harmful domain names by measuring the randomness in the domain name's characters. It is an enhanced version of our earlier work in [28]. RMA is a deterministic algorithm that accepts a subset of the basic features as an input, i.e., the entropy, the maximum sequential consonants, the maximum sequential vowels, and the domain name length. It then processes them according to the threshold values depicted in Fig. 4. These threshold values were determined based on the domain name analysis, conducted in section 4.2.

Fig. 4 The Randomness Measuring Algorithm (RMA)
After inspecting the probability distribution of the entropy values in Fig. 3, we have recognised three base points in the distribution curves that can be used in RMA as thresholds as shown in Fig. 5. The following were found: • The ratio of the benign domains that have entropy (H) <= 2 is 18.99%, while the ratio of the DGA-based domains is 0.093%. • The ratio of the DGA-based and benign domains that have H > 3.24 is 77.83% and 10.44% respectively. • The values 2 < H <= 3.24 occur in the malicious and benign domains together in different proportions, making the recognition of these domains a non-trivial task, which requires additional complex feature(s). Since the goal of this research is to build a detection system using uncomplicated features, two simple features, i.e., maximum sequential consonants & maximum sequential vowels, were added to the rules of RMA, to help increase discrimination between sample distributions and reduce the false rate. The added rule states that most of the DGA-based domains have four or more sequential consonants or vowels, while many of the benign domains have fewer than four sequential consonants or vowels. RMA was implemented in Python and evaluated using various DGA families. RMA does not contain any parameters related to dictionary words or frequency distribution of letters; therefore, it is a languageindependent algorithm. RMA has been applied to 20 types of malware families of DGArchive dataset and the results are indicated in Table 3. RMA also was applied to 85,000 of Alexa domain names where the results show detection accuracy and false positive rate are 83.14% and 0.1686 respectively. It is noticeable from the results above that RMA has high accuracy in detecting most DGA families and reasonable accuracy on a few types such as gozi and dnschanger. The detection process becomes difficult when a malware DGA generates domain names containing meaningful words or low randomness in their characters. However, the majority of the DGA families generate random domains because they are based on a random algorithm to generate many domains. In addition, the botmaster must register a few domains to enable the malware to make a connection. Therefore, the malware developer avoids using wordlists in algorithmically generated domains in order to avoid conflict with legitimate domains during the registration process. It also is noticeable that although some DGA families in Table 1 have domain names that contain alphanumeric characters, such as Bamital, Dyre, and Murofetweekly, they still have a number of sequential consonant and vowel letters within their characteristics and most of them have an entropy value (H) >3.24 as shown in some examples of Table 2. These families have been detected by RMA with high accuracy as indicated in Table 3.
In order to increase the detection accuracy and build a system that can address the low randomness in the domain name strings, a machine learning-based system has been constructed. MaldomDetector employs the output of RMA along with other engineering features to detect malicious domains effectively. The next section illustrates the process of building the system.

Feature extraction and selection
Two types of features have been extracted based on the domain name analysis in section 4.2. Firstly, basic features and secondly derived features. The basic features include the entropy, max sequential consonants, max sequential vowels, the total number of consonants, the total number of vowels, and the length of a given domain name. Whereas the derived features have been calculated from the basic features, based on the domain knowledge as indicated in Table 4.     ratio-max-sequential-consonants-tolength-domain The ratio of the maximum sequential consonant letters to the length of a given domain.

F12
ratio-max-sequential-vowels-tolength-domain The ratio of the maximum sequential vowel letters to the length of a given domain.

F13
ratio-max-sequential-consonants-to consonants The ratio of the maximum sequential consonant letters to the total number of consonants of a given domain name.
F14 ratio-max-sequential-vowels-tovowels The ratio of the maximum sequential vowel letters to the total number of vowels of a given domain.

F15
ratio-max-sequential-consonants-tomax-sequential-vowels The ratio of the maximum sequential consonant letters to the maximum sequential vowel letters of a given domain. F16 Randomness The output of RMA algorithm.
Feature importance was calculated for all features in Table 4. A standard technique for calculating feature importance is to use the correlation, which is formally referred to as Pearson's correlation coefficient (PCC) [29]. The formula of PCC for two variables is indicated in the following equation.

……… (4)
Where: cov is the covariance of the two variables x, y. is the standard deviation of the variable, which represents the square root of its covariance. The covariance (cov) can be used to measure the linear relationship between two variables, i.e., it tells us how much the two variables vary together. In this work, the variable x can be any features of Table 4, whereas the variable y means the class. We calculated PCC between each extracted feature in Table 4 and the class (response) using the SciPy library of Python. The rank and score of the importance of each feature are shown in Fig. 6. The feature importance scores can be used to reduce the number of extracted features by selecting those that have an importance score greater than a specific value (threshold) to be selected features. Since there is no certain rule to assign the threshold [30], we determined the threshold with 0.2 in this case. Therefore, the features that have an importance score greater than 0.2, i.e., F16, F7, F4, F5, F1, F2, F10, F12, F15, F8, F6, and F3 have been selected. While the features F9, F11, F14, and F13 have been eliminated because their score is less than 0.2 as indicated in Fig. 6. Since the features F16, F7, F4, F5, F1, and F2 have an importance score greater than 0.5, they have been considered the most important features for our problem. Table 5 shows the selected features that will be used to build the classifier.   1  F1  entropy  2  F2  max-sequential-consonants  3  F3  max-sequential-vowels  4  F4  length-domain  5  F5  consonants  6  F6  vowels  7 F7 ratio-entropy-to-length-domain 8 F8 ratio-consonants-to-vowels 9 F10 ratio-vowels-to-length-domain 10 F12 ratio-max-sequential-vowels-to-length-domain 11 F15 ratio-max-sequential-consonants-to-max-sequential-vowels 12 F16 Randomness

Building a labelled dataset
We have written several python modules to build the required dataset for training and evaluating the classifier. The main module extracted the values of the features given in Table 4 out of many malicious and benign domain names selected from the raw datasets in Table 1 and saved them in CSV files. Thereafter, the dataset was cleaned through handling the incorrect data and removing duplicates. A group of unduplicated samples was selected from each DGA family of DGArchive in Table 1 and labelled as malicious domains to create the malicious dataset. While a set of domains was selected from the Alexa data and labelled as benign domains to create the benign dataset. These malicious and benign datasets have the same number of samples and were combined to form the final dataset required to build the classifier. Table 6 summarizes the details of the dataset.

Model training
The classification learner app of MATLAB R2019a [31] was used to train the domain name classifier of Fig.1 using the dataset of Table 6. The k-fold cross-validation test option was chosen to train and evaluate the selected machine learning algorithms in order to protect against overfitting and provide an accurate model performance estimation [32] [33]. The default value of k in the classification learner is 5, however, it was set to 10 as it provided optimal accuracies. At first, we explored all the learning classification algorithms that exist in the classification learner, such as decision tree and support vector machine (SVM), to train the model using all the features in Table 5. The hyperparameters of the selected algorithms were tuned manually to improve their performance, however, we found that the algorithms produce the best performance when the default hyperparameters setting were adopted. After training a set of models using the default setting of the hyperparameters, five best models, i.e., Decision tree (Fine Tree), Ensemble (Boosted Tree), Naïve Bayes (Gaussian), and KNN (Coarse), were selected based on some evaluation metrics. Fig. 7 shows the training of the models using the classification learner app of MATLAB.

Evaluating the classifier
Several binary classification metrics [34], such as accuracy, False Positive Rate (FPR), precision, recall and F1 score, were used to evaluate the performance of the trained machine learning models. These metrics can be derived from the confusion matrix and defined as shown in the equations below. The evaluation results of the best models that were selected after performing the training and validation task are indicated in Table 7, and the models' performance is depicted in Fig. 8. Where: TP is True Positive, FP is False Positive, TN is True Negative, and FN is False Negative.  Although applying cross validation procedure is considered enough to evaluate the performance of the models [35] [34], we made an additional assessment for the models' performance using a second dataset, i.e., Bambenek feeds, that was not used during the models building process. Some DGA families were selected from Bambenek dataset to make the evaluation. Whereas the benign domains were collected from the dataset which was used in [16] and they differ from the benign domains of Table 6. Table 8 shows the summary of the testing dataset, while Table 9 illustrates the extra evaluation results of the models. The results demonstrate that MaldomDetecor is a reliable and efficient system for detecting DGA-based domains.

5-Discussion
Since the datasets used in previous research works are not identical and there is no standard DGA to generate domain names for comparison, therefore, it is difficult to compare MaldomDetector with the systems presented in the related work section. On the other hand, the variety of DGA implementations create wide fluctuation in the detection results when applying the detection methods on these DGAs. In this section, we will only summarise some characteristics of 11 detection methods discussed in section 2 as illustrated in Table 10. As shown in Table 10, MaldomDetector has several advantages while it keeps high accuracy.  [20] used a probabilistic language model, i.e., n-gram, which assigns probabilities to every n-sequence of characters. It estimates these probabilities by calculating the relative frequency for each n-sequence of characters within a given dataset. However, this makes the model very dependent on the training dataset and inefficient in dealing with new types of DGA-based domain names [36] [37]. [21] depended on the frequency distribution of alphanumeric letters of the domain names while [18] (Handcrafted features-based) used a dictionary matching score to measure the degree that a word in a domain name can be explained by a dictionary. MaldomDetector depends on a set of pronunciation-based features that do not depend on the training dataset and it does not adopt any probabilistic language model, i.e., language-independent system. The method employed in [19] also used two features that require access to external site, i.e., WHOIS lookup service, to get data before detecting the malicious domains. However, getting this information adds a time delay and requires the system to be always online to function properly. The methods in [17] and [19] require information from the DNS response packets such as TTL (time to live) and a number of domain name servers before making a decision. Although this information can be useful to reduce the false positive rates, it adds a time delay that can be exploited by the malware to contact its C&C server or exfiltrate information before detection. The objective of this research is to build a detection system that requires little information to check the status of the requested domain names, while trying to detect malicious communications early and thus reducing risk to the network. Therefore, MaldomDetector does not use any data from an external site or from the DNS response message to classify the domains, even if they are possibly useful in the detection process. MaldomDetector has been built to depend solely on a deterministic algorithm and computationally inexpensive features extracted out of the DNS request message. This enables the system to check the domain names before sending them to the DNS server to resolve them, as a first layer of detection.

6-Conclusion
This paper presents an effective detection system, MaldomDetector, to detect algorithmically generated malicious domain names. MaldomDetector employs an algorithm, i.e., RMA, to measure the randomness in the domain name characters. MaldomDetector feeds RMA's outputs along with a set of basic and derived features extracted out of the initial DNS request to a machine learning classifier for processing and classification. Several classification algorithms have been explored to build the classifier. Building upon the state-of-the-art on malicious domain name detection, MaldomDetector does not employ any probabilistic language models. Rather, it employs a character-based approach to detect DGAbased domain names. It performs measurements solely on the DNS request packet and does not need to wait for the DNS response to extract extra features or require information from any external sources. The evaluation results show that MaldomDetector can detect effectively different types of DGA-based domains generated by several types of malware and the detection accuracy is ~ 98%. MaldomDetector can be used to raise early alarms about potential malicious DNS communications while maintaining high accuracy.