Detecting Domain Generation Algorithms with Bi-LSTM

: Botnets often use domain generation algorithms (DGA) to connect to a command and control (C2) server, which enables the compromised hosts connect to the C2 server for accessing many domains. The detection of DGA domains is critical for blocking the C2 server, and for identifying the compromised hosts as well. However, the detection is difficult, because some DGA domain names look normal. Much of the previous work based on statistical analysis of machine learning relies on manual features and contextual information, which causes long response time and cannot be used for real-time detection. In addition, when a new family of DGA appears, the classifier has to be re-trained from the very beginning. This paper presents a deep learning approach based on bidirectional long short-term memory (Bi-LSTM) model for DGA domain detection. The classifier can extract features without the need for manual feature extraction, and the trainable model can effectively deal with new unknown DGA family members. In addition, the proposed model only needs the domain name without any additional context information. All domain names are preprocessed by bigram and the length of each processed domain name is set as a value longer than the most samples. Bidirectional LSTM model receives the encoded data and returns labels to check whether domain names are normal or not. Experiments show that our model outperforms state-of-the-art approaches and is able to detect new DGA families reliably.

operating botnet, so as to control the victim machines [Kührer, Rossow and Holz (2014)]. Some malware rely on static lists of domains and IP addresses that were hardcoded to connect compromised machines [Stonegross, Cova, Gilbert et al. (2011)]. The domains are often coded in malicious programs, giving attackers the flexibility to easily change the domains and their IP addresses [Hampton and Baig (2015)]. The biggest advantage of this connection is that it is easy to be implemented, while the disadvantage is that it is very easy to be detected by the authorities. Due to limited amounts of domains and IP addresses, defenders can blacklist them based on reverse techniques. However, attackers propose corresponding countermeasure by using domain generation algorithms to dynamically generate a large number of pseudo random domain names in a short period of time, effectively increase the difficulty of blacklisting and detection. Domain generation algorithms can produce a series of pseudo random domain names, which contain strings and numbers using some seeds, encryption algorithms, such as differences operations. We can predict the generated domains by collecting samples and reverse engineering. Afterwards, we can preregister the domains or put them in a blacklist. However, there may be a huge number of generated domains in a short term, while the list cannot be updated in time. Therefore, a real-time detection of malicious domain names produced by the DGA is needed. With the update of DGAs, the number of generated domain names is increasing and the defense work becomes more difficult. The accuracy of the traditional classification algorithm and the hidden markov model is low. The features selection based on the analysis and detection method of natural language features of domain names cannot deal with the large number of features. In addition, some DGAs may even build the algorithm to generate a large number of pseudo-domain names that conform to the characteristics of normal domain names. In this paper, we design a model to detect domain names generated by DGAs based on bidirectional LSTM neural networks. Compared with the traditional detection methods, the proposed scheme has the following advantages: 1. Our scheme uses a featureless way to handle domain names, by which all information contained in the domain name is retained as much as possible. It also avoids manual feature selection and the difficulty of determining the features effectiveness. 2. Our scheme can well adapt to the detection of pseudo-domain names generated by new DGA. Compared with the defect of large data samples that need to be retrained in the traditional detection scheme, the proposed scheme only needs to continue the training on the original model. 3. Our scheme performs in a real-time and low-cost way. The model trained through the pre-training data samples can be deployed and used directly, and can classify the domain names and quickly blacklist suspicious domain name without requiring more information. In this paper, we make the following contributions: 1. We obtain statistics on frequency distribution of domain names' composition and length, and analyze the differences between the normal and the DGA domain names. On the premise of retaining the original domain name information as much as possible, it is determined that bigram processing can make the differences more obvious.  [Chen, Yan, Pang et al. (2018)] trained the classifier model through Support Vector Machine (SVM), which is based on supervised machine learning. Huang et al. [Huang, Wang, Zang et al. (2018)] proposed Helios, a DGA detection approach based on a neural language model, which exploits the word formation of domain names to identify those generated by DGAs. Yadav et al. [Yadav, Reddy, Reddy et al. (2012)] described a model by testing distribution of alphanumeric characters and bigrams in all domains to detect DGA domain names. Wang et al. [Wang and Chen (2017)] proposed N-Gram features to increase the accuracy of classification models. Schiavoni et al. [Schiavoni, Maggi, Cavallaro et al. (2014) [Mahoney (1999);Mikolov, Karafiat, Burget et al. (2010); Robinson (1994)]. Hochreiter et al. analyzed the problem of gradient explosion and disappearance brought by back propagation through time algorithm, which brought problems such as gradient oscillation and learning difficulty to the learning algorithm. Network structure of LSTM was proposed [Gers, Schmidhuber and Cummins (2000); Gers, Schraudolph and Schmidhuber (2002); Hochreiter and Schmidhuber (1996); Hochreiter and Schmidhuber (1997)]. The basic idea of bidirectional recurrent neural network (BRNN) is to propose that each training sequence are two RNNs, forward and backward respectively [Schuster and Paliwal (1997)]. This structure provides complete past and future context information for each point in the output layer's input sequence. As a member of BRNN, Bi-LSTM has its general structure characteristics [Graves and Schmidhuber (2005)]. Liu used Bi-LSTM proposed a sentence encoding-based model for recognizing text entailment [Liu, Sun, Lin et al. (2016)]. The method presented in this paper is based on the operations presented in Woodbridge et al. [Woodbridge, Anderson, Ahuja et al. (2016)]. This paper improves the existing approach as follows: 1. By analyzing the character composition and length of the domain names, we can find the relationship between the characters in the domain name, and use the binary grammar (bi-gram) method to preprocess the domain name. 2. By using bidirectional training sample data from the Bi-LSTM network, future context relationships are introduced in addition to the past context relationships. 3. Based on a large number of DGA data samples, the results are more suitable for practical use.

System implementation
For effective detection of domain names, we set up a dictionary that contains the characters of domain names and the corresponding positive integer values by analyzing the characteristics of domain names. According to the form of the domain names, we convert the characters into one-dimensional vectors. When the vectors are obtained by Embedding layer, we put them in LSTM neural networks or Bi-LSTM neural networks to get labels, based on which we determine whether the test domain name is generated by DGAs. The normal set and DGA family set in the data set are divided into training set and test set respectively at a ratio of 4:1. The design and implementation of the system is divided into three parts: characteristic analysis, data processing and neural network model.

Characteristic analysis
Because the different levels of domain names on the Internet are managed by different agencies, the way that each agency manages domain names and the rules for naming them are also different. But there are some common rules for naming names: the domain name contains 26 English alphabet letters (case insensitive), 10 Arabic numerals, and a few other characters. We mainly analyze the following four aspects: • Source The experimental domain name data used in this paper are from the global top one million domain names published by Alexa website and more than 1.4 million domain names generated by 28 different domain name generation algorithms that ensure data samples are representative and authoritative.

• Character composition
Though the analysis about word frequency statistics of the domain name samples processed by unigram, we find that the regularity of the distribution of each character in the domain name. It can be seen from Fig. 1 that the frequency distributions of each character in the normal sample and DGA sample are significantly different. In both samples, the frequency of numbers is lower, and the frequency of English alphabet letters is relatively higher, which indicates that English alphabet letters are the main constituent characters of domain names. In the DGA sample, the frequency of the numbers is still low, but both are higher than the normal samples. The frequency distribution of English alphabet letters in the DGAs domain names is of the average level, and the overall fluctuation is smaller than normal. shows that in normal samples, the international suffix domain name ".com" appears in a large number in normal samples and the frequencies of "co", ".c", "om" and "m%" are much higher than other characters. On the contrary, the frequency of remaining characters is smoothly reduced. The frequency difference of characters in the DGA samples is large, indicating that the frequency of high-frequency characters in normal samples is not as high as that in the DGA samples. Obviously, the same domain name samples processed by unigram and bigram respectively differ by only 1 in length. As can be seen from Fig. 3, most domain names are between 4 and 35 in length, and the distribution characteristics of the two data samples are different.
In normal samples, the domain length distribution is close to the normal distribution. A domain name with a length of 12 has the highest frequency, and the domain name of other lengths is relatively small. In the DGA samples, due to the limitations of the generation algorithm, the domain length distribution presents a centralized situation, which is in accordance with the characteristics of the pseudo-random generation. According to the statistical analysis of the first character of the initial domain name in Fig.  4, it can be seen from the figure that in the normal sample and DGA sample, the proportion of the first character is English letter is much higher than that of the number. In normal samples, letters 'q', 'x' and 'z' are relatively low. The frequency change of each English letter in the DGA samples is more gradual. However, due to the individual DGA family algorithm, the probability of '0' and '1' being the first character of the domain name in the digital part is greatly increased compared with other numbers.
Based on the above analysis of domain name characteristics, we are more convinced that there is a textual natural language difference between normal and DGA domain names. Besides, there are great differences between them with the processing of bigram.

Data processing
We create the dictionary for all characters that appear by applying unary grammar to all existing domain names samples. The statistics show that all domain names consist of 44 characters. By artificially specifying that each character corresponds to a different positive integer, each single domain name in the sample is transformed into onedimensional vectors.
To conduct binary grammar processing for all domain names, we need to mark the beginning and end of the domain names with '%' (no symbol % is included in the known domain names). The lengths of the sequence are different, after each domain name being processed by different ways. Unigram is N, and bigram is N+1 (N is the length of the domain name). We obtain statistics on the processed samples and create the dictionary. The processing results of unary and binary grammar are different. Unary grammar results show that each domain name is separated by each character to form a character sequence. Binary grammar results illustrate that each domain name extracts adjacent characters (including start and end character '%') one by one to form a sequence. The data set contains a collection of various DGA domain names, as well as a global one million normal domain names downloaded from the Alexa website. We mark each of the domain strings that contain the DGA domain name as 1, and all of the normal domain names as 0. For different types of DGA samples, we also generate 40,000 pieces of 28 different types of DGA domain names. 40,000 normal domain names are randomly selected from the normal data set, and the labels are treated the same as the DGA samples.

Neural network model
In the experiment, we construct the neural network through Keras. In this paper, four training models are set up. We use unigram, bigram and LSTM, Bi-LSTM to combine with each other and develop modules that use various grammars to generate vectors and perform machine learning training. For the training models, we set up sequential model with Embedding layer, the LSTM (Bi-LSTM) layer, Dropout layer, Dense layer (activation function is sigmoid).

• Embedding layer
The input to the embedding layer in the deep learning model is a vector which encodes each character of the domain name string into a sequence of positive integers via a dictionary. This dictionary is obtained by counting the characters that appear in all domain names and then encoding them with a non-zero positive integer. The dimensions of vectors can be either unfixed or fixed. As the fixed vector dimension can greatly improve the model training effect, we choose a fixed vector dimension.
As for the determination of vector dimension, the previous method is to determine the longest domain name length in the data set, and set the value as vector dimension to hold all domain name strings. If the vector dimension does not reach the maximum dimension, we pad zeroes to make it reach the maximum dimension. This process can be done using sequence preprocessor function in Keras. The statistics in Fig. 5 show that the lengths of domain names mainly concentrate in one region, while the number of large domain names is very small. We count 99.9% of the total number of domain names, all of which are in the range of 4-38 characters, while the number of domain names over 38 characters is very small. Therefore, we determine the dimension of the vector to be 38, and fixe the other vector dimensions to be 38 using the sequence preprocessor function in Keras. For simplicity, let us first introduce some notations. We define V is a vector and v is an element in the vector. We define V : as the row vector V from a to b, i.e., We convert all domain names to matrix X based on the obtained dictionary containing values as follows: In order to speed up the operation of the program, we take T to be 38 and m to be all the domain names, by the analysis of the domain name length above. We set the vector in the matrix to V , that is V represents the m-th domain name in the data set, where the element x ,1 is the first component in the domain name (single character in unigram and double character under bigram). At the same time, in order to adapt to the input of LSTM and Bi-LSTM network model, we set the dimension of each row of the matrix (tensor) output by the embedding layer to 128. In the embedding layer, according to the size 'l' of the dictionary we input, it performs the unique one-hot encoding of the elements in each column vector, operates on the weight matrix W stored in the embedding layer, and then outputs the matrix. W=� For the determination of l, we can also assume that set A A = {x 1,1 }∪{x 1,2 }∪…∪ {x , } =|A| (9) The element V in each column vector V , is encoded by one-hot code V , = (0 … 1… 0) (10) For the sake of demonstration, we might as well assume that we have such a column vector Among them the one-hot coding of x ,1 , x ,2 , x ,3 x ,1 = (1 0 0) (12) x ,2 = (0 1 0) (13) x ,3 = (0 0 1) The weight matrix W is W=� w 1,1 … w 1,128 w 2,1 ⋱ w 2,128 w 3,1 … w 3,128 � Then, the column vector V can be converted to the matrix S by the weight matrix W S = V × W For any member s , of the matrix S ∀s , ∈ {w 1,1 }∪{w 1,2 }∪…∪{w +1,128 } Through different domain name processing methods of unigram and bigram, we can get dictionaries of different lengths. Through our dataset statistics, the size of the dictionary processed by unigram is 44, and the size of the dictionary processed by bigram is 1789. Then the one-hot code length and the weight matrix W in the Embedding layer are 45 and 1790 respectively. Although the matrix S obtained by the weight matrix transformation in the experiment is a matrix of 38 rows and 128 columns, the difference between the dimensions of the bigram preprocessing and the unigram preprocessing is significantly different due to the large difference in the dimension during the conversion process.

• The neural networks (LSTM and Bi-LSTM)
Each domain name is encoded according to a dictionary and obtained by the embedding layer conversion, and the encoded domain name is input into the corresponding network model for the output calculation. The structure of the LSTM network model is shown in the Fig. 6.
Every input vector S corresponds to a structure ℎ , which receives the previous structure ℎ −1 and the current input vector . The model outputs the result to the next structure ℎ +1 by certain calculation, that is ℎ =ƒ (ℎ −1 , ) (21) The final output y is obtained by a certain calculation method from the last structure ℎ +1 , that is y= g (ℎ +1 ) (22) To control the output results between [0, 1], we use the sigmoid function Then the final output Y is Y =Sig (y) (24) =Sig (g (ℎ +1 )) (25) We make the following decision according to the output result The structure of the Bi-LSTM network model is shown in the Fig. 7.  The basic structure of the Bi-LSTM network is similar to LSTM. The final output is obtained by a certain operation of ℎ +1 , ℎ +1 ′ and the sigmoid function. That is Y =Sig (g ′ (ℎ +1 , ℎ +1 ′ )) (27) The main process of the system training part we built is as follows.

Performance evaluation
In this section, we analyze the results of training and testing. The experiments focus on the detection of the DGA and normal domain names. The normal data are from the Alexa top one million domains and the DGA data are generated by 28 DGA families. Tab. 1 shows the DGA families we used and the number of each family domains. We process the one million normal data and more than 467 thousands DGA data in unigram grammar and binary grammar, and then use LSTM and Bi-LSTM networks model for training and testing. We use several performance metrics for the evaluation of the detection network models. These metrics are True Positive Rate (TPR), Recall, False Positive Rate (FPR), Precision and ACC. TPR is the ratio between the number of correctly detected DGA domains to the total number of DGA domains. FPR is the ratio between the number of normal domains that are incorrectly classified as DGA and the total number of normal domains. Precision is the ratio between the number of correctly detected DGA domains and the total number of domains detected as DGA domains, Whereas ACC is the ratio between the number of correctly detected DGA domains plus normal domains and the total number of the test domains. In the following experiments, we evaluate our proposed Bi-LSTM network model in three cases. Case 1 uses unigram, named as Bi-LSTM-Ug. Case 2 use bigram, called as Bi-LSTM-Bg. Case 3 is a hybrid model combining of Bi-LSTM-Bg and LSTM.
We also compare our model with the state of art methods in the respective domains.
• A featureless LSTM model defined in Woodbridge et al. [Woodbridge, Anderson, Ahuja et al. (2016)]. • A SVM classifier model using manual features defined in Chen et al. [Chen, Yan, Pang et al. (2018)]. The manual features of the SVM included the following: • the length of domain name; • the ratio between vowel and domain name; • the entropy of character distribution of the domain name; • bigram frequency distribution occurrences count.

Evaluation metrics
FN: False Negative. It is considered as a negative sample, but it is actually a positive sample. FP: False Positive. It is considered as a positive sample, but it is actually a negative sample. TN: True Negative. It is considered as a negative sample, but it is actually a negative sample. TP: True Positive. It is considered as a positive sample, but it is actually a positive sample. TPR: True positive rate. It can be calculated as follow.  The results of the three sets of models in each performance metric differ little from each other. With regard to Precision, the Bi-LSTM-Bg model is 1.852% higher than the worst model and 1.434% higher than the sub-optimal model. In the case of Recall, it is the worst model, with a decline of 2.035% compared with the optimal model and 1.231% compared with the sub-inferior model. Precision is the ratio of the true number to the total number of results returned after retrieval, while Recall is the ratio of the true number to the whole data set (retrieved or not). The reasons for the results above can be further explained by Tab. 3.  We detect the domain names of five unknown DGA families, such as sisron, github_malware, javascript_malware, unknown_malware and vawtrak. It can be seen from Fig. 9 that the FPR value of Bi-LSTM-Bg model is much lower than that of SVM model by comparing the FPR results detected by each model. It shows that the Bi-LSTM-Bg has better ability to distinguish the unknown DGA family domains. Therefore, its ability to detect the unknown DGA family is much stronger than the traditional machine learning model.

Figure 9: FPR comparison
In the experiment, other models are used to test the DGA domain name samples generated by some DGA domains without training, and the test results are shown in Tab. 5. As shown in Tab. 5, when we test the untrained DGA samples, we find that the misjudgment rates of the LSTM and Bi-LSTM-Ug models for the DGA are better than that of the Bi-LSTM-Bg models in the tests without training. Among them, some DGA family generate domain names with high detection rate in the four models, while there are a large number of domains with poor results in the optimal Bi-LSTM-Bg model detection, but achieve good results in the LSTM model. There is no doubt that compared with the machine learning model, the neural network model has the unique ability in training new data. Therefore, we can combine the two optimal models, LSTM model and Bi-LSTM-Bg model, into a new model, Hybrid-LSTM. The experimental results are shown in Fig. 10. Each group of models was trained with the suspicious samples detected in this group, and the new domain data of the five families above were tested with the trained model. In the Hybrid-LSTM model, LSTM is used for detection in the early stage, and Bi-LSTM-Bg model is used for training and experimental detection and analysis of the detected suspicious samples. Num is the sample size of each DGA family. It can be seen from Fig.10 that Hybrid-LSTM model has the best effect among all the neural network models after a small amount of data training.