Analysis of E-mail Account Probing Attack Based on Graph Mining

Wen, Yi; Chen, Xingshu; Zeng, Xuemei; Wang, Wei

doi:10.1038/s41598-020-63191-5

Download PDF

Article
Open access
Published: 29 April 2020

Analysis of E-mail Account Probing Attack Based on Graph Mining

Yi Wen¹,
Xingshu Chen^1,2,
Xuemei Zeng² &
…
Wei Wang²

Scientific Reports volume 10, Article number: 7240 (2020) Cite this article

1781 Accesses
1 Citations
Metrics details

Subjects

Abstract

E-mail has become the main carrier of spreading malicious software and been widely used for phishing, even high-level persistent threats. The e-mail accounts with high social reputation are primary targets to be attacked and utilized by attackers, suffering a lot of probing attacks for a long time. In this paper, in order to understand the probing pattern of the e-mail account attacks, we analyse the log of email account probing captured in the campus network based on graph mining. By analysing characteristics of the dataset in different dimensions, we find a kind of e-mail account probing attack and give it a new definition. Based on the analysis results, its probing pattern is figured out. From the point of probing groups and individuals, we find definitely opposite characteristics of the attack. Owing to the probing pattern and its characteristics, attacks can escape from the detection of security devices, which has a harmful effect on e-mail users and administrators. The analysis results of this paper provide support for the detection and defence of such distributed attacks.

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

DNA glycosylases provide antiviral defence in prokaryotes

Article Open access 17 April 2024

Interviews in the social sciences

Article 15 September 2022

Introduction

E-mail has become a common social tool on the internet and its security has been a hot-spot. Attackers prefer to use e-mail to pass on malware or phishing so that they can carry out different kinds of attacks such as some traditional network attacks or the APT penetration attack. In recent years, the proportion of users attacked by malicious e-mail has increased significantly¹. E-mail account probing is the most common early action for attackers. The main purpose of email account probing is to crack passwords and open a breakthrough for subsequent attacks. Accounts with high social reputation, especially those of users in universities and well-known enterprises, are the main targets, suffering a lot of probing attacks. Spam and some harmful attachments such as worms, Trojans and ransomware, will be sent by the compromised e-mail accounts, which may bring huge losses to organizations or individuals. The high social reputation attribute of e-mail accounts aggravates the success rate of probing attacks. In the past few years, there have been a series of attacks that successfully attacked government systems, well-known businesses, some politicians and organizations². How to protect e-mail accounts is the most important task to reduce the risk of network attacks.

The security of e-mail has attracted many scholars attention because of its importance and necessity. At present, the main research of e-mail security is about identification and filtering of malicious e-mails. The commonly used technologies include detection and filtering of malicious e-mails based on expert knowledge, machine learning or automatic method of rule extraction^3,4,5,6,7. However, with the development of network attack technology, attacks against e-mail have gradually evolved from traditional ones of single and isolated source to automated and distributed ones⁸. Many attackers including some APT organizations probe e-mail accounts and send spam by distributed botnets⁹. Attacks such as probing accounts in the distributed and covert way, can escape from detection of malicious e-mail. Such distributed and covert attacks are adopted by growing attackers because of good concealment and high attack efficiency. Traditional methods have some limitations in distinguishing malicious login behaviours from benign ones. Attackers can escape from detection of malicious e-mail in this way to implement intrusions into the Intranet. Nevertheless, there are few methods to detect distributed e-mail attacks in prior research, and there is also a lack of description and analysis of e-mail attackers’ behaviours.

How to detect and defend this kind of distributed probing attack, is one of the key problems for e-mail security. In order to figure out how attackers probe accounts, we need to analyse characteristics of the probing pattern, describe attack sources’ behaviours, and propose targeted solutions. However, the distributed characteristics of such attacks make it difficult to extract features directly by expert knowledge from a little amount of data. The analysis of a large number of data is needed for researchers to understand the probing pattern. Accordingly, methods of data mining are needed for analysis to figure out the hidden information of dataset. Graph mining is a subclass of data mining, which has become a popular area of research in recent years because of its numerous applications in a wide variety of practical fields¹⁰. As methods of graph mining can reflect characteristics and predict the evolution of data, they can provide support for the analysis and research in theory.

Theories of graph mining have attracted extensive attention in the study of human’s data about network and communication in recent years. Zhou et al.¹¹ have summarized the research methods of spatial and temporal characteristics of human behaviours in recent years. Wang et al.¹² introduce recent progress in the study of coevolution spreading dynamics. Jiang et al.¹³ studied the data on phone calls and three abnormal communication griioups are obtained through the analysis of data’s distribution. By analysing the distribution of Twitter data, Bovet et al.¹⁴ concluded the influence of fake news on the U.S. general election. As for time characteristics in networks, Masuda et al.¹⁵ proposed a method to assign discrete states to the systems in social temporal networks. Bai et al.¹⁶ analysed two networks’ temporal structures for the early detection of infectious disease. In the analysis of network traffic behaviours, many researchers use methods of graph mining to solve different problems. Francois et al.¹⁷ proposed using flows to construct graph to analyse network communication pattern and used Authority as well as Hub eigenvalues of graph to detect botnets. Weigert et al.¹⁸ proposed a graph-based community discovery method, which showed that the IP addresses of the community were similar on network flows and it could identify low-intensity attacks to multiple hosts. Ye¹⁹ proposed using Graphlet to quantify the correlation of eigenvalues, fusing the attributes of graph nodes and Graphlet attributes to describe individual behaviour characteristics on the internet. Prior research shows that analysis methods based on graph mining theories are feasible in the description of network traffic behaviours.

Accordingly, in this work, we use methods of graph mining such as temporal analysis and graph node evaluation to analyse the distribution of e-mail probing data collected in campus network. We focus on characteristics of the data in time dimension and the spatial dimension of the space which is constructed by probing sources and targets. Attackers’ behaviours and probing patterns are described from the views of individual probing sources and the whole data set. Results of our analysis can be used to assess the security risk of e-mail accounts and provide help for the security defence of colleges or enterprises’ e-mail systems. The main contributions of this paper are as follows.

First, our work is based on the real e-mail traffic data in campus network. The dataset we collected contains abundant information, which is of high research value especially for analysing attackers behaviour tendency in e-mail system.

Secondly, we describe and analyse attackers’ behaviours in the network traffic, which makes up the lack of research on such distributed e-mail probing attack.

Finally, this paper uses theories of temporal analysis, network structure and graph node evaluation as the analysis methods. We analyse the temporal feature of the dataset and construct networks to figure out the correlation between attackers and targets, which can provide a reference for such kind of analysis of security data, especially the distributed attack data which is similar to our dataset.

The structure of the rest parts of this paper is as follows. The second chapter introduces the data set of this paper. Then, the third chapter is about the analysis methods used in three aspects, including time characteristics, network model structures and network node attributes. The fourth chapter demonstrates analysis results of data set based on above methods, analysing the characteristics of attack behaviour in different dimensions. Chapter five is the summary and discussion of this paper.

Dataset

The original dataset is the e-mail login data log collected from a campus network. It contains traffic records of failed login in 333 days. Each record includes login time, login IP address, network segment of IP address, login email username and other fields. We consider that some normal users may make a mistake entering passwords which would generate a login failed record. Therefore, we count the number of login failures for each class C network segment and filter out records with the total number of login failures less than 4 in the previous seven days. Considering the convenience of analysis, we aggregate data with one day as the minimum unit. For the purpose of clearing the sensitive information, we anonymize the data set, randomly numbering the date, IP address, network segment and username of each record. In this way, each field of a record has an ID to represent it. After the pre-processing, the campus network e-mail probing data set (hereinafter referred to as CNEPD) is formed. The basic information of CNEPD is shown in Table 1.

Table 1 Basic information about CNEPD.

Full size table

Each record in CNEPD consists of attack date (Date), IP address (IP), class C network segment (Segment), username of e-mail account (Username) and the number of probing attacks (Count). The format of record is as follow:

$$\{Date,IP,Segment,Username,Count\}$$

Attackers frequently change their IP addresses to avoid tracking or sometimes use compromised hosts which are under the same class C network segment²⁰. Accordingly, we assume that all the IP addresses are class C addresses, which indicate that if the first three digits of different IP addresses are the same, they belong to the same network segment. Therefore, we change the minimum unit of probing source from IP address to network segment.

Methods

The analysis methods consist of two parts. First of all, we analyse the time distribution of CNEPD, so as to obtain the time characteristics of each probing source. Secondly, we establish a probing relationship network based on attacker segments and e-mail accounts. Thus, we can find out whether there is a certain relationship between probing sources and targets. Based on the established network, we analyse the relationship between probing sources by mapping the network to a new one to find out the similarity between attackers. From the time dimension and the spatial dimension of the network, we can get different characteristics which can describe the pattern of this kind of probing attack. Section 2.1 is the time feature extraction method, and sections 2.2 and 2.3 are the methods of network structure analysis. Since CNEPD has certain characteristics of distributed attacks, we consider the following two aspects to focus on behaviours: (1) overall probing behaviour characteristics of CNEPD; (2) the behaviour characteristics of single node. Analysis results from these two aspects can accurately and comprehensively reflect the characteristics of behaviour patterns to describe such probing attacks in terms of groups and individuals.

Time feature extraction

Time characteristics are important attributes to describe the probing pattern because they can manifest the tendency of attacks in time dimension. In the description of time characteristics of graph or complex networks, the concepts of burstiness and memory were proposed in²¹ . The burstiness is the time distribution characteristic of time interval among nodes, while the memory indicates the time when nodes continue to appear. Learning from these two characteristics above, two definitions are proposed to describe time characteristics of the e-mail probing behaviour. Figure 1 shows a part of the No.80 nodes probing distribution of dates and the number of attacks, and definitions of time characteristics are as follows:

Definition 1: Probing Interval (t_interval): Probing interval refers to the number of days between two adjacent probing attacks. As shown in Fig. 1, from day 285 to 296 there is no probing, and the date difference is the t_interval.

Definition 2: Probing Duration (t_duration): Probing duration refers to the number of consecutive days that probing attacks continuously occur. As shown in Fig. 1, from day 261 to 269, probing attacks persisted, and the date difference is the t_duration.

t_interval and t_duration are calculated to describe the time distribution of CNEPD, which can reflect the tendency of attackers’ probing patterns. Based on the large data analysis platform CSRI-BDP established in our laboratory, we aggregate CNEPD to extract time features according to algorithm 1. The input of the algorithm is CNEPD, the login failed data set, which is stored in the Hadoop Distributed File System (HDFS). HDFS is the storage system of Hadoop framework, a distributed file system that can conveniently run on commodity hardware for processing unstructured data. The dataset is stored as Resilient Distributed Dataset(RDD), which shows great performance in processing big data. The output are the extracted interval sequence and duration sequence.

The specific implementation of the algorithm is showed in algorithm 1. First, in line 1–2, each record of CNEPD is changed to {Segment, (Date, IP, Segment, Username, Count)}. Key-Value pairs are grouped according to Segment, the key. Then, in line 3–18, Date of each group is extracted and sorted in ascending order to form a probing time sequence. Afterwards, time sequences of duration and interval are respectively calculated in line 6–7 and 9–18. Finally, the set of time characteristics is obtained.

Construction of network model

One of important probing characteristics is the choice tendency of targets. From CNEPD, we can get a list of probed e-mail accounts. If we desire to figure out attackers choice tendency, we need to understand the relationship between probing sources and probed accounts from our dataset. Hence, we construct a network based on the fields of “Segment” and “Username”. In this case, an attack association bipartite graph, the probing relationship network, is constructed based on two fields of CNEPD, as shown in Fig. 2(a).

From CNEPD, we can conclude that there are a large number of attackers probe different accounts in the whole time. If we desire to understand the probing pattern of attackers, it is important to figure out if there is a relationship between the attackers. In order to analyse the correlation between attackers, according to²², we change the probing relationship network into the probing source mapping network, which is composed of attack nodes by one mode projection. As shown in Fig. 2(b), each node represents an attack network segment, and an edge between two nodes indicates that both segments have probed the same one or more e-mail accounts. By constructing two network models, the relationship between segments and usernames as well as the relationship among attackers can be analysed.

Method of node distribution attribute analysis

In a network system, the distribution attribute of nodes is one of the most important characteristics to describe the network. In order to mine the important nodes and find out if there is any relationship between nodes, we analyse the distribution of nodes in the 2 network models. The degree of node is the most basic attribute of node’s distribution, which represents the number of edges connected with the node. The degree of node i can be expressed as k_i. In scale-free networks, the distribution of nodes’ degree usually fits power law distribution. The distribution of nodes degree can be expressed as:

$${p}_{k}\sim {k}^{-\gamma }.$$

(1)

In the probing relationship network and the probing source mapping network, the more important a node is, the more likely it is to have similarity and relevance with other nodes in probing pattern. In order to figure out the distribution of probing nodes and the relationship between them in the network, we describe the importance of nodes by calculating the centrality of nodes based on nodes degree. In our work, we focus the following four types of centrality.

(1)
Degree Centrality is the degree of node, which refers to the number of edges of a node. The degree centrality of node i is defined as:
$${k}_{i}=\mathop{\sum }\limits_{j=1}^{n}\,{a}_{ij}.$$
(2)
(2)
Closeness Centrality is defined as the reciprocal of the average distance from one node to the other nodes in the network. The closer the average distance from one node to the other, the more central the node is. The closeness centrality of node i is defined as:
$${C}_{C}(i)=\frac{n-1}{{\sum }_{j}{d}_{ij}}.$$
(3)
(3)
Betweenness Centrality describes the path of node information transmission. Nodes with high betweenness centrality are the nodes that transmit the most information. In the probing source mapping network, if one nodes betweenness centrality is high, it means that this node may have the most similar probing behaviour with other nodes. The betweenness centrality of node i is defined as:
$${C}_{B}(i)=\sum _{st}\,\frac{{n}_{st}(i)}{{g}_{st}}.$$
(4)
From Eq. (4), st is a pair of nodes in the network, g_st is the total number of shortest paths from s to t, and n_st(i) is the number of nodes in one shortest path.
(4)
Eigenvector Centrality is an extension of degree centrality. It increases with the increase of the importance of one nodes neighbour nodes. The eigenvector centrality of a node is proportional to the sum of eigenvector centrality of the node s neighbour nodes. The higher eigenvector centrality of one node is, the more important its neighbour nodes are, which indicates that the node is very important in the network. The eigenvector centrality of node i can be define as:

$${x}_{i}={k}_{l}^{-1}\sum _{j}\,{A}_{ij}{x}_{j}.$$

(5)

From Eq. (5), k_i is a constant and A_ij represents the adjacency matrix of the network.

In addition to the above 4 types of centrality, we also use the K-shell algorithm²³ to measure the importance and distribution of nodes in the network. K-shell, also known as k-core, divides the network into layers from the centre to the periphery. The K-shell value of a node is marked as k_s. In this algorithm, the nodes with degree 1 from the network are firstly deleted. At the same time, new nodes with degree 1 may appear in the network. Then continue to delete them until there are no nodes with degree 1 in the network. All the deleted nodes k_s value is 1. These nodes constitute the shell of k_s = 1. Next, nodes with degrees of 2, 3, 4 to the maximum n are deleted in the same way. After that, the network is divided into n shell layers. All nodes in the same layer has the same k_s value. Figure 3 demonstrates a simple example of k-core structure.

Beside individual node attributes, the discretization and aggregation of nodes are also important to reflect the characteristics of network structure.Clustering coefficient describes the degree of clustering of nodes in the network, which reflects the tendency of nodes with neighbours in common to be directly connected²⁴. For the probing source mapping network, clustering coefficient reflects the possibility of direct association between attackers. Local clustering coefficient is used in this paper, which represents the probability that any pair of neighbour nodes of one node are directly adjacent to each other. Local clustering coefficient c_i is defined as:

$${c}_{i}=\frac{2{E}_{i}}{{k}_{i}({k}_{i}-1)}.$$

(6)

From Eq. (6), E_s is the number of node i s neighbour nodes which are directly adjacent. $\frac{{k}_{i}({k}_{i}-1)}{2}$ is the number of possible neighbour pairs of node i.

Results and Analysis

Results and analysis of time feature

The probing interval and duration of CNEPD are calculated according to algorithm 1. The probing interval and duration sequences are obtained and the distribution of these two features are calculated respectively. Then the distribution fitting curve is calculated by the maximum likelihood estimation method, and the results are shown in Fig. 4(a,b).

Figure 4(a) demonstrates the distribution of probing intervals, which follow a power-law distribution. The red line corresponds to the power-law fits with exponents γ = −2.01. Most probing intervals are short, which are less than ten days. There are a few outliers with long intervals in the tail because there are probing intervals of some network segments which are longer than 100 days. The result indicates that attackers prefer to make persistent probes at short intervals.

Figure 4(b) shows the distribution of probing duration. It fits the exponential distribution with the slope of −0.25889, which manifests the memoryless property. The probing duration of each network segment is less than 30 days, and most of the continuous attacks have a short duration. From the results, we can get the conclusion that attackers prefer probing with short duration.

The average number of probing attacks per day in CNEPD is close to 1000. According to daily probing attacks, numbers of probed accounts per day are calculated and the result is shown in Fig. 4(c). From Fig. 4(c), it can be seen that about a fifth of the time, the number of daily probed accounts ranges from 0 to 100. Furthermore, the number of probed accounts in four fifths of the time exceeds 100, the maximum number even reaches 2600. It can be seen from the above results that the numbers of probe attack and probed accounts every day are both quite large.

In order to analyse the attack behaviour of each single probing source, we calculate the distribution of the number of probing attacks. The distribution for each probing source and the distribution for each account are shown in Fig. 4(d,e).

Figure 4(d) demonstrates that most of the network segments appear within 100 days, and the number of attackers probing less than 30 days account for the most. Figure 4(e) shows that most of the attack counts are concentrated in the range of abscissa value <10000. From the Fig. 4(d,e), we can find that most of the network segments probe e-mail accounts in few days, and they only generate several attacks on probing days. The results indicate that the actual probing frequency of each network segment is low, and the number of probing in these 333 days is small as well.

From the above results, we can come to a conclusion that the main characteristics of probing are short duration and intervals. As for the number of probing attacks and robed accounts, both of them are fairly large. However, from the point of view of each single node, nodes’ behaviours have the characteristics of low frequency and small amount of attacks in time distribution. Attackers’ probing pattern shows strong concealment because of low frequency but harmful impact as a result of the huge number of attacks in the entire data.

Results and analysis of the probing relationship network

We construct the probing relationship network based on CNEPD. According to the aggregation of IP segments, the degree distribution of nodes in the network is calculated. The results are shown in Fig. 5(a). The degree distribution of attack nodes fits power-law distribution with the exponent γ = −0.8748. Degrees of most nodes are less than 10, which means the majority of network segments probe no more than 10 accounts in the whole time. Besides, attackers only probe a few accounts, and the proportion of probing sources which attack a large number of accounts in the whole dataset is low as well. The results demonstrate that the number of accounts probed by most probing sources is quite little.

In the probing relationship network, each edge between two nodes has a weight, which indicates the number of attacks from the attacker to the target. We calculate the number of attacks launched by each network segment and get the distribution of the sum of edges’ weight per probing node. The results are shown in Fig. 5(b). The weight distribution of attack nodes fits power-law distribution with exponent γ = −0.42865 and weights of most nodes are less than 100. The result shows that most segments’ only probe a few times and there are only few nodes with larger weights. Attackers like to use multiple network segments rather than single segment or IP address to carry out probes

We take “segment-username” as a connection, aggregating the same connection, calculating the number of attacks. The distribution of connection weights is shown by Fig. 5(c). The distribution of the connection weights fits power-law distribution,with exponent γ = −2.9005. The result indicates that when network segments probe each account, they only probe a few times.

From the above analysis, we find that the number of accounts probed by attackers and attackers only probe a few times for each account, although the number of probing attacks in the whole dataset is more than 3,000,000. Many network segments each generate no more than 100 probing attacks in these 333 days. Generally speaking, a large number of probing sources are each aiming at a few mount of accounts. These segments only launch several attacks for each account.

Results and analysis of the probing source mapping network

In the probing source mapping network, each node represents a network segment, a probing source. An edge between two nodes indicates that the two nodes have probed the same one or more e-mail accounts. In order that we can understand the relationship between attackers roughly, as shown in Fig. 6, there are examples of the network constructed based on the data of one day, one week, one month and the entire data set. With the accumulation of time, the network becomes increasingly complex, the number of nodes and edges increases greatly. CNEPD totals 1737 network segment nodes and 88,040 edges. Because of complex connections among nodes, we initially deduce that there are strong correlations among a large number of nodes, which might have similar attack behaviours. The next content will analyse the network construction and characteristics of nodes to describe attack behaviours.

Analysis of the probing source mapping network construction

After a preliminary analysis of the probing source mapping network, we conclude that there is a high correlation between the nodes. In order to analyse the network construction and figure out the relationship between attackers, we calculate the degree and clustering coefficient of each node. The result is shown in Fig. 7(a). From the result, we can see that degrees of nodes fluctuate greatly with the clustering coefficient. However, the average clustering coefficient decreases slightly with the increase of degree. Beside that, nodes with higher degree are with bigger average clustering coefficient as well. The result indicates that, in the network, there are many clustering groups of nodes which contain a huge amount of large-degree nodes. In other words, probing behaviours of these nodes are intensely similar, aiming at the same one or several e-mail accounts.

To confirm the similarity of different nodes’ probing behaviours, we calculate the characteristics of complete sub-graphs in the network. For most networks, a complete sub-graph represents that all nodes in it are strongly correlated. In the probing source mapping network, each complete sub-graph indicates that all nodes in it have probed the same one or several e-mail accounts. The number of complete sub-graphs is calculated according to the constructed network, as shown in Fig. 7(b). It can be seen that the distribution of complete sub-graphs is dispersed. Although the number of nodes in most complete sub-graphs is less than 100, there are still large complete sub-graphs. The largest complete sub-graph contains more than 800 nodes, which indicates that more than 800 segments have probed the same one or several e-mail accounts.

From the above results, there are multiple network segments probing the same one or more e-mail accounts. Considering the conclusion of last subsection that single network segment has fewer times to probe an account, the result explains why the average number of attacks for each e-mail account is large, while the actual number of attacks of each network segment is small. Although each segment only probes a few times, multiple attackers probe accounts in a cooperative way. These probing nodes actually generate a huge number of attacks, posing a threat to e-mail accounts.

Analysis of the node characteristic

In order to figure out the distribution of nodes in the probing source mapping network, we decompose the network by k-shell and the result is shown in Fig. 8. The distribution of k_s is quite different, and many of nodes are located in the most exterior and interior. A large number of external nodes (located in the outer shell of k_s < 10) are free in the outer layer of the network with a few attacks.The maximum k_s is 158, and the number of nodes with k_s = 158 is 211.

In some networks, the core nodes are the most influential and important. However, nodes in high shells, the innermost core, are not good spreaders with much influence in other networks²⁵. In order to verify that these core nodes are actually important in the network, we calculate the centrality of all nodes and then rank them. The results are shown in Table 2.

Table 2 Top 20 probing node IDs of 4 types of centrality.

Full size table

As a result, all the nodes in Table 2 locate in the shell of k_s = 158, which imply that there is a great correlation and similarity between these core nodes in their probing behaviours. These nodes are with a large degree and high centrality. We can speculate that there is a cooperative probing behaviour on some e-mail accounts by a part of probing sources together.

From the above results, it can be found that from the view of the entire data set, probing sources of large-degree nodes are highly correlated with many shared edges. Besides, the number of nodes in the largest sub-graph is quite large as well, and there is a overlap in the probing targets of different attack nodes. In other words, it manifests that there is a strong cooperation between probing sources in CNEPD. These nodes attack a batch of accounts in a collaborative way for a long time. According to results, we speculate that these probed accounts may be easily accessible public e-mail ones (such as some accounts with contact information disclosed on public websites of campus), which are of great value to attackers behind these probing sources.

Conclusion of the probing attack

Through the above analysis results, we can define this kind of attack as the Distributed E-mail Cooperative Probing Attack. Characteristics of the attack are as follows: (1) It possesses a distributed probing pattern and the number of attack sources is quite large. (2) It attacks a lot and the frequency is high, both the number of probing attacks and the number of probed accounts in the whole time are large, and the distribution of probing time is relatively dense. (3) Each single node probes few and the attack frequency is low, each single node probe few accounts and the number of probing attacks is small as well. (4) With highly cooperative probing sources, with high correlation between probing sources, there is a certain overlap in targeted accounts, demonstrating strong cooperation.

This kind of probing pattern possesses characteristics of well-covertness, difficulties of detection and harmful effects. As for a single probing source, because of its low frequency and few probing times, it can effectively reduce the detection rate of traditional security equipment. In addition, from the entire dataset point of view, this kind of attack produces a huge number of probes, and its effects are no less than probing a large number of accounts in a short period of time, such as database collision and violent cracking. Furthermore, probing sources are highly correlated. In a way of cooperation, the behaviour of continuous probing attacks has brought great security risks for accounts repeatedly probed by attackers.

Discussions

In this paper, we define a kind of distributed e-mail cooperative attack, and analyse its probing pattern. Results of analysis in our work can enable security administrators to propose targeted detection strategies based on the its behaviour characteristics. At the same time, analysis results have been submitted to the security departments of campus network, reminding e-mail users to strengthen protection of their accounts and privacy security to reduce the risk of asset losses. It is worth mentioning that we analyse the attack pattern based on CNEPD, the dataset collected in campus network. Then we make full use of the characteristics in dimensions of time and correlation space, analysing its characteristics and depicting the attack pattern. In addition, as far as we know, methods of graph mining applied to the analysis of e-mail login data are innovative. Besides, this paper can also provide some help for methods of graph mining in analysing security data, which has certain reference significance.

According to the characteristics of the probing pattern, the following strategies can be adopted for detection. (1) Increase the time window of detection program. Keep as much data as possible to analyse, so as to find out the accumulated probing records from a long time span. (2) Expand the range of detection targets. Targets of detection and analysis should not be limited to a single IP address or class C network segment. Beyond that, relevant thresholds and baselines should be set by analysing the entire data, so as to improve the detection rate of such distributed and cooperative probing sources.

As future works, we plan to focus on the dynamic evolution trend of CNEPD, study the development law of probing behaviours in time dimension, and realize the prediction of such probing attacks. On the other hand, based on the characteristics we propose in this paper, we can implement a detection system for such attacks.

Data availability

Our data set is available http://csri.scu.edu.cn/news/728. The data set is divided into eight parts due to the size of uploaded files required by the website.

References

Symantec. Internet security threat report (istr) 2019. https://www.symantec.com/security-center/threat-report (2019).
Ho, G., Sharma, A., Javed, M., Paxson, V. & Wagner, D. Detecting credential spearphishing in enterprise settings. In 26th USENIX Security Symposium 469–485 (2017).
Outbound blacklist and alert for preventing inadvertent transmission of email to an unintended recipient. by Meister, M. (2016, Jun, 28). Patent US 9,378,487 B2 [Online]. http://www.freepatentsonline.com/9378487.html (2016).
Yang, T., Kai, Q., Dan, C. T. L., Nasr, K. A. & Ying, Q. Spam filtering using association rules and naïve bayes classifier In IEEE International Conference on Progress in Informatics & Computing (2016).
Tuan V, M., Tran, Q. A., Jiang, F. & Tran, V. Q. Multilingual rules for spam detection. Journal of Machine to Machine Communications (2015).
System and method for filtering spam messages based on user reputation. by Yablokov, V. V. (2016, Jun, 07).Patent US 9,631,605 B2 [Online]. http://www.freepatentsonline.com/9631605.html (2016).
Zhang.Y. Design and implementation of the spam filtering system based on vsto. Master dissertation, Xidian University, China (2012).
Zhuang, Z. Research of email coordinated attack detection method. Unpublished Master dissertation, Sichuan University, China (2017).
Fang, B., Cui, X. & Wang, W. Surey of botnets. Journal of Computer Research and Developmengt 48, 1315–1331 (2011).
Google Scholar
Charu A. & Haixun W. Graph Data Management and Mining: A Survey of Algorithms and Applications.. In Managing and Mining Graph Data. 13–68 (2010).
Zhou, T. et al. Statistical mechanics on temporal and spatial activities of human. Journal of University of Electronic Science and Technology of China 42, 481–540 (2013).
CAS MATH Google Scholar
Wang, W. et al. Coevolution spreading in complex networks. Physics Reports 820, 1–51 (2019).
Article ADS MathSciNet Google Scholar
Jiang, Z. et al. Calling patterns in human communication dynamics. Proceedings of the National Academy of Sciences of the United States of America 110, 1600–1605 (2013).
Article ADS CAS Google Scholar
Bovet, A. & Makse, H. A. Influence of fake news in twitter during the 2016 us presidential election. Nature Communications 10, 7 (2019).
Article ADS CAS Google Scholar
Masuda, N. & Holme, P. Detecting sequences of system states in temporal networks. Scientific Reports 9, 795 (2019).
Article ADS Google Scholar
Bai, Y. et al. Optimizing sentinel surveillance in temporal network epidemiology. Scientific Reports 7, 4804 (2017).
Article ADS Google Scholar
François, J., Wang, S., State, R. & Engel, T. Bottrack: Tracking botnets using netflow and pagerank. In NETWORKING 2011 - 10th International IFIP TC 6 Networking Conference, Valencia, Spain, May 9-13, 2011, Proceedings, Part I (2011).
Weigert, S., Hiltunen, M. & Fetzer, C. Community-based analysis of netflow for early detection of security incidents. In USENIX LISA. 20–20 (2011).
Ye, X. Study on key technology of anomaly detection of network traffic based on behavior analysis. Unpublished PhD dissertation, Sichuan University, China (2018).
Shao, G. Research on key technologies of deep learning in advanced persistent threat detection. Unpublished PhD dissertation, Sichuan University, China (2018).
Goh, K.-I. & Barabási, A.-L. Burstiness and memory in complex systems. Europhysics Letters 81 (2008).
Newman, M. Networks: an introduction (Oxford University Press, 2010).
Erdos, P. & Bollobas, L. Graph theory and combinatorics: proceedings of the Cambridge Combinatorial Conference, in honour (Academic Press, 1984).
Watts, D. J. & Strogatz, S. H. Collective dynamics of small-world networks. Nature 393 (1998).
Liu, Y., Tang, M., Zhou, T. & Do, Y. Core-like groups result in invalidation of identifying super-spreader by k-shell decomposition. Scientific Reports 5, 9602–9602 (2015).
Article CAS Google Scholar

Download references

Acknowledgements

Open access funding provided by the National Natural Science Foundation of China (Grant No. U19A2081).

Author information

Authors and Affiliations

College of Cybersecurity, Sichuan University, Chengdu, 610065, China
Yi Wen & Xingshu Chen
Cybersecurity Research Institute, Sichuan University, Chengdu, 610065, China
Xingshu Chen, Xuemei Zeng & Wei Wang

Authors

Yi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Xingshu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.W. and W.W. conceived analysing methods, X.C. provided the dataset, Y.W. conducted the experiment and analysed the results, Y.W. wrote the paper, X.C., W.W. and X.Z. commented on and revised the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Xingshu Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wen, Y., Chen, X., Zeng, X. et al. Analysis of E-mail Account Probing Attack Based on Graph Mining. Sci Rep 10, 7240 (2020). https://doi.org/10.1038/s41598-020-63191-5

Download citation

Received: 02 December 2019
Accepted: 16 March 2020
Published: 29 April 2020
DOI: https://doi.org/10.1038/s41598-020-63191-5

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.