An Apriori-Based Learning Scheme Towards Intelligent Mining of Association Rules for Geological Big Data

The past decade has witnessed the rapid advancements of geological data analysis techniques, which facilitates the development of modern agricultural systems. However, there remains some technical challenges that should be addressed to fully exploit the potential of those geological big data, while gathering massive amounts of data in this application field. Generally, a good representation of correlation in the geological big data is critical to making full use of multi-source geological data, while discovering the relationship in data and mining mineral prediction information. Then, in this article, a scheme is proposed towards intelligent mining of association rules for geological big data. Firstly, we achieve word embedding via word2vec technique in geological data. Secondly, through the use of self-organizing map (SOM) and K-means algorithm, the word embedding data is clustered to serve the purpose of improving the performance of analysis and mining. On the basis of it, the unsupervised Apriori learning algorithm is developed to analyze and mine these association rules in data. Finally, some experiments are conducted to verify that our scheme can effectively mine the potential relationships and rules in the mineral deposit data.


Introduction
Data analysis method as an effective technique has been excessively employed in the era of big data [1][2][3][4][5]. During the last two decades, it drives many applications [6][7][8][9][10][11][12]. Specifically, while applying those methods to agricultural upgrading and reconstruction, it can not only accelerate the process of agricultural modernization, but also play an important role in realizing sustainable development [13]. Since the formation and development of agriculture is partly related to the evolution of surface geology, the geological data analysis approaches play an active role in supporting modern agriculture system [14,15]. While collecting and gathering a large amount of geological data in this field, it is challenging to analyze and utilize geological big data, and it also imposes an obstacle to the wide applications in modern agricultural system. In response to such limitation, we specifically conduct an exploration of using machine learning algorithms to achieve data-enabled intelligent analysis for geological big data in this article.
Generally, geological big data relates to the various layers of the Earth, the history about the formation and evolution of the Earth, the material composition and changes of the Earth, and many others [16,17]. Then, geological big data has the characteristics of "4V" of traditional big data, that is, volume, variety, velocity, and value. Moreover, it also has its own particularities, such as multi-source heterogeneity, spatial-temporal correlation, and complexity fuzziness. Meanwhile, due to the huge space-time scope of geological object development and evolution, and the numerous factors affecting geological processes, the geological characteristics of high dimension, high complexity, and high uncertainty are more significant, which make the geological big data face unprecedented opportunities and challenges [18].
For geological big data, it is a challenging and important issue of extracting valuable information from these multi-source data in consideration of its complex characteristics, so as to analyze the mineral yield regularity, summarize the characteristics of a specific type of mineral deposit, and discover the attributions that may be included [19]. Hence, it is particularly difficult to mine association rules from geological big data.
In this field, more emphasis is being placed on using Apriori algorithm to mine association rules for geological big data [20][21][22]. An algorithm was proposed to convert the spatial data to non-spatial one in the geographic information systems (GIS) database, while using the Apriori algorithm to explore a multi-level multi-relational space association rule mining method based on inductive logic programming [23]. Based on multi-source geological spatial database and spatial data mining technology, considering the spatial characteristics and uncertainty of geological data, a regional metallogenic prediction method was developed on the basis of geological spatial data mining, where the experimental comparison results showed that the prediction results based on geological spatial data mining using Apriori algorithm were more accurate than the traditional evidence weight model method, and the method was effective for regional mineralization prediction [24]. Furthermore, based on the Apriori algorithm, the frequent itemsets of the associated ore and intrusive rocks of hydrothermal gold deposits were extracted, and it was found that the associated minerals were closely related to the acidity and alkalinity of the intrusive rocks [25].
However, for those work mentioned above, when the amount of data is too large, the developed algorithms may have the following disadvantages.
(1) If the degree of support or confidence is too low, Apriori algorithm will generate many useless rules that prevent users from quickly distinguishing and judging these rules, making it difficult for users to find truly useful knowledge.
(2) On the contrary, if the degree of support is too high, it will ignore some valid and strong association rules.
(3) When the number of database scans is too large or the frequent itemsets that need to be searched are too large, the algorithms will consume too much time or memory. In order to avoid such limitations and design a more practical method, this article combines selforganizing map (SOM) and K-means with Apriori algorithm to effectively mine more valid and strong associated rules, especially for geological big data. The motivation of this scheme proposed here is shown as follows.
Currently, SOM as an effective unsupervised learning algorithm can be used in clustering. However, the number of clustering results obtained through SOM is particularly large, then the K-means algorithm could be used to perform secondary clustering on the data. K-means is a classic clustering algorithm [26]. Its core idea is to first determine the number of clusters, and then divide all data into clusters with the predefined number, according to the Euclidean distance. But the selection of the initial clustering center of the K-means algorithm has a great influence on the clustering results, and it is easy to fall into the local optimum. Specifically, it is sensitive to "noise" and isolated point data, and these defects greatly limit the clustering effect. By using SOM clustering as the input of K-means clustering, and using the K-means algorithm to perform secondary clustering on the results of SOM clustering, it is expected to overcome the above defects, and the data can be divided accurately according to the specified number of clusters [27,28]. Moreover, through the combination of SOM and K-means, it is able to improve Apriori algorithm [29,30].
Motivated by it, a SOM-K clustering-optimized unsupervised Apriori learning algorithm is developed to mine association rules in each category for geological big data. This algorithm is named as SOM-K-Apriori. In this article, we specifically combine "Support", "Confidence", and "Lift" of association rules in Apriori as the evaluation criteria to further facilitate this scheme to find more valid and strong association rules. Furthermore, to effectively show the mining results, the association rules are clearly demonstrated by tables and parallel visualization method.
The contributions of this article are as follows: (1) Aiming at the practical demand for geological big data analysis, a machine learning-based intelligent mining scheme for association rules is accordingly developed, which greatly improves computational performance of mining task.
(2) In the field of geological data analysis, through the use of unsupervised learning algorithm SOM-K-Apriori, the proposed scheme can find more valid and strong association rules with low support degree, while providing an intuitive visualized result of those association rules. The rest of this article is arranged as follows. Section 2 will introduce the background, including SOM and Apriori algorithm. In Section 3, we detail the proposed scheme. In Section 4, the experiments of mining association rules for mineral deposit data are conducted to evaluate the performance of our proposed scheme. Finally, the conclusion is summarized in Section 5.

Background
In this section, we will simply introduce some key technologies in relation to our method.

Self-Organizing Map (SOM)
SOM was proposed by Kohonen, and it was also known as the Kohonen network [31]. It is one of the unsupervised learning methods. The main task of SOM is to convert input data of any dimension into onedimensional or two-dimensional discrete data through computational mapping. Currently, it can be used for many applications, such as clustering, high-dimension visualization, data compression, and feature extraction [32,33].
Essentially, SOM is a neural network with only an input layer and an output layer. The input layer receives external input information and the output layer responds to them. The data of the input layer can be any dimension, and each node in the output layer represents a class that needs to be clustered. Competitive learning is implemented in training phase. Each input sample finds the node that is most similar to it in the output layer, called the active node or the winning neuron. Then, an appropriate method is executed to update the parameters of the active node. Meanwhile, the neighbor nodes have also updated the parameters appropriately based on the distance from them to the active node.
Therefore, there is a topological relationship between nodes in the output layer. This topological relationship needs to be determined manually. When their needs a one-dimensional model, the output nodes are connected in a line. Otherwise, when a two-dimensional topological relationship is needed, the output nodes are connected to form a plane. For example, the network structure of two-dimensional topological relationship is shown in Fig. 1, where the nodes of the output layer are fully connected to the nodes of the input layer.

Apriori Algorithm
The Apriori algorithm was proposed in [34]. It is now a classical algorithm in association analysis and mining, and it is used to find out the datasets that appear frequently in data values [35][36][37]. Finding patterns for these frequent sets can help us make some decisions [21,38]. (2) where ( ∩ ) represents the probability of an itemset containing set A and set B.
where count(•) is the count of the itemset.
If Lift( → ) > 1, it means that the rule → is a valid and strong association rule. Inversely, Lift( → ) < 1 means that the rule → is an invalid and strong association rule. Specifically, Lift( → ) = 1 represents that A and B are independent.

The Proposed Scheme
In this section, we will present each step of the scheme we proposed. The ultimate goal is to effectively mine the valid and strong association rules between mineral deposit attributions and visualize them.
The core processing process can be divided into four parts. Firstly, all geological data are expressed in the form of word embedding vector through word2vec technique. Secondly, the SOM algorithm maps the vector data in a two-dimensional space and puts similar data in adjacent locations. However, the number of clustering results of SOM is relatively large and the results may be unsatisfactory, we use the K-means method to further cluster. Thirdly, the Apriori algorithm is incorporated to analyze and mine the association rules in each category of geological data. Finally, the evaluation criterion is designed through the combination of "Support", "Confidence", and "Lift" in Apriori, to evaluate the quality of results achieved by SOM-K-Apriori method. Especially, the whole model of SOM-K-Apriori is shown in Fig. 2.

Preprocessing for Geological Data
First, the abnormal data, including checking data consistency, handling invalid values, and missing values, are cleaned up. Then, some key attributes of the data are analyzed through simple statistical methods, to find out the evolution trend of the attributes.

Word Embedding Using Word2vec
Since there are so many text data in geological data, in order to normalize those data, we use word2vec model to process dataset to achieve word embedding. Here, for those massive data, we convert each attribute value to a 50-dimensional word embedding vector before mining association rules.

Association Analysis via SOM-K-Apriori Algorithm
After getting vectors for all data, we use the mode SOM-K-Apriori to cluster data and mine association rules.
Firstly, the topological relationship of SOM network should be determined. Let the number of input data be N. Each input data vector is = [ 1 , 2 , ⋯ , ] T with D-dimension. For each node in the output layer, its dimension is the same as the dimension of the input. Thus, the weight vector of each output node j is recorded as where is the number of output nodes.

1) Initialization
Each weight vector of output node is initialized with a smaller random value.

2) Competition
The competitive process is to find the weight vector wj that is most similar to the vector x. We can use the Euclidean distance as the discriminant function. Then, the smaller the Euclidean distance, the more similar the vector is to the weight vector . The index ( ) which symbolizes the output node that is most similar to the input vector can be expressed as follows. ( ) = arg min | − |, = 1,2, … , .
The output node that satisfies (5) in the output layer is called the active node or winning neuron, which is most similar to the input vector .
3) Cooperation Let ℎ , be a set of excited nodes in the output layer that are affected by the active node , where represents the node number in the output layer. Moreover, , represents the distance between the active node and the excited node , then ℎ , is a unimodal function, which is related to the distance , . The smaller the distance between the active node and the excited node, the greater the impact on the excited node. Thus, ℎ , can also indicate the measure which the excited node is affected.
Generally, ℎ , is a Gaussian function as follows: where ( ) is the location of the active node, and is the effective width of the topology neighborhood. In addition, is expressed as follows.
where (0) is the initial value of , and 1 is a time constant. Then, ℎ , can be defined as follows: Usually, the initial value (0) is the radius of the output mesh, and the time constant is 1 = 1000log(σ(0)).

4) Weight update
The weight vectors of the active node and its surrounding excited nodes are adjusted by the gradient descent method.
where 0 < ( ) ≤ 1 is learning rate defined by: where (0) is the initial value of , and 2 is an another time constant.
Then, we use K-means algorithm to further cluster. Let the data vector mapped through SOM be (i=1, 2, …, N), and it has the same dimension with the input data of SOM, where is the total number of data. Moreover, = { 1 , 2 , ⋯ , }( ≤ ) is a set of cluster. Subsequently, we can select cluster center as follows.
2) The distance between data and each cluster center is calculated, and then is assigned to its nearest cluster center. The distance can be expressed as follows: 3) For each cluster j, the new cluster center is calculated again.
where = 1 when the data belongs to the j-th cluster, otherwise = 0.

4)
Steps 2) and 3) are repeated, until the cluster centers stay the same. Thirdly, we use the Apriori algorithm to mine the association rules in data. The Apriori algorithm uses an iterative method called layer-by-layer search, where the itemset is used to explore the ( + 1) itemset. In general, the process of mining association rules can be divided into two steps: 1) Find out all the frequent itemsets First, after scanning the database, the count of each item is accumulated, and the frequent items whose appearance frequency are greater than or equal to the minimum support, are collected. We find the set of frequent 1-itemsets, which is recorded as L1. Then, L1 is used to find the set L2 of the frequent 2-itemset, L2 is used to find L3, and so on, until the frequent k-itemset can no longer be found. A complete scan of the database is required for each Lk found.
The flow chart of finding frequent itemsets in Apriori algorithm is shown in Fig. 3, where min_Sup means minimum support.
2) Generate strong association rules by frequent itemsets Once we have found all the frequent itemsets, the next step is to find the association rules. If there is a frequent itemset S, A and B are both not empty subsets of it, and ∩ = ∅.
where minConf is the minimum confidence, then there is → .

Performance Evaluation
Currently, there is no clear way to show the quality of results obtained by Apriori. However, the value of "Lift" with each association rule in Apriori can reflect whether the rule is valuable, "Support" and "Confidence" can also show the importance of rules. Therefore, in the proposed scheme the evaluation criterion is designed by combining "Support", "Confidence", and "Lift" of all association rules mined in Apriori. Let the number of association rules is M, then the evaluation criterion can be defined as follows.
where Support , Confidence , and Lift are the "Support", "Confidence", and "Lift" of the i-th association rule, respectively. With (14), we can adjust the cluster number of K-means according to the EC. When the best EC is found, the mined association rules are optimal.

Experimental Description
The dataset used here is from some hydrothermal copper deposits in China, and there are totally 252 hydrothermal copper deposits used in the experiment. We select six important attributes from each hydrothermal copper deposit, including metallogenic epoch, mineral composition, rock structure, rock fabric, alteration type, and alteration intensity.
Here, our experiments are conducted in the Python 3.6.4 environment running on the computer with an Intel(R) Core(TM) i7-6700 CPU and a 16 GB RAM.

Parameters Optimization
In our method, there are some parameters to be optimized, such as the dimension of each mineral deposit attribute, the size of SOM network, and the number of cluster centers in K-means.
Generally, the dimension of word embedding in Chinese words is variable in different scenarios. In most cases, it is 50, 100, 200, or 300. Considering the size of data, 50-dimensional vectors are enough to represent all deposit attributes. From Fig. 4, we can find that when the number of cluster centers is larger, more association rules are mined, however the value of EC difference is not large. Meanwhile, when there are more cluster centers, the association rules between categories are also lost more. Hence, through the combination of the size of data, EC, and the number of association rules, the number of clusters is set as 3, which means that the data is divided into 3 categories. Lastly, according to empirical equation, the size of SOM network is set as 21 × 12.

Figure 4:
The relationship between EC, the number of association rules, and minimum support while using algorithm SOM-K-Apriori with different numbers of cluster centers

Performance Comparison
In order to verify the superiority of the algorithm SOM-K-Apriori, we compare it with Apriori algorithm, and the results are shown in Figs. 5 and 6.
In Fig. 5, when the minimum support is constant, the running time of SOM-K-Apriori is always shorter than that of Apriori. That is because the input/output cost of Apriori increases exponentially with the size of data, and once the range of database is reduced, the input/output cost is also greatly reduced.
In addition, from Fig. 6, we can easily observe that the method SOM-K-Apriori is able to find more valid and strong association rules. This is because some valid and strong association rules are with low support in the database. When the data are clustered with similar characteristics by SOM and K-means, they have high support in categories. This phenomenon becomes clearer as support increases.

Figure 5:
The relationship between running time and minimum confidence while using algorithms SOM-K-Apriori and Apriori Figure 6: The relationship between EC, the number of association rules, and minimum support while using algorithms SOM-K-Apriori and Apriori From the above experiment results, we find that, on the one hand, the algorithm SOM-K-Apriori can quickly find all association rules in the database. On the other hand, it can also find the association rules with low support but valid and strong.

Experimental Results
Firstly, for those selected six important properties in each hydrothermal copper deposit, we clean the abnormal data, and a simple statistical method is employed to analyze the regulation of mineral deposit key properties, including epochs, the rock assemblage composition, and rock structure. The results are shown as Figs. 7-10. Fig. 7 shows the distribution of minerogenetic epochs in domestic hydrothermal copper deposits. We can see that the minerogenetic epochs of hydrothermal copper deposits are mainly concentrated in the Jurassic to Cretaceous, and its support is nearly 0.366. Fig. 8 shows the statistical result of the rock assemblage composition in domestic hydrothermal copper deposits. The most main mineral composition of rocks is Andesite, and its support is nearly 0.360.   In order to get more information with the domestic hydrothermal copper deposits, we show the rock structure and rock fabrics of its main mineral composition Andesite. These two results are shown in Figs. 9 and 10. Fig. 9 shows the distribution of the rock structure of Andesite in domestic hydrothermal copper deposits. Andesite has many kinds of rock structures, and the most important part of which is Porphyritic texture, whose support is as high as 0.769.
Similarly, Fig. 10 indicates that Andesite has many kinds of rock fabrics, but the main rock fabric is Massive structure, whose support is 0.824.
Secondly, we convert each attribute value to a 50-dimensional word embedding vector using word2vec model. Because some attributes contain multiple values, each mineral deposit data is represented by a 600dimensional vector, in which the missing positions are supplemented with "0". The word embedding expressions of the mineral deposit attributes are shown in Tab. 1.
After getting 600-dimension data represented each mineral deposit, we cluster the data. Through 200 iterations, the result obtained only by SOM is shown in Fig. 11(a), and the final result obtained by SOM and K-means is shown in Fig. 11(b), where each color represents a big category. The numbers of mineral deposits contained in each big category are 72, 76, and 104, respectively, named as cluA, cluB, and cluC.  Then, we use the Apriori method to mine the data, after running SOM and K-means. Through many experiments, we set the parameter of minimum support to 0.09, and set the minimum confidence to 0.6. The results regarding association rules of associated rocks, alteration type and their strength in different big categories with SOM algorithm are shown as Tabs. 2-4, respectively.
Tab. 2 indicates that Rhyolite is the highly associated rock of Rhyolitic Tuff, and Andesite is the highly associated mineral of Andesitic Volcanic Clastic Rock. Finally, to intuitively show those mined results, through the use of improved parallel method [39], we provide some visual descriptions of mining association rules, as Figs. 12, 13, and 14. In those figures, the antecedents are below the "−−", and above the "−−" are consequents. Those three figures responds to Tabs. 2-4.

Figure 12:
The association rules of associated minerals Figure 13: The association rules of minerals with alteration type and alteration intensity

Conclusion
In this article, an intelligent scheme is proposed to analyze and mine the association rules of some hydrothermal copper deposits. The proposed scheme is mainly divided into three steps. First, the word embedding vectors are generated for geological data. Then, we use the SOM and K-means to cluster the similar attribution characteristics into one category. Finally, the evaluation criterion-guided Apriori algorithm is specifically used to mine the association relationship in every category. We can see from the experimental results that compared with the traditional Apriori, there are two main advantages in the proposed scheme, one is to quickly mine all the association rules by reducing the scope of scanning, and the other is to mine valid and strong association rules which are with low support. Hence, this intelligent scheme is more suitable for mining valid and strong association rules for geological data. However, in this model SOM-K-Apriori, as the data is clustered by SOM and K-means, the association rules among categories will be lost. Therefore, in the future, for those data with more key attributes, we will reduce the dimensionality of them, preventing the loss of association rules between categories.

Conflicts of Interest:
No potential conflict of interest was reported by the authors.