Blockchain-Based Decision Tree Classification in Distributed Networks

In a distributed system such as Internet of things, the data volume from each node may be limited. Such limited data volume may constrain the performance of the machine learning classification model. How to effectively improve the performance of the classification in a distributed system has been a challenging problem in the field of data mining. Sharing data in the distributed network can enlarge the training data volume and improve the machine learning classification model’s accuracy. In this work, we take data sharing and the quality of shared data into consideration and propose an efficient Blockchain-based ID3 Decision Tree Classification (BIDTC) framework for distributed networks. The proposed BIDTC takes advantage of three techniques: blockchain-based ID3 decision tree, enhanced homomorphic encryption, and stimulation smart contract to conduct classification while effectively considering the data privacy and the value of user data. BIDTC employs the data federation scheme based on homomorphic encryption and blockchain to achieve more training data sharing without sacrificing data privacy. Meanwhile, smart contracts are integrated into BIDTC to incentivize users to share more high-quality data. Our extensive experiments have demonstrated that the proposed BIDTC significantly outperforms existing schemes in constructed consortium blockchain networks.


Introduction
Much data is produced by social networks, engineering sciences, biomolecular research, commerce, and security logs [1]. To extract the information hidden in such big data, machine learning techniques such as statistical model estimation and predictive learning have emerged [2]. Classification is a critical supervised machine learning technique that can learn from the training data and label test data as different predefined classes [3]. Many classification algorithms such as Iterative Dichotomiser 3 (ID3), Support Vector Machine (SVM), and K-Nearest Neighbor (KNN) have been intensively studied [4,5]. Most of the existing classification schemes are based on centralized settings where a large training dataset is available in a single host. However, in a distributed computing system such as Internet of Things (IoT), the data is likely scattered around the system, which makes it difficult to have a large centralized dataset for training and classifications [6][7][8]. For example, the work in Ang et al. [6] proposed the ensemble approach PINE to classify concepts of interest in a distributed computing system. PINE combines reactive adaptation, proactive handling of upcoming changes, and adaptation across peers to achieve better accuracy. A distributed classification algorithm (P2P-RVM) for the peer-to-peer networks was proposed in Khan et al. [7], which is based on the relevance vector machines. To solve the distributed multi-label classification problem, the work in Xu et al. [8] proposed a quantized distributed semi-supervised multi-label learning algorithm, where the kernel logistic regression function is used, and the common low-dimensional subspace shared by multiple labels is learned. The work in Vu et al. [9,10] tries to consider data privacy by use of encrypted traffic. Similarly, the flow-based relation network classification model RBRN was proposed in Zheng et al. [11] to overcome the imbalanced issues of encrypted traffic. However, in these existing approaches, either the data privacy or the value of the user data was not taken into consideration.
It is challenging to optimize the classification accuracy while effectively taking the data privacy and data value into consideration in a distributed system. As each user node in a distributed network system has a limited amount of data for model training, the classification accuracy may be limited due to the insufficient training data at the node. Data sharing among nodes can be employed to enlarge the training dataset and improve classification accuracy. However, such data sharing gives rise to data privacy leakage, which is of great importance for many security-sensitive IoT applications. In this work, we propose an efficient Blockchain-based ID3 Decision Tree Classification (BIDTC) framework to take data sharing and the quality of shared data into consideration during the classification process. The proposed BIDTC employs a blockchain-based distributed storage and fully homomorphic encryption scheme for data sharing among the distributed nodes. By adopting the blockchain-based data federation classification and the smart contract-based stimulation scheme, the proposed BIDTC allows an individual node to have an enlarged training dataset in the distributed environment. As the decision tree-based classification is widely adopted and requires a short training time for knowledge acquisition in various applications [12,13], the proposed BIDTC integrates the decision tree-based classification with the blockchain-based scheme.
The organization of the rest of the paper is as follows. The related literature is summarized in Section 2. Section 3 proposes a blockchain-based data sharing architecture for training the classification model. A blockchain-based ID3 decision tree classification algorithm for the distributed environment is presented in Section 4. Experimental evaluations and the analysis of the results are presented in Section 5. Finally, Section 6 concludes the paper.

Related Work
The related work is summarized in this section, which mainly includes the literature work in the decision tree-based classification, fully homomorphic encryption, and blockchain technologies.

Decision Tree-based Classification
The decision tree technique is widely used in data analysis and prediction [14][15][16][17][18][19][20][21]. For example, in [16], the C4.5 decision tree algorithm is applied to achieve precision marketing prediction. The C5.0 decision tree classifier is proposed in [17] for the general and Medical dataset, in which the Gain calculation function is modified by adopting the Tsallis entropy function. A service decision tree-based post-pruning prediction approach is proposed to classify the services into the corresponding reliability level after discretizing the continuous attribute of services in service-oriented computing [18]. The ID3 is one of the standard algorithms for the decision tree learning process, which calculates the entropy to select the condition attributes [19][20][21].

Fully Homomorphic Encryption
Several privacy-involved machine learning classification has been proposed recently [22,23]. For example, fully homomorphic encryption (FHE) is proposed for classification without leaking user privacy, especially in the outsourcing scenarios of the distributed environment [24]. An ElGamal Elliptic Curve (EGEC) Homomorphic encryption scheme for safeguarding the confidentiality of data stored in a cloud is proposed in Vedara et al. [25]. In Ren et al. [26], a practical homomorphic encryption scheme is proposed to allow the IoT systems to operate encrypted data. A privacy-preserving distributed analytics framework is presented for big data in the cloud by using the FHE cryptosystem [27]. In order to reduce the excessive interactions and ciphertext transformation, the work in Smart et al. [28] proposed the SIMD to improve the efficiency of homomorphic operations by encrypting multiple small plaintexts into a ciphertext. In [29], a private decision tree classification algorithm with SIMD-based fully homomorphic encryption is proposed.

Blockchain
The blockchain is a distributed ledger database and has attracted much recent attention in the academic community [30]. The blockchain paradigm takes advantage of key technologies such as peer-to-peer networking, the distributed ledger, the consensus mechanism, and the smart contracts, which has many applications in fields such as Internet of Things (IoT), finance, and manufacture [31]. In Wang et al. [32], a blockchain-powered parallel healthcare system (PHS) framework is proposed to support comprehensive healthcare data sharing and care auditability. A blockchain-based framework for supply chain provenance is proposed in Cui et al. [33], and the analysis for this framework is performed to ensure its security and reliability. A theoretical framework for trust in IoT scenarios and the blockchain-based trust provision system are investigated in Bordel et al. [34]. The blockchain technique is deployed to create a secure and reliable data exchange platform across multiple data providers in Nguyen et al. [35]. In Wang et al. [36], a blockchain-based data secure storage mechanism for sensor networks is proposed. The blockchainbased privacy-aware content caching in cognitive Internet of vehicles is presented in Qian et al. [37], in which the privacy protection and secure content transaction are examined.

Data Sharing for Classification
The dataset owned by a single node in a distributed system is usually limited and insufficient for training a classification model with high accuracy. In order to improve the classification accuracy, data sharing among nodes is needed. In addition, both the value and the privacy of the shared data are of great importance in the applications such as healthcare and finance. To jointly take the data sharing, data privacy, and the value of data into consideration, we propose a blockchain-based data sharing architecture for classification, as shown in Fig. 1.
There are double chains and different types of nodes in the proposed data sharing architecture. As shown in Fig. 1, a node in the blockchain network can be a data provider, data requestor, storage server, or ledgering node. The data providers in the blockchain network can share valuable data with encryption throughout the whole network. The sharing procedure will be recorded by the ledgering node and finally be written into the corresponding blockchain. If one of the data requestors demands more training datasets to improve the classification accuracy, it can send the request message to a storage server in the blockchain network. As a result, better performance of classification can be achieved by data requestors, and the financial profits can be obtained by data providers when the predefined blockchain-based smart contracts are executed, as shown in Fig. 1.

Double Blockchains
In the proposed blockchain-based data sharing architecture, the consortium chain is employed to store and share the training datasets among multiple nodes in the blockchain network. The data in the consortium chain is mainly from several related nodes such as institutions or companies [38]. In Fig. 1, we propose double blockchains according to the various transactions in the system. One chain for Transaction I is used to store the block data and share the encrypted data by data providers. The other chain is for Transaction II, which is used to store the block data for improving the classification performance by enlarging the volume of the related training dataset. The chain with Transaction II enables some nodes to make financial profits through the blockchain-based pre-negotiated smart contracts between the data providers and the data requestors.

Roles of Node
As shown in Fig. 1, every node in the consortium blockchain network has one or multiple roles: data provider, data requestor, storage server, or ledgering node. The data provider needs to encrypt the plaintext data M to generate ciphertext data C, then upload the ciphertext file and the corresponding encryption algorithm to a data storage server. At the same time, the data provider can obtain the download address of the file and calculate the hash value of ciphertext data to verify the data integrity. The access policies for the uploaded data can be defined by data providers. The data owned by data providers can be packed as a transaction and added to a blockchain (after the confirmation by ledgering nodes in the focused consortium blockchain network). Note that the storage server is not a physical centralized storage node/device. It can be a virtual/logic node like cloud-based storage existing in the consortium blockchain network.
The data requestors can issue a request to the ledgering nodes for some shared data. The ledgering nodes verify the different identities of access policies corresponding to the requested data. Once approved, the data requestors can download the requested encrypted data from the storage servers and train the classification models on the federated training datasets. In the meantime, the smart contracts for the transactions  Figure 1: The architecture of the blockchain-based data sharing associated with data sharing between the data requestors and the data providers can be executed automatically.

Data Storage and Sharing
Each node that has valuable data can obtain some rewards from data sharing. The implementation process requires two phases associated with the two chains of the blockchain network. In phase I, the data providers share their valuable encrypted data to the storage servers. Such sharing is recorded and validated by the ledgering nodes running the consensus algorithm. The data requestors can then issue requests for specific shared data and receive the shared data along with the encryption algorithm after authentication. In phase II, the data requestors encrypt their local data using the obtained encryption algorithm and federate the obtained encrypted training data with their local encrypted data, then train the classification models on the newly federated training data. Correspondingly, the data requestors will pay the predetermined electronic currency to the data providers according to the blockchain-based smart contracts.

Blockchain-based Improved ID3 Decision Tree Classification
In this section, we present a new Blockchain-based ID3 Decision Tree Classification (BIDTC) framework for the blockchain-based data sharing architecture. The proposed BIDTC takes into account the relation between the current condition attributes, the other condition attributes in the learning process, and the stimulation mechanism in smart contracts.

An Improved ID3 Decision Tree Classification
The original ID3 classification algorithm only takes the current condition attributes and decision attributes into consideration during the process of calculating the gain. Here, we present an improved ID3 algorithm to take advantage of all the attributes from the system that includes the relationship between the current condition attributes and the other condition attributes. In specific, we denote A = ðA 1 ; A 2 ; . . . ; A N Þ as a set of N conditions attributes with values of ðR 1 ; R 2 ; . . . ; R N Þ, respectively. Assuming that the occurrence of attribute A i (i = 1, 2, . . . ; N) is N i , the frequency of A i can be defined as below.
Then the weight of the attribute A i can be calculated as Eq. (2).
Assume that Ỷ is a decision attribute with M possible values has U i possible values, and R i is set as R i2 1;2;...;N f g ¼ a 1 ; a 2 ; . . . ; a U i ð Þ . Then, the relationship degree between the condition attribute A i and the decision attribute Ỷ can be defined as Eq. (3).
The A kj in Eq. (3) is the number of instances that the k-th value of A i belongs to the j-th class of decision attribute Ỷ. According to Eq. (3), we can calculate the weighted degree as Eq. (4).
Assuming that the training data samples are in , where x i has a corresponding output class label Ỷ i . Let P j be the percentage of training samples belonging to the class j of decision attribute Ỷ. Then, the class involved entropy E Ỷ À Á for the attribute Ỷ is defined as follows.
Similarly, the condition entropy E ỶjA i À Á for each attribute A i can be defined in Eq. (6).
Therefore, the formula of calculating the information gain of the condition attribute A i can be defined as follows.
The ID3 decision tree algorithm starts with the dataset at the root node and recursively partitions the data into lower-level nodes based on the split criterion. Only nodes that contain multiple different classes need to be split further. Eventually, the decision tree-based algorithm stops the growth of the tree based on a certain stopping criterion. We can set two stopping criteria for the algorithm. The criterion I is whether all samples in the training dataset are labeled as a single class or not. Criterion II is whether the attribute set A is empty (or all attribute values of S are the same) or not. Accordingly, we propose an improved blockchain-based ID3 decision tree algorithm as the following steps.
Step 1. Check the stopping Criteria I and II. If Criterion I is true, mark the current node as a class Ỷ leaf node; if Criterion II is true, mark the Tree as a leaf and set the most common value of Y in S as the label. Otherwise, go to step 2.
Step 2. Calculate the information gain Gain ỶjA i À Á of each condition attribute A i according to Eq. (7); and set the parameter sW =0 and pW =0. For attribute value a i 2 R i , calculate the weight of each attribute using the training set S i of each value a i .
Step 3. For attribute values in A j 2 An A i f g, calculate the relationship degree using Eq. (3) and calculate the weighted relationship degree as Eq. (4). Then the new value of pW is obtained as: The value of the comprehensive information gain can be achieved as: Step 4. Determine the best splitting attribute A best that has the maximum comprehensive information gain: A best arg max A Gain ỶjA À Á , and go to Step 1.

Enhanced Homomorphic Encryption
To consider both privacy and efficiency, we adopt the vector homomorphic encryption (VHE) method [39] for the proposed BIDTC framework. Assuming that the data requestor and the data provider are denoted as R and P, respectively, we present the setup, training, and classification processes of BIDTC as follows.
Phase 1. P:Setup ; D t ð Þ: The data providers identify the security parameter and the training data D t ; t ¼ 1; 2 Á ÁÁ ð Þ , where t represents the sequence number of the transferring data. With the key generation algorithm KeyGen ð Þ, the data providers obtain the VHE public, private keys, and the H matrix. The data providers will encrypt the D t D t ¼ Á by using the encryption algorithm Encrypt pk; x i ð Þ. Then, the data providers send the Encrypt pk; x i ð Þ, D t 0 and matrix H to the corresponding storage servers.
Phase 2. R:Training Classifier ID3 D [ 0 À Á : The data requestors encrypt the local dataset D to D 0 by using the encryption algorithm Encrypt pk; x i ð Þ, which will be combined with the received dataset D t 0 to generate a new dataset D [ 0 . Then the classification model will be trained by performing the improved ID3 algorithm on the federated training dataset D [ 0 .
Phase 3. R:Testing ID3 VD 0 À Á : The data requestors encrypt the local testing dataset VD ¼ x 1 ; x 2 ; Á Á Á; x m f gto obtain the encrypted testing dataset VD 0 ¼ c 1 ; c 2 ; Á Á Á; c m f gby using the same encryption operations as mentioned above. The classification accuracy will be calculated by the data requestors when completing the classification task on the testing dataset VD 0 .

Stimulation Scheme with Smart Contract
In this section, we develop a stimulation scheme with smart contracts for the proposed BIDTC framework.
In the blockchain network, the transactions in a smart contract can be executed automatically, and the corresponding inputs, outputs, and states affected by executing the smart contracts are negotiated and agreed on by all participating nodes [40,41]. Here, we propose a stimulation scheme to incentivize the providers to share more valuable data. For each transaction of data sharing, there are two types of transaction fees: basic transaction fee and additional transaction fee. We assume that the basic transaction fee the data providers can receive from the data requestors is D ethers. The additional transaction fee depends on the percentage increase of the classification accuracy due to the data sharing. Let Dacc denote the percentage increase of the classification accuracy between the original classification model and the newly constructed one (i.e., after the data sharing). If the Dacc > 0, then the data requestors will pay an additional transaction fee to the data providers, according to Tab. 1. If the classification accuracy is not increased when comparing with the original model, the data requestors will not pay an additional transaction fee to the data providers for the data sharing.
The higher quality of the data shared by the providers, the better classification accuracy, and the more financial profits the data providers can obtain during the procedure of the sharing of the training data. Therefore, the data providers in various blockchain networks have incentives to share more valuable datasets.

The Proposed BIDTC Framework
The proposed Blockchain-based ID3 Decision Tree Classification (BIDTC) framework takes advantage of three techniques: blockchain-based ID3 decision tree, enhanced homomorphic encryption, and stimulation smart contract to conduct the classification in the distributed environment while effectively considering the data privacy and the value of the user data. Fig. 2 shows the overall process of the proposed BIDTC framework, whose primary operations are listed below. (i) The distributed blockchain network is set up, and the Ethereum-based consortium chains are constructed. The distributed blockchain network consists of a large number of data providers, the ledgering nodes, and the data requestors.
(ii) The data providers encrypt their local training data by using the vector homomorphic encryption, then upload the encrypted data to a storage server in the blockchain network. The ledgering nodes with the consensus algorithm can validate the transactions involved with sharing data. All the transactions will be stored in the consortium chain.
(iii) The data requestors train the local training dataset with the ID3-based algorithm and obtain a classification model. This model is then validated on the testing dataset, and the accuracy (say acc 0 ) is obtained. The data requestors can then issue requests to the blockchain network for more shared training data. With the authentication by the ledgering nodes, the data requestors can receive the encrypted training data shared by the providers. At the same time, a smart contract is bounded between the data providers and the corresponding data requestors. Once receiving the encrypted training data, the data requestors encrypt the local training data by using the same encryption scheme from the data providers, federate it with the received encrypted training dataset and perform the improved ID3 algorithm to obtain a new classification model and accuracy (say acc 1 ).
(iv) The smart contracts and the stimulation scheme will be triggered when the accuracy difference: Dacc ¼ acc 1 À acc 0 is above a certain threshold.

Performance Evaluation
In this section, we conduct simulations to validate the proposed blockchain-based BIDTC framework and analyze the performance.

Experiment Settings
We simulated the blockchain-based BIDTC network with Python 3.7. The simulation platform is built on a machine with Ubuntu 16.04 LTS, Intel Core 3.40 GHz i5-8250U CPU, and 8.0 GB of RAM. In the consortium blockchain network, each node is deployed based on the Geth 1.7.2 (Go Ethereum). The configuration file genesis. json includes the identifier of the chain id, the random number nounce, and the timestamp. The Remix-based coding and testing for smart contracts are implemented in a browser-based IDE environment. The account address, the balance, and the indexes of datasets are defined in the structs, as shown in Fig. 3.
We carry out the experiments using the MNIST dataset [42]. We set 60000 samples as the training dataset and 10000 samples as the testing dataset. The training dataset is further divided into four equal parts and stored in four random nodes, namely, Node A, Node B, Node C, and Node D.

Experimental Results
As the data privacy is built-in encrypted data sharing, here, we focus on evaluating the accuracy and speed of the proposed BIDTC. The confusion matrix includes True Positives (TP), True Negatives (TN ), False Positives (FP), and False Negatives (FN ). The TP represents the sample that is actually positive Figure 3: The illustration of the deployment for blockchain-based smart contracts and predicted to be positive; the TN represents the sample that is actually negative and predicted to be negative; the FP represents the sample that is actually negative and predicted to be positive; and the FN represents the sample that is actually positive and predicted to be negative. If there are M classes, we can calculate the classification accuracy according to the following formula.
As Eq. (8) shows, the classification accuracy AC equals the rate between all the true classified samples and all the classified samples in the corresponding testing dataset. The speed of the classification can be measured based on the time consumed in training the model and classifying the testing samples. Fig. 4 shows the classification accuracy for the four random nodes. From Fig. 4, we can see that the classification accuracy of all four nodes is improved significantly when increasing their training data volume. The initial values of the classification accuracy of the four nodes are different in Fig. 4. Specifically, Fig. 4a has the maximum accuracy of 0.84, and Fig. 4b has the minimum accuracy of 0.8. This is because the quality of the training dataset in Node A is the highest among the four nodes, while Node B has the worst data quality. We use Q i to denote the dataset quality of Node i. The quality relationship among the four nodes: Q A > Q D > Q C > Q B is further verified in Fig. 5, where each node works as a data requestor and federates more training data from the other three data providers. From  Fig. 5, we can see that the classification accuracy improves as the amount of the data federation increases, and the nodes with high-quality datasets can achieve a greater gain of the classification accuracy.

Classification Accuracy versus Data Quality
Eq. (9) is defined to measure the quality of the training dataset, where N S i ð Þ is the total number of samples in the training dataset of Node i, and N S e i ð Þ is the number of low-quality samples in the training dataset of Node i. The sample with the blurry picture or an incorrect class label in the training dataset can be marked as a low-quality sample.
In this experiment, we uniformly select 10% of the original MNIST training dataset from each class and replace their class with random integer numbers in the range of 0~9. As a result, we obtain 6000 low-quality training samples, denoted by LQ. For each network node, when the volume of training data reaches a threshold Θ, we add some low-quality training samples into the corresponding nodes. For example, when the volume of the federated training data in Node A reaches 20000, we gradually add 0%~20% of lowquality training samples from LQ into its training dataset. Fig. 6 shows the experiment results when setting Θ as 10 4 , 2*10 4 , and 3*10 4 . We can see that Fig. 6a has the maximum initial accuracy of 0.79 when the data volume amounts to 30000. Fig. 6b has the minimum initial accuracy of 0.66 when the data volume amounts to 10000. This is due to the fact that the initial data quality of the training dataset in Node A is highest while the one in Node B is the lowest. Again, we can see that for a given node, the classification accuracy improves significantly when increasing the training dataset volume. In addition, the better training data quality will result in higher classification accuracy from BIDTC.

Comparing BIDTC with Traditional Classification Algorithms
In this experiment, we compare the proposed BIDTC algorithm with the existing algorithms, including the original ID3 algorithm (OIDA), the Neural Networks algorithm (NNA) [43], and the Random Forest algorithm (RFA) [44]. Without loss of generality, we generate a dataset based on the MNIST and argument it with low-quality samples from LQ such that the average quality level is 0.9. The volume of the initial training dataset is 10000 in each node, while the volume of the testing dataset is 2000. Tab. 2 shows the running time and accuracy of all algorithms in the same distributed network environment.
From Tab. 2, we can see that the running time of both the OIDA and the BIDTC is smaller than that of NNA and RFA, at the cost of slight accuracy loss. Here we define the average classification efficiency for K nodes in Eq. (10), where the CE is the average value of classification efficiency; the AC i is the classification accuracy of the node i and the CT i is the corresponding classification running time of node i.   Fig. 7 shows how the classification efficiency CE varies when increasing the volume of the training datasets from 10 4 to 3*10 4 . From Fig. 7, we can see that the average classification efficiency of the BIDTC is significantly higher than the other three algorithms. This is because the proposed BIDTC can take advantage of the three techniques: blockchain-based ID3 decision tree, enhanced homomorphic encryption, and stimulation smart contract to effectively conduct classification in the distributed environment.

Conclusion and Future Direction
In this work, we have proposed a Blockchain-based improved ID3 Decision Tree Classification (BIDTC) algorithm for the distributed environment. The proposed BIDTC takes advantage of three techniques: blockchain-based ID3 decision tree, enhanced homomorphic encryption, and stimulation smart contract to conduct classification while effectively considering the data privacy and the value of the user data. The proposed BIDTC employs the proposed blockchain-based data sharing architecture to enlarge the volume of the training datasets, which is coupled with a smart contract-based stimulation scheme to enhance the quality of the training data. Our extensive experiments have shown that our algorithm significantly outperformed the existing techniques in terms of classification efficiency. In the future, we will explore how to improve the performances of the proposed algorithm for online data with high dimensions.