Interpretable collaborative data analysis on distributed data

This paper proposes an interpretable non-model sharing collaborative data analysis method as one of the federated learning systems, which is an emerging technology to analyze distributed data. Analyzing distributed data is essential in many applications such as medical, financial, and manufacturing data analyses due to privacy, and confidentiality concerns. In addition, interpretability of the obtained model has an important role for practical applications of the federated learning systems. By centralizing intermediate representations, which are individually constructed in each party, the proposed method obtains an interpretable model, achieving a collaborative analysis without revealing the individual data and learning model distributed over local parties. Numerical experiments indicate that the proposed method achieves better recognition performance for artificial and real-world problems than individual analysis.


Motivation
In many applications, e.g., medical, financial, and manufacturing data analyses, sharing the original data for analysis may be difficult due to privacy and confidentiality requirements. Distributed data analyses without revealing the individual data have recently attracted significant attention resulting in the federated learning systems including model sharetype federated learning [19,20,22,25] and non-model share-type collaborative data analysis [4,14,15,37]. In addition, for practical applications, it is known that the interpretability (i.e., "the degree to which a human can understand the cause of a decision" according to Miller's definition [26]) of the obtained model plays an important role [11,27].
A motivating example would be the distributed medical data analysis for employees of companies. In this scenario, employees (i.e., data samples) are distributed in multiple companies. Their medical and work records (i.e., features) are distributed in multiple parties, e.g., the records of medical treatment and check are distributed in different medical institutions and the work situations of the employees are stored in the company's personnel department. Due to the limited number of samples and features, the data in one party of one company could lack some useful information for analysis. Centralizing the data from multiple parties for collaborative analysis could help to learn more useful information and obtain high-quality predictions. However, due to privacy concerns, it is difficult to share individual medical records and work situations from multiple parties. A similar situation occurs in financial and manufacturing data analyses. Thus, collaborative data analysis for distributed data, which are partitioned according to samples and features, is essential and important.
Moreover, when companies aim to adopt policies or decisions according to analyses of machine-learning systems, the model should be interpretable; i.e., people need to understand the reasons why the system obtained such results [2]. This will allow people to make more useful decisions. Therefore, when distributed data analysis is used as a tool to support decision making, the model needs to be interpretable.

Main purpose and contributions
Federated learning is based on deep neural networks and data collaboration analysis constructs the multi-layer model via intermediate representations. Thus, the interpretability of the obtained model is not high, which could limit its use in some application areas. To the best of our knowledge, there have been limited investigations on interpretable model construction for distributed data in the literature.
To meet the above needs of distributed data analysis and interpretability, we propose an interpretable non-model sharing collaborative data analysis on distributed data. The proposed method generates dimensionally-reduced intermediate representations from individual data in local parties, which are then shared instead of the individual data and models. The proposed method constructs an interpretable model for each party.
The main contributions of this paper are summarized as follows: • The proposed method generates an interpretable model for distributed data based on sharing intermediate representations without revealing the private data and sharing the model.
• The obtained interpretable model is based on the whole features of distributed data, which is not possible to do in individual analysis.
• Each party can individually select an interpretable model according to its own needs.
• Numerical experiments on both artificial and real-world data show that the proposed method constructs an interpretable model with better recognition performance than individual analysis and comparable to that of centralized analysis.

Outline of the paper
In Section 2, we state the target distributed data and review related works. In Section 3, we propose a novel interpretable collaborative data analysis. Numerical results are reported in Section 4. Finally, in Section 5, we summarize the results and conclude the paper. Note that throughout the paper, we use the MATLAB colon notation to refer to ranges of matrix elements.
2 Problem setting and related works

Problem setting
In this paper, we consider the simple horizontal and vertical partitions. However, we note that the proposed method can also applied to more complicated situations described in [15].
Let m and n denote the numbers of features and training data samples. In addition, let X = [x 1 , x 2 , . . . , x n ] T ∈ R n×m and Y = [y 1 , y 2 , . . . , y n ] T ∈ R n× be the training dataset and the corresponding ground truth, respectively. The n data samples are partitioned into c institutions and the m features are partitioned into d parties as follows: Then, the (i, j)-th party has partial dataset and the corresponding ground truth, Individual analysis of the dataset in a local party may not have high-quality predictions due to the lack of feature information or insufficient samples. If the datasets can be centralized from multiple parties and analyze them as one dataset, i.e., centralized analysis, then we expect to achieve high-quality predictions. However, it is difficult to share individual data for centralization due to privacy and confidentiality concerns.
All parties want to obtain an interpretable model to achieve the competitive prediction results as centralized analysis without sharing the private dataset X i,j .

Distributed data analysis
Typical techniques for privacy-preserving distributed data analysis include cryptographic computations (or secure multi-party computation) [5,9,17] e.g., using fully homomorphic encryption [8], and methods using differential privacy [1,6,18], where randomization is used to protect the privacy of the original datasets.
Recently, federated learning has been actively studied for distributed data analysis [19,20,25,36], where the learning model is centralized while the original datasets remain distributed in local parties. Google first proposed the concept of federated learning in [19,20], which is typically used for Android phone model updates [25]. Recently, there have been several efforts to improve federated learning, e.g., see [22,36] and reference therein. Note that, for federated learning, we may need to care a privacy of the original dataset due to the shared functional model [35]. Hence, non-model sharing-type method i.e., collaborative data analysis, have been proposed for supervised learning [14,15] and feature selection [37]. The performance comparison between collaborative data analysis and federated learning are reported in [4].

Needs for interpretability in machine learning
In recent years, as machine learning has been used in various application in society, there has been an active discussion to develop interpretable machine learning [2]. While regressions, rules, and decision trees have been considered to be interpretable in machine-learning models, decision trees in particular have long been used in the context of decision support due to the high transparency [2,11]. Also, the need for model transparency from various stakeholders has increased to replace high-performance black-box models currently used for making predictions [29]. Hence, to create an interpretable model with high prediction accuracy, researchers have developed interpretable models that mimic the behavior of black-box models.
In response to these current needs, there is an opinion that since interpretability is a domain-specific concept, it is necessary to build models that consider ease of use and data structure [30]. Therefore, interpretable model construction algorithms needs to be designed such that users allowed to freely set the model according to its own needs.

Interpretable collaborative data analysis
Here, we briefly introduce the algorithm of collaborative data analysis [15] and propose a novel interpretable non-model sharing collaborative data analysis on distributed data.

Collaborative data analysis
Collaborative data analysis has been proposed in [14,15] for distributed data together with a practical operation strategy to address privacy and confidentiality concerns. Here, we briefly introduce the algorithm based on the practical operation strategy.
In the practical operation strategy, collaborative data analysis is operated by two roles: user and analyst. Users have the private dataset X i,j and the corresponding ground truth Y i , which need to be analyzed without sharing X i,j . Each user individually constructs an dimensionally-reduced intermediate representation and shares it to the analyst. To allow each user to use an individual function for generating intermediate representation, the analyst transforms the shared intermediate representations to an incorporable form called collaboration representations and analyzes them as one dataset.

Training Phase: Construction of intermediate representations
Each user constructs the intermediate representation, where f i,j denotes a linear or nonlinear row-wise mapping function. A typical setting for f i,j is dimensionality reduction, with m i,j < m i,j , including unsupervised [12,24,28] and supervised methods [7,13,23,34]. To address privacy and confidentiality concerns, the function f i,j should be set as • The private data X i,j can be obtained only if anyone has both the corresponding intermediate representation X i,j and the mapping function f i,j or its approximation.
• The mapping function f i,j can be approximated only if anyone has both the input and output of f i,j .
Then, the resulting intermediate representations X i,j are centralized to the analyst instead of the original private data X i,j or the trained model. By sharing the intermediate representations X i,j while keeping the mapping function f i,j in each party, the collaborative data analysis can address the privacy and confidentiality concerns.

Training Phase: Construction and analysis of collaboration representations
Since f i,j depends on the user (i, j), the analyst cannot analyze the shared intermediate representations as one dataset. To overcome this problem, the intermediate representations X i,j are transformed to incorporable collaboration representation as follows: where a row-wise mapping function g i with m i = d j=1 m i,j and m = min i m i . To construct the mapping function g i for incorporable collaboration representations, an anchor dataset X anc ∈ R r×m , which is a shareable dataset consisting of public data or dummy data randomly constructed, is introduced. The anchor dataset is shared to all users and is partitioned according to features, i.e., where X anc :,j ∈ R r×m j . At the user-side, applying each mapping function f i,j to the corresponding subset of the anchor dataset, X anc :,j becomes which is centralized to the analyst. Then, the mapping function g i is constructed such that For computing g i , authors of [14,15] introduced a practical method via a total least squares problem [16] when g i is linear and also indicated an idea when g i is nonlinear.
Finally, the obtained collaboration representations X i can be analyzed as one dataset, together with the shared ground truth Y i using supervised machine learning and deep learning methods. This will obtain a model h( X) ≈ Y.

Prediction Phase
Let X test ∈ R s×m be a test dataset partitioned according to features and samples as are constructed at the user-side, and are shared to the analyst. In the analyst-side, the predictions via the intermediate and collaboration representations and are returned to the corresponding users.

Derivation of an interpretable collaborative data analysis
As shown in Section 3.1.3, the obtained model of the i-th institution is which is a multi-layer model via intermediate and collaboration representations. The model is separately hold by the users and the analyst such that f i,j is only at the user-side, and g i and h are only at the analyst-side. Therefore, the interpretability of the model is not so high even though a highly interpretable model is used, e.g., the decision tree for h. To address this, we propose an interpretable collaborative data analysis. We first revisit the anchor data X anc . In collaborative data analysis, the anchor data are shareable data consisting of public data or dummy data randomly constructed, and are used for constructing collaborative representation (see Section 3.1.2).
The basic concept of the proposed method is to mimic the multi-layer model of collaborative data analysis, that is, 1. Predict the anchor data X anc using collaborative data analysis.
2. Construct an interpretable model with the anchor data X anc and their predictions.
The predictions of the anchor data X anc are for each i. In the collaborative data analysis, the analyst holds X anc i,j = f i,j (X anc :,j ). Therefore, Y anc i can be obtained by Y anc without the need for additional communication from users. Note that additional communication may increase a privacy and confidentiality risks. Regarding higher recognition performance, we can use another dataset for this purpose different from the anchor data for constructing g i in the collaborative data analysis. Then, the predictions Y anc i of the anchor data X anc are returned to the i-th user. At the user-side, an interpretable model is individually constructed as where the obtained model t i depends on i. Note that, since the anchor data have whole features of X instead of the private dataset X i,j , the obtained interpretable model t i is based on whole features. For example, in the decision tree, t i can be whole features represented by branch, which are not feasible to do in an individual analysis that only uses X i,j . Here, each party can individually select an interpretable model according to its own needs. This is an advantage of the proposed method for practical applications.

Algorithm 1 Interpretable collaborative data analysis
Input (for user-side): X i,j ∈ R ni×mj , Y i ∈ R ni× , individually Output (for user-side): Interpretable models t i (i = 1, 2, . . . , c), which depend on i user-side (i, j) analyst-side 1: Generate X anc i,j and share to all users 2: Set X anc and X anc :,j 3: Compute X anc i,j = f i,j (X anc :,j ) 6: Share X i,j , X anc i,j and Y i to analyst → Get X i,j , X anc i,j and Y i for all i and j 7: Set X i and X anc i 8: Construct g i from X anc i for all i 9: Compute X i = g i ( X i ) for all i 10: Set X and Y 11: Analyze X and get h as Y ≈ h( X) 12: Compute X anc Note that the performance of the proposed method depends on the choice of the anchor data X anc . The simplest way to set X anc is via a random matrix [4,14,15]. However, to improve the performance, the anchor data need to preserve some statistics of X. One practical idea is to generate X anc i,j for each private data by using methods such as the generative adversarial nets (GAN) [10] and autoencoder based on (deep) neural network or dimensionality reduction with data augmentation. Then, X anc i,j is shared with all users and X anc is set as We will investigate practical techniques for constructing a suitable anchor data in the future. The pseudo-code of the proposed method is summarized in Algorithm 1. As shown in Algorithm 1, the proposed interpretable collaborative data analysis is based on the one-path algorithm, which does not require iteration steps with data communication.

Discussion on privacy and confidentiality
In the proposed method (Algorithm 1), each user shares the local anchor data X anc i,j to other users and shares intermediate representations X i,j , X anc i,j to the analyst. We discuss how the privacy of the private data X i,j is preserved for both the users and the analyst. Here, we assume that the users do not trust each other and want to protect their training data X i,j against honest-but-curious users and the analyst. Hence, the users and the analyst will strictly follow the strategy, but they will try to infer as much information as possible. We also assume that the analyst does not collude with any users.
To ensure the privacy of X i,j against other users, each user shares the local anchor data X anc i,j to other users. The local anchor data do not contain X i,j but may preserve some useful information. The local anchor data are constructed by the users themselves using methods such as GAN and autoencoder with data augmentation. Therefore, users can control the information although it may result a trade-off in the performance. Note that collaborative data analysis works well even when using random anchor data, as demonstrated in [4,14,15].
To ensure the privacy of X i,j against the analyst, each user shares the intermediate representations X i,j , X anc i,j to the analyst. If analyst has the map function f i,j or its approximation, he/she can obtain an approximation of X i,j . However, the function f i,j is private and cannot be approximated by others because no one has both the input and output the of f i,j . Therefore, the analyst cannot obtain an approximation of X i,j from the intermediate representations.
In our future studies, we will further analyze more details of the privacy of the proposed method.

Experiments
This section evaluates the performance of the proposed interpretable collaborative data analysis (Algorithm 1) and compares it with those of interpretable centralized and individual analyses for classification problems. Note that centralized analysis is considered as an ideal case since the private datasets X i,j cannot be shared in our target situation. The proposed collaborative data analysis aims to achieve a better performance than individual analysis.
We use a simple decision tree for the interpretable model. In the proposed method, each intermediate representation is designed from X i,j using locality preserving projections (LPP) [12] which is an unsupervised dimensionality reduction method. We use a kernel version of ridge regression (K-RR) [32] with a Gaussian kernel for data analysis for collaborative analysis. We set the regularization parameter of K-RR to λ = 0.01. The local anchor data X anc i,j are constructed by a low-rank approximation based on singular value decomposition (SVD) with random perturbation and data augmentation. We set r = 2, 500 as the number of anchor data.
We set the ground truth Y as a binary matrix whose (i, j) entry is 1 if the training data x i are in class j and 0 otherwise. This type of ground truth Y has been applied to various classification algorithms, including ridge regression and deep neural networks [3].
In this paper, we evaluate the performance of methods in terms of normalized mutual information (NMI) [33] and accuracy (ACC). Moreover, to evaluate the similarity of prediction model by individual and collaborative data analyses with those of centralized analysis, we use fidelity to centralized analysis under NMI (Fidelity to CA), that is where the function NMI denotes the value of NMI between two predictions and Y CA , Y IA , and Y CDA are the predictions of centralized, individual, and collaborative data analyses, respectively.
All the numerical experiments are performed using MATLAB2019b.

Experiment I: Artificial data
We used a 20-dimensional artificial data for two-class classification. Fig. 1(a)  We considered the case where the dataset in Fig. 1(a) is distributed into four parties: For horizontal partitioning, the 1st group of parties X 1,1 , X 1,2 is the dataset shown in Figs. 1(b) and the 2nd group of parties X 2,1 , X 2,2 is the dataset shown in Fig. 1(c). For the vertical partitioning, X 1,1 , X 2,1 have the features 1-10 and X 1,2 , X 2,2 have features 11-20. Fig. 1(d) illustrates features 1 and 11 of the test dataset and their ground truth. For the proposed method, we set the dimensionality of intermediate representations to m i,j = 4 for all parties. Fig. 2 presents the recognition results and Table 1 shows the average and standard error of NMI, ACC, Fidelity to CA calculated across 10 trials. From these results, we can observe that individual analysis does not obtain good recognition results. This is because of the following reasons. Since X 1,1 has feature 1 of the samples shown in Fig. 1(b) and X 2,2 has feature 11 of the samples shown in Fig. 1(c), the distributions of the two classes are overlapped. Therefore, using only X 1,1 or X 2,2 cannot separate the two classes. Moreover, X 1,2 has feature 11 of the samples shown in Fig. 1(b) and X 2,1 has feature 1 of the samples shown in Fig. 1(c). Therefore, the classification boundaries for X 1,2 and X 2,1 are horizontal and vertical, respectively. On the other hand, when compared with individual analysis, the proposed collaborative data analysis (Fig. 2(a)) achieves good recognition results, which are comparable to the results of centralized analysis (Fig. 2(b)).

Experiment II: Financial data
We used a credit rating dataset "CreditRating Historical.dat" from the MATLAB Statistics and Machine Learning Toolbox. The dataset contains five financial ratios: Working capital /    S TA), and industry sector labels from 1 to 12 for 3932 customers. The dataset also includes credit ratings from "AAA" to "CCC" for all customers. Note that this dataset is simulated and not real. We aim to predict the credit rating using the five financial ratios and industry sector labels.
We considered the case where the training dataset with 3,000 samples is distributed into four parties: c = d = 2 as where, X 1,1 , X 2,1 have the 1st group of features WC TA, RE TA, and EBIT TA and X 1,2 , X 2,2 have the 2nd group of features MVE BVTD, S TA, and Industry sector label as features.
The obtained decision trees for centralized, individual, and collaborative data analyses are shown in Fig. 3, while the average and standard error of NMI, ACC and Fidelity to CA across 10 trials are shown in Table 2. In Fig. 3, the features with * are in X 1,1 , X 2,1 and the features with • are in X 1,2 , X 2,2 . As shown in Fig. 3, the proposed collaborative analysis (Fig. 3(a)) has a tree with the same two features as centralized analysis, which belongs to different groups. This cannot be achieved in individual analysis as shown in Fig. 3(c)-(f).

Experiment III: Real-world data
We next evaluate the performances of centralized, individual, and collaborative data analyses on the binary and multi-class classification problems obtained from [21,31] and feature selection datasets 1 .
We considered the case where the dataset is distributed into six parties: c = 2 and d = 3. The performance of each method is evaluated by using a five-fold cross-validation framework. For the proposed method, we set m i,j = 15.
The numerical results of the centralized analysis, an average of the individual analysis and the proposed method for 10 test problems are presented in Table 3. We can observe from Table 3 that recognition performance of the proposed method is better than that of individual analysis and comparable to that of centralized analysis on most datasets.

Conclusions
To address the needs of distributed data analysis and achieve interpretability, we proposed an interpretable non-model sharing collaborative data analysis on distributed data. The proposed method generated an interpretable model for distributed data by sharing intermediate representations without revealing the private data and the model. The obtained interpretable model was based on the whole features of distributed data, which cannot be achieved in individual analysis. Numerical experiments on both artificial and real-world data showed that the proposed method constructed an interpretable model with better recognition performance than individual analysis and comparable to centralized analysis.
The distributed data analysis and the interpretable model construction are essential and important challenges in real-world situations including medial, financial, and manufacturing data analyses. The proposed interpretable collaborative data analysis would be a breakthrough technology for such kinds of distributed data analysis.
In our future studies, we will further analyze the privacy and confidentiality concerns and the accuracy of the proposed method. Moreover, practical techniques for improving the performance of the proposed method including other suitable anchor data will be investigated. The authors will also apply the proposed method to practical distributed data in other fields, such as medical or manufacturing, and evaluate its recognition performance.