A multiorder feature tracking and explanation strategy for explainable deep learning

: A good AI algorithm can make accurate predictions and provide reasonable explanations for the ﬁ eld in which it is applied. However, the application of deep models makes the black box problem, i.e., the lack of interpretability of a model, more prominent. In particular, when there are multiple features in an application domain and complex interactions between these features, it is di ﬃ cult for a deep model to intuitively explain its prediction results. Moreover, in practical applications, multiorder feature interactions are ubiquitous. To break the interpretation limitations of deep models, we argue that a multiorder linearly separable deep model can be divided into di ﬀ erent orders to explain its prediction results. Inspired by the interpretability advantage of tree models, we design a feature representation mechanism that can consis - tently represent the features of both trees and deep models. Based on the consistent representation, we propose a multiorder feature - tracking strategy to provide a prediction - oriented multiorder explanation for a linearly separable deep model. In experiments, we have empirically veri ﬁ ed the e ﬀ ectiveness of our approach in two binary classi ﬁ cation application scenarios: education and marketing. Experimental results show that our model can intuitively represent complex relationships between features through diversi ﬁ ed multiorder explanations.


Introduction
Currently, in the era where more data are easily accessible, the exploration and prediction of many fields can be completed by computers, thereby reducing the burden on human beings. In this process, machine learning (ML) algorithms try to predict results or provide decisions by learning from large amounts of information [1,2]. However, unlike humans, most ML algorithms cannot explain the reason for predictions or decisions, which is often mentioned as a black box problem [3] in the ML field. The black box problem refers to a model's lack of interpretability [4], meaning that we cannot understand the model's internal mechanisms by only observing its parameters [5]. To address the lack of interpretability, [3,6] propose delving into a black box model in three ways: model explanation, outcome explanation, and model inspection. Different approaches can start from features, also known as feature attribution or a feature-based mostly different for a certain model. Fortunately, existing studies have shown that Taylor expansions are applicable to most DNNs [36,37], that is, most DNNs can be approximately expanded into several linearly addable parts with different orders. Furthermore, Taylor expansions can be applied to explain different orders of network layers. Motivated by an extension of Taylor's theorem in neural networks, we speculate that the different order parts of a deep model can be leveraged to explain its prediction outcomes separately if that deep model is linearly separable or approximately linearly separable. Under this assumption, as long as the methods of feature combination are different, explanations of different orders can be achieved by consistent features, thus solving the second problem. Therefore, we propose a GBDT-based interpretable strategy for deep models named multiorder feature-tracking explanation (MFTE), which employs consistent memory representations to track features from GBDTs to a multiorder deep model and produce predictionoriented explanations. The detailed contributions of our work are summarized as follows: -Novelty: Different from existing methods that explain features directly, we design a novel memory representation to make feature consistent for the explanation. The remaining contents of the article are organized as follows. In Section 2, we discuss some research work related to the proposed approach. Then, the framework and details of our strategy are introduced in Section 3 and instantiated in Section 4, respectively. In Section 5, we did a wealth of experiments in the fields of education and marketing to verify the interpretability of MFTE; subsequently, the experimental results are discussed and summarized. Finally, we make conclusions of our work and further provide a future perspective in Section 6.

Related work
In this section, we first introduce feature attribution approaches for solving black box problems. Then, we study tree-based explanation methods in ML, and finally, we discuss existing explainable approaches based on both trees and deep models.

Feature attribution for black box problems
In ML, the black box problem does not have a standardized definition. However, scholars generally believe that the black box problem is caused by a model's lack of interpretability [3,4]. Molnar [5] argues that interpretability is the degree to which a human can understand the cause of a decision or a prediction. Lipton and Guidotti et al. [3,6] proposed achieving interpretability by using different explanation methods, such as model explanation, outcome explanation, and model inspection. The feature-based explanation [7] can be considered a model-oriented explanation or an outcome-/prediction-oriented explanation, which is formally terminized as a feature attribution because it directly captures the importance of the features [38,39]. In addition to eliminating redundant features in the data preprocessing stage, feature selection methods analyze the impact of features on the results and mine behavior information. For example, Kim et al. [40] proposed a model-based feature selection method to explain the importance of features in malware classification. The typical feature attribution method is the SHAP approach [8], which implements a feature-based explanation through the additive nature of FI. SHAP can calculate the local contribution of features. Moreover, its combination with trees can reduce computational complexity [41]. A tree-based SHAP can further calculate the importance of feature interactions [11], because the interpretable nature of the tree structure allows decision trees to automatically mine the relationships between features [9,10]. Therefore, maximizing the interpretable advantages of decision trees and allocating values for features is a valuable research problem in the field of interpretable ML.

Tree-based explanation
Among various decision trees, GBDTs [12] have been widely used because of their automatic feature select ability and excellent prediction effect [13,42,43]. The gradient boosting process of GBDTs ensures consistency in the feature selection and prediction result explanations [16,17]. For instance, Stojić et al. [14] applied extreme gradient boosting (XGBoost) [13] to predict the distribution and migration of chemical substances in the environment and generate SHAP values to explain important features; Fernández [15] adopted random forests to monitor bank stability in the United States and make multiway interpretations of important variables. Consequently, GBDTs have inherent advantages in selecting necessary features for prediction. Nevertheless, Shih et al. [18] proposed the concept of prime implicant (PI), i.e., explanations based on the tree structure. They believe that there are the smallest feature subsets related to the prediction results, and these subsets are sufficient for prediction and interpretation. Furthermore, [19] standardize the definition of necessary feature subsets. Izz et al. [44] propose a way to calculate PI explanations in decision tree learning. In addition to explaining with features, GBDTs can also explain the original model by simulating the behavior of that model [3]. In particular, the path in the tree is an important explanatory tool to visually show the logical relationship between features [18]. In this work, we will make full use of the interpretable advantages of GBDTs to explain deep models. The following subsection introduces the research progress on trees and deep models.

Explanation based on trees and deep models
The main difference between deep models and traditional ML models is the representation of features. Traditional models generally adopt feature values to represent features, while deep models often use feature embedding vectors to train the model [29,32]. However, the structure of a deep model [20] is usually more complicated than a traditional feature interaction model [22]. For example, the nonlinear feature relations in a multilayer perceptron or in an attention mechanism [21] are difficult to capture by GBDTs. Therefore, to avoid direct tracking and explanation of feature interactions, Zilke et al. [25] utilized the advantages of the tree structure to extract rules from DNNs. Other researchers argue that the tree structure has limitations for understanding a deep model, so they try to achieve interpretability by replacing the last layer of the neural network with a decision tree [26]. However, these methods have changed the original model structure to varying degrees and cannot allow the deep model to take advantage of the tree-selected features. Existing studies have demonstrated that decision trees can initialize neural networks and improve performance [28], which encourages researchers to use tree-selected features as input for a deep model. For example, TEM [27] is a loosely coupled model that uses GBDTs to choose important cross-features as the input of a neural network and explain feature interaction via attention weights [45]. TEM ensures the interpretability and completeness of the feature attribution process, meaning that it achieves feature tracking from input to prediction. However, it still cannot solve the problem of inconsistent feature representation, and the cross-features are fixed; thus, it cannot continue to learn feature interactions in the following deep model. In recent years, some progress has been made in capturing feature interactions of deep models. Researchers have shown that Taylor expansions are applicable to most deep models [36,37], which allow the multiple orders of feature interactions to be represented separately. In this work, we mainly study how to improve external feature representations such as memory [33,34] to capture the semantics of feature interactions and then cooperate with GBDTs to explain the prediction results of a deep model in a different way.

The MFTE strategy
This section mainly introduces the framework of the MFTE strategy in Section 3.1 and the detailed design of the model in Section 3.2.

The framework of MFTE
In Figure 1, we illustrate the entire training process of MFTE, including feature selection and consistent representation, collaborative training between a deep model (on the right), relevant explainable constraints (on the left), and the generation of explanations.
First, one of the important functions of GBDTs is feature selection, which is the basis for GBDTs' interpretability. On the other hand, deep models generally take features as input. Therefore, MFTE takes advantage of GBDTs by adopting the features that trees select to benefit prediction as the input of a deep model. To ensure a distinct input, MFTE employs multiple trees to select important features from all features of the original data. These diverse features are responsible for both prediction and interpretation; they run through the entire process of model training, prediction, and result explanation. Consequently, we design an independent feature representation method based on the memory mechanism [33,34] to achieve the full tracking of various features. In particular, we allocate an independent feature memory M G for diverse features to support explainable storage, encapsulation, and representation.
The left side of Figure 1 shows the explainable feature and consistent representation mechanism. We borrow a feature-lookup operation to link the features selected from the trees with the memory and achieve the independent representation of feature memory ( ) x M G . However, purely independent representation cannot keep track of feature changes, especially when the deep model has multiorder feature interactions.
To explore the explainable advantages of ( ) x M G , we design an encapsulating process ϕ to reorganize ( ) x M G into a new explanation representation . The purpose of encapsulation is to track and interpret the deep model's multiorder feature interactions, because the encapsulating process can functionalize the memory features and uniformly represent the contribution of each order of the deep model. This kind of orderseparated interpretation design requires the deep model to be linearly separable, such as in NFM [20] and attentional factorization machine (AFM) [21]. Specifically, assuming that the highest dimension of the input features is n (for example, ∏ = x from the first-order features to the nth-order feature interactions [35], where 1 only contains first-order features, 2 only contains second-order feature interactions, and so on. In this work, we consider a type of model that can be expressed or approximately expressed as a linear combination from 1 to n , namely = + +⋯+ n 1 2 . Intuitively, n-order feature interactions have n different contributions to the prediction results. Therefore, we propose an explainable constraint method to quantify and explain the contribution of each order separately according to a prediction result. In particular, the constraint method combines n explanation representations to generate an explainable constraint set { } … , , , n 1 2 . Each element in the constraint set corresponds to and constrains a suborder function of model . Motivated by the extension of Taylor's theorem in neural networks, we make a loosely coupled relationship between and . Thus, the deep model can be expressed as an n-order linear combination under n explainable constraints. Compared to the coefficients of Taylor expansions, our constraint coefficients are not derived from the original model but are actually extended consistent feature representations to improve the interpretability of that model. Therefore, through collaborative training between the explainable constraints and the deep model, n explainable constraints can separately express the contributions of n suborder functions without excessively interfering with 's prediction performance. The generation of the explainable constraint set and the details of loose coupling are introduced in Section 3.2.
In the explanation stage, MFTE allows different constraints i to explain the contributions of feature modeling belonging to different i in the predicted results. For example, 1 represents the contribution of a single feature to the result; 2 represents the contribution of pairwise feature interactions to the result; and n can explain the contribution of high-order nonlinear feature interactions to the result. We store the n-order explainable constraints in an explanation pool to allow MFTE to present different orders of explainable representation according to actual needs. For example, we can select only the first-order features, only the second-order feature interactions, or both for explanation. In particular, according to different selected features, we can retrieve the original GBDTs and visualize tree paths containing these features, thereby implementing multiorder and diversified explanations. Thus, this kind of feature explanation based on tree paths naturally achieves feature consistency between predictions and explanations.

The design of MFTE
In the part of generating diverse features, we define multiple GBDTs as , which represents T trees. Consider a single tree . We train multiple trees by ensemble learning and use the following equation to select features: where ⊂ X X r is a set recording the feature nodes that ultimately fall on the rth leaf node v r . The diverse features are generated by the features selected by T trees, thus we have T feature sets. The features in each feature set are used as input to the deep model on one side and used to find explainable feature vectors ( ) x M G from the feature memory M G on the other side. In particular, supposing that the feature memory M G stores original vectors of all features, we select the explainable feature vectors according to the diverse features via a feature-lookup process defined as follows: where ⊓ indicates the selection of the corresponding vector according to the feature identification number.
To obtain the feature-related explanation representation, we apply ϕ to package the explainable features. The encapsulating process is defined as follows: Here, the encapsulation function ϕ can be different according to different order subfunctions. For the first-order subfunction, ϕ can be a self-defined function. For subfunctions of order two or above, we need to consider the combination operation when performing encapsulation to obtain n explainable constraints. Since the explainable constraint of each order is related to T trees, we finally accumulate T -encapsulated features from T trees to obtain a single constraint. In particular, let ( ) G x t represent the features selected by the tth tree. Thus, indicates the encapsulated features from the tth tree according to equation (3).

For n arbitrary combinations of explainable features
-⊙: the combination operation that can be instantiated into specific operations l and the combination operation ⊙ is instantiated as element-wise multiplication in our experiments. Then, the n constraints are loosely coupled with the corresponding n-order subfunctions via a constraint process. Without loss of generality, we define an n-order linearly separable deep model containing a feature-independent variable 0 as follows: . The constraint process is defined as follows: : subfunctions representing suborder feature interactions -⊗: the constraint operation that can be instantiated into specific functions.
It can be observed that the final prediction result ˜i s a linear combination of n subfunctions after being constrained. In a simple case, the constraint operation ⊗ can be multiplication to achieve collaborative training. Specifically, the constraint term and the subfunction become the gradient of each other during the gradient optimization process. In this case, the nth-order constraint n is naturally optimized as an explainable representation of the contribution of the nth-order function n . The constraint operation is superior to the cross-feature mechanism of TEM [27], because the cross-feature mechanism fuses multiorder feature information before the deep model training, so that the information of each order feature cannot be tracked separately. In addition to being consistent with each order of the deep model, MFTE does not need to modify the original loss function during training. For example, in the two domain applications of this work, we employ the binary cross-entropy of the original deep model as the MFTE loss function L to solve the binary classification problem as follows: Here, N denotes the number of samples in a training batch, F h indicates a binary ground-truth value that can be 1 or 0, and F h is the predicted value corresponding to F h . By minimizing ( ) L ,˜, we can complete the training of the entire model.

The instantiation of MFTE
Considering that MFTE can be applied to the n-order linearly separable deep model, we adopt NFM [20] to instantiate our strategy. The difference between instantiated MFTE and NFM is that each order feature of NFM is not constrained by the feature-tracking strategy available in MFTE. Because NFM is a general deep version of factorization machines (FMs) [22,35] and is directly linearly separable, it is a multiorder feature interaction model that contains firstand second-order subfunctions. Consequently, we employ the first two orders of the constraints 1 and 2 to constrain the firstand second-order parts of NFM, respectively. Let ( ) G x represent the union feature set selected by T trees. By using the tree-selected features ( ) G x i in equation (1) as input, the combined NFM model with explainable constraints is defined as follows: w 0 : model bias that is feature-independent w i : model parameters of the first-order feature modeling f deep : neural network with L deep layers for the second-order feature modeling v: model embedding vectors corresponding to features.
We specifically define the constraint operation ⊗ as feature corresponding multiplication. For example, we multiply ( ) w G x i i by i 1 and leverage ij 2 to multiply and constrain ( ( ) where the formal definitions of i 1 and ij 2 are provided in the following discussion. First, based on the explainable feature memory ( ) x M G , we specify the encapsulation function ϕ as follows: where the explanation representation is employed for encapsulating the explainable feature memory where p is a weight matrix that can change the shape of i 1 to adapt to the combination of subfunctions. Furthermore, given the jth feature memory ( ) x M G j , we specify the second-order constraint 2 based on the feature memory interaction as follows: The combination operation ⊙ denotes the elementwise multiplication to represent feature memory interaction. q indicates a weight matrix that can transfer the shape of ij 2 to adapt to the combination of subfunctions. After i 1 and ij 2 are trained, they can represent the contributions of the first-order features and the second-order interactions to the prediction results, thus achieving diversification of explanations. Furthermore, we will demonstrate the detailed advantages of this diversified and multiorder explanation in the experiments section.

Experiments
To verify the interpretability of our strategy, we selected two datasets in different application fields (education and marketing) for experiments. We conduct various experimental analyses on these two datasets to try to answer the following research questions: To answer the above research questions, we first describe two datasets in Section 5.1. Then, Section 5.2 attempts to solve the performance maintenance problem in RQ1. Next, Section 5.3 answers both RQ2 and RQ3 through the experimental results and detailed analysis. Finally, we discuss the entire experiment in Section 5.4.

Data description
The education dataset¹ [46,47] collects students' online learning behaviors in five online courses launched by Harvard University on the EDX platform, covering the period from the autumn of 2012 to the summer of 2013. The features in this dataset can be roughly divided into two categories. One is related to the students themselves, including birth_year, gender, education, etc. The other represents the interactive behaviors between students and courses, including total_events (total events in the server log file, including the number of clicks), active_days (the days of a student participating in course activities), num_chapters (the number of chapters a student learned), days (the number of days between a student's registration and the completion of a course), etc. We employ the dataset to predict and explain whether a student can obtain course certification (ground truth = 1 means the student can obtain course certificates, ground truth = 0 means the student cannot obtain course certificates).
The marketing dataset² [48] implemented by the marketing team of the Bank of the Portugal (2008-2015), stores information about the telemarketing business to attract clients to subscribe to term deposits. This dataset is used to predict whether a client will subscribe (yes/no) a term deposit, where the ground truths contain two variables: "Yes (ground truth = 1)" and "No (ground truth = 0)." There are four categories of features. The first category is client information, including age and mortgage. The second category is social and economic factors, including emp.var.rate (employment variation ratequarterly indicator), cons.price.idx (consumer price indexmonthly indicator), euribor3m (euribor 3 month ratedaily indicator), nr.employed (number of employeesquarterly indicator), etc. The third category is related to the last contact of the current campaign, such as month (last contact month of year) and pdays (number of days that passed by after the client was last contacted from a previous campaign). The final category contains all other features, such as campaign.

Performance evaluations (RQ1)
We have instantiated MFTE (called MFTE N ) with NFM in Section 4. In addition to NFM, other members of the FM family such as the embedding version of FM [30] and AFM are also linearly separable. Therefore, in the experiment, we also instantiated FM (named MFTE F ) and AFM (called MFTE A ), together with MFTE N as our models. We compare the three instantiated MFTE approaches with FM, NFM, and AFM to evaluate whether the MFTE strategy can maintain the same performance as the original models. In addition, XGBoost, an advanced method representing GBDTs, is also employed for performance comparison, because we can combine XGBoost and SHAP to implement feature-attribution explanations in subsequent experiments. Finally, we introduce TEM as another comparison method, because it is an interpretable method, where a deep model accepts tree-selected features as input and shows good performance and explanation effects. All comparison methods are as follows: On the education dataset, we uniformly set the regularized terms of the XGBoost and TEM models to 0.01, and the regularization values of the FM family models and the corresponding MFTE instantiated models are set to 0.001. For optimal learning rates, FM/MFTE F are set to 0.00001; NFM/MFTE N and AFM/MFTE A are adjusted to 0.000001 and 0.0001, respectively; XGBoost is set to 0.0001 and TEM is optimized to 0.00001. In our models, to ensure the diversity of features, we set the number of GBDTs to 30 and the height of all trees to 4. Moreover, XGBoost and the tree model part of the TEM maintain the same settings as our models. On the marketing dataset, the setting of the learning rate and regularized value is consistent with that of the education dataset, but the number of GBDTs and the height of the tree model are set to 38 and 5, respectively. For performance evaluation, we adopt the widely accepted area under the receiver operating characteristic curve (AUC) and F1-measure as metrics. The performance comparison results of all models are shown in Tables 1 and 2.
It can be observed that the two explainable models MFTE F and MFTE A achieve the best performance on F1 and AUC, respectively, with the education dataset. Besides the AUC value of the MFTE F being slightly weaker than that of the FM, both MFTE N and MFTE A have improved performance over their original models. On the marketing dataset, MFTE A and MFTE N outperform other models on F1 and AUC. The comparison results show that although MFTE is designed for explanation, it can maintain the performance of the original deep models, which provides a positive answer to RQ1. In subsequent experiments, we focus on the explanation comparison of MFTE with other methods.

Explanation comparison (RQ2 and RQ3)
The explainable effect is mainly reflected in whether the model properly shows the contribution of features. General tree models (such as XGBoost) directly employ FI [13,38] to explain the contribution of each feature. SHAP provides both firstand second-order explanation tools for trees to calculate diverse contribution The bold values indicate the best performance on AUC or F1-measure. values. TEM is a typical method combining trees and a deep network. It uses the cross-features selected by the trees as the input of the deep model and leverages the attention of the cross-features to explain the results. In contrast, our model achieves diverse explanation effects through multiorder feature tracking. Specifically, the comparable explanation methods in this section are summarized as follows: -FIa traditional explainable approach based on XGBoost [13,38].
-SHAPa representative FI explanation method, including firstand second-order explanations. We employ the tree-version SHAP [10]. -TEMan explainable method based on tree and only provides cross-feature explanation extracted from tree components [27]. -MFTEour approach that supports multiorder explanations. We employ the MFTE N version to provide firstand second-order explanations.
To better compare the explanation effects, we separately analyze the explanation results of the two fields in Sections 5.3.1 and 5.3.2, respectively. In each section, we separate the firstand second-order explanations and compare them to better present the experimental results.

Statistical analysis
First, we provide a simple statistical feature correlation analysis. The result shows that if a student certified in the course has the greatest correlation with the explored feature, then the correlation value is 0.5. In addition, when ground truth = 0, more than half of the active_days values are concentrated in the range of [ ] 0, 15 . For over 60% of the students, the corresponding active_days values are located in [15,50], when ground truth = 1.

The first-order explanations
Statistical analysis mainly describes the distribution of feature values, but it is impossible to know the exact feature contributions to the prediction. Tree-based models can calculate the FI of the whole model and regard it as the model explanation. We apply FI-based XGBoost to the education dataset and obtain FI rankings from largest to smallest: total_events, num_chapters, days, active_days. The higher the importance of the feature, the greater its contribution to the prediction result. However, like statistical analysis, FI reflects the global contribution of all the features, so there is only one sort of FI and it is fixed. Compared with FI, SHAP can flexibly calculate the contribution value of each feature to the prediction result of a single sample. To evaluate SHAP's explanation toward individuals, we randomly selected two students, representing those who obtained a certificate and those who did not and listed the relevant feature values and ground truths in Table 3. Correspondingly, Figures 2 and 3 show the SHAP values of the two samples.
The red values represent that the features play positive roles in predicting that the student can obtain the certificate, whereas the blue ones play negative roles. In particular, the longer the length of the color bar, the greater the absolute value of the feature's contribution. Therefore, the most contributing feature is num_chapters in Figure 2. By searching the relevant feature values in Table 3, we observe that num_chapters = 1, indicating that the student with id = 1571 has only read one chapter. The salient feature of Figure 3 is explored (explored = 1), indicating that the student (id = 2264) has explored the whole course. Although the predictions of the two students are consistent with the ground truths, the tree-based SHAP value depends on the original feature value. When the original feature value is relatively small, it will affect its feature contribution value. For MFTE, the first-order explanation values corresponding to the two samples are shown in Table 4. It can be observed that the first-order explanation of each feature has multiple values corresponding to multiple trees, meaning that MFTE achieves the first-order diversified explanations. The order of average feature contributions of the student with id = 1571 is: explored > active_days > days and the order of average feature contributions of the student with id = 2264 is: active_days > num_chapters > total_events.
According to the feature values in Table 3, the values of active_days and num_chapters of the student with id = 2264 is 50 and 10, respectively (the total number of days is 71 and the total number of chapters is 11). Thus, the first-order explanation of MFTE makes sense and MFTE further provides more diversified explanations. Figure 4 shows the second-order SHAP explanation heatmap corresponding to the two samples, where we only select the features with large SHAP values and visualize their second-order contributions. The secondorder feature explanations mainly indicate the contributions of the feature interactions to the predictions. Consequently, we do not consider the interaction value between the feature and itself, that is, the contribution value on the diagonal. In this case, for the student with id = 1571, the course_id-total_events feature interaction has the largest negative contribution (−0.00064) to the prediction. In contrast, for the student with id = 2264, the course_id-viewed and num_chapters-course_id feature interactions both have the largest positive contributions (0.00019) to the prediction.

The high-order explanations
Although the SHAP values are calculated by training the XGBoost approach, the second-order explanations of SHAP do not reflect the relationship of features in the trees. In contrast, the TEM model introduces "cross features" in the trees and takes the cross-feature embeddings as the input to a deep attention model. Figure 5 shows the attention values of the features included on the cross-feature paths (e.g., v66) and uses them as an explanation. Table 5 lists the relevant cross-feature paths corresponding to Figure 5. For example, path v23 contains cross-features explored, num_chapters, active_days and days, meaning that they are on the same tree. However, these cross feature paths in the trees are fixed, and these features cannot be tracked continuously in the subsequent deep model training, causing the trees to be out of touch with the deep model.
In Figure 6, we illustrate both the first-order explanations (represented by rectangular boxes) and the second-order explanations (represented by ellipses) of MFTE. The greater the absolute value of the explanation, the greater its contribution to the result. Thus, the five features in the figure are the important features that contribute massively to the results. Moreover, the feature selection paths containing these      features in the original four trees are randomly retrieved and displayed in Figure 6. We employ straight lines with arrows to represent the optimal feature selection path from the root of the tree to its leaf nodes and adopt different colors to represent the different trees. The first-order explanation in the figure corresponds to the tree nodes, and the second-order explanation corresponds to directed edges. The positive or negative explainable values denote the positive or negative contribution of the tree features to the predicted results. In this way, the total contribution of the tree can be obtained by combining the contributions of the tree nodes and edges. If the overall contribution is greater than 0, it indicates that a tree predicts that the student will obtain a course certificate. The explanation characteristic of MFTE is that it can illustrate the feature selection path in trees, making the relationship between firstand second-order explanations clear. Moving forward, the multiorder explanations are less disturbed by the original feature values. For example, take the student with id = 1571. Figure 6(a) shows that the feature days in different trees have both a large positive and negative impact on the predicted results. Table 3 shows that the feature value of days is 216, which may make people think that the student has been studying for a long time. In this case, we need to further investigate the secondorder explanations to analyze the impact of days. It can be found that the interactions of the three pairs of features: num_chapters-days, active_days-days, and explored-days, all contribute negatively to the results, because the original values of these three features interacting with days are very small. In addition, MFTE believes that the largest negative second-order contribution comes from the feature interaction of explored-total_events. Combining these two conditions (explored = 0 and total_events = 8) greatly increases the probability that the student will not be able to obtain course certification. For the student with id = 2264 in Figure 6(b), MFTE automatically learns two second-order feature interaction pairs that contribute massively: active_days-num_chapters and total_events-active_days. Table 3 shows that total_events (i.e., the number of clicks) of the student with id = 2264 is as high as 2707. Therefore, the multiorder explanations of MFTE can complement each other to better understand the students' certification results.

The first-order explanations
We apply FI-based XGBoost to the marketing dataset and obtain the important "weight" of all the features, whereas the top-3 important features are: euribor3m, month and pdays. To compare with tree-based SHAP, we randomly select a sample of failed marketing (client_id = 2556) and a sample of successful marketing (client_id = 1666) as examples. The first-order SHAP contribution values of the two samples are shown in Figures 7 and 8, respectively. Table 6 lists the relevant and important feature values according to the SHAP results. For the sample with id = 2556, the feature nr.employed has the largest contribution value and the relevant feature value is 5195.8. In our correlation analysis results, the correlation value between nr.employed and the ground truth is −0.31, meaning that the smaller the feature value of nr.employed, the more likely the marketing is to succeed. In this case, the feature value 5195.8 is higher than the average value 5167, which indicates that the first-order SHAP explanation is consistent with the correlation analysis results. For the successful sample with id = 1666, the feature with the largest contribution value is emp.var.rate and its SHAP value is 0.002. Table 7 lists the first-order explanations of MFTE. For the sample with id = 2556, the top-3 important features are: emp.var.rate, euribor3m, and nr.employed. For the sample with id = 1666, its top-3 important features are: nr.employed, euribor3m, and emp.var.rate. The contribution values of two samples can be compared with feature values to explain the results. In addition, the feature emp.var.rate is important in both samples because it appears three times on four trees. Therefore, the first-order explanations of MFTE can intuitively reflect the importance of features through the number of contributions and their specific values. Figure 9 shows the second-order explanations of SHAP for the two samples. Their most important contributions of feature interaction are both euribor3m-emp.var.rate. In particular, the second-order SHAP values of euribor3m-emp.var.rate of the sample with id = 2556 and id = 1666 are −0.00019 and 0.00064, respectively. Through the color of the contribution values, we can intuitively see that the second-order explanations of SHAP are reasonable.

The high-order explanations
In contrast, the explanations of TEM in Figure 10 can provide more details for the predictions with the help of the tree paths in Table 8. For the sample with id = 2556, the highest attention value comes from the feature emp.var.rate on the path v135. The other two features that have massive contributions on the path v135 are month and nr.employed. It indicates that the three features interact on the same tree and have an important impact on the prediction. Similarly, for the sample with id = 1666, the key path is v83 because it contains four important cross-features, whereas one of them has the highest attention value.
The symbol in Figure 11 is consistent with Figure 6. It can be observed that the first-order explanations of feature campaign includes both positive and negative contributions in both samples. In this case, by further observing the interaction between feature campaign and feature euribo3m, it can be seen that      feature campaign has an indirect contribution to the prediction of the result. Specifically, for the sample with id = 2556 in Figure 11(a), the second-order explanation of campaign-euribo3m interaction has a negative value. In contrast, for the sample with id = 1666 in Figure 11(b), the second-order explanation of campaign-euribo3m interaction contributes positively to the prediction, which is consistent with the ground truth. According to the marketing dataset, campaign means "number of contacts performed during this campaign and for this client," whereas euribo3m indicates "euribor 3 month ratedaily indicator." The corresponding feature values in Table 6 show that the campaign values of both samples is 1. The difference is that the euribo3m value for the fail prediction is 4.12, while the euribo3m value for the success prediction is 1.365. Consequently, the second-order contributions from campaign-euribo3m of both samples are consistent with the original feature meanings, which means that the second-order explanations of MFTE are appropriate. The experimental results in these two fields empirically answer RQ2 and RQ3. In the next section, we will further answer the three research questions to summarize the experiments.

Experiment discussion
Our experiments evaluate the MFTE strategy in terms of performance and explanation. In the performance comparisons, most of the deep models equipped with MFTE perform better than the original models. Therefore, while MFTE provides explanations, it can also provide additional help for performance improvement. In terms of explanation evaluation, MFTE can explain for a single sample, which is more advantageous than the global explanation provided by the FI-based XGBoost. Moreover, another highlight of MFTE is that the firstand second-order explanations can be displayed separately, which makes the explanation effect clearer. Therefore, the experiment gave a positive answer to RQ1, meaning that MFTE can provide an intuitive explanation without reducing the performance of the model. In the explanation representations, the visualized tree paths make the relationship of multiorder explanations clear, thereby maximizing the interpretable advantage of the trees. In contrast, although TEM can also record the paths of the trees by cross-features, the paths are fixed during the training of the deep model. This fixed method of cross-features cannot allow a single feature vector to be further trained, nor can it further take advantage of the trees' interpretation advantages. Moreover, MFTE is more personalized because each sample has its own feature selection path, which has advantages over the second-order explanation based on global feature interaction in SHAP and the fixed second-order explanation in TEM. Therefore, the experiments also gave a positive answer to RQ2 to confirm that MFTE can maximize the interpretable advantages of the trees.
Furthermore, if a deep model has higher-order feature interactions, MFTE can also make corresponding explanations. In contrast, although SHAP can also perform the firstand second-order explanations for samples, it has higher computational complexity for higher-order interpretations, so the scalability of MFTE is relatively better. The third highlight of MFTE is that we store multiorder explainable constraints in an explanation pool to allow MFTE to present different orders of explainable representation according to actual needs. Therefore, MFTE can show diversified representations for the feature contributions of the same order. Because we utilize the memory mechanism to store the features selected by different trees and achieve the consistency of features from selection to training, to explanation. The above three highlights answered RQ3 and confirmed that the explanation of MFTE is diverse.
In general, the experiments empirically verified the applicability of MFTE in different fields, thereby providing a practical approach for a prediction-oriented explanation.

Conclusion and future work
This work investigates a challenging problem in ML applicationthe black box problem. Our model mainly solves the problem of inconsistent feature representation between the tree model and the deep model and exploits the feature-tracking strategy to track features from the beginning of the tree to the training of the deep model and the explanation of the final result. It intuitively reflects the complex features interaction in the deep model. The experiments verify that our model is not only better than the previous work in performance but also provides more diversified explanations. In addition, we also prove that the featuretracking strategy is applicable to linear or approximate linear separable deep models and suitable in different application fields.
In future work, we will further investigate our multiorder explanation framework. In particular, we would compare the linearly separable and the approximate linearly separable deep model and try to express their multiorder feature interactions in a unified way. Furthermore, we plan to design a more automated multiorder explanation, so that the prediction and explanations of the results can be more intuitively presented.