Evaluation of Fuzzy Measures Using Dempster–Shafer Belief Structure: A Classifier Fusion Framework

This article studies the high complexity of the calculation of fuzzy measures that can be used in fuzzy integrals to combine the decisions of different learning algorithms. To this end, this article proposes an alternative low-complexity method for the calculation of fuzzy measures that have been applied to a Choquet integral for the fusion of deep learning models across different application domains for increasing the accuracy of the overall model. This article shows that the Dempster–Shafer (DS) belief structure provides partial information about the fuzzy measures associated with a variable, and this article devises a method to use this partial information for the calculation of fuzzy measures. An infinite number of fuzzy measures are associated with the DS belief structure. This article proposes a theorem to calculate the general form of a specific set of fuzzy measures associated with the DS belief structure. This specific set of fuzzy measures can be expressed as a weighted summation of the basic assignment function of the DS belief structure. The main advantage of expressing the fuzzy measures in this format is that the monotonic condition that needs to be maintained during the calculation of the fuzzy measure can be avoided, and only the basic assignment function needs to be evaluated. The calculation of the basic assignment function is formulated using a method inspired by the Monte Carlo approach used to calculate value functions in the Markov decision process.


I. INTRODUCTION
F USION learning is the aggregation of the decisions of multiple classifiers, learning algorithms, or experts to enhance the ability of the overall learning model. The maximum advantage of fusion learning can be expected when the models selected for forming the ensemble are diverse. This is intuitive because when the classifiers make different errors, a strategic combination of the results may reduce the overall error.
Aggregation operators or functions like weighted arithmetic mean, geometric mean, median, arithmetic mean, etc. are some widely used approaches to fusion learning. However, many problems exist with these commonly used aggregation operators. Some of them are as follows: 1) outliers have a high influence on the mean calculation and 2) these aggregation functions do not consider the correlation among the decisions of the different classifiers or experts.
Keeping the above facts in mind, a more complex aggregation function, called fuzzy integrals, has been proposed in the literature. Fuzzy integrals like Choquet integral [1] and Sugeno integral [2] take into consideration the interaction among the decisions of the different classifiers. In doing so, fuzzy integrals generally assign weights to not only the individual classifiers but also to the subset of classifiers. These weights are known as fuzzy measures.
Before explaining how we are going to use fuzzy measures in the decision-making process, we look at how we mathematically define fuzzy measures [2].
Definition 1 (Fuzzy measure): Let (X, Σ) be a measurable space. Then, μ : Σ → [0, 1] is a fuzzy measure if it satisfies The condition shown in the above definition is monotonic; therefore, a fuzzy measure is sometimes called a "monotone measure." Now, we explain how we use fuzzy measures in our decisionmaking process. Let X be the universe of discourse of classifiers and P j ⊂ X (j = 1, 2, 3, . . .) be the different subsets of classifiers. μ(P j ) indicates the relative importance of the different subsets of classifiers. In other words, when we have μ(P 1 ) > μ(P 2 ), it indicates that the subset P 1 of classifiers conveys much more relevant information as compared with This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the subset P 2 of classifiers. Thus, fuzzy measures, which can be intuitively thought of as weights in the Choquet integral, provide more importance to subsets of classifiers providing higher relevant information as compared with subsets providing lower relevant information. The other constraint, namely the monotonic constraint, is also justified as, whenever, P 2 ⊂ P 1 ; P 1 would intuitively convey more information than P 2 , as it has a larger number of classifiers as compared to P 2 .
Even though we understand how fuzzy measures can be used in the decision-making process, the calculation of these fuzzy measures is, however, an NP-hard problem. An additive fuzzy measure was proposed to deal with this problem. However, additive fuzzy measures are considered a too simplified approach to handle practical problems. Another popular approximation is the λ-fuzzy measure [2]. However, the λ-fuzzy measure is an approximation. Also, it becomes increasingly difficult to determine the value of λ as the number of classifiers and, hence, the number of fuzzy measures increase. In this article, we propose an approach, which is not NP-hard, to calculate a set of fuzzy measures without any form of approximation, wherein we take the help of the Dempster-Shafer (DS) theory. To explain the set of fuzzy measures that can be associated with the DS belief structure, we elaborate on the DS theory [3].
Definition 2 (DS belief structure): Let X be a universe of discourse. A DS belief structure is a set mapping m : 2 X → [0, 1] satisfying following properties: Here, m is referred to as the basic assignment function, while we refer to the subsets B i such that m(B i ) > 0 as focal elements. Let B i be the subsets of classifiers in the set of classifiers X. Now, we assign a value to the mass functions of the subset of classifiers depending on how relevant those subsets of classifiers are in the decision-making process. In other words, a higher mass function value should be assigned to a subset of classifiers conveying more relevant information. However, it can be observed that the monotonic condition associated with the fuzzy measures is not considered in the case of the assignment of the mass functions. Another important thing that can be noted here is that when B i ∩ P j = φ, the relevant information conveyed by P can be associated in some ways with the relevant information conveyed by the subset B i . Thus, m(B i )s would contribute in some ways in the calculation of μ(P j ) provided B i ∩ P j = φ. In other words, the DS belief structure can provide partial information about the fuzzy measures associated with the subset of classifiers.
In this article, we propose a general form of a set of fuzzy measures that can be associated with both the fuzzy measure theory and the DS belief structure. We have shown that the fuzzy measures can be expressed using the formula μ(P ) = p i=1 w P B i m(B i ). However, w P B i s have to satisfy some specific properties for μ(P j )s to be fuzzy measures. We have established the properties in the theorem proposed in this article. Furthermore, we have logically chosen a set of weights for evaluating our method on different datasets belonging to various domains. This set of weights satisfies all the properties mentioned in our theorem.
After we logically set the weights, we calculate the basic assignment functions. In the later sections, we have shown how this calculation is performed by adapting the Monte Carlo (MC) method used for solving the Markov decision process (MDP) in a specific way. In other words, the MC method for solving the MDP has inspired us in developing this novel scoring system.
We prove the efficacy of the evaluated fuzzy measures by applying them in the domain of fuzzy ensemble, by using the Choquet integral, which is used to combine the decisions of the various deep learning models. We use datasets of three diverse domains, namely breast cancer histology classification, chest Xray (CXR) image classification, and human action recognition.

A. Motivation and Contributions
As mentioned above, the calculation of the fuzzy measures is an NP-hard problem. Hence, in this article, we aim to find a set of fuzzy measures that are easier to calculate and properly incorporate the uncertainty involved with the variable. The contributions of our work are as follows.
1) We propose a novel generalized set of fuzzy measures, which is associated with both the DS belief structure and the fuzzy measure theory. We prove a theorem that provides us with the conditions for these sets of fuzzy measures to exist. Of these fuzzy measures, we select one logical set of fuzzy measures that can easily be calculated and properly used for fusion learning. 2) For calculating the basic assignment functions of the DS theory, we have developed a novel scoring system inspired by the MC method for evaluating value functions in MDPs. The rest of this article is organized as follows. Section II provides a literature survey of the state-of-the-art fuzzy ensemble techniques and advancements in the DS theory. In Section III, we describe the proposed classifier fusion framework. In Section IV, we report the dataset description, experimental results, comparison with the state-of-the-art methods, and error analysis. Finally, Section V concludes this article and provides some future directions.

A. Fuzzy Ensemble Techniques
The Choquet integral of fuzzy-number-valued functions based on σ − λ rules was formulated by Li et al. [4] followed by the proposal of genetic-algorithm-based optimization for computing fuzzy measures on fuzzy-number-valued data. Beliakov and Wu [5] introduced k-interactivity, which reduces the number of variables and constraints and reduces the complexity of learning fuzzy measures. The fuzzy measures were learned using a linear programming problem. The devised method facilitated the efficient learning of the fuzzy measures, and it significantly outperformed the fuzzy measures that are calculated using k-additive and k-maxitive fuzzy measures. For learning monotone models, Tehrani et al. [6] demonstrated the use of the Choquet integral as an aggregator operator in machine learning problems. Murillo et al. [7] found that hierarchical least mean square (HLMS), a gradient-based algorithm for the identification of fuzzy measures, had several convergence issues. As a result, they suggested a revamped HLMS implementation that enhanced the convergence by improving the formula for the iterative estimation of fuzzy measure coefficients and the monotonicity check.
Mesiar [8] proposed two types of generalizations of k-order additive discrete fuzzy measures given by Grabisch [9]. The authors proposed a formula for evaluating Choquet-like integrals on finite spaces based on the Mobius-like transform of the discrete fuzzy measure, followed by the proposal of k-order additive fuzzy measures defined on an arbitrary space. Finally, the two generalizations were combined to obtain a k-order pan-additive fuzzy measure on a general space. Liu and Kao [10] devised a mathematical programming approach for the derivation of fuzzy measures based on the correlation coefficient. Grabisch [9] conducted another research study that introduced a gradient algorithm for identifying fuzzy measures from empirical data. The paper presented the use of a mix of standard optimization algorithms, such as Lemke's method, and utility theory for identifying the 2 n coefficients based on semantical considerations.

B. DS Theory
Beynon et al. [11] described the potential offered by the DS theory of evidence as an alternative approach to multicriteria decision modeling. This article describes the DS theory as a generalization of classical probability theory, followed by the incorporation of this theory in a modified version of the analytic hierarchy process (AHP), termed the DS/AHP method. The DS/AHP approach tackles the inherent drawbacks of the standard AHP and allows opinions on sets of decision alternatives. Although this method did not necessarily facilitate obtaining the highest ranked decision alternative, its major contribution laid in the reduction of the number of serious contenders and computational complexity. Yang et al. [12] borrowed the design idea of Schubert's degree of falsity and proposed a distancebased degree of disagreement through which discounting factors can be generated for discounting combinations of unreliable evidence.
Yager [13] inferred that the DS belief structure provided partial information about the underlying fuzzy measure associated with an uncertain variable. Hence, they proposed a class of fuzzy measures that can be seen as possible completions of a DS belief structure. By the introduction of a measure of entropy on a fuzzy measure, they demonstrated that the three noted completions have the same entropy. However, their studies did not explore the general characterizations of the entropy associated with the class of completions proposed in their paper. Based on the distance of basic belief assignments, Martin et al. [14] proposed some conflict measures of a group of experts. Essentially, the conflict is evaluated for one expert against the rest of the group. These measures of conflict were further used for an a posteriori estimation of the relative reliability between the sources of information.
The presented measures of conflict and the associated reliability were evaluated and debated on random basic belief assignments and found their application in real radar target recognition tasks.
Wang et al. [15] incorporated ambiguity measure and DS theory of evidence to enhance the fuzzy soft set-based decisionmaking capabilities. This approach reduced the uncertainty caused by people's subjective cognition and raised the choice decision level. Yu et al. [16] developed the concepts of the supporting probability function and its distance to demonstrate the degree of correlation between evidence in a body of evidence (BOE) and the distances between the BOEs. In addition, a combination method for the conflicting BOEs based on the supporting probability distance was introduced, in which the basic probability mass of conflicting evidence was corrected by the credibility measure calculated via supporting probability distance before using the Dempster combination rule. Zhao et al. [17] proposed an improved combination method for the conflicting evidence based on inconsistent measurements. They introduced a new approach for measuring the conflict between two pieces of evidence, followed by the revision of the conflicting evidence by the selection of discount coefficients.

III. PROPOSED WORK
Even though we have introduced the required terms in Section I, we define them here again for the convenience of the readers. X is the universe of discourse of classifiers. B i s are the subset of classifiers. Fuzzy measures μ(P )s, assigned to each subset of classifiers, determine the importance of that subset of classifier in the decision-making process. In other words, if a higher value is assigned to a subset of classifiers, it specifies that the specific subset provides more relevant information in the decision-making process than the subset that has been assigned a lower value. Similarly, m(B i )s also provide us with information regarding the importance of a subset of classifiers with higher values assigned to subsets of classifiers that provide more relevant information and vice versa. As already mentioned in Section I, the DS belief structure can be thought to provide partial information about the fuzzy measures that we aim to calculate here. Moreover, the monotonicity constraint needs to be satisfied by the fuzzy measures, but the mass functions do not need to satisfy it.
The information about the true class label that is conveyed by the subset B i of classifiers in the form of mass function, namely m(B i ), provides information about the relevant information conveyed by the subset P j of classifiers in the form of μ(P j ), Now, once we have understood that only the focal elements with P ∩ B i = φ can provide partial information about the fuzzy measure, we need to find a way to combine these elements. However, the values derived after the combination must follow the properties of the fuzzy measures. We argue that we can combine the measures using the formula Here, P ⊆ X, p is the number of focal elements, m(B i ) is the basic assignment function for the focal elements, and w P B i is the weight that determines the relative importance of the focal elements B i in conveying information similar to that conveyed by the subset P of classifiers.
The weights, w P B i , however, cannot be assigned arbitrary values as μ(P ), calculated, must satisfy the properties of the fuzzy measures. We calculate the set of such values of w P B i and use the same for our calculation.
However, besides the two constraints of the fuzzy measures, two more constraints are imposed by the DS theory. B i may be totally or partially contained in P . When B i is partially contained in P , we have not decided what portion of m(B i ) to allocate to μ(P ). In (3), the two extreme cases of the allocation method are as follows.
Thus, Scheme 1 of allocation is the strictest version of allocating m(B i ) to μ(P ), while Scheme 2 of allocation is the most lenient version of allocation of m(B i ) to μ(P ). These two conditions give rise to the following constraint: We now derive a set of weights such that μ(P ) satisfies the three rules to be followed by the fuzzy measures and the two imposed by the DS belief structure mentioned above.
Theorem 1: A fuzzy measures μ can be defined as provided that the weights w P B i have the following properties. 1) The weights must be fuzzy measures themselves.
Proof: To prove 1), we need to satisfy the following three criteria of the fuzzy measure theory.
are the weights associated with the decomposition of μ(Q) and μ(W ), respectively. Here, Based on above three criteria, we can see that these indicate that w P B i must be a set of fuzzy measures for μ(P ) to be a fuzzy measure.
Next, we show the other two conditions, which are imposed by the DS theory. 2 . Thus, we see that (5) represents the fuzzy measures, provided that the weights follow the three conditions mentioned in this theorem.
It is to be noted that any fuzzy measure that satisfies (4) would be a fuzzy measure associated with both the fuzzy measure theory and the DS theory. Thus, there can be an infinite number of such fuzzy measures. In (5), we calculate a few such sets of fuzzy measures that can be written in that form. However, w P B i s must follow the three conditions mentioned in the theorem for the quantities to be fuzzy measures associated with both the fuzzy measure theory and the DS theory.
We choose a specific set of weights for evaluating our method on different datasets. We define this set of weights as follows: where | · | is the number of elements of a set ·. It can easily be checked that this set of weights satisfies the three conditions in the theorem. m(B i ) would not have any effect on the value of μ(P ) if B i ∩ P = φ as B i does not provide us with any information about whether information about the true class is being conveyed by the subset P or not. So, w P B i is assigned to zero. On the other extreme, when B i ∩ P = B i , we are provided with the detailed knowledge of B i , but we are still uncertain about the elements of P , which are not part of B i . So, we assign a value to w P B i , which is higher but not equal to 1. However, if B i ∩ P = P , knowing B i gives us complete knowledge of P , that is, B i is not providing partial information anymore. Hence, w P B i is assigned a value equal to 1 in this case. For all the other cases when B i ∩ P ⊂ B i , the higher the value of |B i ∩ P |, the more the information conveyed by B i , and hence, higher value must be assigned to w P B i . Since (6) is consistent with this logic, we select (6) as a weighting scheme.
For calculating the fuzzy measures that we would be used in this article, we combine (5) and (6). We calculate the fuzzy measures using the following formula: The only quantities that we need to know to calculate μ(P ) using (7) , which should not be the case. Hence, the weights must be fuzzy measures, as shown in the theorem.
2) Example 2: Let us consider the subset W = {1}. Now, subsets for which It can be seen very clearly that condition 2 is violated using these weights as the right-hand side is equal to 0.081, while the left-hand side is equal to 0.539. It can be seen that as a result, the condition μ(P ) Similarly, we can device an example to show that condition 3, if violated, would not generate the fuzzy measure using the mass functions as the generated quantity would violate the primary rules of fuzzy and measure theory. m(B i ) conveys the extent to which the relevant information about the true class is conveyed by the subset of classifiers B i . However, it is to be noted that m(B i ) does not specify which subset of B i conveys the most relevant information in the determination of the true class label. Thus, m(B i ) gives us the information that the true class label matches the decision made by the subset of classifiers B i . Thus, the fuzzy measure for the subset of classifiers P (i.e., μ(P )) conveys the relevant information conveyed by the subset of classifiers P and is expressed as the weighted sum of m(B i )s, which, in turn, provides the relevant information of the true class label conveyed by the subsets of classifiers B i s.
We calculate the values of B i s taking the following criteria in mind.
1) How well the subset B i of classifiers can classify such that their combined decision matches the true label. 2) How well the classifiers in B i combine with other classifiers such that the decision made by the resulting subset is the same as the true label. Thus, we make m(B i ) dependent on not only the accuracy of the decision made by this subset of classifiers but also how compatible they are with the other classifiers being used.
To take into consideration the above two points while calculating the values of m(B i ), we take the help of the rewarding system that we developed inspired by the MC method for calculating value function in an MDP.
Before explaining how we adopt a method that is close to the MC prediction method for MDPs in our work, we explain it in brief. In the MC prediction technique, the state-value function is evaluated with respect to a specific policy. In this technique, given a specific policy, the value function of a specific state with respect to the policy is evaluated. Episodes are generated with respect to the specific policy, and the average return (expected return) of the specific state with respect to the policy is considered as the value function of that state. In our work, we adapt this MC prediction technique to calculate the values of the m(B i )s.
It is important to note that in our work, we do not make a decision or try to evaluate a value function for a specific policy; rather, we evaluate the average of the value functions for a given number of different policies. For determining m(B i )s, we take the average of several value functions calculated with respect to different policies that we define. We do this to get an intuition of how good is the state with respect to all the policies, all of which capture an important aspect of the subset of classifiers. The number of value functions whose average is to be taken depends on |B i | and is equal to |X| − |B i | + 1. The policies with respect to which these value functions would be calculated are defined as follows. The first policy would be to classify based on the information conveyed by the classifiers included in B i . The second policy would be to randomly add one of the classifiers from the remaining set of classifiers and then classify based on the information conveyed by the new set of classifiers, namely the classifiers present in B i and the new classifier randomly added, while the third policy is to add two classifiers randomly and then use the cumulative information conveyed by the new set of classifiers. New policies are added till all the classifiers present in the set X of classifiers are taken into consideration for the decision-making process. This makes the number of policies equal to |X| − |B i | + 1, as mentioned above.
Each of these policies generates a value function based on the reward structure. Before we define the reward, we explain why this average of these value functions gives us the value of m(B i ). The value function generated from the policy where no additional classifier is included in the subset and the decision is taken solely based on the decisions of the classifiers in the subset B i indicates how accurate the classifiers in the subset B i are. The value functions for all the other policies where we randomly add one or more classifiers that are not part of B i and then make the decision indicate how compatible the classifier subset B i is with other classifiers in the set of classifiers, X.
We now define the actions and the states. There are two actions, namely, making a classification and adding a classifier randomly. Let us assume that there are five classifiers. Then, there are 32 different states, consisting of the different combinations of classifiers. We define the rewards for the MDP as follows: +1 for a correct decision, −1 for a wrong decision, and 0 for adding other classifiers randomly.
Just like in MC prediction for solving the MDP, we divide the entire task into episodes, where the classification of each item (or the decision made by the classifier subset per item in the training set) is considered as one episode. Then, we take the average of the returns (expected returns) in each episode to evaluate the state-value functions for different policies. m(B i )s are equal to the average of state-value functions with respect to the different policies, as mentioned above. Furthermore, this problem holds the Markov property because, in this problem, at any particular instant, the present state summarizes the entire history. This is because, at any state, the decision is taken based on the classifiers present in that state, and it does not depend on whether that classifier was originally a part of B i or even on when that classifier was added. Thus, the decision is influenced by only the present state and not by history.

A. Dataset Description
In the following section, we have discussed three different kinds of publicly available image datasets that have been used for the performance evaluation of the ensemble learning methodology proposed in this article.

3) UTD Multimodal Human Action Dataset (UTD-MHAD):
The UTD-MHAD [21] is a large public human action recognition dataset consisting of four data modalities namely RGB videos, and depth videos, skeleton sequences, and inertial signals captured with the aid of Microsoft Kinect camera and a wearable inertial sensor. It is a comprehensive and well-annotated dataset consisting of 27 human action classes performed by eight subjects, and each action is repeated four times by each subject. The dataset is publicly available at [22].

B. Experimental Setup
To implement the proposed method, we have used Python programming language with the aid of the Keras package with Tensorflow as the deep learning framework on Nvidia GeForce GTX 1080 Ti GPU with 11-GB RAM, 3584 CUDA cores, and a memory clock of 5505 MHz.

1) BACH dataset:
Under image preprocessing protocols, the H&E stained breast microscopy images are downsized to a resolution of 512 × 512 pixels using bicubic interpolation, followed by the application of Macenko stain normalization [23]. Five fine-tuned deep convolutional neural network (DCNN) architectures, pretrained on the ImageNet dataset, namely VGG16, VGG19, Xception, InceptionV3, and InceptionResnetV2, are used for the extraction of deep features from specific convolutional layers after performing a thorough analysis of the activation maps of each layer in the DCNN model. These extracted features are then fused to generate deep image descriptors. These image descriptors are further fed into a multilayered perceptron (MLP) classifier, which yields the four-class classification performance. A confidence matrix is generated, by combining the confidence scores obtained across the five DCNN models used, which is then passed through the proposed DS-based fuzzy ensemble strategy for decision score aggregation. In the four-class breast cancer classification, the ensemble method is used to combine the confidence scores of each classifier for each of the four classes, and the final combined score is computed. The results are depicted in Table I. It can be observed that the accuracy after the ensemble is superior compared to the accuracy of the individual classifiers both in the case of test and validation sets. In fact, in the case of the test set, the accuracy of the proposed DS-based fuzzy ensemble model outperforms the best accuracy of the individual classifier by 4%. 2) Novel COVID-19 Chestxray Repository: Essential data preprocessing protocols, such as image resizing, color space translation, image denoising filters, and division of the CXR dataset, are applied to the CXR radiographs.
Offline image augmentation techniques are applied to the training and validation sets. The CXR images are then fed as input into a DCNN architecture, pretrained on the ImageNet dataset. VGG16, Xception, and InceptionV3 are employed as transfer learning DCNN models, which are then fine-tuned. The global average pooling function is applied to the convolution layers of the DCNN model, having salient activation maps, which reduces the feature maps into 1-D vectors. These feature vectors are then  II  EXPERIMENTAL RESULTS OBTAINED FOR THREE-CLASS CXR CLASSIFICATION  PROBLEM, WHEN EVALUATED ON THE NOVEL COVID-19 CHESTXRAY  REPOSITORY DATASET concatenated into a single feature embedding. This feature embedding is then passed through an MLP network, and then, the entire architecture is trained for 100 epochs for obtaining a refined deep feature embedding. As shown in (8), we have used a combination of cross-entropy loss (L CE ) and contrastive loss (L con ) as our overall training objective. This loss function facilitates the regularization of the deep feature embedding and helps in addressing the issue of domain gap while dealing with heterogeneous CXR data sources, which hence results in robust joint training L overall = L CE + αL con (8) where α is a hyperparameter. The refined feature embedding is then fed into a classifier with a 512-neuron fully connected layer, with an activation of the rectified linear unit, and a dropout rate of 0.5 to curtail the problem of overfitting. For addressing the three-class CXR classification problem under consideration, there is finally a three-neuron output layer with softmax activation. The three classifiers, corresponding to the three DCNN models used, generate confidence scores of validation and unseen test CXR images, which are combined to form confidence matrices of validation and test sets, respectively. The DS-based fuzzy ensemble method proposed here is used to combine the confidence scores of each classifier. We have tabulated the results in Table II. In this dataset also, we can observe that the accuracy of the proposed ensemble method outperforms the accuracy of all the individual classifiers. Even in the worst-case scenario, the fusion method outperforms the test accuracy of the individual classifier by 2.32%.

3) UTD-MHAD:
We have extracted three geometric feature representations from the raw skeleton joint sequences provided in the skeleton data folder of the UTD-MHAD. These geometric features, namely pairwise joint distance motion (JJd motion ), pairwise line angle (LLa), and pairwise planar angle (P P a), are used for capturing various spatiotemporal dynamic information during human actions. These distilled feature representations are then converted into RGB image encodings, which are fed into DCNN classifiers. We have implemented our proposed DS-based fuzzy ensemble for the aggregation of the decision scores generated by DCNN classifiers. Fig. 1 shows the RGB image encodings for action classes A9 (Draw a circle

D. Comparison With Other Fuzzy Measures Derived From Overlap Functions
For comparing the performance of the proposed fuzzy measure with other fuzzy measures, we have implemented the capacities or the fuzzy measures used in the paper [24]. We have used the capacities for evaluating their efficacy on the BACH dataset. We have followed the rules of calculation mentioned in the paper. The results have been shown in Table IV.

1) BACH Dataset:
We have compared our method with the state-of-the-art methods using the four-class breast cancer classification dataset. Our proposed method has outperformed most of the existing methods in terms of accuracy evaluated on the test set, as demonstrated in Table V.
Here, we briefly explain the existing breast histology image classification methods, which we consider for comparison and which were applied to the said dataset. Fan et al. [25] employed a deep attention network and trained it with a batch size of 16, an stochastic gradient descent (SGD) optimizer, a learning rate of 0.1, and recorded the results up to 50 epochs . DenseNet-161 was utilized with considerable offline data augmentation by Kohl et al. [26]. A data distribution of 80% and 20% for training and validation, respectively, was used. Yang et al. [27] employed a guided attention network with extensive augmentation in their study accompanied by a fourfold cross-validation method. Yang et al. [28] incorporated an ensemble of nine fine-tuned pretrained transfer learning models (architectural variants of DenseNet and ResNet) for the four-class classification problem. Roy et al. [29] used the DCNN along with majority voting (MV), patch extraction, and data augmentation. They divided the dataset into training, validation, and testing sets. 40 of the 400 images were used as the test set for image classification, while the patch was extracted from the rest of the 360 images, 20% of which were  [25], which used a complex attention-based module.

2) Novel COVID-19 Chestxray Repository:
We have evaluated our proposed DS-based fuzzy ensemble approach on the Novel COVID-19 Chestxray Repository proposed in [40]. We have compared our method with the existing techniques. The results have been tabulated in Table VI. Bhowal et al. [40] developed a method of calculating the fuzzy measures of individual classifiers using coalition game theory and information theory before using the lambda fuzzy approximation for calculating the fuzzy measures of the set of classifiers. They also incorporated three different weighting schemes for calculating the Shapley values. Our proposed DS-based fuzzy ensemble yields leverage of 0.31% for the three-class CXR classification.
3) UTD-MHAD: In Table VII, we have tabulated the comparison of our proposed DS-based fuzzy ensemble with the existing approaches evaluated on the UTD-MHAD. Since the joint trajectory map technique, employed by Wang et al. [41], relies majorly on the temporal dynamics, it hence fails to capture discriminative spatial features essential for distinguishing actions such as classes A22 (Jog) and A23 (Walk). By only utilizing the Kinect modality, our method outperforms existing methods [42], [43], [44], [45], which incorporated a multimodal approach for harnessing salient information to facilitate multimodal feature learning. As illustrated in Table VII, the proposed method beats most of the existing methods, thereby justifying that the proposed method is quite general and works on a variety of deep learning and machine learning models on a variety of datasets.

F. Error Analysis
Although the proposed method performs well on varied datasets, we can state that there are two main limitations of the proposed fusion technique, which are listed as follows.
1) Besides the MC method that we used for the calculation of m(B i )s, other optimization techniques may be used, each of which may produce a separate set of basic assignment functions and a separate set of fuzzy measures as a result. Which algorithm to use has to be decided by the programmer using the code. 2) The second major problem is that even though we find the general formula for a set of fuzzy measures that can be associated with the DS theory and can be expressed in the form μ(P ) = p i=1 w P B i m(B i ), there are infinite sets of fuzzy measures that can be associated with the DS belief structure, and it cannot be expressed in this form. Thus, it is not feasible to consider all the possible sets of fuzzy measures associated with the DS belief structure and select the best possible one.

V. CONCLUSION AND FUTURE DIRECTIONS
In this article, our contribution is twofold. First, we provide a specific generalized set of fuzzy measures associated with both the fuzzy measure theory and the DS theory. We also prove a theorem to provide us with the conditions such that the fuzzy measures can be generated using a predecided weighting scheme and the basic assignment functions of the DS theory. We further use a method inspired by the MC method for the calculation of the value function to calculate the basic assignment functions of the DS theory.
To prove the efficacy of the fuzzy measures derived here, we employ the fuzzy measures to the Choquet integral in order to fuse the decisions of deep learning models experimented on three varied kinds of datasets, namely, the BACH, Novel COVID-19 Chestxray Repository, and UTD-MHAD.
However, the method has two specific limitations. First, the values of m(B i )s can be calculated using a number of different algorithms besides ours, which must be decided by the user. Though we evaluate the generalized formula for the set of fuzzy measures associated with the DS theory, which can be expressed as μ(P ) = p i=1 w P B i m(B i ), there are infinite sets of fuzzy measures that cannot be expressed in this form but can be associated with the DS belief structure. Hence, it may happen that the best possible set, giving the best result in a dataset, may not be of the form that we have derived. In our future work, we would try to make our theorem more generalized, so that we may represent more fuzzy measures associated with the DS theory using our theorem. We will evaluate our method with different algorithms and datasets to assess which algorithm performs best on a vast set of datasets.

APPENDIX SPATIOTEMPORAL GEOMETRIC FEATURE EXTRACTION TECHNIQUES FOR 3-D HUMAN SKELETON ACTION RECOGNITION
Consider a video portraying a particular human activity or movement performed by a subject. This video comprises of F number of frames, each frame consisting of J number of key joints. Each joint i, where i ∈ J, is a position vector p i = {x i , y i , z i }. For the f th frame, where f ∈ F , the position vector across all the joints is given as P f = {p 1 , p 2 , p 3 , . . ..., p J−1 , p J }, and across the F frames, we thus have a sequence P = {P 1 , P 2 , P 3 , . . ..., P F −1 , P F }. A detailed description of the spatiotemporal feature extraction techniques is given as follows.
1) Pairwise joint distance motion (JJd motion ): In every frame, for i, j ∈ J, i = j, the pairwise joint distance metric of p i and p j is equal to the Euclidean distance between them and is given as In the UTD-MHAD, for the f th frame having J = 20 key joints, 20 2 = 190 pairwise joint distance features are extracted. Across F frames of the entire video sequence, the joint-joint distance features are given as JJd = JJd 0 , JJd 1 , JJd 2 , . . .., JJd F −1 , JJd F .
(10) JJd motion is a feature representation that captures the kinematic motion patterns of skeleton joint sequences over time. It is computed by finding the temporal difference over the extracted pairwise joint distance features. For a given frame interval N and for f ∈ F, f < F − N , the joint-joint distance motion feature sequence (DM ) is given as Across the entire video, we obtain the following feature sequence: Pairwise joint orientation (JJo) is given as Pairwise line angle (LLa) is defined as the angle, in radians (0 < θ < π) between lines L J i →J j and L J p →J q . The change in the features in the vector sequence is swifter than the distance features, which facilitates in capturing discriminating spatial information corresponding to the angular movements, such as Draw circle, Baseball swing, Wave, Throw, Basketball shoot, and so on, performed by the subjects  (15) where i, j, p, q ∈ J are the joint indices, where (i, j) = (p, q) in (15). In the UTD-MHAD, for the f th frame having 34 lines in total, 34 2 = 561 pairwise joint distance features are extracted. Across F frames of the entire video sequence, the line-line angle features are given as LLa = LLa 0 , LLa 1 , LLa 2 , . . .., LLa F −1 , LLa F .
3) Pairwise planar angle (P P a): Pairwise planar angle (P P a) is defined as the angle, in radians (0 < θ < π), between the normal vectors of planes P J i →J j and L J p →J q P P a f ijkpqr = arccos JJo f ij ⊗ JJo f ik JJo f pq ⊗ JJo f pr (17) where i, j, k, p, q, r ∈ J are the joint indices, where (i, j, k) = (p, q, r) in (17). In the UTD-MHAD, for the f th frame having five planes in total, 5 2 = 10 pairwise planar angle features are extracted. Across F frames of the entire video sequence, the line-line angle features are given as P P a = P P a 0 , P P a 1 , P P a 2 , . . .., P P a F −1 , P P a F . (18) The obtained feature sequence is then converted to its corresponding three-channel RGB image encoding by using the encoding scheme, as shown in (19), proposed by Du et al. [50] where p indicates the pixel intensity and c max and c min are the maximum and minimum of all the joint coordinates, respectively. The obtained image encoding is then resized to a dimension of 256 × 256 pixels using bicubic interpolation. As illustrated in Fig. 1, the spatial structure and the temporal dynamics are depicted along the width and height of each image encoding, respectively. The image encodings are then passed through the Alexnet model, trained from scratch, to generate decision scores for each feature vector.