Maximising entropy on the nonparametric predictive inference model for multinomial data

doi:10.1016/j.ejor.2011.01.020

European Journal of Operational Research

Volume 212, Issue 1, 1 July 2011, Pages 112-122

https://doi.org/10.1016/j.ejor.2011.01.020 Get rights and content

Abstract

The combination of mathematical models and uncertainty measures can be applied in the area of data mining for diverse objectives with as final aim to support decision making. The maximum entropy function is an excellent measure of uncertainty when the information is represented by a mathematical model based on imprecise probabilities. In this paper, we present algorithms to obtain the maximum entropy value when the information available is represented by a new model based on imprecise probabilities: the nonparametric predictive inference model for multinomial data (NPI-M), which represents a type of entropy-linear program. To reduce the complexity of the model, we prove that the NPI-M lower and upper probabilities for any general event can be expressed as a combination of the lower and upper probabilities for the singleton events, and that this model can not be associated with a closed polyhedral set of probabilities. An algorithm to obtain the maximum entropy probability distribution on the set associated with NPI-M is presented. We also consider a model which uses the closed and convex set of probability distributions generated by the NPI-M singleton probabilities, a closed polyhedral set. We call this model A-NPI-M. A-NPI-M can be seen as an approximation of NPI-M, this approximation being simpler to use because it is not necessary to consider the set of constraints associated with the exact model.

Introduction

When we have a set of data about individual items, including characteristics and a variable in study, it is important the model (normally a mathematical one) used to represent the information that the set of data give us. To quantify the information that a characteristic, or a set of them, represents about the variable in study, we can use measures of information-based uncertainty (for simplicity we call these uncertainty measures) on the mathematical model used.

Mathematical models and uncertainty measures can be combined to present procedures into the areas of supervised classification learning, variable selection methods, clustering, etc. All of them are into the general area of data mining, and can be considered as important tools for decision support. References about this type of procedures are [2], [4], [5], [8].

There are many mathematical models that can be used to represent the information without single-valued probabilities. These models are generalisations of probability theories such as belief functions, reachable probability intervals, capacities of various orders, upper and lower probabilities, and closed and convex sets of probability distributions (also called credal sets). The term imprecise probability (Klir [30], Walley [39]) subsumes these theories. Some of these generalised theories are more appropriate than others in specific situations.

The Imprecise Dirichlet model (IDM), presented by Walley [40], is a mathematical model for statistical inference from multinomial data which was developed to correct shortcomings of previous objective models. It verifies a set of principles which are claimed by Walley to be desirable for inference (see Walley [40]). The IDM can be seen as a model which gives imprecise probabilities that can be expressed via a set of reachable probability intervals and a belief function (Abellán [1]). The IDM has been applied to various statistical problems and a description of these applications was presented by Bernard [12]. However, the use of the IDM has recently been questioned for some practical applications (Piatti et al. [34]). Shortcomings of the IDM were already discussed in detail by Walley [40], and by many discussants of that paper, leading Walley to strongly motivate researchers to develop alternative models for such inference.

Coolen and Augustin [10], [17] presented nonparametric predictive inference for multinomial data (NPI-M) as such an alternative, which does not suffer from some of the main drawbacks of the IDM [40]. It is different from the IDM in the sense that NPI-M learns from data in the absence of prior knowledge and with relatively few modelling assumptions, most noticeably a post-data exchangeability-like assumption together with a latent variable representation of data as lines on a probability wheel. NPI-M does not satisfy all of the principles for inference suggested by Walley [40], specifically the Representation Invariance Principle (RIP), but Coolen and Augustin do not consider this to be a shortcoming [18]. In fact, they present strong arguments against general adoption of the RIP for inference and propose an alternative, weaker principle, which NPI-M satisfies.

With the emergence of models that extend classical probability theory, an extension of information-based uncertainty theory has been needed. In the 1990s, using the Shannon entropy (Shannon [37]) measure for probabilities as a starting point, a large amount of research was carried out to study measures to quantify different types of uncertainty inherent in belief functions. In recent years, this study has been extended to general credal sets. The maximum entropy measure appears as a suitable total uncertainty measure for general credal sets, verifying a set of desirable properties (Abellán et al. [6], Klir [30]). This measure have been questioned as an aggregate measure into some theories ([3], [29], [31]), where the efficiency of its calculus is very important ([33], [38]). Since Jaynes [26], [27], [28] expounded his principle of maximum entropy, this measure has been extensively used in the literature.

In this paper, we study the NPI-M model with a view to practical applications, principally in data mining. With this aim in mind, we consider the applications in [8], [2], [4], where the IDM is applied with the maximum entropy measure. NPI-M is an alternative model for uncertainty quantification that can replace the IDM in some situations where the use of the IDM is questioned. The inferences given by NPI-M are in the form of lower and upper probabilities for events. We prove that these bounds comprise sets of reachable probability intervals, which enables more efficient computation (Campos et al. [13]).

An important characteristic of the set of probability distributions generated by NPI-M is that it is not a closed and convex set. We can determine bounds on the probability of each general event via the probabilities of the singleton events, but not all of the distributions in the associated credal set are compatible with the theoretical NPI-M model. For ease of application, an approximation of this model can be used, specifically the use of the credal set associated with the set of reachable probability intervals rather than the actual set of distributions valid under NPI-M. This approximation will be referred to as A-NPI-M.

When working with reachable probability intervals, several algorithms can be used to find the maximum entropy algorithm within the associated credal set, such as the algorithm presented by Abellán and Moral [7] for probability intervals or the more general algorithm presented by Abellán and Moral [9]. However, these algorithms can not be used with NPI-M because due to the set of constraints of the model it does not give a closed and convex set of probability distributions. Taking into account all constraints of NPI-M, we present an algorithm to obtain the maximum entropy distribution on the set of probability distributions generated by NPI-M.

Finally, we present an efficient algorithm to obtain the maximum entropy distribution on the set generated by A-NPI-M. This is simpler than the NPI-M algorithm and we can base it on the algorithm presented by Abellán and Moral [7] because A-NPI-M generates a closed and convex set of probability distributions.

This paper is organised as follows: in Section 2, we present a summary of the principal theories of imprecise probability; in Section 3, we explain NPI-M; Section 4 is devoted to an overview of uncertainty measures; in Section 5 we present an algorithm to calculate the maximum entropy probability distribution on the set obtained from NPI-M; in Section 6, we present an algorithm to calculate the maximum entropy probability distribution on the set obtained from A-NPI-M; and finally, Section 7 is devoted to our conclusions.

Section snippets

Credal sets

Theories of imprecise probability (Klir [30], Walley [39], Weichselberger [43]) share some common characteristics: for example, the evidence within each theory can be described by a lower probability function P_∗ on a finite set¹ X or, alternatively, by an upper probability function P^∗ on X. These functions are always regular monotone measures (Wang and Klir [41]) and satisfy $\sum_{x \in X} P_{*} ({x}) ⩽ 1, \sum_{x \in X} P^{*} ({x}) ⩾ 1 .$

If the set of probability

Probability intervals from NPI-M

The NPI model for multinomial data (NPI-M) was recently developed by Coolen and Augustin [10], [17], [18]. The model is based on a variation of Hill’s A_(n) assumption [24], [25], which relates to predictive inference involving real-valued data observations. Nonparametric predictive inference is an inferential framework with attractive properties which has been applied to many problems in Statistics, Reliability and Operations Research [16]. The assumption made by Coolen and Augustin with regard

Uncertainty measures

It has been well established that uncertainty in classical possibility theory can be suitably quantified by the Hartley measure (Hartley [23]). For each nonempty and finite set A ⊆ X of possible alternatives, the Hartley measure, H(A), is defined by the formula $H (A) = \log_{2} | A |,$ where ∣A∣ denotes the cardinality of A. Since H(A) = 1 when ∣A∣ = 2, H defined by (17) measures uncertainty in bits. The uniqueness of H was proven on axiomatic grounds by Rényi [35]. The type of uncertainty measured by H is

Maximising entropy for NPI-M

An algorithm for finding the maximum entropy distribution (p_maxE) within a credal set was developed by Abellán and Moral [7]. This algorithm is applicable to situations where the set $L$ of probability intervals is reachable, i.e. the intervals are never unnecessarily wide.

The premise of the algorithm is to initially set p(c_j) equal to l_j for all categories, giving an initial set of values {p(c_j)}. The smallest values are then augmented by an equal amount, leading to a new set of values for {p(c_j

Maximising entropy for A-NPI-M

For A-NPI-M, i.e. for the credal set associated with the reachable set of NPI-M probability intervals, we present a more efficient and non-recursive algorithm based on the algorithm by Abellán and Moral [7].

Let J(t) be the set J(t) = {j∣n_j = t} and K(t) = ∣J(t)∣, then $\sum_{i} K (i) = K$ and $n = \sum_{i} iK (i)$ . Let K′ be the number K′ = K − (K(0) + K(1)). For the set $L = \{[l_{i}, u_{i}] | l_{i} = \max (0, \frac{n_{i} - 1}{n}); u_{i} = \min (\frac{n_{i} + 1}{n}, 1); i = 1, 2, \dots, K; \sum_{i = 1}^{K} n_{i} = n\},$ the algorithm attains the array $\hat{p} = (\hat{p} (x_{1}), \dots, \hat{p} (x_{K})) \equiv ({\hat{p}}_{1}, \dots, {\hat{p}}_{K})$ of maximum entropy probabilities and

Conclusions

In this paper we have analysed the set of probability distributions associated with the NPI model for multinomial data. We have simplified its use by proving that the NPI lower and upper probabilities can be obtained via the singleton probabilities only. We proved that the set of probability distributions valid under NPI-M is not closed and convex.

With the aim of using NPI-M in applications via the maximum entropy measure, we have presented algorithms to obtain this measure within the following

Acknowledgements

This work has been supported by the Spanish “Consejerı´a de Economı´a, Innovación y Ciencia de la Junta de Andalucı´a” under Project TIC-06016. The work of the first-named author has been also partially supported by the Spanish “Ministerio de Educación y Ciencia” under project TIN2007-67418-C03-03.

References (43)

J. Abellán et al.
An ensemble method using credal decision trees
European Journal of Operational Research
(2010)
T. Augustin et al.
Nonparametric predictive inference and interval probability
Journal of Statistical Planning and Inference
(2004)
J.M. Bernard
An introduction to the imprecise Dirichlet model for multinomial data
International Journal of Approximate Reasoning
(2005)
A. Chateauneuf et al.
Some characterizations of lower probabilities and other monotone capacities through the use of Möbius inversion
Mathematical Social Sciences
(1989)
F.P.A. Coolen et al.
A nonparametric predictive alternative to the Imprecise Dirichlet Model: The case of a known number of categories
International Journal of Approximate Reasoning
(2009)
H.E. Kyburg
Bayesian and non-Bayesian evidential updating
Artificial Intelligence
(1987)
J. Abellán
Uncertainty measures on probability intervals from Imprecise Dirichlet model
International Journal of General Systems
(2006)
J. Abellán
Application of uncertainty measures on credal sets on the Naive Bayes classifier
International Journal of General Systems
(2006)
J. Abellán et al.
Requirements for total uncertainty measures in Dempster–Shafer theory of evidence
International Journal of General Systems
(2008)
J. Abellán et al.
A filter-wrapper method to select variables for the Naive Bayes classifier based on credal decision trees
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
(2009)

J. Abellán et al.

Disaggregated total uncertainty measure for credal sets

International Journal of General Systems

(2006)

J. Abellán et al.

Maximum entropy for credal sets

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

(2003)

J. Abellán et al.

Building classification trees using the total uncertainty criterion

International Journal of Intelligent Systems

(2003)

J. Abellán et al.

An algorithm that computes the upper entropy for order-2 capacities

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

(2005)

R.M. Baker, Multinomial Nonparametric Predictive Inference: Selection, Classification and Subcategory Data, PhD thesis,...

L.M. De Campos et al.

Probability intervals: A tool for uncertainty reasoning

International Journal of Uncertainty, Fuzziness and Knowledge-Based System

(1994)

G. Choquet

Théorie des Capacités

Annales de l’Institute Fourier

(1953/54)

F.P.A. Coolen, International encyclopedia of statistical sciences, chapter nonparametric predictive inference,...

F.P.A. Coolen, T. Augustin, Learning from multinomial data: A nonparametric predictive alternative to the imprecise...

A.P. Dempster

Upper and lower probabilities induced by a multivalued mapping

The Annals of Mathematical Statistics

(1967)

M. Grabisch

The interaction and Möbius representations of fuzzy measures on finite spaces, k-additive measures: A survey

Cited by (26)

A cost-sensitive Imprecise Credal Decision Tree based on Nonparametric Predictive Inference
2022, Applied Soft Computing
Classifiers sometimes return a set of values of the class variable since there is not enough information to point to a single class value. These classifiers are known as imprecise classifiers. Decision Trees for Imprecise Classification were proposed and adapted to consider the error costs when classifying new instances. In this work, we present a new cost-sensitive Decision Tree for Imprecise Classification that considers the error costs by weighting instances, also considering such costs in the tree building process. Our proposed method uses the Nonparametric Predictive Inference Model, a nonparametric model that does not assume previous knowledge about the data, unlike previous imprecise probabilities models. We show that our proposal might give more informative predictions than the existing cost-sensitive Decision Tree for Imprecise Classification. Experimental results reveal that, in Imprecise Classification, our proposed cost-sensitive Decision Tree significantly outperforms the one proposed so far; even though the cost of erroneous classifications is higher with our proposal, it tends to provide more informative predictions.
A new label ordering method in Classifier Chains based on imprecise probabilities
2022, Neurocomputing
Citation Excerpt :
However, some probability distributions belonging to such a credal set might not be consistent with the NPI-M. If the credal set generated via the NPI-M lower and upper probabilities for singletons is considered, then an approximate model, called Approximate Non-Parametric Predictive Inference Model (A-NPI-M) [28], is obtained. The A-NPI-M avoids some difficult constraints of the exact model imposed by the probability wheel representation of the data.
In Multi-Label Classification (MLC), Classifier Chains (CC) are considered simple and effective methods to exploit correlations between labels. A CC considers a binary classifier per label, in which the previous labels, according to an established order, are used as additional features. The label order strongly influences the performance of the CC, and there is no way to determine the optimal order so far. In this work, a new label ordering method based on label correlations is proposed. It uses a non-parametric model based on imprecise probabilities to estimate the correlations between pairs of labels. Then, it employs a greedy procedure that, to insert the labels in the chain, considers the correlations among the candidate labels and the ones already inserted, as well as the correlations between the candidate labels and the ones non-inserted yet. We argue that our proposal presents some advantages over the label ordering methods in CC developed so far based on label correlations. It is also shown that our proposal achieves better experimental results than the label ordering methods proposed so far that use label correlations in CC.
Non-parametric predictive inference for solving multi-label classification
2020, Applied Soft Computing Journal
Citation Excerpt :
The following result, proved in [28], can be used in order to check if a set of probability intervals is reachable. The two following results are proved in [33]: The Multi-Label Credal Decision Tree (ML-CDT), proposed in this work, starts from the ML-DT, explained in Section 2.2.
Decision Trees (DTs) have been adapted to Multi-Label Classification (MLC). These adaptations are known as Multi-Label Decision Trees (ML-DT). In this research, a new ML-DT based on the Nonparametric Predictive Inference Model on Multinomial data (NPI-M) is proposed. The NPI-M is an imprecise probabilities model that provides good results when it is applied to DTs in standard classification. Unlike other models based on imprecise probabilities, the NPI-M is a nonparametric approach and it does not make unjustified assumptions before observing data. It is shown that the new ML-DT based on the NPI-M is more robust to noise than the ML-DT based on precise probabilities. As the intrinsic noise in MLC might be higher than in traditional classification, it is expected that the new ML-DT based on the NPI-M outperforms the already existing ML-DT. This fact is validated with an exhaustive experimentation carried out in this work on different MLC datasets with several levels of added noise. In it, many MLC evaluation metrics are employed in order to measure the performance of the algorithms. The experimental analysis shows that the proposed ML-DT based on NPI-M obtains better results than the ML-DT that uses precise probabilities, especially when we work on data with noise.
AdaptativeCC4.5: Credal C4.5 with a rough class noise estimator
2018, Expert Systems with Applications
The application of classifiers on data represents an important help in a process of decision making. Any classifier, or other method used for knowledge extraction, suffers a deterioration when it is applied on data with noise. Credal C4.5 (CC4.5) is a recent method of classification, that introduces imprecise probabilities in the algorithm of the classic C4.5. It is very suitable in classification noise tasks, but it has a clear dependency of a parameter. It has been proved that this parameter is related with the level of overfitting of the model on the data used for training. In noisy domains, this characteristic is important in the sense that variations of this parameter can reduce the variance of the model. Depending on the degree of noise that a data set has, the application of different values of this parameter can produce different performance of the CC4.5 model. Hence, the use of the correct parameter is fundamental to attain a high level of performance for this model. In this paper, that problem is solved via a rough procedure to estimate the level of class noise in the training data. Combining this new noise estimation process with the CC4.5, it is presented a direct method that has an equivalent performance than the one of the CC4.5 when it is used with the best value of its parameter for each level of class noise.
Classification with decision trees from a nonparametric predictive inference perspective
2014, Computational Statistics and Data Analysis
Citation Excerpt :
This model, denoted by A-NPI-M, uses the convex hull of the set of distributions compatible with the NPI-M, and so corresponds to the structure defined by the singleton probabilities. A-NPI-M is therefore a simplification of the exact model, allowing us to avoid considering a difficult set of constraints (for more details, see Abellán et al. (2011)). In Section 6 imprecise probability theory will be combined with uncertainty measures to construct decision trees.
An application of nonparametric predictive inference for multinomial data (NPI) to classification tasks is presented. This model is applied to an established procedure for building classification trees using imprecise probabilities and uncertainty measures, thus far used only with the imprecise Dirichlet model (IDM), that is defined through the use of a parameter expressing previous knowledge. The accuracy of that procedure of classification has a significant dependence on the value of the parameter used when the IDM is applied. A detailed study involving 40 data sets shows that the procedure using the NPI model (which has no parameter dependence) obtains a better trade-off between accuracy and size of tree than does the procedure when the IDM is used, whatever the choice of parameter. In a bias-variance study of the errors, it is proved that the procedure with the NPI model has a lower variance than the one with the IDM, implying a lower level of over-fitting.
An application of Non-Parametric Predictive Inference on multi-class classification high-level-noise problems
2013, Expert Systems with Applications
This paper presents an application of the Non-parametric Predictive Inference model for multinomial data (NPIM) on multiclass classification noise tasks, i.e. classification tasks where the variable under study has 3 or more possible states or values; and the data sets have incorrect class labels in their training and/or test data sets. In an experimental study, we show that the combination or fusion of the information obtained from decision trees built using the NPIM in a Bagging scheme, can improve the process of classification in multi-class classification noise problems. Via a set of statistical tests, we compared this approach with other successful methods used in similar scheme, on a wide set of data sets. It must be remarked that the new approach has a notably performance, compared with the rest of models, when the level of noise is increased.

View all citing articles on Scopus

View full text

Stochastics and StatisticsMaximising entropy on the nonparametric predictive inference model for multinomial data

Abstract

Introduction

Section snippets

Credal sets

Probability intervals from NPI-M

Uncertainty measures

Maximising entropy for NPI-M

Maximising entropy for A-NPI-M

Conclusions

Acknowledgements

European Journal of Operational Research

Journal of Statistical Planning and Inference

International Journal of Approximate Reasoning

Mathematical Social Sciences

International Journal of Approximate Reasoning

Artificial Intelligence

Uncertainty measures on probability intervals from Imprecise Dirichlet model

International Journal of General Systems

Application of uncertainty measures on credal sets on the Naive Bayes classifier

International Journal of General Systems

Requirements for total uncertainty measures in Dempster–Shafer theory of evidence

International Journal of General Systems

A filter-wrapper method to select variables for the Naive Bayes classifier based on credal decision trees

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

Disaggregated total uncertainty measure for credal sets

International Journal of General Systems

Maximum entropy for credal sets

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

Building classification trees using the total uncertainty criterion

International Journal of Intelligent Systems

An algorithm that computes the upper entropy for order-2 capacities

International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

Probability intervals: A tool for uncertainty reasoning

International Journal of Uncertainty, Fuzziness and Knowledge-Based System

Théorie des Capacités

Annales de l’Institute Fourier

Upper and lower probabilities induced by a multivalued mapping

The Annals of Mathematical Statistics

The interaction and Möbius representations of fuzzy measures on finite spaces, k-additive measures: A survey

Stochastics and Statistics
Maximising entropy on the nonparametric predictive inference model for multinomial data