Stochastics and Statistics
Maximising entropy on the nonparametric predictive inference model for multinomial data

https://doi.org/10.1016/j.ejor.2011.01.020Get rights and content

Abstract

The combination of mathematical models and uncertainty measures can be applied in the area of data mining for diverse objectives with as final aim to support decision making. The maximum entropy function is an excellent measure of uncertainty when the information is represented by a mathematical model based on imprecise probabilities. In this paper, we present algorithms to obtain the maximum entropy value when the information available is represented by a new model based on imprecise probabilities: the nonparametric predictive inference model for multinomial data (NPI-M), which represents a type of entropy-linear program. To reduce the complexity of the model, we prove that the NPI-M lower and upper probabilities for any general event can be expressed as a combination of the lower and upper probabilities for the singleton events, and that this model can not be associated with a closed polyhedral set of probabilities. An algorithm to obtain the maximum entropy probability distribution on the set associated with NPI-M is presented. We also consider a model which uses the closed and convex set of probability distributions generated by the NPI-M singleton probabilities, a closed polyhedral set. We call this model A-NPI-M. A-NPI-M can be seen as an approximation of NPI-M, this approximation being simpler to use because it is not necessary to consider the set of constraints associated with the exact model.

Introduction

When we have a set of data about individual items, including characteristics and a variable in study, it is important the model (normally a mathematical one) used to represent the information that the set of data give us. To quantify the information that a characteristic, or a set of them, represents about the variable in study, we can use measures of information-based uncertainty (for simplicity we call these uncertainty measures) on the mathematical model used.

Mathematical models and uncertainty measures can be combined to present procedures into the areas of supervised classification learning, variable selection methods, clustering, etc. All of them are into the general area of data mining, and can be considered as important tools for decision support. References about this type of procedures are [2], [4], [5], [8].

There are many mathematical models that can be used to represent the information without single-valued probabilities. These models are generalisations of probability theories such as belief functions, reachable probability intervals, capacities of various orders, upper and lower probabilities, and closed and convex sets of probability distributions (also called credal sets). The term imprecise probability (Klir [30], Walley [39]) subsumes these theories. Some of these generalised theories are more appropriate than others in specific situations.

The Imprecise Dirichlet model (IDM), presented by Walley [40], is a mathematical model for statistical inference from multinomial data which was developed to correct shortcomings of previous objective models. It verifies a set of principles which are claimed by Walley to be desirable for inference (see Walley [40]). The IDM can be seen as a model which gives imprecise probabilities that can be expressed via a set of reachable probability intervals and a belief function (Abellán [1]). The IDM has been applied to various statistical problems and a description of these applications was presented by Bernard [12]. However, the use of the IDM has recently been questioned for some practical applications (Piatti et al. [34]). Shortcomings of the IDM were already discussed in detail by Walley [40], and by many discussants of that paper, leading Walley to strongly motivate researchers to develop alternative models for such inference.

Coolen and Augustin [10], [17] presented nonparametric predictive inference for multinomial data (NPI-M) as such an alternative, which does not suffer from some of the main drawbacks of the IDM [40]. It is different from the IDM in the sense that NPI-M learns from data in the absence of prior knowledge and with relatively few modelling assumptions, most noticeably a post-data exchangeability-like assumption together with a latent variable representation of data as lines on a probability wheel. NPI-M does not satisfy all of the principles for inference suggested by Walley [40], specifically the Representation Invariance Principle (RIP), but Coolen and Augustin do not consider this to be a shortcoming [18]. In fact, they present strong arguments against general adoption of the RIP for inference and propose an alternative, weaker principle, which NPI-M satisfies.

With the emergence of models that extend classical probability theory, an extension of information-based uncertainty theory has been needed. In the 1990s, using the Shannon entropy (Shannon [37]) measure for probabilities as a starting point, a large amount of research was carried out to study measures to quantify different types of uncertainty inherent in belief functions. In recent years, this study has been extended to general credal sets. The maximum entropy measure appears as a suitable total uncertainty measure for general credal sets, verifying a set of desirable properties (Abellán et al. [6], Klir [30]). This measure have been questioned as an aggregate measure into some theories ([3], [29], [31]), where the efficiency of its calculus is very important ([33], [38]). Since Jaynes [26], [27], [28] expounded his principle of maximum entropy, this measure has been extensively used in the literature.

In this paper, we study the NPI-M model with a view to practical applications, principally in data mining. With this aim in mind, we consider the applications in [8], [2], [4], where the IDM is applied with the maximum entropy measure. NPI-M is an alternative model for uncertainty quantification that can replace the IDM in some situations where the use of the IDM is questioned. The inferences given by NPI-M are in the form of lower and upper probabilities for events. We prove that these bounds comprise sets of reachable probability intervals, which enables more efficient computation (Campos et al. [13]).

An important characteristic of the set of probability distributions generated by NPI-M is that it is not a closed and convex set. We can determine bounds on the probability of each general event via the probabilities of the singleton events, but not all of the distributions in the associated credal set are compatible with the theoretical NPI-M model. For ease of application, an approximation of this model can be used, specifically the use of the credal set associated with the set of reachable probability intervals rather than the actual set of distributions valid under NPI-M. This approximation will be referred to as A-NPI-M.

When working with reachable probability intervals, several algorithms can be used to find the maximum entropy algorithm within the associated credal set, such as the algorithm presented by Abellán and Moral [7] for probability intervals or the more general algorithm presented by Abellán and Moral [9]. However, these algorithms can not be used with NPI-M because due to the set of constraints of the model it does not give a closed and convex set of probability distributions. Taking into account all constraints of NPI-M, we present an algorithm to obtain the maximum entropy distribution on the set of probability distributions generated by NPI-M.

Finally, we present an efficient algorithm to obtain the maximum entropy distribution on the set generated by A-NPI-M. This is simpler than the NPI-M algorithm and we can base it on the algorithm presented by Abellán and Moral [7] because A-NPI-M generates a closed and convex set of probability distributions.

This paper is organised as follows: in Section 2, we present a summary of the principal theories of imprecise probability; in Section 3, we explain NPI-M; Section 4 is devoted to an overview of uncertainty measures; in Section 5 we present an algorithm to calculate the maximum entropy probability distribution on the set obtained from NPI-M; in Section 6, we present an algorithm to calculate the maximum entropy probability distribution on the set obtained from A-NPI-M; and finally, Section 7 is devoted to our conclusions.

Section snippets

Credal sets

Theories of imprecise probability (Klir [30], Walley [39], Weichselberger [43]) share some common characteristics: for example, the evidence within each theory can be described by a lower probability function P on a finite set1 X or, alternatively, by an upper probability function P on X. These functions are always regular monotone measures (Wang and Klir [41]) and satisfyxXP({x})1,xXP({x})1.

If the set of probability

Probability intervals from NPI-M

The NPI model for multinomial data (NPI-M) was recently developed by Coolen and Augustin [10], [17], [18]. The model is based on a variation of Hill’s A(n) assumption [24], [25], which relates to predictive inference involving real-valued data observations. Nonparametric predictive inference is an inferential framework with attractive properties which has been applied to many problems in Statistics, Reliability and Operations Research [16]. The assumption made by Coolen and Augustin with regard

Uncertainty measures

It has been well established that uncertainty in classical possibility theory can be suitably quantified by the Hartley measure (Hartley [23]). For each nonempty and finite set A  X of possible alternatives, the Hartley measure, H(A), is defined by the formulaH(A)=log2|A|,where ∣A∣ denotes the cardinality of A. Since H(A) = 1 when ∣A = 2, H defined by (17) measures uncertainty in bits. The uniqueness of H was proven on axiomatic grounds by Rényi [35]. The type of uncertainty measured by H is

Maximising entropy for NPI-M

An algorithm for finding the maximum entropy distribution (pmaxE) within a credal set was developed by Abellán and Moral [7]. This algorithm is applicable to situations where the set L of probability intervals is reachable, i.e. the intervals are never unnecessarily wide.

The premise of the algorithm is to initially set p(cj) equal to lj for all categories, giving an initial set of values {p(cj)}. The smallest values are then augmented by an equal amount, leading to a new set of values for {p(cj

Maximising entropy for A-NPI-M

For A-NPI-M, i.e. for the credal set associated with the reachable set of NPI-M probability intervals, we present a more efficient and non-recursive algorithm based on the algorithm by Abellán and Moral [7].

Let J(t) be the set J(t) = {jnj = t} and K(t) = J(t)∣, then iK(i)=K and n=iiK(i). Let K′ be the number K = K   (K(0) + K(1)). For the setL=[li,ui]|li=max0,ni-1n;ui=minni+1n,1;i=1,2,,K;i=1Kni=n,the algorithm attains the array pˆ=(pˆ(x1),,pˆ(xK))(pˆ1,,pˆK) of maximum entropy probabilities and

Conclusions

In this paper we have analysed the set of probability distributions associated with the NPI model for multinomial data. We have simplified its use by proving that the NPI lower and upper probabilities can be obtained via the singleton probabilities only. We proved that the set of probability distributions valid under NPI-M is not closed and convex.

With the aim of using NPI-M in applications via the maximum entropy measure, we have presented algorithms to obtain this measure within the following

Acknowledgements

This work has been supported by the Spanish “Consejerı´a de Economı´a, Innovación y Ciencia de la Junta de Andalucı´a” under Project TIC-06016. The work of the first-named author has been also partially supported by the Spanish “Ministerio de Educación y Ciencia” under project TIN2007-67418-C03-03.

References (43)

  • J. Abellán et al.

    Disaggregated total uncertainty measure for credal sets

    International Journal of General Systems

    (2006)
  • J. Abellán et al.

    Maximum entropy for credal sets

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

    (2003)
  • J. Abellán et al.

    Building classification trees using the total uncertainty criterion

    International Journal of Intelligent Systems

    (2003)
  • J. Abellán et al.

    An algorithm that computes the upper entropy for order-2 capacities

    International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems

    (2005)
  • R.M. Baker, Multinomial Nonparametric Predictive Inference: Selection, Classification and Subcategory Data, PhD thesis,...
  • L.M. De Campos et al.

    Probability intervals: A tool for uncertainty reasoning

    International Journal of Uncertainty, Fuzziness and Knowledge-Based System

    (1994)
  • G. Choquet

    Théorie des Capacités

    Annales de l’Institute Fourier

    (1953/54)
  • F.P.A. Coolen, International encyclopedia of statistical sciences, chapter nonparametric predictive inference,...
  • F.P.A. Coolen, T. Augustin, Learning from multinomial data: A nonparametric predictive alternative to the imprecise...
  • A.P. Dempster

    Upper and lower probabilities induced by a multivalued mapping

    The Annals of Mathematical Statistics

    (1967)
  • M. Grabisch

    The interaction and Möbius representations of fuzzy measures on finite spaces, k-additive measures: A survey

  • Cited by (26)

    • A new label ordering method in Classifier Chains based on imprecise probabilities

      2022, Neurocomputing
      Citation Excerpt :

      However, some probability distributions belonging to such a credal set might not be consistent with the NPI-M. If the credal set generated via the NPI-M lower and upper probabilities for singletons is considered, then an approximate model, called Approximate Non-Parametric Predictive Inference Model (A-NPI-M) [28], is obtained. The A-NPI-M avoids some difficult constraints of the exact model imposed by the probability wheel representation of the data.

    • Non-parametric predictive inference for solving multi-label classification

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      The following result, proved in [28], can be used in order to check if a set of probability intervals is reachable. The two following results are proved in [33]: The Multi-Label Credal Decision Tree (ML-CDT), proposed in this work, starts from the ML-DT, explained in Section 2.2.

    • Classification with decision trees from a nonparametric predictive inference perspective

      2014, Computational Statistics and Data Analysis
      Citation Excerpt :

      This model, denoted by A-NPI-M, uses the convex hull of the set of distributions compatible with the NPI-M, and so corresponds to the structure defined by the singleton probabilities. A-NPI-M is therefore a simplification of the exact model, allowing us to avoid considering a difficult set of constraints (for more details, see Abellán et al. (2011)). In Section 6 imprecise probability theory will be combined with uncertainty measures to construct decision trees.

    View all citing articles on Scopus
    View full text