Non-Quadratic Distances in Model Assessment

One natural way to measure model adequacy is by using statistical distances as loss functions. A related fundamental question is how to construct loss functions that are scientifically and statistically meaningful. In this paper, we investigate non-quadratic distances and their role in assessing the adequacy of a model and/or ability to perform model selection. We first present the definition of a statistical distance and its associated properties. Three popular distances, total variation, the mixture index of fit and the Kullback-Leibler distance, are studied in detail, with the aim of understanding their properties and potential interpretations that can offer insight into their performance as measures of model misspecification. A small simulation study exemplifies the performance of these measures and their application to different scientific fields is briefly discussed.


Introduction
Model assessment, that is assessing the adequacy of a model and/or ability to perform model selection, is one of the fundamental components of statistical analyses. For example, in the model adequacy problem one usually begins with a fixed model and interest centers on measuring the model misspecification cost. A natural way to create a framework within which we can assess model misspecification is by using statistical distances as loss functions. These constructs measure the distance between the unknown distribution that generated the data and an estimate from the data model. By identifying statistical distances as loss functions, we can begin to understand the role distances play in model fitting and selection, as they become measures of the overall cost of model misspecification. This strategy will allow us to investigate the construction of a loss function as the maximum error in a list of model fit questions. Therefore, our fundamental question is the following. How can one design a loss function ρ that is scientifically and statistically meaningful? We would like to be able to attach a specific scientific meaning to the numerical values of the loss, so that a value of the distance equal to 4, for example, has an explicit interpretation in terms of our statistical goals. When we select between models, we would like to measure the quality of the approximation via the model's ability to provide answers to important scientific questions. This presupposes that the meaning of "best fitting model" should depend on the "statistical questions" being asked of the model. Lindsay [1] discusses a distance-based framework for assessing model adequacy. A fundamental tenet of the framework for model adequacy put forward by Lindsay [1] is that it is possible and reasonable to carry out a model-based scientific inquiry without believing that the model is true, and without assuming that the truth is included in the model. All this of course, assuming that we have a way to measure the quality of the approximation to the "truth", is offered by the model. This point of view never assumes the correctness of the model. Of course, it is rather presumptuous to label any distribution as the truth as any basic modeling assumption generated by the sampling scheme that provided the data is never exactly true. An example of a basic modeling assumption where the supremum is taken over the class of functionals of interest. Using the supremum of the individual errors is one way of assessing overall error, but using this measure has the nice feature that its value gives a bound on all individual errors. The statistical questions of interest may be global, such as: is the normal model correct in every aspect? Or we may be interested to have answers on a few key characteristics, such as the mean.
Lindsay et al. [2] introduced a class of statistical distances, called quadratic distances, and studied their use in the context of goodness-of-fit testing. Furthermore, Markatou et al. [3] discuss extensively the chi-squared distance, a special case of quadratic distance, and its role in robustness. In this paper, we study non-quadratic distances and their role in model assessment. The paper is organized as follows. Section 2 presents the definition of a statistical distance and its associated properties. Sections 3-5 discuss in detail three popular distances, total variation, the mixture index of fit and the Kullback-Leibler distance, with the aim of understanding their role in model assessment problems. The likelihood distance is also briefly discussed in Section 5. Section 6 illustrates computation and applications of total variation, mixture index of fit and Kullback-Leibler distances. Finally, Section 7 presents discussion and conclusions pertaining to the use of total variation and mixture index of fit distances.

Statistical Distances and Their Properties
If we adopt the usual convention that loss functions are nonnegative in their arguments, and zero if the correct model is used, and have larger value if the two distributions are not very similar, then the loss ρ(τ, M) can also be viewed as a distance between τ, M. In fact, we will always assume that for any two distributions F, G If this holds, we will say that ρ is a statistical distance. Unlike the requirements for a metric, we do not require symmetry. In fact, there is no reason that the loss should be symmetric, as the roles of τ, M are different. We also do not require ρ to be nonzero when the arguments differ. This zero property will allow us to specify that two distributions are equivalent as far as our statistical purposes are concerned by giving them zero distance.
Furthermore, it is important to note that if τ is in H and τ = M θ 0 , and M ∈ H , say M θ , then the distance ρ(τ, M) induces a loss function on the parameter space via Therefore, if τ is in the model, the losses defined by ρ are parametric losses. We begin within the discrete distribution framework. Let T = {0, 1, 2, · · · , T}, where T is possibly infinite, be a discrete sample space. On this sample space we define a true probability density τ(t), as well as a family of densities M = {m θ (t) : θ ∈ Θ}, where Θ is the parameter space. Assume we have independent and identically distributed random variables X 1 , X 2 , · · · , X n producing the realizations x 1 , x 2 , · · · , x n from τ(·). We record the data as d(t) = n(t)/n, where n(t) is the number of observations in the sample with value equal to t. We note here that we use the word "density" in a generic fashion that incorporates both, probability mass functions as well as probability density functions. A rather formal definition of the concept of statistical distance is as follows. We would require ρ(τ, m) to indicate the worst mistake that we can make if we use m instead of τ. The precise meaning of this statement is obvious in the case of total variation that we discuss in detail in Section 3 of the paper.
We would also like our statistical distances to be convex in their arguments.

Definition 2.
Let τ, m be a pair of probability density functions, with m being represented as m = αm 1 + (1 − α)m 2 , 0 ≤ α ≤ 1. We say that the statistical distance ρ(τ, m) is convex in the right argument if where m 1 , m 2 are two probability density functions.
Then, we say that ρ(τ, m) is convex in the left argument if where τ 1 , τ 2 are two densities.
Lindsay et al. [2] define and study quadratic distances as measures of goodness of fit, a form of model assessment. In the next sections, we study non-quadratic distances and their role in the problem of model assessment. We begin with the total variation distance.

Total Variation
In this section, we study the properties of the total variation distance. We offer a loss function interpretation of this distance and discuss sensitivity issues associated with its use. We will begin with the case of discrete probability measures and then move to the case of continuous probability measures. The results presented here are novel and are useful in selecting the distances to be used in any given problem.
The total variation distance is defined as follows.

Definition 4.
Let τ, m be two probability distributions. We define the total variation distance between the probability mass functions τ, m to be This measure is also known as the L 1 -distance (without the factor 1/2) or index of dissimilarity.

Proposition 1. The total variation distance is a metric.
Proof. By definition, the total variation distance is non-negative. Moreover, it is symmetric because V(τ, m) = V(m, τ) and it satisfies the triangle inequality since Thus, it is a metric.
The following proposition states that the total variation distance is convex in both, left and right arguments.
Proof. It is a straightforward application of the definition of the total variation distance.
The total variation measure has major implications for prediction probabilities. A statistically useful interpretation of the total variation distance is that it can be thought of as the worst error we can commit in probability when we use the model m instead of the truth τ. The maximum value of this error equals 1 and it occurs when τ, m are mutually singular.
Denote by P τ the probability of a set under the measure τ and by P m the probability of a set under the measure m. Proposition 3. Let τ, m be two probability mass functions. Then where A is a subset of the Borel set B.
Proof. Define the sets Because on the set B 3 the two probability mass functions are equal P τ (B 3 ) = P m (B 3 ), and hence Note that, because of the nature of the sets B 1 and B 2 , both terms in the last expression are positive. Therefore Remark 1. The model misspecification measure V(τ, m) has a "minimax "expression This indicates the sense in which the measure assesses the overall risk of using m instead of τ, then chooses m that minimizes the aforementioned risk.
We now offer a testing interpretation of the total variation distance. We establish that the total variation distance can be obtained as a solution to a suitably defined optimization problem. It is obtained as that test function which maximizes the difference between the power and level of a suitably defined test problem.

Definition 5.
A randomized test function for testing a statistical hypothesis H 0 versus the alternative H 1 is a (measurable) function φ defined on R n and taking values in the interval [0, 1] with the following interpretation. If x is the observed value of X and φ(x) = y, then a coin whose probability of falling heads is y is tossed and H 0 is rejected when head appears. In the case where y is either 0 or 1, ∀x, the test is called non-randomized.

Proof. We have
An advantage of the total variation distance is that it is not sensitive to small changes in the density. That is, if τ(t) is replaced by τ(t) + e(t) where ∑ t e(t) = 0 and ∑ t |e(t)| is small then Therefore, when the changes in the density are small V(τ + e, m) ≈ V(τ, m). When describing a population, it is natural to describe it via the proportion of individuals in various subgroups. Having V(τ, m) small would ensure uniform accuracy for all such descriptions. On the other hand, populations are also described in terms of a variety of other variables, such as means. Having the total variation measure small does not imply that means are close on the scale of standard deviation.

Remark 2.
The total variation distance is not differentiable in the arguments. Using V(d, m θ ) as an inference function, where d denotes the data estimate of τ (i.e.,τ), yields estimators of θ that have the feature of not generating smooth, asymptotically normal estimators when the model is true [4]. This feature is related to the pathologies of the variation distance described by Donoho and Liu [5]. However, if parameter estimation is of interest, one can use alternative divergences that are free of these pathologies.
We now study the total variation distance in continuous probability models. Definition 6. The total variation distance between two probability density functions τ, m is defined as The total variation distance has the same interpretation as in the discrete probability model case.
One of the important issues in the construction of distances in continuous spaces is the issue of invariance, because the behavior of distance measures under transformations of the data is of interest. Suppose we take a monotone transformation of the observed variable X and use the corresponding model distribution; how does this transformation affect the distance between X and the model?
Invariance seems to be desirable from an inferential point of view, but difficult to achieve without forcing one of the distributions to be continuous and appealing to the probability integral transform for a common scale. In multivariate continuous spaces, the problem of transformation invariance is even more difficult, as there is no longer a natural probability integral transformation to bring data and model on a common scale.

Proposition 5.
Let V(τ X , m X ) be the total variation distance between the densities τ X , m X for a random variable X. If Y = a(X) is a one-to-one transformation of the random variable X, then where b(y) is the inverse transformation. Next, we do a change of variable in the integral. Set x = b(y) from where we obtain y = a(x) and dy = a (x)dx; the prime denotes derivative with respect to the corresponding argument. Then A fundamental problem with the total variation distance is that it cannot be used to compute the distance between a discrete distribution and a continuous distribution because the total variation distance between a continuous measure and a discrete measure is always the maximum possible, that is 1. This inability of the total variation distance to discriminate between discrete and continuous measures can be interpreted as asking "too many questions"at once, without any prioritization. This limits its use despite its invariant characteristics.
We now discuss the relationship between the total variation distance and Fisher information. Denote by m (n) the joint density of n independent and identically distributed random variables. Then we have the following proposition. Proposition 6. The total variation distance is locally equivalent to the Fisher information number, that is where m θ , m θ 0 are two discrete probability models.

Proof. By definition
θ (t) using Taylor series in the neighborhood of θ 0 to obtain where the prime denotes derivative with respect to the parameter θ. Further, write Therefore, assuming that 1 n ∑ u θ 0 (t i ) converges to a normal random variable in absolute mean, then The total variation is a non-quadratic distance. It is however related to a quadratic distance, the Hellinger distance, defined as H 2 (τ, m) = 1 2 ∑ τ(t) − m(t) 2 by the following inequality.

Proposition 7.
Let τ, m be two probability mass functions. Then Proof. Straightforward using the definitions of the distances involved and Cauchy-Swartz inequality. [6,7]. Further, define the affinity between two probability densities by The above inequality indicates the relationship between total variation and Matusita's distance.

Mixture Index of Fit
Rudas, Clogg, and Lindsay [8] proposed a new index of fit approach to evaluate the goodness of fit analysis of contingency tables based on the mixture model framework. The approach focuses attention on the discrepancy between the model and the data, and allows comparisons across studies. Suppose M is the baseline model. The family of models which are proposed for evaluating goodness of fit is a two-point mixture model given by Here π denotes the mixing proportion, which is interpreted as the proportion of the population outside the model M . In the robustness literature the mixing proportion corresponds to the contamination proportion, as explained below. In the contingency table framework m θ (t), e(t) describe the tables of probabilities for each latent class. The family of models M π defines a class of nested models as π varies from zero to one. Thus, if the model M does not fit well the data, then by increasing π, the model M π will be an adequate fit for π sufficiently large.
We can motivate the index of fit by thinking of the population as being composed of two classes with proportions 1 − π and π respectively. The first class is perfectly described by M , whereas the second class contains the "outliers". The index of fit can then be interpreted as the fraction of the population intrinsically outside M , that is, the proportion of outliers in the sample.
Proof. Given τ with π * ≤ π 0 , there exists a representation of Write any arbitrary discrete distribution e as follows: e = e 0 δ 0 + · · · + e T δ T , where ∑ T i=0 e i = 1 and δ i takes the value 1 at the (i + 1)th position and the value 0 everywhere else.
which belongs to a simplex.

Proposition 9.
We have Proof. Define Then with equality at some t. Let now the error term be Then τ(t) = (1 − λ)m(t) + λe * (t) and λ cannot be made smaller without making e * (t) negative at a point t 0 . This concludes the proof.
One of the advantages of the mixture index of fit is that it has an intuitive interpretation that does not depend upon the specific nature of the model being assessed. Liu and Lindsay [9] extended the results of Rudas et al. [8] to the Kullback-Leibler distance. Computational aspects of the mixture index of fit are discussed in Xi and Lindsay [4] as well as in Dayton [10] and Ispány and Verdes [11].
Finally, a new interpretation to the mixture index of fit was presented by Ispány and Verdes [11]. Let P be the set of probability measures and H ⊂ P. If d is a distance measure on P and N(H , π) = Q : Q = (1 − π)M + πR, M ∈ H , R ∈ P , then π * = π * (P, H ) is the least non-negative solution of the equation d (P, N(H , π)) := min Q∈N(H ,π) d(P, Q) = 0 in π.
Next, we offer some interpretations associated with the mixture index of fit. The statistical interpretations made with this measure are attractive, as any statement based on the model applies to at least 1 − π * of the population involved. However, while the "outlier"model seems interpretable and attractive, the distance itself is not very robust.
In other words, small changes in the probability mass function do not necessarily mean small changes in distance. This is because if m(t 0 ) = ε, then a change of ε in τ(t 0 ) from ε to 0 causes π * (τ, m) to go to 1. Moreover, assume that our framework is that of continuous probability measures, and that our model is a normal density. If τ(t) is a lighter tailed distribution than our normal model m(t), then and therefore That is, light tailed densities are interpreted as 100% outliers. Therefore, the mixture index of fit measures error from the model in a "one-sided" way. This is in contrast to total variation, which measures the size of "holes" as well as the "outliers" by allowing the distributional errors to be neutral.
In what follows, we show that if we can find a mixture representation for the true distribution then this implies a small total variation distance between the true probability mass function and the assumed model m. Specifically, we have the following.

Proof. Write
with π = π * . This is because there always exists the smallest unique π such that τ(t) can be represented as a mixture model. Thus, the above relationship can be written as There is a mixture representation that connects total variation with the mixture index of fit. This is presented below.
Let q 1i = (1 −π)e 1i and q 2i = (1 −π)e 2i and note that since Rewrite now Equation (1) as follows: where (x) + = max(x, 0) and (x) − = − min(x, 0). Thus, ignoring the constraints, every pair (e 1i , e 2i ) satisfying the equation above also satisfies for some number ε i . Moreover, such pair must have ε i ≥ 0 in order the constraints q 1i ≥ 0, q 2i ≥ 0 to be satisfied. Hence, varying ε i over ε i ≥ 0 gives a class of solutions. To determineπ, and adding these we obtain and the maximum value is obtained when ∑ ε i = 0 ⇒ ε i = 0, ∀i. Thereforẽ and so Therefore, for small V(τ, m) the mixture index of fit and the total variation distance are nearly equal.

Kullback-Leibler Distance
The Kullback-Leibler distance [12] is extensively used in statistics and in particular in model selection. The celebrated AIC model selection criterion [13] is based on this distance. In this section, we present the Kullback-Leibler distance and some of its properties with particular emphasis on interpretations. Definition 9. The Kullback-Leibler distance between two densities τ, m is defined as Proposition 12. The Kullback-Leibler distance is nonnegative, that is with equality if and only if τ(t) = m(t).

Definition 10.
We define the likelihood distance between two densities τ, m as The intuition behind the above expression of the likelihood distance comes from the fact that the log-likelihood in the case of discrete random variables taking n j discrete values, ∑ m j=1 n j = n, m is the number of groups, can be written, after appropriate algebraic manipulations, in the above form.
Alternatively, we can write the likelihood distance as and use this relationship to obtain insight into connections of the likelihood distance with the chi-squared measures studied by Markatou et al. [3]. Specifically, if we write the Pearson's chi-squared statistic as then from the functional relationship r log r − r + 1 ≤ (r − 1) 2 we obtain that λ 2 (τ, m) ≤ P 2 (τ, m). However, it is also clear from the right tails of the functions that there is no way to bound λ 2 (τ, m) below by a multiple of P 2 (τ, m). Hence, these measures are not equivalent in the same way that Hellinger distance and symmetric chi-squared are (see Lemma 4, Markatou et al. [3]). In particular, knowing that λ 2 (τ, m) is small is no guarantee that all Pearson z-statistics are uniformly small. On the other hand, one can show by the same mechanism that S 2 ≤ 2kλ 2 , where k < 32/9 and S 2 is the symmetric chi-squared distance given as .
It is therefore true that small likelihood distance λ 2 implies small z-statistics with blended variance estimators. However, the reverse is not true because the right tail in r for S 2 is of magnitude r, as opposed to r log r for the likelihood distance.
These comparisons provide some feeling for the statistical interpretation of the likelihood distance. Its meaning as a measure of model misspecification is unclear. Furthermore, our impression is that likelihood, like Pearson's chi-squared is too sensitive to outliers and gross errors in the data. Despite Kullback-Leibler's theoretical and computational advantages, a point of inconvenience in the context of model selection is the lack of symmetry. One can show that reversing the roles of the arguments in the Kullback-Leibler divergence can yield substantially different results. The sum of the Kullback-Leibler distance and the likelihood distance produces the symmetric Kullback-Leibler distance or J divergence. This measure is symmetric in the arguments, and when used as a model selection measure it is expected to be more sensitive than each of the individual components.

Computation and Applications of Total Variation, Mixture Index of Fit and Kullback-Leibler Distances
The distances discussed in this paper are used in a number of important applications. Euán et al. [14] use the total variation to detect changes in wave spectra, while Alvarez-Esteban et al. [15] cluster time series data on the basis of the total variation distance. The mixture index of fit has found a number of applications in the area of social sciences. Rudas et al. [8] provided examples of the application of π* to two-way contingency tables. Applications involving differential item functioning and latent class analysis were presented in Rudas and Zwick [16] and Dayton [17] respectively. Formann [18] applied it in regression models involving continuous variables. Finally, Revuelta [19] applied the π* goodness-of-fit statistic to finite mixture item response models that were developed mainly in connection with Rasch models [20,21]. The Kullback-Leibler (KL) distance [12] is fundamental in information theory and its applications. In statistics, the celebrated Akaike information Criterion (AIC) [13,22], widely used in model selection, is based on the Kullback-Leibler distance. There are numerous additional applications of the KL distance in fields such as fluid mechanics, neuroscience, machine learning. In economics, Smith, Naik, and Tsai [23] use KL distance to simultaneously select the number of states and variables associated with Markov-switching regression models that are used in marketing and other business applications. KL distance is also used in diagnostic testing for ruling in or ruling out disease [24,25], as well as in a variety of other fields [26]. Table 1 presents the software, written in R, that can be used to compute the aforementioned distances. Additionally, Zhang and Dayton [27] present a SAS program to compute the two-point mixture index of fit for the two-class latent class analysis models with dichotomous variables. There are a number of different algorithms that can be used to compute the mixture index of fit for contingency tables. Rudas et al. [8] propose to use a standard EM algorithm, Xi and Lindsay [4] use sequential quadratic programming and discuss technical details and numerical issues related to applying nonlinear programming techniques to estimate π*. Dayton [10] discusses explicitly the practical advantages associated with the use of nonlinear programming as well as the limitations, while Pan and Dayton [28] study a variety of additional issues associated with computing π*. Additional algorithms associated with the computation of π* can be found in Verdes [29] and Ispány and Verdes [11].
We now describe a simulation study that aims to illustrate the performance of the total variation, Kullback-Leibler, and mixture index of fit as model selection measures. Data are generated from either an asymmetric (1 − ε)N(0, 1) + εN(µ, σ 2 ) contamination model, or from a symmetric (1 − ε)N(0, 1) + εN(0, σ 2 ) contamination model, where ε is the percentage of contamination. Specifically, we generate 500 Monte Carlo samples of sample sizes 200, 1000, and 5000 as follows. If the sample has size n and the percentage of contamination is ε, then nε of the sample size is generated from model N(µ, σ 2 ) or N(0, σ 2 ) and the remaining n(1 − ε) from a N(0, 1) model. We use µ = 1, 5, 10 and σ 2 = 1 in the N(µ, σ 2 ) model and σ 2 = 4, 9, 16 in the N(0, σ 2 ) model. The total variation distance was computed between the simulated data and the N(0, 1) model. The Kullback-Leibler distance was calculated between the data generated from the aforementioned contamination models and a random sample of the same size n from N(0, 1). When computing the mixture index of fit, we specified the component distribution as a normal distribution with initial mean 0 and variance 1. All simulations were carried out on a laptop computer with an Intel Core i7 processor and 64 bit Windows 7 operation system. The R packages used are presented in Table 1. Tables 2 and 3 present means and standard deviations of the total variation and Kullback-Leibler distances as a function of the contamination model and the sample size. To compute the total variation distance we use the R function "TotalVarDist" of the R package "distrEx". It smooths the empirical distribution of the provided data using a normal kernel and computes the distance between the smoothed empirical distribution and the provided continuous distribution (in our case this distribution is N(0, 1)). We note here that the package "distrEx" provides an alternative option to compute the total variation which relies on discretizing the continuous distribution and then computes the distance between the discretized continuous distribution and the data. We think that smoothing the data to obtain an empirical estimator of the density and then calculating its distance from the continuous density is a more natural way to handle the difference in scale between the discrete data and the continuous model. Lindsay [1] and Markatou et al. [3] discuss this phenomenon and call it discretization robustness. The Kullback-Leibler distance was computed using the function "KLD.matrix" of the R package "bioDist".  We observe from the results of Tables 2 and 3 that the total variation distance for small percentages of contamination is small and generally smaller than the Kullback-Leibler distance for both asymmetric and symmetric contamination models with a considerably smaller standard deviation. The above behavior of the total variation distance in comparison to the Kullback-Leibler manifests itself across all sample sizes used. Table 4 presents the mixture index of fit computed using the R function "pistar.uv" from the R package "pistar" (https://rdrr.io/github/jmedzihorsky/pistar/man/; accessed on 5 June 2018). Since the fundamental assumption in the definition of the mixture index of fit is that the population on which the index is applied is heterogeneous and expressed via the two-point model, we only used the asymmetric contamination model for various values of the contamination distribution. Table 4. Means and standard deviations (SD) for the mixture index of fit. Data are generated from an asymmetric contamination model of the form (1 − ε)N(0, 1) + εN(µ, 1), µ = 1, 5, 10 with sample sizes, n, of 1000, 5000. The number of Monte Carlo replications is 500.

Percentage of
Summary N(1, 1) N(5, 1) N(10, 1) Contamination ε n = 1000 n = 5000 n = 1000 n = 5000 n = 1000 n = 5000 0. We observe that the mixture index of fit generally estimates well the mixing proportion ε. We observe (see Table 4) that when the second population is N(1, 1) the bias associated with estimating the mixing (or contamination) population can be as high as 30.6%. This is expected because the population N(1, 1) is very close to N(0, 1) creating essentially a unimodal sample. As the means of the two normal components get more separated, the mixture index of fit provides better estimates of the mixing quantity and the percentage of observations that need to be removed so that N(0, 1) provides a good fit to the remaining data points.

Discussion and Conclusions
Divergence measures are widely used in scientific work, and popular examples of these measures include the Kullback-Leibler divergence, Bregman Divergence [30], the power divergence family of Cressie and Read [31], the density power divergence family [32] and many others. Two relatively recent books that discuss various families of divergences are Pardo [33] and Basu et al. [34].
In this paper we discuss specific divergences that do not belong to the family of quadratic divergences, and examine their role in assessing model adequacy. The total variation distance might be preferable as it seems closest to a robust measure, in that if the two probability measures differ only on a set of small probability, such as a few outliers, then the distance must be small. This was clearly exemplified in Tables 2 and 3 of Section 6. Outliers influence chi-squared measures more. For example, the Pearson's chi-squared distance can be made dramatically larger by increasing the amount of data in a cell with small model probability m θ (t). In fact, if there is data in a cell with model probability zero, the distance is infinite. Note that if data occur in a cell with probability, under the model, equal to zero, then it is possible that the model is not true. Still, even in this case, we might wish to use it on the premise that m θ provides a good approximation.
There is a pressing need for the further development of well-tested software for computing the mixture index of fit. This measure is intuitive and has found many applications in the social sciences. Reiczigel et al. [35] discuss bias-corrected point estimates of π*, as well as a bootstrap test and new confidence limits, in the context of contingency tables. Well-developed and tested software will further popularize the dissemination and use of this method.
The mixture index of fit ideas were extended in the context of testing general model adequacy problems by Liu and Lindsay [9]. Recent work by Ghosh and Basu [36] presents a systematic procedure of generating new divergences. Ghosh and Basu [36], building upon the work of Liu and Lindsay [9], generate new divergences through suitable model adequacy tests using existing divergences. Additionally, Dimova et al. [37] use the quadratic divergences introduced in Lindsay et al. [2] and construct a model selection criterion from which we can obtain AIC and BIC as special cases.
In this paper, we discuss non-quadratic distances that are used in many scientific fields where the problem of assessing the fitted models is of importance. In particular, our interest centered around the properties and potential interpretations of these distances, as we think this offers insight into their performance as measures of model misspecification. One important aspect for the dissemination and use of these distances is the existence of well-tested software that facilitates computation. This is an area where further development is required.
Author Contributions: M.M. developed the ideas and wrote the paper. Y.C. contributed to the proofs and simulations presented.
Funding: This research received no external funding.