On Entropy-type Measures and Divergences with Applications in Engineering, Management and Applied Sciences

In this work we review Entropy-type measures and Divergences, discuss their properties and unfold their diverse applicability. In addition, we compare distances between populations and distributions via weighted Entropy-type measures relying mainly on Relative Entropy and Jeffrey’s Distance with weights. Finally, we introduce the Absolute Weighted Relative Entropy and the Absolute Weighted Jeffrey’s Distance. Two applications are presented for illustration, one from Geosciences and one from Financial Mathematics.


Introduction
Information theory is a branch of pure and applied sciences that deals with the quantification of information with its roots in modern communication theory where a communication system was formulated as a stochastic process. Tuller (1950) initially and Pierce (1956) later observed the strong similarities between the underlying mechanisms of communication theory and information theory.
The evolution of the field as well as the mathematical rigor that governs it are attributed to Fisher (1956), Shannon (1956) and Wiener (1956). The most fundamental measure in information theory is entropy which was first recognized, formulated and defined in statistical mechanics (Fisher, 1936;Shannon and Weaver, 1949) and consequently triggered the enormous development of the field. In this work we review Entropy-type measures and Divergences, discuss their properties and unfold their diverse applicability. In should be noted that the concept of entropy was used firstly in Physics, in the field of thermodynamics (Clausius, 1865) while its statistical definition was developed by Boltzmann (around 1870) but its applications go beyond Physics.
In the present work we attempt to approach the entropy from a probabilistic or stochastic viewpoint and combine it with the concept of distance which has numerous applications in Applied Sciences, Financial Mathematics, Engineering or Management Sciences. The concept of divergence is fundamental in data analysis since it quantifies the distance between two populations, two models or two functions. By combining the two concepts and relying on Entropy-type divergences or measures we could provide both researchers and practitioners with useful probabilistic tools for modelling purposes in various scientific areas including goodness of fit in Reliability Theory or Survival Analysis, portfolio selection in Financial Mathematics, decision making in Management Sciences, Geosciences etc.
The sense of entropy has essential role in information, since the middle of the 20 th century, when engineers and scientists used the term "information" to quantify something. Claude Shannon (Shannon and Weaver, 1949) with his work "The Mathematical Theory of Communication" was the pioneer of the branch of information theory.
The first scientist who try to quantify the information of a message source with only two numbers was Ralph V. Hartley (1928). In 1948 Shannon provided a generalized form of Hartley's information measure which represents the information (or uncertainty) on average carried by a variable. In that article, Shannon suggests and examines the notions of entropy and mutual information. The entropy is a measure for quantifying the uncertainty of a random variable.
The mutual information measures the mutual dependence between two variables by quantifying the "amount of information" (in bits, nats or bans depending on the base of the logarithm used) which is collected regarding one of the variable by the observation of the other variable. Many scientists after the definition of Shannon entropy, tried to define other types of entropy. One particular generalization is Havrda and Charvát (1967) structural -entropy; Different values of the parameters result in distinct entropy measures. Shannon entropy is a special case of Havrda-Charvát when this single parameter tends to 1. Tsallis entropy (1988) introduced by Constantino Tsallis is another generalization of Shannon entropy and is similar with Havrda-Charvát structural -entropy (with a different multiplying factor). Tsallis proposed to replace the usual Shannon with his non-extensive entropy and maximize it. Tsallis entropy has plenty of applications in astrophysics, fractal random walks, time series analysis and classification. Tsallis Relative entropy (1998) introduced also by Tsallis is a generalization of Kullback Cross-entropy which is one of the simplest measures for distance (see below).
The Rényi (1961) entropy that generalizes the Shannon entropy involves a single parameter called order that modifies the Rényi entropy. Also, Rényi entropy has a straightforward relation with Tsallis entropy. However, the axiomatic characterizations are not so simple as Tsallis entropy. Rényi entropy has a variety of applications in many applied fields such as Information theory, Time series, Classification and Cryptography. The relation between Information theory and Statistics was proposed by Kullback and Leibler (1951) who extended the notion of entropy by Shannon and created the Kullback-Liebler measure of divergence also known as "Relative Entropy". Their book "Information Theory and Statistics" was the beginning of a new mathematical field called Statistical Information Theory.
Before Kullback and Leibler, scientists such as Mahalanobis (1936) and later Bhattacharyya (1943) proposed various types of divergences but the work of Kullback and Leibler made the divergences mainstream to the scientific community. Divergence measures have various applications in many scientific fields such as Applied Mathematics, Probability theory, Statistics and Financial Mathematics. With the notion of divergence measures we established the "distance" between samples or two distributions but, divergence measures are not metrics with the mathematical sense of metric because they are not symmetric and most of them do not fulfilled the triangular inequality.
At this point Jeffreys (1946) with his work "An invariant form for the prior probability in estimation problems" proposed the Jeffrey's Distance which is the symmetric version of Relative entropy.
In statistical conjectures on Entropy-type measures, Divergence measures play significant role. In the field of Model Selection, Akaike (1973) was the first scientist who proposed the well-known Akaike Information Criterion (AIC) by constructing an unbiased estimator of the expected Relative entropy. It should be also noted that the use of Relative entropy is a very useful tool in clustering. Yang et al. (2019) used the hierarchical clustering analysis method based on Relative entropy and their application was held on geochemical exploration data. They observed that the Relative entropy can describe the dissimilarity of pairwise geochemical datasets. Mager et al. (2004) used the Relative entropy as clustering technique to measure the Power spectral analysis of beat-to-beat heart rate variability (HRV). The goodness of fit tests are important tools for examining whether a dataset is compatible with a theoretical probability distribution or whether two datasets share or not the same distribution. The relative entropy goodness of fit test was proposed by Song (2002).
In this work we extend the classical Entropy-type measures to the weighted ones. The Weighted Entropy-type measures play a very significant role in many scientific fields as we mentioned above. If we wish to focus on a specific characteristic of two populations more than others then, we have to give different weights on different parts of the support of the distribution. This desire drives the scientists to re-build the original Shannon entropy to Weighted one. Guiaşu (1971) was the first who proposed the weighted entropy and established the properties for this new type of entropy. The remaining sections of the paper are organized as follows. In Section 2 we review the basic definitions of Divergence measures, different entropies and Entropy-type measures. In Section 3 we introduce the Weighted Entropy-type and Absolute Weighted Entropy-type measures. By using these types of measure, one can focus on "specific" parts of a distribution or population. In Sections 4, 5 and 6 we present some of the areas of application of Entropy-type measures. For the applications we have chosen to present an example on Geosciences and a second one on Financial Mathematics.

Entropy-type Measures and Divergence Measures
The sense of distance plays a very important role in probability theory and mathematical statistics, engineering, applied sciences etc. The concept of measuring distances provides various significant results for populations that we wish to study. Usually, we try to measure a characteristic of a population and a reference point to export some useful results. Now, we will present the definition of some of the most popular distance/divergence measures and provide some important identities. Among other we present the definitions for a metric and a divergence, the Shannon entropy and some generalizations and extensions of entropy such as Rényi and Tsallis entropies.

Definition 2. (Divergence)
Suppose is a space of all probability distributions with same support. Then a Divergence on is a function (•,•) ∶ × S → ℝ + ∪ {0} satisfying: It is important to observe the distinction between divergences and metrics. Indeed, the Divergence measures are not necessarily metrics because they do not have to be symmetric or fulfil the triangular inequality. A classic example of a non-metric is the Relative Entropy also known as Kullback-Leibler divergence (see Definitions 9 and 10 below) which is not symmetric but its role especially in statistical inference, is fundamental (e.g. maximum likelihood estimation).
The notion of Entropy in Information Theory plays an essential role in measuring information or the lack of it (uncertainty). The entropy is related to the uncertainty through the probability of occurrence of the event of interest. Shannon (1948) connected the two concepts for measuring or quantifying the information from a discrete stochastic signal. The following definition is selfexplanatory.

Definition 3. (Shannon Entropy)
Let a stochastic source described by a discrete random variable with distribution , support and probability mass function . The Entropy of is defined by: Three are the standard identities of Shannon Entropy, namely: It is zero if and only if describes a certain event. (iii) It increases by adding an independent component and decreases by conditioning.

Definition 5. (Tsallis Entropy)
For any positive real number the Tsallis Entropy (1988) of order α of a probability measure p on a finite set X is defined as Note that such type of entropies, had been studied by Havrda and Charvát (1967) long before Tsallis. The characterization of the Tsallis entropy is the same as that of Shannon entropy except that for the Tsallis Entropy, the degree of homogeneity under convex linearity condition is α instead of 1.

Definition 6. (The Tsallis Relative Entropy) Tsallis (1998) introduced a generalization of Cross
Entropy called the Tsallis Relative Entropy which is given by: is the support of the random variable, ( ) is a probability distribution and 0 ( ) is a reference distribution.

Definition 7. (The Rényi Entropy) The Rényi Entropy is defined by
where is a positive constant and ( ) is the probability density function.
The main focus of this work is on Relative entropy (Kullback and Leibler, 1951) which is an extremely useful tool in many scientific fields including engineering and applied sciences. Its definition is given below for both the discrete and the continuous case.
Definition 9. (Relative Entropy discrete case) Consider two distributions , with probability mass functions = ( 1 , . . , ) and = ( 1 , … , ) respectively. Then the discrete version of Relative Entropy is the following: As it is easily seen, the Relative entropy does not fulfill the property of symmetricity. For this case, Jeffrey's Distance (Jeffreys, 1946) which is symmetric and closely associated to the Relative entropy, is preferred.
Proof. This proof shows the connection between Relative entropy, Cross-entropy & Shannon entropy and also provides a proof that the Relative entropy is non-negative. We take = − ∑ log =1 and introduce appropriate weights = ( 1 , … , ) : Let 1 , … , the objective probabilities associated with the 1 , … , (i.e. is the estimate of the theoretical ).
subjective-objective measure of uncertainty > measure of objective uncertainty.
This is due to the fact that the uncertainty of the objective probabilities is increased as a result of the uncertainty associated with the estimators of the 's by the 's. If = then we have equality.
Similarly, for [log ( )] where are assumed to be the theoretical probabilities and is the estimate of . Note that this is a totally different setting since the uncertainty of the true is not the same as before and the same goes for the Cross entropy − ∑ =1 log .
For the non-negativity of Jeffrey's Distance, the proof is an immediate consequence of the previous proposition.

Weighted Entropy-type Measures
Sometimes the entropy is not useful enough. For example, if we want to focus on a specific characteristic of two populations more than others then, we have to give different weights on different parts. This desire drives the scientist to re-build the original Shannon Entropy to a weighted one. Guiaşu (1971) was the first who proposed Weighted Entropy. In addition, in this section we provide the definition of the Weighted Relative Entropy.
Definition 12. (Weighted Shannon Entropy) Let a stochastic source described by a discrete random variable of possible events, with distribution , probability mass function = ( 1 , … , ) and = ( 1 , … , ) be a vector of weights associated with these states, where ≥ 0, = 1, … . The weighted Shannon Entropy measure is defined by: Some of the standard properties of the weighted Shannon entropy (for details see Guiaşu, 1971) are: (i) ( ) ≥ 0.  Suppose that , are two incompatible events of the experiment. We require that the weight of the union of these events is equal to the mean value of the weights of the respective events, i.e.
Through the modification of the above definition the resulting Absolute Weighted Relative Entropy (A.W.R.E) is always non-negative. Using as a proper distance tool the A.W.R.E we could proceed and give more "attention" to a special part of a distribution. In order to incorporate the symmetry property into the previous definition we propose below the Absolute Weighted Jeffrey's Distance which is both non-negative and fulfills the symmetric property:

Definition 15. (Absolute Weighted Jeffrey's Distance (A.W.J.D))
Consider two distributions , with probability mass functions = ( 1 , . . , ) and = ( 1 , … , ) respectively and = ( 1 , … , ) a vector of weights. Then the discrete version of Absolute Weighted Jeffrey's Distance is the following: The use of weights and absolute values allow the research to focus exclusively on specific parts of the distribution and the corresponding Relative entropy will be non-negative. In the three sections that follow we present three scientific areas where the Entropy-type measures find some of their numerous applications.

Clustering based on Entropy-type Measures
The problem of clustering is related to grouping a set of objects in the same group or classes within each of which the objects are similar (homogeneous). Frequently, we wish to quantify the dissimilarity between two populations. The clustering is a classical method for distributing populations into clusters. The greater the number of populations, the greater the number of clusters. There are many ways to measure the dissimilarity between two clusters with one of them being the Relative entropy defined previously. Note that the Relative entropy technique is quite similar with the Mahalanobis distance (1936). The main difference being that through the Relative entropy we can represent the value in terms of difference between the two clusters. With the evaluation of Relative entropy we can statistically show whether two clusters are similar or not making the Relative entropy a very useful key in clustering. Yang et al. (2019) used the hierarchical clustering analysis method based on Relative entropy and their application was held on geochemical exploration data. They observed that the Relative entropy can describe the dissimilarity of pairwise geochemical datasets. Mager et al. (2004) used the Relative entropy as clustering technique to measure the power spectral analysis of beat-to-beat heart rate variability (HRV). The research concerned with the developing of an algorithm that utilizes continuous wavelet transform (CWT) parameters as inputs to a Kohonen self-organizing map (Kohonen, 1990), providing a method of clustering.
All the above, clearly show that the Relative entropy is a powerful and useful tool for the comparison between populations for clustering purposes.

Goodness of Fit based on Entropy-type Measures
The goodness of fit tests are important tools for testing whether a dataset is compatible with a theoretical probability distribution or whether two datasets share or not the same distribution. The Relative entropy goodness of fit test was proposed by Song (2002). The relation of goodness of fit test and the Relative entropy will be shown below. Assume the test hypothesis: The previous test hypothesis about the possible equality between two densities , is equivalent to the following test based on the measure (⋅,⋅) defined in the previous section. Let a category, the frequency of results belongs to and = ( ), = 1, … , then the maximum likelihood (ML) estimator of is ̂= . And = , The ML-estimator of ( , ) is: The vector = ( 1 , … , ) ∼ ( , ( 1 , … , )), where is k-dimnsional multinomial distribution. For the very big sample size the vector an asymptotic multivariate normal distribution ( , ( − ′)) is a diagonal matrix with diagonal elements = 1, … , and = ( 1 , … , ). Thus Simple algebra shows that: Then, in testing hypothesis 0 : ( , ) = 0 in favor of 1 : ( , ) > 0 we reject 0 if 0 > where, and is the 1-α quantile of the standard normal distribution.
The previous result is very close to 2 which is the well-known likelihood ratio test statistic (Neyman and Pearson 1933) but in simulations (Sharifdoost et al., 2009) it appears to be more sensitive than 2 . Goodness of fit tests based on the Relative entropy are more sensitive than usual methods for rejecting distributions which are close to the distribution we have (Sharifdoost et al., 2009).

Model Selection based on Entropy-type Measures
Model selection is the field of statistics for selecting an ideal statistical model from a set of candidate models. As we know, model selection plays important role in any scientific field. The first scientist who studied deeply this concept was Akaike (1973) who proposed the Akaike Information Criterion (AIC) by constructing an unbiased estimator of the expected Relative entropy. Below we briefly discuss the main characteristics of AIC as well as the Divergence Information Criterion (DIC).
Let be the true model and a model which is used to estimate . The Relative entropy (equivalent to definition 10) between and is: where a parameter associated with for the estimation of which one uses the available data.
( , ) with support , represents the information lost when is used to estimate . Equivalently we can write: The first expectation is constant, say , irrespectively of the model used, so By computing [log( ( | ))] which is the continuous version of the Cross entropy (see definition 4), we easily obtain the relative distance ( , ) − distance between and . Instead of this quantity which cannot be computed Akaike found that the expectation [ [log( ( | ))]] can be computed. For the above quantity, which is known as the expected Relative entropy information, the asymptotically unbiased estimator is found by Akaike to be log( (̂| )) − where is the dimension of parameter and ̂ is a consistent estimate of . Then the AIC is: where ̂ is the maximum likelihood estimator (or equivalently the minimum Relative entropy estimator). Selecting among various candidate models , the model with the smallest AIC value is related to the model with the least Relative entropy between the true distribution and the estimated one. Now, we present another useful information criterion for model selection called Divergence Information Criterion (DIC, Mattheou et al., 2009). For this type of criterion Mattheou et al. (2009) based on the same methodology as AIC criterion used the BHHJ Divergence (Basu et al., 1998) for developing a new criterion.
Consider a random sample 1 , … , from the distribution (the true model) and a candidate model . For constructing the DIC the following formula will be useful.
which is the same as the BHHJ divergence without the last term, which remaining constant independent of the model . Now the formula which gives an unbiased estimator is = ( | =̂) where ̂ is an asymptotically normal estimator of . We can also say that the above expression is the average distance between and . Now, we present an unbiased estimator of the expected overall discrepancy: The asymptotically unbiased estimator of n-times the expected overall discrepancy evaluated at ̂ is given by The adjusted DIC model is given below (Mantalos et al., 2010) for the case of the MLE of θ: We must point out that the MLE method is faster in computations than other methods. Also, the DIC criterion has highly performance of accuracy in simulations. Note also that it could be used in applications with outlier and contaminated observations. Based on all the above observations we conclude that DIC is a powerful criterion for model selection.

Applications
In this section we study two examples one from Geosciences focusing on a dataset on earthquakes and a second one on Financial Mathematics focusing on the price comparison of two stocks. The purpose of the analysis is to check the performance and the capabilities of the proposed absolute weighted entropy-type measures as opposed to standard entropy-type measures. For the first example we compare the distribution of the dataset with a specific candidate distribution while the second compares the distributions of logarithmic return prices of two stocks.
The analysis is based on an algorithmic procedure the steps of which are presented below. The method requires the split of the support of the distribution into a number of subintervals for each of which the associated probabilities are evaluated. The method which is called "Middle method" uses the probabilities obtained for each of the subintervals of the support. The method is then applied to Entropy-type measures discussed in the previous sections and compare them. For the examples considered in this work we use n=10 subintervals.

The Geosciences Example
We collected data from the Hellenic Institute of Geodynamics (National Observatory of Athens) www.gein.noa.gr. The data concern 5384 earthquakes from 1973 to 2004 in Greece with magnitude below 4 in Richter scale. In this part of the work, we try to see the relationship between our dataset and the Shifted Exponential Distribution which is a displacement of Exponential Distribution to the right, by 4 units. Firstly, we present the histogram of the data together with the Shifted Exponential Distribution (Figure 1). The average of the data is equal 4.280758 and the standard deviation is 0.3451494. We suppose the distribution that fits better the data is the Shifted Exponential Distribution by 4 units with parameter estimated by = 4.280758.  Now, for the implementation of the proposed method using the algorithm defined earlier, we divide the support as follows: [ 4, 4.25), [4.25, 4.5), [4.5, 4.75), [4.75, 5), [5, 5.25), [5.25, 5.5), [5.5, 6), [6, 6.25), [6.25, 6.5), [6.5,7] The main idea for splitting the support of the data is to add specific weights on each interval. Then, we calculate the required Entropy-type measures. Table 1 provides the percentage of data in every interval for the real data and the Shifted Exponential Distribution respectively. Table 1. Percentages by interval dataset vs shifted exponential distribution.

Percentages per Interval Intervals
[ 4, 4.25) [4.25, 4.5) [4.5, 4.75) [4.75, 5) [5, 5.25) [5.25, 5.5) [5.5, 6) [6, 6.25) [6.25, 6.5) Figure 2 describes the "Middle method" and the comparison of 4 Weighted Entropy-type techniques. From Figure 2 we observe that the Relative entropy is not useful. Indeed, firstly, it is not symmetric and secondly it takes negative values. The latter is due to the fact that the numerator of the logarithm is smaller than the denominator and when this happens the measure is negative. The defects stated above can be resolved if one uses Jeffrey's Distance. Observe that Jeffrey's measure is both symmetric and always positive. Note though that Jeffrey's Distance is not very useful because although each term is positive, the elements of each term are not both positive. One is positive and one is negative so that the result (even with the use of a large weight) will not be as extreme as it should. This defect can be resolved if one uses the Absolute Jeffrey's Distance which combines the advantages of Jeffrey's Distance and the absolute value. It should be noted that the use of squares instead of the absolute value, was not going to have the same effect since each term in each summation is less than 1 and the squares were going to reduce the magnitude of the contribution of the most significant intervals (terms). Observe further that the use of Jeffreys together with the absolute value increases when we focus on the last two intervals where the difference is maximum.

The Financial Mathematics Example
For the second application, we collect from www.finance.yahoo.com, logarithmic return prices of the index S&P500 and logarithmic return prices of Barrick Gold Corporation (GOLD) for a fiveyear period from 05/Jan/2016 until 05/Jan/2021. Our purpose is to find the relationship ("distance") between the two stocks via Absolute Weighted and Weighted Entropy-type measures. The dataset contains 1258 observations for each stock. Firstly, we present the histograms of the data separately ( Figure 3). Note that although the choice of the above two stocks is purely illustrative for revealing the capabilities of the proposed algorithmic procedure, financial products are often representative examples due to the negative correlation they frequently exhibit as a result of the belief that gold moves higher when economic conditions worsen and stock markets go down.
For S&P500 returns the average is 0.000483 and the standard deviation is 0.012184876 while for Barrick Gold Corporation returns the average is 0.000929 and the standard deviation is 0.025566474. For the implementation of the method we will divide the support of the dataset in the following way: (   Table 2 provides the percentage of data in every interval of the dataset.  Applying again the "Middle method" algorithm (as in previous example) we reveal the relation ("distance") between the stocks S&P500 and GOLD and compare the relevant Entropy-type measures used (Figure 4).
Observe again that the Relative entropy takes negative values while the Absolute Relative entropy is non-negative but it takes smaller values than Jeffrey's Distance. Further, the largest differences between the two stocks are reported for the Absolute Jeffrey's Distance method. In conclusion, we observe that the Absolute Jeffrey's Distance gives the higher differences of the distance between the two stocks. Observe also that if the Relative entropy is used the two stocks appear to be almost equidistant.

Conclusions
The main purpose of this work is to review of Entropy-type measures and Divergences, discuss their properties and unfold their diverse applicability. After the presentation of the necessary theory on Divergences and Entropy, we proposed and compared Weighted Entropy-type measures and revealed and explored their advantages as distance measures. More specifically, we first observed that the Weighted Relative Entropy technique is less accurate because it takes negative values which violates the main idea of distance. Then, we presented the Weighted Jeffrey's Distance which is symmetric but not accurate. Finally, we introduced the Absolute Weighted Relative Entropy (A.W.R.E) and the Absolute Weighted Jeffrey's Distance (A.W.J.D) both of which and especially the second one, gave larger distance values between two datasets and at the same time fulfilled the properties of symmetricity and non-negativity.
For checking the performance of the proposed methodology, we applied all previous theoretical results in two applications. The first experiment on Geosciences focuses on the closeness of the distribution of earthquakes and a fitted distribution while the other experiment on Financial Mathematics deals with the measuring of the distance and the relation between two stocks. By introducing the Absolute Weighted Entropy-type methods we observed that the Absolute Jeffrey's Distance provides the best results (higher values) among all methods considered. Although the entropy-type measures are important and useful in many fields, the Absolute Entropy-type measures proposed in this work can be found extremely useful in special cases.
In conclusion, based on the two real applications we conclude that the Absolute Jeffrey's Distance appears to be the most sensitive Entropy-type measure among all studied techniques. This means that it produces larger values when we focus on specific parts which otherwise have indistinguishable dissimilarities and therefore it provides the researcher with a useful tool for many scientific fields where the interest focusses not on the entire distribution but on specific (special) parts of it.