Attention to the Variation of Probabilistic Events: Information Processing with Message Importance Measure

Different probabilities of events attract different attention in many scenarios such as anomaly detection and security systems. To characterize the events’ importance from a probabilistic perspective, the message importance measure (MIM) is proposed as a kind of semantics analysis tool. Similar to Shannon entropy, the MIM has its special function in information representation, in which the parameter of MIM plays a vital role. Actually, the parameter dominates the properties of MIM, based on which the MIM has three work regions where this measure can be used flexibly for different goals. When the parameter is positive but not large enough, the MIM not only provides a new viewpoint for information processing but also has some similarities with Shannon entropy in the information compression and transmission. In this regard, this paper first constructs a system model with message importance measure and proposes the message importance loss to enrich the information processing strategies. Moreover, the message importance loss capacity is proposed to measure the information importance harvest in a transmission. Furthermore, the message importance distortion function is discussed to give an upper bound of information compression based on the MIM. Additionally, the bitrate transmission constrained by the message importance loss is investigated to broaden the scope for Shannon information theory.


Introduction
In recent years, massive data has attracted much attention in various realistic scenarios. Actually, there exist many challenges for data processing such as distributed data acquisition, huge-scale data storage and transmission, as well as correlation or causality representation [1][2][3][4][5]. Facing these obstacles, it is a promising way to make good use of information theory and statistics to deal with mass information. For example, a method based on Max Entropy in Metric Space (MEMS) is utilized for local features extraction and mechanical system analysis [6]; as an information measure different from Shannon entropy, Voronoi entropy is discussed to characterize the random 2D patterns [7]; Category theory, which can characterize the Kolmogorov-Sinai and Shannon entropy as the unique functors, is used in autonomous and networked dynamical systems [8].
To some degree, probabilistic events attract different interests according to their probability. For example, considering that small probability events hidden in massive data contain more semantic importance [9][10][11][12][13], people usually pay more attention to the rare events (rather than the common events) and design the corresponding strategies of their information representation and processing in many applications including outliers detection in the Internet of Things (IoT), smart cities and autonomous driving [14][15][16][17][18][19][20][21][22]. Therefore, the probabilistic events processing has special values in the information technology based on semantics analysis of message importance.
In order to characterize the importance of probabilistic events, a new information measure named MIM is presented to generalize Shannon information theory [23][24][25]. Here, we shall investigate the information processing including compression (or storage) and transmission based on MIM to bring some new viewpoints in the information theory. Now, we first give a short review on MIM.

Review of Message Importance Measure
Essentially, the message importance measure (MIM) is proposed to focus on the probabilistic events importance [23]. In particular, the core idea of this information measure is that the weights of importance are allocated to different events according to the corresponding events' probability. In this regard, as an information measure, MIM may provide an applicable criterion to characterize the message importance from the viewpoint of inherent property of events without the human subjective factors. For convenience of calculation, an exponential expression of MIM is defined as follows.
Definition 1. For a discrete distribution P(X) = {p(x 1 ), p(x 2 ), ...,p(x n )}, the exponential expression of message importance measure (MIM) is given by where the adjustable parameter is nonnegative and p(x i )e {1−p(x i )} is viewed as the self-scoring value of event i to measure its message importance.
Actually, from the perspective of generalized Fadeev's postulates, the MIM is viewed as a rational information measure similar to Shannon entropy and Renyi entropy which are respectively defined by where the condition of variable X is the same as that described in Definition 1. In particular, a postulate for the MIM weaker than that for Shannon entropy and Renyi entropy is given by while F(PQ) = F(P) + F(Q) is satisfied for Shannon entropy and Renyi entropy [26], where P and Q are two independent random distributions and F(·) denotes a kind of information measure. Moreover, the crucial operator of MIM to handle probability elements is exponential function while the corresponding operators of Shannon and Renyi entropy are logarithmic function and polynomial function respectively. In this case, MIM can be viewed as a map for the assignments of events' importance weights or the achievement for the self-scoring values of events different from conventional information measures.
As far as the application of MIM is concerned, it may be a better method by using this information measure to detect unbalanced events in signal processing. Ref. [27] has investigated the minor probability event detection by combining MIM and Bayes detection. Moreover, it is worth noting that the physical meaning of the components of MIM corresponds to the normalized optimal data recommendation distribution, which makes a trade-off between the users' preference and system revenue [28]. In this respect, MIM plays a fundamental role in the recommendation system (a popular applications of big data) from the theoretic viewpoint. Therefore, MIM does not come from the imagination directly, whereas it is a meaningful information measure originated from the practical scenario.

The Importance Coefficient in MIM
In general, the parameter viewed as the importance coefficient has a great impact on the MIM. Actually, different parameter can lead to different properties and performances for this information measure. In particular, to measure a distribution P(X) = {p(x 1 ), p(x 2 ), ..., p(x n )}, there are three kinds of work regions of MIM which can be classified by the parameters, whose details are discussed as follows.
(i) If the parameter satisfies 0 ≤ ≤ 2/ max{p(x i )}, the convexity of MIM is similar to Shannon entropy and Renyi entropy. Actually, these three information measures all have maximum value properties and allocate weights for probability elements of the distribution P(X). It is notable that the MIM in this work region focuses on the typical sets rather than atypical sets, which implies that the uniform distribution reaches the maximum value. In brief, the MIM in this work region can be regarded as the same class of message measure as Shannon entropy and Renyi entropy to deal with the problems of information theory. (ii) If we have > 2/ max{p(x i )}, the small probability elements will be the dominant factor for MIM to measure a distribution. That is, the small probability events can be highlighted more in this work region of MIM than those in the first one. Moreover, in this work region, MIM can pay more attention to atypical sets, which can be viewed as a magnifier for rare events. In fact, this property corresponds to some common scenarios where anomalies catch more eyes such as anomalous detection and alarm. In this case, some problems (including communication and probabilistic events processing) can be rehandled from the perspective of rare events importance. Particularly, the compression encoding and maximum entropy rate transmission are proposed based on the non-parametric MIM (namely NMIM) [24]; in addition, the distribution goodness-of-fit approach is also presented by use of the differential MIM (namely DMIM) [29]. (iii) If the MIM has the parameter < 0, the large probability elements will be the main part contributing to the value of this information measure. In other words, the normal events attract more attention in this work region of MIM than rare events. In practice, this can be used in many applications where regular events are popular such as filter systems and data cleaning.
As a matter of fact, by selecting the parameter properly, we can exploit the MIM to solve several problems in different scenarios. The importance coefficient facilitates more flexibility of MIM in applications beyond Shannon entropy and Renyi entropy.
To focus on a concrete object, in this paper, we mainly investigate the first work region of MIM (namely 0 ≤ ≤ 2/ max{p(x i )}) and intend to dig out some novelties related to this metric for information processing.

Similarities and Differences between Shannon Entropy and MIM
In fact, when the parameter satisfies 0 ≤ ≤ 2/ max{p(x i )}, MIM is similar to Shannon entropy in regard to the expression and properties. The exponential operator of MIM is a substitute for the logarithm operator of Shannon entropy. As a kind of tool based on probability distributions, the MIM with parameter 0 ≤ ≤ 2/ max{p(x i )} has the same concavity and monotonicity as Shannon entropy, which can characterize the information otherness for different variables.
By resorting to the exponential operator of MIM, the weights for small probability elements are amplified more in some degree than those for large probability ones, which is considered as message importance allocation based on the self-scoring values. In this regard, the MIM may add fresh factors to the information processing, which takes into account the effects of probabilistic events' importance from an objective viewpoint.
In the conventional Shannon information theory, data transmission and compression both can be viewed as the information transfer process from the variable X to Y. The capacity of information transmission is achieved by maximizing the mutual information between the X and Y. Actually, there exists distortion for probabilistic events during an information transfer process, which denotes the difference between the source and its corresponding reconstruction. Due to this fact, it is possible to compress data based on the allowable information loss in a certain extent [30][31][32]. In Shannon information theory, rate-distortion theory is investigated for lossy data compression, whose essence is mutual information minimization under the constraint of a certain distortion. However, in some cases involved with distortion, small probability events containing more message importance require higher reliability than those with large probability. In this sense, another aspect of information distortion may be essential, in which message importance is considered as a reasonable metric. Particularly, information transfer process is characterized by the MIM (rather than the entropy) with controlling the distortion, which can be viewed as a new kind of information compression, compared to the conventional scheme compressing redundancy to save resources. In fact, some information measures with respect to message importance have been investigated to extend the range of Shannon information theory [33][34][35][36][37]. In this regard, it is worthwhile exploring the information processing in the sense of MIM. Furthermore, it is also promising to investigate the Shannon mutual information constrained by the MIM in an information transfer process which may become a novel system invariant.
In addition, similar to Shannon conditional entropy, a conditional message importance measure for two distributions is proposed to process conditional probability.

Definition 2.
For the two discrete probability P(X) = {p(x 1 ), p(x 2 ), ..., p(x n )} and P(Y) = {p(y 1 ), p(y 2 ), ..., p(y n )}, the conditional message importance measure (CMIM) is given by where p(x i |y j ) denotes the conditional probability between y j and x i . The component p(x i |y j )e {1−p(x i |y j )} is similar to self-scoring value. Therefore, the CMIM can be considered as a system invariant which indicates the average total self-scoring value for an information transfer process.
Actually, the MIM is a metric with different mathematical and physical meaning from Shannon entropy and Renyi entropy, which provides its own perspective to process probabilistic events. However, due to the similarity between the MIM and Shannon entropy, they may have analogous performance in some aspects. To this end, the information processing based on the MIM is discussed in this paper.

Motivation and Contributions
The purpose of this paper is to characterize the probabilistic events processing including compression and transmission by means of MIM. Particularly, in terms of the information processing system model shown in Figure 1, the message source ϕ (regarded as a random variable whose support set corresponds to the set of events' types) can be measured by the amount of information H(·) and the message importance L(·) according to the probability distribution. Then, the information transfer process whose details are presented in Section 2 can be characterized based on these two metrics. Different from the mathematically probabilistic characterization of traditional telecommunication system, this paper mainly discusses the information processing from the perspectives of message importance. In this regard, the information importance harvest in a transmission is characterized by the proposed message importance loss capacity. Moreover, the upper bound of information compression based on the MIM is described by the message importance distortion function. In addition, we also investigate the trade-off between bitrate transmission and message importance loss to bring some inspiration to the conventional information theory.

Organization
The rest of this paper is discussed as follows. In Section 2, a system model involved with message importance is constructed to help analyze the data compression and transmission in big data. In Section 3, we propose a kind of message transfer capacity to investigate the message importance loss in the transmission. In Section 4, message importance distortion function is introduced and its properties are also presented to give some details. In Section 5, we discuss the bitrate transmission constrained by message importance to widen the horizon for the Shannon theory. In Section 6, some numerical results are presented to validate propositions and the analysis in theory. Finally, we conclude this paper in Section 7. Additionally, the fundamental notations in this paper are summarized in Table 1.

Notation
Description The message source in the information processing system model ϕ The mapped or compressed message with respect to the ϕ Ω The received message transferred from the ϕ Ω The recovered message with respect to the ϕ by the decoding process The importance coefficient The message importance measure (MIM) described as Definition 1

H(·)
The Shannon entropy, The CMIM described as Definition 2

H(·|·)
The conditional Shannon entropy, The message importance loss described as Definition 3 C the message importance loss capacity (MILC) described as Definition 4 p(y|x) An information transfer matrix from the variable X to Y {X, p(y|x), Y} An information transfer process from the variable X to Y β s ,β e , The parameters in the binary symmetric matrix, binary eraser matrix and β k k-ary symmetric matrix respectively The message importance distortion function described as Definition 5
From the viewpoint of generalized information theory, a two-layer framework is considered to understand this model, where the first layer is based on the amount of information characterized by Shannon entropy denoted by H(·), while the second layer reposes on message importance measure of events denoted by L(·). Due to the fact that the former is discussed pretty entirely, we mainly investigate the latter in the paper.
Considering the source-channel separation theorem [38], the above information processing model consists of two problems, namely data compression and data transmission. On one hand, the data compression of the system can be achieved by using classical source coding strategies to reduce more redundancy, in which the information loss is described by H(ϕ) − H(ϕ| ϕ) under the information transfer matrix p( ϕ|ϕ). Similarly, from the perspective of message importance, the data can be further compressed by discarding worthless messages, where the message importance loss can be characterized by L(ϕ) − L(ϕ| ϕ). On the other hand, the data transmission is discussed to obtain the upper bound of the mutual information H( ϕ) − H( ϕ| Ω), namely the information capacity. In a similar way, L( ϕ) − L( ϕ| Ω) means the income of message importance in the transmission.
In essence, it is apparent that the data compression and transmission are both considered as an information transfer processes {X, p(y|x), Y}, and they can be characterized by the difference between {X} and {X|Y}. In order to facilitate the analysis of the above model, the message importance loss is introduced as follows.
In fact, according to the intrinsic relationship between L( , X) and L( , X|Y), it is readily seen that In the light of Jensen's inequality, if 0

Message Importance Loss in Transmission
In this section, we will introduce the CMIM to characterize the information transfer processing. To do so, we define a kind of message transfer capacity measured by the CMIM as follows.

Definition 4.
Assume that there exists an information transfer process as where the p(y|x) denotes a probability distribution matrix describing the information transfer from the variable X to Y. We define the message importance loss capacity (MILC) as where L( , , L( , X|Y) is defined by Equation (4), and < 2 ≤ 2/ max {p(x i )}.
In order to have an insight into the applications of MILC, some specific information transfer scenarios are discussed as follows.

Binary Symmetric Matrix
Consider the binary symmetric information transfer matrix, where the original variables are complemented with the transfer probability which can be seen in the following proposition.

Proposition 1.
Assume that there exists an information transfer process {X, p(y|x), Y}, where the information transfer matrix is which indicates that X and Y both follow binary distributions. In that case, we have Proof of Proposition 1. Assume that the distribution of variable X is a binary distribution (p, 1 − p). According to Equation (10) and Bayes' theorem (namely, p(x|y) = p(x)p(y|x) ), it is not difficult to see that Furthermore, in accordance with Equations (4) and (9), we have In the light of the positivity for That is, if β s = 1/2, the extreme value is indeed the maximum value of C(p, , β s ) when p = 1/2. Similarly, if β s = 1/2, the solution p = 1/2 also results in the same conclusion.

Remark 1.
According to Proposition 1, on one hand, when β s = 1/2, that is, the information transfer process is just random, we will gain the lower bound of the MILC namely C(β s ) = 0. On the other hand, when β s = 0, namely there is a certain information transfer process, we will have the maximum MILC. As for the distribution selection for the variable X, the uniform distribution is preferred to gain the capacity.

Binary Erasure Matrix
The binary erasure information transfer matrix is similar to the binary symmetric one; however, in the former, a part of information is lost rather than corrupted. The MILC of this kind of information transfer matrix is discussed as follows.
Proposition 2. Consider an information transfer process {X, p(y|x), Y}, in which the information transfer matrix is described as which indicates that X follows the binary distribution and Y follows the 3-ary distribution. Then, we have where 0 ≤ β e ≤ 1 and 0 Proof of Proposition 2. Assume the distribution of variable X is (p, 1 − p). Furthermore, according to the binary erasure matrix and Bayes theorem, we have that the transmission matrix conditioned by the variable Y as follows: Then, it is not difficult to have Furthermore, it is readily seen that where L( , p) = pe (1−p) + (1 − p)e p . Moreover, we have the solution p = 1/2 leads to ∂L( ,p) ∂p = 0 and the corresponding second derivative is which results from the condition 0 < < 2 ≤ 2/ max {p(x i )}. Therefore, it is readily seen that, in the case p = 1/2, the capacity C(p, , β e ) reaches the maximum value. Remark 2. Proposition 2 indicates that, in the case β e = 1, the lower bound of the capacity is obtained, that is C(β e ) = 0. However, if a certain information transfer process is satisfied (namely β e = 0), we will have the maximum MILC. Similar to Proposition 1, the uniform distribution is selected to reach the capacity in practice.

Strongly Symmetric Backward Matrix
As for a strongly symmetric backward matrix, it is viewed as a special example of information transmission. The discussion for the message transfer capacity in this case is similar to that in the symmetric matrix, whose details are given as follows.
Proposition 3. For an information transmission from the source X to the sink Y, assume that there exists a strongly symmetric backward matrix as follows: which indicates that X and Y both obey K-ary distribution. We have where Proof of Proposition 3. For given K-ary variables X and Y whose distribution are {p(x 1 ), p(x 2 ), ..., p(x K )} and {p(y 1 ), p(y 2 ), ..., p(y K )} respectively, we can use the strongly symmetric backward matrix to obtain the relationship between the two variables as follows: which implies p(x i ) is a one-to-one onto function for p(y i ).
In accordance with Definition 2, it is easy to see that Moreover, by virtue of the definition of MILC in Equation (9), it is readily seen that where L( , Then, by using Lagrange multiplier method, we have By setting ∂λ 0 = 0, it can be readily verified that the extreme value of ∑ y j p(y j )e (1−p(y j )) is achieved by the uniform distribution as a solution, that is p(x 1 ) = p(x 2 ) = ... = p(x K ) = 1/K. In the case that 0 In addition, according to the Equation (23), the uniform distribution of variable X is resulted from the uniform distribution for variable Y.
Therefore, by substituting the uniform distribution for p(x) into Equation (25), we will obtain the capacity C( , β k ).
Furthermore, in light of Equation (22), we have By setting ∂C( ,β k ) ∂β k = 0, it is apparent that C( , β k ) reaches the extreme value in the case that β k = K−1 K . Additionally, when the parameter satisfies 0 < < 2 ≤ 2/ max {p(x i )}, we also have the second derivative of the C( , β k ) as follows: which indicates that the convex C( , β k ) reaches the minimum value 0 in the case β k = K−1 K .
Remark 3. According to Proposition 3, when β k = K−1 K , namely, the channel is just random, we gain the lower bound of the capacity namely C( , β k ) = 0. On the contrary, when β k = 0 (that is, there is a certain channel), we will have the maximum capacity.

Distortion of Message Importance Transfer
In this section, we will focus on the information transfer distortion, a common problem of information processing. In a real information system, there exists inevitable information distortion caused by noises or other disturbances, though the devices and hardware of telecommunication systems are updating and developing. Fortunately, there are still some bonuses from allowable distortion in some scenarios. For example, in conventional information theory, rate distortion is exploited to obtain source compression such as predictive encoding and hybrid encoding, which can save a lot of hardware resources and communication traffic [39].
Similar to the rate distortion theory for Shannon entropy [38], a kind of information distortion function based on MIM and CMIM is defined to characterize the effect of distortion on the message importance loss. In particular, there are some details of discussion as follows.

Definition 5.
Assume that there exists an information transfer process {X, p(y|x), Y} from the variable X to Y, where the p(y|x) denotes a transfer matrix (distributions of X and Y are denoted by p(x) and p(y) respectively). For a given distortion function d(x, y) (d(x, y) ≥ 0) and an allowable distortion D, the message importance distortion function is defined as which is the average distortion.
In this model, the information source X is given and our goal is to select an adaptive p(y|x) to achieve the minimum allowable message importance loss under the distortion constraint. This provides a new theoretical guidance for information source compression from the perspective of message importance.
In contrast to the rate distortion of Shannon information theory, this new information distortion function just depends on the message importance loss rather than entropy loss to choose an appropriate information compression matrix. In practice, there are some similarities and differences between the rate distortion theory and the message importance distortion in terms of the source compression. On one hand, both two information distortion encodings can be regarded as special information transfer processes just with different optimization objectives. On the other hand, the new distortion theory tries to keep the rare events as high as possible, while the conventional rate distortion focuses on the amount of information itself. To some degree, by reducing more redundant common information, the new source compression strategy based on rare events (viewed as message importance) may save more computing and storage resources in big data.

Properties of Message Importance Distortion Function
In this subsection, we shall discuss some fundamental properties of rate distortion function based on message importance in details.

Domain of Distortion
Here, we investigate the domain of allowable distortion, namely [D min , D max ], and the corresponding message importance distortion function values as follows.
(i) The lower bound D min : Due to the fact 0 ≤ d(x i , y j ), it is easy to obtain the non-negative average distortion, namely 0 ≤D. ConsideringD ≤ D, we readily have the minimum allowable distortion, that is which implies the distortionless case, namely Y is the same as X.
In addition, when the lower bound D min (namely the distortionless case) is satisfied, it is readily seen that and according to the Equation (29) the message importance distortion function is where L( , X) = ∑ x i p(x i )e {1−p(x i )} and 0 < ≤ The upper bound D max : When the allowable distortion satisfies D ≥ D max , it is apparent that the variables X and Y are independent, that is, p(y|x) = p(y). Furthermore, it is not difficult to see that which indicates that when the distribution of variable Y follows p(y j ) = 1 and p(y l ) = 0 (l = j), we have the upper bound Additionally, on account of the independent X and Y, namely p(x|y) = p(x), it is readily seen that

The Convexity Property
For two allowable distortions D a and D b , whose optimal allowable information transfer matrixes are p a (y|x) and p b (y|x) respectively, we have where 0 ≤ δ ≤ 1 and 0 < ≤ Proof. Refer to the Appendix A.

The Monotonically Decreasing Property
For two given allowable distortions D a and D b , if 0 ≤ D a < D b < D max is satisfied, we have On account of the Equation (36) and the convexity property mentioned in Equation (37), it is not difficult to see that where 0 < γ < 1.

The Equivalent Expression
For an information transfer process {X, p(y|x), Y}, if we have a given distortion function d(x, y), an allowable distortion D and a average distortionD defined in Equation (30), the message importance distortion function defined in Equation (29) can be rewritten as where L( , X) and L( , X|Y) are defined by the Equations (1) and (4), as well as Proof. For a given allowable distortion D, if there exists an allowable distortion D * (D min ≤ D * < D < D max ) and the corresponding optimal information transfer matrix p * (y|x) leads to R (D), we will have R (D) = R (D * ), which contradicts the monotonically decreasing property.

Analysis for Message Importance Distortion Function
In this subsection, we shall investigate the computation of message importance distortion function, which has a great impact on the probabilistic events analysis in practice. Actually, the definition of message importance distortion function in Equation (29) can be regarded as a special function, which is the minimization of the message importance loss with the symbol error less than or equal to the allowable distortion D. In particular, Definition 5 can also be expressed as the following optimization: where L( , X) and L( , X|Y) are MIM and CMIM defined in Equations (1) and (4), as well as To take a computable optimization problem as an example, we consider Hamming distortion as the distortion function d(x, y), namely which means d(x i , y i ) = 0 and d(x i , y j ) = 1 (i = j). In order to reveal some intrinsic meanings of R (D), we investigate an information transfer of Bernoulli source as follows.

Proposition 4.
For a Bernoulli(p) source denoted by a variable X and an information transfer process {X, p(y|x), Y} with Hamming distortion, the message importance distortion function is given by and the corresponding information transfer matrix is Proof of Proposition 4. Refer to the Appendix B.

Bitrate Transmission Constrained by Message Importance
We investigate the information capacity in the case of a limited message importance loss in this section. The objective is to achieve the maximum transmission bitrate under the constraint of a certain message importance loss . The maximum transmission bitrate is one of system invariants in a transmission process, which provides a upper bound of amount of information obtained by the receiver.
In an information transmission process, the information capacity is the mutual information between the encoded signal and the received signal with the dimension bit/symbol. In a real transmission, there always exists an allowable distortion between the sending sequence X and the received sequence Y, while the maximum allowable message importance loss is required to avoid too much distortion of important events. From this perspective, message importance loss is considered to be another constraint for the information transmission capacity beyond the information distortion. Therefore, this might play a crucial role in the design of transmission in information processing systems.
In particular, we characterize the maximizing mutual information constrained by a controlled message importance loss as follows: where I(X||Y) = ∑ x i ,y j p(x i )p(y j |x i ) log , p(y j ) = ∑ x i p(x i )p(y j |x i ), L( , X) and L( , X|Y) are MIM and CMIM defined in Equations (1) and (4), as well as 0 < ≤ Actually, the bitrate transmission with a message importance loss constraint has a special solution for a certain scenario. In order to give a specific example, we investigate the optimization problem in the Bernoulli(p) source with a symmetric or erasure transfer matrix as follows.

Binary Symmetric Matrix
Proposition 5. For a Bernoulli(p) source X whose distribution is {p, 1 − p} (0 ≤ p ≤ 1/2) and an information transfer process {X, p(y|x), Y} with transfer matrix we have the solution for P 2 defined in Equation (44) as follows: where p s is the solution of L( , X) − L( , X|Y) = (L( , X) and L( , X|Y) mentioned in the optimization problem P 2 ), whose approximate value is in which the parameter Θ is given by and H(·) denotes the operator for Shannon entropy, that is H Proof of Proposition 5. Considering the Bernoulli(p) source X following {p, 1 − p} and the binary symmetric matrix, it is not difficult to gain where Moreover, define the Lagrange function as G s (p) = I(X||Y) + λ s (L( , X) − L( , X|Y) − ) where > 0, 0 ≤ p ≤ 1/2 and λ s ≥ 0. It is not difficult to have the partial derivative of G s (p) as follows: where ∂C(p, ,β s ) ∂p is given by the Equation (14) and By virtue of the monotonic increasing function log(x) for x > 0, it is easy to see the nonnegativity of where p s is the solution of L( , X) − L( , X|Y) = , and C β s is the MILC mentioned in Equation (11). By using Taylor series expansion, the equation L( , X) − L( , X|Y) = can be expressed approximately as follows: whose solution is the approximate p s as the Equation (47). Therefore, by substituting the p * s into Equation (49), we have Equation (46).

Remark 4.
Proposition 5 gives the maximum transmission bitrate under the constraint of message importance loss. Particularly, there are growth regions and smooth regions for the maximum transmission bitrate in the receiver with respect to message importance loss . When the message importance loss is constrained in a little range, the real bitrate is less than the Shannon information capacity, which is involved with the entropy of the symmetric matrix parameter β s .

Binary Erasure Matrix
Proposition 6. Assume that there is a Bernoulli(p) source X following distribution {p, 1 − p} (0 ≤ p ≤ 1/2) and an information transfer process {X, p(y|x), Y} with the binary erasure matrix where 0 ≤ β e ≤ 1. In this case, the solution for P 2 described in Equation (44) is where p e is the solution of (1 − β e ){pe (1−p) + (1 − p)e p − 1} = , whose approximate value is and Proof of Proposition 6. In the binary erasure matrix, considering the Bernoulli(p) source X whose distribution is {p, 1 − p} , it is readily seen that where H(·) denotes the Shannon entropy operator, namely H(p) = −[(1 − p) log(1 − p) + p log p].
Moreover, according to the Definitions 1 and 2, it is easy to see that where L( , p) = pe (1−p) + (1 − p)e p . Similar to the proof of the Proposition 5 and considering the monotonically increasing H(p) and L( , p) in p ∈ [0, 1/2], it is not difficult to see that the optimal solution p * e is the maximal available p in the case 0 ≤ p ≤ 1 2 , which is given by where p e is the solution of (1 − β e ){L( , p) − 1} = , and the upper bound C β e is gained in Equation (16). By resorting to Taylor series expansion, the approximate equation from which the approximate solution p e in Equation (56) is obtained. Therefore, Equation (55) is obtained by substituting the p * e into the Equation (57).
Remark 5. From Proposition 6, there are two regions for the maximum transmission bitrate with respect to message importance loss. The one depends on the message importance loss threshold . The other is just related to the erasure matrix parameter β e .
Note that single-letter models are discussed to show some theoretical results for information transfer under the constraint of massage importance loss, which may be used in some special potential applications such as maritime international signal or switch signal processing. As a matter of fact, in practice, it is preferred to operate multi-letters models which can be applied to more scenarios such as the multimedia communication, cooperative communications and multiple access, etc. As for these complicated cases which may be different from conventional Shannon information theory, we shall consider it in the near future.

Numerical Results
This section shall provide numerical results to validate the theoretical results in this paper.

The Message Importance Loss Capacity
First of all, we give some numerical simulation with respect to the MILC in different information transmission cases. In Figure 2, it is apparent to see that if the Bernoulli source follows the uniform distribution, namely p = 0.5, the message importance loss will reach the maximum in the cases of different matrix parameter β s . That is, the numerical results of MILC are obtained as {0.4081, 0.0997, 0, 0.2265} in the case of parameter β s = {0.1, 0.3, 0.5, 0.8} and = 1, which corresponds to Proposition 1. Moreover, we also know that if β s = 0.5, namely the random transfer matrix is satisfied, the MILC reaches the lower bound that is C = 0. In contrast, if the parameter β s satisfies β s = 0, the upper bound of MILC will be gained such as {0.1618, 0.4191, 0.6487, 1.7183} in the case = {0.3, 0.7, 1.0, 2.0}. Figure 3 shows that, in the binary erasure matrix, the MILC is reached under the same condition as that in the binary symmetric matrix, namely p = 0.

Message Importance Distortion
We focus on the distortion of message importance transfer and give some simulations in this subsection. From Figure 5, it is illustrated that the message importance distortion function R (D) is monotonically non-increasing with respect to the distortion D, which can validate some properties mentioned in Section 4.  Figure 6 shows the allowable maximum bitrate (characterized by mutual information) constrained by a message importance loss in a Bernoulli(p) source case. It is worth noting that there are two regions for the mutual information in the both transfer matrixes. In the first region, the mutual information is monotonically increasing with respect to the ; however, in the second region, the mutual information is stable, namely the information transmission capacity is obtained. As for the numerical

Experimental Simulations
In this subsection, we take the binary stochastic process (in which the random variable follows Bernoulli distribution) as an example to validate theoretical results. In particular, the Bernoulli(p) source X (whose distribution is denoted by P(X) = {p, 1 − p} where 0 < p < 1) with the symmetric or erasure matrix (described by Equations (10) and (15)) is considered to reveal some properties of message importance loss capacity (in Section 3), message importance distortion function (in Section 4) as well as bitrate transmission constrained by message importance (in Section 5).
From Figure 7, it is seen that the uniform information source X (that is P(X) = {1/2, 1/2}) leads to the maximum message importance loss (namely MILC) in both cases of symmetric matrix and erasure matrix, which implies Propositions 1 and 2. Moreover, with the increase of number of samples, the performance of massage importance loss tends to smooth. In addition, the MILC in symmetric transfer matrix is larger than that in the erasure one when the matrix parameters β s and β e are the same.
As for the distortion of message importance transfer, we investigate the message importance loss based on different transfer matrices, which is shown in Figure 8 where p optimal (y|x) is described , p certain (y|x) = 1 0 0 1 , D is the allowable distortion and p is the probability element of Bernoulli(p) source. From Figure 8, it is illustrated that, when the p optimal (y|x) is selected as the transfer matrix, the massage importance loss reaches the minimum, which corresponds to Proposition 4. In addition, if the transfer matrix is not certain (existing distortion), message importance loss is decreasing with the increase of allowable distortion. Considering the transmission with a message importance loss constraint, Figure 9 shows that, when the p * s (given by Equation (52)) and p * e (given by Equation (59)) are selected as the probability elements for the Bernoulli(p) source in the symmetric matrix and erasure matrix respectively, the corresponding mutual information values are larger than those based on other probability (such as p random 1 = (1 − √ 1 − 8 )/2 and p random 2 = (1 − √ 1 − 4 )/2). In addition, it is not difficult to see that, when the parameter β s is equal to β e , the mutual information (constrained by a message importance loss) in symmetric transfer matrix is larger than that in the erasure one.  Figure 9. The mutual information I(X||Y) versus the rare message importance loss threshold (the parameter = 0.1) in the case of Bernoulli(p) source X (that is P(X) = {p, 1 − p} with different probability p). The number of samples observed from the source X is n = 10, 000, and transfer matrix is the symmetric matrix with parameter β s = 0.1 or the erasure matrix with parameter β e = 0.1.

Conclusions
In this paper, we investigated the information processing from the perspective of an information measure i.e., MIM. Actually, with the help of parameter , the MIM has more flexibility and can be used widely. Here, we just focused on the MIM with 0 ≤ ≤ 2/ max{p(x i )} which not only has properties of self-scoring values for probabilistic events but also has similarities with Shannon entropy in information compression and transmission. In particular, based on a system model with message importance processing, a message importance loss was presented. This measure can characterize the information distinction before and after a message transfer process. Furthermore, we have proposed the message importance loss capacity which can provide an upper bound for the message importance harvest in the information transmission. Moreover, the message importance distortion function, which is to select an information transfer matrix to minimize the message importance loss, was discussed to characterize the performance of information lossy compression from the viewpoint of message importance of events. In addition, we exploited the message importance loss to constrain the bitrate transmission so that the combined factors of message importance and amount of information are considered to guide an information transmission. To give the validation for theoretical analyses, some numerical results and experimental simulations were also presented in details. As the next step research, we are looking forward to exploiting real data to design some applicable strategies for information processing based on the MIM, as well as investigating the performance of multivariate systems in the sense of MIM.
Author Contributions: R.S., S.L. and P.F. all contributed to this work on investigation and writing.

Funding:
The authors appreciate the support of the National Natural Science Foundation of China (NSFC) No. 61771283.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: As for an allowable distortion D 0 = δD a + (1 − δ)D b , we have the average distortion for the information transfer matrix p 0 (y|x) = δp a (y|x) + (1 − δ)p b (y|x) as follows: which indicates that the p 0 (y|x) is an allowable information transfer matrix for D 0 .
Moreover, by using Jensen's inequality and Bayes' theorem, we have the CMIM with respect to p 0 (y|x) as follows: where L( , X) is the MIM for the given information source X, while L a ( , X|Y) and L b ( , X|Y) denote the CMIM with respect to p a (y|x) and p b (y|x), respectively. Therefore, the convexity property is testified.
By substituting β = D−pα 1−p into the Equation (A8), it is easy to have where max{0, 1 + D−1 p } ≤ α ≤ min{1, D p } resulted from the constraints in Equation (A6). Moreover, it is not difficult to have the partial derivative of L D ( , X|Y) in Equation (A9) with respect to α as follows: = 2p 2 (2 + In addition, in the light of the domain of D mentioned in Equation (35), it is easy to have D max = min{p, 1 − p} in the Bernoulli source case. That is, the allowable distortion satisfies 0 ≤ D ≤ min{p, 1 − p}. Thus, the domain of α namely max{0, 1 + D−1 p } ≤ α ≤ min{1, D p }, can be given by 0 ≤ α ≤ D p . Then, it is easy to have the appropriate solution of α as follows: in which the second derivative where 0 ≤ D ≤ min{p, 1 − p}.
Consequently, by substituting the matrix Equation (A12) into the Equation (40), it is not difficult to verify this proposition.