Generalized predictive information criteria for the analysis of feature events

: This paper develops two weighted measures for model selection by generalizing the Kullback-Leibler divergence measure. The concept of a model selection process that takes into account the special features of the underlying model is introduced using weighted measures. New information criteria are deﬁned using the bias correction of an expected weighted loglikelihood estimator. Using weight functions that match the features of interest in the underlying statistical models, the new information criteria are applied to simulated studies of spline regression and copula model selection. Real data applications are also given for predicting the incidence of disease and for quantile modeling of environmental data.


Introduction
An information theoretic approach (ref. [2]) for model selection expresses a statistical model in the form of a probability distribution. The model is evaluated using an estimate of the Kullback-Leibler information (ref. [16]) as an overall measure of the divergence of the fitted model from the true model, which is generating the data. According to [2], if the specified model contains the true model and the model is estimated by the maximum likelihood method, then Akaike's information criterion (AIC) can be used to evaluate the constructed models. Previous studies have developed information criteria, as estimators of the expected log-likelihood, under model mis-specification (e.g. [22]), and for evaluating the mis-specified models constructed by various types of estimation procedures (e.g. [14,20]). [21,3,4] extend the information criteria approach to the Bayesian paradigm. These studies focus on the overall fitness of the constructed model to the true underlying structure. This paper proposes methods, which will allow model selection to emphasize some features of the true model. When selecting a model, we usually have well-defined purposes to guide our selection of the class of models. For example, we may want to identify distributions that can assess the probability of the occurrence of rare events. In a bivariate setting, we may select a good copula model to capture the tail dependence of two random variables. In medical disease prognosis, it is important to develop statistical models that can accurately predict the incidence of a disease. A common characteristic of the above examples is that prior to model selection some key features of the true model have been identified. From both the practical and the statistical points of view, it is desirable for model selection schemes to incorporate relevant features into the selection procedure. A feature capturing concept was proposed by [23] who emphasized feature matching in time series modeling. They suggested doing model estimation by treating the multiple-step conditional mean forecasts and autocorrelation functions as features. In this paper, we develop two information criteria which embed some pre-specified relevant features that the true model should contain. We first define a weighted Kullback-Leibler (KL) measure as follows.
Definition. Suppose G and F represent the true underlying distribution and the fitted distribution of a random variable Y , respectively. Define a positive real-valued function w(y) which is bounded and does not depend on the sample size. Denote the expectations with respect to G and F by E G [·] and E F [·], respectively. Under the regularity condition a weighted KL measure is defined as where g(y) and f (y) are probability densities or probability mass functions of G and F , respectively.
The weighted KL measure is a measure satisfying (i) K w (G, F ) ≥ 0; and (ii) K w (G, F ) = 0 if and only if G = F . A proof is given in the Appendix. Note that an alternative weighted KL measure satisfying (i) and (ii) above is defined as The weighted KL measures in Equations (1.2) and (1.3) reduce to the traditional KL measure, denoted by K(G, F ), when w(y) = 1. The main reason for the construction of the weighted KL measures is to put more/less weight on the feature region of G so that any discrepancy between G and F in the feature region will be emphasized/de-emphasized via The plot of K(G, F ), Kw(G, F ) and Kw(G, F ) against β, which represents the "difference" between the true logistic model G and the constant probability model F .
in Equation (1.1) is called the feature condition and is a core condition in constructing the weighted KL measure in Equation (1.2). We give examples below to show how the feature condition is linked to the characteristics of the true model that we want the selected model to match.
Example 1 Consider a discrete probability distribution G(y), where "y = 0" represents a non-disease group and "y = 1" represents a disease group.
The distribution F (x) can represent a screening test detecting the disease and the fitted model is P F (Y = 1) = exp(−2)/[1 + exp(−2)] ≈ 0.12. If the main focus of the model selection is the sensitivity of the screening test, we can specify w(0) = 1 and w(1) > 1. In this case, Figure 1. We find that both K w (G, F ) and K w (G, F ) increase faster than K(G, F ) with β if x ≥ 0. This is understandable because when x ≥ 0, the feature condition in Equation (1.1) is satisfied. Even though when x < 0 where Equation (1.1) does not hold and K w (G, F ) may not be positive, K w (G, F ) is still greater than K(G, F ) for all β, indicating that K w (G, F ) can have higher discriminating power than K(G, F ) in practice.
The plot of K(G, F ), Kw(G, F ) and Kw(G, F ) against 2 − β, which represents the 'difference" between the true exponential power distribution G and the standard normal distribution F .
Example 2 Suppose the true model G(y) is an exponential power distribution with mean 0 and variance 1. Its density is g(x) = β 2αΓ(1/β) e −(|y|/α) β with α = Γ(1/β) Γ(3/β) , and the fitted model is the standard normal. If the feature of interest is the tail part of the true model, we can set w(y)= 2 if |y| > 3 and w(y) = 1 otherwise. In this case, E G [w(Y )] = P G (|Y | > 3) + 1 and E F [w(Y )] = P F (|Y | > 3) + 1 which satisfies the feature condition as P G (|X| > 3) ≥ P F (|X| > 3). The heavy-tailed feature of the distribution is emphasized with the choice of w(y) which puts higher weight on the "tail discrepancy". Figure 2 is the plot of K(G, F ), K w (G, F ), and K w (G, F ) and shows that both K w (G, F ) and K w (G, F ) are greater than K(G, F ) when β decreases from 2, or the exponential power distribution becomes more heavy-tailed. We expect that our weighted KL measures will give a larger discrepancy between the true and the fitted model if the fitted model cannot capture the target feature. In practice, this property of K w (G, F ) and K w (G, F ) can help to differentiate potential models with respect to the target feature, as the divergence between the true model and the fitted "wrong" model will be amplified by suitably defining the weights to match the feature. This paper has three objectives. First, two weighted KL measures, that allow certain characteristics of the underlying model to be incorporated into the model selection are proposed. Second, we develop two new information criteria based on the weighted KL measures to determine the best model among candidate choices. Third, we demonstrate the advantage of having the weight function w(y) in the model selection. The focus is on spline regression and copula model selection. The remainder of the paper is organized as follows. In Section 2, we develop two information criteria for model selection using the weighted KL measures. Section 3 investigates the performance of the proposed information criteria on spline regression and copula model selection, using Monte Carlo simulations. In Section 4, we apply the developed approach to real data. Section 5 provides a discussion, including possibilities for future work.

Main result
Let y = (y 1 , . . . , y n ) T be a set of n observations from the true model G(y). For α = 1, . . . , n, we express the fitted model density function of y α as f (y α ; θ), where θ is the model parameter. One can estimate the parameter θ by maximizing the penalized weighted log-likelihood: the weights w(y α ) can be specified by decision makers to target for specific features in G(y), λ is a regularization parameter and p(θ) is a penalty function. As shown in Section 3.1, the penalty term can improve the performance in spline regression. After obtaining the parameter estimate θ by maximizing Equation (2.1), we produce a model by replacing the parameter θ with the sample estimate θ. The problem is how to choose the best model among the candidates. This section constructs two information criteria for evaluating the fitted model for possible model mis-specification from a predictive point of view. Based on the definition of the weighted measure K w (G, F ), we assess the goodness of the fitted model based on the expected weighted log-likelihood given by where z = (z 1 , . . . , z n ) T are replicates of the observations y α 's. Note that if we specify the weight function as w(z α ) = 1, α = 1, 2, . . . , Equation (2.3) will reduce to the expected log-likelihood extensively analyzed by [14]. We note here that the expected weighted log-likelihood depends on the unknown true distribution G(y) and on the observed data y 1 , . . . , y n . A natural estimator of the expected weighted log-likelihood is the sample based weighted log-likelihood ℓ w ( θ, y), which is formally obtained by replacing the unknown true distribution with the empirical distribution, putting probability mass 1/n on each observation. The weighted log-likelihood generally induces a positive bias as an estimator of the expected weighted log-likelihood, because the same data are used both to estimate the parameters of the model and to evaluate the expected loglikelihood. We therefore consider the bias correction of the log-likelihood. The bias is defined by Then an estimator of the expected weighted log-likelihood is given by where bias(G) is an estimator of the bias of the empirical weighted log-likelihood in estimating the expected weighted log-likelihood. The following theorem provides an expression of the bias term.
where T (1) (z; G) is the first order derivative of the functional The Appendix gives the proof of Theorem 2.1 which relies on the innovative idea of [14] who introduced the statistical functional framework to information theoretic approach. By replacing the unknown distribution G with the empirical distribution G, we can calculate the bias term. After correcting the bias of the log-likelihood, we propose a new information criterion We select the model that minimizes the IC score in Equation (2.7) with respect to the models of interest defined by f (y; θ) and λ.
then the bias term in Equation (2.7) will become where the bias term R( G) −1 K( G) follows Equations (2.8) and (2.9) when we use a maximum penalized weighted likelihood estimator and a maximum penalized likelihood estimator, respectively. Remark.
[1] also used weighted likelihood estimates, but for robust model selection using AIC rather than for developing new information criteria. Comparing with [6] who proposed focused information criterion to focus on some parameters in competitive models, we do model selection with respect to feature events which are matched by weight functions. If we set w(y) to be a constant, the proposed criteria IC and IC reduce to GIC (ref. [14]), which further reduces to AIC if the model f (y; θ) is correctly specified, giving R( G) = K( G). Therefore, the proposed criterion can be regarded as an extension of the standard information theoretic approach. The developed information criteria require an i.i.d. framework. It is our conjecture that similar results, when the sample size is large, can be proved under the time series context such as an AR(p) model. See, for instance, one of the examples in [15].

Penalized B-spline regression
Monte Carlo experiments are conducted to compare the performance of our method with the standard weighted likelihood approach and GIC. Suppose we have n independent observations {(y α , x α ); α = 1, 2, . . . , n}, where y α are response variables and x α are explanatory variables. In generalized linear models y α are assumed to follow the exponential family of distributions with densities and v(·, ·) are functions specific to each distribution, and φ is an unknown scale parameter.
Using the basis expansion approach given in [8,13], the unknown predictor η α is approximated by a linear combination of basis functions η α = m j=1 β j b j (x α ) = β T b(x α ), where β = (β 1 , . . . , β m ) T is the m-dimensional coefficient vector and b(x) = (b 1 (x), . . . , b m (x)) T is the m-dimensional basis function vector. With the basis expansion predictor, the probability density function of the generalized linear model is given by • is the convolution operator. The penalty function in Equation (2.1) is defined by where ∆ is the difference operator, i.e. ∆β j = β j − β j−1 and D is a matrix representation of the difference operator. Putting these equations with the weight function w(·), which is specified below, the unknown parameter θ is estimated by maximizing Equation (2.1). The remaining problem is how to choose the smoothing parameter λ. Substituting the density and penalty functions given by Equations (3.1) and (3.2) into Equation (2.7), we derive a tailor-made version of model selection criteria for evaluating the generalized linear models using basis expansion predictor: where K( θ) and R( θ) are the (m + 1) × (m + 1) matrices given by respectively. Here B = (b(x 1 ), . . . , b(x n )) T , W = diag{w(y 1 ), . . . , w(y n )}, e = (1, . . . , 1) T , Λ and Γ are n × n diagonal matrices and p and q are n-dimensional vectors with the α-th diagonal elements and the α-th elements given by We choose the smoothing parameter λ as the minimizer of the criteria.
To illustrate the use of Equation (3.3), we consider P -spline Gaussian regression modeling in the simulation. Data sets {(x α , y α ); α = 1, . . . , n} are repeatedly generated from the true regression model y α = m(x α ) + ε α for x α = (2α − 1)/(2n). The errors ε α are assumed to be independently distributed according to the normal distribution with mean 0 and variance σ 2 . Two true functions are studied in the simulation: m(x) = sin(5πx 2.5 ) and m(x) = 0.005(1 − 48x + 218x 3 + 145x 4 ) + cos(4π(x − 1) 3 ). We estimate the unknown function m(x) by using P -spline Gaussian regression model:  and w α = 5/3 if (x α ≥ 0.5). This weight setting is considered when the relative importance of prediction is different in a particular range of x. In this case, we have more emphasis on the predictive performance on "large" x. Taking u(ξ α ) = ξ 2 α /2, φ = σ 2 , v(y α , ψ) = −y 2 α /(2σ 2 ) − log(2πσ 2 )/2 and h(µ α ) = µ α in the criteria of Equation (3.3), we obtain a tailor-made version of an information criterion for evaluating the estimated P -spline Gaussian regression model. Table 1 compares the averaged "weighted" mean squared error between the true and estimated functions y(x α ) =β T b(x α ). Again, this WMSE is set to align with the objective of emphasizing the prediction of y when x ≥ 0.5. This performance measure is used because we want to be able to emphasize specific features that are selected by users according to their research focus. We also estimate the regression model by GIC and the standard weighted maximum likelihood approach, which is implemented by setting the smoothing parameter λ = 0 in Equation (2.1). As the parameter estimate for θ can not be obtained by the weighted likelihood method when we set a large number of basis functions such as m = 15, we prepare several values of the number of basis functions m = {6, 9, 12} and then construct the regression model. In addition, by setting w(y α ) = 1, the model f (y α |x α ; θ) in Equation (3.4) is also estimated by the usual maximum penalized likelihood approach. This approach is employed to test the importance of the weight function. An optimal value of the smoothing parameter λ is determined by GIC (ref. [14]). The simulation results are obtained from 100 repeated Monte Carlo trials. It may be seen from the simulation results in Table 1 that our method is superior to the competitors; it gives the smallest value of the average WMSE in all combinations of n and σ 2 . Finally, we note that our criteria can also be applied to P -spline logistic regression modeling. It is easy to derive the information criteria for evaluating the estimated P -spline logistic regression model by taking u( ξ α ) = log{1 + exp( ξ α )}, v(y α , φ) = 0, h( µ α ) = log{ µ α /(1 − µ α )}, and φ = 1 in Equation (3.3). In a Poisson model, we shall take u( ξ α ) = exp( ξ α ), v(y α , φ) = − log(y α !), h( µ α ) = log( µ α ) and φ = 1 in Equation (3.3).

Copula model selection
Recently, there has been active research on copula model selection; see for example [5,11,19,9]. We apply our IC in this section to find the best copula models to explain certain dependence structure of the data. Let y α = (y α1 , . . . , y αp ) T , α = 1, . . . , n, be p-dimensional independent observations. With y αj , j = 1, . . . , p, distributed as Uniform[0,1], the distribution of y α = (y α1 , . . . , y αp ) T is specified by F (y α ; θ) = C(y α1 , . . . , y αp ; θ), where C(·) is a copula function and θ is the copula parameter. The main objective is to select an appropriate C(·). To perform the copula model selection, we use the IC in Equation (2.7) with θ estimated by MLE and the bias term in Equation (2.9) is adopted. As discussed in the previous sections, a novelty of our approach is that we can choose the weight function with respect to a special feature of the distribution of the random variables of interest to do the model selection. Suppose p = 2 and we want to select an appropriate copula function for y α1 and y α2 , where a feature of attention is their lower tail dependence structure. In this case, we use w(y α ) = λ if y α1 , y α2 ≤ 0.1 and w(y α ) = 1 otherwise, where λ > 0. The rationale of this choice of w(y α ) is to emphasize or de-emphasize the influence of the co-occurrence of extreme observations, i.e. when both y α1 < 0.1 and y α2 < 0.1. To study the effect of our approach, we simulate n = 100 observations from three copula functions with different lower tail dependence behavior. They are the t copula with the copula density c(u 1 , u 2 ) = t is the univariate t density with degrees of freedom ν, t (2) ρ is the bivariate t density with correlation ρ and degrees of freedom ν, and T (−1) is the inverse of the univariate t distribution. We choose ρ = 0.59 and ν = 8 in the t-copula, and θ = 0.67 and 1.67 for the Clayton and Gumbel copulas, respectively to match their Kendall's tau. By construction, both t and Clayton copulas have positive lower tail dependence, whereas the Gumbel copula has zero tail dependence. We select the best model among the fitted t, Clayton and Gumbel copula models using the IC. Table 2 Simulation results of 100 replications of size n = 100 generated by the t, Clayton and Gumbel copulas. The table shows the median ratios of the KL measure, the L 1 -norm, the L 2 -norm and the Hellinger measure for different λ over that for the benchmark method (λ = 1) As the target feature is the lower tail dependence, we want to see how fitted models match the true model in A = {y α : y α1 , y α2 ≤ 0.1}. Statistically, we assess the performance by comparing the true tail distribution g(y|A) = g(y)I(y ∈ A)/P G (y ∈ A) and the fitted tail distribution f (y|A) = f (y)I(y ∈ A)/P F (y ∈ A), where f and F are from a fitted model, and P G and P F are probabilities evaluated under G and F , respectively. Four performance measures are adopted, namely, the KL measure, A g(y|A)log g(y|A) f (y|A) dy; the L 1 -norm, A |g(y|A) − f (y|A)|dy; the L 2 -norm, A [g(y|A) − f (y|A)] 2 dy; and the Hellinger measure, A [ g(y|A) − f (y|A)] 2 dy. The smaller the performance measures, the better the fitted models.
The model selection results using the IC in Equation (2.7) with different λ are produced for 100 replications of each of the three copula models. Table 2 shows the ratio of the median of the KL measure based on 100 replications, for each λ, over the median KL measure of the benchmark (λ = 1), and the respective ratio of the median L 1 -norm, L 2 -norm and the Hellinger measure. The smaller-thanone ratio of the median performance measure means that the model selection based on the IC is better than that of the benchmark method. For the t-copula data generating process, our method outperforms (most of the median ratios are less than 1) the benchmark when λ > 1. For Clayton, we observe the same pattern that for λ > 1, all median ratios are less than one. For Gumbel, the outperformance appears in λ < 1, where the best-performed cases are λ = 0.2 and 0.4. The above findings suggest that when λ is appropriately chosen to focus on one feature of the copula models, our IC based on the weighted KL measure K w (G, F ) is superior to that based on the usual KL measure. For Clayton and t copulas, which have lower tail dependence, it is reasonable to choose λ > 1 to emphasize the effect of extreme observations, whereas for Gumbel, which has zero lower tail dependence, λ < 1 is appropriate because we want to de-emphasize the effect of extreme observations to match with the zero or weak low tail dependence of the true model. In practice, whether we want to emphasize or de-emphasize the extreme observations can depend on some prior belief regarding the tail dependence property of the true model.

Predicting the incidence of disease
A motivation for introducing the KL measures is that in medical research, the proportion of disease cases, e.g., heart-disease and cancer, is much small relative to the proportion of non-disease cases. If we can design a suitable weight function that weighs more heavily on the disease cases, we expect that the accuracy rate (i.e., the chance of correctly predicting a cancer patient to have cancer) can be improved. We consider an analysis of South African heart disease data (ref. [18]) studied in [10], pp. 122 to illustrate our method using a logistic regression model. The data set consists of 160 heart-disease cases (y = 1) and a sample of 302 controls (y = 0). Using a set of four predictors, tobacco (cumulative tobacco x 1 ), ldl (low density lipoprotein cholesterol x 2 ), famhist (family history of heart disease x 3 ), and age (age at onset x 4 ), we model the conditional probability Pr(y = 1|x) = π(x) by w(y α ){y α log π(x α ) + (1 − y α ) log(1 − π(x α ))}.
Because our concern is to increase the chance of correctly predicting a heartdisease patient to have heart disease, we set the weight values as w(1) = 4 and w(0) = 1. The unknown parameter vector β = (β 0 , . . . , β 4 ) T is estimated by maximizing the weighted log-likelihood ℓ w (β, y). The corresponding IC criterion in Equation (2.7) is then where π(x α ) is the estimated conditional probability. As a result, we obtain the minimum IC score as −1.052 with the model including all the four predictors. The thick line in Figure 3 shows the receiver operating characteristic (ROC) curve, obtained by plotting the fraction of true positives out of the positives (TPR = true positive rate) versus the fraction of false positives out of the negatives (FPR = false positive rate).
To compare the performance with the un-weighted method (w(y α ) = 1 for α = 1, . . . , 462), we use the GIC in [14] to do the model selection again. The selected model contains only two predictors x 1 and x 2 with the ROC curve given by the thin line in Figure 3. As this ROC curve lies below the curve obtained from our approach, the former model obtained by the IC score is superior. We observe an improved predictive power of the logistic regression model when it has more weights on the disease cases and when the model is selected using our IC score.

Quantile modeling of environmental data
In contrast to classical linear regression models, where a conditional expectation of the response variable is in focus, the quantile regression tries to estimate the τ -th conditional quantile of y given x = (x 1 , . . . , x p ) T as where β 0 (τ ), . . . , β p (τ ) are coefficients dependent on the quantile τ . When we set τ = 0.5, the model reduces to the conditional median regression, which is more robust to outliers than the conditional mean regression. The unknown parameters are estimated by maximizing with ρ τ (u) = u(τ − I(u < 0)), the usual loss function for standard quantile regression modeling and β(τ ) = (β 0 (τ ), . . . , β p (τ )) T . To evaluate estimated quantile regression models, we use the IC score in Equation (2.7). The criterion is then with X = (x 1 , . . . , x n ) T , K( G) = 1 n τ (1 − τ )X T W X, and R( G) = 1 n X T M X. Here W = diag{w(y 1 ), . . . , w(y n )}, M = diag{g(ξ 1 (τ )), . . . , g(ξ n (τ ))} is the n-dimensional diagonal matrix with the α-th element of M being the τ -th quantile of the density function g(ξ α (τ )), ξ α (τ ) = G −1 (τ |x α ), and P (y α ≤ y|x α ) = G(y|x α ). Although our formulation in Equation (4.1) allows heterogeneous weights w(y α ) for observations y α , the IC score with the equal weight w(y t ) = 1 alone, is new in the literature. In the study of precipitation data discussed below, we set the weight function as w(y α ) = 1 + I(y α ≥ a u ) for upper quantiles and w(y α ) = 1 + I(y α ≤ a l ) for lower quantiles, where a u and a l are some thresholds. This setting is in line with the environmental issues like drought, climate change, flooding, etc, in which the precipitation is either very high or very low.
First, we apply our method to monthly precipitation data collected in one of the meteorological stations of the Hong Kong Observatory. The data period is from September, 1997 to December, 2010. The quantile autoregression model in [12] is considered: where y α is the observed precipitation value at time α. Setting the value of τ as 5% and 95%, we construct two quantile functions. To emphasize the effect of extreme precipitations, we use the weighted functions w(y α ) = 1 + I(y α ≥ 800) for the 95% quantile and w(y α ) = 1 + I(y α ≤ 20) for the 5% quantile. To determine an optimal lag value of p, the IC score is used. We fit the candidate models of lag p = {0, 1, . . . , 20} and then determine the best lag that minimizes the IC score. When we construct the IC score in Equation (4.1), the values of the τ -th quantile of the density function g(ξ α (τ )) in the matrix M are estimated by the adaptive kernel method in [17]. As a result, we find that the optimal lags for the two quantile functions are p = 12. Figure 4(a) shows the estimation result. The solid lines are the fitted quantile functions q τ (y α ) = β 0 (τ )+ p j=1 β j (τ )y α−j and circles are realized values.
Next we apply our method to hourly precipitation data, from 1:00 May 1, 2006 to 23:00 June 30, 2006. The same modeling procedure described above is used. Here our focus is τ = 95% percentile which has implications for flooding, and we use the weight function w(y α ) = 1 + I(y α ≥ 10) for the 95% quantile. The optimal lag is p = 1. Again this result seems to be very natural because the most recent observation contains useful information for forecasting. Figure 4(b) shows the estimation result.

Discussion
The concept of model selection with respect to some model features is introduced in the information-theoretic paradigm. The standard KL measure is generalized to have a weight function designed for the specified features. The IC score based on the weighted KL measure, K w (G, F ), is defined in Theorem 2.1 by deriving the bias correction of the expected weighted loglikelihood estimator, ℓ w ( θ, z)dG(z). The scope of applications of our IC score is very wide. In real-life applications, there are many cases that allow us to use the weighted KL measure. Examples are a logistic probability model (Example 1) and an exponential power distribution (Example 2), described in Section 1. We show the usefulness of our statistical modeling procedures through simulation studies, including penalized B-spline regression (Section 3.1) and copula model selection (Section 3.2). We also illustrate our information criteria by applying it to real data, predicting the incidence of disease (Section 4.1) and quantile modeling of monthly precipitation rate (Section 4.2). We believe that our statistical modeling would contribute to the advancement of empirical studies.