Dealing with under-reported variables: An information theoretic solution

Article history: Received 30 November 2016 Received in revised form 21 March 2017 Accepted 3 April 2017 Available online 7 April 2017


Introduction
Under-reporting usually occurs in survey data, and it refers to respondents that under-report the answer to a question, for example due to a perceived social stigma [35]. A famous example is maternal smoking during pregnancy, which is a key risk factor for adverse offspring outcomes including preterm birth and low birth weight (LBW). Like many health behaviours, accurate measurement of smoking habits can be difficult and expensive during pregnancy. For that reason, many studies use self-reported data, e.g. Wright et al. [37]. Given that most smokers know their habit to be harmful, both to themselves and their unborn child, there are strong motivations for women to under-report or deny their smoking status [10]. As such, the frequency of smokers in a sample is expected to be significantly lower than would be expected according to expert knowledge, derived for example from blood test result. Gorber et al. [15] presented a comprehensive analysis of the literature and compared the prevalence estimates of smoking based on self-reported data against the prevalence estimates based on directly measured smoking biomarkers. According to this analysis, self-reported smoking is generally under-reported in such a way that the true smoking figures may be underestimated by up to 47%.
Estimating the association between an under-reported (UR) variable and another will be biased in a manner that is specific to the degree/pattern of UR. Thus, any policy decisions made on the basis of such a biased result will be questionable. For example, government policies on tobacco control, e.g. [1], maybe ill-formed if they do not take into account UR. Maternal smoking and alcohol consumption are our focus for this paper, but there are many important health applications where corrections for UR are needed -e.g. HIV prevalence [38].
One method to correct UR bias is to spend time and resources to manually identify individuals that are likely to have misreported, and ignore/correct their testimony, i.e. identify smokers by performing cotinine blood tests. Unfortunately, in many applications it is impossible to completely correct the misreported cases. For example, Gorber et al. [15] present some possible flaws of the biochemical markers that identify smoking. As an alternative to this, authors in medical statistics treat it as a problem of misclassification bias, and combine data with a prior belief of the pattern of misclassification. They use this prior knowledge to derive corrected estimators for the log-odds ratio [7,12], and the relative-risk [27], or to suggest ways for performing tests of independence [5].
These solutions suffer from a number of weaknesses, which are addressed in this paper. For testing independence, they do not control both types of error (false positive/false negative). For estimating effect sizes, the suggested solutions are naturally only applied to estimate the correlation between a binary UR variable and a binary target variable -multi-class target variables with more than two categories are handled via a one-vs-one or one-vs-all strategy. Furthermore, ranking of variables in relation to a target -a common need in feature selection and other machine learning tasks -is not straightforward. Our strategy to overcome these limitations is to provide an information theoretic insight for the under-reporting problem, by disentangling three intimately related activities: testing, estimation, and ranking of features. Our main goal is to derive corrected estimators for the mutual information (MI), a measure of effect size widely used in machine learning applications with several interesting properties [4].
To achieve our goal we reinterpret the challenge not as dealing with misclassification and biased data, but as a problem of learning from missing data, and particularly learning from positive and unlabelled data [13]. By this interpretation, we present solutions using a graphical representation called missingness or m-graph [23], which is a tool to naturally incorporate a prior belief over the misreporting at the population (or appropriate sub-demographic) level. Furthermore, with our work, we show how to correct MI for under-reporting by examining independence properties observable via the m-graph representation.
In this paper, we present the following novel contributions 1 in relation to UR variables: • Testing: Section 3 suggests a way for testing independence between an UR feature and the class. Using our test and a derived correction factor, we can control both false positive/negative errors.
• Estimation: Section 4 derives corrected estimators of MI terms that occur in UR scenarios. Section 4.1 presents estimators that capture the relevance between an UR variable and an arbitrary categorical variable, while Section 4.2 presents an estimator that captures the redundancy between two UR variables. Furthermore, we provide interval estimates where possible.
Section 5 presents two different methods for information theoretic feature ranking in the presence of UR variables, by using our suggested estimators. Section 5.1 presents Corrected-MIM, a univariate approach that provides rankings and captures only the relevance, while Section 5.2 presents Corrected-mRMR, a multivariate method that captures both the relevance and redundancy between the features. All the above contributions are novel, with the exceptions of Sections 4.1 and 5.1, which have been published in a conference paper [33]. Furthermore, we provide further experimental results in two applications. Firstly, Section 6 derives rankings of risk factors related to low birth weight infants using a case study of 13,776 births in northern England, where we demonstrate some significant false conclusions that might be drawn when ranking variables without the correction factors. Finally, Section 7 presents a machine learning application where we derive rankings of features when training/test distributions differ.

Background material
To the best of our knowledge, our work is the first that tackles the problem of estimating MI in under-reporting scenarios. In classic statistics there are some works that estimate other types of effect sizes (i.e. odds/risk ratios, limited to binary data) and we review them in Section 2.1. Section 2.2 shows how the under-reported can be phrased as a missing data problem. Finally, Sections 2.3, 2.4 and 2.5 give the background on testing, estimation and ranking using information theoretic measures.

Under-reporting as a misclassification bias problem
We assume that we have two random variables X and Y , representing a scenario where X is likely to be UR. In this case, we cannot observe the true value of X , but instead receive observations from a proxy variable X . In the notation below we use lower case letters ( y, x, x) to denote a realisation from these variables. In our example of smoking during pregnancy, y ∈ {0, 1}, is a binary indicator of LBW, x ∈ {0, 1} is whether the mother smoked during pregnancy (1 for smoking and 0 for not smoking), and x ∈ {0, 1} is whether the mother reported that she smoked in pregnancy (1 for reported smoking and 0 for not smoking). While in our running example Y is binary, the techniques presented in this work are also applied to categorical data with more than two levels |Y| > 2.
A classical solution to the under-reporting problem is to consider it as a misclassification bias [17]. Following Greenland [17] terminology, for an under-reported variable, the specificity is p( x = 0|x = 0) = 1, while the sensitivity is p( x = 1| x = 1) < 1. Here, the specificity is the probability that a non-smoker would tell the truth (equal to 1 in this setting) and the sensitivity is the probability that a smoker would tell the truth (in our setting strictly < 1, if it is equal to 1 the variable is not UR). As presented, this is the simplest scenario -referred to as non-differential -that is, the probabilities do not vary with respect to Y . The more complex case is when the sensitivity depends on Y , that is p( x = 1|x = 1, y), known as differential misclassification [17]. In this work, we will focus on the non-differential UR scenario, and leave the differential as a future work, outlined in Section 8. The non-differential assumption is reasonable to cohort studies (such as the Born in Bradford project [37] presented in Section 6), since as Greenland [16] states: "... studies that collect exposure data before the outcome occurs (such as most cohort studies) provide settings for reasonable employment of the non-differential assumption." Estimating the strength of association between variables, using this misclassification approach, is a well explored challenge in epidemiology. For example, Chu et al. [7] derive corrected estimators for the log-odds ratio, while Rahardja and Young [27] did it for the relative-risk. To derive these corrections, knowledge of the specificities/sensitivities, or in other words knowledge of the misclassification rates, is needed. This can be derived in different ways, such as validation studies or domain prior knowledge. A different way of estimating these effect sizes is to use a model to impute the values of the possibly misclassified examples, for example Edwards et al. [12] present a way of using multiple imputations to estimate log-odds ratios. With our work we derive corrections for the MI, by incorporating simple forms of prior knowledge.
A further challenge other than estimation is to conduct a valid independence test. Testing independence under misclassification is a very old problem: Mote and Anderson [24] showed that the usual χ 2 -test of independence is valid when the misclassification is non-differential, but statistical power (i.e. true positive rate) is reduced. We found only one study [5], which suggested a correction factor that captures the amount of power loss in the χ 2 -test. In that work a binomial difference-of-proportions test is analysed and used to suggest properties of the χ 2 -test, using as an argument the equivalence of these two tests under the null hypothesis. With our work we derive a correction factor, which estimates the power loss more accurately than the correction factor suggested by Bross [5] for the same quality of prior knowledge.

Under-reporting as a missing data problem
A different way to phrase the under-reporting problem is by connecting it with the equivalent problem from the missing data literature. The first step is to consider the under-reporting bias as a positive and unlabelled (PU) problem [13]. That is, a semi-supervised binary classification problem where we have a set of positive examples and a separate set of unlabelled examples, which can be either positive or negative. The positive examples can be seen as the reported "smoking" cases ( x = 1), while the unlabelled can be seen as the reported "non-smoking" cases ( x = 0). Furthermore, from the missing data literature we borrow a graphical representation which will help us to make apparent the assumptions behind the under-reporting mechanism. Mohan et al. [23] introduced a formalism for graphical modelling in the presence of missing data, known as missingness graphs or m-graphs. While in the literature of misclassification bias there is a different graphical representation [17], our modification of the m-graphs provides more useful information, by capturing both the data generation model and the causal mechanisms responsible for the misclassification process. Fig. 1(a) shows the simplest case of non-differential UR. 2 A solid node indicates a fully observed variable, whilst dashed nodes represent unobserved variables. Associated with every unobservable variable X there are two additional nodes: firstly M X , which controls whether a value from X is correctly reported (m x = 1) or not (m x = 0), and secondly, the proxy variable X which is fully observed. The major difference between missingness graphs used by Mohan et al. [23] and those here is that the mechanism M X is not observed, and for that reason we must incorporate prior knowledge over the sensitivity p(m x = 1|x = 1) and specificity p(m x = 1|x = 0). The m-graph representation allows us to read off independence properties such as: Y ⊥ ⊥ M X |x = 1 -which corresponds to the selected completely-at-random assumption in the positive and unlabelled literature [13]. Fig. 1(b) shows a more complex situation where we have two UR variables.
The current paper shows (in its simplest case) how to recover the value I( X; Y ) from I( X; Y ) by deriving a correction based on prior belief over the mechanism M X . And how to test the independence between X and Y , by using the observed variables X and Y .

Background on testing independence
The most usual way to decide independence between categorical variables is through the χ 2 -test, calculated from sample . We note that this is related to the squared-loss mutual information (SMI) [31,34] as so: Fig. 1. A graphical representation for under-reporting. (a) Non-differential, where low birth weight (LBW) Y is assumed to be associated with smoking X , so we want to know the strength of association I( X; Y ) on this arc. However, X is under-reported, so the true value is unobservable, and instead we have a proxy X , determined by X and the misclassification mechanism M X . (b) Two non-differential correlated under-reported variables X and W . In this case we want to know the strength of I( X; Y ) and I(Z ; Y ), but also we are interested in the strength of I( X; Z ). The dashed lines indicate that there may or may not be a correlation between the variables.
where I 2 (X; Y ) is the maximum likelihood estimate of the squared-loss mutual information. Under the null hypothesis that X and Y are statistically independent, the χ 2 -statistic is asymptotically χ 2 -distributed, with ν = (|X | − 1)(|Y| − 1) degrees of freedom [2].
To decide independence between X and Y , for a given sample of data, we calculate the statistic and check whether it exceeds the critical value defined by F −1 (1 − α), where α is the user-specified significance level of the test and F −1 is the inverse cumulative distribution function of the χ 2 -distribution with ν degrees of freedom. If the critical value is not exceeded, we fail to reject the null hypothesis of independence. The user specified significance level defines the probability of type I error (α), which is the probability that the test will falsely reject the null hypothesis. To calculate the probability that the test will falsely reject the null hypothesis, or probability of type II error (β = 1 − power), we should perform a power analysis [8]. To do this we need to know the sampling distribution of the test statistic under the alternative hypothesis. The χ 2 -statistic asymptotically follows a non-central χ 2 distribution under the alternative hypothesis, with the same degrees of freedom ν and with non-centrality parameter λ χ 2 (X;Y ) = 2N I 2 (X; Y ) [2, Section 6.6.4]. Thus, the probability of a type II error depends also on the sample size N and on the population value of the SMI I 2 (X; Y ).
Given this context, a very important tool of a priori power analysis is sample size determination [8]. In this prospective procedure we specify the significance level of the test (e.g. α = 0.05), the desired power (e.g. power = 0.99 or the probability of false negative to be 0.01) and the population value of the desired effect size described in terms of I 2 (X; Y ) -from this we determine the minimum number of examples required to detect that effect with the given probabilities of error. Section 3 presents how we can use the above methodology when we have UR features.

Background on estimating mutual information
In practical applications we want to explore relationships between random variables. Just giving a yes/no answer through a hypothesis test may not be of much interest, and estimating the size of the effect gives more useful information, for example, how strongly smoking is correlated with low birth weight. In machine learning one of the main ways of measuring the strength of this association is by estimating Shannon's mutual information (MI) [4]. The maximum likelihood (ML) estimate of the MI is: Firstly, it is a non-negative quantity which takes its minimum zero value when the random variables are independent. Furthermore, it can be associated with both upper and lower bounds on the Bayes error [14,19]. Brown et al. [6] present an extensive discussion of this in the context of feature selection, including various heuristics which provide approximations for high dimensional data, resulting in a unifying theoretical framework derived from a simple probabilistic model. Together with point estimates, it is a good practice to give an interval estimate, a range of possible values that the mutual information can take. Asymptotic distribution theory has a set of tools to derive the sample distribution of the ML-MI estimator and the following theorem presents this known result [4].

Theorem 1 (ML-MI estimator, asymptotic distribution). For the estimator I( X; Y ) it holds that:
The standard error of the estimator is: Proof sketch: This result can be proved by using delta methods [2]. While the asymptotic variance here depends on the population values p(x, y), in practice, for interval estimation we replace them by their sample values p(x, y). This standard procedure [2, Section 3.1.7] is followed for all the sampling distributions that we present in this work. Section 4 presents how we can estimate mutual information in UR scenarios.

Background on information theoretic feature ranking and selection
In most real world problems we have more than one feature, i.e. we observe a sample dataset With a slight abuse of notation, in the rest of this section, we interchange the symbol for a set of variables and for their joint random variable.
In this scenario, it is also useful to order the features according to their relationship with the target variable, a procedure known as feature ranking. Feature rankings provide very useful information, and applications of this principle range from model selection to decision tree construction. There are two main categories of feature rankings [29]: univariate methods, which consider only the individual relevance of each feature and multivariate methods, which also take into account dependencies between features.
In information theoretic feature selection, firstly we rank the features according to a score measure and then select the ones that contain most of the useful information (i.e. higher score). By ranking the features with respect to their mutual information with the target variable, we derive a ranking that takes into account only the relevance with the target. Choosing the features according to this ranking corresponds to the univariate Mutual Information Maximization (MIM) criterion [21]; where the score of each feature X k is given by: This approach does not take into account the redundancy between the features. More advanced multivariate techniques take into account both relevance and redundancy, without having to compute very high dimensional distributions. For example, a popular multivariate criterion is the minimal Redundancy Maximal Relevance (mRMR), which ranks the features according to the score [25]: where X θ is the set of the features already selected. The first term of the RHS captures the relevance and the second the redundancy. Section 5 derives extensions of MIM and mRMR that handle UR features.

Testing independence in under-reported scenarios
To answer the question of whether two variables are independent or not we need a hypothesis testing procedure, where we can have a control over the two probabilities of error: false positive (Type I error) and false negative (Type II error). Testing independence in under-reported scenarios is not straightforward and this section explores the dynamics of this test.
In non-differential under-reporting, it is valid to test independence by using the under-reported variable X [24,16]. This can be easily proved since we have: This can be also read directly from the m-graph in Fig. 1

(a): when
there is no direct arc between X and Y there is no path that connects X with Y and vice versa. Quantifying the loss of power in the χ 2 -test is more challenging. Now we will show how to derive a correction factor that captures this loss effectively.
We first need to write the non-centrality parameter of the under-reported test χ 2 ( X; Y ) in terms of the same parameter of the unobservable test χ 2 (X; Y ).

Theorem 2 (UR test of independence). In the non-differential under-reported scenario the non-centrality parameter of the ideal test,
have the following relationship: Proof can be found in Appendix A.1. The proof builds upon the fact that the following relationship holds between the population values of the MI terms: the under-reported test will be always less powerful than the unobserved correctly reported test. Furthermore, by having a correction factor, we can quantify the amount of the power loss by incorporating knowledge over p(x = 1).
As a result, a χ 2 -test between X and Y with N examples will have the same power as a χ 2 -test between the underreported X and Y with N/κ examples, referred to as effective sample size. Bross [5] suggested a different approach to derive correction factor in non-differential misclassification, using as a starting point the equivalence, under the null hypothesis, between χ 2 -test and the difference-of-proportions-test. This correction factor under our notation in the UR scenario is written as (the analytical derivation can be found in Appendix A.2): By comparing the equations for the two correction factors, (3) and (4), we can derive the following proposition.
Proposition 1 (Comparing effective sample sizes). Deriving effective sample size by using correction factor κ results to more powerful test than using κ .
The proof of this proposition is straightforward. In the UR scenarios it holds p(x = 1) > p( x = 1) ⇔ κ < κ . Thus the relationship between the effect sample sizes is: N/κ > N/κ , which means that using our correction factor κ results to higher sample size, and as a result more powerful test.
To verify experimentally our theoretical results we generate synthetic random variables X and Y with different degrees of dependency and we explore the false positive/negative rates through the graphs presented in Fig. 2. In the x-axis we have different effect sizes in terms of the squared loss mutual information between X and Y , while in the y-axis we have the acceptance rate of the null hypothesis H 0 (over 2,000 repeats). The y-intercept represents 1 -False Positive Rate, and should be close to 1 − α in order for the tests to be valid, while elsewhere the plots indicate the False Negative Rate. Fig. 2 verifies experimentally the correctness of our correction factor κ (verification of Theorem 2) and it shows the superiority against κ (verification of Proposition 1). Having a known correction factor is very useful for power analysis activities, such as sample size determination [30]. Furthermore, it shows that testing using X instead of X is a valid approach, since all lines have the same intercept at 1 − α, and thus the tests have the same false positive rate. Section 6 shows how our results on UR test of independence can be useful in analysing a clinical dataset.

Estimating mutual information in under-reported scenarios
In many applications, just giving an informed answer through a hypothesis test may not be of much interest, while estimating the size of the effect gives more useful information. In this section we will present different ways to estimate mutual information, despite under-reporting.
The ideal method to completely correct UR is to spend resources to identify the individuals that have misreported, and correct their testimony. For example, it could be done by performing cotinine blood tests to all women that reported non-smoking ( x = 0). This approach is expensive, and still it may not be possible to identify the individuals that have misreported [15]. On the other hand, the simplest way to estimate mutual information in under-reported scenarios is to follow a naive approach and just use the observed data. Unfortunately, this estimator, I ( X; Y ), is asymptotically biased for estimating I( X; Y ). This can be easily proved, since under the model of Fig. 1(a) the following strict inequality holds 3 : Another way to estimate mutual information is by trying to "predict" the real values of the misclassified examples using some prediction model. Then, impute new values for these examples, and finally, estimate MI using the imputed data. This is similar to solving the missing data problem by imputation [3]. In our running example this means imputing the actual values of the women who reported not smoking ( x = 0). To do so we need to build a model to derive the Bayesian posterior distribution 4 p(x = 1| y, x = 0), and we use this model to impute the values for the examples with x = 0.
Then, we can use these imputed values to derive point and interval estimates of the MI using the expressions presented in Section 2.4. One limitation of single-imputation is that estimating standard error using conventional methods -such as eq. (2) -does not take into account the fact that some of the data were imputed [28]. One solution to this problem is to perform multiple-imputations and use improved ways of estimating the standard errors, such as Rubin's rule presented in [3,Chapter 5]. Multiple-imputation has some limitations; for example, it is computationally expensive, while, in the case of estimating 5 MI, there are no guarantees that the confidence intervals derived by Rubin's rule will have the coverage defined by the nominal (user specified) level. For more details on the strengths and weaknesses of multiple-imputation we refer to Rubin [28].
In the next section we present a corrected estimator for the mutual information that takes into account the underreporting and overcomes the above limitations: (1) it is consistent, unlike the naive approach, (2) it produces valid interval estimates, unlike the simple-imputation, and (3) it is computational-efficient, unlike multiple-imputations.

Correcting for under-reporting the mutual information that captures relevance
To estimate mutual information in the under-reported scenario, we need to come up with a way to estimate marginal and joint/conditional probabilities, despite the restrictions of the problem. While we can estimate the marginal p( y) from all data, the conditionals are more challenging. For example, the conditional p( y|x = 1) is inaccessible, as we do not have access to the full set of the examples with x = 1, i.e. we do not know the identities of all smokers, but only those that self-reported it ( x = 1). Because of the event based independence assumption Y ⊥ ⊥ M X |x = 1 it holds that p( y|x To find the other conditional p( y|x = 0) we use a simple trick first introduced by Denis et al. [9] in the context of positive and unlabelled data. By using (5) we can write the marginal as p( y) = p( y| x = 1)p(x = 1) + p( y|x = 0)p(x = 0) and solving for p( y|x = 0): Finally, since we do not have access to the marginal distribution p(x = 1), and since it cannot be estimated without modelling assumptions, we incorporate prior knowledge 6 as a parameter γ x , provided by a user's belief over the true prevalence p(x = 1). Incorporating prior knowledge over the true prevalence is a widely used approach in the positive and unlabelled literature [30].
By assuming perfect knowledge over the prevalence γ x = p(x = 1) and using only the observed variables Y and X we can estimate I( X; Y ) using the following corrected estimator.

Definition 1 (Corrected ML-MI estimator).
The corrected estimator of the MI between an UR variable and the target is given by: The following Lemma proves the consistency of the estimator.  4 It is worth pointing out that imputation usually assumes missing at random (MAR), which is not the case in UR -where we have missing not at random (MNAR). To calculate this posterior, we also need prior knowledge for p(x = 1). 5 Rubin's rule assumes normality, while ML-MI estimator can be severely non-normal. For example, in small effect sizes non-central χ 2 -distribution provides a better fit -for more details see [31, Section 2.1.1]. 6 Using prior knowledge over p(x = 1) is equivalent to using prior knowledge over the sensitivity, an approach that is followed to correct the misclassification bias of epidemiological effect sizes [17]. This can be shown by the fact that in the non-differential under-reporting it holds that: , and the p( x = 1) can be estimated by the observed data.
To prove that the estimator is consistent is straightforward, since when we have perfect prior knowledge γ x = p(x = 1), by using (5) and (6) it holds that I γ x ( X; Y ) = I( X; Y ). Only proving the consistency of the corrected estimator is not useful, and we need to capture also the variance that it has in the finite sample size. We do so through the following theorem.

Theorem 3 (Corrected ML-MI estimator, asymptotic distribution). For the estimator I
.
Proof can be found in Appendix A.3.

Experiments with synthetic data and perfect prior knowledge
As a "sanity check" for our theoretical results we generated synthetic random variables X and Y with different degrees of dependency. To create the data, firstly we generate the values of X , by taking N samples from a Bernoulli distribution with parameter p(x = 1). Then, we randomly choose the parameters p( y|x) that guarantee the desired degree of dependency, expressed in terms of I( X; Y ), and we use these parameters to sample the values of Y . To create the under-reported variable X we sample with Sensitivity = p( x = 1|x = 1) the examples with x = 1. We estimate mutual information using five different methods: • Ideal: using the unobservable estimator I( X; Y ) and eq. (2) for standard error. • No correction: using the under-reported estimator I( X; Y ) and eq. (2) for standard error. • Single imputation: using a model to impute possible misclassified data and then estimate MI and standard error by eq. (2).
• Multiple imputations: using a model to impute multiple times and then average MI across the imputed datasets and using Rubin's rule [3] for standard error. To decide this number, we used the White et al. [36] guideline that the number of imputations should be approximately 100 times the fraction of missing information. In under-reporting this can be phrased as using 100 × (1 − Sensitivity) imputations. • Our correction: using our corrected estimator I γ x ( X; Y ) presented in eq. (7) and using the results of Theorem 3 for standard error.
For the imputation-based approaches, we imputed the potentially misclassified examples through the following posterior, which can be naturally derived by the model of Fig. 1(a): As we mentioned, we use perfect prior knowledge over γ x , while the rest of the parameters are estimated through ML from the observed data. To get a fair comparison between the last three methods, we used the same modelling assumptions and γ x is assumed to be known and equal to p(x = 1). Fig. 3 compares the five methods in terms of their mean squared error. The three methods that take into account the under-reporting (single/multiple imputation and our corrected estimator) outperform the naive estimator, which is not consistent. As the sample size/sensitivity increases, all of these three approaches tend to behave in a similar way to the ideal estimator. Our corrected estimator outperforms the imputation-based approaches, especially in small sample sizes and small levels of sensitivity -which are the most challenging situations. Interestingly, our method clearly outperforms methods with the same complexity (no correction and simple imputation). Fig. 4 verifies that the suggested standard error in Theorem 3 is correct, and that our method is a valid way to derive interval estimates, similar to those derived using the ideal estimator. In this figure we estimate the proportion of times that the 90% confidence intervals, derived by using different standard errors for the different methods, contain the true value of the mutual information I( X; Y ). Since the estimated coverage probability for the ideal and our proposed method are at the nominal (user specified) level of 90%, we can conclude that only these methods produce accurate interval estimates.

Experiments with synthetic data and uncertain prior knowledge
Perfect prior knowledge, i.e. γ x = p(x = 1), will not always be available. Therefore it is important to explore ways to deal with uncertain knowledge and examine the behaviour with incorrect priors -results are presented below for an artificial scenario where we can exert control over the "quality" of prior knowledge.   Let us assume that non-smoker births are drawn from a normal distribution with μ = 3500 g and σ = 500 g, while weight of smoker births are drawn from a normal distribution with μ = 3000 g and σ = 500 g. Birthweight was considered to be "low", x = 1, if the weight was < 2500 g [37]. We assume that in a cohort of N = 5000 pregnant mothers, 30% are smokers, so p(x = 1) = 0.3. However, only half of the mothers on average would admit to this, so p( x = 1) = 0.15. In a typical draw from this simulation, the mutual information is estimated with an under-reported variable. However, after using our corrected estimator and by incorporating the prior knowledge that the X variable is non-differential under-reported, the estimated mutual information increases by a factor of three ( Fig. 5(a)).
One way to handle uncertain prior knowledge is by performing a sensitivity analysis as Fig. 5(a) shows. To do so we plot the interval estimates for the corrected MI, calculated by eq. (8), for different values of our belief over the probability of smoking (γ x ). As we observe the point estimate for γ x = p(x = 1) = 0.30 (perfect knowledge) is the same with the true (ideal) value of the MI. A different way to handle uncertainty is through a simulation based analysis, where we represent uncertainty over γ x as a probability distribution, sample from this distribution many times, and estimate the corrected MI for each value. For example, in Fig. 5(b) we model γ x as a generalised Beta distribution (bounded between a minimum and a maximum value) and we explore the resultant uncertainty in the estimate of the corrected mutual information. As we observe, the true value of the MI is very close to the average over the simulations.

Correcting for under-reporting the mutual information that captures redundancy
Using the results of the previous section, we can measure redundancy terms when only one of the features is underreported, i.e. I( X; Z ), but we cannot measure terms when both of the features are under-reported, i.e. I( X; Z ) in Fig. 1(b).
To do so we will use the following estimator.

Definition 2 (Corrected ML-MI estimator between two UR variables).
The corrected estimator of the MI between two UR variables is given by: . (9) The following lemma proves that with perfect prior knowledge this estimator is consistent.

Lemma 2 (Corrected ML-MI estimator between two UR variables, consistency). When we have perfect prior knowledge over the prior
probabilities, i.e. γ x = p(x = 1) and γ z = p(z = 1), the suggested estimator in Definition 2 is consistent, since it holds: I γ x ,γ z ( X; Z ) =

I( X; Z ).
Proof can be found in Appendix A.4. As a "sanity check" for our theoretical results we generated synthetic random variables X and Z with different degrees of dependency. To create the under-reported variables X and Z we sample with Sensitivity x = p( x = 1|x = 1) the examples with x = 1 and with Sensitivity z = p( z = 1|z = 1) the examples with z = 1. We estimate mutual information using three different methods: • Ideal: using the unobservable estimator I( X; Z ).
• No correction: using the under-reported estimator I( X; Z ). • Our correction: using our corrected estimator I γ x ,γ z ( X; Z ) presented in eq. (9).
We will not compare against imputation based methods, since, in this scenario, we do not have any correct reported variable to build the imputation model. Fig. 6 compares the three methods in terms of their mean squared error. As the sample size/sensitivity increases, our suggested method tends to behave in a similar way to the ideal estimator. Our corrected estimator clearly outperforms the naive method (no correction).

Ranking features in under-reported scenarios
By using our theoretical results on estimating mutual information terms in under-reported scenarios, we can derive two different algorithms for producing feature rankings. Before that we will introduce some extra notation. We can assume that the feature vector X consists of two parts: X = {X cr , X ur }. Where X cr contains features that are correctly reported, while X ur the ones that are under-reported.

Rankings that capture only relevance
Using our findings from Section 4.1 we suggest Corrected-MIM, which is an extension of MIM (Section 2.5), and it is suitable when we have UR features. The score of each feature X k is estimated by: when X k ∈ X cr , estimate MI using eq. (1) I γ x k ( X k ; Y ) when X k ∈ X ur , estimate MI using eq. (7) (10)

Rankings that capture both relevance and redundancy
To derive feature rankings that capture both relevance and redundancy, we need to combine our findings from Sections 4.1 and 4.2, and the mRMR paradigm presented in Section 2.5. Our suggested method, Corrected-mRMR, ranks the features according to the score: where X θ is the set of the features already selected, the term that captures relevance I is given by eq. (10) and redundancy I by: when X k ∈ X cr and X j ∈ X cr , estimate MI using eq. (1) when X k ∈ X ur and X j ∈ X cr , estimate MI using eq. (7) I γ x j (X k ; X j ) when X k ∈ X cr and X j ∈ X ur , estimate MI using eq. (7) I γ x k ,γ x j ( X k ; X j ) when X k ∈ X cr and X j ∈ X ur , estimate MI using eq. (9) In the following two sections we present two applications of our theoretical results.

Application in ranking risk factors for low birth weight infants
In this section we present a real-world application of our results -ranking the risks factors that may lead to adverse birth outcomes derived from a large real-world dataset.
To describe the usefulness of our theoretical findings we will use data from a prospective birth cohort, the Born in Bradford (BiB) study. BiB is a longitudinal multi-ethnic birth cohort study aiming to examine the impact of environmental, psychological and genetic factors on maternal and child health and well-being [37]. Bradford is a city in northern England with high levels of socio-economic deprivation and ethnic diversity. The full BiB cohort recruited 12,453 women with 13,776 pregnancies between 2007 and 2010, and the cohort is broadly characteristic of the city's maternal population in terms of age, deprivation and ethnicity. Ethics approval for the study was granted by Bradford Research Ethics Committee (Ref. 07/H1302/112). In our analyses we focus on term births only, and we excluded ethnic groups (such as Pakistani mothers) that are much less likely to smoke and drink alcohol than the rest of the cohort [37]. As a result, the number of suitable pregnancies reduced to 5,457.
We show how to rank several risk factors according to their association with LBW. The risk factors that we focus on are the correctly-reported categorical variables: ethnicity X E (3 levels), age X Ag (3 levels), Body Mass Index (BMI) X B (4 levels), index of multiple deprivation X I (5 levels), gestational diabetes X G (binary), taken vitamins X V (binary), and the following binary UR variables: any smoking X S , passive smoking X P and alcohol X Al consumption during pregnancy. Let us assume that 1/3 of the overall women under-report these three variables, and we assume non-differential UR.
In Fig. 7(a) we observe the MIM ranking by using the MI of the observed covariates and the target variable (LBW). Then, to correct the three UR variables, we use prior knowledge and our corrected estimators presented in Sect. 4, and we derive the Corrected-MIM ranking of Fig. 7(b). Finally, to take into account the correlations between the risk-factors we derive the Corrected-mRMR ranking of Fig. 7(c). From these three rankings we can arrive to the following interesting conclusions: Units are milli-nats. The single star * means the null hypothesis (independence between the reported covariate and LBW) is rejected at α = 0.01, while double stars ** at α = 0.001. Failure to reject the null does not imply insignificance as the test may not have sufficient power, which is likely the case in an under-reported test as we showed in Section 3. 1.
Smoking is the most important risk for LBW even without taking into account the UR (Fig. 7(a)). 2. By correcting for UR, alcohol gains one position in the ranking (Fig. 7(b)). Another interesting observation is that the test of independence between passive smoking and LBW does not reject the null hypothesis. Failure to reject the null does not necessarily imply insignificance, as the test may not have sufficient power to detect an actual effect, which is likely the case in an under-reported test 7 as we showed in Section 3. 3. By using our method for deriving corrected multi-variate rankings (Corrected-mRMR), we see that alcohol and passive smoking become the least important factors (Fig. 7(c)). This result matches with the known correlation between (passive) smoking, BMI and alcohol [11,22]. By using our Corrected-mRMR methodology, we take into account the redundancy between these factors. Thus, by conditioning over smoking, both alcohol and passive smoking become less important.
The differences between the three rankings illustrate the importance of having techniques that are able to produce estimates that correct under-reporting and take into account the correlation between the risk factors. We have demonstrated that failure to correct for potential under-reporting of exposure will lead to biased estimates of the ranking of the relative effect between variables and outcomes. To appropriately validate our statistical findings access to the true values of X S , X P and X Al will be necessary, i.e. through a blood test. Unfortunately, this information was not available. For that reason, in the following section we present the merits of our analysis in a machine learning application using datasets for which we have access to the ground truth.

Application in feature ranking in under-reported scenarios
In this section we present a machine learning application of our findings: producing feature rankings when the training/test distributions differ, and particularly, when the features are non-differentially under-reported. 8 We used four categorical UCI 9 datasets (Splice, Chess, Mushroom and Congress) and two artificial lung cancer 10 datasets (LUCAS0 and LUCAP0) from the Causation and Prediction Challenge [18]. Splice is a 3-class classification problem, while the rest are binary. Categorical attributes are expanded into several binary attributes. Table 1 shows the characteristics of each dataset. We assumed that the features of these datasets are correctly reported, and to generate UR datasets we randomly under-reported the original features.

Deriving MIM ranking
In this section, we will compare the rankings derived by using different MIM methods. Fig. 8 compares the similarities between the MIM rankings derived by UR methods and the ideal ranking -the MIM ranking that we would have if we had access to the actual values of the features. To check the similarity between the rankings we use Spearman's ρ correlation coefficient [20]. The range of values that this coefficient takes is [−1, 1], where 1 means that the two rankings are identical, 0 means that there is no correlation between them, while −1 means that the two rankings are inverse. We compare the three UR methods with the same complexity: no correction, simple imputation and our correction (for a fair comparison we used perfect prior knowledge for the last two approaches). Fig. 8 shows that our suggested approach, Corrected-MIM, outperforms the other approaches in all the settings. Now we will compare the different methods in terms of their misclassification error. As a classification model we will use a k-nearest neighbour (k = 3) classifier, which makes few assumptions about the data and treats all features equally, a desirable property when we compare different feature selection/ranking approaches [6]. Fig. 9 compares the different UR approaches in terms of their misclassification error. In the Splice and Chess datasets our approach outperforms the others and achieves similar performance as using the ideal estimator. In the rest of the datasets, all methods have similar performance, and always our method performs similarly with the ideal.

Deriving mRMR ranking
Now we will compare the rankings derived by using different mRMR methods. Fig. 10 compares the similarities between the mRMR rankings derived by UR methods and the ideal ranking. Our suggested approach, Corrected-mRMR, outperforms the naive not-corrected approach in all the settings. Fig. 11 compares the different approaches in terms of their misclassification error. In the Mushroom and Chess dataset our approach outperforms the naive method and achieves similar performance as using the ideal estimator. In the rest of the datasets, all methods have similar performance, and always our method performs similarly with the ideal.

Conclusions and future work
In this work we have provided an information theoretic solution to the problem of under-reported variables. Initially, by reinterpreting under-reporting as a missing data problem, we presented how we can use the tool of missingness graphs [23] for providing graphical representations of the different under-reported scenarios. Then, by using these representations, we explored valid ways to test independence, and we derived a correction factor that quantifies the power loss of the χ 2 -test.
Furthermore, by incorporating simple prior knowledge, we derived ways for estimating mutual information quantities that capture both relevance and redundancy. Our suggested estimators are computationally efficient, while they have similar error with more complex imputation based approaches. Additionally, for the mutual information that captures the relevancy, we derived confidence intervals that achieve the ideal coverage. Using our suggested estimators we proposed two methods for feature ranking: Corrected-MIM, which captures only relevance, and Corrected-mRMR, which captures both relevance and redundancy. Our theoretical results are supported through experiments with synthetic data. Finally, we showed how we can use our findings in a real-world health care application (ranking the risk factors that may lead to low birth weight) and in a machine learning application (feature ranking when training/testing distributions differ).
In many practical applications we have misreporting mechanisms that are correlated, for example by having a latent variable which is a parent of both missingness mechanisms M X and M X in Fig. 1(b). One limitation of our work is that it assumes independent misreporting mechanisms. Thus an interesting future direction is to explore ways of estimating redundancy terms without making any independence assumption. Furthermore, providing ways for testing, estimation and ranking in differential under-reporting (i.e. when there is a direct arc between the missingness mechanism M X and the variable Y in Fig. 1(a)) seems challenging. Lastly, finding ways to consistently estimate conditional mutual information terms will provide us with algorithms for structure learning or Markov blanket discovery in UR scenarios [32]. In our earlier work [33] we suggested a way for correcting conditional mutual information when we condition on correctly reported variables, while a promising future direction is to derive a corrected estimator, when we condition on under-reported variables.
We believe our results are highly applicable in a wide variety of machine learning applications, when we face the problem of under-reporting. Estimating mutual information, testing independence, ranking sets of features according to their relevance/redundancy, learning Bayesian network structures and sample size determination for experimental design are some -but not all -of the possible applications. of the Children and Parents in BiB. We are grateful to all the participants, practitioners and researchers who have made Born in Bradford happen.
Data access statement: All research data supporting this publication are directly available within this publication, apart from the Born in Bradford dataset, which is obtained upon request and subject to licence restrictions. Due to the potentially identifiable nature of this dataset coming from a small geographical area we are unable to deposit it in the public domain. Full details of how these data were obtained are available in [37], while further details on the application procedure can be found on the Born in Bradford website (http :/ /www.borninbradford .nhs .uk /research-scientific / how-to-request-access-to-raw-bib-data/).

A.1. Proof of Theorem 2
The non-centrality parameter of the under-reported χ 2 test is equal to λ χ 2 ( X;Y ) = 2N I 2 ( X; Y ), while the non-centrality parameter of the unobserved test is equal to λ χ 2 (X;Y ) = 2N I 2 (X; Y ). In order to derive a relationship between the parameters we should derive a relationship between the two squared loss mutual information terms.
We will start by re-expressing I 2 (X; Y ) as follows The baseline random feature selection method helps in determining the complexity of the task.
Following exactly the same procedure for the I 2 ( X; Y ) we get Under the non-differential assumption because of the event based independence assumption Y ⊥ ⊥ M X |x = 1 it holds that p( y|x = 1) = p( y|x = 1, m x = 1) ⇔ p( y|x = 1) = p( y| x = 1), so from (A.1) and (A.2) we derive that: And by multiplying both sides with 2N we can derive the relationship between the non-centrality parameters: . 2

A.2. Derivation of the correction suggested by Bross [5]
Bross [5, p. 484] suggests that in order to calculate the power of the χ 2 -test we should use an effective sample size of 1/κ times the actual sample size, where the correction factor κ is given in eq. (1.02) of [5]. Bross derived this correction factor by starting from the equivalence under the null hypothesis between the χ 2 -test and the different-of-proportions-test.
Using our notation, this correction factor can be written as: In our UR setting we have p( x = 0|x = 0) = 1 ⇔ p( x = 1|x = 0) = 0. By substituting p( x = 1|x = 0) = 0 in eq. (A.3) we get: The conditional probability can be written as: p( x = 1|x = 1) = p( x=1,x=1) p(x=1) . Because of the UR constraint, details in Section 2.1, whenever an example has x = 1 then it also holds that x = 1. This means that: p( x = 1|x = 1) = p( x = 1), and as a result the conditional probability takes the following form: p( x = 1|x = 1) = p( x=1) p(x=1) . By substituting this expression in eq. (A.4) we get: The last expression for κ is the one presented in Section 3. 2

A.3. Proof of Theorem 3
To derive the asymptotic distribution of I γ x ( X; Y ) we will use the delta method [2, Section 16.1.4], which we formally present in the following lemma.
Lemma 3 (Delta method). Suppose that cell counts n = {n x, y } have a multinomial distribution with cell probabilities p = {p(x, y)}, ∀ x ∈ X , y ∈ Y. Let N = x∈X y∈Y n x, y , and let p denote the sample proportions: p(x, y) = n x, y /N. Let g(p) ∈ R be a differentiable function, and let φ x, y = ∂ g ∂ p(x, y) (p), ∀ x ∈ X , y ∈ Y. Assume that at least one φ x, y is nonzero then the distribution √ N g(p) − g(p) converges to the normal distribution N 0, σ 2 when N → ∞, where σ 2 = x∈X y∈Y p(x, y)φ 2 x, y − x∈X y∈Y p(x, y)φ x, y 2 .
The expression of the corrected estimator is: .