A uniﬁed framework for evaluating the risk of re-identiﬁcation of text de-identiﬁcation tools

Objectives: It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identiﬁcation risk. The metrics commonly used to determine if these systems are performing well do not accurately reﬂect the risk of a patient being re-identiﬁed. We therefore developed a framework for measuring the risk of re-identiﬁcation associated with textual data releases. Methods: We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. Results: We demonstrate how this framework compares against common measures of the re-identiﬁcation risk associated with an automated text de-identiﬁcation process. For the probability of re-identiﬁcation using our evaluation framework we obtained a mean value for direct identiﬁers of 0.0074 and a mean value for quasi-identiﬁers of 0.0022. The 95% conﬁdence interval for these estimates were below the relevant thresholds. The threshold for direct identiﬁer risk was based on previously used approaches in the literature. The threshold for quasi-identiﬁers was determined based on the context of the data release following commonly used de-identiﬁcation criteria for structured data. Discussion: Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identiﬁcation. Conclusions: This framework should be used as a basis for computing re-identiﬁcation risk in order to more realistically evaluate future text de-identiﬁcation tools.


Introduction
There has been significant research on developing tools for the de-identification of free-form medical text [1,2]. The evaluation methods currently used to determine whether these tools are performing well enough are borrowed from the areas of entity extraction and information retrieval [3]. There has been some recognition that these evaluation approaches are not always the most appropriate for measuring the probability of re-identification nor are the benchmarks typically used to decide what is ''good enough" directly relevant to the de-identification task [4]. Such concerns triggered the current work.
In this paper we critically examine the methods that are currently used to evaluate medical text de-identification tools [1,2], identify their weaknesses, and propose improvements. We then propose a unified framework for evaluation in terms of the probability of re-identification when medical text is de-identified using automated tools. Our framework builds on existing work, and its main contribution is that it brings multiple concepts together from the disclosure control literature, the information retrieval literature, and the risk modeling literature to provide a more detailed evaluation scheme for measuring re-identification risk.
The issues we identify in current evaluation methods can in some instances inflate the performance of de-identification tools by making them look better than they really are, and in other instances may also penalize them by making them seem much worse than they really are. This means that our proposed evaluation framework will not consistently give higher risk values or lower risk values than currently used methods, although we argue that it represents a more accurate modeling of the probability of re-identification because it better accounts for the distribution of identifiers in documents. We illustrate the differences between our framework and conventional evaluation approaches using theoretical and empirical examples. We then illustrate the application of this framework on a clinical data set, and compare the findings to what would be obtained using current evaluation methods.

Evaluation approaches used in text de-identification
Most of the current text de-identification systems treat Personal Health Information (PHI) identification as a named entity recognition problem. Consequently, they evaluate the identification performance with metrics used in the named entity recognition and information retrieval literature [3]. In particular, they typically annotate different types of entities (or categories), such as date, patient name, and ID, and report performance primarily using three metrics: precision, recall, and f-measure. Let tp be the number of true positive annotations, fp be the number of false positive annotations, and fn be the number of false negative annotations. Then, recall r is given by and precision p is given by Recall and precision answer two questions about a deidentification tool, respectively: ''Did we find all that we were looking for?" and ''Did we only label what we were looking for?" The metric f-measure combines precision and recall, typically by taking the harmonic mean of the two. To get a sense of the overall performance of a system, the most commonly used metrics are micro-average and macro-average precision, recall, and f-measure. To compute micro-average, one creates a confusion matrix for all categories and then computes precision and recall from this table, giving equal weight to each PHI instance irrespective of its category. To compute macro-average, one computes precision and recall for each category separately and then averages them over all categories, giving equal weight to each category, to get an overall measure of performance.
In Appendix A we summarize evaluation metrics currently used in the text de-identification literature. This review indicates that micro-average recall is a primary metric for evaluating such tools. We also conclude that the number of clinical notes (i.e., number of patients) used in different studies range from 100 to 7193, and that the number of test documents used in different studies range from 220 to 514.
In the context of text de-identification, current evaluation approaches are limited in three ways. First, they report performance on all instances of an entity across all documents. However, none of them consider the number of PHI elements missed within a document, which is an important aspect in de-identification, as a document typically corresponds to a patient and any leaks within a document mean potentially revealing the identity of that patient. In other words, current evaluation approaches do not truly reflect the risk of a patient being re-identified. Second, they evaluate all types of entities with the same evaluation metric, giving equal weight to each entity type even though directly identifying entities, such as name and address, have a higher risk of reidentification compared to indirectly identifying entities, such as age and race. Finally, they do not account for the distribution of PHI across documents. For example, an entity type that is rare and appears in very few documents will have a higher sensitivity to the performance of an information extraction tool than a more prevalent entity type. We examine each of these issues below.

Basic concepts
The key assumptions that we make in developing our evaluation framework are detailed below. Some of these assumptions are already made in the literature implicitly, but it is important in our context to make them explicit.

One document = one patient
We assume that every document that is being analyzed pertains to an individual patient (i.e., there is a one-to-one mapping between documents and patients). This means that if a document pertains to multiple patients then that information is split into multiple documents. This assumption simplifies the presentation of our framework and its rationale.
In the case where a simple split is not possible, as in the case of clinical study reports from clinical trials, then we assume that all of the information pertaining to an individual trial participant can be extracted as a unit and treated as a separate virtual document for the purposes of evaluation.
This assumption also means that each patient only has one document in the corpus. For example, if the evaluation corpus consists of hospital discharge records, then each patient has a single discharge record.

Information leak = re-identification
Furthermore, we assume that if an annotation is not detected (i.e., ''leaked") then it can be used to re-identify a patient. So the probability of re-identifying a patient is conditional on a leak occurring. We have: The probability of a leak in a set of documents is directly related to recall, r, given by: Based on our assumptions we can then say: We will examine further below how much information needs to be leaked to re-identify a patient. This simplifying assumption is conservative in that it will inflate the risk of re-identification.

Re-identification from correct information extraction
A corollary to the assumption above is that if an annotation is detected, or ''caught", then it is either redacted or re-synthesized, such that the probability of re-identifying a patient from that information is zero.
We can formulate this probability as: where PrðcatchÞ ¼ 1 À PrðleakÞ, which is recall. Clearly the annotations that were leaked versus those that were caught are mutually exclusive. The overall probability of re-identification is therefore given by Prðreid; catchÞ þ Prðreid; leakÞ, or: ½PrðreidjcatchÞ Â ½1 À PrðleakÞ þ ½PrðreidjleakÞ Â PrðleakÞ ð7Þ Which, given the assumption that PrðreidjleakÞ ¼ 1 in equation (5), simplifies to: PrðreidjcatchÞ þ ½PrðleakÞ Â ð1 À PrðreidjcatchÞÞ ð8Þ The above equation represents the overall probability of reidentification from annotations that were detected during information extraction and modified, and those that were leaked. For now, we will assume that PrðreidjcatchÞ ¼ 0, which will be valid in most cases where redaction or re-synthesis are used. For specific contexts in which generalization or other transformations are performed on the detected identifiers, such as for documents shared in the context of clinical trials transparency efforts, we drop this assumption and allow PrðreidjcatchÞ > 0. The relaxation of this assumption is discussed further in Appendix B.

Distinction between direct and indirect identifiers
As is commonly done in the disclosure control literature [5][6][7], we consider two types of PHI annotations in text: direct identifiers and quasi-identifiers. Direct identifiers are annotations such as first name, last name, telephone numbers, unique identifiers (for example, medical record numbers (MRNs) and social security numbers (SSNs)), and email addresses. Quasi-identifiers are annotations that can indirectly identify the patients, such as dates, ZIP codes, city, state, and facility names. Direct and quasi-identifiers are the types of features in health information that are typically targeted during the de-identification of health data [8]. In our analysis we will make a distinction between these two types of annotations because the manner in which they need to be evaluated will differ.

Focus on micro-average recall
Given that our focus is mostly on a unified framework for measuring re-identification risk, recall is most relevant. This does not mean that precision is not important as a metric to evaluate the performance of de-identification tools: only that in the context of the current paper it will not be the focus of our analysis.
Since we do not consider precision further in this paper, we also do not consider the f-measure since it combines recall and precision. We can also see in the literature review in Appendix A that the most commonly used metric for evaluating the risk of reidentification is micro-average recall. Micro-average recall is therefore used as the baseline measure of re-identification risk.

Critical appraisal of performance evaluation methods
We now consider the weaknesses in conventional approaches to performance evaluation and address these weaknesses. To illustrate some of these points, we use the 2006 i2b2 de-identification challenge data set [2]. The data from this challenge has become a standard for text de-identification evaluation. 1 The data has been manually de-identified ''for the challenge by replacing authentic PHI with synthesized surrogates", however the surrogate PHI is not realistic. For our purposes, we used a rule-based de-identification tool described in [9] for our illustrations below.

All-or-nothing recall
Imagine there is an evaluation set of 100 clinical documents, and these documents have 250 different instances of the last name of a patient. Then micro-average recall would be computed across all of these 250 instances. If 230 of the instances were detected by the de-identification tool then the recall would be 0.92 (i.e., 230/250).
The micro-average does not account for the fact that there were 100 documents, and it does not account for how these names were distributed across these documents. This is important because for direct identifiers, the general assumption is that a single instance of a direct identifier is sufficient to determine the identity of the patient. Although one can come up with counter-examples to this assumption (for example, the name ''James" would not directly identify a unique patient because it is so common), it is one assumption commonly made in the disclosure control community and errs on the conservative side. The implication of this assumption is that we will be conservative because any single leaked direct identifier is equated with a successful re-identification. If the true re-identification can only happen if two direct identifiers leak then we would be being overly protective.
If a single instance of a direct identifier in a document can reveal the identity of the patient, then all that is needed to reveal the identity of a patient is for a single direct identifier to leak (or not to be detected) in a document. If a document has 10 instances of a patient's last name and 9 of those instances are detected, from a re-identification risk perspective this is not a 90% recall but a 0% recall because there was at least one leak. This is the all-or-nothing recall.
To continue with our example, if the 230 names that were correctly detected were all the names in 80 documents, and the remaining 20 names that were not detected were in the other 20 documents (i.e., one name in each document), then the probability of determining the identity of the patient in these 20 documents is almost certain. The micro-average recall of 0.92 inflates the performance of the de-identification tool. The all-or-nothing recall in this case is 0.8, and the correct probability of re-identifying an individual in these documents is then 0.2 instead of 0.08. Therefore, for direct identifiers it is important to use the all-ornothing recall value rather than the micro-average recall value [9]. Consider Table 1, which illustrates the magnitude of the differences between micro-average recall and all-or-nothing recall on the i2b2 data set. The ''DI" group contains all annotation types that would be classified as direct identifiers. Notice how the microaverage recall remains fairly constant when including many PHI types, while the all-or-nothing recall drops markedly. Adding more annotation types can only add more opportunities to leak values, which leads to monotonically decreasing all-or-nothing recall. However, micro-average, by definition of an average, need not decrease; adding an annotation type with a high recall could increase average, even though the documents previously containing leaks still contain leaks. Micro-average can be extremely misleading about the rate of re-identification for leaked direct identifiers.

Masking recall
During information extraction a particular type of annotation is detected. For example, if there is a ''James" in the document then it is identified and then classified as a ''First Name". If both of these steps (identification and classification) are true, then this is typically considered a true positive. However, from a deidentification/recall perspective it does not matter whether ''James" is classified as a first name or a last name. All that matters is that it has been detected. Of course the classification as a ''First Name" may matter from a precision perspective, but it does not matter from a recall perspective.
Consider Table 2 where the annotation provided manually by an expert does not match what a de-identification tool could determine. However, in a redacted document the net effect is the samethe name of the facility will be protected. All the identifying information is removed. Therefore, a more precise recall would consider the organization completely masked, even though it is masked by several annotations of different types.
Therefore, we define masking recall as the recall value calculated only based on whether a particular direct or quasiidentifier in the text has been detected or not [4] (also called PHI-level evaluation in [10]). Masking recall should use a token level evaluation: evaluate that each token is masked.
Consider Table 3, which shows the comparison of masking recall and conventional recall for different annotation types in the i2b2 data set. For all annotation types, masking recall is markedly higher than the conventional recall. The crucial question from the de-identification perspective is whether we missed a PHI or not. The masking recall more clearly answers this question, as it indicates the extent to which instances of an annotation type were identified as PHI. For example, among all IDs, 83.96% of them were identified as a PHI of some annotation type.
However, a token level evaluation would be problematic if the frequencies of tokens in the data set are not similar. For example, consider a data set with 1000 documents. All 1000 documents have a first name, and only 10 have a last name. The de-identification tool detected the first names in 999 of the 1000 documents and only 2 of the last names. If we pool both names as suggested above the recall would be 1001/1010 = 0.99. This, however completely hides the very low recall on last names because of the extreme imbalance in the frequency of occurrence of each name. Therefore, the concept of masking recall is only appropriate if the frequencies of all of the direct and quasi-identifiers is more-or-less the same in the data set. In practice this cannot be ensured and therefore we need a more robust approach for evaluation.

One or more leaks of direct identifiers
As noted earlier, for direct identifiers we assumed that a leak of a single value in a document would result in the patient being reidentified. To be precise we are concerned about at least one of the direct identifiers leaking from the de-identification process. We also need to evaluate this in a manner that accounts for the different frequencies of different types of identifiers. Let s i be the number of documents that a particular identifier i appears in, and n the total number of documents. Then we can define the probability that a direct identifier is missed or leaks given that it actually appears in the corpus being evaluated as: Which gives the probability that a leak will occur given that the identifier actually appears in the data. The probability that direct identifier i leaks and appears in a document is given by: where w i ¼ s i =n and r i is the all-or-nothing recall. The probability that a document will leak at least one direct identifier is therefore given by: This gives us the combined probability of a leak for all direct identifiers. Since each direct identifier type is dealt with independently, the frequency with which specific direct identifiers appear in the data set will not affect this calculation directly (except when computing the confidence intervals).

Quasi-identifier risk
For quasi-identifiers, a single value is not necessarily uniquely identifying. However, there is evidence that, in a number of jurisdictions, two quasi-identifiers such as the date of birth and the ZIP or postal code, are unique across most of the population [11][12][13][14][15]. For example, that uniqueness approaches 100% in Canada and the Netherlands [11][12][13], and is closer to 63% in the US [14]. We therefore make the conservative assumption that at least two quasi-identifiers must leak in the same document to re-identify a patient.
Let m be the number of times, on average, that a quasi-identifier value in a document is repeated (i.e., the average number of instances per quasi-identifier value). Also, let r q be the microaverage recall computed across all quasi-identifiers. Then the probability of at least one quasi-identifier instance being leaked would be given by 1 À ðr q Þ m . This means that the more instances that a quasi-identifier has in a document, the greater the likelihood that there will be a leak.
Finally, let n q be the average number of distinct quasi-identifier values per document. Since we do not know which two or more quasi-identifiers will be leaked, we need to account for all combinations of 2 or more leaks. This can be represented as a binomial distribution with n q trials: PrðX P 2Þ for X $ Bðn q ; 1 À ðr q Þ m Þ ð 12Þ where Bða; bÞ is a binomial distribution with a trials and b probability of success. This is a suitable distribution even when the population is known to be finite. The values for m and n q are computed from the data. The expression in equation (12) assumes that the instances for the same quasi-identifier are protected independently. In practice, this is a conservative assumption since the ability to detect one instance of a quasi-identifier could be quite similar across all instances of that quasi-identifier in a document. For example, the recall for a date of birth will be the same for all instances of date of birth. A less conservative approach for modeling of at least two quasi-identifiers leaking would then be PrðX P 2Þ for X $ Bðn q ; 1 À r q Þ. We nevertheless err on the conservative side because the recall will also depend on the context in which a quasi-identifier is used and how it is expressed, and that will not necessarily always be the same across all instances. For example, the name of a facility may be ''The Ottawa Hospital", ''TOH", and ''the general hospital in Ottawa" and all of these instances refer to the same quasi-identifier but will have different recall values.
In the i2b2 data set the proportion of documents with at least two leaked quasi-identifiers was 0.3704, and the probability as Table 3 Comparison of masking recall and conventional recall on i2b2 data.  (12) was 0.467. Therefore, we can see in this example that equation (12) sets an upper bound on the risk and errs on the conservative side.

Re-synthesis recall
It is common practice to replace the elements in text that are annotated by the de-identification tool as direct or quasi-identifiers with fake values. These would be randomly generated values that are substituted for the original values. Such a re-synthesis of the original text ensures that the de-identified text looks realistic.
It has been shown that an adversary who attempts to reidentify individuals from a re-synthesized document has difficulty in determining which identifiers are re-synthesized ones versus original ones that were missed by the de-identification tool [16,17]. For example, if the de-identified text has the names ''James" and ''Alan" in the document, there will be uncertainty as to which one of these is the real name of the patient. For this reason, re-synthesis allows leaks to be hiding in plain sight.
The probability that a document will leak at least one direct identifier that is recognized by an adversary, and therefore the probability of re-identification, is given by: Prðrecognize; leak; appearsÞ ¼ Prðrecognizejleak; appearsÞ Â PrðleakjappearsÞ

Â PrðappearsÞ ð 13Þ
Let h be the probability a leaked identifier value is successfully hiding in plain sight, i.e., the probability that an adversary can correctly determine whether an identifier is an original one that was leaked versus one what was re-synthesized. The above formulation for direct identifiers can be computed as: where r i is the all-or-nothing recall for direct identifiers. For quasiidentifiers we have: PrðX P 2Þ for X $ Bðn q ; hð1 À ðr q Þ m ÞÞ ð15Þ Based on previous experiments [16] a reasonable value can be computed as h ¼ 0:1, which also errs on the more conservative side given that some studies found that h ¼ 0 [17].

Strict recall
Equation (14) could result in quite small values of recall giving seemingly acceptable levels of re-identification probability. For example, if we use h ¼ 0:1 from [16], w i ¼ 1, and r i ¼ 0:4, then the overall probability of re-identification with re-synthesis would be 0.06, even though the value of r i is quite low. Furthermore, with a low value for r i the density of identifiers that have leaked will be high and it is not clear that the h value from these previous studies would still hold. Therefore, we need to specify a minimal value for the recall values in order to use the re-synthesis adjustment. This adjusts the equations above for those recall values above 0.9, versus those below 0.9. For direct identifiers we have: In this case we assumed that a high recall of 0.9 for direct identifiers would be necessary for the published h value to hold. We use a slightly lower cutoff value than is reported in the literature [16] because the literature uses micro-average recall all the time rather than all-or-nothing, and this will result in inflated recall values. Therefore, the lower threshold is an attempt to adjust for that. Note the impact of w, the probability a direct identifier appears in a document, will have on the overall risk from direct identifiers. On the one hand w < 1 will decrease risk, possibly even countering for the loss of the factor h ¼ 0:1 when recall is below 0.9; on the other hand w will increase variance for recall (which depends on s i ¼ n Â w i ). In order to justify the use of the factor h we need to ensure it is significantly greater than or equal to 0.9 (see the discussion of confidence intervals in Section 2.3.8).
And for quasi-identifiers, PrðX P 2 if r q P 0:7; or Y P 2 if r q < 0:7Þ for X $ Bðn q ; hð1 À r m q ÞÞ; Y $ Bðn q ; ð1 À ðr q Þ m ÞÞ; ð17Þ where 0.7 is the minimum recall value. This is the value that we have used in our analysis based on our subjective judgement and what would be acceptable to the institution releasing the data in our study, but it is a parameter that can be adjusted by the analyst.

Accounting for attempted attack
If a de-identified text document is going to be disclosed publicly, then the results in equations (16) and (17) would be the correct ones to use. However, for non-public data releases it is necessary to take into account the probability that an adversary will actually attempt to re-identify an individual in the data set [18]. Considering the probability of attempt is common disclosure control practice for health data and has been included in recent guidance and standards [19][20][21][22].
This can be modeled as follows for direct identifiers: Prðreid; attempt; leak; appearsÞ And for quasi-identifiers: Prðreid; attempt; leak; appearsÞ ¼ Prðattemptjleak; appearsÞ Â PrðX P 2 if r q P 0:7; or Y P 2 if r q < 0:7Þ for X $ Bðn q ; hð1 À ðr q Þ m ÞÞ; Y $ Bðn q ; ð1 À ðr q Þ m ÞÞ ð19Þ A scheme based on subjective probability that has been in use for a number of years to evaluate the probability of re-identification for health data has been developed for computing a value for Prðattemptjj) [8]. This uses checklists to evaluate the security and privacy practices of the data recipient, the types of contractual controls in place, and the motives and (technical and financial) capacity of the data recipient to re-identify the data set.

Confidence intervals
In the literature it has been assumed thus far that the computed recall value is an accurate point estimate, and typically no confidence interval was computed for it. However, because during validation studies the computed value is an estimate of recall, it is important to report the confidence interval around that estimate as well. That confidence interval will be affected by, for example, the sample size of the corpus and the frequency of identifiers in the data.
Therefore the recall can then be represented by a normal distribution with the observed value as the mean and the estimate of the variance would be r i ð1 À r i Þ=s i . Similarly, the weight w i can then be represented by a normal distribution with the observed value as the mean and the estimate of the variance would be w i ð1 À w i Þ=n.
Because each identifier will have a different frequency in the data, the computations of recall will have different accuracy, and this needs to be accounted for in an evaluation framework. For example, a direct identifier that appears in 1000 documents will have a recall value that is computed more accurately after evaluation than a direct identifier that only appears in 10 documents. We therefore need to account for this uncertainty.
Document frequency and all-or-nothing recall can be treated as proportion estimates; document frequency is the estimated proportion of documents with a particular type of PHI and all-ornothing recall the estimated proportion of documents correctly annotated. Proportion estimates follow a binomial distribution since they are modeled as Bernoulli trials, however it is common practice to approximate this with a normal distribution [23].
The value of PrðattemptÞ can also be represented as a triangular distribution which is a common approach to represent uncertainty with subjective probabilities [24,25]. The counts n q and m can be represented as Poisson distributions given that there will be variation in their values across documents as well.
The variable weight and recall values can be represented as normal distributions denoted by Nða; bÞ, where a is the mean and b is the standard deviation. The triangular distribution is given by Triangða; b; cÞ where b is the most likely value and a and c the minimum and maximum values. Therefore, we can then formulate the overall probability distribution for direct identifiers as follows: Prðreid; attempt; leak; appearsÞ And for quasi-identifiers: Prðreid; attempt; leak; appearsÞ ¼ PrðX P 2 if r q P 0:7; or Y P 2 if r q < 0:7Þ Â A for X $ BðN q ; hð1 À ðR q Þ M ÞÞ; Y $ BðN q ; ð1 À ðR q Þ M ÞÞ; The distribution of the terms in equations (20) and (21) can be computed using a Monte Carlo simulation and the 95% confidence interval for the overall probability of re-identification derived from that empirical distribution [25].

Setting thresholds
In this section we discuss how to evaluate the re-identification probability distribution by comparing it to an appropriate threshold for each of the direct and quasi-identifiers.
2.3.9.1. Evaluating the distribution for direct identifiers. For direct identifiers, we create a benchmark or threshold distribution and compare the actual distribution obtained from this data with that threshold distribution. This threshold distribution is derived from existing practices in the literature. If the actual distribution does not cover a risk greater than what is covered by the threshold distribution, then we have sufficient evidence to conclude that the actual risk is the same as or lower than the threshold risk and is therefore considered acceptably low. This is illustrated in panel (a) of Fig. 1. In other words we need the upper confidence limit of the actual distribution to be less than or equal to the upper confidence limit of the threshold distribution. Otherwise we cannot conclude that the risk is lower than the threshold distribution, or that the risk is acceptably low. As illustrated in panel (b) of Fig. 1, the upper confidence limit of the actual distribution is greater than the upper confidence limit of the benchmark distribution, and therefore we cannot conclude that the actual risk is less than or equal to the benchmark distribution.
This can be thought of in terms of a null hypothesis, where the actual risk is greater than the benchmark distribution. In panel (a) we can reject this null hypothesis and conclude that the actual risk is not greater than the threshold distribution, but in panel (b) there is insufficient evidence to reject it and we therefore conclude that the actual risk may be greater than the threshold distribution.
For the benchmark distribution we need to determine an acceptable recall for direct identifiers that will result in a measure of risk that is equivalent to existing standards. The authors in [26] recommended that a recall of at least 0.95 would be acceptable for direct identifiers. We have extended this criteria to all-or-nothing recall, which is more conservative than these authors had intended since they were referring to micro-average recall.
When constructing the benchmark distribution we assume that w ¼ 1, the worst case in terms of risk in that it assumes that all of the direct identifiers are present in each document. By examining the literature review in Appendix A we see that the smallest data set that was used to evaluate a rule-based de-identification tool or the testing data set for a machine-learning based tool was 220 documents. We therefore assume that n ¼ 220 for the benchmark distribution.
When we put these values into equation (20) we obtain a conservative benchmark probability distribution that reflects what has been considered acceptable performance for the detection and removal of direct identifiers. Note that if a particular data set has n < 220 then this would result in an actual confidence interval that is wider than the benchmark distribution, increasing the chance that the actual risk may cover a risk that is greater than the benchmark distribution. Therefore, we do not set minimal data set sizes for evaluations because that is already accounted for.
When w < 1 the overall risk from direct identifiers will decrease, but this will also increase variability because recall depends on s i ¼ n Â w i . In this case the actual distribution may cover a risk that is greater than the benchmark distribution, and we would not conclude that the risk is acceptably low. Now that we have a conservative benchmark distribution, we can perform the comparisons illustrated in Fig. 1 to determine if the actual distribution covers a risk greater than the benchmark, and therefore decide if the actual re-identification risk is acceptable.
2.3.9.2. Evaluating the distribution for quasi-identifiers. Previous work has suggested a fixed 85% recall threshold for quasiidentifiers in the automated de-identification literature [27]. However, a fixed recall value for quasi-identifiers would be quite inconsistent with how the re-identification risk from quasiidentifiers in structured data sets are evaluated, as illustrated below.
The benchmark for acceptable probability of re-identification is determined by a threshold computed from the sensitivity of the data, the potential subjective and objective harm that can affect a patient if there was an inappropriate disclosure of their data or re-identification, and the extent to which the patient had consented for their information to be used for the anticipated secondary purposes [8,28]. These are the same criteria that are used to determine the acceptable probability of re-identification for quasi-identifiers in structured data sets.
When considering these three criteria, there are some strong precedents for choosing a probability value that is acceptable. Historically, data custodians have used the ''cell size of five" rule as a threshold for deciding whether data has a low risk of re-identification [29][30][31][32][33][34][35][36][37][38][39][40][41][42][43][44]. This rule has been applied originally to count data in tables. Count data, however, can be easily converted to individual-level data-therefore these two representations are in effect the same thing. A minimum ''cell size of five" rule would translate into a maximum probability of re-identifying a single record of 0.2. Some custodians use a cell size of 3 [45][46][47][48][49], which is equivalent to a probability of re-identifying a single record of 0.33. For the public release of data a cell size of 11 has been used in the US [50][51][52][53][54], and a cell size of 20 for public Canadian and US patient data [55,56]. Cell sizes from 5 to 30 have been used across the US to protect student's personally identifying information [57]. Other cell sizes such as 4 [58], 6 [59-62], 10 [63], 16 [63], and 20 [63] have been used in different scenarios within varying countries.
Once an appropriate value is determined from within this range using the three criteria and the checklist and scoring scheme in [8], we can derive the following inequality from equation (21): PrðX P 2 if r q P 0:7; or Y P 2 if r q < 0:7Þ Â A 6 s for X $ BðN q ; hð1 À ðR q Þ M ÞÞ; Y $ BðN q ; ð1 À ðR q Þ M ÞÞ; and s is the threshold probability. If the inequality is met then the risk of re-identification is considered acceptable. The upper confidence limit of the 95% confidence interval needs to be below the threshold value to be able to conclude that the risk is acceptably small.

Summary
The framework that we have presented above for calculating the probability of re-identification from a de-identified text document provides more precise modeling of the risks from an adversary. They may result in higher probability calculations than under existing approaches in some instances, or smaller values in other instances. Nevertheless, they represent a more accurate way to assess the probability of re-identification than current approaches.
We have also presented techniques to account for the uncertainty in the estimated values and comparing the computed risk values to benchmarks or thresholds in a formal manner. This would allow a precise determination of whether the actual probability of re-identification is acceptably small. These techniques account for the corpus size that is used to perform the evaluations.
The notation used in formulating our framework is summarized in Table 4. The application of the overall model in a hypothetical context is described in Appendix C, which shows how the equations can be used in practice.

Data set
Our purpose in the empirical application of the evaluation framework is to illustrate its use on a real data set, and show how to interpret the results. We applied the evaluation framework to a data set from the University of Michigan Medical School. The data comes in four groups, one is a random assortment of documents from the full collection of over 80 million, while the other three are a stratified random sample of three documents types: Social Work Notes, History and Physical Notes, and Progress Notes.
Each document is between 1 and 2 pages in length and has different emphasis that is evident in the content and organization of the document. The random group allows us to analyze each stratum against a general representation of the overall corpus.
There are 30 documents in each group for a total of 120 expert annotated documents. The entire corpus was annotated by a single expert, and subsequently reviewed by a second expert. Where there was disagreement the two experts met and reached consensus on the appropriate annotation to use.

De-identification
The de-identification was performed with the rule-based engine that was described elsewhere [9], Ch. Free-Form Text. Because this was a rule-based de-identification engine, no training data set was required to construct a model before applying it. The deidentification engine was applied ''out-of-the-box" without modification or customization.
The set of direct and quasi-identifiers that were targeted for extraction in these documents are consistent with those that are typically used in the literature [1]. These include: ID's, phone numbers, people names, email addresses, street addresses, organization names, ZIP codes, ages, country, state, and city.
We will compare our risk assessment results with those that would be obtained using a typical contemporary micro-average evaluation of recall. This will illustrate the difference between the proposed evaluation framework and the current baseline.

Risk thresholds
For the purposes of our case study, we will use a threshold based on the commonly used ''cell size of five" rule, which is equivalent to a probability of re-identification of 0.2 for quasi-identifiers. The upper confidence limit of the quasi-identifier confidence interval needs to be below that value. In the case of direct identifiers the data confidence interval is compared with the benchmark confidence interval. Table 5 contains information on the data element type (annotation) frequency by document and the number of instances of annotations found in the corpus. The table refers to particular annotation sets: the gold standard which was expertly annotated and reviewed. The ''document" column indicates the number of documents containing that annotation, while the ''annotations" column represents the number instances (individual annotations).

Evaluation results
The evaluation results are split into two sets. First, the results using a more traditional micro-average recall are shown in Table 6.
In the second set of results we show the 95% confidence intervals for the probability of re-identification using our evaluation framework. In this case we have a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The confidence interval for the direct identifiers from the data, and compared to the benchmark, is illustrated in Fig. 2. This shows that the upper confidence limit for the re-identification risk from the data is below the upper confidence limit of the benchmark distribution, and therefore we can conclude that the risk of reidentification for direct identifiers is acceptably small.
In Fig. 3 we show the 95% confidence interval for the quasiidentifiers. The upper confidence limit is below the 0.2 threshold that we are using in our example. Therefore we can conclude that the risk of re-identification for quasi-identifiers is acceptably small.
The comparison of these two sets of results shows that the numeric outcomes of the evaluation will be different, and that our evaluation framework, because it takes context into account, will often be less pessimistic about the real risks. Again, as noted Table 4 Summary of notation.

Notation Definition s i
The number of documents that a particular direct identifier i appears in r i The all-or-nothing recall for direct identifier i n The number of documents rq The micro-average recall computed across all quasi-identifiers m The average number of times that a quasi-identifier value in a document is repeated -the average number of instances per quasi-identifier value nq The average number of unique quasi-identifier values per document sq The number of documents that a quasi-identifier appears in. In most cases this will be the same as n h The HIPS factor  Table 6 Summary of results that would be obtained by a more traditional micro-average recall calculation (and the leak rate, which is one minus the recall).
Micro-average recall Probability of a leak Direct identifiers 0.9758 0.0242 Quasi-identifiers 0.8757 0.1243 Fig. 2. The 95% confidence intervals for the probability of re-identification for direct identifiers using our evaluation scheme.
earlier, this will not always be the case. However, we would expect to see differences in the numerical values and the conclusions about the risk of re-identification.

Summary
In this paper we have presented a new framework for evaluating the performance of free-form text de-identification tools that accounts for the many subtleties and distributional variances that one sees in real data sets. It attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions about reidentification that are made using the more conventional evaluation approach. This framework provides arguably a more realistic estimate of the true probability of re-identification. The framework was illustrated on a heterogeneous corpus of documents from the University of Michigan medical school.
The application of this framework to the de-identification of clinical reports from clinical trials, as required by the European Medicines Agency, is described further in Appendix B.

Limitations
Our framework does not consider the precision of the deidentification tool used. Our focus has been on the risk of reidentification only. However, in practice precision would need to be considered as well when evaluating real de-identification systems. That we focused on recall and the risk of re-identification is not intended to diminish the importance of considering precision when evaluating de-identification solutions.
Furthermore, in very rare diseases the risk of re-identification may still be present with a single quasi-identifier. In future work, we will consider the implications of disease frequency in the global population and re-identification risks.