De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset

Background There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. Objective To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Methods We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. Results An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. Conclusions It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.

  sup X , and was computed as the number of other patients with the value X . A score was defined based on the support values for the quasi-identifiers in a claim.
Let us say that a claim had 1 ir irh a a  quasi-identifiers, where i was an index for the patient, r an index for the claim, and h an index for quasi-identifiers. We wished to assign a higher score to claims with quasi-identifiers that had low support. If we had N patients in the data set, then we could compute the score as . This score gave the quasi-identifier with the lowest support the most weight in deciding the overall score for a claim. The reasoning for this was that if any claim had a quasiidentifier that was quite rare in the data set (among patients) then its score would be quite high.

Removal of High Risk Patients
The following patient groups were removed from the data set during the pre-processing step:  Patients with diagnoses indicating Human Immunodeficiency Virus (HIV), abortion, abuse, psychosexual disorders, mental retardation, or plastic surgery: o Plastic surgery for unacceptable cosmetic appearance (ICD-9 code V50.1) or aftercare involving the use of plastic surgery (ICD-9 code V51).
o Patients having undergone procedures indicating intersex surgery (CPT codes starting with 5597, or 5598).
 Patients with a claim where the "place of service" is Intermediate Care-Mentally Retarded, Residential or Nonresidential Substance Abuse Treatment Facility (codes 55-57).
The following claims were removed because they would indicate an important event in a patient's life, making them more easily identifiable, or were not considered relevant: o Deleted claims for newborns (up to and including 28 days after birth).
o Claims where the CPT code is not a medical procedure (all codes with a letter prefix, known as level II codes, and three digit revenue or chapter codes).
When a high-risk patient or claim was deleted from the data set, all claims with the same MemberID were removed from the data. The concern behind this decision was the possibilty that information in other claims for the same patient can be used to infer the deleted information. For example, if only a claim with a direct diagnosis of HIV is deleted, but then other claims with diagnoses of infections that are very strongly related would allow an adversary to infer that the patient did have HIV.

Differences Between Excluded and All Patients
In this section we describe the difference between the high risk patients that were removed during the previous step and the remaining patients. In Table 1 we see that the distribution on gender is quite similar.
On the other hand, there were differences on some of the variables. We found that the excluded patients tended to have a longer interval between claims (Table 2), be older (Table 3), and to require longer stays in hospital (Table 4). This is not surprising in that the diagnoses and procedures that were excluded are most likely to occur with an older population. The median number of claims per patient that was excluded was 34 compared to 11 for the remaining patients. This indicates that the excluded patients tended to have more procedures performed compared to the general patient population, and is testimony to the fact that their conditions tend to be serious and often chronic ones.

Computing the Adversary Power
In this section we describe how we computed the power of the adversary by considering the number of claims that a patient has and the diversity or variability in their quasiidentifier values.
We made two assumptions about the knowledge of the adversary: (a) the adversary would not know which values on the quasi-identifiers were in the same claim (the "inexact knowledge" assumption), and (b) the adversary would not know the exact order of the claims (the "inexact order" assumption) beyond what is revealed through the DSFC quasi-identifier, which is consistent with other models of transactional data in the disclosure control literature [2][3][4][5][6]. However, we did test the sensitivity of our results to these assumptions in our empirical evaluation.
For the first assumption, for example, the adversary could know that a patient had a heart attack and stayed for a week at the hospital during the period covered by the data, but would not know that these two values pertained to the same episode of care. For instance, a patient could have had the following series of primary condition groups (a generalization of the diagnosis code): <AMI, AMI, UTI, RENAL3> and a series of LOS values <2,4,1,7>, but the values in the two quasi-identifier sequences were not ordered the same way. The adversary could know that there were two AMI diagnoses but not know that the LOS was 7 days for one of them.
For the second assumption, we assumed that the adversary would not necessarily know the exact order of the quasi-identifier values. For example, if an AMI and UTI diagnoses occurred within the same day, say, the adversary would not know that the UTI occurred before the heart attack. These two diagnoses would have two separate claims which shared the same date.
For each individual, we defined their variability ih  , which would start from zero, indicating no variability in the quasi-identifier values for that patient. This characterized how often the values in their set of quasi-identifier values would vary. For example, a patient with a chronic disease, such as kidney disease, who made many dialysis visits, would have low variability in their diagnosis codes across many claims. On the other hand, a patient with multiple acute incidents would have high variability in their diagnosis codes since one would expect that they would be unrelated to each other.
A patient with low variability would be easier to re-identify because an adversary would need to know little about them to re-identify them. An adversary who knew little would be able to predict information in the rest of the claims because there was little variability. Consequently, the adversary would be likely to have background information about many of the quasi-identifier values, which would make p value higher.
For a patient who had high variability, every additional new piece of information about the patient would be so different from previous information, the adversary could not use existing knowledge to figure out other information. This means that the adversary would be likely to have less background information about a patient and hence p would be lower.
We also defined i  as the number of claims that a patient had. The number of claims was independent of the quasi-identifier as that number was constant across all quasiidentifiers. The more claims that a patient had, the more information that was available to be used for re-identification. Therefore, it would be easier for an adversary to have more background information about patients who had many claims. We could then define conceptually four quadrants of patients as in Figure 1. For patients in quadrant (3) little information would be available to an adversary because there were few claims and there was so much diversity in the patient's information (i.e., it would be more difficult to use known information to predict additional information). Therefore, we assigned them a low p value. On the other hand, for patients in quadrant (2) it would be easier to get more background information about them because they had so many similar claims that knowing a little the adversary could predict others. Therefore the power of the adversary would be much higher. The other quadrants were in between.
Let the p value for quadrant (1) be denoted by   1 p . The basic relationships among the p values are (1) ( . Therefore, we expected a monotonic relationship between i ih   and the p value. If we strengthen that monotonic assumption and say that the relationship is linear then the value of p can be computed as follows: where m p is the maximum value of p that we were willing to assume.
We set the value of m p at 5. While there are no precedents for this number, it represented a significant amount of background information about the patients: for 6 quasi-identifiers in each claim, this would mean that the adversary could have up to 30 pieces of information about the patient to use for re-identification, plus the 4 quasiidentifiers in the patients' table. This was a significant amount of background information and therefore represented quite a knowledgeable adversary. Because many patients had more claims than 5, it was also assumed that all claims were equally probable to be within the background knowledge of the adversary. As part of our empirical evaluation we evaluated the sensitivity of our results to the use of 5 m p  .
In some cases where there was absolutely no variability in the data, the ih  would be zero. This means that the In such a case if there was no diversity in the data then the maximum value of p was always selected.
If there were patients with an extreme number of claims, they would skew the calculations of ih p lower. Therefore we capped the value of i  at the mean and two standard deviations. This meant that if a member had more claims than that, these additional claims were not considered to provide the adversary with additional information for a re-identification attack. In practice this cutoff was still quite high, but did prevent extreme skewness in the distribution of ih p .

Computation of Variability
Although the Shannon index is commonly used for estimating variability, it is sensitive to sample size and difficult to interpret [7]. Instead, we estimated variability using the more robust Simpson index [8].
Let X be a categorical variable with k categories, k n be the frequency of occurrence of category k , and N be the frequency of occurrence across all categories (i.e., k N n

 
). For a finite population, the Simpson index is calculated as represents the probability of two randomly selected occurrences being in the same category.
For example, if h X is primary condition groups (quasi-identifier h ) then

Calculating diversity from the Simpson index
To determine diversity from the Simpson index, one can either take the complement can have variance problems. Some therefore recommend using   ln D . Still, others argue for using   1 D  , which is easily interpretable as it is a probability. We therefore chose to use this latter measure of diversity.

Correlation Among Diversity Values
The correlation matrix of the diversity values among the quasi-identifiers is shown in Table 5. This shows that the relationship among variables in terms of their diversity varies considerably, ranging from negative moderate, to close to zero, to moderate positive. While the signs and magnitudes of these correlations have face validity, they also re-enforce the need to treat the diversity of each quasi-identifier separately in computing the power of the adversary, rather than attempting to compute a single diversity value across all claims.

Node Computation
Below we describe the precise steps in computing the re-identificatiion risk for each node in the lattice.
We used i to index patients and h to index quasi-identifiers in a claim. The following are the calculation steps followed within a node. The objective of the processing within the node was to determine if this was a candidate solution node or not: 1. The data is generalized according to the specifications for the node. Let this be data set D .  node is not a candidate and exit the node calculation, otherwise this is a candidate node

Derivation of Sample Marketer Risk
We let J be the set of equivalence classes in the population data set, with N records. The size of each equivalence class was given by j F where j J  . A sample was drawn from the population. The set of equivalence classes in the sample was denoted by S , such that S J  . The sample size was given by n , and the size of an equivalence class in the sample was given by j f where j S  .
Marketer risk was given by 1 j j S j f n F     [9]. We had j j f F   . Therefore, we ended up where S was the number of equivalence classes in the sample.
To determine the value of  we needed to compute the value of S .
Assuming we would randomly draw n records from the population. We let j U be a random variable associated with each j f such that We let j j J U U    , which represented the non-zero equivalence classes in the sample. Then: Putting this in the equation for  , we had an approximation: The above equation could be simplified to: Based on that calculation, the proportion of HHP patients that could be correctly matched to a voter list, on average, would be calculated as   in equation (2).

Grouping ICD-9 Codes into Primary Condition Groups
The following is a summary of how ICD-9 codes were grouped into larger sets for the primary condition groups [10]:

ANES
Head, neck, thorax, intrathoracic, spine and spinal cord, upper and lower abdomen, perineum, pelvis, upper and lower leg, knee and popliteal area, shoulder and axilla, upper arm and elbow, forearm, wrist and hand, radiological procedures, burn excisions or debridement, obstetric, and other.

EM
Office or other outpatient services, hospital observation services, hospital inpatient services, consultations, emergency department services, critical care services, continuing intensive care services, nursing facility services, domiciliary, rest home, or custodial care services, home services, prolonged services, case management services, care plan oversight services, preventive medicine services, special evaluation and management services, and other manual and management services.

MED
All other medicine including drug administration, vaccines, toxoids, hydration, therapeutic, prophylactic, and diagnostic injections and infusions, psychiatry, dialysis, ophthalmology, contact lens services, spectacle services, medical tests and measurements, analysis, assessment, intervention, evaluative and therapeutic services, diagnostic studies, drug administration, physical medicine and rehabilitation, education and training for patient self-management, special services, procedures and reports, moderate sedation, and home health procedures and services. 90281-99199, 99500-99602 PATHOLOGY AND LABORATORY PL Organ or disease panels, drug testing, therapeutic drug assays, evocative and suppression testing, consultations, urinalysis, chemistry, molecular diagnostics, infectious agent: detection of antibodies, microbiology infectious agent detection, anatomic pathology, cytopathology, cytogenetic studies, and surgical pathology.

RAD
Diagnostic radiology, diagnostic ultrasound, radiation oncology, and nuclear medicine.

SAS
External ear, middle ear, inner ear, and temporal bone, middle fossa approach.

SCS
Heart and pericardium, and arteries and veins.

Grouping Place of Service
The authors in consultation with hospital physicians grouped the place of service into the following categories.

PLACESVC GROUP DESCRIPTION & INCLUDED PLACE OF SERVICE CODES AMBULANCE
Ambulance: A vehicle specifically designed, equipped, and staffed for lifesaving and transporting the sick or injured; Ambulatory surgical center: a freestanding facility, other than a physician's office, where surgical and diagnostic services are provided on an ambulatory basis. HOME Location, other than a hospital or other facility, where the patient receives care in a private residence.

INPATIENT HOSPITAL
A facility, other than psychiatric, that primarlily provides diagnostic, therapeutic, and rehabilitation services by physicians for admitted patients.

INDEPENDENT LAB
A laboratory certified to perform diagnostic or clinical tests independent of an institution or a physician's office.

OFFICE
Location where the health professional routinely provides health examinations, diagnosis, and treatment of illness or injury on an ambulatory basis.

OUTPATIENT HOSPITAL
A portion of a hospital that provides diagnostic, therapeutic (both surgical and nonsurgical), and rehabilitation services to sick or injured persons who do not require hospitalization or institutionalization.

URGENT CARE
Urgent care facility: Location whose purpose is to diagnose and treat illness or injury for unscheduled, ambulatory patients seeking immediate medical attention. Emergency roomhospital: A portion of a hospital where emergency diagnosis and treatment of illness or injury is provided.

OTHER
All other places of service: Assisted living facility, birthing center, community mental health center, comprehensive inpatient rehabilitation facility, custodial care facility, endstage renal disease treatment facility, federally qualified health center, group home, hospice, independent clinic, inpatient psychiatric facility, mass immunixation center, military treatment facility, mobile unit, nursing facility, other place of service, psychiatric facility, psychiatric residential treatment center, rural health clinic, skilled nursing facility, public health clinic, unassigned, unknown, tribal 638 provider-based facility.

Relationship to Previous Work
Previous work that is relevant for the de-identification of longitudinal medical records consists of research for the de-identification of transactions or the de-identification of trajectories. Transactions consists of items, for example, merchandise bought at a store. All of the items in a particular grouping is called an itemset. Trajectories appear in the context of de-identifying movements of individuals, for example, the wireless telephone cell towers people pass by as they travel. Below we explain why this previous work cannot be applied directly to our particular problem:  Methods for the generalization of transactions often employ local recoding [11,12]. This means that the precision of, say, a claim's date can vary by claim and by patient. For example, one patient may have a claim's date as the quarter and year, and another claim by the same patient may have only the year as the date, whereas another patient's claim date could be generalized to a month and year. This inconsistency in generalization makes a data set difficult to analyze using the most common statistical techniques. An argument has been made that using local recoding in de-identification algorithms creates data analysis difficulties and therefore global recoding is always preferable [13]. In our de-identification we used only global recoding.
 Previous work that looked at trajectories considered sequences of points [14,15]. An analogy to our context would be if only adjacent claims are allowed to be part of the adversary knowledge. This assumption does not apply and our problem is more complex because the claims do not need to be in a sequence/adjacent. For example, for a power of 3, an adversary may have background on a patient's first, fifteenth, and twentieth claim in the data set.  Some previous work looked at the de-identification of graph data [11,[16][17][18][19]. These papers either use permutation to alter the graph structure (deleting and inserting edges) and/or they partition the nodes of the graph and their corresponding entities into groups, and only the maps between the groups are revealed. To illustrate permutation and grouping approaches, assume we have a set of 3 patients: A,B and C, and a set of 5 claims: claims 1 and 2 correspond to patient A, 3 and 4 correspond to patient B and 5 to patient C (see panel a in Figure 2).
A permutation can assign an additional claim to a patient, say patient A to claim 4 (panel b), or remove a claim from the record of a patient. Grouping is illustrated in panel c, where it hides the mappings between the entities and only reports whether two groups have a map between them. In this case we would know that between patients B and C they had claims 3, 4, and 5, but would not know which patient had which claim(s).
Permutation (whether combined with grouping or not) does not retain the structure of the graph. In [11], the authors use only grouping, thus they preserve the graph structure exactly, however the resulting anonymized data would be locally recoded.   Table 6: Example of using generalization and permutation to de-identify transactional data.
 In [20], the authors use generalization and permutation to deal with the problem of attribute disclosure. Their method relies on grouping the transactions with varying sensitive values together, thus forming several "anonymized groups". Then they publish the quasi-identifiers of each group together with a summary of the sensitive items in the group. In other words, the sensitive attributes are linked to the whole group and not to a particular transaction in the group. Table 6 provides an example of 5 purchase transactions after being anonymized using the method in [20]. Note that the data is divided into 2 groups: the first 3 records form one group and the other 2 form the second group. Each group has one sensitive data associated with it. An adversary knows that the sensitive values correspond to one record in their group but the exact correspondence is hidden.
Besides, our focus is on protecting against identity disclosure and not attribute disclosure. The above method would lead to significant data distortion and produces data sets that would be difficult to analyze. Furthermore, this method uses the fact that in transaction data, sensitive attributes are rare and that does not apply to our case.
Other research which considered the power of the adversary always assumed that the power is fixed for all patients [2][3][4][5][6]. We have argued that this simplifying assumption may not hold in practice because patients would differ on how easy it is to construct background knowledge about them, and developed a method to model such variation. Some researchers have taken a different approach and suggested that the data custodians should define possible groupings of the items in a transaction to meet certain privacy and utility requirements [21,22]. For the HHP data set it is not clear how all of the quasi-identifiers can be grouped a priori and how the proposed approach would work with multiple transactions treams (one for each quasi-identifier).