A Privacy Attack on Multiple Dynamic Match-key based Privacy-Preserving Record Linkage

Abstract Introduction Over the last decade, the demand for linking records about people across databases has increased in various domains. Privacy challenges associated with linking sensitive information led to the development of privacy-preserving record linkage techniques. The multiple dynamic match-key encoding approach recently proposed by Randall et al. (IJPDS, 2019) is such a technique aimed at providing enough privacy for linkage applications while obtaining high linkage quality. However, the use of this encoding in large databases can reveal frequency information that can allow the re-identification of encoded values. Objectives We propose a frequency-based attack to evaluate the privacy guarantees of multiple dynamic match-key encoding. We then present two recommendations that can be used in this match-key encoding approach to prevent such a privacy attack. Methods The proposed attack analyses the frequency distributions of individual match-keys in order to identify the attributes used for each match-key, where we assume the adversary has access to a plain-text database with similar characteristics as the encoded database. We employ a set of statistical correlation tests to compare the frequency distributions of match-key values between the encoded and plain-text databases. Once the attribute combinations used for match-keys are discovered, we then re-identify encoded sensitive values by utilising a frequency alignment method. Next, we propose two recommendations; one to alter the original frequency distributions and another to make the frequency distributions uniform. Both will help to prevent frequency-based attacks. Results We evaluate our privacy attack using two large real-world databases. The results show that in certain situations the attack can successfully re-identify a set of sensitive values encoded using the multiple dynamic match-key encoding approach. On the databases used in our experiments, the attack can re-identify plain-text values with a precision and recall of both up to 98%. Furthermore, we show that our proposed recommendations are able to make this attack harder to perform with only a small reduction in linkage quality. Conclusions Our proposed privacy attack demonstrates the weaknesses of multiple match-key encoding that should be taken into consideration when linking databases that contain sensitive personal information. Our proposed recommendations ensure that the multiple dynamic match-key encoding approach can be used securely while retaining high linkage quality.


Introduction
Privacy-preserving record linkage (PPRL) is the process of linking records that belong to the same individual across different databases while preserving the privacy of the individuals that are represented by the records in these databases [1]. Exist-ing PPRL approaches have a trade-off between linkage quality, scalability, and privacy [1], with some being vulnerable to privacy attacks [2,3,4].
The multiple dynamic match-key approach [5] is a recently proposed method for PPRL which aims to provide privacy protection against frequency attacks, while achieving high linkage quality. As we describe next, the main idea of this approach is to generate distinct hash-codes (called match-key values) using attribute value combinations (called match-keys) and compare these match-key values across databases to identify matching pairs of records.

Multiple dynamic match-key approach
We now briefly describe the main steps of multiple dynamic match-key encoding as proposed by Randall et al. [5]. Common notations used throughout the paper are shown in Table  1. We assume a sensitive database, D, with n records, where each record has k attributes. A match-key is a combination of attributes from D. Given there are k attributes in D, there can be 2 k -1 attribute combinations. A list of match-keys mk e is first selected and then match-key values are created for each record r ∈ D. These match-key values are hash-codes generated by hashing the concatenated values of each match-key mk e ∈ mk e , as illustrated in Figure 1. An encoded database (consisting of hash-codes) generated from D is denoted as H, a matrix of hash-codes. Essentially the matrix H is generated with n rows and m columns where n = |D| and m = |mk e |. A row in H corresponds to a record r ∈ D and each row contains a list of m hash-codes. A column in H corresponds to a certain match-key mk e ∈ mk e , where a given column contains the hash-codes generated for all r ∈ D. Hence a cell H ij represents a hash-code, h ij , generated for the i th record using the j th match-key, where 1 ≤ i ≤ n and 1 ≤ j ≤ m.
The list mk e is determined using the probabilistic record linkage method proposed by Fellegi and Sunter [6], which utilises the error characteristics of the databases that are being linked. As described in detail by Randall et al. [5], for each potential match-key a weight score w is calculated using the method proposed by Fellegi and Sunter [6]. For the multiple dynamic match-key approach, these weight scores w are then compared against a user-defined weight threshold, w t . Only those match-keys that have a weight score of at least w t are selected for mk e to generate match-key values. The authors of the multiple match-key approach also recommended selecting match-keys with at least two attributes [5].
Due to the large number of match-keys that are possible for k attributes (2 k -(k+1) in total), the authors have employed a superset pruning approach to reduce the number of match-keys that are being selected in order to reduce the computational costs [5]. For instance, if the match-key {First-Name, LastName} has a total score of at least w t then all the supersets of this match-key are pruned. This is because having a superset such as {FirstName, LastName, Gender } will not result in any additional matches compared to the matches already identified by the match-key {FirstName, LastName}.
Once a list of match-keys mk e is selected, match-key values are generated for each match-key using the corresponding attribute values of each record. To generate match-key values, a keyed cryptographic hash function, such as the hashed message authentication code (HMAC) [7] that uses a hash function and a secret key (known only to the owners of the databases that are being encoded for linkage) can be used. If one of the attribute values used in a match-key is missing, then that match-key value is left empty. Given the list of selected match-keys mk e , each record in the database D will have up to |mk e | match-key values assigned to it (less if the record has missing values in one or more of the attributes used in mk e ). We provide an example of this generation of match-key values in Figure 1. The match-key values for each r ∈ D are stored as lists (i.e. with an order), and only match-key values at the same position in these lists are compared between records. This will reduce the computational complexity of the comparison since one match-key value of a record will only be compared with one match-key value of another record.
In a linkage protocol, the owners of the sensitive databases to be linked first generate such a list of match-key values for each record in their database. Assuming two databases, D A and D B , these databases are encoded into two matrices of match-key values (hash-codes), H A and H B , respectively. In the comparison and classification step, assumed to be conducted by a third party known as the linkage unit [1], the corresponding match-key values (hash-codes in the same column in H A and H B ) are then compared to classify each candidate record pair either as a match or non-match. Following Randall et al. [5], a record pair is classified as a match (assumed to refer to the same individual) if one or more of its match-key values are the same. If none of the corresponding match-key values of a record pair are the same, then the record pair will be classified as a non-match (assumed to refer to different individuals).

Contribution
In this paper, we analyse the privacy guarantees of this multiple dynamic match-key approach using a privacy attack that aims to re-identify the plain-text values that were encoded into hash-codes, using a known (publicly available) plain-text database. The attack tries to identify the attributes used for match-keys by analysing the frequency distributions of individual match-key values. Using two large real-world databases, we show that the attack can re-identify plain-text values with an acceptable level of accuracy. We then propose two improvements as countermeasures for multiple match-key encoding and discuss how these methods can strengthen the privacy guarantees of the match-key encoding approach and prevent this type of privacy attacks.

Methods
We define the following terms which we use to describe the match-key approach and the proposed privacy attack (as also summarised in Table 1). Assume a sensitive database D is encoded using the multiple dynamic match-key approach, which results in a matrix H of hash-codes, as shown in Figure 1. Each column in H contains hash-codes generated using a unique match-key, mk e ∈ mk e . We refer to these columns as encoded match-keys. As with existing attacks on PPRL [2,3,4], we assume the adversary has access to a plain-text database V with a frequency distribution of attribute values similar to the encoded database D. Such plain-text databases can be sourced externally (e.g. a telephone directory or voter database) or internally (if the attack is conducted by an insider who has access to some type of plain-text database that is similar to the encoded database). We then generate a list of match-keys, mk p , from the plain-text database V to compare against the encoded match-keys. We refer to these match-keys as plain- Plain-text match-key matrix, and a column in M which represents a mk p ∈ mk p C, C i* , C *j Correlation matrix, a row in C (mk p ∈ mk p ), and a column in C (correlation metric) f A frequency distribution i, j i th row and j th column in a matrix Figure 1: An example encoding process assuming three attribute combinations (match-keys) where |mk e | = 3. The top table refers to the plain-text database attributes before encoding and the bottom table refers to the generated match-key values using HMAC [7]. Note that no match-key value (hash-code) is generated if a match-key has a missing attribute value.
Plain-text sensitive database (D) Peter Miller 1990 text match-keys. The resulting plain-text match-key matrix is denoted as M where each column is a plain-text matchkey mk p ∈ mk p . We assume that the plain-text database V contains some or all attributes that are used for the encoded match-keys in D. We also assume different levels of knowledge of the adversary, as we discuss in the Results section. As we describe in detail below, the proposed privacy attack consists of five main steps: 1. Calculate the frequency distributions of encoded matchkeys (f j e ) and plain-text match-keys (f j p ).
2. Calculate the correlations between the frequency distributions of each encoded match-key H *j ∈ H with all plain-text match-keys M *j ∈ M, and for each H *j build a correlation matrix, C.
3. Select candidate plain-text match-keys for each encoded match-key H *j by comparing correlation values in C.

4.
Assign highly correlated plain-text match-keys mk p to each encoded match-key mk e .
5. Re-identify plain-text values using frequency alignment of plain-text match-keys.
The first two steps of the attack are illustrated in Figure  2. We discuss these two steps in detail in the following two sections.
Step 1: Calculating frequency distributions In the first step of the attack we calculate the frequency distributions of match-key values, f j e , for each encoded match-key (i.e. column H *j ∈ H) where 1 ≤ j ≤ |mk e |. Since the matchkey values of records are stored as lists (i.e. with an order), we are able to calculate frequency distributions of match-key values for each encoded match-key.
Next, we generate different match-keys of the plain-text database V. Assuming there are k attributes in V (including some or all attributes used for the encoded match-keys) we generate all match-keys from length two to k, assuming match-keys have only been generated from combinations of two or more attributes [5]. For the resulting list of plain-text match-keys, mk p , for each record r ∈ V, we generate a list of plain-text match-key values using all mk p ∈ mk p . This results in a matrix of plain-text match-key values, M, where a column M *j ∈ M represents a plain-text match-key mk p ∈ mk p . Then, for each column M *j , we calculate the frequency distribution, f j p , of concatenated attribute values in mk p from M.
For instance, if there are three attributes FirstName, LastName, and BirthYear in the plain-text match-key database M, we can generate four match-keys M *j = {(FirstName+LastName), (FirstName+BirthYear ), (Last-Name+BirthYear ), (FirstName+LastName+BirthYear )}. Figure 2: An overview of the first two steps of the proposed privacy attack. We first generate the plain-text match-key matrix, M, and then we calculate the frequency distributions, f j e and f j p , of all encoded match-keys H *j ∈ H and plain-text match-keys M *j ∈ M. Next, every possible pair of f j e and f j p is compared using several correlation metrics. This results in a correlation matrix C for each encoded match-key H *j .
Here + is used to represent the string concatenation. We then calculate frequency distributions for all four match-keys. At the end of the first step the adversary has the frequency distributions of all encoded match-keys from H (f j e ) and all plain-text match-keys from M (f j p ).
Step 2: Measuring the correlation of frequency distributions In the second step, we use a series of statistical tests to compare the frequency distributions of all plain-text match-keys, f j p , with one given encoded match-key, f j e . For each encoded match-key, H *j , we build a correlation matrix, C which contains all correlations calculated between each plain-text matchkey, M *j , and that encoded match-key, H *j (each pair of f j e and f j p ), using different correlation metrics. Then by comparing these correlation values, we obtain the most similar plaintext match-keys for the current encoded match-key, H *j . In the following we discuss this process in more detail.
The statistical tests we use can be divided into two categories. The first consists of basic statistics which can be calculated for a distribution, while the second consists of correlation tests which can be used to compare two distributions. As for the basic statistics we calculate the mean, standarddeviation, variance, and skewness of two frequency distributions and calculate the absolute difference of those values for different pairs of f j e and f j p . Then we employ the following correlation metrics to compare the two distributions f j e and f j p : 1. Earth mover's distance (EMD) [8]: The EMD is used to measure the dissimilarity between two probability distributions over some feature space. Given one distribution as the baseline, EMD calculates the least amount of work needed to reach the second distribution.
2. Kolmogorov-Smirnov (KS) test [9]: The KS test is used to measure the equality of two continuous probability distributions to inspect if one distribution is a sample of another distribution.
3. Pearson's correlation coefficient [10]: Given two continuous variables x and y, Pearson's correlation coefficient measures the correlation between these two variables. The relationship calculation is based on the method of co-variance.

4.
Spearman's rank correlation [10]: Spearman's Rank correlation measures the ranked correlation between two variables. Specifically, it calculates the strength and direction of association between two ranked variables.

Relative entropy (Kullback-Leibler divergence) [11]:
Relative entropy is used to measure how one probability distribution differs from another reference probability distribution. We use relative entropy to measure how much one frequency distribution is different from the second distribution.
6. Histogram intersection [12]: Histogram intersection measures the similarity of two probability distributions by calculating the intersection of their histograms. For this calculation the two histograms of the probability distributions should be of the same length or both these distributions can be placed into a same number of bins using an appropriate binning technique such as equal depth binning. Given two histograms, x and y, with b bins in each, the intersection of the two histograms can be calculated as: where min() outputs the minimum of x i and y i . To normalise the intersection value, we can divide I(x,y) by the average of the two histograms: In our attack, we use histogram intersection to calculate the intersection of the frequencies of distribution pairs f j e and f j p . The higher the value of I(x,y) norm , the more similar the two distributions are.
These six metrics are commonly used to compare the correlations between probability or frequency distributions. Given we are interested in measuring the correlations between two frequency distributions, we aim to identify correlated matchkeys by employing these metrics. Furthermore, our evaluation using these metrics will help us to determine the usability of these correlation metrics in such an attack scenario.
The correlation values calculated using the above statistical tests will be added to a single vector, c, of ten values (i.e. four values from the basic statistical metric differences and six values from the above discussed correlation metrics) for each pair of frequency distributions f j e and f j p . For each encoded match-key, H *j , and its corresponding frequency distribution, f j e , we obtain a set of correlation vectors by comparing f j e with the frequency distributions f j p of all possible plain-text match-keys in M. This results in a correlation matrix C j (for that particular encoded match-key H *j ) where C i* (row i) represents a correlation vector c calculated for a certain plain-text match-key and C *j (column j) represents a vector of correlations for a certain correlation metric (i.e. one of the above ten tests). In practice it is important to understand which correlation metrics are robust when calculating correlations with different distributions. This aspect is outside the scope of this paper and we considered the performance of all correlation metrics to be equal in our context. In the following steps we describe how we can use the obtained correlation values to filter plain-text match keys that are unlikely to match to encoded match keys. Frequency based filtering: Given there are k attributes in V, potentially there are 2 k -(k+1) attribute combinations that we need to analyse. Hence, we propose a filtering step to reduce this number of candidate plain-text match-keys by comparing their frequencies. We identify not possible plain-text match-keys through their top most frequent plain-text values. For instance, an encoded match-key with the highest frequency of 50 will not be compared with a plain-text match-key that has a highest frequency of 150. This is assuming that the two databases have a similar number of records and frequency distributions in their attribute values. We define a frequency range parameter, ε, for the range of frequency values that can be used to filter plain-text match-keys. An encoded match-key with highest frequency of max(f j e ), will only be compared with a plain-text match-key with highest frequency of . Here max() represents the top frequency (highest count) of that particular frequency distribution. The value for ε needs to be adjusted based on the encoded and plain-text database sizes, |D| and |V|.
Step 3: Filtering candidate plain-text matchkeys In the third step of the attack we first normalise the correlation values in the matrix C for each statistical test using min-max normalisation. Here correlation values are normalised in such a way that the highest correlation will become 1.0 and the lowest correlation will become 0.0. However, the distance values (such as Earth mover's distance and Relative entropy) are normalised in such a way that the highest value will become 0.0 and the lowest value will become 1.0. This ensures that all values obtained using different statistical tests are in the same range and comparable.
Next, for each correlation metric, C *j , we sort the plaintext match-keys mk p in descending order based on their normalised correlation values. We then select the top plain-text match-keys based on this sorted list. For each pair of plain-text match-keys, mk x and mk x+1 , we calculate the difference ratio, α d , of their normalised correlation values. We do this to filter the candidate plain-text match-keys that have the highest correlation values with small differences between them. Only plain-text match-keys that have a difference ratio below a user defined difference ratio limit of α will be considered. Assuming the correlation values of mk x and mk x+1 are s x and s x+1 respectively, the difference ratio between those two values, α d , is calculated as: If α d ≤ α, then we select both match-keys mk x and mk x+1 . If α d > α then we stop the selection of candidate match-keys at that point. This process is performed for all statistical tests, resulting in a filtered matrix of plain-text match-keys for each encoded match-key. Table 2 shows an example matrix C obtained after the filtering process for the encoded match-key (LastName+Gender+BirthYear ). Plain-text match-keys with crossed-out correlation values are not selected for that particular correlation metric due to having a too low correlation value, as discussed in the above filtering step.
At the end of this step, we obtain one filtered correlation matrix C j per encoded match-key, H *j . This C j will contain plain-text match-keys, mk p , that have the highest correlation values based on the most similar frequency distributions to that encoded match-key H *j .
Step 4: Re-identifying encoded match-keys In the fourth step of the attack, we try to identify the correct plain-text match-key for each encoded match-key, H *j , from its corresponding matrix C j . In this section, since we are analysing a single encoded match-key at a time, we exclude j from all notations to improve readability. Therefore, it is important to note that the following process is applied for each encoded match-key H *j separately.
For a given filtered matrix C, we first calculate an overall correlation, a c , of each plain-text match-key mk p by averaging the correlation value vector of that match-key. As we discussed above, this vector consists of the normalised correlation values for the ten different statistical tests. Given that a plain-text match-key mk p ∈ C has a correlation vector c p , the overall correlation a c can be calculated as: Table 2: An example correlation matrix C for the encoded match-key (LastName+Gender+BirthYear ) after the filtering process discussed in the third step of the attack. Crossed-out values represent the plain-text match-keys that are not selected for the corresponding metric due to their low correlation.
Next we look at the number of unique match-key values the plain-text match-key (number of unique concatenated plaintext values) and the encoded match-key (number of unique hash-codes) have. Note that even if the distributions of an encoded match-key and a plain-text match-key is similar these two match-keys might have different numbers of unique values. Therefore, apart from correlation measurements, the number of unique values can also be used to compare the two matchkeys. If the encoded match-key has q e unique values and a plain-text match-key mk p ∈ C has q p unique values, we calculate the difference ratio d c of those values as: where abs() calculates the absolute value. We next calculate the average of the two values, a c and d c . This can be calculated as a weighted average, where we assign certain weights to each value. Assuming the weight values for a c and d c are ω and 1ω respectively (with 0 ≤ ω ≤ 1), the average value s c is calculated as: For each encoded match-key we obtain |C| of these average values where |C| is the number of plain-text match-keys assigned to that encoded match-key. We then sort the plaintext match-keys according to these s c values and assign the plain-text match-key with the highest s c value to that encoded match-key. We repeat this step for all the encoded match-keys H *j . At the end of this step we get a ranked list of plain-text match-keys that are likely to be more similar to each of the encoded match-keys.
Based on the superset pruning method suggested by the authors of the multiple dynamic match-key approach [5], we can also check for super-or subsets in the identified plaintext match-keys for each encoded match-key H *j . If any of the encoded match-keys, H *j , has already been assigned to a plain-text match-key mk p x , then any of the super-or subsets of the mk p x cannot be assigned to another encoded matchkey, under the assumption that the first assignment of mk p x is correct. An example is discussed in the Introduction section.
Step 5: Re-identification of plain-text values Once the plain-text match-keys for each encoded match-key H *j are identified, the final step of the attack is to perform plain-text value re-identification by applying a frequency alignment method [2]. For each encoded match-key, H *j , we obtain the frequency distribution of match-key values, and for the identified plain-text match-key, mk p j , of H *j , we obtain the frequency distribution of the corresponding plain-text values. Both these frequency distributions are then sorted in descending order as shown in Table 3. First, we adjust these frequency values according to the sizes of the databases D and V. We can adjust either the frequency values of match-keys by multiplying each value by |V| / |D|, or alternatively we can adjust the frequency values of plain-text values by multiplying each value by |D| / |V|. Then we loop over each pair of frequency aligned match-key value and plain-text value and assign the i th plain-text value to the corresponding i th match-key value. We continue this process until the differences of their frequencies is greater than Δ t , a user defined threshold. Given the frequencies of the i th match-key value and plain-text values are f j i and f c i we calculate the frequency difference Δ i as: where abs() returns the absolute value. If Δ i > Δ t , then we stop the alignment of plain-text values to encoded match-key values (hash-codes).
Furthermore, if several adjacent frequencies of match-key values and plain-text values in the sorted lists are the same then it will be difficult to determine which plain-text value corresponds to which match-key value.
, we cannot be certain that the i th matchkey value refers to i th plain-text or (i+1) th plain-text value. In such scenarios we assign both the i th and (i+1) th plain-text values to both the i th and (i+1) th match-key values, respectively. An example of this plain-text alignment process is illustrated in Table 3. As shown, records with the same frequency values are aligned together in order to improve the accuracy of the alignment. It is also important to note that this alignment works when the underlying attribute value distributions are similar in both V and D. In cases where attribute value distributions are dissimilar this frequency alignment can potentially lead to incorrect assignments.
Using the above discussed process we can now identify the combined plain-text values from V for the most frequent match-key values in each encoded match-key H *j . In the following section we describe experiments conducted using our proposed attack method and discuss the results we obtained. Table 3: An example of plain-text alignment for match-key FirstName+LastName+BirthYear (F+L+B) using frequency values of match-key values and plain-texts values. As discussed in step 5 of the attack, record d1 from the encoded database D will be aligned with record v1 from the plain-text database V. Since all three records d2, d3, and d4 have a frequency of 7 and the three records v2, v3, and v4 in the plain-text database also have the same frequency, we assign each of d2, d3, and d4 to all three values of v2, v3, and v4. Similarly, we assign each of d5 and d6 to both values of v5 and v6 since they all have a frequency of 4.

Encoded database D
Plain-text database V

Results
We evaluated our proposed attack method on the multiple dynamic match-key approach [5] using two large real-world databases. We investigated different parameter settings under different attack scenarios to measure the feasibility of our attack in real linkage situations.

Databases
We first used a North Carolina Voter Registration (NCVR) database 1 , where we used six snapshots collected in October, August, and February 2019, October 2018 and 2017, and October 2011. We used the snapshot from October 2019 as the sensitive encoded database, D, and the other snapshots as the plain-text databases, V, to represent several time intervals (2 months, 8 months, 1 year, 2 years, and 8 years, respectively) between the encoded and the plain-text databases. Our aim is to evaluate how the temporal differences in records, such as attribute changes, between databases affects the accuracy and scalability of the attack. Our hypothesis is that the larger the time difference the larger the differences in attribute value distributions between D and V, as more voters change their names and/or addresses. The second database was the Michigan voter registration (MVR) database 2 , where we used four snapshots collected in September and January 2016, September 2014, and June 2013. We used the snapshot from September 2016 as the sensitive encoded database, D, and the other snapshots as the plain-text databases, V. As with the NCVR databases, our aim is to evaluate how the time difference between databases affects the attack. The

Evaluation criteria
To evaluate the quality of the identification of plain-text match-keys for each encoded match-key H *j , we analysed the top three to five plain-text match-keys assigned for each encoded match-key and see whether the attack was able to identify the correct match-key with a high correlation. We then evaluated the accuracy of the re-identified plain-text values for match-key values using precision and recall [13,14] of how many plain-text values identified are correct and how many correct values are in the identified set of plain-text values.
We also define three different attack scenarios based on the assumed knowledge an adversary has about the matchkey encoding settings. The scenarios we considered are as follows; • Knowledge on attribute combinations (Comb): In this scenario we assume that the adversary has knowledge about the actual attribute combinations (matchkeys) used in the match-key generation process, but does not know which encoded match-key H *j corresponds to which plain-text match-key M *j . In this situation the adversary would not have to consider all possible plaintext match-keys. Only the known combinations need to be evaluated to measure the correlation of frequency distributions. This scenario can be considered as the best possible attack scenario on an encoded database from an adversary's point of view.
• Knowledge on used attributes (Attr): In this scenario we assumed that the adversary has knowledge only about the set of attributes used for the match-key generation, but does not know which combinations of attributes are used. Unlike the previous scenario, in this situation the adversary would have to conduct the first step of the attack (as discussed in the Methods section) using all possible plain-text match-keys to identify the best matching plain-text match-keys for each encoded match-key, and then perform the second step of the attack.
• Domain knowledge (Dom): In this final scenario we assumed that the adversary does not have any specific knowledge about the attributes, nor the match-keys, used in the match-key generation process. The adversary only knows what kind of entities are encoded in H (such people with their names, addresses, and so on) and therefore can partially assume what kind of quasiidentifiers are being used for the encoding. Hence, the adversary needs to run the attack on all possible matchkeys using all attributes available in V that potentially were encoded into match-keys. This scenario can be considered as the worst case scenario on an encoded database for an adversary.
We implemented all attack methods using Python 2.7 and the experiments were run on a server with 64-bit Xeon 2.1 GHz 16-Core CPU, 512 GBytes of memory and running Ubuntu 18.04. To facilitate reproducibility the prototype programs are available from: https://dmm.anu.edu.au/pprlattack/.
For attack parameter settings, we set α = 0.05, ω = 0.7, Δ t = 0.2. An adaptive value is selected for the parameter ε based on the maximum frequency value of the considered encoded match-key. A ratio from 0.1 to 0.9 was first selected and then the selected ratio is multiplied by the maximum frequency value of the encoded match-key to calculate ε. These settings provided good results in a series of set-up experiments. The parameter α can be set to an appropriate value by analysing the correlation values obtained for each correlation metric. If the frequency distributions of the encoded and plaintext match-keys are similar then α can be set to a low value such as α < 0.1. Inversely, with contrasting frequency distributions, α can be set to a higher value such as α = 0.3 to select a wide range of candidate plain-text match-keys. However, setting α to even a higher value such as α > 0.4 will potentially result in a large selection of candidate plain-text match keys. We set the weight ω of to a higher value than 0.5 because the average correlation value a c was consistently more accurate in correctly identifying encoded match-keys than the difference ratio d c in our experiments. The value for Δ t should be set to a low value such as Δ t < 0.3 to get accurate re-identifications. If the frequency distributions of the two databases are known to have differences, a higher value such as Δ t = 0.4 can be set. This should however be done carefully because this could lead to inaccurate plain-text re-identifications, because allowing substantial differences in value distributions to be mapped can be wrongly used by the attack to map dissimilar plaintexts with each other. It is also worth noting that the optimal values for these parameters are data dependent and therefore should be set over several iterations of experiments.

Match-key generation
Following the originally proposed dynamic match-key encoding [5], we used the following sets of attributes for the experiments for the two databases NVCR and MVR.
• NCVR: FirstName, MiddleName, LastName, BirthYear, StreetAddress, and ZipCode • MVR: FirstName, MiddleName, LastName, BirthYear, StreetNumber, StreetName, and ZipCode These attributes are commonly used in linking records across different databases [1]. Following Randall et al. [5], we selected a list of match-keys that provided the highest linkage quality (according to the F-measure) using the above attributes to encode the sensitive databases.

Discussion
In Tables 4 and 5 we show the re-identification results for each encoded match-key used in different databases from different time intervals using the NCVR and MVR databases. As can be seen, with NCVR we are able to identify the attribute combinations used for each encoded match-key with high accuracy despite the time difference and corresponding percentages of attribute changes between these databases. The correct match-key was always amongst the top three identified plain-text match-keys. Furthermore, the attack was able to identify the correct combination as the first (the match-key with highest correlation) 20 out of 24 times.
With the MVR databases the re-identification accuracy of the match-keys are lower compared to the NCVR databases. However, as can be seen in Table 5, the identified match-keys are always a superset or a subset of the correct match-key. The correct match-key was amongst the top five identified match-keys 17 out of 21 re-identifications. The lower accuracy results are caused by having low and similar values in the frequency distributions f j e of match-keys. As shown in Table 5, the highest frequency values of encoded match-keys in MVR are only 3 and 2 compared to the NCVR databases where they are ranging from 4 to 26. Therefore, the correlation values measured using different plain-text match-keys (especially the supersets and subsets of the correct match-key) are all highly similar and it is difficult to distinguish the correct match-key from other possible candidates. While the results on the NCVR databases illustrate the robustness of the attack in re-identifying encoded match-keys when there is enough frequency information in match-keys, at the same time the results on the MVR databases highlight the need of enough frequency information and distinct frequency distributions for the attack to be highly successful. Table 6 shows the precision, recall, and the number of reidentifications of plain-text values for the NCVR and MVR databases. These are the results obtained in the final step of the attack as discussed in the Methods section. As can be seen, with the NCVR databases, the attack was able to re-identify plain-text values encoded in match-key values with high accuracy when the time difference between databases was small. However, both precision and recall drop when the time difference between databases is increasing. This is because, even if the attack was able to identify the correct attribute combinations used for encoded match-keys by analysing the frequency distributions, exact re-identifications of plain-text values as described in the fifth step of the attack are difficult due to having changes in actual attribute values. With different scenarios the attack maintains fairly consistent results. As expected with the Dom scenario the numbers of re-identifications are slightly lower than the other scenarios   because the accuracy of the re-identification of match-keys is lower.
With the MVR databases, the plain-text alignment accuracy is relatively low compared to the NCVR database experiments. This is because plain-text alignment is done only using the top most selected plain-text match-key as discussed in the Methods section. According to Table 5, the attack was not able to identify the correct encoded match-key as the top selected plain-text match-key most of the times. Therefore, the plain-text alignment did not perform as accurately as with NCVR databases. However, we are still able to identify some of the attribute values correctly because all identified plaintext match-keys for the MVR are either a superset or a subset of the correct encoded match-keys.
With regard to the time efficiency, the first and the second steps are the most time consuming steps of the attack requiring a maximum of 3,840 and 7,352 seconds respectively. The third and fourth steps took less than one second for all experiments, and the fifth step took a maximum of 307 seconds. Overall the attack took less than 3.5 hours to complete for all databases and attack scenarios 3 .
Out of all the tests we utilised to compare frequency distributions, the correlation metrics Earth mover's distance, KS test, Pearson's correlation, Spearman's rank correlation, Relative entropy, and Histogram intersection provided consistently good results with different attack scenarios and databases with different time intervals compared to the four basic statistical measures used. This is due to the ability of correlation metrics to identify common characteristics of distributions even with differences in actual frequency values.

Privacy improvements and recommendations
Considering the identified privacy vulnerabilities in the matchkey generation process, we now propose two recommendations to strengthen the privacy guarantees of the multiple dynamic match-key encoding approach. We further show empirical ev-idence to support our claim of using these recommendations to make the match-key encoding resilient to frequency-based privacy attacks. , and so on. This is because the match-key values are stored in the same order for all records.
This ordering of match-key values allows the calculation of frequency distributions of each match-key in the attack. An adversary can safely assume that a single column in the encoded database D (a column H *j ∈ H) corresponds to a single encoded match-key. To prevent such assumptions by the adversary we propose to store match-key values in sets as opposed to lists. Each row H i* ∈ H in the encoded database should be a set of match-key values where there is no order. Now the adversary cannot assume a single column represents a certain encoded match-key.
As opposed to improved privacy, there is a trade-off with this technique. A single match-key value from a record now needs to be compared with every match-key value from another record which will lead to increased time consumption. Furthermore, the possibility of obtaining false positives will also increase. For instance if two record r1 and r2 have the following attribute values: r1: {FirstName = "Paul", Last-Name = "Thomas", BirthYear = "1992"} and r2: {FirstName = "Thomas", LastName = "Paul", BirthYear = "1992"}, then match-key (F+B) of r1 and match-key (L+B) of r2 will have the same value and the record pair (r1, r2) will be classified as a match. Increasing false positives will reduce the overall quality of the linkage. However, this can be solved by prefixing an attribute specific value (such as an attribute qualifier or an attribute index) to the plain-text value before hashing.
For example if we use 'F' for first name and 'L' for last name, we obtain match-key values such as "F-Thomas1992" and "L-Thomas1992". This approach is different from record specific salting values as used in the context of Bloom filter encoding to avoid frequency attacks. Therefore, adding an attribute specific string however may not affect the frequency distribution, rather it is recommended to avoid the false positive rate. Recommendation 2: As discussed in the Methods section, our proposed attack compares the correlations between the frequency distributions of encoded match-keys with plain-text match-keys. In order for this to be successful, the encoded match-keys need to have distinct frequencies for different match-key values. If the frequency distributions are close to uniform, an adversary will not be able to compare those distributions with plain-text frequency distributions.
As our second recommendation, we propose making the frequency distributions of encoded match-keys close or equal to uniform. In the encoded database H, we set the matchkey values that have frequency larger than x, where x ≥ 1, as missing. This will ensure that all the remaining match-key values (hash-codes) for each encoded match-key are unique and the frequencies of all those match-key values will be between 1 and x making the distribution close to uniform. Hence, the adversary can no longer conduct any correlation analysis on these distributions since they are uniform and cannot be distinguished from each other. As with our first recommendation, this method also has a trade-off for improved privacy at the cost of reduced linkage quality. Setting match-key values as missing will lower the overall recall of the linkage since potential matches could be missed. At the same time, overall precision of the linkage will be improved since potential false positives could be removed when using this technique.
We conducted multiple experiments with x = {50, 20, 10, 5, 2, 1} where we set the match-key values with frequency > x as missing. As expected, the precision of the linkage has increased with a minimum and a maximum increase of +0.69% and +10.4%, respectively. The recall has dropped due to the true matches being missed because their match-key values are removed. The overall recall had a minimum and a maximum decrease of -0.04% and -1.29% respectively. The small changes in both precision and recall is because the number of records with match-key value frequencies > 1 is small compared to the 7 million records in the databases. However, if the databases have more errors or missing values in attributes this will potentially lead to a selection of encoded match-keys, H *j , with a small number of attributes which will improve the recall of the linkage. As we further discuss below, selecting encoded match-keys with a smaller number of attribute values (such as only 2 or 3 attributes in a H *j ) will increase the frequencies of those encoded match-keys. In such a case using this recommendation to improve privacy will also lead to a potentially substantial reduction of recall.
It is possible to apply one or both of the above two recommendations to increase the privacy of the sensitive values in the encoded database. However, we recommend to initially apply the first recommendation and then check if any distinct frequency distributions can still be obtained and aligned between the plain-text and encoded databases. Depending on the results of the first modification we can then apply the second recommendation if further improvement of privacy is required. Furthermore, it is also worth noting that if the second recommendation is applied first, then there is no practical advantage of applying the first recommendation next because the attack will not be possible after the application of the second recommendation.
Practical considerations of the multiple match-key encoding While the original multiple dynamic match-key encoding proposed by Randall et al. [5] provides high linkage quality with acceptable privacy, there are few practical aspects that need to be taken into consideration when using this encoding for PPRL in real-world applications.
1. The selection of match-keys depends on the Fellegi and Sunter agreement and disagreement weights of individual attributes. In [5] the authors have conducted their experiments on selected match-key that provided the highest F-measure values for a given database. However, in practice these agreement and disagreement weights need to be estimated based on partially available ground truth or domain knowledge. Hence, with estimated values, there is no guarantee that the F-measure and thus linkage quality of the final record linkage will be optimal.
2. The multiple match-key encoding approach depends on exact matching of hash-codes. As also discussed by the authors in [5] this could be a disadvantage if the matching databases contain errors or variations in attribute values. Even a small typographical error or a character change (Christine vs. Christina) in an attribute value will lead to a completely different hash-code, whereas with approximate matching techniques such as Bloom filter encoding [15] this might not lead to false negatives. Furthermore, in our experiments, we observed that selected match-keys always contain a name attribute (FirstName, MiddleName, or LastName), as can be seen in Tables 4  and 5. This is because name attributes generally have more discriminatory power compared to other attributes used. Therefore, if the name attributes in the databases to be linked contain more errors or changes in their values, then this could lead to lower linkage quality despite low amounts of errors in other used attributes.