Secure distributed genome analysis for GWAS and sequence comparison computation

Background The rapid increase in the availability and volume of genomic data makes significant advances in biomedical research possible, but sharing of genomic data poses challenges due to the highly sensitive nature of such data. To address the challenges, a competition for secure distributed processing of genomic data was organized by the iDASH research center. Methods In this work we propose techniques for securing computation with real-life genomic data for minor allele frequency and chi-squared statistics computation, as well as distance computation between two genomic sequences, as specified by the iDASH competition tasks. We put forward novel optimizations, including a generalization of a version of mergesort, which might be of independent interest. Results We provide implementation results of our techniques based on secret sharing that demonstrate practicality of the suggested protocols and also report on performance improvements due to our optimization techniques. Conclusions This work describes our techniques, findings, and experimental results developed and obtained as part of iDASH 2015 research competition to secure real-life genomic computations and shows feasibility of securely computing with genomic data in practice.


Introduction
The iDASH (Integrating Data for Analysis, Anonymization and SHaring) research center at the University of California, San Diego hosts an annual competition, which in 2015 was dedicated to secure genome analysis. The two challenges corresponded to secure non-interactive analysis of genomic data based on homomorphic encryption and secure interactive analysis using secure multiparty computation techniques. We focus on the second challenge and report our design and implementation of the competitions tasks, which consisted of distributed GWAS (Genome-Wide Association Study) computation and secure sequence comparisons in the form of the Hamming distance or edit distance. Here, GWAS is a study of common genetic variants in different individuals using case and control groups to determine if any variant is associated with a specific trait or genetic conditions. We utilize secure multi-party computation (SMC) techniques based on secret sharing with lightweight computational footprints. This requires that all computation carried out jointly by the parties (i.e., computation that cannot be performed locally by data owners) is dataoblivious, which means that all instructions and accessed memory locations must be independent of the data. While this does not pose a challenge for some simpler computational tasks, meeting this objective often involves using non-trivial techniques for more complex functionalities. In particular, computing both Hamming and edit distances of two genome sequences involves a form of aligning the input sequences which is not straightforward to achieve in secure setting. A logical tool to resort to is to utilize secure set intersection for computing chromosome positions that appear in both input sequences, for which both two-party and multi-party implementations are known. Due to the specifics of our setting, we rely on the ideas from [1] for computing the set of positions common to both sequences, which in turn utilize oblivious sorting. The fastest oblivious sorting mechanism available to us at the time of competition preparation was sorting based on Batcher's mergesort [2], which works only on input sets, the size of which is a power of 2. This posed a problem because padding an input set of a large size to have the size equal to a power of 2 often can result in a significant performance slowdown which we wanted to overcome. Thus, the most challenging component of distance computation was generalizing the mergesort (more precisely, the merging step as described later) component of the computation to work with inputs of arbitrary sizes, which might be of independent interest. This and other optimizations and design decisions constitute the main contribution of this work. We also report on performance of our algorithms on real genome data.

Tasks of the challenge
The challenge for secure multi-party computation based genomic data analysis had two tasks: 1 The first task was to develop secure distributed protocols for GWAS computations. The input consists of the genotypes of two groups of individuals (one case and one control group) over a number of Single Nucleotide Polymorphisms (SNP) with each of them being a DNA sequence variation, where a single nucleotide (A, T, C, or G) in the genome differs between individuals. The input is horizontally partitioned among two sites (e.g., two institutions, medical facilities, etc.), where each site cannot reveal its input to other parties. The task consists of securely computing minor allele frequencies (MAF) and chisquared statistics for each of the SNPs in the case and control groups distributed across the two input parties. We provide the details of the computation below. 2 The second task was to develop secure distributed protocols for genomic sequence comparisons. The input consists of two genomic datasets, one from each individual, which are organized as the genotypes over many SNPs across the whole human genome. Each genomic dataset belongs to a different entity, and the data owner cannot reveal any information about its data to other parties. The task consists of securely computing either the Hamming distance or edit distance between the two genomic datasets, and we concentrate on Hamming distance computation. The computation involved in the Hamming distance computation of two genomic datasets differs from the traditional formulation of the Hamming distance and we describe the computation used in determining the distance later in this section.
Before we proceed with a detailed description of the tasks, we would like to note that the specification of the tasks, including all information about the participants' datasets that should be treated as public (such as the number records in one's dataset), was provided by the competition organizers. The goal was thus to provide a secure evaluation of the specified functionality using at least the semi-honest security model (see below for detail), and the extent of the information about the other participant's data that can be deduced from the output is beyond the scope of this work.
We next describe the computation involved in the first task in more detail. The input comes as a list of SNPs, where for each SNP a number of genotypes corresponding to the individuals from the case and control groups are given. Let P denote the number of SNPs in the input and N c (N t ) denote the number of individuals in the case (resp., control) group, whose genomic data is provided for each SNP. For each SNP, a genotype corresponding to an individual consists of two nucleotides with three possible variations denoted by AA, AB, and BB, where A and B each represent a character from the set {A, T, C, G} and are alleles in our context.
For the purpose of MAF computation, there is no distinction between case and control groups, and all individuals in both groups are treated in the same way. We denote the total number of individuals by N = N c + N t . To determine MAF for a given SNP, one first needs to count the number of occurrences of alleles A and B as n A = 2n AA + n AB and n B = 2n BB + n AB , respectively, where n AA , n AB , and n BB denote the number of individuals with genotypes AA, AB, and BB for the given SNP. For an allele A or B, we compute its frequency as n A /2N or n B /2N , respectively, where 2 is the length of each genotype. To simplify notation, we also let N ' = 2N (and N c = 2N c , N t = 2N t ). The smaller frequency corresponds to the minor allele and constitutes the output of MAF computation. We obtain the following: Definition 1 Minor allele frequency (MAF) refers to the frequency at which the least common allele occurs in a given population and is computed as We can simplify the computation by directly obtaining n A and n B after counting the number of times each of the two nucleotide values appears in the provided genotypes. In the case when the individuals are partitioned into two (case and control) groups, we will have n A = n cA + n tA and n B = n cB + n tB , where n cA and n tA represent the value of n A in the case and control groups, respectively, and similarly n cB and n tB represent the value of n B in the case and control groups. Furthermore, in our case the data are partitioned among two different entities and thus each of n A and n B need to be computed as the sum of the corresponding values at the respective sites. If we let superscripts (1) and (2) represent the values present in the genotypes of the individuals at sites 1 and 2, respectively, we now obtain and n B = n (1) cB + n (1) tB + n (2) cB + n (2) tB . Also, now N corresponds to the total number of individuals in the data at both sites and in both case and control groups. Using notation similar to the above, we let The chi-squared test is also performed for each SNP independently, but now the data of the individuals in the case and control groups play different roles.
Definition 2 Chi-squared (c 2 ) test is a statistical test for comparing observed data with those expected according to a specific hypothesis and is represented as for some setting-dependent m.
In our case (for a single dataset), m = 4 and the observed values obs i 's correspond to the observed allele counts for a SNP, namely, n cA , n cB , n tA , n tB . The corresponding expected allele counts expi's are (n cA + n tA )N c /N , (n cA + n tA )N c /N , (n cA + n tA )N c /N , (n cB + n tB )N t /N . The resulting computation can be simplified to become χ 2 = (n cA n tB − n cB n tA ) 2 N N c N t (n cA + n tA )(n cB + n tB ) When the dataset is horizontally partitioned among two sites, the counts n cA , n tA , n cB , n tB , N c , N t , N become the sum of their respective values at both sites.
We can now proceed with the description of the computation involved in the second task, namely, the Hamming distance of two genomic datasets. In the traditional formulation of the Hamming distance, on input of two sequences of equal length, the distance is defined as the number of positions at which the corresponding symbols in the input sequences are different. This is not directly applicable to genomic sequences because they are not represented as perfectly aligned strings of the same length and thus the computation is more complex. Before we proceed with the details of the computation, we need to specify how the input (i.e., genomic datasets) are represented.
Genomic sequences are represented in the Variant Call Format (VCF), where each genomic sequence is a set of records. In each record, chromosome CHROM represents an identifier from the reference genome and position POS represents the reference position within the reference sequence CHROM. In other words, the pair 〈CHROM, POS〉 represents the location of the data associated with this record in the genome. The fields REF and ALT represent the reference and alternate bases, respectively, expressed as a sequence of one or more nucleotides. The field SVTYPE represents the type of the record, which is one of SUB, SNP, DEL, or INS. Only records of type SUB and SNP are used in the computation of the Hamming distance. In records of type SNP, both REF and ALT fields are one character long, while in records of type SUB, both fields can be longer. We also found that in two different inputs records at the same location 〈CHROM, POS〉 may be represented using different types (SUB and SNP). end if 19: end if 20: end for 21: return dist To compute the Hamming distance between two genomic datasets, we initially set the distance to 0. Then for all records in the datasets with type SUB or SNP, if a location 〈CHROM, POS〉 is found only in one of the datasets (and is absent in the other), the Hamming distance is incremented by 1. Also, if the location is found in both datasets and the corresponding values of the REF fields are the same, but the values of the ALT fields are different, the Hamming distance is also incremented by 1. A more detailed specification of how this procedure may be implemented is given in Algorithm 1. The algorithm uses a map to store all records of type SUB and SNP from the first dataset and (optimistically) increments the distance by 1 for each record placed in the map (lines 3-7). Then for each record of type SUB or SNP from the second dataset, if there was no record with the same location 〈CHROM, POS〉 in the first dataset, the distance is incremented by 1 (lines [10][11][12]. If, however, the location is present in both datasets, the distance is first decremented by 1 (line 14). The algorithm then compares the fields REF and ALT of the records from the two datasets with the same location. If the former are equal and the latter differ, the distance is incremented by 1 (lines [15][16][17].
Additional information about the tasks, including examples that illustrate the computation, can be found on the competition web site [3] as well as in an article [4] being prepared by the competition organizers.

Secure multi-party computation background
Secure multi-party computation allows two or more participants to jointly evaluate a function on their private inputs without revealing any information about the private data other than the output of the agreed-upon function. There are two standard security models used for secure function evaluation on private data that differ with respect to the types of adversaries they can tolerate. The first security model known as semi-honest (or honest-but-curious or passive) requires that all computation participants follow the computation as prescribed, but might save any information observed throughout the computation and compute with it with the goal to discover additional information about private input values. A protocol is said to be secure if no coalition of semihonest participants (adversaries) can learn any additional information about private inputs of other parties other than what they can already compute from their legitimate output. It also follows from the security properties that any outside party is unable to learn any information about the participants' data and protocol output corresponds to evaluating the correct function on the provided data. Security in presence of semi-honest participants was a minimum security requirement for this competition.
The second, stronger, security model permits malicious (also known as active) participants who can arbitrarily deviate from the prescribed computation (and coordinate their actions). Security in this model holds if the same data protection and output correctness properties are achieved as before. Known techniques for achieving security in this model typically involve substantially larger overhead than in the semi-honest model, and we do not use it in our implementation.
There are a variety of available techniques on which secure multi-party computation protocols can be based. For the purposes of this competition, we utilize an (n, t)-threshold linear secret sharing scheme for representation of and secure computation over private values. With such a scheme, each private value is split into n secret shares (using n computational parties each of whom receives a share), such that combining t or fewer shares information-theoretically reveals no information about the private value, but combining t + 1 or more shares allows the value to be reconstructed exactly. All computation proceeds on secret shares, which means that before the computation commences each participant distributes her private data among n computational parties and at the end of the computation reconstructs the result from the shares obtained from at least t+1 computation participants. We utilize Shamir's secret sharing scheme [5] and a typical way of conducting computation using this scheme requires that t <n/2. Thus we use n = 3 computational parties and set t = 1 (i.e., the parties are assumed not to collude).
In both tasks of the challenge, there are two parties who contribute their input. They will play the role of computational parties together with another party who contributes no input. As mentioned above, each input provider produces secret shares of her data and distributes them among the participants and reconstructs the output at the end of the computation. We assume that the three computational parties are connected by pairwise secure authenticated channels (that provide secrecy and authenticity) with each other, which can be achieved using standard means.
The underlying secret sharing scheme [5] requires that shares are represented as elements of a field, which means that the input needs to be provided in the form of integer values. With a linear secret sharing scheme, a linear combination of secret-shared values can be performed by each computational party locally, without any interaction. Multiplication of two secret-shared values, on the other hand, requires communication between all of them and is treated as an elementary building block of secure protocols (we assume the multiplication protocol from [6]). These operations are typically used as the main building blocks in more complex computations, e. g., comparisons and divisions.
We utilize a number of efficient protocols for integer computation that have previously been shown secure in the standard security model. It is also known that by invoking the composition theorem [7], secure building blocks can be combined together to achieve security of the overall computation. The building blocks that will be used in the computation of the two tasks are listed next (as mentioned before, all correspond to integer computation). When performance of a building block depends on the size of the arguments provided into the function, the size is listed as a separate argument.
is a multiplication protocol that on input two secret-shared values x and y outputs a secret-shared product z = xy.
is an equality protocol that on input two secret-shared values x and y of bitlength at most ℓ outputs a bit b which is set to 1 iff x = y.
is a comparison protocol that on input two secret-shared values x and y of bitlength at most ℓ outputs a bit b which is set to 1 iff x <y.
is a division protocol that on input two secret-shared values x and y of bitlength at most f outputs a secret-shared quotient z that satisfies z = ⌊ x/y⌋.
As shown above, each protocol takes shares of its input and produces shares of the output. It means that these protocols can be naturally and securely invoked as part of larger computation and we use them as steps in larger computation. In our implementation, we use Mult from [6], EQ and LT from [8], and Div from [9], and we refer the reader for the details of these protocols to the respective publications.
Performance of secure computation protocols is of a paramount importance for their practical use. In the case of techniques based on secret sharing, the computation is normally lightweight and thus performance is measured in terms of two parameters: (i) the number of interactive operations (e.g., multiplications) necessary to perform the computation and (ii) the number of sequential interactions, i.e., rounds. Our goal is to minimize both of these parameters for the computation performed for each task.
Before we conclude this section, we would like to say that other options for securely evaluating the functions of the competition tasks are possible. In particular, the garbled circuit evaluation approach [10] allows any function to be securely evaluated in the two-party setting. Similarly, any function can be evaluated using homomorphic encryption, or special-purpose building blocks such as private set intersection (e.g., [11]) can be used as the basis for building a custom solution for a task of the competition. Furthermore, secure computation compilers such as Fairplay [12], Sharemind [13], PICCO [14], etc. are able to produce secure implementations given function specification in a form of a program. This competition, however, allowed for custom solutions that can tune general building blocks to the needs of the tasks and result in improved performance. Because no secure implementations of the competition tasks were available to us prior to the competition, we are unable to directly compare performance of different approaches in this paper.

Secure distributed GWAS computation
In this section we describe our approach to securely computing the task of distributed GWAS computation, namely, computing minor allele frequencies and chisquared statistics.
According to the task specification, the size of the input at each site, i.e., the number of SNPs and the number of individuals in the case and control groups, are treated as public and are not protected. This means that parameters P, N (1) are known to all computation participants. All remaining data (i.e., the genotypes themselves) are private.
In what follows, we first describe a basic version of our solution and then provide optimization techniques that improve the runtime of program execution.

Basic solution
For each SNP in the input, the computation is identical (and independent of other SNPs) and thus it suffices to describe the computation for a single SNP.
We divide the overall computation into three phases: input preparation, computation execution, and output reconstruction, which proceed as follows. Observe that each input site i can locally compute n This is what is done as part of input preparation, after which each input site secret shares each of its computed values and distributes the shares among all three computational parties. We use notation [a] to denote that the value of a is secret-shared among the computational parties (i.e., each party holds a different share of a).
During computation execution, the computation proceeds on the shares to compute MAF and chi-squared statistics using equations 1 and 2 and secure building blocks from the previous section. We choose to perform only the private portion of the computation on secret shares, while postponing the computation with public constants to the output reconstruction phase. This is done for performance reasons to reduce the size of values used in the computation.
To calculate the MAF for each SNP in parallel, the computation follows equation 1 with provisions to make the computation data-oblivious. That is, each computational party performs the following steps: In this section we describe our approach to securely 1 [n A ] = [n (1) cA ] + [n (1) tA ] + [n (2) cA ] + [n (2) tA ]; 2 [n B ] = [n (1) cB ] + [n (1) tB ] + [n (2) cB ] + [n (2) The first two steps that aggregate the input values are local to each computational party, but steps 3 and 4 that produce the minimum of n A and n B involve joint computation by all of them. We subsequently discuss the choice of the parameter ℓ 1 .
To compute the chi-squared statistics for each SNP in parallel, we similarly follow the computation in equation 2 using the following steps: 1 [n cA ] = [n (1) cA ] + [n (2) cA ]; 2 [n tA ] = [n (1) tA ] + [n (2) tA ]; 3 [n cB ] = [n (1) cB ] + [n (2) cB ]; 4 [n tB ] = [n (1) tB ] + [n (2) N  , N c , and N t is omitted). The numerator is then scaled up by a factor of k to ensure that using integer division will provide sufficient precision of the result. The bitlength of k will be on the order of the precision of the answer in bits. We defer discussion of the choice of ℓ 2 to the next section.
At the end of the computation, all computational parties send their shares of the result res 1 and res 2 for each SNP to one of the input sites who reconstruct the values. The output party then sets the result of MAF computation to res 1 /N ' and the result of the chi-squared computation to (res 2 · N )/(kN c N t ) .

Optimizations
We applied several optimizations to the computation to improve its runtime.
1 The nature of the computation in this task allows all interactive operations to run in parallel in a single batch for all SNPs. That is, all P comparisons corresponding to line 3 of MAF computation are executed simultaneously. The same applies to line 4 of MAF computation and lines 5-9 of chisquared computation. We can further reduce the number of rounds in chisquared computation by running interactive independent operations at the same time. In particular, this means that lines 5-7 of the computation can be executed in a single round. 2 We modify chi-squared computation to use floating point instead of integer division after converting both operands d and c to floating point representation. This is primarily driven by the fact that performance of division we rely on (described in [15,9]) depends on the maximum of the bitlengths of its arguments and we can use substantially shorter values with floating point division compared to integer division (i.e., the bitlength can be comparable to that of k instead of the sum of the bitlengths of d and k). The savings noticeably outweigh the cost of integer-to-floating point conversion, or normalization (to use floating point division we need to normalize two values, while integer division needs to compute normalization of one of its arguments). We additionally slightly optimize integer to floating point conversion and floating point division compared to those given in [9] using information known about d and c (e.g., the fact that they are positive). 3 For performance reasons, we want to set parameters ℓ 1 and ℓ 2 (as well as the bitlength of secret shared values) to their minimum values that guarantee correctness. When the bitlength of the arguments to both comparison and division differ, the larger value is to be used. In particular, for ℓ 1 , the largest value of n A or n B in the LT protocol appears when only one nucleotide is present in all genotypes in both case and control groups (i.e., max(n A , n B ) = N ' and min(n A , n B ) = 0), and we set ℓ 1 = ⌈log N ' ⌉ (where the extra 1 is due to the specifics of the LT operation). For ℓ 2 , the largest value of d or c appears when n cB = n tA = 0, which leads to n cA = N c , n tB = N t , and d = (N c ) 2 (N t ) 2 , and we set n tB = N t , and d = (N c ) 2 (N t ) 2 , and we set 2 = 2(log N c + log N t ) . For integer division, this value of ℓ 2 needs to be additionally incremented by the bitlength of precision k, but fortunately after we switch to floating point representation, we can reduce the bitlength to the desired precision of the result because the values are represented in a normalized form.

Secure distributed genomic Hamming distance computation
We next concentrate on the second task of securely computing the Hamming distance between a pair of genomic datasets in a distributed setting.
According to the task specification, the number of records in each of the two datasets are known to all parties and we denote them as N 1 and N 2 , respectively. The content of the records, however, is private (in particular, the values that fields CHROM, POS, REF, ALT, and SVTYPE take). Because only records with SVTYPE equal to SUB and SNP are relevant for the computation, for ease of notation we refer to them as SUB and SNP records, respectively.
The high-level idea behind our solution is as follows: we first let each input site extract SUB and SNP records from its dataset and pad the resulting set with dummy records to hide its size. After each input site secret shares its records across all computational parties, the parties then run a set operation to identify all records that appear in both dataset (conceptually similar to set intersection) using 〈CHROM, POS〉 as the key as well as all records that appear only in one dataset (conceptually similar to symmetric difference). We accomplish this by obliviously sorting all records from both datasets using Batcher's mergesort [2] and scanning the sorted set examining every two adjacent elements in it to determine if the Hamming distance needs to be incremented by one for that pair.
At the time of competition preparation, Batcher's mergesort was available to us as one of the best options for oblivious sorting (based on the overall amount work as well as its round complexity). It is particularly well suited to this task because it is a recursive algorithm that works by first sorting the first and the second half of its input set and then merging the sorted halves. In our setup this means that the input datasets can be pre-sorted by each input site locally and only the merge step needs to be run jointly. Unfortunately, Batcher's mergesort (including the merge step) has the drawback that the number of elements in the input set has to be a power of 2, which may unnecessarily increase the runtime.
In what follows, we start by describing in detail a basic solution in the first subsection and then discuss two optimizations in the two consecutive subsections.

Basic solution
As before, we divide the overall computation into three phases: input preparation, joint computation execution, and output reconstruction.
Input preparation. Each input site i extracts all SUB and SNP records from its dataset and pads them with dummy records to size N i + 1 (we require at least one dummy record). (If the combined fraction of SUB and SNP records is guaranteed to be within a certain fraction a < 1 of the total size for typical genomic datasets, then the datasets can be padded to aN i + 1 records. For this competition, a could not be lower than 1.) Furthermore, to meet Batcher's mergesort requirements, the input parties additionally pad the sets with dummy records so that the combined size of the two datasets is 2 q , where q = ⌈log2(N1 + N2 + 2)⌉. We use this newly formed dataset as the input into the computation and refer to it as a "dataset".
Next, the values in each record need to be converted to integers, which we accomplish as follows: 1 The location 〈CHROM, POS〉 is represented as V 1 = CHROM · L + POS, where L is the maximum length of any existing human chromosome. CHROM ranges from 1 to 24 (22 autosomes, plus × and Y), and for dummy records we set V1 = 25L + 1 to avoid overlap with real records. 2 REF and ALT fields are represented as strings of nucleotides in the input. To produce their numeric counterparts, we map each nucleotide value to a two-bit integer (e.g., A = 0, C = 1, G = 2, and T = 3) and concatenate two-bit integers from a string to form a single number. To hide information about the size of the fields, the values need to be represented using the same bitlength for all records based on the maximum string length M. Because shorter strings need to be padded to the maximum size, we need to ensure that strings of different sizes will always be different (i.e., the padding character cannot be one of 0-3). Instead of introducing a separate padding character, which increases the bitlength of one character from 2 to 3 bits, we append the string length in bits at the end of the string and use 0 for padding. Thus, all strings are represented using 2M +log M bits. Let After computing a 3-tuple (V 1 , V 2 , V 3 ) for each record in its dataset, an input site i sorts the records by the V 1 field to form set S i , generates shares of all records in S i , and distributes them to the computational parties (we slightly abuse notation and use [S i ] to denote shares of all values in S i ). It also distributes shares of the number of dummy records d i in S i .
Computation execution. After receiving two sorted sets of ([V 1 ], [V 2 ], [V 3 ]) triples from both input sites, the computational parties run oblivious merge using [V1] as the key. The algorithm is built using an input-independent sequence of compare-and-exchange operations. Each operation takes two integers and either swaps them or leaves them unchanged so that the first output (min) is always smaller than the second (max). In our framework, it is implemented as follows: The computational parties next compute the Hamming distance as specified in Algorithm 2. Sets S 1 and S 2 represent sorted input triples of the input parties and parameters ℓ 1 and ℓ 2 correspond to the bitlengths of fields V 1 and V 2 (or V 3 ), as discussed previously.
Because a specific location V 1 appears only once in each of the input datasets (except for dummy records), there will be at most two records with the same V 1 in the combined set. The algorithm works by looking at each pair of two consecutive elements in the combined sorted set and adds 1 to dist if this is the first time the location appears on the list (i.e., a i = 0 on line 4). The distance is incremented automatically for the first record (dist = 1 on line 2). Then, if a location appears for the second time (a i = 1 on line 4), the algorithm examines the values of V 2 and V 3 fields (lines 5-6) to determine whether the condition for incrementing the distance is satisfied (i.e., b i = 1 and c i = 0). If not (b i = 0 or c i = 1), dist is decremented by 1 to compensate for the fact that it was increased during previous loop iteration. All dummy records collectively contribute distance −d 1 −d 2 +2 (i.e., 0 for the first two records and −1 for each additional record) and this is why we adjust the computed distance at the end (line 9). We note that all loop iterations and all comparisons within a loop iteration can be carried out in parallel.

Separating SUB and SNP records
As the first significant optimization, we separate computation of the distance for SNP and SUB records and consequently reconstruct the overall distance from the two values. The main reason for this is to reduce the time comparisons of V 2 and V 3 take. Recall that SNP records contain a single character in REF and ALT fields, while SUB records can contain longer strings. In the genomic datasets we worked with, a great majority of all records were SNP records that can be processed using 2-bit comparisons for V 2 and V 3 (i.e., ℓ 2 = 2). In the basic solution, however, the bitlength had to be unnecessarily increased by two orders of magnitude for most records to meet privacy requirements. Thus, the idea consists of extracting two sets from each input dataset: one consisting of SNP records and another consisting of SUB records. Then the distance for SUB records is computed separately from the distance for SNP records and the sum is returned as the result. This strategy works well if all records with the same 〈CHROM, POS〉 pair are always marked with the same type across all datasets. It is, however, possible for two datasets to contain SUB and SNP records corresponding to the same location. Because of the existence of such records, the Hamming distance will not be computed correctly if we simply add the two distances together. That is, if one record appears in the SUB set and another with the same location appears in the SNP set, they collectively will contribute 2 to the overall distance instead of correct 0 or 1 (depending on other attributes).
To address this, we need to find all such pairs and compensate for the difference they introduced, which is the most subtle part of our solution. We next provide more detail about the solution and highlight the differences from the basic scheme.
Input preparation. Given a dataset, an input entity produces two subsets: one composed of SUB records and one composed of both SUB and SNP records from the dataset. As before, both sets need to be padded with dummy records to hide their number and make the size to be a power of 2 to the combined size of 2 qs and 2 q , where q s = ⌈log(a s (N 1 + N 2 ) + 2)⌉ and q = ⌈log(a(N 1 + N 2 ) + 2⌉ and a s (a) denotes the maximum fraction of SUB (resp., SNP and SUB) records in genomic datasets (we were given a = 1 and a s = 0.3). All records in the SUB set are converted to (V 1 , V 2 , V 3 ) triples as before. For the SNP&SUB set, one-character REF and ALT fields in SNP or SUB records are represented using integers 0-3, while these fields of longer length in SUB records are represented using integer 4 (i.e., V 2 and V 3 fields are 3 bits long). This will guarantee that comparison of a one-character long REF or ALT field in a SNP record with a longer REF or ALT field in a SUB record will result in their inequality. We also add another binary attribute V 4 to each record of the SNP&SUB set that indicates whether the record is of SUB type (V 4 = 0) or SNP type (V 4 = 1). We set V 4 = 0 in dummy records.
Each input entity now produces shares of (V 1 , V 2 , V 3 ) in its SUB set and (V 1 , V 2 , V 3 , V 4 ) in its SNP&SUB set (together with the number of dummy records in each set) and distributes them to the computational parties. We note that computation with SNP&SUB sets can be performed on shorter bitlengths, which results in faster arithmetic, and thus we setup two different instances of the secret sharing scheme and process SUB sets separately from SNP&SUB sets.
Computation execution. To compute the Hamming distance correctly, we now distinguish between different cases: (i) SUB records that don't have a SNP record with identical location in the other dataset, (ii) SNP records that don't have a SUB record with identical location in the other dataset, and (iii) records that have another record with identical location but different type present in the other dataset. Let N0 denote the number of records in the third category.
10: end for 11: return dist The computational parties execute Algorithm 2 on two SUB datasets. This computes the distance corresponding to the records in category 1, but also introduces offset N 0 . The parties then execute Algorithm 3 on two SNP&SUB sets that computes the distance corresponding to categories 2 and 3 and additionally compensates for the offset. The output will then be the sum of the distances computed by both algorithms.
In Algorithm 3, when examining each pair of consecutive records, we only consider those that contain at least one SNP record within the pair (d i = 1 on line 9). Furthermore, similar Algorithm 2, when observing a location for the first time, we add 1 to the Hamming distance, but only if it is a SNP record ( V Note that dummy records do not introduce any error in Algorithm 3. That is, d i = 0 and e i = 0 when both records i and i − 1 are dummy and the expression on line 9 evaluates to 0. Similarly, when record i−1 is real while record i is fake that expression also evaluates to 0 because a i = 0 and V (i) After computing the distances corresponding to SUB and SNP&SUB sets, the parties need to convert shares of one of them into shares of the same value in the secret sharing setup used by the other algorithm. Then the distances can be locally added to compute the overall result. Output reconstruction is performed as before by exchanging the shares and recovering the result.
The performance gain achieved by this optimization highly depends on the values of public parameters a s , a, and M. While the gain stems from using shorter values for V 2 and V 3 with SNP&SUB sets, the total number of records processed using this solution (2 q + 2 q s ) is greater than in the basic scheme (2 q ). Therefore, this optimization is recommended with relatively small a s and large M. In our experiments with a s = 0.3, a = 1, and M = 100, we observed approximately 30% performance improvement compared to the basic scheme.

Reducing set size
Our second optimization is with respect to oblivious sort and removing the requirement that the input size has to be a power of 2 for the merge step of Batcher's mergesort. To explain how our optimization works, we need to provide additional details about Batcher's mergesort algorithm.
Recall that the merge step takes two sorted sets L 1 = (a 1 , a 2 , ..., a m ) and L 2 = (b 1 , b 2 ,..., b n ), where m + n is a power of 2. It first combines them into a single sequence that first monotonically increases and then decreases as L = (a 1 , a 2 , ..., a m , b n , ..., b 2 , b 1 ), after which a sequence of compare-and-exchange operation is executed as specified by the following pseudo-code: for (r = (m + n)/2; r > 0; r = r/2) for (j = 0; j < m + n; j = j + 2r) for (k = j; k < j + r; k = k + 1)

compare -and -exchange(L[k], L[k + r])
After executing the first iteration of the outer loop, the first (second) half of L will contain (m + n)/2 smallest (resp., largest) elements of L although they are not necessarily sorted. After its second iteration, the ith quarter of L will contain the ith quarter of elements in the final sorted list for i = 1, ..., 4. This process continues until each sublist contains one element and L becomes sorted. Notice that the algorithm uses log(m + n) iterations of the outer loop, and in each iteration every element in the list is used in a compare-and-exchange operation, which is the reason for requiring the size of the list to be a power of 2.
We next proceed with describing our strategy for generalizing the merge operation to work with inputs of arbitrary sizes, which might be of independent interest. There will be no need to pad the input in the beginning to make the overall input size to be a power of 2, but dummy records are now added throughout algorithm execution as needed. This means that earlier loop iterations use a smaller number of elements and are therefore faster than in the original algorithm. In particular, at each loop iteration, if the size of a sublist is odd, we append a copy of its last element to the end. This will ensure that comparisons can be performed at the current level while still preserving the necessary properties of the (partially) sorted list. For example, suppose we want to merge (3,6,8) and (5,7). Before the first iteration 5 will be added to the list (3,6,8,7,5) because the number of elements in it is odd, and we obtain (3,5,5,7,6,8) at the end of that iteration. At the time of second iteration, the size of sublists (3,5,5) and (7,6,8) is also odd and they are modified to be (3, 5, 5, 5) and (7,6,8,8). In the next iteration no additional padding is used and we obtain (3,5,5,5,6,7,8,8) at the end of the algorithm.
In the context of Hamming distance computation, we

Results
In this section, we provide experimental results of securely computing GWAS statistics and the Hamming distance in a distributed setting. We ran experiments in LAN and WAN settings with three computational parties connected by pair-wise secure authenticated channels with each other. The LAN experiments were conducted using 2.4 GHz 6-core Red Hat Linux machines connected through 1 Gb/s Ethernet with pairwise round-trip times 0.3 msec.
Our WAN experiments used two machines from the LAN setting and employed another 2.1 GHz 8-core machine from the GENI infrastructure [17] at a different geographic location with Red Hat Linux. The pairwise roundtrip times between these machines were 0.3 msec, 9.2 msec, and 9.2 msec. Each GWAS experiment was run 20 times and each Hamming distance experiment was run 5 times and the median over all runs is reported. For GWAS computation, case and control groups consisted of genotypes of 200 individuals each (100 individuals at each input site in each group). We measured the runtime of the MAF and chi-squared computation by varying the number of SNPs in the input. The results are given in Table 1. Modulus size (Mod) corresponds to the bitlength of secret shared values.
Our implementation incorporates all optimizations and uses parameters ℓ 1 = 11 and ℓ 2 = 21 computed as described in the optimizations subsection of the secure GWAS computation section (ℓ 2 = 35 + |k| would be required for integer division, but a lower parameter is requested precision). We can see from the table that sufficient with floating point operation to obtain the the execution time is linear in the number of SNPs in both settings, and the overhead in WAN is almost three times as large as that in LAN, which is primarily due to larger communication delays in WAN. Another observation not present in the table is that division performed in chi-squared computation contributes almost the entire runtime (close to 99%) and thus any optimizations applied to division can lead to direct improvement of chi-squared performance.
For the Hamming distance computation, we conducted four sets of experiments in the LAN setting, that correspond to the basic scheme (i) with no optimizations, (ii) with the first optimization, (iii) with the second optimization, and (iv) with both optimizations. By comparing execution times of different schemes, we can see performance gains from different optimizations. In the WAN setting, we only report the timings of the best (last) scheme. For each set of experiments, we varied the number of records in the genomic dataset at each input site. The results are presented in Table 2.
We used two different secret sharing bitlengths for schemes that apply the first optimization (one for the computation with SUB records and another for the computation with SNP&SUB records). The complexity of the merge is O(n log n) for combined sequences of size n and computing the distance itself is linear in n (with larger constants), which the runtimes in Table 2 follow. In the LAN setting, the two optimizations result in performance improvement up to 27.9% and 13.1% on our set of parameters when applied separately, and 40.9% when applied together. Performance gain of the first optimization heavily depends on parameters (a, a s , and M ), while the gain of the second optimization depends on the difference between the combined input size N and 2 ⌈log N⌉ . The the smaller the difference is, the smaller improvement is expected.

Conclusions
In this work we report on our experience with participation in the 2015 iDASH secure genomic computation competition. We show how to securely compute MAF and chi-squared statistics in the context of GWAS computation and the Hamming distance between two genomic datasets and report on their performance results. We develop a number of novel optimizations, some of which may be of independent interest.