Solving the Arizona search problem by imputation

Summary An “Arizona search” is an evaluation of the numbers of pairs of profiles in a forensic-genetic database that possess partial or complete genotypic matches; such a search assists in establishing the extent to which a set of loci provides unique identifications. In forensic genetics, however, the potential for performing Arizona searches is constrained by the limited availability of actual forensic profiles for research purposes. Here, we use genotype imputation to circumvent this problem. From a database of genomes, we impute genotypes of forensic short-tandem-repeat (STR) loci from neighboring single-nucleotide polymorphisms (SNPs), searching for partial STR matches using the imputed profiles. We compare the distributions of the numbers of partial matches in imputed and actual profiles, finding close agreement. Despite limited potential for performing Arizona searches with actual forensic STR profiles, the questions that such searches seek to answer can be posed with imputation-based Arizona searches in increasingly large SNP databases.


INTRODUCTION
In a common setting in forensic genetics, the genotype of a sample of biological material from an unknown individual is queried against a database of genotypic profiles of known individuals. 1,2The procedure relies on a standardized set of genetic markers typed both in the profiles in the database and in the sample whose identity is sought.A full genotypic match to a database profile can recover the identity of the source of the sample; a partial genotypic match can be informative as well, suggesting that the unknown individual is a relative of the contributor of the partially matching profile. 3,4or the procedure to produce accurate identifications, genotypic profiles across the standardized set of genetic markers must be sufficiently variable that with high probability, a match of a full genotypic profile uniquely identifies an individual across the human species, up to monozygous sibships. 5At the same time, it is desirable for the system to possess the fewest loci necessary for establishing uniqueness.The use of a small number of loci minimizes the intrusion of marker systems on genetic privacy, so that profiles contain as little information as possible about individual genotypes and phenotypes; the use of a small number of loci also minimizes the genotyping cost in systems that process many profiles.
What is the minimal size required for a set of loci to achieve the goal that profiles based on that set are unique?As it is impractical to perform the required empirical evaluation-to obtain the genotypes of all possible individuals for a large set of loci, and to choose the optimal subset by analysis of the resulting ginormous dataset-the determination must rely in part on a mathematical model of the level of individual identifiability contained in proposed sets of loci.Indeed, widely used marker sets have been designed using model-based calculations that rely on allele frequencies in small datasets. 6,7In the United States, the set of loci in current use-the ''CODIS loci,'' abbreviated from the ''Combined DNA Index System''-has contained 13 highly variable short-tandem-repeat (STR) loci that were first chosen in the 1990s 8 and that were later augmented with 7 additional loci in 2017. 9s profiles on the initial CODIS marker set began accumulating in the 1990s, empirical evaluation of the uniqueness of profiles in forensic databases became possible to perform in principle.In such an evaluation, all profiles are compared with all other profiles.The number of pairs of diploid profiles that match at k alleles is tabulated, for each value of k from 1 to twice the number of loci in the marker set.
Such a pairwise analysis of all profiles in a database has come to be known as an ''Arizona search,'' after one such evaluation-in which a team working with the forensic profile database for the state of Arizona conducted a search of pairs of profiles in the database. 10The analysis identified partial matches at a level that was unexpectedly high-high enough to raise the concern among some that the 13-locus set then in use might not produce a sufficiently high level of uniqueness for individual profiles. 11,12he ''Arizona search'' incident has had a number of lasting consequences.First, it contributed to the clarification of protocols for forensic databases. 12,13As the purpose of the databases is their operational use for testing query profiles against database profiles, implementation protocols have been clarified so that calculations such as Arizona searches that do not fall into the operational purview generally would not be performed by forensic employees with access to actual profiles. 12In the United States, discussions of the possibility for other scientists to access such forensic profiles for research purposes 12,14,15 -for example, to conduct ''Arizona searches'' themselves-have not resulted in such access.

ll OPEN ACCESS
A second consequence was a further understanding of the conceptual meaning of the level of pairwise matching in a forensic query database.The central application of such a database is to assess if some database profile has a match to a profile at hand.The probability that a match exists between two profiles in a database solves a fundamentally different problem-analogous to the probability that two people in a group have a shared birthday rather than the probability that someone in the group has a shared birthday with person X. 11,12,16,17 Nevertheless, the pairwise match probability is informative about the conceptual uniqueness of matches and the fit of probability models to forensic databases. 11,18,19inally, recognizing the utility of Arizona searches in understanding the properties of forensic databases, a third consequence is that several studies have sought to provide substitute calculations that mimic a pairwise database search in the absence of access to actual databases.In the model-based Arizona search of Mueller, 18 independence of a set of forensic loci is assumed.Profiles are generated from allele frequency parameters under independence, producing hypothetical databases.The fraction of profile pairs with complete or partial matches is then obtained.Studies such as that of Mueller 18 have generally found that models provide a reasonable description of the number of partial matches in databases.
A limitation on such studies is that they use model-based profiles rather than actual profiles.Some studies with sets of actual profiles have been performed, 19,20 comparing model-based predictions of the number of pairwise database matches to empirical assessments.Although these studies have tens of thousands of individuals, their numbers of profiles remain small compared to the millions of profiles now present in actual forensic databases.Hence, the potential for understanding pairwise database matches in practical settings continues to rely on mathematical models together with evaluations of the level of empirical matching in smaller datasets.
We and others have recently employed techniques for the imputation of the alleles of forensic STR loci from neighboring SNPs, [21][22][23] introducing a new possibility for evaluating pairwise match probabilities in databases.Non-forensic genomic SNP databases are increasing in size, so that the possibility that millions of SNP profiles will be available for pairwise comparison can be envisioned.With a large database of SNP profiles, the alleles of forensic STRs could conceivably be imputed from the SNPs.From probabilistically imputed STR alleles, the probability of database matches could then be obtained.
An imputation-based calculation enables an Arizona search from SNP profiles, where instead of using a model that generates profiles from allele frequencies, as in the work of Mueller, 18 the model employed is the imputation model for STR allele probabilities on the basis of the neighboring SNPs.Hence, assuming that the potential for performing Arizona searches from actual STR profiles continues to remain limited, use of imputation in increasingly large SNP datasets can increase the database size for Arizona searches.
In this study, we assess the feasibility of performing an Arizona search of forensic STR profiles by imputation in databases of SNP profiles.We consider individuals for which both SNP and STR genotypes are available.We empirically perform the search using the actual STR profiles, tabulating numbers of partial matches.We then repeat the search by the imputation of STR profiles from SNP profiles, assessing the agreement of the number of partial matches in the imputed data with that in the empirical genotypes.The results suggest that increasingly large SNP databases can indeed be used, together with imputation, to perform searches that mimic Arizona searches of unavailable STR databases.

Arizona search with imputed genotypes
We begin by using a dataset of phased SNP-STR genotypes derived from the 1000 Genomes project 23 to simulate a forensic database (see STAR Methods: Data and code availability).We randomly split the 2,504 individuals in the dataset into a reference panel (60%, 1,502 individuals) for use in the imputation procedure, and a database set (40%, 1,002 individuals), in which the Arizona searches are performed.We consider 100 replicate reference-database splits to ensure that results are not affected by artifacts of random splitting.
For individuals in a database set, we have two kinds of genotype data available: the true STR genotypes at 18 CODIS loci, and STR genotypes imputed with the BEAGLE program 24,25 using neighboring SNP genotypes and the reference panel (STAR Methods: Imputation with BEAGLE).We refer to the imputed genotypes as ''BEAGLE-called'' genotypes.
For the true genotypes, we calculate the numbers of matching alleles, loci matching at both alleles (''fully matching''), and loci matching at exactly one allele (''partially matching'') for each of 1002 2 = 501; 501 possible pairs of individuals.
We then repeat this calculation for BEAGLE-called genotypes and compare the values obtained with those for true genotypes.We refer to this approach as Scheme 1 (Figure 1A).

Arizona search with imputed allele probabilities
The BEAGLE-called genotypes do not capture all the information that is produced by the imputation procedure.The imputation algorithm also estimates allele probabilities for each locus on each chromosome for every sample, representing the uncertainty in the imputation.The BEAGLE-called genotypes are then assigned to be alleles with the highest probability.
In a second experiment, working with the same 100 random splits of reference and database samples, we used the estimated allele probabilities directly to compute expected numbers of allele and locus matches for each pair of individuals in the database, as described in STAR Methods: Expected number of matches.Expected numbers of matches represent the similarity between a pair of individuals across all possible genotype combinations, weighted according to the imputed allele probabilities.We refer to this approach as Scheme 2 (Figure 1B).

Distributions of numbers of matches
We perform Arizona searches using true STR genotype data and imputed STR genotype data obtained using Schemes 1 and 2. Figure 2 shows match distributions over 1002 2 possible comparisons in the database, averaged over all 100 replicates.The distributions are summarized in Table 1.
In an Arizona search with the true data, the median number of matching alleles is 10, and the maximal value observed across the replicates is 24.The theoretical maximum is 36, corresponding to a comparison of identical samples.For the counts of fully matching loci, the median of the true distribution is 1, and largest observed value is 8 compared to a theoretical maximum of 18.Finally, for partially matching loci, the median is 8, and the observed maximum is 17.
Both ways of using imputed data produce distributions of matches close to the true data.Arizona search using BEAGLE-called genotypes (Scheme 1) recovers the correct medians (Table 1).Visually, the distributions of the numbers of allele matches, fully matching loci, and partially matching loci are close to the true ones.The range of values is larger with imputed data: most noticeably, the maximal numbers for counts of fully matching loci are 8 and 11 for true and imputed genotypes, respectively.
Using the expected numbers of matches computed from imputed allele probabilities (Scheme 2) yields a distribution of the numbers of allele matches that is more concentrated than the true discrete distributions (Figure 2A).The medians are close to true values, as are the observed maxima (Table 1).

Match error due to imputation
The Arizona searches using imputed data recover the distributions of allele and locus matches across pairs of individuals; we now evaluate the procedure at the level of specific pairs of individuals.
Figure 3 compares the numbers of matches for true and imputed data for each pair of individuals.The numbers computed using Scheme 1 are reasonably correlated with the true values (Spearman correlations of 0.66, 0.51, and 0.55 for allele matches, fully matching loci, and partially matching loci, respectively).In each category of matches, for more than 50% of pairs, the absolute difference between the number of matches in Scheme 1 and the true number is no more than 1.In 90% of pairs, Scheme 1 differs from true values by 3 or less (Table 2).Scheme 2 increases the agreement of the algorithm with the true values.Correlations of true and expected numbers of matches are higher (0.71, 0.58, 0.61 for allele matches, fully matching loci, and partially matching loci).Median absolute error is also near one allele or locus (Table 2).
To further characterize the differences between true numbers of matches and those computed with imputed data, we use the Hodges-Lehmann estimator of the difference of means for paired samples. 26Let T i be the true number of matches (for any of the three match categories) and let I i be the number of matches with imputed data (with either Scheme 1 or 2), for i = 1;2;.;501501.Let E i = I i À T i .Rearrange the E i in non-decreasing order, E 1 % E 2 % / % E 501501 .Our estimate of the difference between numbers of imputed and true matches is the median of averages of all pairs in the set fE i g: (Equation 1) The value of b q is an estimator that is well suited to our problem, as it does not introduce any assumptions on the distributions of the numbers of matches and it is robust to outliers.The Hodges-Lehmann estimates, shown in Figure 4 as distributions over 100 replicate splits, lie in ½ À 0:15;0:15.Hence, on average, using called genotypes (Scheme 1) or expected matches (Scheme 2) computed from SNP data biases the Arizona search results by less than 0.15 of a match.

DISCUSSION
We have analyzed the possibility of performing Arizona searches of STR databases using SNP genotype data and imputation.Using 18 of the 20 CODIS STR loci and neighboring SNPs, we have described Arizona searches by imputation that use either most likely STR genotype assignments (Scheme 1) or STR allele probabilities (Scheme 2) obtained by imputation using surrounding SNPs (Figure 1).
Both schemes recover the true distributions of the numbers of matching alleles and loci (Figure 2), and the medians of three classes of matches closely agree with the true values.For the maximal number of matches, Scheme 2 provides values close to those of the true data; Scheme 1 sometimes yields pairs with higher numbers of matches (Table 1).That Scheme 1 would not perform as well on this metric  is sensible: although the calculation using imputed allele probabilities reasonably captures the uncertainty in the imputation algorithm, Scheme 1 is systematically biased toward selecting more probable (and more frequent) alleles for each individual, increasing the probability of observing pairs with high numbers of matches.
When specific pairs of individuals are considered, the median absolute error in the number of matches computed by imputation is near 1 (Table 2).Correlations between numbers of imputed and true matches are reasonably high (Figure 3), though error can be nontrivial for specific pairs.As in other imputation studies, 27,28 it is likely that some of this error can be eliminated with larger reference panels.
As forensic genetics has been increasingly examining new SNP sets that could eventually augment or even replace existing STR systems, 29,30 it is possible that the Arizona search question of understanding the distribution of pairwise agreement among profiles will become relevant for new potential marker sets.Although we have focused here on imputing STRs from SNPs, imputation of the relevant SNPs in proposed marker sets from neighboring SNPs could proceed similarly, and indeed would be more similar than our present SNP-STR analysis to typical biomedical imputations of SNPs from other SNPs.The differences are computed after merging results on 100 independent replicate splits of the starting dataset into reference and database samples.
Imputation has appeared in a variety of problems in forensic genetics; [21][22][23][31][32][33][34] its use for the Arizona search problem is one of an increasing number of scenarios in which loci external to forensic systems can assist in understanding forensic genetic matching. Imputaion has enabled the matching of genetic records between profiles of SNP loci and profiles of STR loci, potentially linking SNP and STR databases in principle.21,22,33 It can also help in testing STR loci for phenotypic associations while attempting to understand the phenotypes that might be associated with particular forensic profiles.31,34 Limitations of the study Our somewhat simplistic analysis in the 1000 Genomes-a dataset with relatively few individuals compared to that in which the largest reported Arizona search has been performed 19 -provides a demonstration that the imputation-based Arizona search approach is feasible.However, we note a number of limitations.First, the 1000 Genomes SNP-STR haplotype panel we used was itself obtained using imputation based on an external family-based reference dataset.23 While the accuracy of this procedure was found to be high, 23 imputation errors could still be present in the data.It is important to be cautious in interpreting our computations for any particular pair of individuals, and it will be useful to perform similar analyses in datasets containing SNP and STR genotypes obtained directly.We note also that we have not taken into account population structure among profiles in the database of profiles; a future direction is to examine imputation in the context of approaches to Arizona searches incorporating the Balding-Nichols model that takes population structure into account.19 The possibility that the database contains siblings, parents and offspring, or other close relatives could also be considered.
Finally, we note that in our analysis of the 1000 Genomes data, we are relying on an assumption that a forensic database accurately represents the profiles of its sampled individuals.Genotyping errors, recording errors, sample mislabelings, and sample duplications can alter the relationship between the set of individuals for whose profiles an Arizona search is of interest and the actual profiles employed in such a search.Such factors will be important to consider in interpreting any imputation-based Arizona searches performed beyond the controlled scenario of a simulation.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Figure 1 .
Figure 1.The experimental design Rectangular boxes represent data, rounded boxes represent actions, and circles mean that the actions below are repeated multiple times.(A) Scheme 1: Arizona search using BEAGLE-called genotypes.(B) Scheme 2: Arizona search using STR allele probabilities inferred by BEAGLE for each individual in the database.The 100 replicate splits are the same in Schemes 1 and 2.

Figure 2 .
Figure 2. Distributions of the numbers of matching alleles, fully matching loci, and partially matching loci in Arizona searches in simulated forensic databases Normalized histograms are plotted for discrete match counts using true STR genotypes (green) and imputed STR genotypes (Scheme 1, orange).Kernel density estimates are plotted for expected matches (Scheme 2, purple).All 100 replicate splits are combined to produce a single distribution.(A) Number of matching alleles between two individuals.(B) Number of fully matching loci between two individuals.(C) Number of partially matching loci between two individuals.

Figure 3 .+ 1 2 ; y + 1 2 Á
Figure 3.Comparison of numbers of matches with imputed and true data for all pairs of individuals in the database In each panel, the x-axis is the number of matching alleles or loci with true STR genotype data, and the y axis shows the corresponding number with imputed data.The Spearman correlation coefficient r is shown for each panel.The panels show matches in all 100 replicates combined into a single distribution.In the figure, for integers ðx;yÞ, the unit square centered at À x + 1 2 ; y + 1 2 Á depicts values in ½x;x + 1Þ 3 ½y;y + 1Þ.(A) Scheme 1, allele matches.(B) Scheme 1, fully matching loci.(C) Scheme 1, partially matching loci.(D) Scheme 2, allele matches.(E) Scheme 2, fully matching loci.(F) Scheme 2, partially matching loci.

Table 1 .
Summaries of distributions of the numbers of matching alleles, fully matching loci, and partially matching loci in Arizona searches in simulated forensic databases Medians and maximal observed values are computed after pooling results on 100 replicate splits of the starting dataset into reference and database samples.

Table 2 .
Absolute difference between the number of matches in Schemes 1 and 2 and the true values

TABLE
d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d METHOD DETAILS B Imputation with BEAGLE B Expected number of matches