Data on the interexaminer variation of minutia markup on latent fingerprints

The data in this article supports the research paper entitled “Interexaminer variation of minutia markup on latent fingerprints” [1]. The data in this article describes the variability in minutia markup during both analysis of the latents and comparison between latents and exemplars. The data was collected in the “White Box Latent Print Examiner Study,” in which each of 170 volunteer latent print examiners provided detailed markup documenting their examinations of latent-exemplar pairs of prints randomly assigned from a pool of 320 pairs. Each examiner examined 22 latent-exemplar pairs; an average of 12 examiners marked each latent.

in [2] summarizes the test workflow, which conforms broadly to the prevailing ACE methodology. The Verification phase was not addressed. Examiners could review and revise their work prior to submitting their results. Examiners were free to modify the markup and value determination for the latent after the exemplar was presented, but any such changes were recorded and could be compared with their Analysis responses. The test procedure is described in detail in [2], including the complete test instructions and introductory video.

Fingerprint data
The fingerprints were collected at the FBI Laboratory and at Noblis under controlled conditions, and from operational casework datasets collected by the FBI. We provide a detailed description of the fingerprint data selection process in Appendix S.5 in [2]. All prints were impressions of distal segments of fingers, including some sides and tips.
The latents were processed using a variety of development techniques. The processed latents were captured electronically at 8-bit grayscale, uncompressed, at a resolution of 1000 pixels per inch.
The exemplars included both rolled and plain impressions captured as inked prints on paper cards or using FBI-certified livescan devices; they were captured at 8-bit grayscale, 1000 or 500 pixels per inch and either uncompressed or compressed using Wavelet Scalar Quantization [3].
The fingerprint pairs were selected to vary broadly over a four-dimensional design space: number of corresponding minutiae, image clarity, presence or absence of corresponding cores and deltas, and complexity (based on distortion, background, or processing). The primary focus was to test the boundaries of sufficiency for individualization determinations, and therefore we deliberately limited the proportion of image pairs on which we expected unanimous determinations.
We selected nonmated pairs to result in challenging comparisons either by down-selecting among exemplar prints returned by searches of the FBI's Integrated AFIS (IAFIS) or from among neighboring fingers from the same subject.
To ensure coverage of the design space and balance of image pairs across examiners, the assignments of fingerprint images to examiners were randomized based on an incomplete block design (with examiners as blocks, image pairs as factor levels), balanced to the extent possible (using the criterion of D-Optimality).
For each image pair assigned to an examiner, the test process saved two data files: one saved upon completion of the Analysis phase (before the exemplar print was presented) and a second upon completion of the Comparison phase. The files complied with the ANSI/NIST-ITL [4] standard, using the COMP transaction described in the Latent Interoperability Transmission Specification [5].

Local ridge clarity
The annotations of local ridge clarity complied with the Extended Feature Set (EFS), which is part of the ANSI/NIST-ITL standard [4]. EFS defines a color-coding method for describing clarity [6]. For minutiae, the primary distinction with regard to clarity is that for green or better areas, the examiner is "certain of the location, presence, and absence of all minutiae" (White Box Instructions, Appendix 22 in [2]). Yellow areas indicate the opposite, that location, presence, and/or absence are not certain. Black or red areas should not have any marked minutia: when this occurs it is often due to imprecise painting of the clarity, or to not following instructions. 1 For this analysis, we simplified the classification to clear (green or better) vs. unclear (yellow or worse).
Unless otherwise stated, we report the clarity as marked by that examiner. In some analyses we use the median clarity across multiple examiners, which combines the clarity maps from the examiners who were assigned that pair to represent a group consensus. This reduces the impact of outlier opinions and imprecision. When constructing the median clarity maps, we excluded four examiners whose clarity markup did not comply with the test instructions.

Examiner responses: determinations and markup data
As detailed in Appendix SI-5 of [2], we received valid responses from 170 participants. Each participant was assigned 22 image pairs from a pool of 320 total pairs. Early in the testing process, a problem was identified in seven image pairs; ten responses on these image pairs were excluded, yielding a total of 3730 valid responses from the Analysis phase. Examiners marked 44,941 minutiae on 3550 latents (180 Analysis-phase markups included no minutiae).
Comparison-phase responses include 2966 comparisons where neither the latent nor the exemplar was assessed to be NV; this omits 2 invalid determinations (software issue) and 762 NV determinations (713 Analysis-phase latent NV, 43 Comparison-phase latent NV, and 6 Comparison-phase exemplar NV). Our previous report on changes made from Analysis to Comparison [7] omitted an additional nine responses whose Analysis-phase markup was not captured until after the exemplar had been presented. The number of valid responses per image pair is summarized in Fig. 1.
The corresponding minutia data excludes markups by five examiners who routinely did not annotate correspondences, and two markups that were missing a Comparison determination. This resulted in 3618 valid markups for analyses of corresponding minutiae (45,130 Comparison-phase minutiae marked on the latent). For some analyses, we include all minutiae marked during Analysis (including deletions) or added during Comparison (52,155 minutiae, 50,894 of which are on the 3618 markups with valid corresponding minutiae).  [1]. Marked minutiae are shown as small black dots inside color-coded clusters. For the Analysis phase, cluster colors indicate the proportion of examiners who marked within that cluster; for the Comparison phase, colors indicate the proportion of comparing examiners who corresponded the minutia as marked on the latent. The third row of images ("Latent with Analysis minutiae") shows all minutiae as marked in the Analysis phase; the fourth row ("Latent with corresponding minutiae") shows markup from the Comparison phase limited to those minutiae that examiners marked as corresponding; the fifth row ("Exemplar with corresponding minutiae") shows the locations of the corresponding minutiae as marked on the exemplar. Because marked minutiae from one cluster on the latent did not always correspond to one cluster on the exemplar (either due to examiner disagreements or behavior of the clustering algorithm), the fifth row ("Exemplar with corresponding minutiae") uses the color-coding from the latent markup to help visualize the correspondences. Table 1 describes for each of the four examples shown in Fig. 2, the number of examiners contributing to the clusters, and their determinations.

Example markups
Note that example D is the one comparison on which an erroneous individualization occurred (also shown as an example in Fig. 2 of [7]). Five examiners marked correspondences (two of whom also marked discrepancies), one additional examiner marked debatable correspondences, and one additional examiner marked discrepancies. Even after omitting the examiner who individualized, more correspondences were marked on this image pair (22, in 11 clusters) than on any other nonmated image pair in the test. Other top examples of nonmated image pairs with many correspondences marked included  Table 1. one with 18 correspondences (in 12 clusters, by two of ten comparing examiners), and another with 13 correspondences (in 8 clusters, by five of eight comparing examiners).

Effect of clustering parameters
Examiners' markups differed in whether or not individual minutiae were marked, and in the precise location where the minutiae were marked. In order to focus on whether examiners agree on the presence or absence of minutiae, we need to see past minor variations in minutia location. Neumann et al. [8] used ellipses to determine whether two minutiae should be considered the same, based on an expectation of more variation in location along the direction of the ridge than perpendicular to ridge flow; here we did not collect minutia direction, making this approach impractical. In [7], our technique of classifying features as retained, moved, added or deleted was based on a fixed radius of 0.5 mm (0.02 in., or approximately the average inter-ridge distance)although that approach was satisfactory for two markups where one was derived from the other, it is not well suited to comparing more than two markups.
We used automated clustering algorithms in order to classify minutiae marked by multiple examiners as representing the same minutia on the latent. Clustering was implemented in two stages as follows: 1. For each fingerprint, the set of all minutiae x,y coordinates (as marked by the examiners) was preliminarily clustered using DBSCAN with a given radius r, and no lower limit to the cluster size. That is, singletons were treated as valid clusters, not labeled as "noise." 2. Oversized preliminary clusters were split using agglomerative hierarchical clustering, with ceiling (mean number of marks per examiner) as the cutoff point. Hierarchical clustering assembles a tree of cluster relationships; there is no assumption of a fixed radius.
Neither algorithm makes use of any information from the fingerprint images themselves; they rely entirely on the x,y coordinates of the minutiae as marked by examiners. The implementation of Density-based Spatial Clustering of Applications with Noise (DBSCAN) we used was written by Michal Daszkyowski of the University of Silesia in 2004. [9,10] 1 The DBSCAN radius was set to 0.015" (0.38 mm) after extensively reviewing the algorithm's performance over a range of radius settings. In our review, we considered several standard clustering performance measures and visually assessed the resulting clusters as plotted superimposed over the latent prints. As shown in Fig. 3 and Table 2, any choice of radius substantially biases the reproducibility distributions: increasing the radius increases the measured mean reproducibility, and decreases the measured number of clusters. We selected a slightly large radius in order to aggregate some of the less precisely focused clusters; we then split many of the oversized clusters in the second step.
Oversized preliminary clusters were selected for subsequent splitting by agglomerative hierarchical clustering based on a criterion of (mean number of marked minutiae per examiner) 4 1.5. This arbitrary threshold was selected because (1) automated splitting of clusters meeting this criterion was highly successful, and (2) for lower values (between 1 and 1.5), it was usually not apparent even to a human how to split correctly without careful interpretation of the fingerprint image. The oversized preliminary clusters often contained multiple, clearly distinct ridge events, but otherwise were difficult to resolve by visual inspection. We used MATLAB's implementation of agglomerative hierarchical clustering algorithm; Ward's method was selected for computing the distance between clusters. 2 Ward's method helps overcome the main flaw of DBSCAN, which is that it tends to fail when faced with highly heteroskedastic data (data in which the variance differs among subsets). Clustering was performed separately on Analysis markup (n ¼ 44,941 minutiae), Comparison markup (n ¼46,205 minutiae), and combined markup (n ¼52,155 minutiae). Combined markup (used in sections 9 and 10.2) includes both deleted and added minutiae. 94% of the Analysis-phase clusters have a maximum radius less than 1 mm; 99.2% less than 1.5 mm; 99.95% less than 2 mm. Tables 3 and 4 and Figs. 4 and 5 describe associations between reproducibility and clarity, and between consensus and clarity. While clarity as painted by the examiners who marked the minutiae is a strong predictor of reproducibility, consensus descriptions of clarity provide a better explanation of interexaminer variation in minutiae markup. Minutiae that were more highly reproduced were more likely to be found in clear areas of the latent. Table 4 illustrates how median clarity explains this association better than examiner clarity.

Reproducibility and consensus by clarity
The latent prints included many areas where examiners did not agree on clarity. Fig. 4 indicates how these areas of "debatable clarity" contribute to reproducibility, by showing the associations between consensus and clarity.   Table 5 "Perfect" agreement counts those Analysis-phase markups in which (1) all minutiae that the examiner marked in clear areas were in majority clusters and (2) the examiner marked in all majority clusters (in any clarity). The 90% and 75% agreement columns require that at least 90% (75%) of the minutia that the examiner marked in clear areas were in majority clusters and the examiner marked at least 90% (75%) of the majority clusters. Latents lacking any clear minutiae or majority clusters trivially satisfy both criteria for "perfect" agreement.  Fig. 5 shows the distribution of minutia clarity conditioned on the proportion of examiners describing that location as clear: minutia reproducibility is very high when examiners concur that a location is clear, very low when examiners concur that a location is unclear, and varied when there is no concurrence on clarity. This can explain some of the lack of association seen in Fig. 4.

Reproducibility of entire markups
In addition to assessing interexaminer variability by minutiae (reproducibility) and by clusters (consensus), we can assess variability by entire markups. Table 5 describes the extent to which the examiners' minutia markup was in complete (or near-complete) agreement on each latent, conditioned on the presence of clear minutiae and majority clusters. Table 6 shows the distribution of singletons per markup. With a mean of 12 examiners per latent, 50% of the Analysis-phase markups had singletons. 15% of all markups had more than two singletons, and these markups accounted for 59% of all singletons. 6.6% of examiner clear minutiae were singletons; 16.8% of examiner unclear minutiae were singletons.

Singletons and solo misses
Analogous to singletons are "solo misses," i.e., minutiae that were marked by all but one of the examiners. Unlike singletons, solo misses occur primarily in clear areas: there were a total of 640 solo misses during Analysis (6% of clusters), 610 of which were in median clear areas. Although singletons are far more numerous than solo misses, solo misses disproportionately affect measures such as mean reproducibility, because reproducibility counts each singleton once (as reproducibility¼0) while it counts solo misses once for each examiner who marked that minutia (e.g., as mean reproducibility¼92% if 11 of 12 examiners marked a minutia). Table 6 Distribution of singletons per markup (Analysis phase, mean of 12 examiners per latent).

Reproducibility of minutia with respect to value determinations
Minutia reproducibility tended to be higher on latents that examiners agreed are VID than those that examiners agreed are not VID. However, as shown in Fig. 6, most of this association can be accounted for in terms of differences in clarity: those latents that examiners agreed are VID tend to have more minutiae marked in clear areas.
We have previously reported [2,7] that when one examiner assesses a latent to be VID and another examiner assesses that same latent to be NV, the examiner assessing the latent to be VID can be expected to mark more minutiae. Here we take a closer look at how differences in value assessments relate to whether examiners mark specific minutiae.
The following logistic regression model was used to estimate the probability that an examiner would mark a minutia given the level of consensus for that minutia and the examiner's value assessment. This model allows us to estimate how much effect is specifically associated with the value assessments as opposed to other factors such as clarity or which regions of the prints examiners chose to mark that are largely accounted for by conditioning on consensus: where π is the probability that this examiner marked the minutia given this examiner's value assessment of the latent and given the proportion of all examiners who marked this minutia. The probability estimates are summarized in Table 7. Even after accounting for the level of consensus on each minutia, examiners are more likely to mark minutiae when they assess a latent to be VID.
The decisions to mark or not mark minutiae on a single latent are not independent events. For example, examiners occasionally mark no minutiae on latents assessed to be NV or VEO; this may contribute to the lower probability of examiners marking minutiae in majority clusters on these responses. Taking this lack of independence into account, we realize that conditioning on the level of consensus, as shown in Table 7, does not completely remove the confounding effects of factors such as clarity. Figs. 7 and 8 show that when examiners assessed latents to be VID, they almost always marked most of the majority clusters; when they assessed latents to be NV or VEO, they often marked fewer than half of the majority clusters. Tables 8 and 9 summarize Analysis-phase reproducibility by latent value assessment and clarity.   Fig. 9 shows reproducibility of cores and deltas. Examiners were instructed to mark all cores and deltas on the latents, provided they could be located within approximately three ridge intervals. On  Fig. 7). The median number of majority clusters marked (dashed line) was 71% of NVs; 75% of VEOs; 89% of VIDs. No majority clusters were marked (left extreme) on 13% NV latents; 6% of VEO latents; and 0% of VID latents. All majority clusters were marked (right extreme) on 34% NVs; 27% VEOs; and 28% VIDs.    Table 11, shown graphically, color-coded by examiner B clarity.

Table 12
Examiner B clarity by examiner A clarity at each cluster center. Data is constructed from all pairs of examiners on each latent regardless of whether the examiners marked in the cluster; each cluster is weighted equally (n¼10,324 clusters). The tables summarize the clarity examiners assigned to each cluster without regard to whether those examiners marked a minutia in the cluster. those latents that had one or more cores or deltas marked by any examiners, typically only about half of the examiners marked them: no cores or deltas were unanimously marked. Table 10 shows the prevalence of nonminutia features in the area of minutia clusters. Features other than minutiae were sometimes present in or near minutia clusters, which could indicate a disagreement as to whether a feature should be marked as a minutia, a nonminutia feature, or both. However, this did not explain much of the interexaminer variability: only 4.5% of clusters contained features other than minutiae.

Agreement in clarity markup (Analysis phase)
Examiners often disagreed as to whether or not minutiae were present and as to whether the locations of minutiae were sufficiently clear to be certain of the presence or absence of minutiae. Fig. 11. Examiner B clarity by examiner A clarity at each cluster center. Same data as Table 12, shown graphically, color-coded by examiner B clarity.  Table 11 and Fig. 10 show for every minutia (n ¼44,941) the distribution of clarity assigned to that location by other examiners, regardless of whether the other examiners marked a minutia at that location. When an examiner marked a minutia in an area that that examiner described as unclear, other examiners were about equally likely to describe that area as clear or unclear.

Table 14
Percentage of minutiae that are "relatively far" (more than 0.1", about 5 ridge intervals on average) or "very far" (more than 0.2", about 10 ridge intervals) from the nearest majority cluster, by phase and minutia clarity. The total minutia count is limited to latents that had at least one majority cluster. For corresponding minutiae, distance is measured to the nearest cluster that was marked and corresponded by a majority of comparing examiners. (Analysis phase, n¼ 44,729; another 212 minutiae were marked on latents having no majority clusters).    Fig. 11 show for every cluster center (n¼ 10,324) the distribution of clarity assigned to that location by pairs of examiners, regardless of whether those examiners marked a minutia at that location. Selecting examiner pairs and cluster centers at random, the probability of the two examiners agreeing whether to describe that location as clear vs. unclear was 65%. Table 13 shows for every minutia marked (n ¼44,941) the distribution of clarity assigned to that location by other examiners, conditioned by whether the second examiner marked at that location. When a second examiner agreed on the presence of a minutia, that examiner was much more likely to describe the location as clear, whereas if the second examiner did not mark the minutia, that examiner was likely to describe the location as unclear.

Differences in regions with marked minutiae
Some examiners mark minutiae far away from those marked by other examiners. This may be due to disagreements regarding the boundaries of the impression being considered (i.e., the region of interest), or disagreements on which areas in the region of interest are of sufficient quality to mark minutiae. Table 14 describes what proportion of minutiae were marked far from the nearest majority Fig. 13. Distance of corresponding minutiae to the nearest cluster corresponded by a majority of comparing examiners, by examiner latent clarity. Distance is measured in units of 0.001". The set of majority clusters was limited to those in which at least three examiners marked corresponding minutiae; "majority" was calculated among those examiners who marked at least one correspondence on the image pair. (Comparison phase, n¼ 27,486; another 454 corresponding minutiae were marked on latents having no majority cluster). cluster. Fig. 12 (Analysis phase) and Fig. 13 (corresponding minutiae, Comparison phase) show the distributions of the distances from marked minutiae to the nearest majority cluster.

Consensus and sufficiency (Analysis and comparison phases)
Previously, we reported [2] that the number of minutiae annotated by examiners is strongly associated with their own value and comparison determinations, and that seven minutiae was an approximate "tipping point": "for any minutia count greater than seven, the majority of value determinations were VID, and for any corresponding minutia count greater than seven, the majority of comparison determinations were individualization." Across multiple examiners, a mean of seven corresponding minutiae was also the point at which approximately 50% of examiners individualized  (approximately 50% of examiners assessed latents to be VID when the mean minutia count was seven).
Here we report similar thresholds as measured by consensus on minutia clusters. We find counts of majority clusters comparable to mean minutia counts as predictors of examiner determinations. For example, when predicting VID determinations using logistic regression, r 2 ¼0.4253 for mean minutia counts vs. r 2 ¼0.4310 for majority clusters. As shown in Fig. 14, these majority cluster statistics are highly correlated with the mean number of minutiae, which tends to be slightly larger than the number of majority clusters.
As shown in Figs. 15 and 16A, latents with fewer than 5 majority clusters were usually not assessed as VID; latents with 10 or more majority clusters were usually assessed to be VID. Fig. 16B shows a similar association for clusters corresponded by the majority of comparing examiners: almost all image pairs with 7 or more clusters that were corresponded by a majority of comparing examiners were individualized by the majority of examiners; almost no image pairs with 5 or fewer majority corresponding clusters were individualized by the majority of examiners.
In [2] we included several figures to show the association between minutia counts and value determinations, and between corresponding minutia counts and comparison determinations. Fig. 17 is comparable to Fig. 5 of [2] except that it includes a data series for the number of clusters corresponded by a majority of examiners who compared the image pair; it also includes data for both mated and nonmated image pairs. In general, the number of majority clusters tends to be approximately equal to the mean minutia count.

Reproducibility of analysis-comparison changes
As previously reported, examiners often modified their latent Analysis markup during the Comparison phase [7]. For each pair of latent markups (analysis and comparison phases), we classified features as retained, moved, deleted, or added. A retained feature is one that is present at exactly the same pixel location in both markups; a moved feature refers to one that was deleted during Comparison and replaced by another within 0.5 mm (approximately one ridge width); a deleted feature is one that was present in the Analysis markup only (no Comparison feature within 0.5 mm); an added feature is one that was present in the Comparison markup only (no Analysis feature within 0.5 mm).  Tables 15 and 16 show that deleted and added minutiae are strongly associated with low reproducibility. This association is stronger in clear areas than unclear areas: using logistic regression to predict deletions and additions from minutia reproducibility, we find that for deleted minutiae, r 2 ¼0.1243 (clear) and 0.0686 (unclear); for added minutiae, r 2 ¼ 0.0640 (clear) and 0.0332 (unclear).
Having shown that reproducibility and clarity are strongly associated, we took a closer look at how reproducibility and clarity are associated with changes. We used logistic regression to model deleted and added minutiae as responses to reproducibility and clarity. Predicting deleted minutiae from reproducibility and examiner clarity (r 2 ¼0.1114), only the reproducibility term is significant; clarity provides no additional information (using median clarity makes no meaningful improvement to the model: r 2 ¼0.1116). Predicting added minutiae from reproducibility and examiner clarity (r 2 ¼0.0762), both terms are significant, though the reproducibility term contributes much more than clarity (predicting added minutiae from reproducibility alone results in r 2 ¼0.0682; from examiner clarity alone, r 2 ¼ 0.0271; from median clarity alone, r 2 ¼ 0.0359). Examiners are more likely to add minutiae in low-clarity areas even after accounting for reproducibility of those minutiae. Our ability to predict deleted minutiae is not further improved by knowing clarity after accounting for reproducibility.
The net effect on minutia reproducibility was to increase from the Analysis to Comparison phase, but only for those latents compared to mated exemplars (not for those compared to nonmated exemplars). Fig. 19 shows this effect on a subset of 19 latents, each of which was assigned in both mated and nonmated image pairs; this subset controls for any differences in how latents were selected for the  mated and nonmated pairs. Minutia reproducibility for mated pairs increased in both clear and unclear areas, which is generally representative of what was observed across all latents. For further discussion of how changes in markup relate to whether or not the exemplar was mated, see [7].

Probability of correspondence
The probability of examiners corresponding marked minutiae was correlated with the reproducibility of those minutiae. Fig. 20 shows the probability of examiners corresponding minutiae as Table 16 Reproducibility of Comparison minutiae by clarity and change type (n ¼46,119 Comparison-phase minutiae). Data are limited to 2957 comparisons of 313 image pairs, which excludes markups where either the latent or exemplar was assessed to be NV and some data collection problems (detailed in [7]  estimated by four logistic regression models, one for each combination of clarity (as marked by that examiner) and whether the examiner individualized.

Reproducibility of corresponding minutiae
In our previous work [2], we noted "Disagreements on sufficiency for individualization tend to be associated with substantial disagreements on corresponding minutiae." Table 17 through 20 describe reproducibility by type of correspondence markup as conditional probabilities: when examiner A marked a minutia, what did examiner B do? Table 17 summarizes reproducibility across all data; Table 18 through Table 20 summarize reproducibility on subsets of the data. The probabilities are calculated as weighted sums over all other examiners who marked each latent, such that each minutia marked by examiner A is weighted equally. The final column, "Marked and compared minutiae that were definitely corresponded," is the probability that examiner B definitely corresponded a minutia given that examiner B marked that minutia and compared the latent to the exemplar. For example, Table 17 shows that when examiners corresponded minutiae marked as clear, 68.8% of the time other examiners also corresponded those minutiae; 20.0% of the time other examiners did not mark those minutiae at all. The data in these tables is limited to 3618 markups as described in Section 1.4. Table 17 shows the very substantial interexaminer differences as to which minutiae were marked. Often when one examiner said a latent was NV, other examiners corresponded minutiae on that latent (recall that fingerprint comparisons in this test were selected to be borderline value). In addition to marking "definite" correspondences, examiners were instructed to indicate discrepancies (features in one print that definitely do not exist in the other print) as needed to support an exclusion determination. Examiners were also permitted to mark "debatable" correspondences: features "that potentially correspond, but do not meet your threshold for supporting an ID." The correspondences referred to in [1] include only "definite" correspondences.
Whereas definite correspondences occurred much more often in clear than unclear areas (3x), debatable correspondences occurred about equally in clear and unclear areas. After controlling for clarity, minutiae that were marked as debatable correspondences have a similar, but slightly lower, reproducibility distribution to all minutiae.
Similar to the preceding tables, Table 21 and 22 describe reproducibility by type of correspondence markup and whether the examiners changed their Analysis markup during Comparison.            21 shows the distribution of the proportion of examiners who corresponded each cluster by clarity among examiners who compared each image pair; Fig. 22 shows similar data limited to examiners who individualized the image pairs. These charts show that while consensus is generally low in unclear areas, consensus is mixed in clear areas: often a minority of examiners correspond minutiae in clear areas.

Reproducibility of minutia with respect to exclusion determinations
Responses included 561 exclusions on 81 mated and 75 nonmated pairs. When examiners determined that the latent and exemplar were not from the same source, they were asked to indicate a reason for the exclusion. Table 24 summarizes the distribution of reasons given. The distributions were not substantially different for nonmated and mated pairs (true and false exclusions). For 80% of exclusions, the reason given was "one or more minutiae differ." There were 25 mated pairs and 70 nonmated pairs that more than one examiner excluded. Agreement on exclusion reasons was low (beyond chance). For example, the probability that examiner B said "minutiae differ" given that examiner A said "minutiae differ" was 67% for mated pairs and 48% for nonmated pairs (each image pair weighted equally).
When examiners said "minutiae differ," discrepancies were not usually marked (34% of mates, 42% of nonmates, 40% overall). Agreement on discrepancies was greater than chance, but not substantially. There were 47 image pairs on which at least two examiners marked discrepancies.
Upon completing the examinations that resulted in exclusions, examiners had marked 1744 minutiae (in 1264 clusters) on mated latents, 123 (7.1%) as discrepant; and 4901 minutiae (in 1703 clusters) on nonmated latents, 425 (8.7%) as discrepant. As shown in Table 25, there were 18 clusters with 3 discrepancies marked and 8 clusters with 4 discrepancies marked on nonmated image pairs (vs. 7 and 1 predicted from simulations that randomly assigned the "discrepant" label throughout the minutiae at the average rates for mates and nonmates). Table 26 describes agreement on marking of discrepancies. When discrepancies were marked, they were more likely to be in clusters marked by many examiners: this pattern largely reflects chance (more opportunities for some examiner to note a discrepancy).

Variation in minutia locations
In order to better understand the lack of reproducibility, we clustered minutiae marked on the exemplars and then looked to see how these exemplar clusters corresponded to latent clusters. We expected to find many examples of exemplar clusters whose corresponding minutiae on the latents had not been assigned to a single cluster because of variation in the precise location at which examiners marked minutiae in unclear areas on the latent.
Clustering was performed on the 3618 exemplar markups (Comparison phase) described in Section 1.4 using the same clustering procedures and parameters as were used for the latents (3). Although clustering was performed on all minutiae marked on the exemplars, our analyses of variation in minutia locations focused on a subset of those minutiae that examiners marked as corresponding. In defining this subset, an additional 60 markups were omitted because of documentation errors in how the correspondences were marked. Most of these omitted markups were initially identified on the basis of having abnormally high bending energy (a measure of the non-linear component of the relative distortion between the minutiae marked on the latent and exemplar) [11,12]). Each of the omitted markups was manually reviewed and most were identified as having "crossed" correspondences that were clearly incorrect (and presumably inadvertent documentation errors).
13,397 clusters were constructed from the 41,071 minutiae on the 3618 markups; 27,159 of these minutiae were marked as corresponding (after omitting the documentation errors). The 27,159 corresponding minutiae were contained in 5470 clusters on the exemplars and corresponded to 5794 clusters on the latents. Table 27 summarizes correspondences among latent and exemplar clusters. 15% (830/5470) of exemplar clusters were corresponded to more than one latent cluster; 9% (538/5794) of latent clusters were corresponded to more than one exemplar cluster. 31% (1672/5470) of exemplar clusters were Table 24 Exclusion reasons. Examiners were instructed to select the first option that applied. The exclusion reason was missing for one comparison.

Exclusion reason
Mates Nonmates  Table 25 Counts of discrepant minutiae among clusters on exclusion determinations by whether the cluster was a singleton. For example, 97 clusters on mated pairs that were marked by more than one examiner ("Not singleton") were marked as discrepant by exactly one examiner. In no case did more than four examiners mark a minutia as discrepant.

Mates Nonmates
Number of discrepancies Number of discrepancies 0  1  2  3  Total  0  1  2  3  4  Total   Singleton  252  17  0  0  269  663  48  0  0  0  711  Not singleton  894  97  3  1  995  714  212  40  18  8  992  Total clusters  1146  114  3  1  1264  1377  260  40  18  8  1703   Table 26 Percentage of clusters marked as discrepant by any comparing examiner by Comparison-phase consensus. corresponded to only one latent cluster simply because only one minutia within the cluster was corresponded; similarly, 35% (2015/5794) of latent clusters. Just as most minutiae were marked in median clear areas, this variation in the location at which examiners marked minutiae was most often observed in median clear areas: although examiners could be confident in the presence of these minutiae, certain aspects of clarity can interfere more with determining the precise location of minutiae than with determining their presence or absence. Variation in location (together with the clustering criteria) accounts for most of the lack of one-to-one correspondence between latent and exemplar clusters; examples of incorrect alignment of the latent and exemplar were also noted.