An algorithm for decoy-free false discovery rate estimation in XL-MS/MS proteomics

Abstract Motivation Cross-linking tandem mass spectrometry (XL-MS/MS) is an established analytical platform used to determine distance constraints between residues within a protein or from physically interacting proteins, thus improving our understanding of protein structure and function. To aid biological discovery with XL-MS/MS, it is essential that pairs of chemically linked peptides be accurately identified, a process that requires: (i) database search, that creates a ranked list of candidate peptide pairs for each experimental spectrum and (ii) false discovery rate (FDR) estimation, that determines the probability of a false match in a group of top-ranked peptide pairs with scores above a given threshold. Currently, the only available FDR estimation mechanism in XL-MS/MS is the target-decoy approach (TDA). However, despite its simplicity, TDA has both theoretical and practical limitations that impact the estimation accuracy and increase run time over potential decoy-free approaches (DFAs). Results We introduce a novel decoy-free framework for FDR estimation in XL-MS/MS. Our approach relies on multi-sample mixtures of skew normal distributions, where the latent components correspond to the scores of correct peptide pairs (both peptides identified correctly), partially incorrect peptide pairs (one peptide identified correctly, the other incorrectly), and incorrect peptide pairs (both peptides identified incorrectly). To learn these components, we exploit the score distributions of first- and second-ranked peptide-spectrum matches for each experimental spectrum and subsequently estimate FDR using a novel expectation-maximization algorithm with constraints. We evaluate the method on ten datasets and provide evidence that the proposed DFA is theoretically sound and a viable alternative to TDA owing to its good performance in terms of accuracy, variance of estimation, and run time. Availability and implementation https://github.com/shawn-peng/xlms


Introduction
Cross-linking mass spectrometry (XL-MS) proteomics has emerged as a key technique in molecular and structural biology, particularly for an exploration of protein assemblies and protein-protein interactions under native cellular conditions (Sinz 2003, 2006, Yu and Huang 2018, Piersimoni et al. 2021).In a typical experiment, a chemical reagent (linker) capable of forming covalent bonds with side chains of specific residues (e.g.lysine) on each end, is first introduced to the protein mixture.The sample is then digested and processed using liquid chromatography (LC) coupled with tandem mass spectrometry (MS/ MS) to identify pairs of inter-or intra-protein cross-linked peptides.Since the chemically linked residues must be located within the distance of the spacer arm of the linker (e.g.10-30Å), the experiment provides a set of distance constraints that can be key to resolving protein structure and interaction sites, often in combination with other techniques (Rappsilber 2011, Piersimoni et al. 2021).However, XL-MS also offers distinct advantages including the rapid interrogation of protein isoforms, post-translationally modified proteins, membrane proteins, and disordered proteins (Piersimoni et al. 2021).The digestion step in XL-MS/MS further removes restrictions on the protein size, localization, or conformational landscape, although the dynamic range of modern analytical instrumentation still limits the studies to relatively abundant proteoforms.
Despite its promise, XL-MS/MS has not fully matured and a number of challenges remain (Piersimoni et al. 2021).One key challenge is related to the data processing pipelines, particularly the computational and statistical difficulties associated with the identification and quantification of cross-linked peptides (Yang et al. 2012).XL-MS/MS peptide identification follows a similar workflow to the traditional MS/MS (Steen and Mann 2004).The first step is database search (Yates et al. 1995, Perkins et al. 1999, Kim and Pevzner 2014, Kong et al. 2017), where the experimental spectra from the instrument are searched against a database of theoretical spectra of peptides (peptide pairs in XL-MS/MS) whose mass (sum of masses, including the linker), is within the instrument's tolerance of the measured mass.This produces a ranked list of peptides (peptide pairs) for each experimental spectrum, or peptide-spectrum matches (PSMs), with only the top-ranked PSM eligible for downstream identification if its score is sufficiently large.The second step is a procedure devised to control the error rate of identifications (Storey 2002, Choi and Nesvizhskii 2008, Aggarwal and Yadav 2016, Burger 2018), with the objective of determining the threshold above which all top PSMs will be considered identified with the false discovery rate (FDR) below a predetermined value, e.g.FDR ¼ 1%.
In XL-MS/MS, both steps add complexity to the traditional pipeline.For example, an XL-MS/MS search engine must scan a database of peptide pairs instead of single peptides (Rinner et al. 2008, Hoopmann et al. 2015, Ji et al. 2016, Netz et al. 2020), thus increasing the run time (quadratically) and the competition for each experimental spectrum.
Similarly, FDR control is more difficult in part because the incorrect identifications include PSMs where both peptides in a pair are incorrectly identified and also PSMs where one peptide is correctly identified and the other is not (Walzthoeni et al. 2012).Depending on the fragmentation patterns and the search engine, these so-called partially incorrect identifications can have relatively high scores.Overall, the increased search space and the complexity of error control both contribute to the smaller fraction of identified spectra compared to the traditional MS/MS, for the same estimated FDR (Piersimoni et al. 2021).
To control for FDR, traditional MS/MS platforms rely on both target-decoy (TDA) and decoy-free (DFA) approaches, with TDAs being the preferred option.TDAs typically search a database of peptides that are potentially present in the sample (target sequences) together with an equal-sized set of peptides that cannot be present in the sample (decoy sequences; often reversed target sequences).The high-scoring decoy PSMs are then used to estimate the number of false target identifications (Elias andGygi 2007, Jeong et al. 2012).In contrast, DFAs typically fit two-component mixture models to the distribution of top PSM scores, where one of the latent components corresponds to the correct and the other to incorrect identifications.The expectation-maximization (EM) algorithm is then used to resolve the component distributions using parametric families such as Gaussian, gamma, or skew normal (SN) (Keller et al. 2002, Li 2008, Peng et al. 2020).In contrast, TDA is an exclusive error control mechanism in XL-MS/MS (Walzthoeni et al. 2012) and the absence of DFAs may be due to the fact that the set of top-ranked PSM scores cannot be easily modeled using parametric two-component mixtures, especially the heterogeneous distribution of incorrect PSMs.
Despite their simplicity and prevalence in a standard workflow, TDAs exhibit important theoretical and practical limitations (K€ all et al. 2008a,b, Gupta et al. 2011, Cooper 2011, 2012, He et al. 2015, Danilova et al. 2019, Peng et al. 2020).Theoretically, FDR estimates can be >1, and likewise, the strategy of competing target with decoy peptides for the available experimental spectra is problematic and may lead to biased estimates.Practically, the search time for TDA is increased, it cannot be applied to de novo searches (Dancik et al. 1999, Frank andPevzner 2005), it shows high-variance of the score cutoffs at low FDR (Peng et al. 2020), and the estimation is inaccurate for the samples with low amounts of biological material (Li et al. 2015, Budnik et al. 2018, Peng et al. 2020).The problems with run-time are further amplified in XL-MS/MS, quadrupling the search times over those that could be achieved with DFAs.
To address these problems, this study introduces a novel DFA for FDR estimation in XL-MS/MS.We exploit the score distributions of top-ranked and second-ranked PSMs, and model them as five-component mixtures (with shared parameters) from the SN family.We then devise a multi-sample EM algorithm with constraints to resolve the latent components.Our results show that this method holds promise for enhancing the accuracy and reliability of XL-MS/MS studies and is an attractive alternative to TDAs.

Terminology and notation
Let X ¼ fx i g be a set of spectra collected from a mass spectrometer and P ¼ fðp α ; p β Þ j g a set of candidate pairs of (sorted) peptides.A search engine produces a set of triplets ðx; ðp α ; p β Þ; sÞ 2 X × P × R, where s is the score assigned to the PSM ðx; ðp α ; p β ÞÞ.The higher the score, the more likely that the spectrum x was generated from ðp α ; p β Þ.
Let now x be generated from an unknown peptide pair ðq α ; q β Þ and let ððx; ðp α ; p β Þ 1 ; s 1 Þ; ðx; ðp α ; p β Þ 2 ; s 2 Þ; . ..Þ be a ranked list of PSMs from a search engine for x such that s 1 ≥ s 2 ≥ and so on.A PSM ðx; ðp α ; p β ÞÞ for which ðp α ; p β Þ ¼ ðq α ; q β Þ is called the correct match.If only one of the peptides matches the ground truth, we refer to these as partially incorrect matches, whereas all other PSMs involving x are called incorrect matches.Furthermore, given the ranked list of PSMs, the PSM with the highest score, ðx; ðp α ; p β Þ 1 Þ, is called the top-ranked or first PSM, ðx; ðp α ; p β Þ 2 Þ is called the second-ranked or second PSM, etc.We similarly distinguish between partially incorrect and incorrect PSMs.
An MS/MS analysis pipeline looks at top-ranked PSMs and determines a threshold τ such that the pair of peptides ðp α ; p β Þ from each top hit ðx; ðp α ; p β ÞÞ is considered identified when the score s from ðx; ðp α ; p β Þ; sÞ satisfies s ≥ τ.If, further, ðp α ; p β Þ ¼ ðq α ; q β Þ, it is considered to be the correct identification.The threshold τ must be established to satisfy a desired estimated FDR for the biological study.

Skew normal distributions
Azzalini (1985) introduced the SN family of distributions as a generalization of the normal family that allows for skewness.It has a location (μ), a scale (σ), and a shape (λ) parameter, where λ controls for skewness.The distribution is right skewed when λ > 0, left skewed when λ < 0, and reduces to a normal distribution when λ ¼ 0. The probability density function (pdf) of a random variable S � SNðμ; σ; λÞ is given by where μ; λ 2 R, σ 2 R þ , ϕ, and Φ are the pdf and the cumulative distribution function (cdf) of Nð0; 1Þ, respectively.Alternatively, SN family can be parameterized by Δ and Γ (Table 1), instead of λ and σ.The alternate parametrization naturally arises in the following stochastic representation of a SN random variable (Henze 1986) where T � TN þ ð0; 1Þ, the standard normal distribution is truncated below 0; U � Nð0; 1Þ; ¼ d reads as "equal in distribution".The stochastic representation is exploited in developing EM algorithms for maximum likelihood estimation of SN distributions and their mixtures (Lin et al. 2007, Lin 2009).where TTðτÞ is the number of (top) PSMs matched to target sequences for both peptides, TDðτÞ is the number of PSMs with exactly one peptide matched to decoy sequences, and DDðτÞ is the number of PSMs with both peptides matched to decoy sequences.

Data and database search
We downloaded ten XL-MS/MS datasets from the PRoteomics IDEntifications (PRIDE) database (Perez-Riverol et al. 2022), as shown in Table 2.The data were downloaded as raw spectra.We used ProteoWizard to convert the raw spectra into peaks files in the .mzMLformat.After the peaks were identified, we used OpenPepXL (Netz et al. 2020) to run a search for each file against the protein sequence database as described in the original paper.For each file, we ran two searches, one with decoy sequences for TDA experiments and another without decoy sequences for DFA experiments.
To run a TDA search, we concatenated the original database with reversed protein sequences.The search parameters for each dataset were also identical to those in the original papers.
After the search was completed, we extracted the scores for the top two PSMs corresponding to each experimental spectrum.If a spectrum only matched one candidate peptide pair, we marked the second score as missing.If the top two candidates for one spectrum had the same peptides but differed on the position of cross-linked residues, we retained only the top one and promoted the third scoring peptide pair, if available, to the second position.

Approach
To estimate the FDR in an XL-MS/MS search, we build on our mixture approach for the traditional LC-MS/MS search (Peng et al. 2020).The proposed method uses a mixture of SN distributions to model the PSM scores not only for the top hits but also for the second hits to improve the estimation of the latent components.
Let S 1 and S 2 be random variables giving the top and second PSM scores for a spectrum, respectively, where S 1 and S 2 could be coming from a correct match (the two chains matched to the correct peptide pair), partially incorrect match (only one chain matched to the correct peptide) and an incorrect match (both chains matched to incorrect peptides).We define C as the random variable giving the score corresponding to the correct match; J 1 and J 2 as those for the highest-and second-highest-scoring partially incorrect matches, respectively; I 1 and I 2 as those for highest-and secondhighest-scoring incorrect matches, respectively.Note that S 1 and S 2 are observed in the data, whereas C, J 1 , J 2 , I 1 , and I 2 are not observed (latent).Our approach for FDR estimation relies on modeling S 1 and S 2 as mixtures of C, J 1 , J 2 , I 1 , and I 2 as latent components, and fitting the data to uncover their distributions.As justified by our experimental results, incorporating the second score has advantages over a model based on the top score only.We model each latent variable using a SN distribution, i.e.
We next describe our model and estimation algorithm focusing on the approach that uses both top-and secondranked PSMs.In Section 5, however, we evaluate this algorithm against a one-sample model that only relies on the topranked PSMs and TDA.

Two-sample statistical model
We model S 1 as a mixture distribution of SNðθ C Þ, SNðθ J 1 Þ, and SNðθ I 1 Þ as components, since the top score can only come from the correct match (C), top-scoring partially incorrect match (J 1 ), or the top-scoring incorrect match (I 1 ).Now, S 2 may also come from C (when S 1 6 ¼ C), J 1 (when S 1 6 ¼ J 1 ), and I 1 (when S 1 6 ¼ I 1 ).However, it can alternatively come from J 2 (when S 1 ¼ J 1 and J 2 is greater than C; I 1 and I 2 ) or I 2 (when S 1 ¼ I 1 and I 2 is greater than C; J 1 and J 2 ).Hence, we model it as a mixture of SNðθ C Þ, where w X > 0 and P X w X ¼ 1 for X 2 fC; J 1 ; I 1 g and v X > 0 and P X v X ¼ 1 for X 2 fC; J 1 ; J 2 ; I 1 ; I 2 g give the mixing proportions (weights) of the components within the mixtures.The sharing of SNðθ C Þ, SNðθ J 1 Þ, and SNðθ I 1 Þ between the two mixtures allows incorporating information from both scores to learn the parameters θ C , θ J 1 , and θ I 1 .The additional components, SNðθ J 2 Þ and SNðθ I 2 Þ, in the second mixture and the differing mixing proportions allow capturing the distributional differences between S 1 and S 2 .

Constraints
In addition to the mixture formulation with parameter sharing, we incorporate inequality constraints on the mixing proportions (weights) that naturally emerge due to the latent structure between the top two scores.Additionally, we incorporate intuitive constraints between the density functions of the latent variables.

Weight constraints
Due to the nature of the relationship between the top two scores from the same spectra, they cannot come from the same component.Furthermore, the second score can be J 2 or I 2 only if the top score comes from J 1 or I 1 , respectively.These observations lead to several inequality constraints between the mixing proportions.For example, when the top score comes from C, the second score has to come from I 1 or J 1 and cannot be I 2 or J 2 because they are lower than I 1 and J 1 , respectively.This implies that the proportion of C in the top score is upper bounded by the sum of the proportions of J 1 and I 1 in the second score, i.e. w C ≤ v I 1 þ v J 1 .The constraint set A below was derived in this manner.
In our implementation, we modify the constraint set A to account for the cases where the search gives only a single match for a spectrum; i.e. when the second score is missing.We, however, observed that the second scores are not missing at random.Missing second scores occur preferentially at the left tail of the top-score distribution; see Supplementary materials.Due to non-random missingness of the second score, the constraint set A becomes invalid.To address this issue, we derive the constraint set B which gives a set of valid constraints, irrespective of the non-random nature of the missing second scores (derivation in Supplementary materials).The constraints explicitly incorporate the proportion of the spectra with missing second scores, v Φ , as a constant.All second-score mixing proportions are scaled and included as v for brevity.We will incorporate the constraint set B in our algorithm.

Density constraints
In addition to the constraints between mixing proportions, we also enforced constraints between the density functions of the components in a pairwise manner.Since the correct scores have higher values than partially incorrect scores on average, we expect the mode of the C density to be higher than that of J 1 .We similarly expect C (J 1 ) to have a higher density at any given point in its right (left) tail compared to J 1 (C); we also expect such relationships between other pairs of densities such as C and I 1 , J 1 and I 1 , J 1 and J 2 and I 1 and I 2 .We formalize such constraints between a pair of densities, f and g, using a strict partial order f � g (f dominates g) defined by the following constraints modeðf Þ > modeðgÞ f ðxÞ > gðxÞ; 8x > modeðf Þ gðxÞ > f ðxÞ; 8x < modeðgÞ; where modeðf Þ ¼ argmax x f ðxÞ is the mode of density f.We enforce the following pairwise constraints in our approach Due to transitivity of the strict partial order, the following ordering of the densities holds: We derive an EM algorithm-based maximum likelihood estimation with several weight and density constraints to capture the structure inherent to XL-MS/MS data.To enforce the weight constraints we convert the so-called Q-function in the maximization step of the EM algorithm into a Lagrangian function with additional terms and parameters for the constraints.The density constraints are enforced in each step by performing a binary search between the old and the new parameters.

Derivation of the Q-function
We first write the log-likelihood function for the model as where X ¼ fC; J 1 ; I 1 g and Y ¼ fC; J 1 ; J 2 ; I 1 ; I 2 g.Variable f ¼ ffw X g X2X ; fv Y g Y2Y ; fθ Y g Y2Y g contains all model parameters.Next, we introduce the hidden variables for the EM framework.
Let fW X ðs 1 Þg X2X and fV Y ðs 2 Þg Y2Y be two sets of binary variables giving the source component for s 1 and s 2 , respectively.If . Each score, given its component affiliation, is an SN variable and consequently has an associated TN þ ð0; 1Þ variable from its stochastic representation (Section 2.2), one for each component it may come from.Let fT X ðs 1 Þg X2X and fT Y ðs 2 Þg Y2Y be the set of such TN þ ð0; 1Þ variables for s 1 and s 2 , respectively.Omitting s 1 and s 2 as arguments of , the complete data log-likelihood up to an additive constant in f is given by The Q-function for the EM algorithm is defined as the conditional expectation of L cmp ðfÞ given the observed data (S 1 ; S 2 ), computed using the current estimate of the parameters, f.The hidden variables fW X ðs 1 Þg X2X , fV Y ðs 2 Þg Y2Y , fT X ðs 1 Þg X2X , and fT Y ðs 2 Þg Y2Y are the random quantities in L cmp ðfÞ.Thus, the expectation is taken with respect to their conditional distribution given S 1 and S 2 .The current parameters, f, are only used for taking the expectation and they do not replace the parameters in the expression for L cmp ðfÞ.Consequently, the Q-function is a function of both f and f.It is given by where Qðs; θ; θÞ ¼ qðs; ξ 1 ðs; θÞ; ξ 2 ðs; θÞ; θÞ; ξ 1 ðs; θÞ and ξ 2 ðs; θÞ are the first and second moments of a truncated normal distribution, respectively, as defined in Table 3; is the probability that s 1 (s 2 ) comes from X (Y) under the current parameters (Table 3).The EM approach relies on finding new parameters, € f, at each iteration, that increase the Q-function, i.e.Qð € fjfÞ ≥ QðfjfÞ, to indirectly increase the loglikelihood (Equation 2).

Component parameter updates
To update the component parameters fθ Y g Y2Y , we adopt the Expectation Conditional Maximization (ECM) approach of optimizing QðfjfÞ one parameter at a time, as it leads to simpler closed-form update equations without compromising the monotonicity of the Q-function and the log-likelihood (Meng and Rubin 1993); see Supplementary materials.Taking the partial derivative of QðfjfÞ with respect to μ Y ; Δ Y and Γ Y and equating them to 0, gives update equations below.For Y 2 fC; J 1 ; I 1 g, where m Y ðs; ΔÞ; d Y ðs; μÞ and g Y ðs; μ; ΔÞ are defined in Table 3.For Y 2 fJ 2 ; I 2 g, The new component parameters are guaranteed to not decrease the Q-function; see Supplementary materials.Note that for each component Y, its three parameters should be updated in the order, € μ Y !€ Δ Y !€ Γ Y , due to dependencies among the equations.

Pairwise density constraints
To enforce the pairwise density constraints we developed a binary search procedure (Supplementary materials) which is applied whenever a component density parameter (μ Y ; Δ Y or Γ Y ) is being updated with the new parameter from Section 4.4.2 as a candidate.Specifically, in case of μ Y , if the new parameter, € μ Y , violates a pairwise density constraint involving component Y, a binary search is performed on the line segment connecting μ Y , and € μ Y to find a feasible point, μY (not violating the constraints), closest to € μ Y .A binary search is similarly performed when updating Δ Y and Γ Y .This approach is guaranteed to give feasible parameters at each iteration provided the first set of component parameters are feasible; see Supplementary materials and Section 4.4.5.The parameters obtained from the binary search are also guaranteed to not decrease the Q-function; see Supplementary materials.Pairwise density constraints are efficiently evaluated as described in Supplementary materials.Note that if € μ Y ( € Δ Y ) is not feasible, then the feasible μY ( ΔY ), from the binary search, is used in the subsequent parameter updates of Δ Y and Γ Y in Section 4.4.2.

Weight updates under constraints
We update the weight parameters, fw X g X2X and fv Y g Y2Y , by optimizing QðfjfÞ under the weight constraint set B and the standard mixture constraints, P X2X w X ¼ 1 and Using the Karush-Kuhn-Tucker (KKT) approach for constrained optimization leads to the following Lagrangian objective.
The quantities accented with have the current estimates of all parameters, contained in f, as an implicit argument.Component specific quantities are subscripted by the component placeholder X or Y. X 2 X ¼ fC; J 1 ; I 1 g and Y 2 Y ¼ fC; J 1 ; J 2 ; I 1 ; I 2 g.Parameters δ; Δ, and Γ are related to the canonical SN parameters σ and λ as per Table 1.TN þ ðα; ψ 2 Þ represents truncated normal distribution truncated below 0; α and ψ 2 are the location and scale parameters, respectively.E represents the expectation operator.Oðf; c; gÞ γ 2 g and g ¼ fη i g 8 i¼1 are the KKT multipliers for the equality and inequality constraints, respectively.The KKT conditions lead to the following equations.
As per the KKT theory, the solution, ½ € w; € v; € c; € g�, to the above system of equations gives the optimum mixing proportions, ½ € w; € v�, maximizing QðfjfÞ and also satisfying the constraint set B and the standard mixing proportion constraints.Note that the last eight equations, arising from the inequality constraints, require special consideration.For example, does not need to be enforced explicitly.The optimal feasible solution lies in the interior region of the inequality and already satisfies it.
It covers the case when the optimal feasible solution lies on the boundary of the inequality and consequently, it is enforced as an equality constraint.A practical implementation would require first solving the equations with η 1 ¼ 0. If a feasible solution is obtained, it is optimal.If it violates the inequality, then the correct solution should lie on the boundary and consequently, system of equations to be solved.
Since there are multiple inequality constraints, finding the optimal solution would require an exhaustive search by solving all possible systems of equations obtained by equating each subset of fη i g 8 i¼1 to 0. This would lead to 256 different systems of equations.This approach is prohibitively expensive since the equations are solved in each iteration of the EM algorithm.As a practical solution, we adopt a greedy approach, where we first solve the system of equations without explicitly enforcing any inequality constraint, i.e. η i ¼ 0, i ¼ 1; 2 . . .8. If none of the inequality constraints are violated, it gives the optimal solution.If any inequality constraint is violated, we run the system of equations with each of the violated constraints as active, separately, i.e. a single η parameter is non-zero.If a feasible solution is found, it gives the optimal solution.If no feasible solution is found, we run the system of equations again with two violated inequality constraints as active; i.e. exactly two η parameters are nonzero.Proceeding in this manner, we next check 3; 4; . . .; 8 active inequality constraints, if necessary.In all our experiments a feasible solution was obtained with a maximum of two active inequality constraints.Note that due to the inequality constraints, the updated weight parameters are not guaranteed to maintain the monotonicity of the Q-function and the log-likelihood in each iteration.However, experimentally we still observe the log-likelihood to increase over multiple iterations.

Parameter initialization
To generate a diverse set of initial parameters, we adopted a random initialization approach.First, a normal distribution is fitted to the top scores, with μ and σ being the fitted parameters.Then five points are sampled randomly from Nðμ; σÞ and sorted.They are used to initialize the location parameters μ C , μ J 1 , μ J 2 , μ I 1 , and μ I 2 of the five SN components, assigned in that order.This approach makes it likely that the modes of the component densities follow the ordering described in Section 4.3.2.The scale parameter of the Y 2 Y component, σ Y , is uniformly picked from ½σ=4; σ�.To initialize the skewness parameters, first, a λ 0 2 f1; 2; 5g is picked.Then the absolute value of component Y skewness parameter, λ Y , is uniformly picked from ½1=λ 0 ; λ 0 �.The sign of the λ C is initialized to be positive.The sign of λ J 2 ; λ I 1 and λ I 2 is initialized to be negative.λ J 1 is assigned a positive value in one initialization and a negative value in another.In this manner, two initializations with identical parameters, except the sign of λ J 1 , are obtained.If the initial parameters thus obtained violate the density constraints, they are discarded and resampled.Varying the value of λ 0 in f1; 2; 5g, six initializations are obtained.In each initialization, the mixing proportions w C ; w J 1 and w I 1 are set equally to 1/3.v C is set to 0.001, since the second score is expected to have a small number of correct hits.v J 1 ; v J 2 ; v I 1 , and v I 2 are set equally to 0.999/4.Unlike the density constraints, it is not necessary for the initial parameters to satisfy the weight constraints.
We ran the above sampling procedure 40 times, resulting in 240 ¼ 40 × 6 initilaizations.After running our algorithm once for each initializations, we pick the solution attaining the maximum log-likelihood (Equation 2) among them.

Single-sample model
We also considered a single-sample model that only incorporates the top score as a three component mixture, i.e.
The weight constraints are not applicable to this model.However, the density constraints f C � f J 1 and f J 1 � f I 1 are enforced.The parameter update equations for this model are given in Supplementary Materials.Similar parameter initialization approach and the same number of restarts as the two-sample model were used for a fair comparison.

FDR estimation
The FDR of the fitted mixture model, at a given threshold τ, can be estimated as where S SN ðτ; θÞ ¼ 1 − F SN ðτ; θÞ is the SN distribution survival function; is the SN distribution cdf; Φ is the cdf of Nð0; 1Þ and O is Owen's T function, computed approximately (Young and Minder 1974).

Results
We carried out experiments on ten datasets (Table 2) and investigated the quality of fit, variance of estimation, and the number of identified peptides as a function of estimated FDR.

Quality of modeling
The quality of fit is shown in Fig. 1 for five selected datasets; for all datasets, please refer to the Supplementary materials.In each case, the first two rows show the results of the two-sample model, i.e. joint modeling of the distributions of the top-ranked PSMs (blue histogram) and the second-ranked PSMs (green histogram).The third row shows the one-sample approach, i.e. when only the distribution of top-ranked PSMs is considered.Overall, the fit is excellent, indicating that the SN distribution was a reasonable choice for modeling competition between PSMs, which is a theoretically grounded result (Arellano-Valle et al. 2006).
Compared to the TDA, the two-sample model gives competitive results.The 1% FDR threshold, shown by vertical lines in Fig. 1, is similar in four out of ten datasets (D1810, Peplib, RPA, QE).For other datasets, the two-sample method gives a different 1% FDR threshold, which is sometimes more permissive and other times more strict than the TDA threshold.By visual inspection and observation of some MS/MS spectra, we concluded that the two-sample solution may in fact be advantageous.The ALott dataset is an interesting case as our model leads to a more permissive FDR estimation.We have inspected multiple PSM identifications and concluded that the FDR ¼ 1% threshold may in fact be closer to the one estimated by our method because, at least in some cases, the experimental spectra appear to be mixtures of two different pairs of cross-linked peptides with very similar total masses.Other cases of high-scoring second PSMs appear to correspond to the top PSMs with even higher scores.This dependency in the latter case cannot be easily modeled by the mixtures of distributions and is a limitation of our approach.
The one-sample model provides relatively good results, often with an even better log-likelihood than the two-sample model; however, the lack of the second sample in parameter learning leads to a considerable error in distribution placement and, consequently, FDR estimation.One such example is the Alinden data where the 1% FDR threshold of 195 is lower than the TDA's threshold of 230 and the two-sample model's threshold of 247.This result is problematic because the tail area of the score distribution of the second-ranked PSMs (that this model is not considering) above 195 is quite large and cannot be attributed to the correct PSMs.Therefore, this model is clearly inferior to the two-sample model in its quality of fit.
We used constraints to control the relative placement of the component distributions.For example, the correct component should take the rightmost part of the score distribution of the top-ranked PSMs as they should have higher mean values than incorrect or partially incorrect matches.Peng et al.
Similarly, when both the top hits and second hits are partially incorrect, they should have a similar shape of the distribution.While this was not enforced by the constraints, we observed that the first and second partially incorrect distributions indeed have similar shapes as well as that the second partially incorrect distribution had a lower mean than the first partially incorrect distribution.

Variance of the FDR threshold
We used fifty bootstrapping (Efron and Tibshirani 1986) experiments to study the variance of estimated 1% FDR thresholds (Fig. 2).In most cases the two-sample model gives a stable threshold, although TDA performs well on large datasets such as Alinden.This is expected and is the case when TDA assumptions are likely to be satisfied (Elias and Gygi 2007).The larger variance of the 1% FDR thresholds in the one-sample mixture suggests an identifiability issue, which is mitigated by incorporating the second sample and the weight constraints.

Spectral identifications
Figure 3 shows the number of identified PSMs as a function of estimated FDR thresholds.We first observe that our model generates smooth curves, which is desirable.In some cases, the two-sample model shows great agreement with the TDA suggesting that decoy data may not give any information that is not already incorporated by the DFA.Examples of such cases are Ecoli and D1810 datasets.Interestingly, however, the bootstrapping results on these datasets show increased stability of DFA and suggest that the DFA should be the preferred choice for such data.

Discussion
Accurate false discovery rate estimation is a key to biological discovery (Nesvizhskii 2010, Aggarwal andYadav 2016) and is integral to protein identification methodology (Li and Radivojac 2012, Serang and Noble 2012) and protein function studies (Sinz 2003(Sinz , 2006)).However, the field is confronted with computational and statistical challenges and the quality of methods is difficult to evaluate owing to the lack of the ground truth associated with experimental spectra.
To the best of our knowledge, this work is the first to propose a decoy-free FDR estimation in XL-MS/MS.We have accomplished this by modeling top-ranked and second-ranked PSM score distributions as multi-component mixtures of (latent) SN distributions with shared parameters.We formulated the problem as a constrained maximum likelihood optimization and then derived an EM algorithm to learn model parameters from data.We extensively evaluated this method to show that the low-variance quality FDR estimation can be achieved without decoy data.The proposed algorithm is a nontrivial generalization of the multi-sample decoy-free approaches we developed for traditional MS/MS (Peng et al. 2020) although the use of multiple components to model incorrect PSM scores in XL-MS/ MS required constrained optimization and a far more complex solution.However, as before, modeling of the score distribution of the second-ranked PSMs has stabilized learning, and helped avoid target-decoy competition, leading to an accurate inference procedure with significant run-time savings.The reasoning behind this algorithm can be further applied to other large searchspace MS/MS scenarios, including de novo searches (Dancik et al. 1999), searches of semi-tryptic (Alves et al. 2008) and post-translationally modified (Fu 2012) peptides, as well as to metaproteomics searches (Heyer et al. 2017).
of the pdf and the cdf of Nð0; 1Þ evaluated at α=ψ.i432Peng et al.

Figure 1 .
Figure1.The quality of the fit is visualized for five datasets (columns), with the remaining ones available in the Supplement.The data is shown as histograms, with blue representing top-ranked and green representing second-ranked PSM scores.The mixture components are plotted separately and each density is weighted by its mixture weight estimated by the model as described in Section 4. 1SMix and 2SMix are the one-sample and two-sample mixture models, respectively.The vertical solid line is showing the 1% FDR threshold given by the mixture model, while the vertical dashed line is showing the 1% FDR threshold given by TDA.The average log-likelihood of the fitted model on each sample is shown on the upper right corner.FDR curves are shown in blue with the y-axis and its scale shown on the right. i434

Table 1 .
Relationship between the alternate and canonical SN parameters.

Table 2 .
Summary of datasets and search parameters used in this study.

Table 3 .
Useful quantities for the parameter update equations.