Research paper
Novel Method for Accurately Assessing Pull-up Artifacts in STR Analysis

https://doi.org/10.1016/j.fsigen.2020.102410Get rights and content

Highlights

  • Novel algorithm accurately identifies pull-up artifacts

  • Handles imperfect spectral matrix and overloaded samples

  • Expert system pattern analysis increases accuracy of artifact identification

  • Can correct alleles for coinciding pull-up signal

  • Implemented in free, open source OSIRIS STR DNA analysis software

Abstract

OSIRIS is a mathematically-based software tool for Short Tandem Repeat (STR) and DNA fragment analysis (https://www.ncbi.nlm.nih.gov/osiris/). As part of its routine sample analyses, OSIRIS computes unique quality metrics that can be used for sample quality assessment. A common artifact of STR analysis is cross-channel pull-up or pull-down (negative pull-up). This occurs because of the spectral overlap between the dyes used with the marker set, and the failure of the color deconvolution matrix to isolate the colors in the dye set adequately. This paper describes a mathematical method for analyzing and quantifying the pull-up patterns across sample channels and effectively identifying and correcting the pull-up artifacts, as implemented in OSIRIS. Unlike approaches to pull-up that require a training set composed of previous samples, the algorithm described here uses a mathematical model of the underlying causes of pull-up. It is based solely on the information intrinsic to the sample it is analyzing and therefore incorporates the effects of the ambient conditions and the specific procedures used in creating the sample. These conditions are the physical determinants of the level of pull-up in the sample and are not likely to be represented in a training set consisting of past samples.

Introduction

Electrophoresis data is typically divided into “channels” through the targeted addition of dye molecules to the synthesized fragments of DNA. During electrophoresis, each dye is excited by a laser and the resulting emission spectra are recorded by a charge-coupled device (CCD). The response of the overall system, including process-induced artifacts, is due to the combined performance of the dye set, laser and CCD.

Each color in a dye set has a fluorescent emission spectrum that overlaps with the other dyes in the set by an amount that depends on the dye emission wavelengths and on ambient conditions (temperature, pH, etc.) affecting the dye’s environment [1]. Because of these overlaps, in a kit with blue, green, yellow and red dyes, for example, a pure blue peak will register a measurement in each of the four colors. That is, a blue peak will display blue, of course. However, in addition to blue, there will be companion peaks of various sizes in each of the green, yellow and red bands. At set ambient conditions, the spectral composition for each dye can be measured and used to build a so-called deconvolution matrix (or color separation matrix). In order to decipher the spectral overlap, then, the pre-generated deconvolution matrix is applied to the raw data. Under appropriate run conditions, the application of the matrix completely resolves the issue of spectral overlap, and the peaks of the dye channels of the deconvoluted data are independent of each other.

On the other hand, if the matrix color deconvolution matrix does not work properly, then color can bleed from one spectral channel into another, a phenomenon referred to as pull-up. Pull-up manifests as a peak in one channel comigrating with larger peaks in one or more other channels. We will refer to the large peaks that cause pull-up in one or more other channels as primary pull-up peaks. Because the operation of the deconvolution matrix is dependent on the properties of the dyes and the operation of the laser and CCD, we expect that any cross-channel pull-up patterns will manifest channel-wide.

The most common reasons for deconvolution matrix failure are overloading sample DNA and a change in conditions between when the matrix was created and when the sample was analyzed. The simplest way the deconvolution matrix can fail is if it does not represent the current spectral overlap due to a change in ambient conditions, including current condition of the capillary electrophoresis unit, consumables and reagents. Incorrect matrix/calibration, or a mismatch between the conditions during the calibration versus the analysis run can result in peaks, both positive and negative, comigrating with peaks in other channels [2]. These are both possible manifestations of pull-up. Since the amplitude of a peak is measured in relative florescence units, or RFU, and uncorrected RFU values are always non-negative, the only way in which peaks can be negative is for the application of the deconvolution matrix to fail.

Poor calibration of the deconvolution matrix causes pull-up that is linear in regard to the heights of the primary pull-up peaks. That is, if a primary peak in the blue channel of height h causes a pull-up of height p, then a primary peak of, say, θ times the height, θh, would cause a pull-up of θ times the height: θp. This is illustrated in [1, Figures 13.6 and 13.7], which show a well calibrated deconvolution matrix, and the plots of the raw STR data before and after the application of the deconvolution matrix. Before deconvolution, every peak is present in every color, the manifestation of the spectral overlap between the dyes. After application of the deconvolution matrix, peaks appear in one color with no pull-up between channels. Since transforming the channel data by matrix multiplication across the entire time range is a linear operation and an appropriately calibrated deconvolution matrix can completely compensate for spectral overlap between the dyes, the spectral overlap in a channel’s peaks must be related linearly to the heights of peaks in the other channels. Therefore, in the case of an inappropriate deconvolution matrix, there is a dominantly linear channel-wide pattern relating the primary pull-up peak heights in one channel to the pull-up peak heights in another.

There can be non-linear causes of pull-up, as well, in which this simple relationship between primary height and pull-up height does not occur, because the CCD used for measuring the intensity of the dye fluorescence does not have a linear response throughout its dynamic range [3, “Dynamic Range” section].

The two regions of non-linearity of the CCD depend on the quantity of DNA. The most extreme occurs with so-called “off-scale” data, which happens when the sample is overloaded. In this case, signal from one or more of the dye colors has saturated the CCD, known as the “full well capacity” of the device [3]. This situation represents a state change in that one or more of the dye measurements no longer represents the true quantity of that dye in the capillary. Therefore, the deconvolution matrix will fail if the quantity of dye being measured causes the CCD response to be off-scale (beyond the full well capacity of the CCD) [1]. For off-scale peaks, there is still a pattern relating primary pull-up heights in one channel to pull-up peak heights in another, even if the pattern is different from when the peak heights are not off-scale. Larger primary peaks will generally result in more extreme pull-up peaks, but the two cases must be considered as separate analyses: off-scale vs. in-scale.

The other region of non-linearity of the CCD response, which also depends on the quantity of DNA, occurs before the spectral signal fully saturates the CCD, but above a mid-level signal. “CCD response becomes nonlinear before full well capacity is reached” [3]. In other words, the CCD has a linear range and the deconvolution matrix can fail when the signal data is outside that linear range.

Therefore, the pattern between primary pull-up heights in one channel and pull-up heights in another is linear when the primary pull-up heights are within the linear range of the CCD, and the pattern is non-linear when the primary pull-up heights are above the linear range of the CCD.

In summary, for each pair of channels, we seek two pull-up patterns, one for in-scale and one for off-scale data, and to be certain to represent the full range of possibilities, we allow that each pattern may represent a non-linear relationship between pull-up heights and the associated primary pull-up heights.

In quantifying the pull-up pattern across channels, the goal is to determine which peaks are pure pull-up signals and which peaks are partial pull-up – actual allele peaks that include some pull-up signal. The pure pull-up peaks can then be ignored as known artifacts. For partial pull-up, a secondary benefit would be to estimate the true allele signal heights for the actual allele peaks that also contain pull-up signal, after correcting for pull-up. The analysis of pull-up coefficients is a multi-variable estimation problem, based on data consisting of the heights of pairs of peaks on different channels that coincide in time. The task of collecting the data and performing the estimation, is complicated by several factors. First, pairs of peaks rarely coincide in time exactly. Second, there may be data that is important to estimating the true pull-up pattern that does not present itself in the form of pairs of peaks. So, collecting and representing all the relevant data is critical. Third, not all pairs of peaks on separate channels that coincide in time represent pull-up, rather a pair may comigrate by coincidence, resulting in either no pull-up or a partial pull-up, which must be taken into account in the choice of estimation algorithm. It is important to note that, in general, partial pull-ups will not adhere to the same pattern of coefficients as pure pull-ups because of the mixture of allelic DNA and pull-up signals. Additionally, if the coinciding signal in a partial pullup is negative, from a pull-down, that artifactually reduces the height of the allele peak signal.

The following begins with the estimation algorithm, assuming a knowledge of how to select the data, after which the selection of data is described. Finally, the entire algorithm is summarized, and we conclude with some examples and the enumeration of results from the analysis of a set of data. A glossary of terms and variable names is included in the Appendix.

Given the number of possible explanations for two peaks on different channels to occur at the same time, it is impossible to determine, based on a single pair of coincident peaks, what, if any, pattern of pull-up may exist between the two channels. One cannot say with any confidence that the larger peak is a causative primary peak and the smaller one is a pull-up. Nor can one say that they are simply independent but comigrating peaks. More evidence is needed.

To determine a pattern between two channels, pairs of heights, (x, y), must be collected, where x is the height of a peak on the first channel and y is the height of a peak that occurs at the same time on the second channel. Our goal is to discover what, if any, pattern exists between these pairs and to use this pattern to classify the pairs (xi, yi), where xi is the (potential) primary pull-up height and yi is the (potential) pull-up height, which type of pull-up each represents among those listed in the Introduction, and, finally, which yi’s do not represent pull-up at all. The last case occurs when a pattern can be established and the pattern is that there is no pull-up between the channels. In the two cases of partial pull-up and no pull-up, xi and yi represent the heights of two peaks that are both alleles and that happen to coincide. A comparison of ladder peaks on adjacent channels in a GlobalFiler ladder show that these coincidences are expected. Fig. 1 shows an overlay of two loci, vWA in blue (6-FAM) and D21S11 in green (VIC), two channels with a significant number of comigrating alleles (See [1, p. 334]). Six allele pairs differ in spacing by fewer than 0.3 base pairs and three allele pairs differ by fewer than 0.1 base pairs. These would normally be considered to have fallen within the same allele bin and might be mistaken for primary pull-up / pull-up pairs.

Therefore, while seeking a pull-up pattern between two channels, one must be mindful of the likelihood of coincidental peaks that are both legitimate alleles. Such coincidences, whether non-pull-up peaks or partial pull-up peaks, should manifest as outliers in the pull-up pattern.

Ideally, because of the possibility that a pull-up peak may be caused by coincidental primary peaks in more than one other channel, one should solve a mathematical model involving all channels at one time. For simple linear pull-up correction, this would amount to finding a replacement to the original deconvolution matrix, which would have N=k2-k unknowns, where k is the number of channels. Furthermore, there would potentially be as many equations as there are peaks in the sample. Each additional non-linear term adds another matrix of unknowns having the same dimensions. The size and complexity of such a formulation may pose a significant computational burden. On the other hand, for the correction we seek, multiple channel interactions, when present, may be susceptible to other analysis techniques. Restricting attention to two channels at a time simplifies the analysis and reduces both the number of unknowns and the number of equations. Cross-channel issues involving multiple channels are comparatively rare, but even in that case, our handling of outliers works.

Henceforth, we will narrow our focus to two channels, the primary pull-up channel which we call the α-channel, and the actual pull-up channel, which we call the β-channel. The α-channel is the one that causes the pull-up into the β-channel.

Assuming that there are n peaks in the α-channel, some or all of which correspond (described below) to peaks in the β-channel. The heights of the corresponding peaks in the β-channel will be designated byyi, for i = 1 , … , n.

The yi may be positive (pure or partial pull-up), negative (pull-down) or zero (no pull-up). See Fig. 2.

We define xi to be the height of the ith α-channel peak. Then, the data that may predict the pull-up portion of yi includes xi and functions of xi.

The classic model for the required formula is as follows:yi=hi1θ1+hi2θ2++hipθp+eiwhere

ei represents (unknown) noise for i=1,,n,

θj, for j=1,,p, are unknown parameters to be determined as part of the solution,

and

hi1=xi and the other hij are non-linear functions of xi.

The most common mathematical tool for quantifying a pattern between such a collection of pairs of points is linear least squares regression (LS), or least squares fitting.

For a vector of θj’ s, letri=ri(θ1,,θp)=yihi1 θ1hi2 θ2hipθprepresent the ith residual. As stated above, the hij represent the various functions of the heights of potential primary pull-up peaks on other channels, and the yi represent the heights of the peak pulled up by the primary peaks.

The mathematical problem posed by linear least squares regression is to find the vector θˆ = (θ1ˆ,…, θpˆ) which minimizes the sumi=1nri2,over all parameter vectors θ.

The properties of the least squares approach to modeling data have been studied extensively and contrasted with other approaches. One aspect of this comparison comes in the form of the “breakdown point”, e*, defined to be the smallest percentage of contaminated data that can cause the estimator to take on arbitrarily large aberrant values [4, p.871]. For least squares regression, the breakdown point e* has been calculated to be 0. In other words, a single outlier can be enough to skew the results of the regression by an arbitrarily large amount. This property disqualifies least squares regression as a tool for solving the pull-up pattern problem because outliers are not just possible, they are expected.

In most applications of statistical regression, whether least squares regression or some other algorithm, the fit is what is important and the outliers, if any, are to be rejected. In our case, it is the outliers that are important, because they represent alleles and the fit is merely a means to an end. In fact, because an “outlier” in our model may also be affected by pull-up, we would like to use the ultimate fit to estimate the extent of the influence. A peak that fits the pull-up pattern is a pure pull-up and not an allele. In terms of the model, an outlier is a peak that does not fit the pull-up pattern but nevertheless may contain pull-up signal. We have defined such a peak as a partial pull-up. Finally, there can be an established pattern of pull-up where the pattern is that there is no pull-up. In this case, all the potential pull-up peaks are (full) outliers and are neither pull-up nor alleles containing pull-up signal. They are pure alleles. Therefore, it is of utmost importance to employ a mathematical tool that tolerates not just one outlier, but several.

Fortunately, there is a tool that has been proven to tolerate up to half the data as outliers and to be able to identify which values are outliers: the least median of squares regression (LMS) [4,5,6,7,8]. The optimization criterion is to choose the vector of parameters, θˆ = (θ1ˆ,…, θpˆ) which minimizes the median of the residuals squared, or,minθmedri2(θ1,,θp)

This approach has been shown [4] to have a breakdown point e* = 50%. In other words, at least half the data must be pure pull-ups to avoid skewing the result. In practice, this is the largest breakdown point possible. Because the outliers do not provide evidence of the underlying pattern, we would not trust a pattern derived from data in which more than half the measurements are outliers.

This, then, is our problem statement: to find the pull-up parameters that minimize the median residual squared based on coinciding peaks across channels. The next section shows how to solve this problem in the context of our application.

The criterion in (1) is simple but, unlike least squares regressions, least median squared problems are not easily solved. As the number, p, of unknown parameters grows, the difficulty of solution increases rapidly. For p=1, the solution may be computed analytically, as it may be for p=2, with considerably more difficulty. However, for p>2, analytic solutions beyond exhaustive enumeration are unknown. Hence, in the interest of computational efficiency, it is in our best interest to restrict ourselves to p=1 or p=2.

To this problem statement, we include a single non-linear term to model spectral distortion due to overamplification. We approximate the non-linear contribution as a quadratic term, which means that our model for a single primary channel (causing the pull-up) and a single pull-up channel, yieldsy=θ1x+θ2x2where y is the pull-up height, x is the primary peak height and θ1 and θ2 are the unknown parameters to be determined. Thus, for a 5-channel multiplex, there are 20 two-parameter least median square problems. In general, that is k2-k two-parameter LMS problems, where k is the number of channels. Further study of the non-linear pull-up phenomenon may yield a refinement of this simple quadratic approximation, but it is sufficient as a first approximation.

The solution to a least median squared regression allows us to compute associated outliers using the following formulas. We defines=1.4826(1+5n-p-1)Wwhere W is the optimal median of the squared residuals. Then, for a data point with residual r, ifrs>2.5that point is an outlier [8]. This raises the possibility of a simpler solution approach.

Because, using (3) and (4), one can determine which points are outliers and which fall within the purview of a pattern, LMS and LS can be combined as follows. First, perform LMS using a linear model:y=θ1xthen remove the outliers using (3) and (4) and, finally, perform LS on the remaining data points using the non-linear model:y=θ1x+θ2x2

The LMS solution to (1) and (5) is computationally straightforward [4]. The algorithm consists first of sorting the n ratiosτi=yixi,fori=1,ninto numerical order. If k is n2+1 for n even and n-12+1 for n odd, find the k consecutive elements of the sorted ratios with the least displacement τi+k-τi. Let iˆ be the integer I that gives the least such displacement. The LMS minimum is thenθ*=(τiˆ+k+τiˆ)/2for the selected iˆ.

With this solution in hand, we apply (3) and (4) with W in (3) computed usingW=(θ*)2to compute s in (4), which gives the threshold to remove any outliers. Then, use LS on the remaining peaks to find θˆ1 and θˆ2 to substitute in (2). Finally, use the formula in (2) to provide pull-up corrections from all relevant peaks in the α -channel to all corresponding peaks in the β -channel.

All peaks in the β -channel that were not outliers in the above process are pure pull-up, with a pull-up-corrected height of 0. These peaks are not given allele calls. When all pull-up corrections have been made for all channels, all peaks in the β -channel that were outliers but have overall corrected height below the minimum height threshold (in RFU) specified by the user, are identified as partial pull-up peaks with corrected height below minimum analytical RFU. These peaks are assigned a corrected height of 0 and are not given allele calls. All peaks in the β -channel that do not fall into either of the above categories are identified as partial pull-up peaks with the computed corrected height and are given allele calls.

There are two more conditions that can occur. First, either there are no pairs of peaks in the two channels, in which case there is no pull-up correction, or, there are insufficiently many pairs of peaks to establish a pattern, in which case the pairs that are found are identified as uncertain pull-up but with no conclusion as to pull-up level. The second condition is that there is a pattern of pull-up between the α and β channels, but that pattern is found to have coefficients θˆ1 and θˆ2 equal to zero. In this case, no pull-up is identified between the channels and no height correction is made.

The decision to calculate pull-up two channels at a time was based on the expectation that the incidence of peaks that represent pull-up from multiple channels is relatively rare and that the computing of outliers is sufficient to accommodate any occurrences. The examples show that the way OSIRIS handles such instances validates the assumption.

This section concludes with a note about negative peaks. OSIRIS fits raw data peaks that are negative in addition to those that are positive. Because pull-up can be manifested with either a positive coefficient or a negative one, to establish the true pattern of pull-up across a sample, it is necessary to have both positive and negative raw data available. We have observed that there are some proprietary STR analysis systems where negative raw data values are truncated. This reduces the accuracy of the pull-up pattern assessment where there is pull-down. If the α-channel pulls-up into the β-channel with a negative sign, it means that the β-channel peaks that are reduced by pull-down from the α-channel should be taller than they appear in the plot, or the electropherogram.

To determine pull-up, it is necessary to establish which peaks on the α and β-channels “coincide”. The reason this is an issue is that, based on empirical evidence, there can be shifting between the channels. That is, the centers of raw data pull-up peaks do not always align with those of the cross-channel primary peak responsible for the pull-up. OSIRIS deals with this problem by fitting peaks of predefined shapes [9, pp. 1921-1922] and, because of the nature of the curve fit, peak locations are accurately computed [9]. Even though electrophoresis data is sampled data, OSIRIS routinely locates peak centers, or modes, even when, as usually happens, the modes are located between sampled points. Similarly, OSIRIS estimates peak heights based on the fitted curves, and so heights are calculated somewhat more accurately than if the raw data maximum alone were used.

The identification of cross-channel patterns requires that the underlying peak analysis comprehensively locates and fits peaks on each channel. In addition to large, easily identified peaks, it is necessary to effectively identify low level peaks, including those below the user-specified minimum analytical threshold. Important analysis algorithms include: the assessment of the static noise level for each channel; the normalization of the baseline; and shoulder peak identification.

The static noise level is used explicitly in the pull-up analysis and is estimated as follows. As a rule, an electrophoresis analysis is run long enough to ensure that all DNA peaks have had a chance to migrate past the laser. It is safer to err by running longer than necessary, so, generally, for larger times, there are no peaks, only static noise. For each channel, OSIRIS samples the raw data for an interval at the end of the measured time sequence and performs two functions. First, an average of that data is computed, and, if the average is non-zero, then it represents a fixed offset that is then subtracted from the raw data across the entire time spectrum. This is a static normalization process. Second, relative to the average, OSIRIS computes the maximum peak-to-trough distance. This represents the maximum oscillation of the raw data, in the absence of signal, and varies from channel to channel. This oscillation is saved as the noise level for each channel. The maximum end-of-sample oscillation is an important part of many OSIRIS algorithms, including peak identification and curve-fitting, because it is a measure of the minimum signal that can be distinguished from noise [9].

Once peak identification and curve fitting is complete, we can collect cross-channel patterns. Coinciding peaks are specified in terms of the width of primary peak, the peak in the α-channel that (potentially) causes the pull-up. We define the width of a peak as the horizontal distance between the right and left tail at the half height of the peak. Given a potential primary pull-up peak with width w (in time units) in one channel, a potential pull-up peak in another “coincides” in time with the first peak if its mode falls within w/2 seconds of the mode of the primary. In other words, the two modes lie within w/2 seconds of each other. This flexibility is required by the potential shifting between channels noted above. It is now a straightforward matter to collect the set of all pairs of coinciding peaks.

Another question that must be resolved is the number of coinciding α and β-channel peaks required for OSIRIS to determine a pattern. In general, one or two pairs is an insufficient number for OSIRIS to impute a pattern, because the probability of two pairs of comigrating peaks may not be small. The higher the number of comigrating peak pairs, the lower the probability that more than half are coincidental. For the purposes of the OSIRIS analysis, we have chosen 4 as the minimum number of peak pairs in which half could be identified as outliers. Because the algorithm takes into account peaks in the raw data below the user-defined detection limit, the requirement of at least 4 pairs is usually attained, as observed in the data analysis described below. In cases where there are insufficient numbers of pairs of peaks for the pull-up algorithm to establish a pattern between two channels, OSIRIS instead flags comigrating peaks in the β-channel as uncertain pull-up peaks to ensure user review.

Using the following example, we examine some consequences of the problem statement in (1) for modeling the pull-up pattern, assuming the OSIRIS method of peak pair selection.

Fig. 3 shows a graphical depiction of a possible scenario. The α-channel is blue and there are two peaks, of heights 2,000 RFU and 1,500 RFU, that coincide with peaks in the green β-channel with heights 500 RFU and 400 RFU, respectively. The algorithm as specified above restricts the data to pairs of peaks only, and ignores the other, unpaired peaks in the blue channel. Because it requires at least four pairs of peaks to apply the least median of squares algorithm, there is not enough information to establish a pattern. However, one can see that there are three other peaks in the α-channel with heights 2,500 RFU, 1,700 RFU and 1,200 RFU, that have no comigrating peaks in the β-channel. A knowledgeable analyst reviewing this would conclude that, since the 2,500 RFU (tallest), 1,700 RFU and 1,200 RFU peaks have no comigrating peaks, therefore there is no pull-up. The coefficients θˆ1 and θˆ2, above, should be equal to zero and the two peaks in the β-channel, with heights 500 and 400 RFU, respectively, should be called as alleles.

The algorithm must have broader criteria for data collection. As in Fig. 3, there is information about the pattern contained in the α-channel peaks that have no corresponding β-channel peaks. It is significant that the tallest α-channel peak has no coinciding peak: if the tallest peak on a channel causes no pull-up, then a lesser peak should also cause no pull-up. The set of pairs of peaks must be broadened to include null pull-up peaks, that is, regions in the β-channel which correspond to α-channel peaks and for which no peak has been fit in the β-channel.

The relevant pull-up data must include all instances of the following cases:

  • a)

    Peaks in the α-channel that have coincidental peaks of lesser height, in the β-channel;

  • b)

    Peaks in the α-channel that have a raw data peak (defined below) in the β-channel;

  • c)

    Peaks in the α-channel that are masked (defined below) in the β-channel;

  • d)

    Peaks in the α-channel that do not fit into category (a), (b) or (c).

Category (a) consists of the easily identified peak pairs discussed above, except that we omit α / β pairs for which the α-channel peak is a possible pull-up from a third channel. The reason for the exception is that an α-channel peak may not be a real peak if it turns out to be a pure pull-up or may have a corrected height that is different from what is measured because of pull-up correction if it turns out to be a partial pull-up. If there are no peak pairs of category (a), then there is no pull-up from the α-channel into the β-channel. In this case, none of the other categories needs to be tested between the two channels. If there is at least one category (a) peak pair, then all the other categories must be tested between the two channels.

Category (b) “raw data peaks” are small peaks below the analytical threshold that are too small to be fitted by the peak-fitting algorithm. To find category (b) peaks, we define “raw data peak” in the β-channel corresponding to a peak in the α-channel as follows: A time range is established as the time interval centered on the α-channel peak and with width equal to the α-channel peak width. If c is the time of the α-channel peak center (mode) and w is its peak width, then the interval is[c-w2,c+w2]

This is the same time interval specified above for ascertaining whether two peaks on different channels coincide, or comigrate. Within this time interval, we calculate the maximum raw data value and the minimum raw data value on the β-channel. If the (positive) difference between the two is greater than the measured static noise for the β-channel, then we say that the β-channel has a raw data peak at the time equal to the α-channel peak mode. The value of the pull-up is either the maximum raw data value or the minimum, whichever has the larger absolute value. Note that either or both of the minimum and maximum values can be negative. We include α-channel peaks as primary pull-ups in our collection of data with an associated pull-up having this calculated value in the β-channel.

For category (c), we say that an α-channel peak is “masked” with respect to the β-channel if there is a fitted peak in the β-channel that does not align closely enough to be pull-up, but its tail might interfere with the ability to determine if there were any pull-up into the β-channel from the α-channel peak location. The numerical test for this condition is as follows. If f(t) is the function for the fitted curve in the β-channel for an interval of times, t, containing c, the time of the α-channel peak mode, and if Nβ is the measured noise level for the β-channel, then the α-channel peak is said to be masked in the β-channel iffc>Nβ

In other words, even though the β-channel peak does not comigrate with the α-channel peak, it still has tail height above the β-channel noise level at the time of the mode of the α-channel peak, potentially interfering with the measurement of a possible comigrating peak in the β-channel. By their very nature, masked peaks prevent us from determining what, if any, pull-up they may cause. Therefore, they are excluded from our analysis.

Finally, category (d) peaks in the α-channel have no coincidental fitted curve in the β-channel; they have no raw data peak; and they are not masked. Therefore, category (d) peaks, depending on height, can be indications that there is no pull-up. We include such peaks in our collection of data as primary pull-ups causing zero pull-up.

There is one more issue that concerns data collection. The ratios τi in (7) do not retain any information related to the height of primary pull-up peak in the α-channel. Yet, the heights of the primary pull-ups are critical. In our example, the 2,500 RFU peak that has no corresponding β-channel peak contains much more information than the α-channel 1,200 RFU peak with no corresponding β-channel peak. In this instance, assume that the noise level on the β-channel has been estimated to be 15 RFU and further, that the actual pull-up level is 1%. Then, the pull-up from the 2,500 RFU peak should be 0.01 × 2,500, or 25 RFU, while the pull-up from the 1200 RFU peak should be only 12 RFU. The latter is below the noise level and so one might expect that there would be no detectable pull-up from the 1,200 RFU peak. By contrast, the expected 25 RFU pull-up from the 2,500 RFU peak should be distinguishable visibly on the graph and analytically by the algorithm.

In general, this is a signal-to-noise ratio issue. If the pull-up coefficient from the α-channel to the β-channel is P, the β-channel measured noise level is N, and there are two α-channel peaks with heights H1 and H2 such thatH1<H2then, the signal-to-noise level for each potential pull-up is|P|HiN

Because of (10), the pull-up from H2 has the higher signal-to-noise ratio and has the more reliable information. Therefore, α-channel peaks that are taller should have a greater influence on the algorithm than peaks that are shorter. One way to accomplish this is to add multiple copies of pair data incorporating “tall” α-channel peaks and fewer copies of pair data incorporating “shorter” α-channel peaks. On the other hand, if the ratio in (11) is less than 1, then the pull-up associated with the ratio has no information about the pull-up pattern and should be removed from the collection of data.

If we knew P, we could use the relation|P|HiN<1to find the minimum height threshold for a peak to be considered a primary pull-up on theα -channel. Rewriting the above, yieldsHi<N|P|which means that α -channel peaks with heights below N/|P| would result in pull-ups at most below the β -channel noise level. These peaks would not exhibit measurable pull-up. Thus,N|P|can be used as the minimum threshold for an α-channel primary pull-up into the β-channel. Note that this threshold is dependent on the α-channel / β-channel pair.

The issue with (13) is how to estimate P. We will estimate P using a value Q, to be specified below, and test each α-channel peak height, H, as follows. IfH<N|Q|then that peak is excluded from the α-channel /β-channel pull-up analysis. The calculated minimum height threshold for an α-channel primary pull-up bears an inverse relationship with the (absolute) estimated pull-up fraction, |Q |. Larger values for the pull-up fraction result in lower values for the minimum height threshold and smaller values for the pull-up fraction result in larger values for the minimum height threshold.

Now that we see how the estimated pull-up fraction is to be used, the estimated pull-up fraction is calculated as follows. Because the tallest peaks in the α-channel carry the most information, the preliminary estimate is based on these peaks. We create a set consisting of the ratios of the five tallest peaks of category (a), (b) or (d) in the α-channel and calculate Q, the median of these ratios. The result is an estimate for Q that, when substituted in (14), is likely to cause the inclusion of the same five tallest peaks, plus others from the α-channel, while rejecting shorter peaks from the α-channel that could bias the final pattern analysis.

Note that category (b) peaks, those having raw data pull-up, are similar to category (d) peaks, those having no pull-up discernable from noise, in that such peaks convey more information about a potential pull-up pattern than shorter category (a) peaks because any possible pull-up from a category (b) or (d) peak falls below the level of curve-fitting detection.

Summarizing, no α-channel peak should be added to the data if its height, H, satisfies (14). Also, α-channel peaks that are category (b) or (d) should be added to the data multiple times, depending on the height of the α-channel peak. Category (c) α-channel peaks should not be added at all.

It is important to note that pull-up peaks with off-scale signal, for which the detector is saturated, are likely to exhibit a different pull-up pattern from those that are not off-scale. Accordingly, the full algorithm is applied for all channel pairs first to non-off-scale peaks and then again, for all channel pairs that have off-scale signal. The algorithm is the same in each case, only the data differs.

The sequence of the complete pull-up analysis algorithm integrates the data selection algorithm with the estimation algorithms in (5) – (9). The overall algorithm becomes:

  • A

    Gather all category (a) peak pairs into a list, except those for which the α-channel peak is labeled as a pull-up from a third channel. Add to the list all α-channel category (b) peaks, possibly multiple times. Calculate m, the minimum α-channel peak height for the list so far. Then, add to the list any category (d) peaks for which the α-channel peak height is greater than or equal to m. Do not include any category (c) peaks.

  • B

    If all β-channel peaks comigrating with category (a) peaks in (A) are either below the user-specified detection threshold for the β-channel or are pull-down, or if there are insufficiently many peaks (at least 4) in the collection after step (A), then quit. There is no reportable pull-up from the α-channel to the β-channel, and we assign both pull-up coefficients, θ1 and θ2 in (6) to be 0. Otherwise, if there are sufficiently many peaks and there is at least one category (a) pair with a β-channel peak that is above the detection threshold, then the algorithm continues with step (C).

  • C

    Find the median pull-up ratio, Q, of the five tallest peaks gathered in (A). Remove from the list formed in (A) all pairs for which the α-channel peak height is H and |Q|H<Nβ, where Nβ is the measured noise on the β-channel. Add to the list in (A) any category (d) α-channel peaks with height H<m for which |Q|HNβ. Peaks may be added multiple times, depending on height (see below). If none are removed or added, skip to step (E).

  • D

    Calculate the median pull-up ratio, μ=θ*, equation (8), of all the peaks gathered through step (C).

  • E

    Using μ, calculate the displacement (3) and (4) that discriminates the peaks that are outliers from the peaks that are within bounds.

  • F

    Create a subset of the peaks used to calculate μ consisting solely of peaks that are not outliers, based on the calculations from step (E). Using this set of data, find pull-up coefficients θ1 and θ2 in (6) using a linear least square fit algorithm.

  • G

    For each outlier peak pair, assign a “partial pull-up” artifact to the β-channel peak and calculate its height correction due to pull-up using

hc=θ1x+θ2x2where x is the height of the corresponding α-channel peak. Non-outlier peaks in the β-channel are assigned a “pure pull-up” artifact and their height is corrected to zero due to pull-up, hc=0 (i.e., not an allele signal). Partial pull-up peaks are generally assigned allele calls (see step (H) for an exception). Pure pull-up peaks are not assigned allele calls.
  • H

    When all pairs of channels have been analyzed, test partial pull-up peaks for cumulative corrected heights (the corrected height that results from partial pull-ups from every other channel) that fall below the user-specified analytical threshold. These peaks are given the artifact “partial pull-up corrected below minimum RFU”. No allele call is given to such peaks.

A flow chart of this algorithm appears in Fig. 4.

Section snippets

Examples

Fig. 5 shows a pair of comigrating peaks, the 250 peak in LIZ and the FGA 26 allele in PET (solid arrow). However, the 114, 180 and 240 peaks in LIZ, which are taller than the 250 LIZ peak, have neither comigrating peaks nor raw data pull-up in PET (dashed arrows). The algorithm expands the list of comigrating peaks to include the 114, 180 and 240 peaks in LIZ, but with zero pull-up. The resulting calculation yields zero pull-up from LIZ into PET, and the FGA 26 allele is a valid allele with no

Algorithm performance

Thirty-one single source Promega PowerPlex 16 samples were analyzed to test the efficacy of the new pull-up analysis algorithm. Typical allele peaks had heights in the 1500 – 4000 RFU range, the analytical threshold was set at 100 RFU, and the detection threshold was set at 20 RFU. To reduce visual clutter, OSIRIS does not report pull-up if the pull-up peaks in the channel are either 1) all below the detection threshold or 2) all negative pull-up. Those pull-up peaks are not included in the

Summary

The OSIRIS pull-up analysis algorithm makes unique use of the estimation technique least median of squares, leveraging the algorithm’s ability to identify and, for the purposes of estimation, reject outliers. From the examples and the algorithm performance with ground-truth samples, it can be seen that the algorithm correctly identifies the pull-up pattern between channels and that the ability to reject outliers allows alleles with partial pull-up signals to be recognized. Further, although the

Acknowledgements

This work was supported by the Intramural Research Program of the National Library of Medicine,National Institutes of Health. We gratefully acknowledge Tim Kalafut of DFSC for helpful discussions/review and Brandi Kattman of NCBI and the reviewers for comments that have improved this paper.

References (11)

There are more references available in the full text version of this article.

Cited by (4)

  • Recent advances in forensic biology and forensic DNA typing: INTERPOL review 2019–2022

    2023, Forensic Science International: Synergy
    Citation Excerpt :

    Electropherograms generated by CE instruments exhibit both STR alleles and artifacts that complicate data interpretation. Efforts are underway to understand and model instrumental artifacts [248–251] as well as biological artifacts of the PCR amplification process such as STR stutter products [252,253]. Machine learning approaches are being applied to classify artifacts versus alleles with the goal to eventually replace manual data interpretation with computer algorithms [254–257].

  • Developmental validation of FaSTR™ DNA: Software for the analysis of forensic DNA profiles

    2021, Forensic Science International: Reports
    Citation Excerpt :

    The latter is due to unbound fluorescent dye molecules [23]. Pull-up arises from cross-channel detection of fluorescent signal due to spectral overlap between the dyes used in the marker set [24]. Spikes are artefactual peaks caused by anomalous events in the CE instrumentation and exhibit a narrow peak signal in multiple dye channels [25].

View full text