Development of a program for in silico optimized selection of oligonucleotide-based molecular barcodes

In Seok Yang; Sang Won Bae; BeumJin Park; Sangwoo Kim

doi:10.1371/journal.pone.0246354

Abstract

Short DNA oligonucleotides (~4 mer) have been used to index samples from different sources, such as in multiplex sequencing. Presently, longer oligonucleotides (8–12 mer) are being used as molecular barcodes with which to distinguish among raw DNA molecules in many high-tech sequence analyses, including low-frequent mutation detection, quantitative transcriptome analysis, and single-cell sequencing. Despite some advantages of using molecular barcodes with random sequences, such an approach, however, makes it impossible to know the exact sequences used in an experiment and can lead to inaccurate interpretation due to misclustering of barcodes arising from the occurrence of unexpected mutations in the barcodes. The present study introduces a tool developed for selecting an optimal barcode subset during molecular barcoding. The program considers five barcode factors: GC content, homopolymers, simple sequence repeats with repeated units of dinucleotides, Hamming distance, and complementarity between barcodes. To evaluate a selected barcode set, penalty scores for the factors are defined based on their distributions observed in random barcodes. The algorithm employed in the program comprises two steps: i) random generation of an initial set and ii) optimal barcode selection via iterative replacement. Users can execute the program by inputting barcode length and the number of barcodes to be generated. Furthermore, the program accepts a user’s own values for other parameters, including penalty scores, for advanced use, allowing it to be applied in various conditions. In many test runs to obtain 100000 barcodes with lengths of 12 nucleotides, the program showed fast performance, efficient enough to generate optimal barcode sequences with merely the use of a desktop PC. We also showed that VFOS has comparable performance, flexibility in program running, consideration of simple sequence repeats, and fast computation time in comparison with other two tools (DNABarcodes and FreeBarcodes). Owing to the versatility and fast performance of the program, we expect that many researchers will opt to apply it for selecting optimal barcode sets during their experiments, including next-generation sequencing.

Citation: Yang IS, Bae SW, Park B, Kim S (2021) Development of a program for in silico optimized selection of oligonucleotide-based molecular barcodes. PLoS ONE 16(2): e0246354. https://doi.org/10.1371/journal.pone.0246354

Editor: Ruslan Kalendar, University of Helsinki, FINLAND

Received: November 10, 2020; Accepted: January 15, 2021; Published: February 18, 2021

Copyright: © 2021 Yang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government(MSIT) (No. 2019R1A2C2008050). SK was additionally funded by a faculty research grant from Yonsei University College of Medicine [6-2016-0081]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: α value, Percentage of barcodes excluded from a barcode set; , Converted value of cp according to different l; CP, Complementarity; cp, Maximum number of bases paired; , Normalized gcc score; GCC, GC contents; gcc, Percentage of GCC; , Normalized hp score; HD, Hamming distance; hd, HD-based edit distance; HP, Homopolymer; hp, Length of HPs; l, Barcode length; ld, LD-based edit distance; LD, Levenshtein distance; MD, Minimum distance; N, Number of barcodes to be generated; P_a, One of six total penalty scores (P_GCCt, P_HPt, P_SRt, P_HDt, P_CPt or P_WT) during the initial cycle (= 0^th cycle); P_b, One of six total penalty scores (P_GCCt, P_HPt, P_SRt, P_HDt, P_CPt or P_WT) during the best cycle (a cycle showing the lowest P_WT value in a run); P_CP, Penalty score for CP of a barcode; P_CPi, Penalty score for CP of the i-th barcode; P_CPt, Total penalty score for CP of a barcode set; P_DEC, Percent decrease between P_a and P_b; P_EX, Probability of barcode exclusion; P_GCC, Penalty score for GCC of a barcode; P_GCCi, Penalty score for GCC of the i-th barcode; P_GCCt, Total penalty score for GCC of a barcode set; P_HD, Penalty score for HD of a barcode; P_HDi, Penalty score for HD of the i-th barcode; P_HDt, Total penalty score for HD of a barcode set; P_HP, Penalty score for HPs of a barcode; P_HPi, Penalty score for HP of the i-th barcode; P_HPt, Total penalty score for HP of a barcode set; P_SR, Penalty score for SRs of a barcode; P_SRi, Penalty score for SR of the i-th barcode; P_SRt, Total penalty score for SR of a barcode set; P_Ti, Weighted total penalty scores for GCC, HP, SR, HD, and CP of the i-th barcode; P_WT, Weighted total penalty score of a barcode set; P_WTa, P_WT value during the initial cycle (= 0^th cycle); P_WTb, P_WT value during the best cycle (a cycle showing the lowest P_WT value in a run); , Normalized sr score; sr, Repeat number of SRs; SR, Simple sequence repeat with repeated units of dinucleotides; w1, Weight for P_GCC; w2, Weight for P_HP; w3, Weight for P_SR; w4, Weight for P_HD; w5, Weight for P_CP

Introduction

DNA barcodes are oligonucleotide sequences tagged to target DNA molecules that allow researchers to identify specific molecules in an experiment, including sequencing experiments [1, 2]. There are two general types of DNA barcodes [3]: The first are DNA barcodes that permit the identification of individual samples in a pooled mixture. For the purpose, short DNA barcodes (~4 mer oligonucleotides) are frequently used. The second are molecular barcodes, also known as unique molecular identifiers, that allow for consensus-based error correction by facilitating the unique labeling of individual molecules [4]. In many high-tech sequence analyses, longer barcodes (8–12 mer) of this second type are used to identify raw DNA molecules. DNA barcodes can also be characterized according to their design (i.e., rationally designed or randomly produced) [5], and random barcodes are often used for molecular barcoding [2, 6–8]: note that the barcodes mentioned in this study indicate “in-line barcodes” to be sequenced together with target DNA sequences.

Next-generation sequencing (NGS) has been widely applied in genomic and transcriptomic analyses for various purposes [9], including clinical research [10, 11]. In NGS data analysis, the read depth (i.e., number of reads) of a target region has been used to identify variant alleles and the frequencies thereof [12] and to estimate expression levels of genes [13]. However, this approach sometimes suffers from amplification bias between samples during library preparation and errors in the sequencing step, including incorrect base incorporation. To overcome these issues, researchers have turned to tagging barcodes to individual DNA or RNA fragments, otherwise known as molecular barcoding, in an attempt to reduce amplification bias and to eliminate false positive variants by filtering out duplicate reads. This approach has been applied in various ways: One, it has been used in many clinical studies to identify true variants with very low allele frequencies less than or equal to 1% in liquid biopsies from cancers [14]. Two, it has been implemented in quantitative transcriptome analysis to allow for more accurate quantification of transcript levels [15]. Three, it has also been applied in single-cell sequencing of large cell populations, wherein micro-fluidic droplet barcoding was used to label the genomes of single cells in each droplet [16, 17].

For research purposes, barcodes for random sequences have been used [2, 6–8], which is reasonable because any number of barcodes can be readily produced. However, this makes it impossible to know the exact sequences of all barcodes used in an experiment, thereby hindering a researcher’s ability to identify which base(s) in a barcode was/were mutated. Furthermore, misclustering of barcodes in data analysis due to unexpected mutations in their sequences can sometimes lead to inaccurate interpretation in the identification of mutations with low allele frequency, in the quantification of transcript-level expressions in disease samples, and even in single-cell sequencing. Therefore, we presumed that the preselection of an optimal barcode set in sequencing experiments would considerably reduce the occurrences of the above.

To date, several studies have addressed DNA barcode design [18–23], focusing on detection of mismatches (substitution, insertion, and deletion) between expected (original) and observed barcodes, as well as the correction of such errors and the removal of duplicated reads (i.e., demultiplexing). For barcode design, researchers have considered the length of barcodes, the minimum distance (MD) between two barcodes, GC content (GCC), homopolymers (HPs), and complementarity (CP) between barcodes. For example, balanced GCC (e.g., 40–60%), HPs of minimal length (e.g., <3), and/or minimum CPs (e.g., <3) have been utilized to filter out undesirable barcodes in the generation step. Furthermore, since mutations can be occurred in-line barcodes due to the amplification with primers during library preparation and sequencing together with target DNA molecules during sequencing reactions, mismatched bases between original and mutated ones (i.e., k errors) should be considered in the barcode design step. For detecting and correcting k errors, MD values in barcode sets have been controlled at greater than or equal to k+1 and 2k+1, respectively, calculated based on distance metrics such as Hamming and Levenshtein distances (HD and LD, respectively). Meanwhile, however, researchers have yet to account for simple sequence repeats with repeated units of dinucleotides (SRs), which are somewhat similar to HPs and are frequently observed in longer barcodes (e.g., 8–12 mer). In this reason, SR is another possible source of errors in PCR amplification and sequencing steps as previously reported [24].

The barcode factors of lengths, GCC, HP, SR, and CP have been commonly considered in PCR primer design [25]. For example, primers with unbalanced GCCs can lead to mispriming and misannealing during PCRs [26]. Also, the presence of long HPs or SRs in the primers can cause polymerase slippage [27], thereby yielding insertion or deletion in the regions [28]. Sometimes, two primers complementing each other can form a dimer during PCRs, reducing the product yield [25]. Finally, platform-specific errors occur in NGS, especially in certain genome regions with GC-rich, long HPs, and repeat sequences [29, 30].

Here, we present a versatile and fast program (VFOS) that is capable of selecting an optimal subset of oligonucleotide sequences (barcodes) by considering penalty scores for five barcode factors, along with their relative importance. The program allows a potential user to adjust penalty scores and weights to yield barcode sets that best fit their needs. Therefore, we expect that many researchers will choose VFOS to obtain their own barcode sets that can be applied to various experimental conditions.

Methods

The VFOS program was developed by implementing C++ language, and the g++ compiler was used for its compilation from source code on a Linux operating system. The detailed methods for selecting optimal barcode sequences with VFOS are described below.

Definition of a barcode sequence

A barcode is an oligonucleotide sequence of four bases {A, T, G, C}. We define l as the length of a barcode. For given l, there are 4^l possible barcodes. For example, a total of 1048576 and 16777216 unique barcodes can be generated for l = 10 and 12, respectively.

Definition of barcode factors

We defined five barcode factors (GCC, HP, SR, HD, and CP) that are considered in selecting an optimal barcode subset. gcc is the percentage of nitrogenous bases (guanine [G] and cytosine [C]) in a barcode, calculated as follows: , where n_G(B) and n_C(B) are the number of guanines and cytosines in B, respectively. For example, gcc of the sequence “AGCTAAGCTACC” is 50% (= 6/12). hp is defined as the length of the longest running single base repeats. For example, the hp of the sequence “GTAAACGGGGGC” is five. sr is similar to hp, but defined for dinucleotide repeats (e.g., AGAGAG or TCTCTCTC). In calculating sr, single base repeats are also considered, but as the repeat of dinucleotides (e.g., “GGGGGG”is a 3×“GG”repeat). hd and cp are defined for barcode pairs. hd is HD-based edit distance (the number of positions at which the corresponding bases are different in the ungapped pairwise alignment) between two barcodes. For example, hd of {“AAAAC”, “AGAAG”} is two. CP measures the degree of base pairing (“A-T” or “G-C”) between two barcodes, in the formation of self- or cross-dimers. For given two barcodes B₁ and B₂, cp is defined as the maximum number of Watson-Crick base pairs that is determined by sliding B₁ on the B₂ or vice versa with minimum overlap of l = 3 (see S1 Fig)_. For example, there can be two sequences (“AGACAT” and “GTGTCC”) in relationship of cp = 4, because “GACA” in the first one and “TGTC” in the second one are reverse complementary. Note that if directionality of DNA or RNA sequences is omitted, the left and right ends of the sequences indicate 5′- and 3′-ends, respectively. Accordingly, “AGACAT” and “GTGTCC” mean “5′-AGACAT-3′” and “5′-GTGTCC-3′,” respectively.

From the definitions, we can deduce the desired conditions of the five factors. In general, extremely biased gcc towards 0% or 100% is avoided in the design of PCR primers [25], and can be applied to the barcode design. hp, sr, and cp should remain low to prevent unwanted errors (e.g., replication slippage or formation of barcode dimers). In addition, hd should be maximized for better discrimination between barcodes and tolerance from nucleotide variations. As these factors are considered in up to millions of barcodes, simultaneously, appropriate scoring with weight is the key to the optimization of the final set.

Generation of random sequences to characterize DNA barcodes

To examine the characteristics of the five barcode factors, we generated random barcodes for three different lengths (l = 8, 10, and 12): these were chosen based on their frequency of use. For each l, we generated 1000 random barcode sets, each of which consists of 10⁵ (l = 10 and 12) or 10⁴ barcodes (l = 8).

Determination of penalty scores for evaluating individual barcodes

To apply the barcode factors to the barcode set optimization, we used a penalty-based selection strategy. To determine the most appropriate penalty scores for GCC, HP, and SR (P_GCC, P_HP, and P_SR, respectively), we obtained appropriate equations using the curve fitting method on the website MyCurveFit [31] based on distributions observed in random barcodes with l = 12. To obtain gcc distribution, we calculated mean barcode counts at each gcc value. A Gaussian bell curve with the formula was used to fit its distribution, where x indicates gcc. We set f(gcc) to be maximum if gcc was in the range from 40% to 60% and minimum at the points gcc = 0% or 100%. Then, we determined P_GCC values with a maximum value of 10⁶ at the points gcc = 0% or 100% and a minimum value of 0 in the range of gcc = 40% to 60% by inverting the distribution of f(gcc) values. For distributions of hp and sr, we adopted median barcode counts at each data point of hp and sr, respectively. When hp and sr were examined in a barcode sequence, we only considered them greater than or equal to 2, because there are many cases with hp = 1 and sr = 1 in a barcode. An exponential curve obtained with the formula f(x) = a + b × e^cx was applied to their random distributions, where x represents hp or sr. In the equations, constant terms (a) were removed to obtain f(hp) or f(sr) values greater than or equal to 0. We determined P_HP and P_SR scores by subtracting f(hp) and f(sr) values from 10⁶, respectively, with a modification to transform penalty scores into positive integers.

To generate penalty scores for HD (P_HD), we implemented the concept of accumulation of mutations (herein, substitutions of a base at a nucleotide position) to represent distances or differences between barcode sequences. In a given barcode set with l = n, there exist barcode pairs with a relationship of hd values ranging from 1 to n. Since barcode sequences are sequenced with sample DNA or RNA fragments, unexpected mutations can be observed in the sequences. In result, barcodes with hd = 1, 2, or 3 are more sensitive to misclustering than others, sometimes leading to inaccurate interpretation in data analysis. Accordingly, we only dealt with P_HD values for hd = 1, 2, and 3 in this study. Since mutations can occur during library preparation or a sequencing step, rates for two error types (those for polymerases and sequencing platforms) warrant consideration. The former is known to range from 1/10⁶ to 1/10⁵ [32] and the latter approximately 1/10³ with the Illumina sequencing platform [29]. Of these, we selected the latter, because it is one thousand times higher than that of the former. Accordingly, if the probability that a mutation is detected in a sequence is m (= 1/10³), the probabilities that two and three mutations simultaneously occur would be m² (= 1/10⁶) and m³ (= 1/10⁹), respectively. In this context, we defined P_HD values of barcodes with hd = 1, 2, and 3 as 10⁶, 10³, and 1, respectively, by dividing the probabilities by 1/10⁹. In addition, we set P_HD scores for barcodes with hd≥4 to 0, since the values were smaller than 1 and closer to 0 in the hd range (for example, 1/10³ and 1/10⁶ for those with hd = 4 and hd = 5, respectively). In addition, three look-up-tables containing precalculated hd values for short subsequences with l = 2, 3, and 4 are used to reduce computation time in HD calculation. For example, a barcode pair with l = 8 is composed of two pairs of subsequences with l = 4. Another pair with l = 9 is composed of three pairs of subsequences with l = 2, 3, and 4. Another pair l = 10 is composed of two pairs of subsequences with l = 4 and a pair of subsequences with l = 2. So, the total hd value of the pair can be obtained by summing hd values of subsequence pairs referred from look-up-tables. Note that longer subsequences are preferentially considered than shorter subsequences when referring to hd values from the look-up-tables.

To define penalty scores of CP, we employed maximum Watson-Crick base pairs between given barcodes rather than predicted free energies that has been widely used to predict DNA cross-hybridizations [33], because we had to consider computational time of free energies for all combinations of them. We also assumed that the strength of CP would increase dramatically if the cp between them became greater than a certain limit. Otherwise, it would become close or equal to its minimum value. We set the limit to 2/3 of l. The assumption aims to prevent inclusion of barcodes in relationships of full reverse complementary or close to it. For barcodes with l = 8, 10, and 12, the limits become 5.3, 7.5, and 9, respectively. The maximum value of P_CP score was set to 10⁸ for the barcodes in full complementary relationship. We also determined that the score will decrease by 10² when cp decreases by 1. Thus, P_CP score was decreased to 1 at cp = 8. To reflect the nature of P_CP, we chose an exponential model with the formula f(x) = a + b × e^cx. In the equation, constant terms (a) were removed to obtain a f(x) value greater than or equal to 0.

Unlike penalty scores for GCC, HP, and SR, those for HD and CP are obtained by pairwise comparison between two barcodes and, thus, demand a greater number of calculations. To reduce calculation times for P_HP and P_CP for a barcode pair, we used look-up-tables containing pre-calculated hd values for two barcodes with l = 2, 3, and 4, respectively, as described above. Nonetheless, a total of calculations is required to obtain them for all barcode pairs, where N represents the total number of barcodes.

Algorithm for selecting optimal barcode sets

In the VFOS program, we employed a simple algorithm consisting of two steps: random generation of an initial barcode set and optimal barcode selection via iterative replacement of barcodes. In every cycle of the second step, i) calculation of penalty scores for the selected barcode set, ii) determination of excluded barcodes based on a α value, and iii) addition of new barcodes are repeated. During the process, a weighted total penalty score (P_WT) is applied to check whether a selected barcode set in each cycle is optimal and the best choice, which is calculated by summing all weighted penalty scores for the five barcode factors as shown in Eq (1) (1)

In the equation, P_GCCt, P_HPt, P_SRt, P_HDt, and P_CPt indicate total penalty scores for individual barcode factors in a barcode set as follows: , , , , and , where i stands for each barcode and N represents the total number of barcodes. This algorithm is stopped if P_WT score reaches its minimum limits through further running of an additional five cycles.

Determination of weights for penalty scores and an appropriate α value

We first employed percent decrease (P_DEC) to show the change of penalty score between initial and optimized states, which was calculated using the equation "(median P_a − median P_b)/median P_a × 100" for each run, where P_a and P_b indicate initial and the lowest values for any penalty scores including P_WT, respectively. Then, to find what combination of weights lead to the highest P_DEC values in a given space, we examined the values on the five dimensional space corresponding to weights (w₁, w₂, w₃, w₄, and w₅) for five barcode factors (GCC, HP, SR, HD, and CP). Because P_DEC values at all data points could not be calculated due to the limitation of resource and time for computation, we chose only four points (1, 5, 10, and 20, respectively) per axis to consider total 4⁵ (= 1024) points on the space. Initial weights were set to values at the origin (w₁ = 1, w₂ = 1, w₃ = 1, w₄ = 1, and w₅ = 1). By comparing all P_DEC values, final weights were determined to weight values at the point having the highest P_DEC value.

An appropriate α value was determined to that from a condition showing the minimum P_WT value at nearly 100 data points including the origin after running the VFOS program in six different conditions with α values of 5%, 10%, 20%, 30%, 40%, and 50%.

Exclusion of barcodes in each cycle

The probability of barcode exclusion (P_EX) is determined as shown in Eq (2), which is composed of the total penalty score for a barcode (P_Ti = w₁ × P_GCCi + w₂ × P_HPi + w₃ × P_SRi + w₄ × P_HDi + w₅ × P_CPi), P_WT, N, and α.

(2)

A single barcode factor or a certain combination of the factors can be used as a monitoring target(s) during a program run by providing a user’s own penalty scores and weights. As all barcodes can be excluded with their own probabilities, depending on penalty scores, the algorithm used in the VFOS program avoids instances in which a barcode set falls into a local minimum depending on an initial state.

Performance of the VFOS program

We tested the performance of VFOS in several conditions in which four cases were selected as representative examples: i) all barcode factors (GCC, HP, SR, HD, and CP) were considered; ii) only barcode factors GCC and HD were included; iii) only barcode factor GCC was examined; and iv) only barcode factor HD was considered in the test. The first condition was tested as a default setting of the VFOS program, and the remaining three conditions were examined as examples of user-defined settings. All results were obtained through 1000 repetitions of each condition.

Comparison with other tools

For comparison of barcode sets from VFOS and two other tools (DNABarcodes [22] and FreeBarcodes [23]) that were designed to provide error-correcting barcode sets, we first created four barcode sets containing barcodes with l = 12 using DNABarcodes with minimum distances of 3 and 5, and FreeBarcodes with number of errors of 1 and 2. And barcodes with the same conditions for length and number that created under the respective settings of the two tools were also generated through 1000 repetitions using VFOS. Then, we compared the barcode sets with 6 criteria comprising five barcode factors (GCC, HP, SR, HD, and CP) and an additional factor (LD) as well as computation time between VFOS and other tools.

Results

Characteristics of random barcodes

We obtained distributions of random barcodes for five barcode factors according to l. S1–S3 Figs show the distributions for barcodes with l = 12. gcc distribution was well fitted to binomial distribution and centered on a gcc of 50% as expected (A). In hp distribution, the number of barcodes containing hp values for each base was nearly identical (B). Furthermore, HPs with hp = 2 were most frequently detected for all bases, and the number of HPs with hp>2 was drastically decreased with increasing hp (C). Note that only the hp distribution of an “A” base was represented due to a lack of space and very similar distributions between “A” and other bases (“T”, “G”, and “C”). The distribution of sr was similar with that for hp, in which SRs with sr = 2 were most frequently observed (D). hd distribution followed a right skewed pattern (E). When examining mode hd values according to l, approximately 2/3rds of bases for each l appeared (hd = 6, 8, and 9 for barcodes with l = 8, 10, and 12, respectively; S1–S3 Figs). cp distribution followed a different pattern than HD distribution (F), in which mode values were observed at approximately 1/3rds of bases for each l (cp = 3, 4, and 4 for barcodes with l = 8, 10, and 12, respectively; S1–S3 Figs).

GCC penalty score (P_GCC)

By fitting random gcc distribution to a Gaussian bell curve, we obtained an equation (Eq (3)) wherein μ = 50 and σ = 14.74996 (S5a Fig).

(3)

Then, we defined normalized gcc score, , as represented by Eq (4), to reflect that the best gcc range is from 40% to 60% in primer design for polymerase chain reaction and to set maximum to 10⁶.

(4)

In distribution, values ranged from 4023 (for the barcodes with gcc = 0% or 100%) to 10⁶ (for the barcodes with gcc values from 40% to 60%). Finally, we determined P_GCC values using Eq (5), where the minimum value of (min()) represents at gcc = 0% or 100%.

(5)

Fig 1A shows distributions of P_GCC according to the gcc range at different l (8, 10, and 12). Detailed values of gcc, f(gcc), , and P_GCC are represented in S1 Table.

Download:

Fig 1. Distributions of penalty scores (P_GCC, P_HP, P_SR, P_HD, and P_CP) for 5 barcode factors (GCC, HP, SR, HD, and CP).

A, P_GCC distribution according to the percent GCC (gcc) at different l (8, 10, and 12). B, P_HP distribution according to the homopolymer length (hp) value. C, P_SR distribution according to the repeat number (sr) of simple sequence repeats. D, P_HD distribution according to the Hamming distances (hd). E, P_CP distribution according to the maximum number of bases paired (cp).

https://doi.org/10.1371/journal.pone.0246354.g001

Penalty scores for HP (P_HP) and SR (P_SR)

We obtained two equations (Eqs (6) and (7)) by the curve fitting method from random distributions of hp and sr, respectively, along exponential curves, with modification as described earlier (S5b and S5c Fig, respectively).

(6)

(7)

Since the number of barcodes with hp = 2 and sr = 2 were most frequently observed in distributions of hp (S4c Fig) and sr (S4d Fig), respectively, we set normalized hp score () at hp = 2 and normalized sr score () at sr = 2 as maximum values (10⁶). With a similar approach as that for score, we defined equations to calculate and scores (Eqs (8) and (9), respectively).

(8)

(9)

Finally, P_HP and P_SR values were determined with Eqs (10) and (11), respectively, by subtracting and scores from 10⁶ with a modification to fix maximum penalty score to 10⁶ and to make the minimum penalty score a positive integer.

(10)

(11)

Fig 1B and 1C show distributions of P_HP and P_SR, respectively. In detail, S2 Table represents the values of f(hp), , and P_HP for HP (A); and f(sr), , and P_SR for SR (B) according to increases in hp and sr, respectively.

HD penalty score (P_HD)

According to the assumption for P_HD score, we could represent its equation as shown in Eq (12), where the equation satisfying the conditions of hd = 1, 2, and 3 were obtained by curve fitting, as shown in S5d Fig.

(12)

As shown in Fig 1D and S3a Table, P_HD scores for barcodes in the range between hd = 1 and hd = 3 become 10⁶, 10³, and 1, respectively, and scores in the range of hd≥4 are zero (0).

CP penalty score (P_CP)

We introduced an increment of 10² in the P_CP score system whenever cp increased by 1 to efficiently exclude barcodes that are close to or in a full complementary relationship. For barcodes with l = 12, P_CP scores at cp = 8, 9, 10, 11, and 12 become 1, 10², 10⁴, 10⁶, and 10⁸, respectively. For l<8, we set P_CP scores to 0 as a minimum. Using this P_CP distribution, we determined the equation satisfying the conditions of by using curve fitting approach (Eq (13) and S5e Fig), and treated P_CP score to 0 (zero) in the range of according to the assumption.

(13)

In the equation, was employed to obtain P_CP for different l (8 and 10) and was calculated using Eq (14). For barcodes with l = 12, is equal to cp.

(14)

Fig 1E shows distributions of P_CP scores according to the cp at different l, which increase dramatically if cp is greater than the limit (2/3 of l) and become zero (0) if cp is less than the limit. Detailed values of the scores are also presented in S3b Table.

Weights for penalty scores and α values for barcode exclusion

To determine final weights (w₁, w₂, w₃, w₄, and w₅) for five barcode factors (GCC, HP, SR, HD, and CP), we examined P_DEC values for P_WT scores at total 4⁵ (= 1024) points on the five dimensional space from the origin (1, 1, 1, 1, 1) to the farthest point (20, 20, 20, 20, 20). We could observe the changes of penalty scores (P_GCC, P_HP, P_SR, P_HD, and P_CP) for individual barcode factors together with P_WT between initial and best cycles at the median level as shown in S4 Table, which may be useful as a guideline to produce user own barcode sets by configuring one or more weights. We also obtained distributions of median P_DEC values according to the weights for the respective factors as shown in S6 Fig, where similar patterns of the distributions were observed between two intrinsic factors (GCC (a) and HP (b)), and between two pairwise factors (HD (d) and CP (e)). By comparing all P_DEC values for P_WT in S4 Table, we found the minimum value of 87.43% at the point of (20, 20, 20, 1, 1), revealing that higher intrinsic barcode factors (GCC, HP, and SR) and lower pairwise relationship factors (HD and CP) lead to higher P_DEC (i.e., minimized P_WT). Therefore, we set the final weights to “20, 20, 20, 1, and 1,” respectively, as default values (Table 1).

Download:

Table 1. Initial and final weights determined in this study.

https://doi.org/10.1371/journal.pone.0246354.t001

As shown in S5 Table, we could determine an appropriate α value of 0.2 by comparing P_WT values from the results of six different conditions, which gave us good results in many test runts. Therefore, the value was employed as a default in VFOS.

Inputs and outputs of the VFOS program

Several input parameters are required to run this program: i) l (default: 12); ii) N (default: 10⁵); iii) α value (default: 0.2 [20%]); iv) penalty scores for five barcode factors, of which default values are set for barcodes with l = 12 (See S7 Fig for default penalty scores); and v) weights for penalty scores of GCC, HP, SR, HD, and CP (default: 20, 20, 20, 1, and 1, respectively). Default values are stored in parameter files (“param_general” for l and N; “param_alpha” for α value, “param_GCC, param_HP, param_SR, param_HD, and param_CP” for penalty scores for five barcode factors; and “param_weights” for weights). User-defined settings can be easily implemented by modifying the values of the parameters that are stored in their respective files. In addition, three files, “hd2_table, hd3_table, and hd4_table,” are mandatory for running the program as precalculated HD look-up-tables for short sequences with lengths of 2, 3, and 4, respectively. S7 Fig shows all of the input files containing respective default values.

Output files in text format are generated for each cycle, as well as at the initial state, and contain the selected barcodes and information on barcode length, the number of barcodes generated, and total penalty scores for the five barcode factors, as shown in S8 Fig.

Performance of the VFOS program

To test the performance of VFOS, we set up four conditions with different weights, in which we were able to obtain optimal barcode sets from the conditions. Users should be aware that the term “optimal” means finally optimized under a certain condition defined by parameters. Therefore, if different penalty scores and weights are given, totally different barcodes will be produced based on the parameters.

Results from the default setting.

Since the default setting showed minimized penalty scores for all barcode factors, the setting can be used for general use to obtain an optimal barcode set. As shown in Fig 2A, S6a Table, and S9a Fig, our test results in this setting revealed that initial P_WT values (P_WTa) from all 1000 runs ranged from 1.189E+12 to 1.213E+12 (median P_WTa = 1.198E+12), and the lowest P_WT values (P_WTb) were observed between 1.466E+11 and 1.586E+11 (median P_WTb = 1.506E+11). Consequently, P_DEC values of P_WT scores appeared between 86.73% and 87.82% (median P_DEC = 87.43%), revealing an overall performance of VFOS in this setting of 87.43%. When we examined P_DEC values of penalty scores for individual barcode factors (P_GCCt, P_HPt, P_SRt, P_HDt, and P_CPt), their median values were 89.76% for GCC, 89.99% for HP, 90.07% for SR, 52.83% for HD, and 66.91% for CP. The P_DEC values of GCC, HP, and SR were greater than the overall performance, and those for HD and CP were smaller than it. From the results, we confirmed that many barcodes in the final set were selected to have more balanced GCCs, shorter HPs, shorter SRs, higher HDs, and lower CPs toward the direction intended in previous studies [18–23] and primer design [25].

Download:

Fig 2. Results from test runs in four different conditions.

All five barcode factors (A), two factors (GCC and HD; B), one factor (GCC; C) and another single factor (HD; D) were considered in the conditions for selecting an optimal barcode set in each condition. Steel blue and tomato red bars represent penalty scores of the barcode factors in initial and final sets, respectively. P_GCC, P_HP, P_SR, P_HP, and P_CP represent penalty scores for GCC, HP, SR, HD, and CP, respectively. P_DEC means a percent decrease in penalty scores obtained by comparing results between initial and best cycles.

https://doi.org/10.1371/journal.pone.0246354.g002

Results from the first user-defined setting.

In the first user-defined setting (Fig 2B, S6b Table, and S9b Fig), we selected two barcode factors of GCC and HD as monitoring targets to obtain a barcode set with minimized P_GCCt and P_HDt. For the purpose, we adjusted the weights of their penalty scores as follows: w₁ = 20.0; w₂ = 0; w₃ = 0; w₄ = 1.0; and w₅ = 0. Accordingly, P_WT score was calculated as a summed value of P_GCCt and P_HDt. In this setting, P_WTa was observed between 4.047E+11 and 4.112E+11 (median P_WTa = 4.078E+11), and P_WTb ranged from 1.115E+10 to 1.248E+10 (median P_WTb = 1.192E+10). P_DEC values of P_WT scores appeared between 96.94% and 97.27% (median P_DEC = 97.08%), revealing an overall performance of VFOS in this setting of 97.08%. Because only two barcode factors were considered in this condition, the performance of this condition was highly improved, compared to the default setting. Furthermore, penalty scores for GCC and HD were greatly decreased to 97.33% and 92.53% at the median level. In contrast, penalty scores for other barcode factors (HP, SR, and CP) were either slightly decreased or increased (16.16%, 24.88%, and -49.21%, respectively, at the median level).

Results from the second user-defined setting.

In the second user-defined setting (Fig 2C, S6c Table, and S9c Fig), a single barcode factor of GCC was selected as a monitoring target, and a weight (w₁) was given as a positive value (20.0). In this condition, P_WTb scores (= P_GCCb) decreased to 0 (P_DEC of 100%), meaning that gcc values of all barcodes were in the range between 40% and 60%. In addition, penalty scores for other barcode factors (HP, SR, HD, and CP) slightly decreased or greatly increased (16.79%, 25.55%, -34.42% and -52.99%, respectively, at the median level). From these results, we could note a positive relationship between GCC and HP/SR and a negative one between GCC and HD/CP.

Results from the third user-defined setting.

In the third user-defined setting (Fig 2D, S6d Table, and S9d Fig), the barcode factor HD was selected as a monitoring target, and a weight (w₄) was given as a positive value (1.0). P_WT scores (= P_HDt) dropped to 2.177E+10 at the median level (median P_DEC of 89.92%). Unlike the case of GCC, we discovered that HD showed no or little association with other factors.

Total cycles and running time per cycle.

We examined the running time of VFOS for selecting 10⁵ optimal barcode sets in two different computing environments: i) a dual-core desktop PC and ii) a node cluster system capable of 20 threads. A similar number of cycles per run were recorded in the range between eight and 43 for both systems. The calculated running times per cycle were ~ 722 sec (~13 min) in the former system and ~118 sec (~2 min) in the latter system. For a run of 50 cycles, the expected run time would be ~10 hours 50 min and ~ 1 hour 40 min for the two systems, respectively.

Possible examples of misclustering of barcodes

We found two possible examples of inaccurate interpretation in the data analysis procedure that true variants could occasionally be missed out due to misclustering of barcodes when random barcodes were used to detect low frequent variants, which were related to HD between two barcodes and HPs in barcodes, respectively.

The first example was found in the analysis of a targeted DNA sequencing data set from lung cancer liquid biopsy samples with a L858R mutation of T>G transversion in the epidermal growth factor receptor gene, in which random barcodes were used for accurate identification of mutations (unpublished data). In a sample, the depth-based frequency of the mutation was detected at 0.51% (15 of 2922; forward 7 and reverse 8 barcodes; S9a Fig). However, its barcode-based frequency was identified at 0.75% (13 of 1739; forward 6 and reverse 7 barcodes; S9b Fig). In addition, two barcodes with the relationship hd = 1 (“GGGGCAGTCGGG” vs. “GGGGCAGACGGG”) were observed in both forward and reverse directions of the reads with the mutation. The corresponding reads were not clustered into the same group due to differences in their length and sequence. However, if the reads were clustered into the same group due to an unexpected mutation in one of their barcodes, its barcode-based frequency would be lowered to 0.63% (11 of 1737; forward 5 and reverse 6 barcodes; S9c Fig).

The second example was found in previous research [6], in which barcodes with l = 12 were used for variant identification from liquid biopsy samples of cancer patients by sequencing DNA fragments using the Ion Torrent platform. The sequencing system is known to have high insertion and deletion error rates, especially in HP regions [29, 30]. If the possible errors are not considered in barcode clustering, misclustering of barcodes will occur. For this reason, Kukita et al. [6] employed an approach in which barcodes with l = 11 and 13 that only differed from a barcode with l = 12 by the insertion or deletion of a single base were grouped with the corresponding barcode of l = 12, as shown in S7 Table.

Comparison with other tools

There are four features of VFOS distinct from two other tools (DNABarcodes [22] and FreeBarcodes [23]). The first feature of VFOS is flexibility in program running that can accept user-own parameters of the number of barcodes created, weights, α value, and even penalty scores. Especially, different combinations of weights could be applied to generate different barcode sets as shown in Fig 2 and S4 Table. The second feature is that VFOS can create a large amount of barcodes suitable for molecular barcoding. When we created barcodes with l = 12 using DNABarcodes and FreeBarcodes, 2857 and 98536 barcodes were generated from DNABarcodes with MD (dist) of 3 and 5, respectively, and 178 and 17213 barcodes with l = 12 were obtained from FreeBarcodes with number of errors (num_errors) of 1 and 2, respectively. Unlike the programs, we could generate more than 100000 barcodes using VFOS.

In addition, when we generated the same number of barcodes (178, 2857, 17213, and 98536) using VFOS and then compared barcode sets from VFOS and the other tools with 6 criteria (GCC, HP, SR, HD, LD, and CP) as shown in S8–S13 Tables, we could observe that VFOS produced barcodes with balanced gcc (S11a Fig and S8 Table), shorter hp (S11b Fig and S9 Table), and higher hd (S11c Fig and S10 Table) and ld (S11d Fig and S11 Table). The results revealed that VFOS has comparable performance to the two tools, although inclusion of a small portion of non-allowed or unfavorable barcodes were permitted for generating large amounts of barcodes in limited barcode lengths (l). In comparison with CP distributions, VFOS also produced barcodes with more concentrated cp toward the peak (cp = 5), which was affected to reduce barcodes with unfavorable cp, especially at cp = 12 that means perfect reverse-complementary between barcodes (S12 Table).

The third feature is that only VFOS considers SRs that are often observed in barcode sequence as a typical repeat consisting of dinucleotides. When SR distributions were compared from the results, VFOS showed that many barcodes have shorter sr than those from the two tools (S11e Fig and S13 Table), thereby enabling to reduce possible error sources in PCR amplification and sequencing steps. The fourth feature is that VFOS showed faster or similar performance over computation time compared to the other tools as shown in Fig 3. In comparison with DNABarcodes, VFOS was faster or similar computation time when we produced 2857 (3A) and 98536 barcodes (3B). In comparison with FreeBarcodes, VFOS showed overwhelmingly faster performance when we created 178 (3A) and 17213 barcodes (3D).

Download:

Fig 3. Comparison of computation time for barcode generation between VFOS and other two tools (DNABarcodes and FreeBarcodes).

Computation time between VFOS and DNABarcodes with 2857 (A) and 98536 barcode generation (B), and between VFOS and FreeBarcodes with 178 (C) and 17213 barcode generation (D).

https://doi.org/10.1371/journal.pone.0246354.g003

Discussions

We developed VFOS to provide users with a program that can generate an optimal set of barcode sequences for NGS with unique molecular identifier (UMI) (a.k.a. molecular barcoding) techniques. As the purpose of UMI sequencing is to tag every single DNA fragment in the library preparation step, the number of barcodes should be large enough (e.g., >1 million) in most cases. We would like to note that previous algorithms or tools [18–23] also aimed to design optimal sets of DNA barcodes, which are more applicable for multiplex sequencing. In multiplex sequencing, barcodes are used to discriminate samples that are being sequenced in a same NGS lane, the number of which is a few hundred at most. We found that none of the tools was applicable to UMI due to the limitation in the output size (<100000) for typical barcode lengths (8 to 12). Furthermore, we showed that VFOS has comparable performance by comparing with DNABarcodes [22] and FreeBarcodes [23] in the same condition (l = 12), although different design was applied in VFOS. In addition, VFOS has flexibility in program running, the consideration of SRs as a barcode factor, and fast computation time in comparison with the tools.

In the implementation, we applied many computational techniques and heuristics to increase the speed of VFOS, i.e., pre-calculation of frequently used values and the use of look-up-tables, which enabled the generation of large barcode sets of multi-million scales. Multi-threading was also implemented and effective in speeding up the program (see Results). In addition, we noted that many users do not need to generate new barcode sets again, but can reuse any of those that are already optimized. For such users, we provide several sets of prebuilt barcodes with different conditions (e.g., barcode length and parameters) in the program webpage [34]. These sets can be downloaded and instantly applied for sequencing, reducing time and cost for users substantially.

With the rapid development of sequencing technologies, molecular barcoding is now widely in various systems. Sometimes, complex designs are required for barcoding. For example, the Chromium platform (10x Genomics Inc.) exploits a double barcoding system that consists of the 10x-barcodes and UMI, which identify the single cell and molecule, respectively [35]. In most cases, sample index is further attached to the libraries, presenting three different barcode systems in a read. We expect that further optimization can be done for multiple barcoding, such as considering barcodes of longer lengths and interactions among different barcoding systems.

Conclusions

We have developed a versatile and fast tool (VFOS) for selecting oligonucleotide subsets that can be utilized in the detection of low-frequent somatic mutations, in the quantification of gene- or transcript-level expression, and in single-cell sequencing. Although our in silico work requires experimental validation, we demonstrated the versatility and fast performance of VFOS by providing optimal barcode sets for various conditions. We also showed that VFOS has comparable performance, flexibility in program running, consideration of SRs, and fast computation time in comparison with other two tools (DNABarcodes and FreeBarcodes). Therefore, we expect that many researchers will opt to apply the program for selecting optimal barcode sets during their experiments, including next-generation sequencing.

Availability and requirements

Project name: VFOS

Project home page: https://sourceforge.net/projects/vfos/

Operating system: Linux

Programming language: C++

Other requirements: None

License: Non-Commercial Research Use Only

Supporting information

S1 Fig. Calculation of maximum number of base paired (cp) between two barcodes.

https://doi.org/10.1371/journal.pone.0246354.s001

(PPTX)

S2 Fig. Distributions of five barcode factors (GCC, HP, SR, HD, and CP) observed in random barcodes with l = 8.

https://doi.org/10.1371/journal.pone.0246354.s002

(PPTX)

S3 Fig. Distributions of five barcode factors (GCC, HP, SR, HD, and CP) observed in random barcodes with l = 10.

https://doi.org/10.1371/journal.pone.0246354.s003

(PPTX)

S4 Fig. Distributions of five barcode factors (GCC, HP, SR, HD, and CP) observed in random barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s004

(PPTX)

S5 Fig. Curve fitting results that used to derive penalty scores for five barcode factors.

https://doi.org/10.1371/journal.pone.0246354.s005

(PPTX)

S6 Fig. Distributions of P_DEC values according to the weights (w₁, w₂, w₃, w₄, and w₅).

https://doi.org/10.1371/journal.pone.0246354.s006

(PPTX)

S7 Fig. Parameters setting that produces 10000 barcodes with l = 12 using default penalty scores and weights.

https://doi.org/10.1371/journal.pone.0246354.s007

(PPTX)

S8 Fig. Example results in an output file in text format.

The results comprise barcode length, number of barcodes, total penalty scores of the barcode set, and selected barcodes ordered alphabetically.

https://doi.org/10.1371/journal.pone.0246354.s008

(PPTX)

S9 Fig. Histograms of penalty scores (P_GCC, P_HP, P_SR, P_HD, P_CP, and P_WT) at initial and best cycles for Fig 2.

https://doi.org/10.1371/journal.pone.0246354.s009

(PPTX)

S10 Fig. Schematic representation of mutation frequencies of EGFR L858R.

https://doi.org/10.1371/journal.pone.0246354.s010

(PPTX)

S11 Fig. Comparison results between VFOS and other programs (DNABarcodes and FreeBarcodes).

https://doi.org/10.1371/journal.pone.0246354.s011

(PPTX)

S1 Table. Penalty scores for GCC (P_GCC) depending on the lengths of barcodes (8, 10, and 12 nt).

https://doi.org/10.1371/journal.pone.0246354.s012

(XLSX)

S2 Table. Penalty scores for HP (P_HP) and SR (P_SR) according to the lengths of HP (hp) and the repeat numbers of SR (sr) regardless of the lengths of barcodes.

P_HP for hp = 1 and P_SR for sr = 1 are the same with the cases of hp = 2 and sr = 2, respectively.

https://doi.org/10.1371/journal.pone.0246354.s013

(XLSX)

S3 Table. Penalty scores for two pairwise barcode factors (P_HD and P_CP) according to barcode length.

https://doi.org/10.1371/journal.pone.0246354.s014

(XLSX)

S4 Table. Penalty scores (P_GCC, P_HP, P_SR, P_HD, P_CP, and P_WT) and their P_DEC values according to the weights.

https://doi.org/10.1371/journal.pone.0246354.s015

(XLSX)

S5 Table. Decreasing patterns of P_WT scores in 6 conditions with individual values (0.05, 0.1, 0.2, 0.3, 0.4, and 0.5).

The lowest P_WT score appeared at 5th cycle when α was set to 0.2.

https://doi.org/10.1371/journal.pone.0246354.s016

(XLSX)

S6 Table. Comparison with performance results between default and user-define modes.

In user-defined mode, it was tested to a barcode set having minimized penalty scores for GCC and HD (P_GCC and P_HD).

https://doi.org/10.1371/journal.pone.0246354.s017

(XLSX)

S7 Table. Clustering examples of barcodes considered that insertion or deletion of a single base occurred.

Different bases were underlined in the barcode sequence. Note that this example was adopted from Kukita et al.

https://doi.org/10.1371/journal.pone.0246354.s018

(XLSX)

S8 Table. GC content (GCC) distribution for barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s019

(XLSX)

S9 Table. Homopolymer (HP) distribution for barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s020

(XLSX)

S10 Table. Hamming distance (HD) distribution for barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s021

(XLSX)

S11 Table. Levenshtein distance (LD) distribution for barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s022

(XLSX)

S12 Table. Complementarity (CP) distribution for barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s023

(XLSX)

S13 Table. Simple sequence repeat (SR) distribution for barcodes with l = 12.

https://doi.org/10.1371/journal.pone.0246354.s024

(XLSX)

Acknowledgments

We used a cluster system equipped at Yonsei Genome Center for developing and testing the VFOS program, and we would like to thank the staff involved in the management of the system.

References

1. Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc. 2010;2010(6):pdb.prot5448. pmid:20516186
- View Article
- PubMed/NCBI
- Google Scholar
2. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet. 2018;19(5):269–285. pmid:29576615
- View Article
- PubMed/NCBI
- Google Scholar
3. MacConaill LE, Burns RT, Nag A, Coleman HA, Slevin MK, Giorda K, et al. Unique, dual-indexed sequencing adapters with UMIs effectively eliminate index cross-talk and significantly improve sensitivity of massively parallel sequencing. BMC Genomics. 2018;19(1):30. pmid:29310587
- View Article
- PubMed/NCBI
- Google Scholar
4. Ogawa T, Kryukov K, Imanishi T, Shiroguchi K. The efficacy and further functional advantages of random-base molecular barcodes for absolute and digital quantification of nucleic acid molecules. Sci Rep. 2017;7(1):13576. pmid:29051542
- View Article
- PubMed/NCBI
- Google Scholar
5. Lyons E, Sheridan P, Tremmel G, Miyano S, Sugano S. Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens. Sci Rep. 2017;7(1):13899. pmid:29066821
- View Article
- PubMed/NCBI
- Google Scholar
6. Kukita Y, Matoba R, Uchida J, Hamakawa T, Doki Y, Imamura F, et al. High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients. DNA Res. 2015;22(4):269–277.
- View Article
- Google Scholar
7. Kou R, Lam H, Duan H, Ye L, Jongkam N, Chen W, et al. Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations. PLoS One. 2016;11(1):e0146638. pmid:26752634
- View Article
- PubMed/NCBI
- Google Scholar
8. Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A. 2012;109(36):14508–14513. pmid:22853953
- View Article
- PubMed/NCBI
- Google Scholar
9. Metzker ML. Sequencing technologies—the next generation. Nat Rev Genet. 2010;11(1):31–46. pmid:19997069
- View Article
- PubMed/NCBI
- Google Scholar
10. Yohe S, Thyagarajan B. Review of Clinical Next-Generation Sequencing. Arch Pathol Lab Med. 2017 Nov;141(11):1544–1557. pmid:28782984
- View Article
- PubMed/NCBI
- Google Scholar
11. Roy S, LaFramboise WA, Nikiforov YE, Nikiforova MN, Routbort MJ, Pfeifer J, et al. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch Pathol Lab Med. 2016;140(9):958–975. pmid:26901284
- View Article
- PubMed/NCBI
- Google Scholar
12. Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput Struct Biotechnol J. 2018;16:15–24. pmid:29552334
- View Article
- PubMed/NCBI
- Google Scholar
13. Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, et al. A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One. 2014;9(8):e103207. pmid:25119138
- View Article
- PubMed/NCBI
- Google Scholar
14. Newman AM, Lovejoy AF, Klass DM, Kurtz DM, Chabon JJ, Scherer F, et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat Biotechnol. 2016;34(5):547–555. pmid:27018799
- View Article
- PubMed/NCBI
- Google Scholar
15. Fu GK, Xu W, Wilhelmy J, Mindrinos MN, Davis RW, Xiao W, et al. Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc Natl Acad Sci U S A. 2014;111(5):1891–1896. pmid:24449890
- View Article
- PubMed/NCBI
- Google Scholar
16. Lan F, Demaree B, Ahmed N, Abate AR. Single-cell genome sequencing at ultra-high-throughput with microfluidic droplet barcoding. Nat Biotechnol. 2017;35(7):640–646. pmid:28553940
- View Article
- PubMed/NCBI
- Google Scholar
17. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. pmid:26000487
- View Article
- PubMed/NCBI
- Google Scholar
18. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008;5(3):235–237. pmid:18264105
- View Article
- PubMed/NCBI
- Google Scholar
19. Krishnan AR, Sweeney M, Vasic J, Galbraith DW, Vasic B. Barcodes for DNA sequencing with guaranteed error correction capability. Electronics Lett. 2011;47(4):236–237.
- View Article
- Google Scholar
20. Bystrykh LV. Generalized DNA barcode design based on Hamming codes. PLoS One. 2012;7(5):e36852. pmid:22615825
- View Article
- PubMed/NCBI
- Google Scholar
21. Buschmann T, Bystrykh LV. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinformatics. 2013;14:272. pmid:24021088
- View Article
- PubMed/NCBI
- Google Scholar
22. Buschmann T. DNABarcodes: an R package for the systematic construction of DNA sample tags. Bioinformatics. 2017;33(6):920–922.
- View Article
- Google Scholar
23. Hawkins JA, Jones SK Jr, Finkelstein IJ, Press WH. Indel-correcting DNA barcodes for high-throughput sequencing. Proc Natl Acad Sci U S A. 2018;115(27):E6217–E6226. pmid:29925596
- View Article
- PubMed/NCBI
- Google Scholar
24. Fujimoto A, Fujita M, Hasegawa T, Wong JH, Maejima K, Oku-Sasaki A, et al. Comprehensive analysis of indels in whole-genome microsatellite regions and microsatellite instability across 21 cancer types. Genome Res. 2020;30(3):334–346. pmid:32209592
- View Article
- PubMed/NCBI
- Google Scholar
25. PCR Primer Design Guidelines. http://www.premierbiosoft.com/tech_notes/PCR_Primer_Design.html (accessed on 09 November 2020)
26. Guido N, Starostina E, Leake D, Saaem I. Improved PCR Amplification of Broad Spectrum GC DNA Templates. PLoS One. 2016;11(6):e0156478. pmid:27271574
- View Article
- PubMed/NCBI
- Google Scholar
27. Fan H, Chu JY. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5(1):7–14. pmid:17572359
- View Article
- PubMed/NCBI
- Google Scholar
28. Fazekas A, Steeves R, Newmaster S. Improving sequencing quality from PCR products containing long mononucleotide repeats. Biotechniques. 2010;48(4):277–285. pmid:20569204
- View Article
- PubMed/NCBI
- Google Scholar
29. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51. pmid:23718773
- View Article
- PubMed/NCBI
- Google Scholar
30. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–351. pmid:27184599
- View Article
- PubMed/NCBI
- Google Scholar
31. MyCurveFit. https://mycuvefit.com (accessed on 09 November 2020)
32. Which polymerase is right for me? http://barricklab.org/twiki/bin/view/Lab/ProtocolsTaq (accessed on 09 November 2020)
33. Tulpan D, Andronescu M, Leger S. Free energy estimation of short DNA duplex hybridizations. BMC Bioinformatics. 2010;11:105. pmid:20181279
- View Article
- PubMed/NCBI
- Google Scholar
34. Prebuilt barcode sequences. https://sourceforge.net/projects/vfos/files/Prebuilt_barcode_sequences/ (accessed on 09 November 2020)
35. Reveal the Full Complexity of Cellular Diversity with Single Cell Transcriptomics. https://pages.10xgenomics.com/rs/446-PBO-704/images/10x_PS030_SCGE_SingleCell_v3_digital.pdf (accessed on 09 November 2020)

[ref1] 1. Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc. 2010;2010(6):pdb.prot5448. pmid:20516186
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Salk JJ, Schmitt MW, Loeb LA. Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations. Nat Rev Genet. 2018;19(5):269–285. pmid:29576615
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. MacConaill LE, Burns RT, Nag A, Coleman HA, Slevin MK, Giorda K, et al. Unique, dual-indexed sequencing adapters with UMIs effectively eliminate index cross-talk and significantly improve sensitivity of massively parallel sequencing. BMC Genomics. 2018;19(1):30. pmid:29310587
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Ogawa T, Kryukov K, Imanishi T, Shiroguchi K. The efficacy and further functional advantages of random-base molecular barcodes for absolute and digital quantification of nucleic acid molecules. Sci Rep. 2017;7(1):13576. pmid:29051542
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Lyons E, Sheridan P, Tremmel G, Miyano S, Sugano S. Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens. Sci Rep. 2017;7(1):13899. pmid:29066821
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Kukita Y, Matoba R, Uchida J, Hamakawa T, Doki Y, Imamura F, et al. High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients. DNA Res. 2015;22(4):269–277.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref7] 7. Kou R, Lam H, Duan H, Ye L, Jongkam N, Chen W, et al. Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations. PLoS One. 2016;11(1):e0146638. pmid:26752634
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Schmitt MW, Kennedy SR, Salk JJ, Fox EJ, Hiatt JB, Loeb LA. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl Acad Sci U S A. 2012;109(36):14508–14513. pmid:22853953
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Metzker ML. Sequencing technologies—the next generation. Nat Rev Genet. 2010;11(1):31–46. pmid:19997069
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Yohe S, Thyagarajan B. Review of Clinical Next-Generation Sequencing. Arch Pathol Lab Med. 2017 Nov;141(11):1544–1557. pmid:28782984
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Roy S, LaFramboise WA, Nikiforov YE, Nikiforova MN, Routbort MJ, Pfeifer J, et al. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch Pathol Lab Med. 2016;140(9):958–975. pmid:26901284
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Xu C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput Struct Biotechnol J. 2018;16:15–24. pmid:29552334
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, et al. A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One. 2014;9(8):e103207. pmid:25119138
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref14] 14. Newman AM, Lovejoy AF, Klass DM, Kurtz DM, Chabon JJ, Scherer F, et al. Integrated digital error suppression for improved detection of circulating tumor DNA. Nat Biotechnol. 2016;34(5):547–555. pmid:27018799
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref15] 15. Fu GK, Xu W, Wilhelmy J, Mindrinos MN, Davis RW, Xiao W, et al. Molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc Natl Acad Sci U S A. 2014;111(5):1891–1896. pmid:24449890
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref16] 16. Lan F, Demaree B, Ahmed N, Abate AR. Single-cell genome sequencing at ultra-high-throughput with microfluidic droplet barcoding. Nat Biotechnol. 2017;35(7):640–646. pmid:28553940
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref17] 17. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. pmid:26000487
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008;5(3):235–237. pmid:18264105
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Krishnan AR, Sweeney M, Vasic J, Galbraith DW, Vasic B. Barcodes for DNA sequencing with guaranteed error correction capability. Electronics Lett. 2011;47(4):236–237.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref20] 20. Bystrykh LV. Generalized DNA barcode design based on Hamming codes. PLoS One. 2012;7(5):e36852. pmid:22615825
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref21] 21. Buschmann T, Bystrykh LV. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinformatics. 2013;14:272. pmid:24021088
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref22] 22. Buschmann T. DNABarcodes: an R package for the systematic construction of DNA sample tags. Bioinformatics. 2017;33(6):920–922.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref23] 23. Hawkins JA, Jones SK Jr, Finkelstein IJ, Press WH. Indel-correcting DNA barcodes for high-throughput sequencing. Proc Natl Acad Sci U S A. 2018;115(27):E6217–E6226. pmid:29925596
View Article
PubMed/NCBI
Google Scholar

[87] View Article

[88] PubMed/NCBI

[89] Google Scholar

[ref24] 24. Fujimoto A, Fujita M, Hasegawa T, Wong JH, Maejima K, Oku-Sasaki A, et al. Comprehensive analysis of indels in whole-genome microsatellite regions and microsatellite instability across 21 cancer types. Genome Res. 2020;30(3):334–346. pmid:32209592
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref25] 25. PCR Primer Design Guidelines. http://www.premierbiosoft.com/tech_notes/PCR_Primer_Design.html (accessed on 09 November 2020)

[ref26] 26. Guido N, Starostina E, Leake D, Saaem I. Improved PCR Amplification of Broad Spectrum GC DNA Templates. PLoS One. 2016;11(6):e0156478. pmid:27271574
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref27] 27. Fan H, Chu JY. A brief review of short tandem repeat mutation. Genomics Proteomics Bioinformatics. 2007;5(1):7–14. pmid:17572359
View Article
PubMed/NCBI
Google Scholar

[100] View Article

[101] PubMed/NCBI

[102] Google Scholar

[ref28] 28. Fazekas A, Steeves R, Newmaster S. Improving sequencing quality from PCR products containing long mononucleotide repeats. Biotechniques. 2010;48(4):277–285. pmid:20569204
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref29] 29. Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias in sequence data. Genome Biol. 2013;14(5):R51. pmid:23718773
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

[ref30] 30. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–351. pmid:27184599
View Article
PubMed/NCBI
Google Scholar

[112] View Article

[113] PubMed/NCBI

[114] Google Scholar

[ref31] 31. MyCurveFit. https://mycuvefit.com (accessed on 09 November 2020)

[ref32] 32. Which polymerase is right for me? http://barricklab.org/twiki/bin/view/Lab/ProtocolsTaq (accessed on 09 November 2020)

[ref33] 33. Tulpan D, Andronescu M, Leger S. Free energy estimation of short DNA duplex hybridizations. BMC Bioinformatics. 2010;11:105. pmid:20181279
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref34] 34. Prebuilt barcode sequences. https://sourceforge.net/projects/vfos/files/Prebuilt_barcode_sequences/ (accessed on 09 November 2020)

[ref35] 35. Reveal the Full Complexity of Cellular Diversity with Single Cell Transcriptomics. https://pages.10xgenomics.com/rs/446-PBO-704/images/10x_PS030_SCGE_SingleCell_v3_digital.pdf (accessed on 09 November 2020)

Figures

Abstract

Introduction

Methods

Definition of a barcode sequence

Definition of barcode factors

Generation of random sequences to characterize DNA barcodes

Determination of penalty scores for evaluating individual barcodes

Algorithm for selecting optimal barcode sets

Determination of weights for penalty scores and an appropriate α value

Exclusion of barcodes in each cycle

Performance of the VFOS program

Comparison with other tools

Results

Characteristics of random barcodes

GCC penalty score (PGCC)

Penalty scores for HP (PHP) and SR (PSR)

HD penalty score (PHD)

CP penalty score (PCP)

Weights for penalty scores and α values for barcode exclusion

Inputs and outputs of the VFOS program

Performance of the VFOS program

Results from the default setting.

Results from the first user-defined setting.

Results from the second user-defined setting.

Results from the third user-defined setting.

Total cycles and running time per cycle.

Possible examples of misclustering of barcodes

Comparison with other tools

Discussions

Conclusions

Availability and requirements

Supporting information

S1 Fig. Calculation of maximum number of base paired (cp) between two barcodes.

S2 Fig. Distributions of five barcode factors (GCC, HP, SR, HD, and CP) observed in random barcodes with l = 8.

S3 Fig. Distributions of five barcode factors (GCC, HP, SR, HD, and CP) observed in random barcodes with l = 10.

S4 Fig. Distributions of five barcode factors (GCC, HP, SR, HD, and CP) observed in random barcodes with l = 12.

S5 Fig. Curve fitting results that used to derive penalty scores for five barcode factors.

S6 Fig. Distributions of PDEC values according to the weights (w1, w2, w3, w4, and w5).

S7 Fig. Parameters setting that produces 10000 barcodes with l = 12 using default penalty scores and weights.

S8 Fig. Example results in an output file in text format.

S9 Fig. Histograms of penalty scores (PGCC, PHP, PSR, PHD, PCP, and PWT) at initial and best cycles for Fig 2.

S10 Fig. Schematic representation of mutation frequencies of EGFR L858R.

S11 Fig. Comparison results between VFOS and other programs (DNABarcodes and FreeBarcodes).

S1 Table. Penalty scores for GCC (PGCC) depending on the lengths of barcodes (8, 10, and 12 nt).

S2 Table. Penalty scores for HP (PHP) and SR (PSR) according to the lengths of HP (hp) and the repeat numbers of SR (sr) regardless of the lengths of barcodes.

S3 Table. Penalty scores for two pairwise barcode factors (PHD and PCP) according to barcode length.

S4 Table. Penalty scores (PGCC, PHP, PSR, PHD, PCP, and PWT) and their PDEC values according to the weights.

S5 Table. Decreasing patterns of PWT scores in 6 conditions with individual values (0.05, 0.1, 0.2, 0.3, 0.4, and 0.5).

S6 Table. Comparison with performance results between default and user-define modes.

S7 Table. Clustering examples of barcodes considered that insertion or deletion of a single base occurred.

S8 Table. GC content (GCC) distribution for barcodes with l = 12.

S9 Table. Homopolymer (HP) distribution for barcodes with l = 12.

S10 Table. Hamming distance (HD) distribution for barcodes with l = 12.

S11 Table. Levenshtein distance (LD) distribution for barcodes with l = 12.

S12 Table. Complementarity (CP) distribution for barcodes with l = 12.

S13 Table. Simple sequence repeat (SR) distribution for barcodes with l = 12.

Acknowledgments

References

GCC penalty score (P_GCC)

Penalty scores for HP (P_HP) and SR (P_SR)

HD penalty score (P_HD)

CP penalty score (P_CP)

S6 Fig. Distributions of P_DEC values according to the weights (w₁, w₂, w₃, w₄, and w₅).

S9 Fig. Histograms of penalty scores (P_GCC, P_HP, P_SR, P_HD, P_CP, and P_WT) at initial and best cycles for Fig 2.

S1 Table. Penalty scores for GCC (P_GCC) depending on the lengths of barcodes (8, 10, and 12 nt).

S2 Table. Penalty scores for HP (P_HP) and SR (P_SR) according to the lengths of HP (hp) and the repeat numbers of SR (sr) regardless of the lengths of barcodes.

S3 Table. Penalty scores for two pairwise barcode factors (P_HD and P_CP) according to barcode length.

S4 Table. Penalty scores (P_GCC, P_HP, P_SR, P_HD, P_CP, and P_WT) and their P_DEC values according to the weights.

S5 Table. Decreasing patterns of P_WT scores in 6 conditions with individual values (0.05, 0.1, 0.2, 0.3, 0.4, and 0.5).