RASLseqTools: open-source methods for designing and analyzing RNA-mediated oligonucleotide Annealing, Selection, and, Ligation sequencing (RASL-seq) experiments

RNA-mediated oligonucleotide Annealing, Selection, and Ligation (RASL-seq) is a method to measure the expression of hundreds of genes in thousands of samples for a fraction of the cost of competing methods. However, enzymatic inefficiencies of the original protocol and the lack of open source software to design and analyze RASL-seq experiments have limited its widespread adoption. We recently reported an Rnl2-based RASL-seq protocol (RRASL-seq) that offers improved ligation efficiency and a probe decoy strategy to optimize sequencing usage. Here, we describe an open source software package, RASLseqTools, that provides computational methods to design and analyze RASL-seq experiments. Furthermore, using data from a large RRASL-seq experiment, we demonstrate how normalization methods can be used for characterizing and correcting experimental, sequencing, and alignment error. We provide evidence that the three principal predictors of RRASL-seq reproducibility are barcode/probe sequence dissimilarity, sequencing read depth, and normalization strategy. Using dozens of technical and biological replicates across multiple 384-well plates, we find simple normalization strategies yield similar results to more statistically complex methods.


INTRODUCTION
Gene expression signature profiling has found widespread adoption in a number of biological and medical applications 1 , 2 . However current methods to assess gene expression signatures are either too expensive or too cumbersome to be applied to large-scale patient sampling or high throughput drug discovery. Rnl2-based RNA Annealing, Selection, and Ligation-sequencing (RRASL-seq, Figure   1) provides a scalable and cost-efficient method to profile the expression of more than one-hundred transcripts for less than one dollar per sample, thereby facilitating large scale drug discovery screens 3 .
Briefly, RASL-seq assays rely upon RNA-templated ligation of oligonucleotide probe pairs that anneal to complementary sequences 4 . RRASL-seq utilizes T4 RNA Ligase 2 (Rnl2), which requires two ribonucleotide bases in the 3'OH position of the RRASL acceptor probe and a donor phosphate on the 5' end of the RRASL donor probe. After annealing, excess RRASL probes are washed away, increasing the specificity of the ligation reaction. Each RRASL probe pair is designed to contain flanking adaptor sequences, allowing addition of well and plate-specific barcodes during the PCR step. Samples are then pooled for sequencing. This protocol yields a defined structure for the Illumina RRASL-seq FASTQ reads: well barcode (8 nucleotides), adaptor 1 (17 nucleotides), ligated RRASL probe pair sequence (~40 nucleotides), adaptor 2 (17 nucleotides), and plate barcode (7 nucleotides).
This defined structure reduces the complexity of the alignment process and facilitates the analysis of observed error profiles.
For some applications the quantitative accuracy of RRASL-seq probe mapping does not affect the results (e.g. viral RNA detection). However, highly accurate RRASL probe alignment, barcode demultiplexing, and data normalization are essential for any use case that requires transcript abundance estimates to be compared across barcoded samples (e.g. high throughput screening).
Additionally, given the cost-efficiency of RASL-seq, computational methods that provide timely analysis of thousands of samples are needed for alignment, demultiplexing, and normalization.
Here, we describe computational methods to design and analyze RRASL-seq experiments that have been implemented as an open source software package, RASLseqTools. Using sample data from a large RRASL-seq experiment, we provide empirical estimates of experimental, sequencing, and alignment error that offer guidance for planning and analyzing RRASL-seq experiments. In addition, we describe visualization strategies and five different normalization methods that minimize systematic error, such as batch effects. Finally, we identify three key predictors of RRASL-seq reproducibility: barcode/probe sequence dissimilarity, sequencing read depth and normalization method.

RRASL-seq Probe Design
One hundred and ten RRASL probe pairs targeting fifty-eight genes were designed using the previously described methodology (Supplementary File 1). Briefly, a probe design protocol similar to Primer-BLAST 5 was constructed using Python 6 , pandas 7 , Primer3 8 , and MELTING 9 . RRASL probe pairs were designed to complement ~40 nucleotides (nt) of a selected sequence near the three-prime end of each transcript. Each RRASL probe pair was extended 17 nts in the five and three prime directions to accommodate adaptor sequences. Critical probe design parameters included an optimal melting temperature of ~68ºC and GC content of ~50%. We define on-target RRASL probe pairs as combinations of donor and acceptor probes designed to measure a target gene. We define mismatched RRASL probe pairs as combinations of donor and acceptor probes not intended to measure a target gene. Additional details and filtering criteria for probe design can be found in the previously published manuscript 3 .

Human B Cell Experiments
CD19+ B cells isolated from human peripheral blood mononuclear cells were cultured for 7 days in 384-well plates at a concentration of 2,000 cells/well in IgE polarizing conditions (50ng/ml recombinant human IL4 and 20ng/ml megaCD40L), as described previously 3 . Each experimental well contained a single small molecule or natural product at one of three concentrations (0.08uM, 0.3uM, and 3uM). Approximately 3000 small molecules and natural products with known pharmacological activities were selected from the Enzo (n=1730), SelleckChem (n=688), and LOPAC (n=687

RRASL-seq Demultiplexing
We designed twenty-seven plate barcodes that have an edit distance no less than two for all pairs of plate barcodes (Supplementary File 2). Plate barcode demultiplexing for sequencing runs 1 and 2 was accomplished using a Levenshtein edit distance (the minimum number of edits required to change one string into another) < 2. Plate barcode error rates were calculated using reads possessing a Levenshtein edit distance of 1 with respect to any single plate barcode. We also designed 384 well barcodes to possess an equal base distribution at each position, in addition to all pairs of well barcodes possessing an edit distance greater than or equal to 3 (Supplementary File 2). Well barcode demultiplexing was accomplished by identifying reads containing a perfect match to a well barcode in the first 8 base calls. We required all reads that did not contain a perfect match to a well barcode be no more than 2 Levenshtein edits, in the first 8 base calls, to a single well barcode. Well barcode error rates were calculated using reads with a Levenshtein edit distance less than three with respect to any single well barcode.

RRASL-seq Alignment Read Set & Alignment Methods
We randomly selected 200,000 reads from each of the two sequencing runs, henceforth referred to as the alignment read set. We excluded reads with >20% N base calls, more than 2 N base calls in the well barcode region, or missing an exact match to the first 6 base calls of the AD1 adaptor sequence (Supplementary Files 3 & 4). We created a gold-standard alignment call for each FASTQ sequence ( Figure  1D) in the alignment read set using exact string matching or Smith-Waterman alignment. We identified FASTQ reads not containing a perfect match to either an on-target or mismatched RRASL probe pair, henceforth referred to as 'imperfect' reads and isolated a 45 base-pair substring beginning after the adaptor 1 sequence. Isolated substrings were then aligned to all unique combinations of donor and acceptor RRASL probes (n=11,009) using the Smith-Waterman (SW) algorithm 10 implemented in the 'swalign' python package. The following scoring parameters were used: +2 match, -1 mismatch, -2 gap opening, -2 gap extension. Therefore, we define the maximum alignment score as 2 * probe pair length. We used two criteria to classify 'imperfect' read sequences as unmapped.
First, 'imperfect' reads mapping to multiple on-target or mismatched probe pairs were classified as unmapped. Second, we calculated the Smith-Waterman Difference Score (SW Difference Score), defined as the difference between the maximum SW Alignment Score (maximum SW Alignment Score between read sequence and any combination of RRASL probes) and the SW Threshold (maximum SW alignment score(s) for the mapped RRASL probe pair, excluding self-alignment). Any read with a SW Difference Score less than one was classified as unmapped. We note that the SW Difference Score provides a conservative threshold, yielding a reduction in recall, but an increase in precision. Each mapped read was further classified as aligning to an on-target or mismatched RRASL probe pair.
The following STAR alignment parameters were used for read alignment: seedSearchStartLmax 1;; seedPerReadNmax 15000;; scoreDelOpen -1;; scoreDelBase -2;; scoreInsOpen -1;; scoreInsBase -2;; outFilterMultimapNmax 100;; outFilterMatchNminOverLread 0.7;; scoreGapNoncan -1;; outFilterScoreMinOverLread 0.7. Due to the defined structure of RRASL-seq FASTQ reads we specified an integer value to trim five-prime (25 nucleotides) and three-prime base calls (19 nucleotides for Sequencing Run1 reads and 33 nucleotides for Sequencing Run2 reads), in order to remove barcode and adaptor sequences using the clip5pNbases and clip3pNbases command-line arguments. This hard base clipping strategy allows alignment of a small portion of the adaptor 2 (AD2) sequence for any probe pair less than 42 bases long. We filtered all FASTQ read alignments containing sequence identity less than 70%. Multiprocessing of FASTQ read files was accomplished using GNU parallel 12 .

RRASL-seq Normalization Samples
Reads from the control cells (n=20 per plate), Day 0 control lysates (n=6 per plate), and Day 7 control lysates (n=6 per plate) were used to benchmark normalization methods. Briefly, FASTQ reads were mapped using the STAR aligner and aggregated in the same manner as the alignment read set, creating a matrix of RRASL-seq probe well counts (RRASL-seq probes rows, sample wellscolumns). We were most interested in assessing normalization effectiveness in the context of reasonable sampling error, therefore we only considered samples from plates with a median sample library size ≥ 1,000 on-target mapped reads (n=21 plates), as technical precision is poor for samples with a library size lower than this level (Supplementary Figure  2).

RRASL-seq Normalization Methods
Five normalization methods: Library Size, housekeeping gene (HK gene), median geometric mean ratio (MGMR), median geometric mean ratio by plate (MGMR by plate) and Regularized Logarithm were compared using read count data from the control cells and control lysates. A pseudocount of 1.0 was added to each RRASL-seq probe well count in order to avoid zero division errors. In order to minimize the less interesting effect of low sequencing depth we only considered RRASL-seq probes with a median unnormalized read count > 20 across the control wells and probes targeting immunoglobulin isotype transcripts in the benchmarking experiments (n=64 RRASL-seq on-target probe pairs, Supplementary File 1). However, as described below, the normalization step used all 110 on-target RRASL-seq probe pairs.
Briefly, the Library Size normalized probe count, Y i , was calculated by dividing each probe's read count plus 1 by the corresponding well library size (total on-target mapped reads in the sample plus Normalized RRASL-seq probe well counts were also standardized, defined here as subtracting the average value across samples (mean-centering) and dividing by the standard deviation (scaling). This post-normalization processing step was applied across all samples in the indicated experiment, e.g.
standardization of all sixty control cells for the principal component analysis described in Figure  8.

Visualization of Well Position and Batch Effects
To visualize differences in sequencing depth across plates and reagent batches we plotted the sum of the on-target read counts for each plate grouped by sequencing run, with a color label indicating reagent batch. To visualize well position effects we calculated and plotted the average library size for each well position across plates using the pandas and seaborn python packages. To visualize the variation of sample library sizes stratified by well position we calculated and plotted the coefficient of variation across plates using the pandas and seaborn python packages.

RRASL-seq Normalization Benchmarking
Once the normalized probe well counts were calculated, we used several approaches to assess normalization effectiveness. Our control cells are technical replicates confounded by reagent batch, plate, and sequencing read depth and therefore useful for normalization benchmarking. We utilized the control cells to assess the effects of five normalization methods using relative log expression (RLE) plots of housekeeping genes, Pearson's correlation between immunoglobulin RRASL-seq probes and principal component analysis. To assess the dependence of normalization on sequencing read depth and global gene expression patterns, we used all control cells to compute the first and second principal components, and then plotted these components against sample library size. Sample library size is defined as the sum of ontarget RRASL-seq probe counts in a sample, and is the value used for the Library Size normalization method.
We also computed and plotted the Pearson's correlation between RRASL-seq probes targeting immunoglobulin transcripts across all control cells. We judged each normalization method using a signal-to-noise score, defined as the difference between the average Pearson's correlation for probes targeting the same isotype and the average Pearson's correlation to probes targeting different isotypes (e.g. average pair-wise correlation between the probes targeting IGHA minus the average correlation between probes targeting IGHA and all probes targeting immunoglobulin isotypes other than IGHA).
We used generalized linear models (GLM, negative binomial distribution, log-link function), implemented in the statsmodels python package, to assess the reduction in library size bias. We constructed two nested GLMs for each RRASL-seq probe. The first model predicts RRASL-seq probe count as a function of plate, batch, and sample library size, and the second model predicts RRASLseq probe count as a function of plate and batch. We counted the number of GLMs that produced a statistically significant coefficient for the library size predictor using unnormalized or normalized data.
We then calculated the difference in variance explained between the two models to assess the influence of library size on prediction accuracy.
To further evaluate the usefulness of the five normalization methods we leveraged the 2000 cell, 1000 cell and 500 cell Day 0 and Day 7 control lysates. We restricted our analysis to plates with a median sample library size above 1000 reads (n=21 plates). Furthermore, we only used the technical duplicate with the higher sample library size because we observed a strong relationship between sample library size and probe count correlation for technical duplicates on the same plate (Supplementary Figure  3).
Principal component analysis of Day 0 and Day 7 serial dilution cell lysates from three plates was carried out, as described for the control samples, in order to evaluate whether normalization methods can produce both a clear separation between the biological conditions in the first principal component.
Fold changes (Day 7 / Day 0) for each probe were calculated after normalization, but before log2transformation, using data from cell lysates on the same plate. Box plots of fold changes (across plates) were plotted for several housekeeping genes. The standard deviation of the log2-transformed fold change across the twenty-one plates was computed for each of the 64 RRASL-seq probes. To assess the relationship between cell density and fold change precision we compared the log2transformed fold-change vectors and the standard deviation of the log2-transformed fold-change vectors from the 2000 cell and 500 cell conditions using paired t-tests. A paired t-test was used in order to control for variance inherent to each probe.

RASLseqTools Design Pattern
RASLseqTools has been implemented in Python 6  Levenshtein edit distance is helpful during barcode and RRASL probe selection prior to experimental execution, as this distance is a major determinant of unambiguous barcode demultiplexing and probe mapping. We have also implemented several visualization and normalization methods as convenience modules in RASLseqTools.

RRASL-seq B cell Screen
We recently conducted a RRASL-seq cell-based screen of ~10,000 conditions in 384 well plates, run as 9 separate reagent batches ( Figure  1

RRASL Barcode and Probe Design
Twenty-seven plate barcodes were used to distinguish individual 384-well plates. Twenty-one and six of these plate barcodes possessed a minimum Levenshtein edit distance of 3 and 2, respectively. We have previously described a set of 384 well barcodes that exhibit balanced base composition ( Figure   2A), a minimum pairwise Levenshtein edit distance of 3 ( Figure  2B), and a variable number of nearest neighbor barcodes with a Levenshtein edit distance of 3 ( Figure  2C).
One hundred and ten RRASL probe pairs were designed to profile fifty-eight distinct genes using our previously published method 3 . The vast majority of the RRASL probe pairs used in this screen had a length of 40 nucleotides, an average melting temperature of 68, and a GC content of 42.4%.
We define several useful concepts for Smith-Waterman read alignment classification. First, maximum Smith-Waterman alignment score (max SW Score) is computed as two times the RRASL probe pair length. The vast majority of RRASL probe pairs used in the B cell screen had a max SW Score of 80 ( Figure  2D). Nearest Neighbor SW Alignment Score is defined as the Smith-Waterman alignment score for a given probe and its nearest on-target/mismatched probe pair. Figure  2E

RRASL Barcode Demultiplexing
In order to assess the performance of our demultiplexing and alignment strategy, in addition to characterizing the observed error rates, we randomly selected 200,000 reads from each of the two sequencing runs (n=400,000 reads), to create an 'alignment read set'. We excluded FASTQ reads with excessive 'N' base calls or a mismatch base call in the first 8 bases of AD1. All presented percentages assume a denominator of 200,000 reads for the indicated sequencing run.
Summed read counts for the twenty-seven plate barcodes exhibited strong concordance across the alignment read set from sequencing runs 1 and 2 (on-target reads, Figure  3A). We find that 96.72% and 94.3% of the alignment read set possess a perfect sequence match to one of the twentyseven plate barcodes for sequencing run 1 and 2, respectively. Only 1.54% and 1.18% of the alignment read set reads are more than 1 edit distance away from the expected plate barcodes for sequencing runs 1 and 2, respectively.
With respect to well barcode demultiplexing, we observed 97.9% of the first eight bases in the alignment read set mapped perfectly and uniquely to one of the 384 well barcodes for both sequencing runs. 1.65% and 1.48% of the alignment read set well barcodes from sequencing run 1 and 2, respectively, map with one edit distance to a single well barcode. An additional 0.37% and 0.54% of all RRASL-seq reads possess an edit distance of 2, leaving less than 0.04% of reads unmapped due to errors in the well barcode sequence.  Figure  3C). One well barcode, AGTAGATC, has perfect sequence similarity with a standard Illumina adaptor sequence and therefore is ill-suited for RRASL-seq applications if standard demultiplexing methods are used. While most well barcodes exhibit an error rate between 0 and 0.5% ( Figure  3D) we have found a subset of well barcodes more prone to either synthesis or sequencing error (Supplementary File 5).    Table  1

RRASLprobe Alignment Benchmarking
The Smith-Waterman algorithm yields optimal pairwise local alignment given an arbitrary scoring scheme, however it is computationally expensive. We benchmarked the STAR aligner using the alignment read set sequences and a reference database composed of all combinations of donor and acceptor RRASL probes (STARrasl). We classified a read as mapped if STAR found greater than 70% sequence identity to a RRASL probe pair. All multi-mapping reads (0.04%, defined as aligning to more than 1 on-target or mismatched probe pair with the same mapping quality score) were classified as unmapped. 43.44% and 43.02% of reads mapped to an on-target RRASL probe pair, and 35.62% and 37.42% of reads mapped to mismatched RRASL probe pairs for sequencing runs 1 and 2, respectively. We observe excellent concordance, 0.999, between STAR and Smith-Waterman alignments ( Figure  6 and Table  2), despite the lower number of reads mapped by STAR. We also explored the sensitivity of STAR alignment to flanking bases and found excellent concordance when the expected percent identity filter is adjusted to account for additional flanking bases, i.e. 60% identity when 5 flanking bases are allowed. BLASTn mapping of the alignment read set yields statistically significantly reduced concordance to both Smith-Waterman and STAR alignments (~0.9), and requires at least 300% more wall time. Finally, we used the STAR aligner to map the alignment read set to the Human Genome reference (hg19), which also yielded excellent concordance, 0.999, with both the STARrasl and Smith-Waterman alignment results.

Visualization of Systematic Bias
Batch and/or well position effects may confound RRASL-seq experiments requiring inter-plate and/or intra-plate comparisons. We have implemented several visualization methods to identify these systematic errors. The first visualization for identifying batch effects revealed dramatic differences between the total on-target mapped read counts for each plate and batch using alignment read set data ( Figure  7A). The second visualization displays heatmaps of average sample library size and coefficient of variation (CV) of sample library sizes across plates by well position ( Figure  7B&C). An experiment with minimal well position effects will show a uniform average and a low CV across the plate. However, we observe substantial variation in average sample library size (well position effects) and coefficient of variation (batch effects) in the first four rows and edge columns, respectively, indicating systematic bias within plates. The column effect may be due to contamination of columns 1 and 2 during incubation, which prompted the investigators to remove media from these wells prior to lysis. A similar analysis using all reads from the RRASL-seq screen yields the same pattern.

Methods to Correct Systematic Bias
Given the systematic bias detected from the aforementioned visualization methods, we sought to evaluate more advanced visualization and normalization methods using the sixty control cell technical replicates, in addition to using Day 0 and Day 7 samples from the control lysates. We first consider the sixty control cell technical replicates.
Principal component analysis 17 (PCA) has been used extensively to assess systematic error in gene expression experiments as it captures higher order structure in low dimensional space 18 . We used STAR aligned on-target RRASL probe counts (n=64 RRASL-seq probes) from control cell technical replicates on three separate plates (20 control samples per plate, Figure  8A). Normalized The principal component analysis suggests that the normalization methods are reducing systematic error introduced by library size discrepancies;; however an alternative explanation is that the normalization methods are destroying the correlation between features arising from true differences in biological condition. In order to test whether the normalization methods preserve and enhance observed associations between features, we plotted the Pearson's correlation between probes targeting immunoglobulin isotypes using the sixty control samples ( Figure  8D). Visual inspection revealed a dampening of off-diagonal associations. We computed a signal-to-noise score by calculating the difference between the average Pearson's correlation for probes targeting the same isotype and the average Pearson's correlation of those probes targeting different isotypes ( Figure  8E).
We find every normalization method improves the signal-to-noise score, with Library Size and MGMR normalization yielding the largest increase over unnormalized data (Supplementary  Table  1) indicating these methods preserve and enhance true associations between probe features. The poor performance of MGMR by plate is likely explained by batch/plate effects driving spurious correlations, a hypothesis supported by the finding that standard scaling each plate individually yields much higher signal-to-noise scores for MGMR by plate. Pearson's correlation between immunoglobulin isotypes across the sixty control samples. E) Signal-tonoise score for RRASL-seq probes targeting IGHA and IGHE. Signal-to-noise score is defined as the difference between the average Pearson's correlation for probes targeting the same isotype and the average Pearson's correlation to probes targeting different isotypes. RASL-seq probe count data was normalized using the indicated method and subsequently mean-centered and scaled to have a standard deviation of 1.
We assessed the reduction in library size bias by constructing nested generalized linear models (GLM) to predict on-target RRASL-seq probe counts (n=64 RRASL-seq probes) in the control samples using plate, batch, +/- sample library size as predictors. We counted the number of GLM models yielding a statistically significant library size coefficient from the full model when using unnormalized or normalized probe count data. Fifty-eight out of sixty-four GLM models using unnormalized probe counts yielded a statistically significant (Bonferroni-corrected) coefficient for the library size predictor. Only two probes yielded statistically significant coefficients for the models trained using normalized data. We also found that library size explained, on average, 31% of the variance for the corresponding RRASL-seq probe count distribution when using unnormalized data.   Table  2). Unnormalized data produces an unacceptable 61 statistically significant probes out of 64. All normalization methods yielded dramatically fewer statistically significant probes.
Interestingly, although Regularized Log normalization produced the smallest variance, which is unsurprising given the methods shrinkage of low count probes to the average count across samples, it yields 26 statistically significant probes, which is far more than the two, three, and four statistically significant probes found using housekeeping gene, Library Size, and MGMR by plate normalization, respectively. MGMR normalization produced 19 statistically significant probes. Figure  10B displays boxplots of the observed standard deviations for the sixty-four RRASL-seq probes stratified by cell density and grouped by normalization method. Using a paired t-test, to control variance inherent to different RRASL-seq probes, we find the standard deviation (across plates) of fold changes is statistically significantly lower in the 2000 cell density condition when compared to the 500 cell density condition ( Figure  10B). The 2000 cell density samples had, on average, higher read counts, which could bias against the observed finding of lower variance for the 2000 cell density condition given the reported mean-variance association. However, the mean-variance association is likely countered by the increased precision of higher read counts. The exception to this general trend is the fold-change estimates using the Regularized Log normalized data and is likely explained by the replacement of low read count values with the mean across samples. These results offer additional evidence for the primary dependence of technical precision on read depth. We found RRASL-seq offers robust reproducibility despite relatively poor sequencing quality, given sufficient barcode/RRASL probe sequence dissimilarity and read depth. A minimum pairwise Levenshtein edit distance between barcodes is a crucial design parameter, particularly plate barcodes. We recommend an edit distance ≥3 to provide adequate divergence for error correction during the demultiplexing process.
Greater than 80% of RRASL-seq reads containing an Illumina adaptor sequence could be mapped to an on-target or mismatched probe combination, indicating the use of 40 nucleotide RRASL-seq probe pairs with a Smith-Waterman threshold greater than 10 offers ample leeway to correct the relatively low rates of sequencing error common to current Illumina sequencing protocols. In addition, we found the suffix array-based STAR aligner offers dramatically improved computational efficiency and excellent concordance to a conservative Smith-Waterman alignment protocol. We recommend STAR for standard processing of RRASL-seq data.
Any application that requires quantitative comparisons between samples requires identification and minimization of systematic error. We found large differences in sequencing depth across reagent batches, plates, and well positions when visualizing read counts at each of the aforementioned levels. The large variability of plate-specific read depth is likely a function of pre-sequencing pooling imbalances due to using equal volumes of PCR product, in contrast to equal-molar pooling.
Therefore, we recommend assessing library concentration for each plate before pooling in order to achieve a more balanced read depth across plates. Library size imbalances seen at different well positions are likely a function of either robotic pipetting or edge effects during the PCR step. Future experiments will test whether different robotic pipetting strategies change the relationship between well position and library size.
Furthermore, we have shown systematic bias dramatically influences biological inference and must be minimized before quantitative comparisons. We found normalization methods using scaling factors (sample library size or the median geometric mean ratio) offers robust correction for systematic bias introduced by read coverage disparities, well position, and/or reagent batch. More statistically complex transformations such as the regularized logarithm method also reduce systematic error, however we found this particular method very computationally intensive and did not outperform scaling factors in either the signal-to-noise ratio fold change or fold-change central tendency analysis.
It is likely that the regularized logarithm method is hampered by the relatively small number of features and may offer superior performance in experiments with far more RRASL-seq probes. While fold-change normalization has a long-standing tradition in experimental biology, we present evidence that normalization prior to fold-change estimation offers dramatic reductions in false discovery rates, even in the context of well-chosen negative controls.
There are several limitations to the presented work. First, although we attempted to design RRASL-seq probes that measure the abundance of a specific transcript isoform, more than half of the probe pairs used in this study bind to multiple isoforms of the target gene. Therefore, we cannot rule out the possibility that alternative splicing has influenced probe counts for RRASL-seq probe pairs that can bind to multiple isoforms. Measuring all known splicing junctions for each target gene is an elegant and useful strategy to overcome this limitation 22 . Second, outliers can produce spuriously large correlation coefficients and can undermine the use of correlation statistics as a measure of reproducibility. The presented scatterplots provide strong evidence that the reported correlations are an accurate reflection of reproducibility in the context of differing library sizes. Third, we recognize that a design matrix can and likely should be employed to further minimize batch effects for any formal statistical comparison between samples, however these investigations are beyond the scope of this manuscript. Finally, we have omitted consideration of robust statistics (e.g. median absolute deviation) in the current work, but acknowledge their utility in the context of large sample sizes enabled by the cost-efficiency of RRASL-seq. The utility of robust statistics will be presented with the comprehensive analysis of the B cell screen in a subsequent manuscript.
To facilitate the use of RRASL-seq we have developed an open source python package, RASLseqTools (https://github.com/erscott/RASLseqTools), offering access to methods useful for the design and analysis of RRASL-seq data. RASLseqTools offers methods to: 1) assess the sequence similarity of barcodes and RRASL probes, 2) demultiplex and align RRASL-seq FASTQ reads, 3) annotate demultiplexed and mapped reads with meta-data, 4) visualize batch effects, and 5) correct systematic error using a variety of normalization methods.