A Simple, Cost-Effective, and Robust Method for rRNA Depletion in RNA-Sequencing Studies

The ability to examine global patterns of gene expression in microbes through RNA sequencing has fundamentally transformed microbiology. However, RNA-seq depends critically on the removal of rRNA from total RNA samples. Otherwise, rRNA would comprise upward of 90% of the reads in a typical RNA-seq experiment, limiting the reads coming from mRNA or requiring high total read depth. A commonly used kit for rRNA subtraction from Illumina was recently unavailable for an extended period of time, disrupting routine rRNA depletion. Here, we report the development of a “do-it-yourself” kit for rapid, cost-effective, and robust depletion of rRNA from total RNA. We present an algorithm for designing biotinylated oligonucleotides that will hybridize to the rRNAs from a target set of species. We then demonstrate that the designed oligonucleotides enable sufficient rRNA depletion to produce RNA-seq data with 75 to 80% of reads coming from mRNA. The methodology presented should enable RNA-seq studies on any species or metagenomic sample of interest.

Precipitation of these oligonucleotides bound to rRNA by magnetic streptavidin beads then 23 depletes rRNA from a complex, total RNA sample such that ~75-80% of reads in a typical RNA-24 seq experiment derive from mRNA. Importantly, we demonstrate a high correlation of RNA 25 abundance or fold-change measurements in RNA-seq experiments between our method and the 26 previously available Ribo-Zero kit. Complete details on the methodology are provided, including 27 open-source software for designing oligonucleotides optimized for any bacterial species or 28 metagenomic sample of interest. 29 30 Importance 31 The ability to examine global patterns of gene expression in microbes through RNA-sequencing 32 has fundamentally transformed microbiology. However, RNA-seq depends critically on the 33 removal of ribosomal RNA from total RNA samples. Otherwise, rRNA would comprise upwards 34 of 90% of the reads in a typical RNA-seq experiment, limiting the reads coming from messenger 35

Introduction
We also applied our algorithm to the 5S rRNA from the 8 species considered. However, because 104 the 5S rRNA is both shorter and more poorly conserved than 16S and 23S rRNA, we were unable 105 to find oligos that are predicted to effectively hybridize to the 5S rRNA from all eight species. 106 Therefore, we ran the algorithm against individual 5S rRNAs and hand-selected two oligos specific 107 to the 5S from each species. In addition, we found that the algorithm was unable to find oligos 108 mapping near the 5'-and 3'-ends of the 23S due to its low conversation among species. To improve 109 binding to these regions, we also identified two oligos that were specific to either Gram-positive 110 or Gram-negative members of our target set of species. Thus, our final set of depletion oligos for 111 a given organism includes 21 total oligos: 17 common oligos targeting 16S and 23S rRNA, 2 oligos 112 that target 23S rRNA in a Gram positive-or Gram-negative-specific manner, and 2 species-113 specific 5S targeting oligos (Table S1; oligos each contain a 5'-biotin modification). 114 We then sought to determine whether our oligo libraries could effectively deplete bacterial rRNA. 115 To deplete rRNAs from a total RNA sample, we incubated biotinylated versions of the 21 designed 116 oligos with total RNA. Samples were then combined with magnetic streptavidin beads to 117 precipitate the oligos bound to rRNAs, followed by isolation of the supernatant, which should be 118 heavily enriched for mRNA ( Figure 2A). We extracted total RNA from exponentially growing 119 cultures of common lab strains of E. coli, B. subtilis, and C. crescentus and performed a single 120 round of rRNA depletion. For all three species, incubation with the 21 depletion oligos 121 substantially decreased the intensity of rRNA signal on a polyacrylamide gel, while tRNA and 122 ncRNA were generally unaffected ( Figure 2B). Moreover, this depletion was modular, as 123 incubation of E. coli total RNA with probes targeting only 16S, 23S, or 5S rRNA resulted in 124 selective depletion of the band corresponding to a given targeted rRNA ( Figure 2B, left). 125 To quantify how well our method depleted rRNA, we performed RNA-seq on the three RNA 126 samples pre-and post-rRNA depletion for E. coli, B. subtilis, and C. crescentus. We then 127 calculated the fraction of reads mapping to rRNA loci in each case ( Figure 2C). The fraction 128 mapping to rRNA decreased following depletion from >95% to 13%, 6%, and 22%, respectively. 129 To determine whether there was any bias for depletion of certain regions of the rRNAs, we 130 compared read counts at each nucleotide position pre-and post-depletion in each E. coli rRNA 131 ( Figure 2D). For the 16S, 23S, and 5S rRNAs, read density was relatively uniform but lower, 132 following depletion, indicating that no particular region of the rRNAs (e.g. regions prone to high structure or partial degradation, preventing effective depletion) was over-represented in our rRNA 134 reads. 135 Many RNA-seq studies are aimed at detecting significant differences in the expression of mRNA 136 in different strains or across different perturbations. To ensure that our depletion technique did not 137 affect the measurement of expression changes (e.g. through unintended depletion of particular 138 mRNAs), we treated E. coli cells with either rifampicin or chloramphenicol for 5 minutes and 139 compared fold changes measured from libraries generated using our depletion strategy to those 140 generated using the previously available commercial kit Ribo-Zero (Illumina). For each depletion 141 method, we calculated the log 2 fold-change in read counts in coding regions following antibiotic 142 treatment compared to a negative control ( Figure 3A-B). For both rifampicin and chloramphenicol 143 treatment, the correlation in log 2 fold-change per coding region between the two rRNA depletion 144 strategies was high (R 2 = 0.98 and 0.97 for rifampicin and chloramphenicol, respectively) across 145 a wide range of changes in gene expression. These results indicate that our method should provide 146 similar results to the Ribo-Zero kit for studies measuring changes in gene expression. 147 Importantly, we also determined if our kit differentially depleted particular mRNAs compared to 148 the Ribo-Zero kit. To do this, we directly compared the RPKM (reads per kilobase per million) 149 values for well-expressed coding regions from a library prepared using our depletion strategy and 150 one prepared using Ribo-Zero ( Figure 3C). Overall, there was a high correlation between the two 151 depletion methods (R 2 = 0.89). However, there were a few outliers. We first identified genes more 152 than two standard deviations away from a log-linear fit of RPKM in a comparison of RNA-seq 153 data generated using our method and Ribo-Zero ( Figure S2A). To ensure that fold-change in 154 expression could also be accurately calculated for these outliers, we returned to the data generated 155 following antibiotic treatment ( Figure 3A-B). For the outlier genes, the change in expression 156 following treatment with rifampicin or chloramphenicol was still highly correlated (Figure S2B-7   Finally, we compared the RPKM values for well-expressed coding regions between libraries  165 prepared using our depletion strategy and libraries from total RNA (no rRNA depletion) from E. 166 coli, B. subtilis, and C. crescentus ( Figure S3A). Depleted libraries for each species showed similar 167 correlations of R 2 = 0.76, 0.71, and 0.66 for E. coli, B. subtilis, and C. crescentus, respectively 168 ( Figure S3A). Though this analysis is complicated by the relatively few reads mapping to mRNA 169 in undepleted samples ( Figure 2C), these results confirm that our method is an effective strategy 170 for depleting rRNAs while maintaining transcriptome composition across multiple species. Taken 171 all together, we conclude that our DIY method provides a broadly applicable, customizable, and 172 cost-effective technique for determining changes in bacterial gene expression patterns in a wide 173 range of organisms and experimental contexts. 174

175
We have developed a simple, fast, easy-to-implement, and cost-effective method for efficiently 176 depleting rRNA from complex, total RNA samples. For three different species, E. coli, B. subtilis, 177 and C. crescentus, we demonstrated robust depletion of 23S, 16S, and 5S rRNAs in a single step 178 such that ~70-90% of reads in RNA-seq arise from non-rRNA sources. This level of mRNA 179 enrichment is sufficient for most RNA-seq studies. Our method showed relatively uniform 180 depletion of rRNAs and minimal, unwanted 'off-targeting' of mRNAs. Additionally, expression 181 changes measured using our method correlated very strongly (R 2 = 0.98, 0.97; Figure 3A-B) to 182 those measured using the previously available Ribo-Zero kit. This strong correlation both validates 183 our method and ensures that data generated via either method can be safely compared or combined. 184 Another method has recently been developed as an alternative to the discontinued Ribo-Zero 185 kit(11). This method is based on hybridization of DNA oligonucleotides to rRNAs followed by 186 digestion with RNase H, which recognizes DNA:RNA hybrids. This method also enabled robust 187 rRNA depletion, although a direct comparison of RNA-seq counts per gene generated using this 188 method and Ribo-Zero was not reported. Additionally, this alternative method requires extended 189 (~60 min) incubations with an RNase, albeit one that should be specific to DNA-RNA hybrids, 190 whereas ours involves only hybridization and a precipitation step. 191 The set of biotinylated oligonucleotides tested here were designed to deplete the rRNA from a set 192 of 8 selected organisms. These organisms span a large phylogenetic range so these 193 oligonucleotides are likely broadly applicable to different bacterial species or even metagenomic samples. However, the set of oligonucleotides can also be easily optimized for a different species 195 or set of species using the open-source software developed here and available on Github. As noted,196 because the 5S rRNA is shorter and less conserved, probes specific to the 5S from a given species 197 must typically be designed. However, the 5S rRNA does not yield nearly as many reads in  seq data for a total RNA sample and may not require depletion for all studies.

Oligonucleotide algorithm 212
The algorithm was initialized with 500 and 1000 oligos of length 15 to 24 nucleotides for the 16S 213 and 23S rRNA, respectively. Oligos were randomly positioned at non-gapped locations of the 214 alignment of the 8 species we selected. Sequences were chosen by randomly selecting a nucleotide 215 matching one or more species at each position. Sequences were then optimized to achieve the 216 target predicted T m of 62.5°C. T m calculations were conducted using the MeltingTemp module in 217 the biopython library. We used the default nearest-neighbor calculation table for RNA-DNA 218 hybrids(12). Notably, this model does not allow prediction of T m for some sequences with multiple 219 sequential mismatches; as such, many oligos begin the optimization with undefined T m . 220 Optimization was conducted by sequential rounds of 'mutation' on each oligo. Allowed mutations 221 included moving the probe from 1-4 bases, shrinking the probe from 1-4 bases (on either end), 222 extending the probe from 1-4 bases (on either end), or swapping the sequence of the oligo at one 223 position to a nucleotide matching a different aligned rRNA. In each round of mutation, the starting 224 oligo was mutated 25 times. From this set of mutated oligos, an oligo close to the target T m was 225 chosen probabilistically (probabilities were determined by a normal distribution centered at 62.5°C 226 with a standard deviation of 2°C). This probabilistic selection, coupled with the large number of 227 oligos initialized, enables oligos to sample the possible binding locations without greedily 228 descending on the first possible binding site they discover. Each oligo was mutated for 100 cycles 229 before oligos binding to a number of sites across the 16S and 23S were selected. 230 To enable better binding of the more variable 23S 5'-and 3'-ends, we split the organisms into two 231 groups (Ec, Pa, Cc, Rp and Ms, Mtb, Bs, Sa) and re-ran the optimization algorithm as above. For 232 each of these groups, we selected 2 additional oligos matching the 5'-and 3'-ends of the 23S. 233

Data and code availability 234
The code used to generate the oligonucleotides is available for download at 235 https://github.com/peterculviner/ribodeplete. The raw and processed sequencing data is available 236 on GEO (GSE142656). 237 Bacterial strains and culture condition 238 E. coli MG1655 was grown to mid-log phase at 37 ºC in LB medium or M9 medium supplemented 239 with 0.1% casamino acids, 0.4% glucose, 2 mM MgSO 4 , and 0.1 mM CaCl 2 . C. crescentus 240 CB15N/NA1000 was grown to mid-log phase in PYE medium at 30 ºC. B. subtilis 168 was grown 241 to mid-log phase at 37 ºC in LB medium. For quantifying changes in expression from antibiotic 242 treatment, cells were harvested 5 minutes after adding chloramphenicol or rifampicin at 50 µg/mL 243 or 25 µg/mL, respectively. 244 SUPERAse-In RNase Inhibitor (ThermoFisher) was added to the beads. The beads were then 287 incubated at room temperature until probe annealing (below) was complete. 288 To anneal biotinylated probes to rRNA, 2-3 µg total RNA, 20x SSC, 30 mM EDTA, water, and 289 the diluted probe mix were mixed on ice in the calculated quantities. The mixtures were incubated 290 in a thermocycler at 70 ºC for 5 min, followed by a slow ramp down to 25 ºC at a rate of 1 ºC per 291 30 sec. To pull down biotinylated probes bound to rRNA, annealing reactions were then added 292 directly to beads in 2x B&W buffer, mixed by pipetting and vortexing at medium speed, and 293 incubated for 5 min at room temperature. Reactions were then vortexed on medium speed and 294 incubated at 50 ºC for 5 min, and then placed directly on a magnetic rack to separate beads from 295 the remaining total RNA. The supernatant was pipetted away from the beads, placed on ice, and 296 diluted to 200 µL in DEPC H 2 O. RNA was then ethanol precipitated at -20 ºC for at least 1 hour 297 with 20 µL of 3 M NaOAc, 2 µL GlycoBlue (Invitrogen), and 600 µL ice-cold ethanol. Samples 298 were centrifuged at 4 ºC for 30 min at 21000 x g to pellet RNA, then washed twice with 500 µL 299 of ice-cold 70% ethanol, followed by centrifugation at 4 ºC for 5 min. RNA pellets were then air-spectrophotometer, and the efficiency of rRNA depletion was verified by running 50 ng of total 302 RNA on a Novex 6% TBE-urea polyacrylamide gel (Invitrogen). 303

Optimization of rRNA depletion 304
In the process of generating our depletion protocol, we tried multiple ratios of streptavidin-coated 305 beads to biotinylated oligos and biotinylated oligos to total RNA. We found that rRNA was 306 depleted robustly across a range of ratios. However, it was critical to have a significant excess of 307 streptavidin-coated beads over biotinylated oligos, as oligos that do not successfully capture rRNA 308 may bind streptavidin more rapidly, thus out-competing bound rRNA-bound oligos and reducing 309 rRNA capture efficiency. We selected our final ratios to achieve reliable depletion of rRNA at a 310 low per-reaction cost. 311

Cost calculation 312
The majority of reagents are common laboratory supplies for labs that work with RNA. To 313 maintain the optimized ratio between streptavidin beads, biotinylated oligos, and rRNA, more 314 oligos and beads must be used to deplete more total RNA. Considering the input, the cost per 315 reaction is approximately $10, $19, or $28 for 1, 2 or 3 µg of RNA, respectively. The majority of 316 the cost per reaction arises from streptavidin-coated magnetic beads; cost could likely be further 317 decreased by using cheaper streptavidin-coated beads or decreasing the quantity of beads used (see 318 above). The up-front cost of purchasing oligos (IDT) is approximately $1000 for large scale 319 synthesis or $500 for smaller scale synthesis (available for sets of oligos >24). However, a single 320 oligo synthesis order is adequate for hundreds of depletion reactions. 321

RNA-seq library preparation 322
Libraries were generated as previously with a few modifications described below(13). The library 323 generation protocol was a modified version of the paired-end strand-specific dUTP method using

Analysis of oligo depletion efficiency 445
To quantify the efficiency of rRNA depletion, the sum of reads mapping to rRNA loci was divided 446 by the total number of mapped reads in each sample. To compare the reads mapping to individual 447 coding regions following rRNA depletion and/or antibiotic treatment ( Figures 3A-C, S2A-F, and  448 S3A), coding regions were filtered for expression by RPKM, and then the correlation between 449 genes for which this ratio was less than or greater than two standard deviations from the mean line. 454 Outliers in Figure S2D were hand-picked. 455

456
The co-first authors discussed and mutually agreed on the author order. We thank I.  indicated. The mean T m of oligos for each species is also shown (red lines). Note that the same 519 oligos are used for each species, but because of 16S sequence variability, the T m can vary, as 520 illustrated for one particular oligo (blue). 521  were generated as in (C), but oligo T m minima were used to generate histograms rather than taking 563 the mean across all oligos. Oligos with undefined T m were not included in the histograms. 564 (E) Distribution of T m values for each 23S-targeting oligo (n = 11) for each individual species 565 indicated. The mean T m of oligos for each species is also shown (red lines). Note that the same 566 oligos are used for each species, but because of 23S sequence variability, the T m can vary, as 567 illustrated for one particular oligo (blue). 568  Figure 3C, with all genes at least two standard deviations away from the least squares fit line 571 (red) indicated in black (n = 69). 572 (B) Figure 3A, with outliers identified in Figure S3A marked in black. For these outliers, the 573 correlation between log 2 (rif+/negative control) for DIY depletion and Ribo-Zero treatment was (C) Figure 3B, with outliers identified in Figure S3A marked in black. For these outliers, the 576 correlation between log 2 (chl+/negative control) for DIY depletion and Ribo-Zero treatment was 577 0.90, compared to 0.97 for all well-expressed coding regions. 578 (D) Figure 3C, with 11 highly-expressed genes more depleted in our method than in Ribo-Zero 579 indicated in black. 580 (E) Figure 3A, with outliers identified in Figure S3D marked in black. For these outliers, the 581 correlation between log 2 (rif+/negative control) for DIY depletion and Ribo-Zero treatment was 582 0.96, compared to 0.98 for all well-expressed coding regions. 583 (F) Figure 3B, with outliers identified in Figure S3D