SMRT-Cappable-seq reveals complex operon variants in bacteria

Current methods for genome-wide analysis of gene expression require fragmentation of original transcripts into small fragments for short-read sequencing. In bacteria, the resulting fragmented information hides operon complexity. Additionally, in vivo processing of transcripts confounds the accurate identification of the 5′ and 3′ ends of operons. Here we develop a methodology called SMRT-Cappable-seq that combines the isolation of un-fragmented primary transcripts with single-molecule long read sequencing. Applied to E. coli, this technology results in an accurate definition of the transcriptome with 34% of known operons from RegulonDB being extended by at least one gene. Furthermore, 40% of transcription termination sites have read-through that alters the gene content of the operons. As a result, most of the bacterial genes are present in multiple operon variants reminiscent of eukaryotic splicing. By providing such granularity in the operon structure, this study represents an important resource for the study of prokaryotic gene network and regulation.

(Pearson corr=0.022, p=0.6487) between the degree of read-through of the TTS and the free energy of the RNA structure (X-axis) of the TTS can be observed. C. Motif logos for TTSs (top) that have high (more than 25%) read-through and for TTSs (bottom) that have low (less than 25%) read-through. Red arrow indicates the termination site. X-axis: Predicted TTS is located at position 0, negative and positive values correspond to positions upstream and downstream of TTS respectively. Y-axis: information content expressed in bits.

Supplementary Figure 4: Structure of the Leu operon.
Individual mapped reads ordered by TSS location and read size in the Leu operon locus. The red arrow denotes the termination site controlled by a riboswitch. Reads in blue represent transcripts originated from the TSS1 upstream of the leader peptide leuL (pos=83735). Reads in grey represent transcripts originated from the TSS2 downstream of the leader peptide (pos=83581).

Supplementary Figure 5: SMRT-Cappable-seq read profile in Rich condition.
Schematic representation of previously annotated operons from RegulonDB database (dash line indicates weak evidence, and solid line indicates strong evidence) and operons defined by SMRT-Cappable-seq. There are four annotated operons covering the coding genes in the shown genome region, three of which are also defined by SMRT-Cappable-seq (hslV-hslU, menA-rra, rra). SMRT-Cappable-seq additionally identifies four extended operons. Two of them (ftsN-hslV-hslU and hslU) have previously unidentified promoters and 5' genes. The other cytR-ftsN has the same promoter as the previous cytR unit, but includes an additional gene at the 3'end. Another extended operon is composed of the hslV-hslU-menA-rraA genes due to the readthrough at the hslV-hslU terminator. Red arrow indicates the previously known TTS for the operon.  Examples of staircase patterns generated from transcripts sharing the same TSS as a result of sequential read-through at termination sites. Data shown is from M9 growth condition. Arrows denote the direction of transcription.   A. An example of the dapA-ypfJ operon containing the previously known dapA-bamC TTS (red arrow), where the read-through is condition dependent. B. Schema of this TTS location. The attenuator structure (orange) that locates 10 bp upstream of the termination site (blue) is predicted and deleted to examine the role of this regulatory region in the control of transcription termination. C. D. The levels of read-through across the TTS of wild-type (WT) and deletion strain (DEL) were measured for bacteria grown in both M9 and Rich medium by qPCR. The qPCR1 primers amplify an upstream region of the predicted attenuator site. For qPCR product2, the forward primer binds to the 5'end of the known dapA-bamC TTS while the reverse primer binds to the downstream region of the TTS. Therefore, read-through ratio was calculated as the amount of qPCR2 product divided by the amount of qPCR1 product. Data shown are the means ± SD from five independent repeats. Unpaired t-test was used to determine significance. The read-through in DEL is significantly higher than in WT in Rich condition.

Supplementary Note 1: Comparison between Illumina and PacBio data
In order to investigate whether PacBio data can quantitatively estimate the expression level of transcripts, we compared the gene expression levels measured using SMRT-Cappable-seq and standard RNA-seq from Illumina. For this, we used a previously published RNA-seq experiment done on E. coli K12 strain MG1655 grown in similar conditions as our experiments (M9 medium) [1]. We estimated gene expression level (in RPKM) for both the Illumina dataset and the SMRT-Cappable-seq dataset based on the number of reads overlapping with annotated genes (Methods). At the gene level, the correlation between Illumina and PacBio dataset is estimated to be 0.798 (p<2.2e-16) using Spearman's rank correlation.

Supplementary Note 2: Comparison between biological replicates
In order to measure the reproducibility, we prepared two SMRT-Cappable-seq libraries using 5 µg of E. coli RNA from M9 medium as biological replicates. These two libraries were sequenced using PacBio Sequel platform as mentioned in Methods, and we generated around 0.2 million reads for each replicate. For both replicates, we estimated gene expression level and counted the number of reads at TSS and TTS. The correlation between two replicates is greater than 0.95 estimated using Pearson correlation ( Supplementary Fig. 2a, 2b and 2c).

Supplementary Note 3: Comparison of SMRT-Cappable-seq reads with processed/primary rRNA
For rRNA analysis, SMRT-Cappable-seq reads were mapped to E. coli U00096.2 genome and classified based on the overlap with its annotated genomic features. Overlap calculations were performed using bedtools intersect version v2.24.0 (-s parameter). Reads mapped to E. coli genome were classified into 3 categories (Fig. 1d): (1) starting at the known primary rRNA TSS sites [2] (primary rRNA); (2) overlapping with the rRNA genes but do not start at the primary TSS sites (processed rRNA); (3) overlapping with other coding genes (protein coding gene). Here we distinguished processed and primary rRNA because primary rRNA transcripts are expected to be enriched in the SMRT-Cappable-seq library compared to control, while the processed rRNA should be depleted.
Mappable reads can be classified into one of these categories with the following result: For E. coli grown under M9 condition, in SMRT-Cappable-seq library, from 279298 total mappable reads, 13621 reads start at the TSSs of primary rRNA transcripts (4.87%), 12170 reads are processed rRNA (4.35%) and 253507 are assigned to other protein coding genes (90.75%) .