DEBKS: A Tool to Detect Differentially Expressed Circular RNAs

Circular RNAs (circRNAs) are involved in various biological processes and disease pathogenesis. However, only a small number of functional circRNAs have been identified among hundreds of thousands of circRNA species, partly because most current methods are based on circular junction counts and overlook the fact that a circRNA is formed from the host gene by back-splicing (BS). To distinguish the expression difference originating from BS or the host gene, we present differentially expressed back-splicing (DEBKS), a software program to streamline the discovery of differential BS events between two rRNA-depleted RNA sequencing (RNA-seq) sample groups. By applying to real and simulated data and employing RT-qPCR for validation, we demonstrate that DEBKS is efficient and accurate in detecting circRNAs with differential BS events between paired and unpaired sample groups. DEBKS is available at https://github.com/yangence/DEBKS as open-source software.


Introduction
Circular RNAs (circRNAs) are a class of long non-coding RNAs with a covalently closed continuous loop structure that is formed via back-splicing (BS). Although circRNAs are believed to play important regulatory roles in a number of biological processes [1][2][3][4], only a small number of functional circRNAs have been recognized among hundreds of thousands of circRNA species [5][6][7][8]. A straightforward strategy to identify functional circRNAs is to detect differentially expressed circRNAs (DECs) that may contribute to certain traits or diseases [9][10][11].
Previously developed software, such as find_circ [12], CIRI2 [13], and CIRCexplorer2 [14], focused on the identification of circRNAs, as well as quantification by circular junction counts (CJCs), while recent tools, including CIRI-full [15], CircSplice [16], and CircAST [17], could detect circRNA alternative splicing (AS) and quantify CJCs at the isoform level. As these programs are not able to detect DECs, differentially expressed gene (DEG) detection software (mostly DESeq2 [18] and edgeR [19]) are employed to analyze the data obtained with these CJC-based quantification tools. However, the circRNA expression level is usually associated with the host gene [20], and it is therefore difficult to differentiate whether the difference is due to BS or is only a byproduct of the host gene.
Some programs, such as DCC [21], CIRI2, CIRIquant [22], and CLEAR [23], could measure circRNA expression level by the ratio of BS events, instead of CJCs, to normalize the noise of host gene expression. Given that the BS ratio is not applicable for DESeq2 and edgeR, here we report differentially expressed back-splicing (DEBKS), a tool that detects differential BS events between sample groups with paired or unpaired rRNA-depleted RNA-seq data. DEBKS includes four functional modules: merge for reformatting junction counts from circRNA detection software, count for quantifying expression level of circRNA linear counterpart, anno mainly for estimating circRNA length, and dec for detecting DECs. Using public, in-house, and simulated RNA-seq datasets, we demonstrate that DEBKS effectively improves the BS ratio calculation and has a better performance in DEC detection than current methods.

Workflow of DEBKS for identifying DECs
DEBKS is designed to analyze DECs from rRNA-depleted RNA-seq data without erasing linear RNA. DEBKS includes four modules: merge, count, anno, and dec ( Figure 1A). After detecting circRNAs by users, merge can collect and reformat CJCs from files generated by detection software. This module can also simultaneously collect linear junction counts (LJCs), if available, which is defined as junction counts across flanking introns of candidate back-spliced sites. Alternatively, count is employed for calculating LJCs from BAM files produced by RNA-seq aligners, such as HISAT2 [24] and STAR [25]. Similar to the percentage spliced in [26], the ratio of BS events is measured by percentage back-spliced in (PBSI, w), which is estimated by: where I l represents the effective linear length and I c represents the effective back-spliced length ( Figure 1B). The values of I l and I c are estimated by: where L represents the RNA-seq read length minus the minimum overhang length, and e represents the circRNA length, which can be optionally predicated with gene annotation by anno or fulllength analysis software, such as CIRI-full and CircAST. For smaller circRNAs (e < 2L), w is adjusted with smaller effective lengths. For larger circRNAs (e ! 2L), w is equivalent to the calculation of the ratio in other programs, such as CIRI2 and CIR-Iquant. Finally, the statistical framework of rMATS [26], which was developed for detecting differential AS, is employed to compare the difference of w (|Dw|) with a user-defined threshold between the paired or unpaired sample groups.

Public RNA-seq datasets
Two public rRNA-depleted RNA-seq datasets from the NCBI Sequence Read Archive database (SRA: SRP050149 and SRP156355) were employed to verify the performance of DEBKS. SRP050149 includes four samples of HEK293 cells with the co-depletion of ADAR1 and ADAR2 by RNAi (ADAR1/2 knockdown) and four untreated samples. SRP156355 includes five invasive ductal carcinoma (IDC) samples and five paired normal samples.

Mouse model of transient middle cerebral artery occlusion
Transient focal ischemia of three male mice (C57BL/6, 6-to-7 weeks old, 21 ± 1 g) was induced by middle cerebral artery occlusion (MCAO) using the endovascular suture method [27]. The operation was performed under isoflurane anesthesia. The right external carotid artery (ECA), internal carotid artery (ICA), and common carotid artery (CCA) were exposed under a stereoscopic anatomical microscope. The suture was gently pushed from the ECA to the ICA to block the origin of the middle cerebral artery (MCA; 9-10 mm). After 1 h, the suture was withdrawn to allow MCA reperfusion. After 6 h, mice were sacrificed. Then, the brain tissue from the ischemic core area on the right side of the brain was removed, and the normal brain tissue from the corresponding area on the left side was removed and utilized as a control. Next, RNA extraction was performed immediately.

RNA extraction
Total RNA of the ischemic core and normal areas was extracted by the FastPure Cell/Tissue Total RNA Isolation Kit (Catalog No. RC101, Vazyme, Nanjing, China) with 10-20 mg of brain tissue according to the manufacturer's instructions.

Library preparation for RNA-seq
The library construction and sequencing of rRNA-depleted RNA were performed by Annoroad Gene Technology (Beijing, China). A total of 3 lg RNA per sample was used as the input material for rRNA removal with the Ribo-Zero rRNA Removal Kit (Catalog No. MRZG12324, Illumina, San Diego, CA). Libraries were generated using the NEBNext Ultra Directional RNA Library Prep Kit for Illumina (Catalog No. E7420, NEB, Ipswich, MA), followed by sequencing on an Illumina HiSeq X platform with 150-bp paired-end reads.

RT-qPCR analysis
Total RNA (500 ng) was reverse-transcribed into complementary DNA (cDNA) by using HiScript III RT Strand SuperMix for qPCR (+gDNA wiper) (Catalog No. R323, Vazyme). For each circRNA, two primers inside the back-spliced site and two primers outside the back-spliced site were designed to detect circular junctions and upstream and downstream linear junctions ( Figure S1; Table S1). RT-qPCR was performed by using ChamQ Universal SYBR qPCR Master Mix (Catalog No. Q711, Vazyme) on an ABI7500 system (Applied Biosystems, Foster, CA) according to the manufacturer's procedures, and Actb was used as an internal reference gene. Then, w was calculated by determining the ratio of the relative expression level of the circular junctions to the sum of the circular junctions and the average of the upstream and downstream linear junctions.

Identification of DECs and DEGs in real RNA-seq datasets
For each RNA-seq dataset, raw reads were mapped to the reference genome (hg19 for SRP050149 and SRP156355; mm10 for mouse RNA-seq) by STAR (v2.5.3a). Next, the LJCs of circRNAs were calculated by DEBKS, while CJCs of circRNAs and counts of host genes were calculated by CIRCexplorer2 [14] (v2.3.8) and featureCounts (v1.6.0) [28], respectively. Only circRNAs with average CJCs ! 2 were kept in the following differential expression analysis. DECs and DEGs between the two groups were analyzed using DEBKS and DESeq2, respectively.

Simulation of rRNA-depleted RNA-seq datasets
For the simulated RNA-seq dataset1, we randomly selected 10,000 circRNA host genes based on human gene annotation (GENCODE v19) for 3 vs. 3 sample groups. We assumed that each host gene contained only one linear product but that half genes contained one circular product, while the other half contained two circular isoforms (same back-spliced site with exon skipping in one isoform). For each gene, the total number of linear and circular RNA molecules was sampled from a normal distribution, with the mean and the standard deviation (SD) being derived from the sum of CJCs and LJCs of the in-house MCAO mouse RNA-seq data. For the two sample groups, the ratios of BS events were simulated with 9500 pairs from a beta distribution (p = 1, q = 12) and 500 pairs from a beta distribution (p = 7, q = 10). Half of the ratio pairs had differences greater than 5% as DECs, and the ratio of BS events for each sample was simulated with the mean equal to the group ratio and the SD at five different levels, that is, 0.01, 0.02, 0.05, 0.10, and 0.20. Next, the number of circular and linear molecules was calculated with the total number of molecules and the ratio of BS events. For circRNAs with AS, the percentage of one circRNA isoform against total circular molecules was simulated with uniform distribution. Next, the number of each circular isoform molecule was calculated with the percentage and the total number of circular molecules. Finally, simulated rRNA-depleted RNA-seq dataset1 was generated with simulated numbers of circular and linear molecules.
For the simulated RNA-seq dataset2, again, 10,000 circRNA host genes based on human gene annotation (GENCODE v19) were randomly selected for 3 vs. 3 sample groups. Each gene contained one linear product and two circular isoforms from the same back-spliced site. The total number of linear and circular RNA molecules was sampled as described in dataset1. The ratio pairs between groups of each circular isoform against the circular and the linear isoforms were simulated with beta distribution (p = 1, q = 12 for 9500 ratio pairs; p = 7, q = 10 for 500 ratio pairs). For the two circular isoforms from the same gene, there was one and only one isoform that had a difference of more than 5% from a DEC at the isoform level. Next, the number of circular and linear molecules was calculated with the total number of molecules and the ratio of BS events. Finally, simulated rRNAdepleted RNA-seq dataset2 was generated with a number of circular and linear molecules.

Identification of DECs in simulated RNA-seq datasets
For each RNA-seq dataset, raw reads were mapped to the reference genome (hg19) by BWA mem [29] (v0.7.17-r1188). circRNAs were detected by CIRI2 (v2.0.6) with gene annotation (GENCODE v19). In the simulated RNA-seq dataset1, DEBKS, CircTest [21], and Fisher's exact test (FET) were employed to identify DECs based on the CJCs and LJCs from the output file of CIRI2. In the simulated RNA-seq dataset2, the CJCs of each isoform of circRNAs were detected by CIRI-full, and the corresponding LJCs were obtained from CIRI2.

Identification of DECs with DEBKS in two public RNA-seq datasets
We first evaluated the performance of DEBKS using two public rRNA-depleted RNA-seq datasets. The first dataset includes samples of HEK293 cells with or without the knockdown of both ADAR1 and ADAR2. Given that RNA editing is exclusive to circRNA formation [9], we expected to observe more up-regulated circRNAs in the ADAR1/2 knockdown treatment. With |Dw| > 0.01 at false discovery rate (FDR) < 0.05, DEBKS detected 14 DECs, all of which were upregulated in ADAR1/2 knockdown cells as expected (Figure 2A).
The second dataset includes RNA-seq data of five IDC samples and five paired normal samples. With or without the parameter '-p', DEBKS detected 377 DECs with the paired model and detected 62 DECs with the unpaired model ( Figure 2B). Most (372 of 377) DECs detected by the paired model were down-regulated in cancer samples, which is in keeping with the results of a previous study [30]. The |Dw| of 315 DECs only detected in the paired model was significantly lower (P = 2.6 Â 10 À5 , as determined by a two-tailed Student's t-test, Figure 2C) than that of the 62 DECs detected by both models, indicating that the statistical framework with paired information can result in considerably improved sensitivity in paired samples.  . P values were evaluated by DEBKS for RNA-seq, and by single-tailed paired Student's t-test for RT-qPCR. *, P < 0.1; **, P < 0.05; ***, P < 0.01.

DEBKS is consistent with RT-qPCR validation in mouse brain
Next, we detected DECs between the ischemic core and control area of the mouse brain with MCAO, which can induce differential expression of circRNAs in the ischemic core [31]. At the threshold of |Dw| > 5% and FDR < 0.05, a total of 74 DECs (10 up-regulated and 64 down-regulated in the ischemic area) were identified using rRNA-depleted RNA-seq data by DEBKS with a paired statistical model. Among highly expressed DECs (average CJC ! 10), five were randomly selected for validation by RT-qPCR, and they were all consistent with the RNA-seq data ( Figure 3A and B). Notably, we did not detect any DECs with either DEBKS in the unpaired model or CircTest of DCC, suggesting that paired samples are more powerful for detecting DECs, especially with DEBKS, the only software demonstrated to detect differential BS for paired groups with replicates to date.

DEBKS performs better than current methods in simulation data
To further evaluate the performance of DEBKS, we simulated rRNA-depleted RNA-seq data based on the real junction count distribution with five intragroup SDs, that is, 0.01, 0.02, 0.05, 0.10, and 0.20, which represented the variability of the ratio of BS events among replicates. In the simulated data, the ratio of BS events followed the beta distribution, which could better reflect the ratio distribution in real RNAseq data ( Figure S2A and B). We first compared the performance of BS ratio quantification with other programs, including CIRI2, CIRIquant, DCC, CIRCexplorer2, CLEAR, and sailfish-cir [32] (Figure 4). We found that w calculated by CJCs from CIRI2 and LJCs from DEBKS was most consistent with the simulated ratio. In addition, when combined with LJCs from DEBKS, w was more consistent with the simulated ratio than the ratio derived from CIRI2, CIRIquant, or DCC alone, supporting that DEBKS possesses an advantage for ratio calculation.
With minimal total CJCs of all samples ranging from 5 to 40 counts and SD from 0.01 to 0.20 in the simulated dataset1, we evaluated the performance of dec of DEBKS, as well as the performance of CircTest and FET for comparison ( Figure 5). In terms of the values of area under receiver operating characteristic (AUROC) and true positive rate (TPR) at 5% false positive rate (FPR), DEBKS was determined to perform better than any combination of minimal total CJCs and SD, indicating that DEBKS is powerful in DEC detection for both weakly and strongly expressed circRNAs. However, by applying CIRI-full to calculate CJCs of each circular isoform in the simulated dataset2, we observed that DEBKS possessed low accuracy for detecting DECs at the isoform level ( Figure S3), which was primarily observed because of the inaccurate quantification of CJCs ( Figure S4).

DEBKS is less heavily influenced by host gene expression than DESeq2
Given the low abundance of circRNAs in rRNA-depleted RNA-seq data, ratio-based methods were considered to be more likely to detect DECs than CJC-based methods, such as DESeq2. By using DESeq2 at FDR < 0.05, we failed to detect any DECs in the ADAR1/2 knockdown and MCAO RNA-seq data, while detected 34 DECs in the IDC RNAseq data, among which 47% (16/34) were also detected by DEBKS. In addition, the host genes of DECs detected by DESeq2 were considerably more likely to be DEGs (odds ratio = 7.33, P = 4.02 Â 10 À8 , two-tailed FET) compared to those detected by DEBKS with the paired model (odds ratio = 1.08, P = 0.67, two-tailed FET) and the unpaired model (odds ratio = 1.63, P = 0.15, two-tailed FET), suggesting that DEBKS identifies DECs with less influence from host gene expression.

Discussion
Given that circRNAs are BS isoforms of host genes, we developed DEBKS to identify DECs at the splicing level by BS ratio, which normalizes CJCs with local LJCs. In addition to CJC-based methods, ratio-based methods may help to identify potential functional circRNAs based on the biogenesis mechanism of circRNAs [4]. In our evaluation, DEBKS efficiently identified DECs with less influence from host gene expression. For example, among the 3080 circRNAs detected in IDC RNA-seq data, a strong correlation of log 2 fold change of expression level (log 2 FC) was observed between circRNA and the host gene (Pearson's r = 0.53, P < 2.2 Â 10 À16 ) (Figure S5), indicating that circRNA expression level was significantly affected by host gene expression. In contrast, only a weak correlation was observed between circRNA Dw and log 2 FC of the host gene (Pearson's r = 0.06, P = 4.7 Â 10 À4 ), suggesting that BS events measured by |Dw| were less strongly influenced by the expression level of the host gene.
At present, several software, including CircTest and CIRIquant, can detect DECs with ratio-based methods. Compared with simulated data, DEBKS shows better performance for length adjustment of smaller circRNAs and employment of the statistical framework of rMATS. In addition, DEBKS could be applied in paired or unpaired groups with replicates, which is distinct from CIRIquant, which utilizes a ratio-based test to detect DECs between groups without replicates.
Because DEBKS is based on the ratio of CJCs and LJCs, it is only applicable in RNA-seq analyses of both circRNAs and linear RNAs, such as rRNA-depleted or exome-capture RNAseq [10]. In addition, intergenic circRNAs without linear counterparts, which are estimated to account for less than 10% of total circRNAs [33], are not suitable for DEBKS.

Ethical statement
All experimental animals were purchased from the Experimental Animal Center of Peking University, China. All protocols involving animals, behavioral testing, surgery, and animal care were approved by the laboratory animal welfare ethics committee.

Code availability
The source code of DEBKS and the script to simulate the RNA-seq dataset are available at https://github.com/yangence/DEBKS.