TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs

Transcriptome assemblers aim to reconstruct full-length transcripts from RNA-seq data. We present TransComb, a genome-guided assembler developed based on a junction graph, weighted by a bin-packing strategy and paired-end information. A newly designed extension method based on weighted junction graphs can accurately extract paths representing expressed transcripts, whether they have low or high expression levels. Tested on both simulated and real datasets, TransComb demonstrates significant improvements in both recall and precision over leading assemblers, including StringTie, Cufflinks, Bayesembler, and Traph. In addition, it runs much faster and requires less memory on average. TransComb is available at http://sourceforge.net/projects/transcriptomeassembly/files/. Electronic supplementary material The online version of this article (doi:10.1186/s13059-016-1074-1) contains supplementary material, which is available to authorized users.


Reference transcripts downloaded for real datasets
In order to evaluate the performances of the assemblers on real datasets, we downloaded the reference transcripts of human and mouse from the Ensemble Genome Browser. The human reference transcripts are downloaded from http://ftp.ensembl.org/pub/release-73/gtf/homo_sapiens/Homo_sapiens.GRCh37.73.gtf.gz and the mouse reference transcripts from http://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/Mus_musculus.NCBIM37.67.gtf.gz

Construction of splicing graphs
The splicing graphs are constructed based on mapping the RNA-seq reads to a reference genome by Tophat2. Generally speaking, if the reference genome is well resolved and most RNA-seq reads are of high sequencing quality, the mapping results of Tophat2 could be highly reliable. Then TransComb could effectively and efficiently construct splicing graphs based on the mapped results detailed as follows.

1) Clustering reads into gene loci
Since the mapped reads in the output bam file of Tophat2 have been sorted according to their mapping positions from 5' to 3', one can easily cluster the mapped reads into their corresponding gene loci. To do so, by Max_Pos we denote the most right position of the current cluster of mapped reads in the reference genome, which is initialized by the 3' end of the first read. TransComb then determines the end-positions of the successive mapped read in the reference genome, which are represented respectively by 5'_end and 3'_end. If 5'_end is not larger than Max_Pos, TransComb clusters this read into the current gene locus, and updates Max_Pos by 3' _end if and only if 3' _end is larger than Max_Pos. TransComb repeats the procedure until it encounters a read with its 5'_end larger than Max_Pos, with which it resumes next gene locus, until all reads exhausted.

2) Identification of exons and splicing junctions in each gene locus
For each gene locus constructed above, we use all the reads clustered into this locus to detect exons and splicing junctions in this gene. To do so, we denote by Gene_Exon a zero vector with the same length as this gene locus, with its components corresponding to the positions of this locus, and by Gene_Junction a 3-column matrix to represent splicing junctions in this locus. For the matrix Gene_Junction, its first two columns store the start and end positions of corresponding junction, and the third column stores the number of reads supporting the junction. Then for each read clustered into this locus, if it is not a junction read, we then add 1/n to each aligned components of Gene_Exon, where n represents the number of different mappings of this read; otherwise, i.e. this read is a junction read, TransComb updates the matrix Gene_Junction by adding a row consisting of start and end positions of the junction as well as the number 1/n, if this junction has not been detected yet. However, if the junction has already existed, TransComb directly updates the matrix Gene_Junction by adding 1/n to the number of reads supporting this junction. TransComb Repeats the procedure until all mapped reads belonging to this locus have been exhausted. Then the zero components of Gene_Exon constitute the introns and the others the exons with their values representing the coverage of reads supporting this position.

3) Updating of gene loci
We remove those junctions from Gene_Junction of their coverage less than 1 because such junctions are not reliable. Notice that a gene locus may be split into two or more gene loci after the updating operation and the following analysis is subjected to updated gene loci.

4) Updating exons and junctions in a gene locus
Biologically, most introns are actually quite long in mammalian genomes. Therefore, for each two adjacent exons in a gene locus, TransComb combines them into one if either their distance is no more than a prespecified number (the default is set to 100 bp) or there are two or more paired-end reads supporting these two exons (see Figure S2A) and meanwhile the distance between them is more than 100 bp but less than the value of average pair-gap length (the default is set to 200 bp).

5) Updating exons and junctions between two gene loci
For two adjacent gene loci, TransComb merges them into one if either the distance between the two gene loci is no more than a prespecified number (the default is set to 100 bp), or they are supported by at least two paired-end reads if the distance between them is more than 100 bp but less than the value of average pair-gap length (the default is set to 200 bp).

6) Dividing wrongly merged exons
Some reads may be wrongly mapped to intron areas, leading to wrongly merged exons.
TransComb attempts to solve this problem by dividing one exon into two exons through sliding a window as follows. For an exon with length L longer than 200, TransComb computes the average coverage of the first 50 bp, denoted by Ave-left, and that of the last 50 bp, denoted by Ave-right. Then it slides a 50-bp-length widow from left to right by one bp at a time (See Figure S3) and computes the average coverage of each sliding window, and finally picks the minimum one, denoted by Ave-min. If Ave-min is no more than 25% of the smaller one of Ave-left and Ave-right, then this exon is divided into two exons from the middle of the corresponding window of Ave-min.

7) Generation of splicing graphs
For each gene locus, an exon recorded in Gene_Exon is represented by a node and any two nodes are connected by a directed edge from 5' end to 3' end if and only if there is a splicing junction recorded in Gene_Junction supporting it. The coverage of each node is computed as the average coverage of each base in the corresponding exon recorded in Gene_Exon and the coverage of each edge is recorded in the third column of Gene_Junction. By Pair_Edges we denote a 2-column matrix to record pair supporting information between two edges in the splicing graph, which is useful and important in extracting paths from the splicing graphs. For each two edges in a splicing graph, if they are supported by at least two mate pairs (see Figure   S2A), Pair_Edges will be updated by adding a row with its two entries being the indexes of these two edges.

Recovery of full-length transcripts from weighted junction graphs
To do so, by Nodes_Unused we denote the dynamic set of nodes in the junction graph J = (V(J), E(J)) that have not been covered by assembled paths and by filter a parameter for determination. Initially, Nodes_Unused is set to V(J). Each transcript-representing path in a splicing graph is extracted by recurrently extending a current path in the corresponding weighted junction graph originating from a seed node in Nodes_Unused. In the procedure, by nl and nr we respectively denote the left and right ends of the current path which is being extended in the junction graph. By NL (resp. NR ) we denote the set of left/in (resp. right/out) Then TransComb iteratively extracts transcript-representing paths in a splicing graph through the operations on the corresponding junction graph that are pseudo-coded by the following steps.
Step 1. If Nodes_Unused ≠ Ø, TransComb chooses from Nodes_Unused a seed node v of largest weight cov_max (nl = nr = v at the moment); or else, it terminates.
If cov_max < filter, it removes from Nodes_Unused the node v, returns to Step 1; or else, goes to Step 2; Step 2. If N L 2 ≠ Ø , TransComb extends from nl to a node nl' in NL 2 of maximum weight, resets nl = nl', updates the current sets NL, NL 2 and NL 1 accordingly, and returns to Step 2; or else if NL 1 ≠ Ø , TransComb extends from nl to a node nl' in NL 1 of maximum weight, resets nl = nl', updates the current sets NL, NL 2 and NL 1 accordingly, and returns to Step 2; or else, goes to Step 3; Step 3. If N R -1 ≠ Ø , TransComb extends from nr to a node nr' in NR -1 of maximum weight, resets nr = nr', updates the current sets NR, NR -1 and NR +1 accordingly, and returns to Step 3; or else if NR 1 ≠ Ø , TransComb extends from nr to a node nr' in NR 1 of maximum weight, resets nr = nr', updates the current sets NR, NR 2 and NR 1 accordingly, and returns to Step 3; or else, goes to Step 4; Step 4. TransComb removes from Nodes_Unused the nodes of the path obtained by Steps 1, 2 and 3, and returns to Step 1.
The default value of filter is set to 0 by TransComb, implying that it will not terminate until all the nodes in the junction graph have been covered by the assembled paths. It is worth mentioning that nodes removed from Nodes_Unused can still be used for extension of other paths.

Estimation of expression levels of the recovered transcripts
Intuitively, expression level of each recovered transcript would have been estimated by allocating the coverage of each edge to the expressed transcripts if the expressed transcripts were uniformly sampled and fragmented. However, the fragmentation and sequencing process may cause a lot of biases and even errors, and moreover, the process of mapping RNA-seq reads to a reference genome may also bring in many unexpected errors. On the other hand, the assembled transcripts are not necessarily the true expressed ones. So it may not be a good idea to estimate the expression levels for assembled transcripts based on allocation of the coverage of all the edges. To diminish the impact of the errors and biases mentioned above on estimating expression levels for the assembled transcripts, we would prefer to rely on those edges used as seed nodes in the corresponding junction graph during the path extraction procedure rather than all of them, and from now on such edges are called seed edges. Seed edges would be quite reliable because each of them has the highest coverage in the unused nodes during the process of extracting a set of paths from a junction graph. We found that the edges with higher (lower) coverage are less (more) impacted by mapping and sequencing errors. In addition, all assembled transcripts must be covered by these seed edges. Therefore, the expression levels would be better estimated by allocating the coverage only for a few of seed edges to the assembled transcripts. Notice that assembled transcripts one-to-one correspond to those seed nodes used in the extraction procedure. By E = {e1, e2, …, en} we denote the set of n seed edges, and T = {t1, t2, …, tn} the set of n assembled transcripts. It can be achieved by solving the following quadratic programming: where wi (i = 1, 2 ,…, n) represents the coverage of the seed edge ei, and xj (j = 1, 2, … , n) the expression level of transcript tj.

Comparison of correctly identified genes
A reference gene is considered to be correctly identified if at least one isoform in the gene is exactly matched by a predicted transcript in a predicted gene. On the simulated dataset, the comparison results showed that TransComb correctly identified 8948 genes, vs. StringTie 8734, Cufflinks 8281, Traph 7131, and Bayesembler 7814 (see Figure S5A). On the three real datasets, TransComb still performed the best among the other compared assemblers in terms of correctly identified genes while only slightly inferior to StringTie on human H1-cells dataset. For example, the numbers of correctly identified genes of TransComb on human K562-cells, human H1-cells and mouse datasets are respectively 8610,8762 and 8540,vs. StringTie 8591,8780 and 8411,Cufflinks 6649,7532 and 4953,and Bayesembler 8593,8710 and 8471 (see Figure S6 A, B and C).

Comparison of detecting unique true positives
To compare two assemblers in terms of their ability of correctly recovering novel isoforms, we define another true positive, termed unique true positive, which is a reference transcript recovered by exactly one of the two compared assemblers. Obviously, the more unique true positives an assembler recovers, the better it performs than others.
We compared the unique true positives between TransComb and each of the other four compared assemblers on the simulated dataset, and found that TransComb assembles much more unique true positives compared to each of them. For example, the unique true positives are 1385 and 845 between TransComb and StringTie, 2855 and 608 between TransComb and Cufflinks, 2889 and 660 between TransComb and Traph, and 1787 and 1094 between TransComb and Bayesembler (see Figure S5B and Table S5). Overall, TransComb substantially outperforms all the compared assemblers in terms of their unique true positives even in the case where it assembles fewer candidates than others.
The competition in unique true positives between two assemblers was also carried out on real datasets. The competition results show that TransComb assembled much more unique true positives than each of the compared assemblers (see Figure S7 and Table S6, S7 and S8).
Overall, we conclude that TransComb consistently and overwhelmingly outperforms each of the compared assemblers in terms of unique true positives on both synthetic and real datasets.

Results of processing wrongly merged exons
If some reads were wrongly mapped to intron areas, then this intron and its two neighboring exons may be wrongly assembled into a single exon, which could result in two transcripts merged into one. TransComb attempts to solve this problem by dividing the whole exon into two as mentioned in the supplementary methods section. Successfully handling this problem significantly improves the performance of TransComb, especially in real datasets. On the simulated dataset, RNA-seq reads are indeed from the reference genome, which will be used as the mapping template, so the error mapping rates may be quite lower than on real datasets, which may contain mutations, indels, individual differences and some other unknown factors resulting from the sequencing process. So much fewer exons in simulated datasets will be divided than in real datasets. For example on the simulated dataset, the number of exons divided is 52. While in real datasets, the numbers are 5739, 3760 and 2096 on human K562-cells, human H1-cells and the mouse datasets, respectively. Then we computed the number of assembled true positives resulting from those divided exons on both simulated and real datasets. On the simulated dataset, the number of such true positives is 32 ( Figure S5C and Table S9), while on the real datasets, the numbers are 1627, 1079 and 650 ( Figure S8 and Table S9) on human K562-cells, human H1-cells and the mouse datasets, respectively. We can see that the exon-splitting tip indeed makes a great contribution to the performance of TransComb, especially on real dataset.

Comparison of expression level estimations
After assembling the RNA-seq reads into transcripts, another challenging task is to estimate the expression abundance of each assembled transcript in original cells. To evaluate the estimators, we compared the estimated abundances with the genuine ones for those correctly assembled transcripts only by all the five assemblers. The distribution of the simulated transcripts against their expression levels is shown in Figure S9A. For each estimator, we used the Spearman correlation coefficient between the simulated and estimated FPKMs to measure the estimators. As shown in Figure S9, the correlation coefficient of TransComb is 0.97, against 0.98 of StringTie, 0.97 of both Bayesembler and Cufflinks, and 0.87 of Traph.
The comparison results showed that it reaches comparable accuracy level to the existing tools of same kind, e.g. StringTie, Cufflinks and Bayesembler, while it is much better than Traph.
From the design of our method and then its performance, it seems that the expression levels of encoded transcripts would be mainly determined by several key junction edges in the splicing graph. As this method is really unusual and very simple, it may hopefully provide a hint for scientists who are interested in developing more powerful estimator. Figure S1. Transcriptome assembly pipeline for TransComb.       Assembled true positives of the four compared assemblers on mouse dataset. Figure S9. The distribution of simulated isoform expression levels and Correlation between simulated (y-axis) and estimated (x-axis) expression levels on simulated dataset using only transcripts that were correctly assembled by the five assemblers. ρ represents the Spearman correlation coefficient between simulated and estimated FPKM values.. Tables   Table S1 Candidates and true positives of the assemblers on the simulated dataset.