LCAT: an isoform-sensitive error correction for transcriptome sequencing long reads

As the carrier of genetic information, RNA carries the information from genes to proteins. Transcriptome sequencing technology is an important way to obtain transcriptome sequences, and it is also the basis for transcriptome research. With the development of third-generation sequencing, long reads can cover full-length transcripts and reflect the composition of different isoforms. However, the high error rate of third-generation sequencing affects the accuracy of long reads and downstream analysis. The current error correction methods seldom consider the existence of different isoforms in RNA, which makes the diversity of isoforms a serious loss. Here, we introduce LCAT (long-read error correction algorithm for transcriptome sequencing data), a wrapper algorithm of MECAT, to reduce the loss of isoform diversity while keeping MECAT’s error correction performance. The experimental results show that LCAT can not only improve the quality of transcriptome sequencing long reads but also retain the diversity of isoforms.


1) Introduction
LCAT (An isoform-sensitive error correction for transcriptome sequencing long reads) is a wrapper algorithm of MECAT, to reduce the loss of isoform diversity while keeping MECAT's error correction performance. The experimental results show that LCAT not only can improve the quality of transcriptome sequencing long reads, but also keeps the diversity of isoforms.

3) Quick Start
LCAT can be used to correct RNA long reads produced by PacBio and Nanopore platforms. The options and commands for processing different types of data are introduced below.

4) Program Descriptions
The introduction of modules designed in LCAT is shown in the following sections, which also include the options and output format of each module. -d <string> reads file name.
-o <string> output file name.
-w <string> working folder name, will be created if not exist.
-k <integer> minimum number of k-mer match a matched block has. Default: k= whether print gapped extension start point, 0 = no, 1 = yes. Default: 0.

Output Format
the results are output in can format, each result of which occupies one line and 9 fields: If the -g option is set to 1, two more fields indicating the extension starting points are given: In the strand field, 0 stands for the forward strand and 1 stands for the reverse strand. All the positions are zero-based and are based on the forward strand, whatever which strand the sequence is mapped.
❖ lcat2cns module Input Format can format files.

lcat2cns [options] input reads output
Options -x <0/1> sequencing platform: 0 = PACBIO, 1 = NANOPORE. Default: 0 -t <Integer> number of threads (CPUs) -p <Integer> batch size that the reads will be partitioned -r <Real> minimum mapping ratio -a <Integer> minimum overlap size -c <Integer> minimum coverage under consideration -l <Integer> minimum length of corrected sequence -k <Integer> number of partition files when partitioning overlap results (if < 0, then it will be set to system limit value) -d <Real> identity threshold -w <Integer> slide window length -m <Real> minimum coverage rate of modify region -h print usage info.

Output Format
The corrected sequences are given in FASTA format. The header of each corrected sequence consists of three components seperated by underlines: >A_B_C_D where A is the original read id, B is the left-most effective position, C is the right-most effective position, D is the length of the corrected sequence, by effective position we mean the position in the original sequence that is covered by at least c (the argument to the option -c) reads.

1) Introduction of evaluation tool
LR_EC_analyser stands for Long Read Error Correction analyser. It is a python script that analyses the output of long reads error correctors, like LoRDEC, NaS, PBcR, proovread, canu, daccord, LoRMA, MECAT, pbdagcon, etc. It does so by running AlignQC (https://github.com/jason-weirather/AlignQC) on the BAMs built by the mapping the output of error correction tools to a reference genome (using for example gmap or minimap2) and parsing its output, and creating other custom plots, and then putting all the relevant information in a HTML report. It also makes use of IGV.js (https://github.com/igvteam/igv.js) for an in-depth gene and transcript analysis.
LR_EC_analyser can be applied to evaluate the extent to which existing long-read DNA error correction methods are capable of correcting long reads. It not only reports classical error-correction metrics but also the effect of correction on long read connectivity (impacts the inference of transcript structure and exon coupling), gene families, isoform diversity, bias toward the major isoform, and splice site detection. BAM files of the reads output by the HYBRID correctors mapped to the genome (preferably using gmap -n 10 -f samse).

2) Usage of the evaluation tool
--genome GENOME The genome in Fasta file format.
--gtf GTF The transcriptome in GTF file format. sh.
--pdf Produce .pdf files of the plots in the <output_folder>/plots directory.
--skip_bam_process Skips BAM processing (i.e. sorting and indexing BAM files) -assume we had already done this.
--skip_alignqc Skips AlignQC calls -assume we had already done this.
--skip_copying Skips copying genome and transcriptome to the output folder -assume we had already done this.

3) Reference and annotation files for four species used in the evaluation
The long reads of Mouse, Zebra finch, Calypte anna, and Human were used in our experiments. The mouse and human data are sequenced by nanopore technology, while zebra finch and Calypte anna are sequenced by PacBio technology (Table 1). In addition, the corresponding reference genomes and annotation files were from the NCBI website (https://www.ncbi.nlm.nih.gov/) and the Ensembl website (ftp://ftp.ensembl.org/pub/). The version number of genomes and the annotation files are shown in Table 2.

4) Specific steps for evaluation
First, use minimap2 to align the original reads and error-corrected reads to the reference genome to obtain the sam files, then use LR_EC_analyser to mark the gene and its isoform structure to which each read belongs according to the sam files and gene and exon information in the genome annotation files.
The number of isoforms of each gene in the original reads and the error-corrected reads were counted separately, and the number of isoform changes was calculated by the difference between the two. By counting the number of genes whose isoforms increase or decrease, the degree of loss of gene isoforms after error correction is reflected.