ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut

[version 2; peer review: 3 approved]
PUBLISHED 06 Nov 2015
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Agriculture, Food and Nutrition gateway.

Abstract

The transcriptome provides a functional footprint of the genome by enumerating the molecular components of cells and tissues. The field of transcript discovery has been revolutionized through high-throughput mRNA sequencing (RNA-seq). Here, we present a methodology that replicates and improves existing methodologies, and implements a workflow for error estimation and correction followed by genome annotation and transcript abundance estimation for RNA-seq derived transcriptome sequences (YeATS - Yet Another Tool Suite for analyzing RNA-seq derived transcriptome). A unique feature of YeATS is the upfront determination of the errors in the sequencing or transcript assembly process by analyzing open reading frames of transcripts. YeATS identifies transcripts that have not been merged, result in broken open reading frames or contain long repeats as erroneous transcripts. We present the YeATS workflow using a representative sample of the transcriptome from the tissue at the heartwood/sapwood transition zone in black walnut. A novel feature of the transcriptome that emerged from our analysis was the identification of a highly abundant transcript that had no known homologous genes (GenBank accession: KT023102). The amino acid composition of the longest open reading frame of this gene classifies this as a putative extensin. Also, we corroborated the transcriptional abundance of proline-rich proteins, dehydrins, senescence-associated proteins, and the DNAJ family of chaperone proteins. Thus, YeATS presents a workflow for analyzing RNA-seq data with several innovative features that differentiate it from existing software.

Keywords

RNA-seq, transcriptome, open reading frame, extensin, proline-rich proteins, dehydrins, senescence-associated proteins, Computational genomics, Juglans nigra, black walnut, heartwood/sapwood transition zone

Revised Amendments from Version 1

In this version, we have

  1. Added two new authors based on their inputs to the manuscript
     
  2. Provided IDs to the submissions of the transcriptome(s).
     
  3. Created github repository with README. It is to be noted that this is not meant to be a software article, so the software provided is not release quality. https://github.com/sanchak/YEATSCODE2
     
  4. Incorporated several minor points raised by reviewer.

See the authors' detailed response to the review by Michael I. Love
See the authors' detailed response to the review by Varodom Charoensawan

Introduction

Analysis of the complete set of RNA molecules in a cell, the transcriptome, is critical to understanding the functional aspects of the genome of an organism. Most transcripts get translated into proteins by the ribosome1. Non-translated transcripts (noncoding RNAs) may be alternatively spliced and/or broken into smaller RNAs, the importance of which have only recently been recognized2. Transcriptional levels vary significantly based on environmental cues3, and/or disease4. Quantifying transcriptional levels constitutes an important methodology in current biological research. Traditional methods like RNA:DNA hybridization5 and short sequence-based approaches6 have been supplanted recently by a high-throughput DNA sequencing method - RNA-seq7,8. Concomitant with the introduction of RNA-seq has been the development of a diverse set of computational methods for analyzing the resultant data921.

In the current work, we present a methodology for analyzing RNA-seq data that has been assembled into transcripts (YeATS - Yet Another Tool Suite for analyzing RNA-seq derived transcriptome). The process of associating genomic open reading frames (ORF) to a set of transcripts (transcriptome) is the key step in YeATS, enabling identification and correction of specific errors arising from sequencing and/or assembly, a novel feature missing in most known tools. These errors include transcripts that have not been merged, a transcript having broken ORFs and transcripts containing long repeats. Also, YeATS identifies noncoding RNAs by comparison to compiled databases22, transcripts with multiple coding sequences and highly transcribed genes (based on simple normalization of raw counts followed by sorting).

Here, the YeATS workflow is demonstrated using a representative sample of the transcriptome from the tissue at the heartwood/sapwood transition zone in black walnut (Juglans nigra L.). We have identified transcripts that have sequencing and/or assembly errors (~5%). A novel feature that emerged from our analysis was the presence of a highly transcribed gene that had no known homologous counterpart in the entire BLAST database. The amino acid composition of the longest open reading frame of this gene consists of a high percentage of leucine, histidine and valine, and classifies this as a putative extensin23. Given the economic and ecological importance of black walnut timber, characterization of such genes will enhance our understanding of the mechanisms underlying the unique properties associated with the wood of these trees24. The significance of proline-rich proteins25, dehydrins26, senescence-associated proteins27 and DNAJ28 proteins to the formation of heartwood was established through their transcriptional abundance. Finally, based on transcripts that have no known homologs, we have identified noncoding RNAs by comparison with the noncoding RNA database for Arabidopsis22. Thus, in the current work, we present a workflow (YeATS) with several novel features absent in most currently available software.

Methods

In silico methods

The input to YeATS is a set of post assembly transcripts as a fasta file (φTRS). The first step is to identify the set of genes (proteins) encoded by φTRS. This is done by associating a proper open reading frame (ORF) to each transcript. This involves a comprehensive automated BLAST run29.

For each transcript in φTRS, we generate the three longest ORFs (using the ‘getorf’ utility in the EMBOSS suite30) (Figure 1). These three ORFs are BLAST’ed to the full non-redundant protein sequences (‘nr’) database. For a given E-value cutoff (1E-12 in the current work), we create four sets

  • 1. Only one ORF is less than the cutoff - the transcript is uniquely annotated.

  • 2. None of the ORFs is less than the cutoff - the transcript has no known homologs.

  • 3. More than one ORF is less than the cutoff.

    • (a) The ORFs map to different fragments of the same protein. This points to an error in the sequencing or the assembly, which breaks down the contiguous ORF into two fragments.

    • (b) The ORFs map to different proteins - these are instances of a transcript having two valid ORFs. We duplicate the transcript, associating each one to a different protein sequence.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure1.gif

Figure 1. Flowchart for YeATS.

For each transcript, the three longest open reading frames (ORF) are obtained using the ‘getorf’, and these were BLAST’ed to the full non-redundant protein sequences (‘nr’) database. Based on the number of significant matches, the transcriptome is partitioned. Unique genes have only one significant match, erroneous transcripts have multiple ORFs matching the same gene, while duplicate genes have multiple distinct matches.

To produce the uniquely annotated set of genes, we ignored entries with the keywords chromosome, hypothetical, unnamed, unknown and uncharacterized, in order to have a functional characteristic in the annotation, provided the final annotated entry has low E-value. Also, apart from comparing E-values, we also compare the BLAST score, choosing an ORF as unique if its BLAST score was more than twice any other BLAST score, even if other scores satisfied the E-value criteria.

Algorithm 1 describes the process of merging transcripts (SI Figure 1). For a given length (which varies from 5 to 15 in this case), the 5’ and 3’ sequences and identifiers of each transcript are stored in new string databases: 3’=Begin; 5’=End. Repetitive strings (strings that have only two letters) are ignored, as it is difficult to ensure their uniqueness. For each string of n length in the Begin (3’) string database, we find whether: a) unique matches of n length (one-to-one mapping) are present in the End (5’) string database and b) that the prefixes (initial transcript identifiers) of the transcripts are the same.

Algorithm 1. MergeTRS - Merge two transcripts

Input: φTRS ⇐ Set of transcripts

Output: φTRSMERGED: Pairs of transcripts that can be

             merged

begin

      φTRSMERGED ← 0;

      while NewStatesAdded do

           foreach TRSi in φTRS do

                 φBEGIN ← 0;

                 φEND ← 0;

                 foreach len:5..15 do

                      AddBeginingofTRS(φBEGIN,TRSi,len);

                      AddEndofTRS(φEND,TRSi,len);

                 end

                 foreach stringi in φBEGIN do

                      /* ignore strings that have less than 3 letters, these are repetitive*/

                      IgnoreRepeats(stringi);

                      if(∃ only one stringj in φEND) such that prefixof(TRSi) == prefixof(TRSj))[

                      φTRSMERGED

                      AddtoMergeableSet(TRSi,TRSj);

                      ]

                 end

           end

      end

      return φTRSMERGED;

end

Algorithm 2 describes the iterative method for identifying homologous genes in the genome based on the transcriptome. First, the transcriptome is converted to a set of protein sequences by choosing the appropriate ORF (described above) as the representative protein sequence, and a BLAST database (TRSDB) is created. An input protein sequence (possibly from another organism) of a gene of interest is used to query TRSDB using BLAST29. This results in a set of significant transcript matches which is pruned based on a cutoff identity (40% in this case) and the criterion that the sequence length should not differ more than another parameterizable value (50 in this case). Both these transcripts are now potential genes, and the above mentioned process is repeated for each of them, until no new transcripts are added.

Algorithm 2. FindGene - Iterative method to identify homologous genes based on the transcriptome

Input: G ⇐ Amino acid sequence of gene

Input: TRSDB ⇐ BLAST database of the protein sequences from each transcript, choosing the longest ORF as the representative protein sequence

Input: identitycutoff ⇐ Ignore matches which are less than identitycutoff % identical to the sequence under consideration

Input: lengthcutoff ⇐ Ignore matches where the sequence length differs by more than lengthcutoff % from the sequence under consideration

Output: φgenes

begin

     φgenesG;

     φprocessed ← 0;

     NewStatesAdded ← 1;

     while NewStatesAdded do

          NewStatesAdded ← 0;

          foreach Gi in φgenes such that Gi is not in

          φprocessed do

               φprocessedGi;

              ϕiBLAST = BLAST Gi on TRSDB;

               foreach TRSi in ϕiBLAST do

                    difflength

                    length(Gi) – length(TRSi) ;

                    if(identity(TRSi, Gi) > identitycutoff ^

                    (difflength < lengthcutoff)) [

                    NewStatesAdded ← 1;

                    φgenesTRSi;

                    ]

               end

          end

     end

     /* This is not a TRS, but an input - remove this from the set*/

     remove G from φgenes;

     return φgenes;

end

The raw counts for each transcript is normalized according to Equation 1, assuming a read length of 100.

Scorenormal=100[Scoreraw/(Length(transcript))];(1)

The sequence alignment was done using ClustalW31. The alignment images were generated using SeaView32.

The runtimes for most of the processing required in YeATS is a few hours on a simple 16 GB, 16-core machine, barring the search for homologies in the BLAST ‘nr’ database. This search can be significantly accelerated when the organism under investigation has well-annotated protein databases (as in the current case), much in lines of the newly introduced SMARTBLAST (http://blast.ncbi.nlm.nih.gov/smartblast/), to runtimes under a day.

In vitro methods

Total RNA was isolated from the xylem region immediately external to the heartwood of a 16 year-old black walnut. The tree was felled in November, cross sections about 1 inch thick were taken from the base and dropped immediately into liquid nitrogen. After the sections were fully frozen they were transported to the lab on dry ice. The transition zone was then chiseled and the xylem was ground using a freezer mill. The RNA was extracted from 100g of ground wood using lithium chloride extraction buffer, and subsequently treated with DNAse (to remove genomic DNA) using an RNA/DNA Mini Kit (Qiagen, Valencia, CA) per the manufacturers protocol. Presence of RNA was confirmed by running an aliquot on an Experion Automated Electrophoresis System (Bio-Rad Laboratories, Hercules, CA).

The cDNA libraries were constructed following the Illumina mRNA-sequencing sample preparation protocol (Illumina Inc., San Diego, CA). Final elution was performed with 16 μL RNase-free water. Each library was run as an independent lane on a Genome Analyzer II (Illumina, San Diego, CA) to generate paired-end sequences of 85bp in length from each cDNA library.

Prior to assembly, all reads underwent quality control for paired-end reads and trimming using Sickle33. The minimum read length was 45bp with a minimum Sanger quality score of 35. The quality controlled reads of 19 libraries from J. nigra were de novo assembled with Trinity v2.0.614 (standard parameters with minimum contig length of 300bp) (manuscript in submission, bioproject id PRJNA232394). Subsequently, the reads from the TZ from J. nigra was aligned to this transcriptome and counts obtained by BWA’s short read aligner v.0.6.2 (‘bwa aln’) (http://bio-bwa.sourceforge.net/)34. The Illumina reads for the transition wood transcriptome can be accessed at http://www.ncbi.nlm.nih.gov/sra/SRX404331.

Results

The input dataset to the YeATS tool was a set of transcripts, transcript identifiers and their corresponding raw counts (see Supporting information), obtained from the tissue at the heartwood/sapwood transition zone (TZ) in black walnut (Juglans nigra L.) (Figure 2). These raw counts were normalized (see Methods), and transcripts with zero counts were ignored (see rawcounts.normalized.TZ in Dataset 1). There were ~24K such transcripts (ϕtranscriptTZ).

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure2.gif

Figure 2. Heartwood/sapwood transition zone in black walnut.

A cross section of a mature black walnut (Juglans nigra) stem showing the light-colored sapwood (Secondary xylem), darkly colored heartwood which contains no living cells. The transition zone (TZ) is immediately external to the heartwood highlighted by the yellow line in the red box. Cell death is actively occurring in this TZ tissue.

Dataset 1.YeATS Dataset.
READMEFASTADIR.tgz : 24k transcriptsORFS.tgz : open reading frames from 24k transcripts computed from the ‘getorf’ tool from the Emboss suite.list.merged.txt : transcripts that have been merged based on overlapping endsHigh.TZ.genome.annotated.csv : transcripts having only one ORF with a high significance matchLower.TZ.genome.annotated.csv : transcripts having only one ORF with a lower significance matchTZ.genome.annotated.none.csv : transcripts with no matchTZ.genome.errors : transcripts which have two ORFs matching with high significance to the same geneTZ.genome.annotated.morethanone.csv : transcripts having more than one ORFs which match to different genes with high significancerawcounts.TZ: Raw countsrawcounts.normalized.TZ: Normalized counts

In order to associate a transcript to a specific open reading frame (ORF), the ORFs of ϕtranscriptTZ is obtained using ‘getorf’ from the Emboss suite30 (see ORFS.tgz in Supporting information) (Figure 1). The three longest ORFs for each transcript is BLAST’ed to the full non-redundant protein sequences (‘nr’) database, and the results were used to characterize the genes.

There were ~1200 transcripts that had possible sequencing or assembly errors, ~22K transcripts that had significant matches (E-value<E-12) in the ‘nr’ database, 113 transcripts that had lower matches (E-12<E-value<E-08) in the ‘nr’ database, ~700 transcripts that had no matches in the ‘nr’ database and about 200 transcripts that could be merged based on overlapping amino acid sequences. We describe these in detail below.

Possible sequencing error or mis-assembly of transcripts

We observed transcripts that had multiple ORFs that matched to the same gene with high significance (E-value<E-10). The possibility that such an occurrence is not an experimental artifact is low. Transcript C15259_G1_I1 is one such example, having two ORFs - ORF_36 (length = 144) and ORF_9 (length = 122), both of which match to the mitochondrial ATP-dependent Clp protease proteolytic subunit 235 (GenBank: CAN64666.1) from Vitis vinifera with E-values of 6E-92 and 7E-45, respectively. Figure 3 shows the alignment of these two ORFs to the Vitis vinifera protein indicated the possible site of the sequencing error or transcript misassembly. This aspect of the YeATS methodology can be used to estimate the sequencing and transcript assembly error rate. For example, in the current transcriptome of the walnut TZ, we found a 5% (1200 out of 24,000) error rate.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure3.gif

Figure 3. Error detection in sequencing or transcript assembly by YeATS.

Transcript C15259_G1_I1 has two ORFs - 9 and 36 - both of which match to the mitochondrial ATP-dependent Clp protease proteolytic subunit 2, mitochondrial (GenBank: CAN64666.1) from Vitis vinifera with E-values of 6E-92 and 7E-45, respectively. It is likely that the error occurred near the amino acid sequence ‘SAG’ marked in the figure. The current transcriptome of the walnut TZ had a 5% (1200 out of 24,000) error rate for this class of error.

Long repeat within the same transcript

A small number of transcripts had long repeats (on the reverse strand), as identified by transcripts that had multiple identical ORFs. For example, transcript C50369_G5_I2 has two ORFs (length = 143) that matched to an uncharacterized protein (Uniprot id: XP_009362671, E-value= 4e-13). These ORFs were located on the reverse strand, and were exactly the same (Figure 4). There were only 8 such cases.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure4.gif

Figure 4. Erroneous transcripts with an exact long repeat (on the reverse strand).

Transcript C50369_G5_I2 had an ORF (length = 143, Uniprot id: XP_009362671, uncharacterized protein), with an exact match on the reverse strand. There were only eight such cases, and they could be manually corrected.

Merging transcripts

About ~200 transcripts have been merged using conservative metrics by YeATS (see Methods, list.merge in Supporting information). For example, transcripts C55368_G1_I3 and C55368_G2_I1 were merged based on a stretch of 12 amino acids (NFDENRGALNSH) (Figure 5). The indicated single nucleotide difference might be the reason for the failure of the assembly program to merge these two transcripts. Transcript C55368_G1_I3 had two exact repeats of this stretch, which is a likely assembly error.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure5.gif

Figure 5. Transcripts that could be merged.

(a) Transcripts C55368_G1_I3 and C55368_G2_I1 could be merged based on a stretch of 12 amino acids (NFDENRGALNSH) obtained from their ORFs. (b) The partial nucleotide sequences of these transcripts shows the repeat with only a single nucleotide difference. The indicated single nucleotide difference may explain the failure of the assembly program to merge these two transcripts. Interestingly, the transcript C55368_G1_I3 had two exact repeats of this stretch at the end which may have contributed to the failure of the assembly program to merge these transcripts.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure6.gif

Figure 6. Identification of transcripts encoding multiple genes.

These ORFs belong to the same transcript, and have significant matches to different proteins. (a) Genes on the reverse strand, having no overlap - clathrin light chain (value=3E-126) and a leucine repeat rich receptor-like serine/threonine protein kinase (E-value=0). (b) Genes on the same strand, having no overlap - RING/U-box superfamily protein (E-value=7E-149) and a homeodomain-like superfamily protein isoform (E-value=0).

Single transcripts with two ORFs

Some transcripts were associated with multiple ORFs with distinct significant matches in the ‘nr’ database. We demonstrate this for the transcript C8909_G1_I1, which had two ORFs - ORF_104 (length = 331) and ORF_45 (length = 390) which matched to a clathrin light chain36 (Uniprot id:XP_006481016.1, E-value=3E-126) and a leucine repeat rich receptor-like serine/threonine protein kinase37 (Uniprot id: XP_007026739.1, E-value=0), respectively. These ORFs were on opposite strands, and did not overlap. It was not possible to ascertain which was the correct gene product, and it is a distinct possibility that both strands were transcribed38. A slightly different situation arose when both the ORFs were on the same strand39, as in the case of the transcript C54995_G6_I2. For example, in transcript C54995_G6_I2, there were two ORFs - ORF_157 (length = 464) and ORF_231 (length = 543) that matched to a RING/U-box superfamily protein40 (Uniprot id: XP_007042454.1, E-value=7E-149) and a homeodomain-like superfamily protein isoform41 (Uniprot id: XP_007030696.1, E-value=0), respectively. Both of these proteins were on the same (reverse) strand of the transcript. These transcripts are candidates for chimeric42 or fusion43 genes, since the ribosome is known to bypass small nucleotide stretches separating two ORFs44.

Highly transcribed genes

Table 1 shows the transcripts with the highest counts. Interestingly, the most abundant transcript had no homologous counterpart in the full BLAST ‘nr’ or ‘nt’ database (GenBank accession: C52369_G2_I1). A proline-rich protein (PRP), a part of the protein superfamily of cell wall proteins consisting of extensins and nodulins, was found to have the second most abundant transcript23,45. Proline comprises 19% of the amino acids in the ORF of this transcript. PRPs are found as structural proteins in wood, and it was hypothesized that these proteins occur in the xylem cell walls during ligniflication, and influence the properties of wood46. PRPs were associated with carrot storage root formation47, were wound and auxin inducible47 and implicated in cell elongation48. PRPs are also an integral component of saliva responsible for the precipitation of antinutritive and toxic polyphenols by forming complexes49. Two DNAJ/HSP40 chaperone proteins, which are involved in proper protein folding, transport and stress response, showed high transcriptional levels28. Two DNAJ/HSP40 chaperone homologs (GenBank accession id: BI677935 and BI642398) were shown to be differentially expressed during summer at the sapwood/heartwood TZ of black locust50. The transcription levels of dehydrin-related proteins were shown to be seasonally regulated in the wood of deciduous trees26,51. However, this dehydrin protein is homologous to a 24kDa dehydrin (Uniprot id: AGC51777) from Jatropha manihot, a drought resistant plant52, unlike the ~100kDa proteins investigated in 26. Senescence-associated proteins, and the related tetraspanins, were also highly transcribed27. One highly expressed transcript was homologous to a protein that is yet to be characterized.

Table 1. A sample of highly transcribed genes with high normalized counts (NC).

There are several highly transcribed genes in the representative sample of the transcriptome from the tissue at the heartwood/sapwood transition zone (TZ) in black walnut that did not have any significant homologs (NSL) in the complete ‘nr’ or ‘nt’ database. For the ‘nr’ database, we use the three longest ORFs as query. The significance of dehydrins, senescence-associated and DNAJ proteins can be observed through their transcription abundance.

IDNCDescriptionE-value
C52369_G2_I143040NSL (putative extensin based on amino acid composition)-
C51134_G2_I215200ref|XP_008224364.1|PREDICTED: extensin-like [Prunus mume]1e-08
C40830_G1_I114169ref|XP_006365673.1|dnaJ protein homolog isoform X2 [Solanum tuberosum]0
C46581_G1_I110651PREDICTED: Probable zinc transporter protein [Phoenix dactylifera]8e-09
C51134_G2_I310631emb|CAN59948.1|hypothetical protein VITISV_043422 [Vitis vinifera]6e-09
C44353_G2_I17769gb|AGC51777.1|dehydrin protein [Manihot esculenta]6e-09
C44353_G1_I16652gb|AAF01465.2|AF190474_1 bdn1 [Paraboea crassifolia]2e-19
C43130_G3_I16601gb|KEH16988.1 |j senescence-associated protein, putative [Medicago truncatula]2e-129
C44922_G1_I15584ref|XP_008363477.1|tetraspanin-3-like [Malus domestica]2e-169
C40830_G1_I25113ref|XP_007010484.1|DNAJ [Theobroma cacao]0

Finding genes

We demonstrated the (iterative) gene finding methodology in YeATS on a transcription factor that has an AP2 DNA binding motif (RAP2.6L in Arabidopsis, At5g13330)53. This protein showed differential tissue specific expression, and is likely to be involved in plant developmental processes and stress response54. Recently, the sequence of a homolog of RAP2.6L was deduced (Uniprot id: C1KH72, JnRap2) from an EST sequence isolated from tissue at the heartwood/sapwood TZ in black walnut (Juglans nigra L.), and its role in the integration of ethylene and jasmonate signals in the xylem and other tissues was established55,56. Using the sequence of JnRap2, we probed for other RAP2 genes in the TZ of walnut. We found three possible genes (C38523_G2_I1, C53728_G7_I1 and C53728_G7_I2) (Figure 7). It was observed that C53728_G7_I2 was closest to the JnRap2 gene (97.4% identity, 98.2% similar), and is probably the same gene. C53728_G2_I1 was also significantly homologous to the JnRap2 gene (84.4% identity, 92.4% similar), and it appears to be an allelic or splice variant, a conflict that can be resolved after the publication of the complete walnut genome. Raw counts (see Supporting information) demonstrated that the transcript C38523_G2_I1 had negligible expression levels in TZ, corroborating the previous detection of only one RAP2 protein in 55.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure7.gif

Figure 7. Finding genes from a template sequence.

Multiple sequence alignment of possible genes for a transcription factor that had a AP2 DNA binding motif compared to JnRap2, which was deduced from an EST sequence obtained from tissue at the heartwood/sapwood transition zone in black walnut.

Transcripts with no significant matches in the ‘nr’ database - possible long non-coding RNA genes?

The top three ORFs of ~600 transcripts had no match in the BLAST ‘nr’ database. Although these may be unique genes, another possibility that must be considered is that these are non-coding RNA genes2. The nucleotide sequences of these 600 transcripts were BLAST’ed to the database of noncoding RNAs in Arabidopsis22. Three matches were identified: C52424_G5_I11, C52424_G5_I4 and C53565_G3_I1. Both C52424_G5_I11 and C52424_G5_I4 are homologous to CR20, a cytokinin-repressed gene in excised cotyledons of cucumber, hypothesized to be non-coding RNA57. Analogous to the current work, the CR20 gene had alternate splicing57. C53565_G3_I1 had a 100% match to the Arabidopsis locus ATMG01380, a mitochondrial 5S ribosomal RNA, which is a component of the 50S large subunit of mitochondrial ribosome58.

Discussion

High-throughput mRNA sequencing (RNA-Seq) has revolutionized the field of transcript discovery, providing several advantages over traditional methods7,8. Following isolation and fragmentation of RNA and subsequent generation of cDNA libraries, a high-throughput sequencing platform is selected to generate short reads59. Reconstruction of transcripts from these short reads (assembly) may be performed using a reference genome or de novo algorithms1518,21,60. Sequencing biases, variable coverage, sequencing errors, alternate splicing and repeat sequences are some of the challenges faced by these assemblers14,61.

Several post assembly computational tools provide further curation of transcripts resulting from the assemblers. The curation step involves identifying redundancies19,20, finding coding regions62, annotating the transcripts (https://transdecoder.github.io/) and detecting inaccuracies by aligning the transcripts to the genome63. In the current work, we present an integrated workflow for RNA-seq analysis (YeATS). YeATS includes most features of the tools mentioned above. Additionally, YeATS delivers several capabilities absent in these tools. A comprehensive BLAST analysis of the top three open reading frames of each transcript enables the identification of erroneous transcripts arising out of sequencing or assembly errors. These erroneous transcripts can be classified as: a) transcripts that have not been merged, b) transcripts that result in broken ORFs and c) transcripts that have long improbable repeats. Finally, YeATS provides annotation of the genes, enumerates homologous genes based on a template sequence and specified similarity threshold and identifies transcripts with multiple ORFs. The ribosome is known to bypass small nucleotide stretches separating two ORFs44. These are rare events, however, and thus unlikely to apply to the ~1200 transcripts that have broken ORFs pointing to the same gene64. Transcripts having multiple ORFs on the same strand are good candidates for chimeric42 or fusion43 genes dependent on ribosome bypassing.

The current work reveals and corroborates several aspects of the biology of hardwood trees. Probably, the most interesting is the detection of a highly transcribed gene (C52369_G2_I1) with no known homologs in the complete protein and nucleotide BLAST database, or significant matches in a database of long non-coding RNA genes22. If indeed the longest ORF of this transcript encodes a protein, it is 143 amino acids long, and is leucine (18%), histidine (13%) and valine (10%) rich (Figure 8). Although it is likely that this is a protein with leucine rich repeats, these proteins are typically larger proteins65. On the other hand, histidine and valine rich extensins have been reported to be constituents of plant cell walls of dicots23. The regulatory stimuli of extensins are different for monocots (which also have different amino acid composition) and dicots23. A significant presence of extensin-like proteins in the cell wall of both developing and mature xylem (wood) have been reported for pine46,66. The publication of the walnut genome will aid the characterization of these genes by elucidating its promoter sequences.

9cd63fe1-a331-452d-b6ca-2a2e6a360c7d_figure8.gif

Figure 8. Percentage amino acid composition of the two most highly transcribed genes.

C52369_G2_I1 has a high percentage of leucine, histidine and valine, and is a putative extensin. C51134_G2_I2 is proline and lysine rich, and is homologous to an extensin and nodulin.

Well characterized proteins like proline-rich proteins25,46, dehydrins26, senescence-associated proteins27 and DNAJ/HSP40 chaperone50 proteins were also abundant in the transcriptome. While Arabidopsis supports secondary growth, it fails to accumulate wood; it is therefore interesting to identify highly transcribed genes that are missing in the Arabidopsis proteome (Table 2). The DNAJ/HSP40 chaperone, dehydrins and tetraspanin proteins are found in the Arabidopsis proteome (TAIR10_pep_2010121467), while the putative extensin, the proline-rich protein, a probable zinc transporter protein, an uncharacterized protein and senescence-associated protein appear to be unique to the walnut proteome.

Table 2. Identifying highly transcribed genes that are not present in the Arabidopsis proteome.

The wood quality of walnut and Arabidopsis are quite different. It is informative to identify genes (proteins) that are absent in Arabidopsis, since they are likely to be responsible for the differences. The DNAJ/HSP40 chaperone, dehydrins and tetraspanin proteins are found in the Arabidopsis proteome, while the putative extensin, the proline-rich protein, a probable zinc transporter protein, an uncharacterized protein and senescence-associated protein appear to be unique to the walnut proteome.

TRSArabidopsis IdDescriptionE-valueSignificant?
C52369_G2_I1AT5G04990.1SUN1, ATSUN1 | SAD1/UNC-84 domain protein0.75
C51134_G2_I2AT3G18440.1AtALMT9, ALMT9 | aluminum-activated malat0.046
C40830_G1_I1AT5G22060.1ATJ2, J2 | DNAJ homologue 2 | chr5:7303790Y
C46581_G1_I1AT5G51930.1Glucose-methanol-choline (GMC) oxidore8.1
C51134_G2_I3AT1G79090.2FUNCTIONS IN: molecular function unkno1.3
C44353_G2_I1AT1G76180.2ERD14 Dehydrin family protein | chr1:281e-05Y
C44353_G1_I1AT1G20450.2LTI29, LTI45, ERD10 | Dehydrin family pro1e-07Y
C43130_G3_I1AT1G72110.1O-acyltransferase (WSD1-like) family p1.7
C44922_G1_I1AT3G45600.1TET3 | tetraspanin3 | chr3:16733973–167358e-156Y
C40830_G1_I2AT3G44110.1ATJ3, ATJ | DNAJ homologue 3 | chr3:158691e-179Y

Also, we corroborated the presence of a transcription factor that has a AP2 DNA binding motif53,55, and identify additional splice/allelic variants with similar transcriptional levels. Once again, the knowledge of the walnut genome would enable a more profound understanding of such genes.

Conclusions

In summary, the current work elucidates an integrated workflow for RNA-seq analysis with several innovative features for identifying and correcting erroneously assembled transcripts. We demonstrated this workflow by characterizing the transcriptome of the tissue at the heartwood /sapwood TZ in black walnut.

Data availability

F1000Research: Dataset 1. YeATS Dataset, 10.5256/f1000research.6617.d4973068

Software availability

Archived source code as at the time of publication

http://dx.doi.org/10.5281/zenodo.33137

Software license

GNU General Public License version 3.0 (GPLv3)

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 17 Jun 2015
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Chakraborty S, Britton M, Wegrzyn J et al. YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut [version 2; peer review: 3 approved] F1000Research 2015, 4:155 (https://doi.org/10.12688/f1000research.6617.2)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 06 Nov 2015
Revised
Views
13
Cite
Reviewer Report 19 Feb 2016
Binay Panda, Genomics Applications and Informatics Technology laboratories (GANIT Labs), Bio-IT Centre, Institute of Bioinformatics and Applied Biotechnology(IBAB), Bangalore, Karnataka, India 
Approved
VIEWS 13
Chakraborty et al. implemented a workflow for error estimation and correction, functional annotation and abundance estimation in RNA-seq data. They explored a methodology of analyzing longest ORFs of transcripts, using BLAST, as means to identify important genes. Although BLAST has ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Panda B. Reviewer Report For: YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut [version 2; peer review: 3 approved]. F1000Research 2015, 4:155 (https://doi.org/10.5256/f1000research.7788.r12065)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
23
Cite
Reviewer Report 28 Jan 2016
Michael I. Love, Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA 
Approved
VIEWS 23
I do not have expertise in transcript assembly, but I can comment on the general readability and usability of the article and tool suite.
  1. As with the report from Dr. Charoensawan, I was expecting that the tool suite would be more integrated
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Love MI. Reviewer Report For: YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut [version 2; peer review: 3 approved]. F1000Research 2015, 4:155 (https://doi.org/10.5256/f1000research.7788.r12066)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 01 Feb 2016
    Sandeep Chakraborty, Plant Sciences Department, University of California, Davis, 95616, USA
    01 Feb 2016
    Author Response
    We thank you for taking the time to review this paper, and for your comments.

    As we have mentioned previously in response to Dr Charoensawan, this manuscript is not meant to ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 01 Feb 2016
    Sandeep Chakraborty, Plant Sciences Department, University of California, Davis, 95616, USA
    01 Feb 2016
    Author Response
    We thank you for taking the time to review this paper, and for your comments.

    As we have mentioned previously in response to Dr Charoensawan, this manuscript is not meant to ... Continue reading
Views
18
Cite
Reviewer Report 04 Jan 2016
Varodom Charoensawan, Mahidol University, Bangkok, Thailand 
Approved
VIEWS 18
The authors have addressed most of my previous comments.

However, I still have one reservation on the use of Arabidopsis "proteome" (instead of of publicly available transcriptomes) as a benchmark for walnut transcripts found, in the section "Identifying highly transcribed genes that are ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Charoensawan V. Reviewer Report For: YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut [version 2; peer review: 3 approved]. F1000Research 2015, 4:155 (https://doi.org/10.5256/f1000research.7788.r11300)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 06 Jan 2016
    Sandeep Chakraborty, Plant Sciences Department, University of California, Davis, 95616, USA
    06 Jan 2016
    Author Response
    Dear Dr Charoensawan,
       We would like to thank you once again for critically reviewing, and accepting the revised version.

    As for the Table 2, which mentions the "Identifying highly transcribed genes ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 06 Jan 2016
    Sandeep Chakraborty, Plant Sciences Department, University of California, Davis, 95616, USA
    06 Jan 2016
    Author Response
    Dear Dr Charoensawan,
       We would like to thank you once again for critically reviewing, and accepting the revised version.

    As for the Table 2, which mentions the "Identifying highly transcribed genes ... Continue reading
Version 1
VERSION 1
PUBLISHED 17 Jun 2015
Views
50
Cite
Reviewer Report 05 Oct 2015
Varodom Charoensawan, Mahidol University, Bangkok, Thailand 
Approved with Reservations
VIEWS 50
Chakraborty and coworkers proposed a new platform for analysing transcriptomic data from RNA-seq (YeATS -Yet Another Tool Suite for analyzing RNA-seq derived transcriptome). The key feature of the tool highlighted by the authors is error estimation and correction of assembled ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Charoensawan V. Reviewer Report For: YeATS - a tool suite for analyzing RNA-seq derived transcriptome identifies a highly transcribed putative extensin in heartwood/sapwood transition zone in black walnut [version 2; peer review: 3 approved]. F1000Research 2015, 4:155 (https://doi.org/10.5256/f1000research.7105.r10335)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 06 Nov 2015
    Sandeep Chakraborty, Plant Sciences Department, University of California, Davis, 95616, USA
    06 Nov 2015
    Author Response
    We would like to thank you for taking the time to review this paper. Please find our responses below.

    Chakraborty and coworkers proposed a new platform for analysing  transcriptomic data from ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 06 Nov 2015
    Sandeep Chakraborty, Plant Sciences Department, University of California, Davis, 95616, USA
    06 Nov 2015
    Author Response
    We would like to thank you for taking the time to review this paper. Please find our responses below.

    Chakraborty and coworkers proposed a new platform for analysing  transcriptomic data from ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 17 Jun 2015
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.