Patching Holes in the Chlamydomonas Genome

The Chlamydomonas genome has been sequenced, assembled, and annotated to produce a rich resource for genetics and molecular biology in this well-studied model organism. However, the current reference genome contains ∼1000 blocks of unknown sequence (‘N-islands’), which are frequently placed in introns of annotated gene models. We developed a strategy to search for previously unknown exons hidden within such blocks, and determine the sequence, and exon/intron boundaries, of such exons. These methods are based on assembly and alignment of short cDNA and genomic DNA reads, completely independent of prior reference assembly or annotation. Our evidence indicates that a substantial proportion of the annotated intronic N-islands contain hidden exons. For most of these, our algorithm recovers full exonic sequence with associated splice junctions and exon-adjacent intronic sequence. These new exons represent de novo sequence generally present nowhere in the assembled genome, and the added sequence improves evolutionary conservation of the predicted encoded peptides.

In Supplemental_Fasta 2-5, uppercase letters indicate sequence that was aligned by Blastn to existing Phytozome transcript models. Lowercase letters represents unaligned sequence (i.e. sequence present in the Trinity object, but potentially missing from the genome assembly). MATLAB SAMPLE ANALYSIS (File S6) A subset of the raw data is provided in the file 'Matlab_sample_analysis_and_README.zip'. This raw data can be analyzed with the Matlab script (File S5).

Overview of computational pipeline
Generating the Trinity assembly and alignment of genomic reads Single-end 50 bp reads from RNAseq time-courses (Tulin and Cross, 2015) from six libraries were used. We collected all reads that mapped to gene models that contain internal (intronic or exonic) and flanking N-islands (Trinity6_Nisland_genes.bed). We also included unmapped reads from two libraries. In total, 57 million reads were used as input to Trinity (Haas et al. 2013).
Trinity generated 10134 sequences. These were filtered down to 3473 sequences by requiring a maximal e-value of 0.01 to either Arabidopsis TAIR10, or Volvox 2.0 peptides. To reduce redundancy within this set, we filtered again with the usearch program (http://www.drive5.com/usearch/): $usearch -cluster_fast Trinity6_blastx_filtered.fa -sort length -id 0.8 -strand both -centroids This gave the final set of 3114 Trinity objects that we analyze in the paper. To facilitate downstream analysis, we took the reverse complement of Trinity objects, where necessary, to match the strand of the top Blastn hit to Phytozome Chlamydomonas primary transcripts.
A bowtie2 index was generated from the 3114 Trinity objects, and genomic 100 bp single-end reads from wild type Chlamydomonas (CC-124) were aligned by bowtie2: The genomic reads used corresponded to ~100x coverage of the Chlamydomas genome.
The mismatch (--mp) and read-gap (--rdg) penalties were set higher than the defaults (--mp 6; --rdg 5,3) to promote perfect alignments to the Trinity sequence. We expected no polymorphisms, since the Trinity assembly and the genomic reads come from the same strain background.
The alignment to Trinity sequences was used to build 'connected islands' by Matlab (Figure 5,6) Analysis of Blastn alignment between Trinity and Phytozome transcript models We performed a Blastn alignment between the strand-corrected Trinity objects and Phytozome primary transcript models: $ blastn -query accessory_files/Creinhardtii_281_v5.5.transcript_primaryTranscriptOnly.fa -db ~/blast_db/Trinity6_strand_corrected_raw -ungapped -out Creinhardtii_281_v5.5.transcript_primaryTranscriptOnly_vs_Trinity6_strand_corrected_raw_blastn_ungapped.bls The Blastn alignment was analyzed by 'myNisland_accounting.pl' and accessory files in Supplemental Files according to the basic algorithm: COVERED: An intronic N island is marked by two adjacent nucleotides in the Phytozome transcript sequence. If there is a Blastn HSP to a Trinity object that extends across the N-island-junction, we considered this evidence for the Phytozome assignment of the N-island as fully intronic. We required that the Trinity object be contained within a 'connected island' that extends at least 100 bp to the left and right of the N-island-junction. A covered N-island-junction is shown in Figure 1B (top panel).
BRIDGED: If a single Trinity object aligns with one HSP on either side of the N-island-junction, with unaligned Trinity sequence between the two HSPs, we called the N island BRIDGED. To score as BRIDGED, we also required that 1. the Trinity object be contained within a 'connected island' that extends at least 100 bp to the left and right of the unaligned Trinity segment. 2. <100% of the Trinity object is aligned A bridged N-island-junction is shown in Figure 1B (bottom panel).
HALF-BRIDGED: We considered the possibility that the Trinity sequence may in some cases be too short to complete a full bridge. In such cases, the Trinity object would align (Blastn) with a single HSP on either the left, or the right side of the N-island-junction, with an unaligned 'tail' extending in the direction of the N island.
To score a half-bridged intronic N island, we required: 1. that the HSP ends (if on the left side) or begins (if on the right side) within 50 bp of the Nisland-junction. 2. that the HSP does not overlap by more than 10 bp from the left to the right side or vice versa.
10 bp is accepted since this occurs frequently by Blastn alignment due short sequence repeats at the splice junction. 3. that the HSP length (aligned sequence) is <40% of the Trinity length. This is a subjective cut-off to focus on cases where a substantial proportion of the Trinity sequence is unaccounted for. 4. that a 'connected island' span at least 200 bp around the border between the HSP and the unaligned tail.
In addition, due to the lower confidence in half-bridges (since they are not anchored on both sides of the N-island-junction), we required a sequence-dependent score increase of 10 or more by the Blastx test to Volvox described in the main text, i.e. the half-bridges require support from evolutionary conservation.
FLANKING: To identify cases where an N island may cause premature termination of a gene model, we looked for Trinity objects that align (Blastn) within the transcript body, with an unaligned 'tail' extending in the direction of a flanking N-island (N-island assigned to lie outside the 3' of 5' borders of the gene). As for the half-bridged N-islands, we required support from a 'connected island' at the HSPtail border, and evolutionary support from the Blastx test to Volvox.

Result
A total of 40 exonic, 136 flanking and 789 intronic N islands were analyzed. This is not the full set of N islands in the Chlamydomonas genome (v5.3.1); we limited the analysis to those N-island-containing transcripts that were the top hit of some Trinity object by Blastn. This was done to prevent scoring N islands based on spurious Trinity-Phytozome alignments.
We scored 272 COVERED N islands Supplemental Table 1  104 intronic BRIDGED N islands  Supplemental Table 2  4 exonic BRIDGED N islands Supplemental Table 3  13 intronic HALF-BRIDGED N islands Supplemental Table 4  11 FLANKING N islands Supplemental Table 5 Alignments We constructed nucleotide sequence alignments (clustalo, Sievers et al. 2011) between the Trinity objects and their corresponding Phytozome transcript models (Supplemental Alignments). The files are labeled as: Cre01.g003532.t1.1_comp5218_c0_seq1.aln where Cre01.g003532.t1.1 is the Phytozome identifier of the N-island-containing transcript, and comp5218_c0_seq1 is the Trinity object containing new sequence information.