Revision and reannotation of the Halomonas elongata DSM 2581T genome

Abstract The genome of the Halomonas elongata type strain DSM 2581, an industrial producer, was reevaluated using the Illumina HiSeq2500 technology. To resolve duplication‐associated ambiguities, PCR products were generated and sequenced. Outside of duplications, 72 sequence corrections were required, of which 24 were point mutations and 48 were indels of one or few bases. Most of these were associated with polynucleotide stretches (poly‐T stretch overestimated in 19 cases, poly‐C underestimated in 15 cases). These problems may be attributed to using 454 technology for original genome sequencing. On average, the original genome sequence had only one error in 56 kb. There were 23 frameshift error corrections in the 29 protein‐coding genes affected by sequence revision. The genome has been subjected to major reannotation in order to substantially increase the annotation quality.

We have reported the complete genome sequence of H. elongata in 2011 (Schwibbert et al., 2011). Although the number of disrupted genes was small, we subsequently found several of them to be functional. Resequencing of individual genes identified frameshift errors in the original 454-based genome sequence. Here, we report an improved genome sequence based on a large-scale resequencing effort using Illumina technology.

| Genome resequencing and genome error identification
Having identified several functional protein-coding genes, which seemed disrupted according to our genome sequence, we decided to resequence the genome using the Illumina HiSeq2500 technology (GATC Biotech, Konstanz, Germany) for a high genome coverage. The reads were used to pinpoint genome sequence errors, applying the methods described for an Aeromonas salmonicida genome sequencing project (Zamora-Lagos et al., in prep.). In brief, reads were mapped to the reference genome (using BowTie, TopHat), followed by mismatch and indel detection (with Samtools, Varscan) (Koboldt et al., 2009;Langmead & Salzberg, 2012;Li et al., 2009;Trapnell, Pachter, & Salzberg, 2009). In the A. salmonicida genome project, we had randomly spotted additional sequence errors, which had not been detected by the above tools. Therefore, we developed a k-mer-based algorithm that permitted to identify additional sequence errors (Zamora-Lagos et al., in prep.). Briefly, k-mers (49-mers) were extracted from the Illumina reads, applying a one-step window. For each k-mer, the occurrence count in the Illumina reads was determined (with k-mers and their reverse complements being merged). In parallel, k-mers were identified in the genome sequence. At each of the bases covered by a k-mer, the occurrence count was added in order to estimate Illumina read coverage. The resulting data were analyzed in two ways: (a) We searched for drops in Illumina read coverage, which are expected to indicate genome sequence errors. The corresponding genome positions were subjected to manual curation. (b) Also, we searched for k-mers with a high occurrence count in the Illumina reads, which are not found in the genome sequence. An attempt was made to identify the genome region to which these reads belong. In several cases, this uncovered additional polymorphic versions of duplicated genome regions.

| Genome reannotation
In the original genome sequencing report, we had only subjected the metabolic network relevant for ectoine biosynthesis to detailed manual curation. Thereafter, we have improved the annotation by applying manual curation to the complete genome. (a) For about one-third of the genome, we applied a stringent annotation strategy (Pfeiffer & Oesterhelt, 2015). This strategy requires that annotations are transferred only from homologs, which themselves have been subjected to experimental analysis (Gold Standard Proteins). Briefly, we use the SwissProt section of UniProt as a rich source for such proteins (last accessed release: 2017_01, 18-Jan-2017) (UniProt Consortium, 2016). In addition, a significant amount of literature work has been performed during this project. Literature work was mainly required when experimental data were yet unavailable via UniProt (we provided feedback to UniProt in such cases). (b) As this procedure is extremely time-consuming, we developed a more relaxed annotation strategy for the remainder of the genome. This is based on our experience that annotations in the SwissProt section of UniProt are highly reliable. Proteins were compared to the SwissProt section of UniProt using BLASTp (Altschul et al., 1997). Homologs with high levels of sequence identity (seqid) (commonly >50%) were assumed to be equivalogs (orthologs with identical function). The SwissProt annotation was transferred to the H. elongata protein-coding genes. In case of more distantly related homologs (40%-50% seqid), the SwissProt sequence was compared to the H. elongata proteins using BLASTp to verify sequence orthology before annotation transfer. For proteins with only more distant homologs in SwissProt functional conservation was considered unlikely. Thus, only a general protein name was assigned in order to escape the "overannotation problem" (Pfeiffer & Oesterhelt, 2015). In such cases, protein names are shaped according to the more distant UniProt/SwissProt homologs and/or assigned InterPro domains in order to keep a high level of naming consistency. We also attempted to minimize EC number assignment inconsistencies between our annotation and the KEGG database (last accessed: 17-Jan-2017) (Kanehisa, Furumichi, Tanabe, Sato, & Morishima, 2017).

| RESULTS AND DISCUSSION
The available genome sequence is mainly based on data obtained with the 454 technology (Schwibbert et al., 2011). Upon further study of the organism, we found several genes, which were expected to be functional but were disrupted by frameshift according to the genome sequence. Examples are PyrC (Helo_2340), required for pyrimidine biosynthesis and MoeA (Helo_2621), required for MoCo biosynthesis. Also, Zwf (Helo_3637) is required for the cytosolic conversion of glucose-6-phosphate to gluconate-6-phosphate at the beginning of the Entner-Doudoroff pathway. Low-scale resequencing of PCR products showed that several of these genes were not disrupted by frameshift but that the genome sequence was not correct. Therefore, we decided for a systematic approach to increase genome sequence reliability. We selected the Illumina HiSeq2500 technology and achieved a 650-fold genome coverage (21 million reads of 2.6 Gbp total length).
Initially, we applied standard technologies for detection of differences between the reference genome and the set of Illumina reads (read mapping, followed by mismatch and indel detection). Commonly, such tools are used for SNP (single-nucleotide polymorphism) analysis.
However, a sequence error in the reference genome can be considered as an equivalent to a mutant in a sequence dataset. By this approach, we identified 62 genome sequence errors. We had very recently performed another such analysis in a genome sequencing project on Aeromonas salmonicida subsp. pectinolytica 34mel (Zamora-Lagos et al., in prep.). During that project, we had randomly spotted genome sequence errors which had been overlooked by the standard methodology. Several of the corresponding errors proved to be unusually severe (e.g., three closely spaced point mutations), which interfered with read mapping at the applied stringency. Thus, the genome sequence differed considerably from that of the Illumina reads but the standard methodology did not list this case. As a consequence, a required correction of a more severe genome error would have been missed while the minor errors would have been corrected.
To overcome this problem, we developed a complementary parameter-independent algorithm for detection of potential genome errors (Zamora-Lagos et al., in prep.). Upon application of this methodology to the H. elongata genome, we detected only very few additional problems in unique genome regions. However, our algorithm proved efficient to identify previously overlooked genome sequence errors in duplicated regions of the H. elongata genome.
(1) In one case, two copies of a gene (Helo_1778 and Helo_2803) were previously considered truly identical over the N-terminal 335 codons, even at the DNA sequence level, while our analysis uncovered that the sequences differ at two base positions, both mutations being silent. This problem escaped detection by the standard methodology (reads for the incorrect sequence are available due to the other gene copy and are mapped to both genome regions, thus masking the problem). However, as the variant sequences were frequent in the Illumina reads but were missing in the genome, they were detected as high-frequency novel k-mers. (2)   Within unique genome regions, we found 72 differences (Table 2), with fewer simple mutations (23 out of 24 were point mutations) than frameshifts (48). There were only three simple onebase frameshifts. Most frameshifts (37) occurred in polynucleotide runs. In 19 cases, T-runs were one base too long; in 15 cases, C-runs were one base too short; in the remaining three cases, C-runs were one base too long. The overrepresentation of homopolymer-related frameshifts in a sequence originally obtained by 454 pyrosequencing is noteworthy.
Overall, 29 protein-coding genes were affected by mutations (Table 1, Table 3). Of the six genes affected only by point mutation, two have a modified protein sequence, while the mutations are silent in the other four. There were 23 frameshift differences. In 13 cases, the gene was annotated as disrupted but is functional. In the other 10 cases, the gene was annotated as functional but had an invalid C-terminal (7) or N-terminal (3) sequence.
In the original sequencing report, we applied a stringent annotation strategy only to proteins directly connected to ectoine metabolism, while the remainder of the genome was left as predicted by an annotation robot (Schwibbert et al., 2011). In the meantime, we have applied the stringent annotation strategy (Pfeiffer & Oesterhelt, 2015) to one-third of the genome. This detailed postannotation was performed during a study concerned with detailed analysis of osmoregulation in H. elongata (Kindzierski et al., 2017). For the remainder, a more relaxed manual curation strategy was applied which benefits from the high reliability of annotations in the UniProt/SwissProt database (see methods). For a significant number of genes, start codons were reassigned, applying a BLAST-based start codon checking procedure (Pfeiffer et al., 2008). Also, a number of previously missed gene annotations were resolved by postprediction, commonly detected in long "intergenic" regions. The genome reannotation was managed in Differences are summarized based on genome category (duplication-associated or unique), mutations category (nonshifting or indel). Numbers of cases are provided for different mutation classes and summarized by mutation category (sum1), by genome category (sum2), and for all cases (total).

Mutation category Sequence revision class Number Sum Total
Nonshifting mutation only Silent mutation 4 6 29 Protein sequence differs 2

Frameshift
Repair of known frameshift 13 23 C-term region replaced 7 N-term region replaced 3 Numbers of cases are provided for different sequence revision class and summarized for mutation categories and for all cases (total).

T A B L E 3 Protein-coding genes affected by genome sequence error corrections
HaloLex (Pfeiffer et al., 2008). The revised and reannotated genome has a length of 4,061,825 bp (previously 4,061,296 bp) and has 3,644 protein-coding genes.  (Wu et al., 2012). As the substrate specificity cannot be securely predicted for the H. elongata homologs, we assigned gene names spuE1 and spuE2. P. aeruginosa SpuA is somewhat more closely related to Helo_1903 than E. coli PuuD (49% and 44% seqid, respectively) but only E. coli PuuD has been functionally characterized and thus this annotation is preferred for the H. elongata homolog.
Finally, we want to comment on one annotation detail, the 3′ end of the 16S rRNA. We consider GGATCACCTCCTTA to be the correct 3′ sequence, while our previous annotation had a 2 nt extension (AT).
Our previous annotation was based on the publication by Lin, Chang, Chiang, & Tang (2008). These authors are clearly correct in their general conclusion that many 16S rRNA sequences in the databases are too short because they lack the conserved anti-SD sequence CCTCC.
However, the authors added 5 instead of 3 nt after the anti-SD sequence, even for the E.coli 16S rRNA (Suppl.  (Nawrocki et al., 2015). We also made other RNA annotations consistent with RFAM except for tRNAs which we made consistent with GtRNADB (Chan & Lowe, 2016).
The nucleotide sequence accession number is FN869568, sequence version 2.