Updated functional annotation of the Mycobacterium bovis AF2122/97 reference genome

Mycobacterium bovis AF2122/97 is the reference strain for the bovine tuberculosis bacillus. Here we report an update to the M. bovis AF2122/97 genome annotation to reflect 616 new protein identifications that replace many of the old hypothetical coding sequences and proteins of unknown function in the genome. These changes integrate information from functional assignments of orthologous coding sequences in the Mycobacterium tuberculosis H37Rv genome. We have also added 69 additional new gene names.


INTRODUCTION
Mycobacterium bovis is a causative agent of bovine tuberculosis (bTB) and the most widely studied animal-adapted member of the Mycobacterium tuberculosis complex (MTBC). The genome of the M. bovis AF2122/97 strain was first sequenced in 2003 [1] and is considered to be the reference sequence for this species. Mycobacterium bovis has considerable importance as the basis for comparative studies into animal-and human-adapted species of the MTBC. This genome was revised in 2017 [2] with an updated sequence that included the previously missing RD900 region and added 42 new coding sequences. These revisions brought the annotation in line with updates to the M. tuberculosis H37Rv genome, to which M. bovis shares high identity.

Hypothetical and unknown proteins
Proteins encoded by genes with no clear functional activity have traditionally been annotated with designations such as 'unknown protein' , 'hypothetical protein' or 'conserved hypothetical' . These labels are usually placed in the /product field of the annotation file (in this context we refer to the GenBank file format). This /product qualifier is not automatically updated when new protein functions are discovered, and hence genome annotation files become out of date over time if not regularly updated. Cross-references to databases with functional information are automatically added in a /db_xref qualifier during updates on the DDBJ/ EMBL/GenBank system; however, these are not human readable. Therefore, it is a valuable and necessary exercise to update the protein product information directly in genome annotations of reference species.

New sources of annotation
Since the original annotation of M. bovis AF2122/97, the function of many hypothetical and newly identified proteins has been recognized. These data were collated by Doerks et al. in 2012 [3], who added approximately 620 new functional assignments to the remaining unknowns in the M. tuberculosis H37Rv reference genome. This was done using orthology and genomic context evidence. If a hypothetical protein was a member of a known orthologous group in the eggNOG database, this annotation was transferred to the corresponding M. tuberculosis protein.
The STRING tool was also used, combining gene fusion events and significant co-occurrence to predict links with known proteins. The annotations vary in specificity; those predicted through association are more 'functional hints' as to the nature of the protein function rather than ascribing a specific function.
The PATRIC database [4] used their own pipeline based on RASTtk to reannotate H37Rv. The PATRIC annotation has some additional functional assignments and small genes that were not in the M. tuberculosis H37Rv reference. Some of the genes found by Doerks et al. are also assigned in PATRIC and a few have since been assigned to the reference. Fig. 1 shows the overlap in these two sets with the existing unknown proteins in the H37Rv reference.
UniProt [5] stores up-to-date protein annotations for the M. tuberculosis H37Rv strain, several of which are recent and were not present in either the PATRIC or the Doerks et al. data. These three sources were used to make an integrated table that was used to update the protein product field of the M. bovis AF2122/97 genome annotation.

MeTHODS
Data from three sources were used: (1) Doerks et al.  The resulting table with new protein products was then matched to the M. bovis orthologous genes using a mapping between M. tuberculosis H37Rv and M. bovis locus tags. In a final step we performed a blast [6] of the remaining unknowns to the Protein Data Bank and found five additional proteins that have structures and function ascribed, and these were also added. Analysis was performed in Python, utilizing the Biopython [7] and pandas [8] libraries.

ReSUlTS
The current M. bovis genome annotation contains 3989 protein coding genes [2] . Of these, 1097 were marked as hypothetical, conserved or unknown proteins. These data are summarized in Table 1, with two M. tuberculosis H37Rv annotations for comparison. We have now added a total of 616 new protein product annotations to the genome, including five products from a PDB search. Sixty-nine new gene names were added from the UniProt data. There are now 488 hypothetical/ unknowns remaining in the updated M. bovis AF2122/97 annotation. This revised annotation has been submitted to DDBJ/ENA/GenBank and is available under the accession no. LT708304.

CONClUSION
Reference genomes stored on the International Nucleotide Sequence Database Collaboration (INSDC) provide the primary sources of annotation for virtually all bacterial species. Although there are now multiple alternative information sources for bacterial genomes, most are specialist databases and not universally known. The proliferation of data sources risks fragmentation of genome annotation and linked functional information; it is vitally important that reference sequence annotation in GenBank and Ensembl, the first port of call for the majority of researchers, are as up to date as possible. The M. bovis AF2122/97 strain is well established as a reference for M. bovis and the MTBC, and will remain a research cornerstone for the foreseeable future; maintaining an updated genome annotation for the research community drove our current work. We note that the latest reference annotation of M. tuberculosis H37Rv also lacks many of the updates we have added here to hypothetical proteins, underlining the need for constant curation of reference sequence annotation.