Maize GO Annotation—Methods, Evaluation, and Review (maize‐GAMER)

Abstract We created a new high‐coverage, robust, and reproducible functional annotation of maize protein‐coding genes based on Gene Ontology (GO) term assignments. Whereas the existing Phytozome and Gramene maize GO annotation sets only cover 41% and 56% of maize protein‐coding genes, respectively, this study provides annotations for 100% of the genes. We also compared the quality of our newly derived annotations with the existing Gramene and Phytozome functional annotation sets by comparing all three to a manually annotated gold standard set of 1,619 genes where annotations were primarily inferred from direct assay or mutant phenotype. Evaluations based on the gold standard indicate that our new annotation set is measurably more accurate than those from Phytozome and Gramene. To derive this new high‐coverage, high‐confidence annotation set, we used sequence similarity and protein domain presence methods as well as mixed‐method pipelines that were developed for the Critical Assessment of Function Annotation (CAFA) challenge. Our project to improve maize annotations is called maize‐GAMER (GO Annotation Method, Evaluation, and Review), and the newly derived annotations are accessible via MaizeGDB (http://download.maizegdb.org/maize-GAMER) and CyVerse (B73 RefGen_v3 5b+ at doi.org/10.7946/P2S62P and B73 RefGen_v4 Zm00001d.2 at doi.org/10.7946/P2M925).

We have generated this illustration to clarify the details about specificity of the predicted GO annotations. Specificity of the predicted annotation can be classified into three categories. Case (A) where the prediction is less specific than the gold standard annotation, case (B) prediction and gold standard have same specificity, and case (C) where prediction is more specific than gold standard. Less specificity in the case of prediction is not always incorrect, but is not as detailed (granular) as the gold standard. This is a known issue when predicting GO terms, because the amount of data available are inadequate for the construction of mathematical or machine learning methods.
I am wondering whether the authors' efforts to annotate maize genes can be implemented in other plants, even animal. If so, can we still get 100% annotation rate? Because we have made the code and methods available so that others could try out the methods, we are hopeful that these questions will be approached, by our own group and/or others. Preliminary analyses (not reported here) indicate that for other plants the system works well in other plants. Table 2, "Descripti" should be "Description". Thank you for bringing that to our notice. There indeed was a margin issue with the row and we have fixed it to show the full text of that cell.

Reviewer #3:
The manuscript entitled "Maize GO Annotation -Methods, Evaluation, and Review (maize-GAMER) " by Wimalanthan et al. report a new maize functional annotation resource that will be very helpful for the maize community and bioinformatics tools for gene functional annotation that can be applied to a wide range of organisms as well.
A major strength of this work is the use of a set of expertized annotation to guide the automatic annotation process. In addition, the pipeline is based on state of the art methods. The result is functional annotation of maize genes of an unprecedented richness and coverage. Without any doubt, this work will be very useful in many research interested in gene function.

Major comments
Maize-GAMER is an aggregate of many Component Annotation Sets. Some of these sets are made from distinct reference organisms. It is therefore possible that these components include conflicting or inconsistent annotation for a single gene. The hF 1 metrics can tell how far a predicted annotation is from the gold standard (GS) but no why it is. The annotation of a single gene may include more terms than the GS with the possibility of inconsistencies making the annotation of the gene difficult to use (decreased specificity ?). Or, in contrast, the predicted annotation may miss terms present in the gold standard. Or both. It may be interesting to discuss about this point. I would also appreciate to have a look at the hFf 1 and specificity distribution (for all gene of the GS) for each component and the aggregated annotation, it might be interesting and better support the conclusion of the na1 case study (part 4.3).
We thank the reviewer for the insightful comments. We have used the hF 1 metric to assess how the different methods used to produce the GO dataset for maize performed. The hF 1 metric is the harmonic mean of hierarchical Precision (hPr) and hierarchical Recall (hRc). These metrics are similar to Precision and Recall, but take the hierarchical nature of Gene Ontology into account when evaluating against the gold standard. hPr metric indicates the proportion of the GO terms predicted by the method that was correct, and a low value would reflect the inconsistencies when large number of GO terms are annotated by a method. This could be used to find how specific the annotation from a given method is. hRc gives the proportion of the GO terms in the gold standard dataset which are correctly predicted. Low hRc values would reflect the inability of a particular method to capture the GO terms annotated to a particular gene in the gold standard. We do agree the comparison of the distributions of hF1, hPr, and hRc would provide a clear picture of the value of the annotations from different component methods and other existing maize annotation datasets. We have provided supplemental figure S1 with the distribution of hF1,hPr, and hRc from different component methods, aggregate maize-GAMER dataset and other existing maize GO datasets. We have added text in the discussion section to explain the differences among different tools.

Minor comments
3.5.1 "Redundancy is the proportion of shared ancestral terms annotated for each gene averaged across all genes." While the term ancestral is well defined in Defoin-Platel et al. this term should be explained in this manuscript as it might be misleading to geneticists and evolution researchers. We have defined and explained the ancestral GO terms right before Redundancy is defined in the text. 3.5.1 "An ideally cleaned annotation set for a single gene would have no duplication, no redundancy, and high coverage and specificity. " Not sure what coverage means at the scale of a single gene if I consider the definition of coverage given above : "Coverage is the proportion of genes that have at least one GO term assigned. " We thank the reviewer for the attention to detail, and agree that coverage does not apply to a single gene and have updated the text with correct phrasing to describe an annotation set rather than the gene 4.4.1 "The IPRS annotation set had a lower number of annotations compared to the CAFA mixed-method pipelines, but covered more genes than methods." Covered more genes than which methods ? We appreciate the comment, and we have updated the sentence to correct the mistake. It should have been "sequence-similarity methods" instead of "methods".