Website Review: How to Get the Best From Fission Yeast Genome Data

Researchers are increasingly depending on various centralized resources to access the vast amount of information reported in the literature and generated by systematic sequencing and functional genomics projects. Biological databases have become everyday working tools for many researchers. This dependency goes both ways in that the databases require continuous feedback from the research community to maintain accurate, reliable, and upto- date information. The fission yeast Schizosaccharomyces pombe has recently been sequenced, setting the stage for the post-genome era of this popular model organism. Here, we provide an overview of relevant databases available, or being developed, together with a compilation of Internet resources containing useful information and tools for fission yeast.

The Schizosaccharomyces pombe genome sequence and a preliminary analysis have recently been reported [15], together with several articles celebrating this achievement [11,16,14]. This landmark will further establish and expand the role of fission yeast as a major experimental model organism. It will also increase the need for organized and continuously updated data repositories to allow online access to biological information for fission yeast and related data in other organisms. This paper provides a guide to available databases and other Internet resources relevant to fission yeast. We hope that colleagues will find this compilation helpful, whether they work with fission yeast or wish to access data on this model organism for computational or comparative analyses. We also describe how researchers can contribute to the development and contents of these resources, which is essential to provide accurate and current information for the community. Figure 1 shows the dataflow between the major databases and resources described in the text.

Repositories of genome sequence and annotation
1.1. The S. pombe genome project http://www.sanger.ac.uk/Projects/S_pombe The S. pombe genome project home page at the Sanger Institute will continue to maintain links to the following resources and primary datasets available to download by ftp: 1.2. Primary DNA and protein sequence databases EMBL, GenBank and DDBJ http://www.ebi.ac.uk/embl http://www.ncbi.nlm.nih.gov/Genbank http://www.ddbj.nig.ac.jp/ EMBL/GenBank/DDBJ is a collaboration of the primary nucleotide sequence databases. S. pombe genome project data and updates are submitted directly to EMBL. The three databases are synchronised on a daily basis, and the accession numbers are managed consistently. These databases are redundant and provide minimal error checking.
TrEMBL http://www.ebi.ac.uk/swissprot TrEMBL contains the automatically annotated translations of known and predicted coding sequences (CDS) present in the EMBL database that are not yet integrated into SWISS-PROT and can be considered as a preliminary section of SWISS-PROT. Entries are assigned SWISS-PROT accession numbers (e.g., P04551) but no identifiers (e.g., CDC2_SCHPO).

SWISS-PROT
http://www.ebi.ac.uk/swissprot SWISS-PROT consists of curated, non-redundant sequence entries. It contains high-quality annotation and is cross-referenced to several other databases. A complete list of the S. pombe entries curated into SWISS-PROT is accessible at: http:// expasy.ch/cgi-bin/lists?pombe.txt. SWISS-PROT release 40.0 contains 1842 curated S. pombe entries; the remaining 3672 entries are in TrEMBL and will Constant updates, new submissions, and feedback from specialized users are crucial to maintain accurate and up-to-date information in the databases (dotted arrows). Fission yeast specific tools are shown in bold, most of which are at an early stage of their development. Databases to pool functional genomic information from various organisms are also being developed and will be important to complement (and partially supersede) the currently scattered information (e.g., [2,4,8,12]) Website Review: fission yeast data resources 283 be curated into SWISS-PROT with the removal of redundant entries.
PombePD http://www.incyte.com/sequence/proteome/databases/ PombePD.shtml PombePD is a commercial database developed by Proteome Inc. with much initial input from the fission yeast community [7]. It is now part of the BioKnowledge 1 library of Incyte Genomics. Despite previous promises to contributors [7], Incyte has recently started to charge yearly subscription fees, even for academic users. PombePD provides curated reports for each S. pombe protein and is integrated with databases of other organisms within the library. Weekly updates add new scientific content from the literature. In April 2002, 989 fission yeast proteins were listed as characterized by genetics or biochemistry, as reported in 2451 references.
InterPro http://www.ebi.ac.uk/interpro/index.html Protein sequence signature databases such as PROSITE, PRINTS, SMART, Pfam, ProDom, and TIGRFAMs are vital resources for identifying potential motifs and domains, particularly in novel sequences. InterPro (URL above) is a collaboration between these databases and provides an integrated resource of defined signatures and a facility for text and sequence-based searches [1]. In addition, all of the participating databases provide sequence search options from their individual websites (Pfam and TIGRFAMs also allow the adjustment of thresholds to enable the identification of less conserved domains). Protein signature searches were an integral part of the primary fission yeast annotation and are increasingly important as a resource for 'domain-driven' researchers.

Gene Ontology Consortium
GO http://www.geneontology.org/ The Gene Ontology (GO) Consortium provides 'a dynamic controlled vocabulary that can be applied to any organism even as knowledge of gene and protein roles in cells is accumulating and changing' [5,6]. A common vocabulary to describe the attributes of gene products will facilitate consistent comparisons between organisms and will allow the automated querying of genes and proteins based on shared biology. It will also aid the interpretation of large datasets created by functional genomics projects [6]. The majority of eukaryotic genome projects already use the GO annotation system, and GO annotations are being incorporated into SWISS-PROT and GeneDB (see section 1.5). Gene products are annotated using three GO ontologies: biological process, molecular function, and cellular component. Each ontology contains a set of well-defined terms with clearly described, specific relationships to each other. To represent biological reality accurately, the GO vocabularies are structured such that any term may have multiple parents as well as zero, one, or more children. A gene product may be annotated to a term at any level within the ontology. Because annotation to a term implies assigning its parents, a gene product can be retrieved from a search for the actual terms assigned to it, or for parent terms.
GO is continually expanded and altered to reflect increasing biological knowledge. To facilitate this process, suggestions for new terms, or alterations to existing ontologies can be submitted via the GO website above. The ontologies can be searched and browsed using a number of specially designed tools such as the AmiGO ontology browser at http://www.godatabase.org/cgi-bin/go.cgi. This tool also allows access to all gene products annotated to specific terms from all the participating databases. Assignments to GO terms are attributed to a source, which may be a published paper, a database cross-reference, or a computational analysis, and indicate the type of evidence supporting the annotation. Evidence types include 'inferred from mutant phenotype' (abbreviated IMP), 'inferred from direct assay' (IDA) and others.

Fission yeast genome database
GeneDB http://www.genedb.org/pombe Database development. Fission yeast is one of the initial organisms funded for inclusion into the GeneDB genomics database being developed at the Wellcome Trust Sanger Institute. The GeneDB project will develop and maintain database resources to support sequence and annotation at both the DNA and protein level. It will also

284
Website Review: fission yeast data resources provide a repository for the storage of data derived from functional genomics projects (see section 3). Integration of various data with existing information will help to interpret data within the framework of the whole genome. Functionality for the annotation and curation of features and attributes of both DNA (e.g., genes, transcripts, exons, introns, UTRs, promoters, repeats) and proteins (e.g., functions, domains, interactions, phenotype) will be provided. The resource will also display the results of predictive software (e.g., signal sequences, transmembrane helices, domains). Sequence visualisation will be provided initially by map and contig views, and in the longer term by additional views (e.g., interaction, pathway). Extensive cross-references will allow retrieval of related information from external resources. Search tools, comprehensive data retrieval facilities, and a helpdesk will provide levels of access suitable for both novice and expert users. A prototype of GeneDB is now available which includes one-page reports for each protein-coding gene. These pages provide basic information, location details, predicted peptide properties, GO associations, domain information, database cross references, and sequence access. A BLAST server and browseable catalogues of annotated descriptions, GO associations, and Pfam domains are also available.
Fission yeast curation within GeneDB. Fission yeast annotations are updated on a daily basis to reflect new characterizations from EMBL/GenBank submissions, publications, and user feedback. The annotation currently provides basic descriptive infor mation including known or predicted compartment, process and function, presence of domains, and similarity to budding yeast (closest homolog). At present, 3443 genes have some functional information attached, y1300 from published data, and the remainder inferred from similarity.
Annotations have been manually curated to include domain descriptions using Pfam (see section 1.3; [3]). Pfam provides high coverage for fission yeast (more than 65%, which is higher than any other eukaryote), with a low incidence of false positives. Domain identification is also an ongoing process, and new domains are continually identified and included in the core annotation.
GO associations (see section 1.4) for S. pombe genes are currently created semi-automatically, by comparing the curated annotations to a set of curated keywords that are always associated with a particular GO term. As an example, Figure 2 shows a list of the terms from the 'cellular component' ontology to which the S. pombe Arp2/3 complex proteins have been assigned. All seven identified fission yeast Arp2/3 complex proteins are annotated as 'Arp2/3 actin-organizing complex', and this structured syntax is used to assign these genes to the 'Arp2/3 protein complex' term and its 'parent' terms shown in Figure 2. Similarly, these proteins are assigned to several GO terms under 'biological process', the most specific one being 'actin cytoskeleton organization and biogenesis'. More specific 'child' terms are available, including 'actin nucleation' and 'actin filament organization'. Fission yeast genes have not yet been assigned to these terms, so the higher-level category serves as a 'place holder' for later refinement of the associations. New fission yeast annotations use structured syntax wherever possible; this not only enables preliminary GO assignments to be automated, but also allows similar annotations to be grouped together and browsed in GeneDB.
For fission yeast genes, the annotations currently use only 130 of the 4747 available 'biological process' terms (9221 assignments), and 82 of the 5010 available 'cellular component' terms (4207 assignments), but many terms are not relevant to yeast. The next phase of the fission yeast annotation will involve the manual curation of GO assignments, with the addition of evidence codes and supporting citations (see section 1.4). No fission yeast genes have, as yet, been assigned to the 'molecular function' ontology, but this is also planned for the future. The long-term aim for fission yeast (as for other organisms: [10]) is to associate each characterised gene product with one or more GO terms.  GeneDB http://www.genedb.org/genedb/pombe/curator.jsp Updates can be submitted to GeneDB through the general update form at the URL above, or using the forms provided on the individual gene pages. Submission forms will be structured to simplify the submission of experimental data and supporting publications. Additional data, comments, and suggestions outside the scope of the submission forms can be submitted directly to the curator or the database developers and are actively encouraged.
Functional genomics data will also be incorporated, and submitters should contact the curator to discuss submission formats and data types to ensure rapid inclusion in GeneDB.

Resources for functional genomics
To increase accessibility and comparison of postgenomic datasets within and between organisms, it will be important to develop central data resources similar to public sequence databases. Post-genomic data are typically much more complex than sequence data, but promising initiatives have been launched to set standards for recording and reporting microarray-based gene expression experiments [4]. For budding yeast, user-friendly resources to visualize and survey microarray and other functional genomics data have been established [2], [8],

286
Website Review: fission yeast data resources