Methodology for Constructing Problem Definitions in Bioinformatics

Motivation A recurrent criticism is that certain bioinformatics tools do not account for crucial biology and therefore fail answering the targeted biological question. We posit that the single most important reason for such shortcomings is an inaccurate formulation of the computational problem. Results Our paper describes how to define a bioinformatics problem so that it captures both the underlying biology and the computational constraints for a particular problem. The proposed model delineates comprehensively the biological problem and conducts an item-by-item bioinformatics transformation resulting in a germane computational problem. This methodology not only facilitates interdisciplinary information flow but also accommodates emerging knowledge and technologies.


Introduction
A number of recent papers have identifi ed 'open' problems in bioinformatics. From a computer science perspective, these problems have been classifi ed broadly into those (i) related to the 'central dogma' (i.e. DNA to RNA to protein), (ii) related to data in general and (iii) simulating biological processes (Backofen and Gilbert, 2001). From a life science perspective, open bioinformatics questions are concrete questions, such as, 'which structural RNAs are encoded in a genome?' (Eisenberg et al. 2006;Goodman, 2002;Yu et al. 2004). Yet, there is a fundamental difference between the bioinformatics problems described above and the aim of this paper, which proposes a systematic procedure for constructing defi nitions of such problems.
For the most part, 'how-to' practices in bioinformatics address the application of software engineering and database management principles to computational issues. For example, Parker et al. (2003) proposed comprehensive management of information fl ow for large-scale genome projects through system-wide management of metadata and data dependencies across both biological and computational processes. This led to implementation of numerous integrated systems, commonly referred to as pipelines or workfl ows (e.g. Garcia Castro et al. 2005). Indeed, these efforts made signifi cant contribution to bioinformatics 'in-the-large'. Still lacking, however, are how-to-practices for bioinformatics problems 'in-the-small'.
We propose a methodology for formulating bioinformatics problems by defi ning the cognate life and computational problems in an explicit and integrated fashion. The goal of this methodology is to guide development of bioinformatics tools that account for critical biology. Our work is much different from what is typically published in the fi eld of bioinformatics. We focus on methods for formulating a problem rather then for solving an already formulated problem.

Methodology
Our procedure has three components: a biological model, a bioinformatics transformation and a computational model ( Table 1). The biological model specifi es a question of interest and defi nes the biological problem in a way that captures the breadth of the phenomenon. The bioinformatics transformation translates biological features and criteria into a set of computational rules, which, as a whole, circumscribe the biological problem. Finally, the computational model reformulates the problem mathematically by incorporating rules derived in the transformation and describes a computational approach to the initial biological question.
As an example of our methodology, we pose the biological question 'which tRNAs are encoded in a genome?' Information on tRNAs is available in the supplemental materials and in the literature (e.g. ).

Biological model
A biological model consists of a concise 'biological question' and a comprehensive description of the 'biological knowledge.' Both portions are critical to the entire bioinformatics model. The biological question is usually straightforward; for instance, the case example used here seeks to detect tRNA genes (i.e. a known phenomenon) in genomic sequences. In contrast, formulating the knowledge portion is more diffi cult requiring a comprehensive, taxonomically broad review of the life science literature and other available resources. As we detail below, the knowledge portion has three elements: (i) an abstract, 'global' defi nition of the phenomenon together with a textual exposé of observed scenarios, (ii) a comprehensive dataset of observed instances, and (iii) a description of yet unobserved, conceivable scenarios. Obviously, the state of knowledge about a particular biological problem determines how the question is formulated and how the biological phenomenon is described (see for example the early work on tRNA sequence and structure (Holley et al. 1965;Levitt, 1969)).
A sample biological model for tRNA gene identifi cation is available in the supplemental Table S1. For the sake of simplicity, the model does not include more advanced criteria such as minimum free energy.
Defi nition and observed variation of a biological phenomenon One first provides a brief, abstract 'textbook' defi nition that views the phenomenon in a larger biological context, together with a typical scenario. Then, one should describe the extent and frequency of biological diversity, both within an organism and across taxa indicating both signifi cant and minor differences. For tRNA gene identifi cation, this may read as follows: Transfer RNA molecules have two signifi cantly different types of secondary structures, the cloverleaf and the twoarm. The more common structure consists of four stems (or three-arm) in the form of a cloverleaf (Fig. S1). Yet, an unusual three-stem (or two-arm) type is common in certain animal lineages (Okimoto and Wolstenholme, 1990, Fig. S2). Notably, both types of tRNAs fold into a similar L-shaped tertiary structure (Fig. S3).… A minor difference observed is the size of the D-arm loop (D-loop), which typically is 8 nt long but can be up to 10 nt long.
The description of the phenomenon should be comprehensive yet constrained to relevant features. For example, if the question involves identifi cation of tRNA genes in genomic sequences, introns are relevant, but not so if the question solely addresses tRNA secondary structure. Similarly, mapping of a codon to an amino acid (the genetic code) is irrelevant for defi ning tRNA secondary structures, but relevant for identifying tRNA function.

Compilation of a comprehensive dataset
Concrete instances of a phenomenon make a description explicit and tangible. A compilation should span the breadth of taxonomic diversity and should contain a sampling of frequent occurrences as well as all known instances of rare and unique ones. As we discuss later, such a collection will be crucial for benchmarking bioinformatics tools.

Description of conceivable scenarios
Life scientists continually uncover novel occurrences of a given biological phenomenon, and not infrequently, these novelties fall within the expected range of diversity. To accommodate future discoveries within the framework of the biological model, one may include knowledge about biological structures and mechanisms that enable extrapolation of unobserved scenarios. Knowledge may be inferred from the same system, or from other systems or even other disciplines.
For example, we know that a discontiguous molecule can assume the same structure or function as one that is contiguous. Thus, we can extrapolate that a tRNA gene could be encoded by multiple pieces, which are transcribed independently and join post-transcriptionally to form a functional structure. In fact, RNA 'in pieces' have been documented for ribosomal RNA (rRNA) of mitochondria from several eukaryotic groups and bacteria  and references therein) and rare examples for discontiguous tRNA genes have been reported as well (Randau et al. 2005;Soma et al. 2007).
A comprehensive biological model, as illustrated above, leads directly into the biological criteria of the next component, the bioinformatics transformation phase.

Bioinformatics transformation
This component converts a biological description into a set of mathematical formulas. The translation process involves conversion of the biological knowledge description into biological criteria and transformation of these criteria into computational rules. A sample bioinformatics transformation for tRNA gene searches is available in the supplemental Table S2; note that all sample criteria below are excerpts from this table.

Biological criteria
This fi rst step converts the biological description into simple, concise verbal statements, termed biological criteria (BCs). A separate criterion is formulated for each characteristic of a feature, such as the length variation allowed for the stem of the D-arm (D-stem) of a tRNA, e.g.

The D-stem length is 3 or 4 nt.
In addition to this list of criteria, it is useful to group related statements (e.g. particular features or major variants) in order to add meaning and to aid organization. See Table S2 (section BC) for criteria for searching tRNA genes in genomic sequences.
After conversion to BCs, check statements for ambiguity. For instance, the criterion above indicates that the D-stem in tRNAs can vary in length. Yet, this statement is ambiguous since it is uncertain what kind of pairings make up a stem-only Watson-Crick pairings or also interactions such as G-U and G-G. To clarify this ambiguity, a new criterion that defi nes permissible pairings must be added:

Allowable nucleotide pairs: A-U, C-G and G-U.
More generally, clarifi cation of the intended biological meaning may involve adding new or enhancing already formulated criteria.

Computational rules
This step is the most crucial one of the bioinformatics transformation. Here, the BCs described above are converted into mathematical formulas, called computational rules (CRs). Generally, a single BC leads directly to a single CR, such as the one-to-one mappings exemplifi ed by nucleotide pairings: and D-stem length variation: However, occasionally it may be necessary to combine several BCs to form a single CR (a many-to-one mapping). For example, a CR describing the length of a D-arm combines the following fi ve BCs: The D-arm forms a hairpin closed by a stem (pos. 10 to 25).
The D-loop length is 8 to 10 nt. If positions 13 and 22 do not pair, it increases to 7 to 11 nt.
The 15,16,17,17a,18,19,20,20a,20b,21. Optional positions: 17a,20a and 20b. into a single CR: Alternatively, a single BC may be utilized by several CR (a one-to-many mapping), such as permissibility of stem bulges used by each of the four stems (see Table S2; BC 3.1 is used by CRs 1.5, 1.6 and 1.9). An important feature of the conversion step is the explicit mapping of BC to CR, which acts as a conduit that shuttles knowledge through the model. As laid out in the discussion, this mapping facilitates two important tasks: updating a bioinformatics model to accommodate new biological discoveries and assessing the biological capabilities of tools.
Some of the CRs represent the core of the problem, whereas others represent peripheral details. It is important to identify and mark as 'key' those features that are essential for the bioinformatics model. For the tRNA example, the basic cloverleaf and two-arm structures are critical features of the biological phenomenon. The search for intron-less tRNA genes may be central to a particular biological question. Finally, the T-arm and D-arm consensus sequences are essential for effective computational analysis. Rules marked as crucial receive special attention, not only during construction of the computational model but also later, during software development when infeasibility causes modifi cation of the problem and removal of required rules (requirements).

Computational model
The third phase of defi ning bioinformatics problems consists of reformulating the biological problem into a pure computational one. Unlike the two previous components, this phase does not devise a specifi c procedure, as techniques for defi ning a computational problem (model) are well established. Instead, we focus on what to include in such a defi nition.
First, a global problem defi nition should re-state the biological question as a computational problem. For example, Problem: Given a DNA sequence, S, locate all genes, G, capable of forming a functional tRNA structure.
Second, we define a set of smaller problems (problem-set) that together describe a general approach satisfying the global problem. Each (smaller) problem should defi ne a specifi c task and should state explicitly the CRs that apply to this particular problem. In addition, even the most trivial assumptions required by a problem should be stated explicitly in the defi nition. For example, assuming a four-nucleotide alphabet (for RNA sequences) lends itself to a highly effi cient, bitbased computational approach. If the alphabet size would increase, this approach would be less effective and the problem itself may require revision. Third, each CR must be stated as a requirement for at least one (smaller) problem within a set. More challenging is determining all problems that rely on a given CR.
Obviously, more than one computational approach can address the same global problem; hence, alternative problem-sets can be formulated. For example, identifying tRNA genes using a machine learning approach may subdivide the problem in a manner that differs greatly from a purely deterministic, algorithmic approach. We recommend specifi cation of all these alternatives as they facilitate development and comparison of software that use alternative computational techniques, be it hidden Markov models (Rabiner, 1989), Bayesian networks (Pearl, 1988), or rulebased systems (e.g. (Snyder and Stormo, 1993)).

Discussion
A bioinformatics model that is constructed according to the proposed methodology captures a problem in its entirety. This is achieved by specifying three separate yet inter-connected modules: a comprehensive biological description of the problem, a computational defi nition of the problem and an explicit transformation from one to the other.
Models constructed with our procedure provide a solid foundation for development, testing and comparison of analytical bioinformatics tools. Software development can focus fully on effi ciency because the model ensures correct translation of the biology into a computational problem. Tools can be tested more easily, since comprehensive positive test data are readily available in the biological model. Once tools are available based on the same model, they can be compared to determine adherence to the model, performance against a benchmark dataset as well as time and space efficiency; all of these facilitate selection of the 'best' tool for a given analytical task.

Information management and software engineering
Systematic procedures and information management principles are not new in bioinformatics. Informatics-leaning bioinformaticians have been applying these strategies for many years through explicit management of rules ('requirements'), tracking of relationships between rules as well as data ('dependencies') and clear defi nitions of the extent of the problem ('scope'). Currently though, such principles are applied predominantly to the informatics realm of bioinformatics rather than to bioinformatics as a whole, spanning both the informatics and biology realms.

Interdisciplinary communication
For any interdisciplinary science, effective communication is a challenge. Bioinformatics has to deal with differences inherent to the life and computational sciences in terms of basic notions, ways of reasoning and scientifi c language. These differences present a substantial barrier to both comprehending and explaining ideas. Less obvious is a fundamental difference in conveying information. For instance, life science aims at extracting common patterns from the full breadth of natural diversity. In contrast, computer science aims at bounding a problem by defi ning assumptions, rules and constraints. Consequently, each side reduces the breadth of the problem through either generalizations or bounds. These reductions must be communicated.
The advantage of our methodology is that it facilitates interdisciplinary communication. First, the scientifi c language of a particular discipline is used to ensure accurate biological and computational models while translation from one model into the other occurs in a separate step. This provides explicit connections between the two models, connections that link specifi c biological criteria with specifi c computational constraints. As a consequence, tracing back a criterion/constraint from one model to the other becomes an easy task.

Changes in knowledge and technology
Biological sciences are about discovering new features of Life. Therefore, bioinformatics tools and resources need to incorporate new knowledge continually. For example, an early defi nition of gene regards it as a contiguous region on a chromosome that specifi es an RNA or protein product. The subsequent discovery of alternative splicing and trans-splicing impacted fundamentally the assumptions underlying gene-fi nding tools.
Our methodology anticipates both incorporation of new knowledge and application of new technology. Advances in the life sciences can be accommodated because the model records relevant biological facts in a systematic fashion and specifi es how they are interconnected with computational rules. Similarly, new computational technology can be accommodated because the model defi nes the scope and requirements for tool development. At most, a new problem-set needs to be added to the computational model. Construction of this new problem-set is simplifi ed substantially by the computational rules contained in the transformation module.

Putting this proposal into practice
To allow cooperative model formulation and tool development, models should be openly available (pref. web-accessible). To accommodate new science, they should be easily expandable (e.g. managed in a database). Finally, new models need to reuse components of existing models (e.g. the description and the translated rules for nucleotide pairing). This not only reduces scientifi c effort and improves speed of model construction but also retains a consistent scientifi c representation across different bioinformatics models.

Conclusion
Bioinformatics needs standards and methodologies that span its entire breadth from biology to informatics. Establishment of systematic procedures is necessary to transform the 'art' of defi ning bioinformatics problems into a science.

Authors' Contributions
AH conceived and refi ned the methodology. GB critiqued the methodology and made signifi cant contributions both to the biological model and to comprehension of the methodology by life scientists. Both authors drafted the manuscript.

Computational model-problematic simplifi cations
Certain computational practices imprecisely simplify bioinformatics problems. For example, an algorithm used to solve a previous problem is re-used to address a new problem, or alternatively, a problem is well-formulated but does not address known exceptions. More generally, three levels of simplifi cation can negatively affect bioinformatics analysis and therefore should be utilized with extra caution.

Exclusion of computational rules (CR)
One obvious type of simplifi cation is to disregard certain biological scenarios. For example, when searching tRNA genes, one may ignore all rules associated with the two-arm tRNA structure and focus only on the cloverleaf shape. Omission of CR for biological exceptions can signifi cantly simplify both the global task and individual problems in the problem-set, but it also can signifi cantly change the analytical capabilities of the tool.
Partial omission of CR Certain simplifi cations represent an oversight during construction of the computational model. For instance, an individual problem in a problem-set may locate the tRNA anticodon (AC) arm without accounting for introns in the stem, while another problem may explicitly look for introns. However, if intron search is performed after AC-stem identifi cation, then genes with an intron in the AC-stem will not be located. Here, intron analysis occurs in a separate (smaller) problem and yet, the scenario where introns occur in the AC-stem is not analyzed correctly. Thus to resolve this case, one would need either to conduct intron search prior to stem identifi cation or to add a CR for introns to the stem identifi cation step. This example illustrates that some CR may be essential to more than one individual problem and one must ensure that each CR serves as a criterion for every relevant problem.

Inaccurate computational assumptions
Inaccurate, implicit assumptions can change the technical nature of a computational problem in the same manner as omitting CR. For example, suppose that stem-permissible base pairings were not specifi ed as a rule during the bioinformatics transformation phase (i.e. Table S2: BC 3.2). An inaccurate assumption could be that only Watson-Crick (WC) nucleotide interactions are valid pairings in a stem. Then, fi nding two pairing sequences capable of forming a stem is equivalent to identifying inverted repeats since inverted repeats can pair to form stems. Yet, stems containing non-WC interactions will not be identifi ed. To uncover such a fundamental error requires a good understanding of both the computational and biological implications.

Impact
Explicit or implicit omission of CR can lead to such a principal change that the biological question is no longer addressed appropriately. In the previous example, without explicit allowance of non-WC pairs, one assumes that stems contain only WC pairs. This assumption creates a trivial computational problem where the sequence on one side of the stem can pair with only one other sequence, a problem solved by many existing algorithms (e.g. string matching algorithms). On the other hand, if non-WC pairs are permissible then the sequence on one side has numerous potential sequence pairings for the other side, F(n). This is stated mathematically as follows.
Given a sequence S of length n, the number of sequences having permissible pairings at every position in S is F(n) = p w × q x × r y × s z where w is the number of positions in S having an A, x is the number having C, y is the number having G, z is the number having T/U and w + x + y + x = n and where p is the number of permissible pairing interactions with nucleotide A, q is the number with C, r is the number with G, s is the number with T/U. Exclusive use of WC pairs means that p through s has a value of one (a one-to-one mapping of A with T/U and C with G) and that the number of sequences, F(n), pairing to S is one. Further interactions in addition to WC pairs mean that one or more of p through s will have a value greater than one. For instance, if G can interact with C, G and U then r has a value of three and the number of matching sequences increases at an exponential rate, r y .
Obviously, a minor simplifi cation can have a devastating effect on the analytical results.
Less obvious, simplifying also can lead to poor selection of an algorithm, one which cannot easily be extended to accommodate omitted CR. Thus, details which may seem minor to life scientists are crucial to framing the computational problem. This is clearly the case with specifi cation of stem-permissible base pairings where a WC-only assumption not only increases the rate of false negatives but also leads to choosing an "effi cient" algorithm which is incapable of handling non-WC pairings.
By this example, it may seem that all simplifi cations have an adverse effect, but this is not the case. Rather, one should be aware of the pitfalls inherent in common computational practices and take appropriate precautions during development and use of the computational model. In conclusion, it is important that the approach presented in the computational model spells out the algorithmic tasks in a manner that not only satisfi es the computational requirements but also correctly answers the original biological question. Table S1. Sample of biological model for tRNA gene identifi cation model.

Biological model
Question: Which tRNAs are encoded in a genome? Relevant knowledge on tRNAs (description of gene, gene product and function) 1. Brief defi nition "Transfer RNA (tRNA) … is a small RNA molecule (70-90 nucleotides). The tRNAs, by binding at one end to a specifi c codon in the mRNA and at their other end to the amino acid specifi ed by that codon, enable amino acids to line up according to the sequence of nucleotides in the mRNA. Each tRNA is designed to carry only one of the 20 amino acids …. Each of the 20 amino acids has at least one type of tRNA assigned to it, and most have several tRNAs. Before an amino acid is incorporated into a protein chain, it is attached by its carboxyl end to the 3'end of … a tRNA containing the correct anticodon-the sequence of three nucleotides that is complementary to the three-nucleotide codon that specifi es that amino acid on an mRNA molecule. Codon-anticodon pairings enable each amino acid to be inserted into a growing protein chain according to the dictates of the sequence of nucleotides in the mRNA, thereby allowing the genetic code to be used to translate nucleotide sequences into protein sequences." (Alberts et al. 1994). 2. Description 2.1 Observed tRNAs (gene product) and genes 2.1.1 tRNA structure "tRNAs can form the loops and base-paired stems of a cloverleaf structure, and all are thought to fold further to adopt the L-shaped conformation" (Alberts et al. 1994).
The cloverleaf structure is composed of three arms (D, anticodon (AC) and T), a highly variable (V) loop between the AC-and T-arms, enclosed by the aminoacyl (AA) stem (Fig. 1). Generally, the D-stem forms four base pairings but a stem of three is possible. Likewise, the D-loop is typically 8 nt long but may expand to 9 or 10 nt. Overlapping the D-arm is one of two promoters recognized by transcription factor TFIIIC and having a conserved sequence of 5'-GTGGCNNAGT-3' . A major alternative to the cloverleaf structure is composed of two instead of three arms, lacking either the D-or the T-arm (Fig. 2). Often, the V-loop expands and establishes extra stabilizing interactions. Such tRNAs lacking entire domains have been documented in certain animal lineages ).
In Archea, the 'strictly invariant' nucleotide U at position 8 is replaced by a C .

tRNA genes
The sequence of the tRNA (gene product) differs from that of its gene. The transcribed sequence of the gene is subjected to various processes, which effectively changes the sequence of the gene. Most common are post-transcriptional nucleotide modifi cations. For example, the T-loop contains the modifi ed base pseudouridine (phi), which is encoded in the gene as T. In addition, the CCA tail at the 3' end of tRNAs is added post-transcriptionally. Less common are changes incurred by RNA editing by which nucleotides are replaced, inserted or deleted. For example, mis-pairings in the AA-arm portion of the gene between 1-72, 2-71, and 3-70 are corrected post-transcriptionally by RNA editing (Bullerwell and Gray, 2005, and references therein).

Sample instances
Below is a list of sequences representative of tRNA genes. This list is suffi ciently broad to serve as a benchmark that measures the effectiveness of tRNA identifi cation software. Journal references are provided where appropriate. GenBank Acc. No. DQ256197, positions 78-145 and references therein.

etc.
A compilation of tRNA genes identifi ed by tRNAscan-SE (Lowe and Eddy, 1997) in complete or nearly complete genomes is available at http://lowelab.ucsc.edu/GtRNAdb. Additional tRNA resources are available at http://www.uni-bayreuth.de/departments/biochemie/trna.

Conceivable genes
The unifying structure is the L-shaped tertiary structure required to perform its translational function (Fig. 3). Nucleotides are also important for processing amino-acylation, binding of initiation and elongation factors, etc. The constraints on the shape are …. It is conceivable that the gene is encoded by multiple gene pieces that are transcribed independently, similar to ribosomal RNA . Compensation for the missing stem is provided by a larger V-loop and an increase in non-stem, nucleotide interactions.  15,16,17,17a,18,19,20,20a,20b,21. Optional positions: 17a,20a and 20b. 4.6. Detailed nucleotide and base-pairing distributions are available (see Tables 1 and 2 in Ref. Marck and Grosjean, 2002) 5. Conserved sequences 5.1. In eukaryotes, the conserved sequence, 5'-GTGGCNNAGT-3', is found at position 8 (Sharp et al. 1981).

5'
3' CCA C C C C C Figure S2. The two-arm, tRNA-Arg molecule in Caenorhabditis elegans .