Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes

Key Points

  • The identification of evolutionarily constrained sequences is an unbiased approach for finding functional sequences in genomes. However, their identification is strongly affected by upstream analyses.

  • Genomes are sequenced to different levels of finishing, which affects downstream comparative analyses.

  • Before genomic sequences can be aligned, segments of homologous collinearity must be identified. Errors at this stage can have a dramatic effect on the identification of constrained sequences.

  • Base-pair sequence-alignment programs differentially handle the complexities of evolution, such as insertions/deletions and duplications. In addition, new approaches to identify regions of alignment uncertainty can be used.

  • Current approaches that utilize evolutionary sequence constraint focus on regions that are deeply constrained. Newer approaches combined with sequences from more species can now be pursued to identify weakly and/or lineage-specific constrained sequences.

  • The amount of detectable constrained sequence depends on the phylogenetic scope being pursued as well as the resolution and intensity of the desired detectable constraint.

  • Large collaborative projects such as ENCODE are shedding light on the correlation between sequence constraint and sequence function. Additional methods are also available for determining the biological significance of constrained sequences.

Abstract

The comparison of genomic sequences is now a common approach to identifying and characterizing functional regions in vertebrate genomes. However, for theoretical reasons and because of practical issues, the generation of these data sets is non-trivial and can have many pitfalls. We are currently seeing an explosion of comparative sequence data, the benefits and limitations of which need to be disseminated to the scientific community. This Review provides a critical overview of the different types of sequence data that are available for analysis and of contemporary comparative sequence analysis methods, highlighting both their strengths and limitations. Approaches to determining the biological significance of constrained sequence are also explored.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Overview of comparative sequence analysis.
Figure 2: Vertebrate genomic sequence data.
Figure 3: Challenges in the reconstruction of homologous collinear regions.
Figure 4: Relationship between sequence length and the quantity of constrained sequence detected.

Similar content being viewed by others

References

  1. Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 5, 15–56 (2004).

    Article  CAS  Google Scholar 

  2. Hardison, R. C. Comparative genomics. PLoS Biol. 1, E58 (2003).

    Article  Google Scholar 

  3. Xie, X. et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature 434, 338–345 (2005).

    Article  CAS  Google Scholar 

  4. Göttgens, B. et al. Analysis of vertebrate SCL loci identifies conserved enhancers. Nature Biotechnol. 18, 181–186 (2000).

    Article  Google Scholar 

  5. Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000).

    Article  CAS  Google Scholar 

  6. Kimura, M. & Ohta, T. On some principles governing molecular evolution. Proc. Natl Acad. Sci. USA 71, 2848–2852 (1974).

    Article  CAS  Google Scholar 

  7. Margulies, E. H. Confidence in comparative genomics. Genome Res. 18, 199–200 (2008).

    Article  CAS  Google Scholar 

  8. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

  9. International Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). The first vertebrate genome-wide comparative sequence analysis. The many seminal findings include initial estimates of the extent of evolutionary constraint among mammalian genomes and the fact that there is more than twice as much non-coding constrained sequence compared with protein-coding regions.

  10. Margulies, E. H., NISC Comparative Sequencing Program & Green, E. D. Detecting highly conserved regions of the human genome by multispecies sequence comparisons. Cold Spring Harb. Symp. Quant. Biol. 68, 255–263 (2003).

    Article  CAS  Google Scholar 

  11. Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003). This landmark manuscript was one of the first to highlight the power of sequencing and analysing the genomes of many species and provided intellectual support for sequencing many vertebrate genomes.

    Article  CAS  Google Scholar 

  12. Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, e10 (2005). The 'go-to' manuscript outlining the theoretical basis for choosing the amount and diversity of genomes to sequence in order to obtain a certain level of resolution for constrained sequences.

    Article  Google Scholar 

  13. Green, E. D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001). An excellent review outlining the approaches used for sequencing genomes.

    Article  CAS  Google Scholar 

  14. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).

  15. Blakesley, R. W. et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244 (2004).

    Article  CAS  Google Scholar 

  16. Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005).

    Article  CAS  Google Scholar 

  17. Green, P. 2x genomes — does depth matter? Genome Res. 17, 1547–1549 (2007).

    Article  CAS  Google Scholar 

  18. Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).

    Article  CAS  Google Scholar 

  19. ENCODE Project Consortium. The ENCODE pilot project: functional annotation of 1% of the human genome. Nature 447, 799–816 (2007). This manuscript, along with Reference 20 and the entire April 2007 issue of Genome Research , highlights the first systematic identification and analysis of functional elements in the human genome. Of particular interest here is the section describing the extent of evolutionary sequence constraint that was observed in functional elements.

  20. Margulies, E. H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).

    Article  CAS  Google Scholar 

  21. Bentley, D. R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552 (2006).

    Article  CAS  Google Scholar 

  22. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007). This manuscript, along with Reference 56, highlights seminal multi-species sequence comparisons on a genome-wide scale. Furthermore, Reference 56 presents new approaches for computationally identifying and classifying the functions of constrained sequences.

  23. Stein, L. D. et al. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 1, E45 (2003).

    Article  Google Scholar 

  24. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003).

    Article  CAS  Google Scholar 

  25. Dewey, C. N. Aligning multiple whole genomes with Mercator and MAVID. Methods Mol. Biol. 395, 221–236 (2007).

    Article  CAS  Google Scholar 

  26. Tesler, G. GRIMM: genome rearrangements web server. Bioinformatics 18, 492–493 (2002).

    Article  CAS  Google Scholar 

  27. Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl Acad. Sci. USA 100, 11484–11489 (2003).

    Article  CAS  Google Scholar 

  28. Ma, J. et al. Reconstructing contiguous regions of an ancestral genome. Genome Res. 16, 1557–1565 (2006).

    Article  CAS  Google Scholar 

  29. Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).

    Article  CAS  Google Scholar 

  30. Holmes, I. & Durbin, R. Dynamic programming alignment accuracy. J. Comput. Biol. 5, 493–504 (1998).

    Article  CAS  Google Scholar 

  31. Wong, K. M., Suchard, M. A. & Huelsenbeck, J. P. Alignment uncertainty and genomic analysis. Science 319, 473–476 (2008).

    Article  CAS  Google Scholar 

  32. Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478–2483 (2002).

    Article  Google Scholar 

  33. Higgins, D. G. & Sharp, P. M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988).

    Article  CAS  Google Scholar 

  34. Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003).

    Article  CAS  Google Scholar 

  35. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).

    Article  CAS  Google Scholar 

  36. Margulies, E. H., Chen, C. W. & Green, E. D. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 22, 187–193 (2006).

    Article  CAS  Google Scholar 

  37. Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).

    Article  CAS  Google Scholar 

  38. Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340 (2005).

    Article  CAS  Google Scholar 

  39. Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–699 (2004).

    Article  CAS  Google Scholar 

  40. Pollard, D. A., Bergman, C. M., Stoye, J., Celniker, S. E. & Eisen, M. B. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics 5, 6 (2004).

    Article  Google Scholar 

  41. Stone, E. A., Cooper, G. M. & Sidow, A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6, 143–164 (2005). An excellent review outlining how various comparative sequence analysis methods combined with different sequence data sets affect the sensitivity, specificity and phylogenetic scope for detecting constrained sequences.

    Article  CAS  Google Scholar 

  42. Margulies, E. H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003). This was the first report of a computational approach for logically combining the conservation information from many species' sequences for the identification of constrained elements. Also see References 43 and 45 for subsequent methods.

    Article  CAS  Google Scholar 

  43. Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).

    Article  CAS  Google Scholar 

  44. Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, UK,1998).

    Book  Google Scholar 

  45. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).

    Article  CAS  Google Scholar 

  46. Kamal, M., Xie, X. & Lander, E. S. A large family of ancient repeat elements in the human genome is under strong selection. Proc. Natl Acad. Sci. USA 103, 2740–2745 (2006).

    Article  CAS  Google Scholar 

  47. Siepel, A., Pollard, K. S. & Haussler, D. in Proc. 10th Int. Conf. Res. Comput. Mol. Biol. (eds Apostolico, A., Guerra, C., Istrail, S., Pevzner, P. & Waterman, M.) 190–205 (Springer, Berlin, 2006).

    Book  Google Scholar 

  48. Rhesus Macaque Genome Sequencing and Analysis Consortium. Evolutionary and biomedical insights from the Rhesus macaque genome. Science 316, 222–234 (2007).

  49. Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).

    Article  CAS  Google Scholar 

  50. Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).

    Article  CAS  Google Scholar 

  51. Wang, Q.-f. et al. Detection of weakly conserved ancestral mammalian regulatory sequences by primate comparisons. Genome Biol. 8, R1 (2007).

    Article  Google Scholar 

  52. Wang, Q.-f. et al. Primate-specific evolution of an LDLR enhancer. Genome Biol. 7, R68 (2006).

    Article  Google Scholar 

  53. Moses, A. M. et al. Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2, e130 (2006).

    Article  Google Scholar 

  54. Blanchette, M. & Tompa, M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12, 739–748 (2002).

    Article  CAS  Google Scholar 

  55. Wang, T. & Stormo, G. D. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19, 2369–2380 (2003).

    Article  CAS  Google Scholar 

  56. Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219–232 (2007).

    Article  CAS  Google Scholar 

  57. Mortlock, D. P., Guenther, C. & Kingsley, D. M. A general approach for identifying distant regulatory elements applied to the Gdf6 gene. Genome Res. 13, 2069–2081 (2003).

    Article  CAS  Google Scholar 

  58. Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).

    Article  CAS  Google Scholar 

  59. Fisher, S., Grice, E. A., Vinton, R. M., Bessling, S. L. & McCallion, A. S. Conservation of RET regulatory function from human to zebrafish without sequence similarity. Science 312, 276–279 (2006).

    Article  CAS  Google Scholar 

  60. Kuhn, R. M. et al. The UCSC genome browser database: update 2007. Nucleic Acids Res. 35, D668–D673 (2007).

    Article  CAS  Google Scholar 

  61. Flicek, P. et al. Ensembl 2008. Nucleic Acids Res. 36, D707–D714 (2007).

    Article  Google Scholar 

  62. Spudich, G., Fernández-Suárez, X. M. & Birney, E. Genome browsing with Ensembl: a practical overview. Brief. Funct. Genomic. Proteomic. 6, 202–219 (2007).

    Article  CAS  Google Scholar 

  63. Giardine, B. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005).

    Article  CAS  Google Scholar 

  64. Blankenberg, D. et al. A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res. 17, 960–964 (2007).

    Article  CAS  Google Scholar 

  65. Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. VISTA Enhancer Browser — a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elliott H. Margulies.

Supplementary information

Related links

Related links

FURTHER INFORMATION

Elliot Margulies's homepage

Ewan Birney's homepage

ENCODE Project

ENSEMBL

Galaxy

GRIMM

Mercator

NIH Intramural Sequencing Center

PECAN

UCSC Genome Bioinformatics

VISTA

Glossary

Purifying selection

The evolutionary process of rejecting substitutions in functional DNA, thereby making such sequences more similar when compared among different species.

Orthologous

Homologous sequences in different species that arose from a speciation event.

Homologous

Sequences that have a common ancestor but might be related by either speciation or duplication events in a genome. Pragmatically, homology is detected by the presence of similarity between two sequences.

Contig

Contiguous piece of DNA that is assembled from shorter overlapping sequence reads.

Whole-genome shotgun

The process of shearing the DNA from an entire genome into smaller pieces that are randomly (or 'shotgun') sequenced en masse.

Genome coverage

The total number of bases that are sequenced divided by the genome size. Actual coverage differs depending on the statistical properties of a poisson distribution, which takes into account the fact that reads are sequenced at random.

Segmental duplications

Regions (or segments) of a genome that evolved from a single common ancestor and arose from a duplication event. As such, these sequences within the same genome are paralogous to each other.

Pseudogene

A presumably non-functional region of DNA with homology to an actual gene. Pseudogenes typically arise from the reincorporation of an RNA intermediate into the genomic sequence.

Paralogous

The homology between two genomic segments that arose from a duplication event.

Heuristic methods

For large compute problems, the application of workable but not formally correct solutions to help reduce the computational time. In the case of sequence alignment, common heuristic methods include the progressive alignment of closer sequences to each other first before aligning to more distant species, and the use of highly similar anchoring sequences to reduce the search space in the alignment.

Compute farms

Large groups of computers, each on their own only able to analyse a small piece of data (similar to a typical desktop PC), but which, when combined together, provide a powerful resource for analysing computationally intense problems.

Dynamic programming

An algorithm that is efficiently designed to analyse data, usually by elegantly breaking down the computational problem down into smaller, simpler sub-problems.

Suffix tree

An indexing technique to efficiently store all sub-sequences of a string of letters.

Hidden markov model

Mathematical concept that describes a finite set of 'states' and a probabilistic model for transitioning from one state to another.

Ancestral repeat

Relics of transposable elements that inserted before a speciation event and are therefore orthologous and presumed to be non-functional. Therefore, these regions are largely thought to be neutrally evolving.

Eutherian radiation

Approximatey 80 million years ago a large diversity of mammalian species began to evolve. These placental mammals provide a rich resource for identifying constrained sequences.

Four-fold degenerate sites

Third positions of codons for which any base yields the same amino acid.

False-discovery rate

A statistical measure of error, specifically defined as 1 – (true positives / (true + false positives)). Such an error estimate allows for greater fluctuation in the total amount of detected true-positives, as it reflects the proportion of false positives in the resulting data set rather than an absolute value of false positives.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Margulies, E., Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat Rev Genet 9, 303–313 (2008). https://doi.org/10.1038/nrg2185

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg2185

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing