Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes

Margulies, Elliott H.; Birney, Ewan

doi:10.1038/nrg2185

Review Article
Published: April 2008

Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes

Elliott H. Margulies¹ &
Ewan Birney²

Nature Reviews Genetics volume 9, pages 303–313 (2008)Cite this article

1777 Accesses
48 Citations
9 Altmetric
Metrics details

Key Points

The identification of evolutionarily constrained sequences is an unbiased approach for finding functional sequences in genomes. However, their identification is strongly affected by upstream analyses.
Genomes are sequenced to different levels of finishing, which affects downstream comparative analyses.
Before genomic sequences can be aligned, segments of homologous collinearity must be identified. Errors at this stage can have a dramatic effect on the identification of constrained sequences.
Base-pair sequence-alignment programs differentially handle the complexities of evolution, such as insertions/deletions and duplications. In addition, new approaches to identify regions of alignment uncertainty can be used.
Current approaches that utilize evolutionary sequence constraint focus on regions that are deeply constrained. Newer approaches combined with sequences from more species can now be pursued to identify weakly and/or lineage-specific constrained sequences.
The amount of detectable constrained sequence depends on the phylogenetic scope being pursued as well as the resolution and intensity of the desired detectable constraint.
Large collaborative projects such as ENCODE are shedding light on the correlation between sequence constraint and sequence function. Additional methods are also available for determining the biological significance of constrained sequences.

Abstract

The comparison of genomic sequences is now a common approach to identifying and characterizing functional regions in vertebrate genomes. However, for theoretical reasons and because of practical issues, the generation of these data sets is non-trivial and can have many pitfalls. We are currently seeing an explosion of comparative sequence data, the benefits and limitations of which need to be disseminated to the scientific community. This Review provides a critical overview of the different types of sequence data that are available for analysis and of contemporary comparative sequence analysis methods, highlighting both their strengths and limitations. Approaches to determining the biological significance of constrained sequence are also explored.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of comparative sequence analysis.**

**Figure 2: Vertebrate genomic sequence data.**

**Figure 3: Challenges in the reconstruction of homologous collinear regions.**

**Figure 4: Relationship between sequence length and the quantity of constrained sequence detected.**

References

Miller, W., Makova, K. D., Nekrutenko, A. & Hardison, R. C. Comparative genomics. Annu. Rev. Genomics Hum. Genet. 5, 15–56 (2004).
Article CAS Google Scholar
Hardison, R. C. Comparative genomics. PLoS Biol. 1, E58 (2003).
Article Google Scholar
Xie, X. et al. Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals. Nature 434, 338–345 (2005).
Article CAS Google Scholar
Göttgens, B. et al. Analysis of vertebrate SCL loci identifies conserved enhancers. Nature Biotechnol. 18, 181–186 (2000).
Article Google Scholar
Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000).
Article CAS Google Scholar
Kimura, M. & Ohta, T. On some principles governing molecular evolution. Proc. Natl Acad. Sci. USA 71, 2848–2852 (1974).
Article CAS Google Scholar
Margulies, E. H. Confidence in comparative genomics. Genome Res. 18, 199–200 (2008).
Article CAS Google Scholar
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
International Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). The first vertebrate genome-wide comparative sequence analysis. The many seminal findings include initial estimates of the extent of evolutionary constraint among mammalian genomes and the fact that there is more than twice as much non-coding constrained sequence compared with protein-coding regions.
Margulies, E. H., NISC Comparative Sequencing Program & Green, E. D. Detecting highly conserved regions of the human genome by multispecies sequence comparisons. Cold Spring Harb. Symp. Quant. Biol. 68, 255–263 (2003).
Article CAS Google Scholar
Thomas, J. W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003). This landmark manuscript was one of the first to highlight the power of sequencing and analysing the genomes of many species and provided intellectual support for sequencing many vertebrate genomes.
Article CAS Google Scholar
Eddy, S. R. A model of the statistical power of comparative genome sequence analysis. PLoS Biol. 3, e10 (2005). The 'go-to' manuscript outlining the theoretical basis for choosing the amount and diversity of genomes to sequence in order to obtain a certain level of resolution for constrained sequences.
Article Google Scholar
Green, E. D. Strategies for the systematic sequencing of complex genomes. Nature Rev. Genet. 2, 573–583 (2001). An excellent review outlining the approaches used for sequencing genomes.
Article CAS Google Scholar
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Blakesley, R. W. et al. An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14, 2235–2244 (2004).
Article CAS Google Scholar
Margulies, E. H. et al. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl Acad. Sci. USA 102, 4795–4800 (2005).
Article CAS Google Scholar
Green, P. 2x genomes — does depth matter? Genome Res. 17, 1547–1549 (2007).
Article CAS Google Scholar
Lander, E. S. & Waterman, M. S. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–239 (1988).
Article CAS Google Scholar
ENCODE Project Consortium. The ENCODE pilot project: functional annotation of 1% of the human genome. Nature 447, 799–816 (2007). This manuscript, along with Reference 20 and the entire April 2007 issue of Genome Research , highlights the first systematic identification and analysis of functional elements in the human genome. Of particular interest here is the section describing the extent of evolutionary sequence constraint that was observed in functional elements.
Margulies, E. H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).
Article CAS Google Scholar
Bentley, D. R. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16, 545–552 (2006).
Article CAS Google Scholar
Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007). This manuscript, along with Reference 56, highlights seminal multi-species sequence comparisons on a genome-wide scale. Furthermore, Reference 56 presents new approaches for computationally identifying and classifying the functions of constrained sequences.
Stein, L. D. et al. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 1, E45 (2003).
Article Google Scholar
Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003).
Article CAS Google Scholar
Dewey, C. N. Aligning multiple whole genomes with Mercator and MAVID. Methods Mol. Biol. 395, 221–236 (2007).
Article CAS Google Scholar
Tesler, G. GRIMM: genome rearrangements web server. Bioinformatics 18, 492–493 (2002).
Article CAS Google Scholar
Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl Acad. Sci. USA 100, 11484–11489 (2003).
Article CAS Google Scholar
Ma, J. et al. Reconstructing contiguous regions of an ancestral genome. Genome Res. 16, 1557–1565 (2006).
Article CAS Google Scholar
Lunter, G. et al. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res. 18, 298–309 (2008).
Article CAS Google Scholar
Holmes, I. & Durbin, R. Dynamic programming alignment accuracy. J. Comput. Biol. 5, 493–504 (1998).
Article CAS Google Scholar
Wong, K. M., Suchard, M. A. & Huelsenbeck, J. P. Alignment uncertainty and genomic analysis. Science 319, 473–476 (2008).
Article CAS Google Scholar
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478–2483 (2002).
Article Google Scholar
Higgins, D. G. & Sharp, P. M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237–244 (1988).
Article CAS Google Scholar
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721–731 (2003).
Article CAS Google Scholar
Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).
Article CAS Google Scholar
Margulies, E. H., Chen, C. W. & Green, E. D. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 22, 187–193 (2006).
Article CAS Google Scholar
Notredame, C., Higgins, D. G. & Heringa, J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205–217 (2000).
Article CAS Google Scholar
Do, C. B., Mahabhashyam, M. S. P., Brudno, M. & Batzoglou, S. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. 15, 330–340 (2005).
Article CAS Google Scholar
Bray, N. & Pachter, L. MAVID: constrained ancestral alignment of multiple sequences. Genome Res. 14, 693–699 (2004).
Article CAS Google Scholar
Pollard, D. A., Bergman, C. M., Stoye, J., Celniker, S. E. & Eisen, M. B. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics 5, 6 (2004).
Article Google Scholar
Stone, E. A., Cooper, G. M. & Sidow, A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu. Rev. Genomics Hum. Genet. 6, 143–164 (2005). An excellent review outlining how various comparative sequence analysis methods combined with different sequence data sets affect the sensitivity, specificity and phylogenetic scope for detecting constrained sequences.
Article CAS Google Scholar
Margulies, E. H., Blanchette, M., NISC Comparative Sequencing Program, Haussler, D. & Green, E. D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003). This was the first report of a computational approach for logically combining the conservation information from many species' sequences for the identification of constrained elements. Also see References 43 and 45 for subsequent methods.
Article CAS Google Scholar
Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005).
Article CAS Google Scholar
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, UK,1998).
Book Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS Google Scholar
Kamal, M., Xie, X. & Lander, E. S. A large family of ancient repeat elements in the human genome is under strong selection. Proc. Natl Acad. Sci. USA 103, 2740–2745 (2006).
Article CAS Google Scholar
Siepel, A., Pollard, K. S. & Haussler, D. in Proc. 10th Int. Conf. Res. Comput. Mol. Biol. (eds Apostolico, A., Guerra, C., Istrail, S., Pevzner, P. & Waterman, M.) 190–205 (Springer, Berlin, 2006).
Book Google Scholar
Rhesus Macaque Genome Sequencing and Analysis Consortium. Evolutionary and biomedical insights from the Rhesus macaque genome. Science 316, 222–234 (2007).
Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).
Article CAS Google Scholar
Prabhakar, S. et al. Close sequence comparisons are sufficient to identify human cis-regulatory elements. Genome Res. 16, 855–863 (2006).
Article CAS Google Scholar
Wang, Q.-f. et al. Detection of weakly conserved ancestral mammalian regulatory sequences by primate comparisons. Genome Biol. 8, R1 (2007).
Article Google Scholar
Wang, Q.-f. et al. Primate-specific evolution of an LDLR enhancer. Genome Biol. 7, R68 (2006).
Article Google Scholar
Moses, A. M. et al. Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2, e130 (2006).
Article Google Scholar
Blanchette, M. & Tompa, M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12, 739–748 (2002).
Article CAS Google Scholar
Wang, T. & Stormo, G. D. Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19, 2369–2380 (2003).
Article CAS Google Scholar
Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219–232 (2007).
Article CAS Google Scholar
Mortlock, D. P., Guenther, C. & Kingsley, D. M. A general approach for identifying distant regulatory elements applied to the Gdf6 gene. Genome Res. 13, 2069–2081 (2003).
Article CAS Google Scholar
Pennacchio, L. A. et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature 444, 499–502 (2006).
Article CAS Google Scholar
Fisher, S., Grice, E. A., Vinton, R. M., Bessling, S. L. & McCallion, A. S. Conservation of RET regulatory function from human to zebrafish without sequence similarity. Science 312, 276–279 (2006).
Article CAS Google Scholar
Kuhn, R. M. et al. The UCSC genome browser database: update 2007. Nucleic Acids Res. 35, D668–D673 (2007).
Article CAS Google Scholar
Flicek, P. et al. Ensembl 2008. Nucleic Acids Res. 36, D707–D714 (2007).
Article Google Scholar
Spudich, G., Fernández-Suárez, X. M. & Birney, E. Genome browsing with Ensembl: a practical overview. Brief. Funct. Genomic. Proteomic. 6, 202–219 (2007).
Article CAS Google Scholar
Giardine, B. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 15, 1451–1455 (2005).
Article CAS Google Scholar
Blankenberg, D. et al. A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res. 17, 960–964 (2007).
Article CAS Google Scholar
Visel, A., Minovitsky, S., Dubchak, I. & Pennacchio, L. A. VISTA Enhancer Browser — a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007).
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Genome Technology Branch, Genome Informatics Section, National Human Genome Research Institute, National Institutes of Health, 5625 Fishers Lane, Room 5N-01N, MSC9400, Bethesda, 20892, Maryland, USA
Elliott H. Margulies
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD, Cambridge, UK
Ewan Birney

Authors

Elliott H. Margulies
View author publications
You can also search for this author in PubMed Google Scholar
Ewan Birney
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elliott H. Margulies.

Supplementary information

Supplementary information S1 (table) (PDF 130 kb)

Glossary

Purifying selection: The evolutionary process of rejecting substitutions in functional DNA, thereby making such sequences more similar when compared among different species.
Orthologous: Homologous sequences in different species that arose from a speciation event.
Homologous: Sequences that have a common ancestor but might be related by either speciation or duplication events in a genome. Pragmatically, homology is detected by the presence of similarity between two sequences.
Contig: Contiguous piece of DNA that is assembled from shorter overlapping sequence reads.
Whole-genome shotgun: The process of shearing the DNA from an entire genome into smaller pieces that are randomly (or 'shotgun') sequenced en masse.
Genome coverage: The total number of bases that are sequenced divided by the genome size. Actual coverage differs depending on the statistical properties of a poisson distribution, which takes into account the fact that reads are sequenced at random.
Segmental duplications: Regions (or segments) of a genome that evolved from a single common ancestor and arose from a duplication event. As such, these sequences within the same genome are paralogous to each other.
Pseudogene: A presumably non-functional region of DNA with homology to an actual gene. Pseudogenes typically arise from the reincorporation of an RNA intermediate into the genomic sequence.
Paralogous: The homology between two genomic segments that arose from a duplication event.
Heuristic methods: For large compute problems, the application of workable but not formally correct solutions to help reduce the computational time. In the case of sequence alignment, common heuristic methods include the progressive alignment of closer sequences to each other first before aligning to more distant species, and the use of highly similar anchoring sequences to reduce the search space in the alignment.
Compute farms: Large groups of computers, each on their own only able to analyse a small piece of data (similar to a typical desktop PC), but which, when combined together, provide a powerful resource for analysing computationally intense problems.
Dynamic programming: An algorithm that is efficiently designed to analyse data, usually by elegantly breaking down the computational problem down into smaller, simpler sub-problems.
Suffix tree: An indexing technique to efficiently store all sub-sequences of a string of letters.
Hidden markov model: Mathematical concept that describes a finite set of 'states' and a probabilistic model for transitioning from one state to another.
Ancestral repeat: Relics of transposable elements that inserted before a speciation event and are therefore orthologous and presumed to be non-functional. Therefore, these regions are largely thought to be neutrally evolving.
Eutherian radiation: Approximatey 80 million years ago a large diversity of mammalian species began to evolve. These placental mammals provide a rich resource for identifying constrained sequences.
Four-fold degenerate sites: Third positions of codons for which any base yields the same amino acid.
False-discovery rate: A statistical measure of error, specifically defined as 1 – (true positives / (true + false positives)). Such an error estimate allows for greater fluctuation in the total amount of detected true-positives, as it reflects the proportion of false positives in the resulting data set rather than an absolute value of false positives.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Margulies, E., Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat Rev Genet 9, 303–313 (2008). https://doi.org/10.1038/nrg2185

Download citation

Issue Date: April 2008
DOI: https://doi.org/10.1038/nrg2185

This article is cited by

Genome-wide identification of Drosophila dorso-ventral enhancers by differential histone acetylation analysis
- Nina Koenecke
- Jeff Johnston
- Julia Zeitlinger
Genome Biology (2016)
A Pharm-Ecological Perspective of Terrestrial and Aquatic Plant-Herbivore Interactions
- Jennifer Sorensen Forbey
- M. Denise Dearing
- William J. Foley
Journal of Chemical Ecology (2013)
Cgaln: fast and space-efficient whole-genome alignment
- Ryuichiro Nakato
- Osamu Gotoh
BMC Bioinformatics (2010)
A reference guide for tree analysis and visualization
- Georgios A Pavlopoulos
- Theodoros G Soldatos
- Reinhard Schneider
BioData Mining (2010)
Illusions of conservation
- Mary Muers
Nature Reviews Genetics (2010)

Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes

Key Points

Abstract

Access options

Similar content being viewed by others

Towards complete and error-free genome assemblies of all vertebrate species

A comparative genomics multitool for scientific discovery and conservation

Systematic discovery of conservation states for single-nucleotide annotation of the human genome

References

Author information

Authors and Affiliations

Corresponding author

Supplementary information

Supplementary information S1 (table) (PDF 130 kb)

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

Genome-wide identification of Drosophila dorso-ventral enhancers by differential histone acetylation analysis

A Pharm-Ecological Perspective of Terrestrial and Aquatic Plant-Herbivore Interactions

Cgaln: fast and space-efficient whole-genome alignment

A reference guide for tree analysis and visualization

Illusions of conservation

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Author information

Authors and Affiliations

Corresponding author

Supplementary information

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links