Abstract
A phylogeny is the evolutionary history of a group of organisms; systematists (and other biologists) attempt to reconstruct this history from various forms of data about contemporary organisms. Phylogeny reconstruction is a crucial step in the understanding of evolution as well as an important tool in biological, pharmaceutical, and medical research. Phylogeny reconstruction from molecular data is very difficult: almost all optimization models give rise to NP-hard (and thus computationally intractable) problems. Yet approximations must be of very high quality in order to avoid outright biological nonsense. Thus many biologists have been willing to run farms of processors for many months in order to analyze just one dataset. High-performance algorithm engineering offers a battery of tools that can reduce, sometimes spectacularly, the running time of existing phylogenetic algorithms, as well as help designers produce better algorithms. We present an overview of algorithm engineering techniques, illustrating them with an application to the “breakpoint analysis” method of Sankoff et al., which resulted in the GRAPPA software suite. GRAPPA demonstrated a speedup in running time by over eight orders of magnitude over the original implementation on a variety of real and simulated datasets. We show how these algorithmic engineering techniques are directly applicable to a large variety of challenging combinatorial problems in computational biology.
Article PDF
Similar content being viewed by others
References
L. Arge, J. Chase, J. Vitter, and R. Wickremesinghe. Efficient sorting using registers and caches. In Proceedings of the 4th Workshop on Algorithm Engineering (WAE 2000), Saarbrücken, Germany, 2000.
D. Bader and B. Moret. GRAPPA runsin record time. HPCwire, 9(47), 2000.
D. Bader, B. Moret, and P. Sanders. High-performance algorithm engineering for parallel computation. In Experimental Algorithmics, Lecture Notes in Computer Science. Springer-Verlag, Berlin, 2001.
D. Bader, B. Moret, and L. Vawter. Industrial applications of high-performance computing for phylogeny reconstruction. In H. Siegel, ed., Proceedings of the SPIE Commercial Applications for High-Performance Computing, vol. 4528, pp. 159–168. Denver, CO, 2001.
D. Bader, B. Moret, and M. Yan. A linear-time algorithm for computing inversion distance between signed permutations with an experimental study. Journal of Computational Biology, 8:483–491, 2001.
M. Bender, E. Demaine, and M. Farach-Colton. Cacheoblivious search trees. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS-00), pp. 399–409. Redondo Beach, Calif., 2000.
M. Blanchette, G. Bourque, and D. Sankoff. Breakpoint phylogenies. In S. Miyano and T. Takagi, eds., Genome Informatics, pp. 25–34. University Academy Press, Tokyo, Japan, 1997.
B. Cherkassky and A. Goldberg. On implementing the pushrelabel method for the maximum flow problem. Algorithmica, 19:390–410, 1997.
B. Cherkassky, A. Goldberg, P. Martin, J. Setubal, and J. Stolfi. Augment or push: a computational study of bipartite matching and unit-capacity flow algorithms. ACM Journal of Experimental Algorithmics, 3(8), 1998. www.jea.acm.org/1998/CherkasskyAugment/.
B. Cherkassky, A. Goldberg, and T. Radzik. Shortest paths algorithms: theory and experimental evaluation. Mathematical Programming, 73:129–174, 1996.
M. Cosner, R. Jansen, B. Moret, L. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman. An empirical comparison of phylogenetic methods on chloroplast gene order data in Campanulaceae. In D. Sankoff and J. Nadeau, eds., Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment, and the Evolution of Gene Families, pp. 99–121. Kluwer Academic Publishers, Dordrecht, Netherlands, 2000.
M. Cosner, R. Jansen, B. Moret, L. Raubeson, L.-S. Wang, T. Warnow, and S. Wyman. A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB00), pp. 104–115. San Diego, Calif., 2000.
N. Eiron, M. Rodeh, and I. Steinwarts. Matrix multiplication: a case study of enhanced data cache utilization. ACM Journal of Experimental Algorithmics, 4(3), 1999. www.jea.acm.org/1999/Eiron-Matrix/.
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS-99), pp. 285–297. New York, 1999.
GenProbe. New oligonucleotides corresponding to HIV-1 sequences—used for selective amplification and ashybridis ation probesfor detection of HIV-1. Patent filing EP-617132–A (priority date), 1993.
A. Goldberg and K. Tsioutsiouliklis. Cut tree algorthms: an experimental study. Journal of Algorithms, 38:51–83, 2001.
E. Grossbard and D. Atkinson, eds., The Herbicide Glyphosate. Butterworths, Boston, 1985.
P. Halbur, M. Lum, X. Meng, I. Morozov, and P. Paul. New porcine reproductive and respiratory syndrome virusDNA—and proteinsencoded by open reading framesof an Iowa strain of the virus; are used in vaccines against PRRSV in pigs. Patent filing WO9606619–A1 (priority date), 1994.
D. Johnson and L. McGeoch. The traveling salesman problem: A case study. In E. Aarts and J. Lenstra, eds., Local Search in Combinatorial Optimization, pp. 215–310. John Wiley, New York, 1997.
D. Jones. An empirical comparison of priority queues and event-set implementations. Communications of the ACM, 29:300–311, 1986.
R. Ladner, J. Fix, and A. LaMarca. The cache performance of traversals and random accesses. In Proceedings of the 10th Annual Symposium on Discrete Algorithms (SODA-99), pp. 613–622. Baltimore, 1999.
A. LaMarca and R. Ladner. The influence of cacheson the performance of heaps. ACM Journal of Experimental Algorithmics, 1(4), 1996. www.jea.acm.org/1996/LaMarcaIn.uence/.
A. LaMarca and R. Ladner. The influence of cacheson the performance of heaps. In Proceedings of the 8th Symposium on Discrete Algorithms, pp. 370–379. New Orleans, LA, 1997.
K. Melhorn and S. Näher. The LEDA Platform of Combinatorial and Geometric Computing. Cambridge University Press, 1999.
B. Moret, D. Bader, and T. Warnow. High-performance algorithm engineering for computational phylogenetics. In Proceedings of the 2001 International Conference on Computational Science, vol. 2073–2074. Lecture Notes in Computer Science. San Francisco, Calif., 2001.
B. Moret and H. Shapiro. Algorithms from P to NP, Vol. I: Design and Efficiency. Menlo Park, Calif., Benjamin-Cummings, 1991.
B. Moret and H. Shapiro. An empirical assessment of algorithms for constructing a minimal spanning tree. In DIMACS Monographs in Discrete Mathematics and Theoretical Computer Science: Computational Support for Discrete Mathematics, vol. 15, pp. 99–117. American Mathematical Society, 1994.
B. Moret and H. Shapiro. Algorithms and experiments: the new (and old) methodology. Journal of Universal Computer Science, 7:434–446, 2001.
B. Moret, L.-S. Wang, T. Warnow, and S. Wyman. New approachesfor reconstructing phylogenies based on gene order. In Proceedings of the 9th International Conference on Intelligent Systems for Molecular Biology (ISMB 2001), pp. S165–S173. In Bioinformatics 17, 2001.
B. Moret and T. Warnow. Reconstructing optimal phylogenetic trees: a challenge in experimental algorithmics. In Experimental Algorithmics, Lecture Notes in Computer Science. Springer-Verlag, Berlin, 2001.
B. Moret, S. Wyman, D. Bader, T. Warnow, and M. Yan. A new implementation and detailed study of breakpoint analysis. In Proceedings of the 6th Pacific Symposium on Biocomputing (PSB2001), pp. 583–594. Hawaii, 2001.
R. Olmstead and J. Palmer. Chloroplast DNA systematics: a review of methods and data analysis. American Journal of Botany, 81:1205–1224, 1994.
J. Palmer. Chloroplast and mitochondrial genome evolution in land plants. In R. Herrmann, ed., Cell Organelles, pp. 99–133. Springer-Verlag, Berlin, 1992.
I. Pe'er and R. Shamir. The median problemsfor breakpointsare NP-complete. Technical report 71, Electronic Colloquium on Computational Complexity, 1998.
N. Rahman and R. Raman. Analysing cache effects in distribution sorting. In Proceedings of the 3rd Workshop on Algorithm Engineering (WAE99). pp. 183–197. London, England, 1999.
L. Raubeson and R. Jansen. Chloroplast DNA evidence on the ancient evolutionary split in vascular land plants. Science, 255:1697–1699, 1992.
K. Rice, M. Donoghue, and R. Olmstead. Analyzing large datasets: rbcl500 revisited. Systematic Biology 46:554–562, 1997.
N. Saitou and M. Nei. The neighbor-joining method: a new method for reconstruction of phylogenetic trees. Molecular Biology and Evolution, 4:406–425, 1987.
P. Sanders. Fast priority queues for cached memory. ACM Journal of Experimental Algorithmics, 5(7), 2000. www.jea.acm.org/2000/SandersPriority/.
D. Sankoff and M. Blanchette. Multiple genome rearrangement and breakpoint phylogeny. Journal of Computational Biology, 5:555–570, 1998.
J. Stasko and J. Vitter. Pairing heaps: experiments and analysis. Communications of the ACM, 30:234–249, 1987.
P. Szekeres, A. Muir, L. Spinage, J. Miller, S. Butler, A. Smith, G. Rennie, P. Murdock, L. Fitzgerald, H. Wu, L. McMillan, S. Guerrera, L. Vawter, N. Elshourbagy, J. Mooney, D. Bergsma, S. Wilson, and J. Chambers. Neuromedin U is a potent agonist at the orphan G protein-coupled receptor FM3. Journal of Biological Chemistry, 275:20247–20250, 2000.
L.-S. Wang. Improving the accuracy of evolutionary distances between genomes. In Proceedings of the 1st Workshop on Algorithms in Bioinformatics (WABI'01), Lecture Notes in Computer Science, vol. 2149, pp. 176–190. Århus, Denmark, 2001.
L.-S. Wang and T. Warnow. Estimating true evolutionary distances between genomes. In Proceedings of the 33th Annual Symposium on Theory of Computing (STOC 2001), pp. 637–646, 2001.
L. Xiao, X. Zhang, and S. Kubricht. Improving memory performance of sorting algorithms. ACM Journal of Experimental Algorithmics, 5(3), 2000. www.jea.acm.org/2000/XiaoMemory/.
Y. Zhu, D. Michalovich, H. Wu, K. Tan, G. Dytko, I. Mannan, R. Boyce, J. Alston, L. Tierney, X. Li, N. Herrity, L. Vawter, H. Sarau, R. Ames, C. Davenport, J. Hieble, S. Wilson, D. Bergsma, and L. Fitzgerald. Cloning, expression, and pharmacological characterization of a novel human histamine receptor. Molecular Pharmacology 59:434–441, 2001.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Moret, B.M.E., Bader, D.A. & Warnow, T. High-Performance Algorithm Engineering for Computational Phylogenetics. The Journal of Supercomputing 22, 99–111 (2002). https://doi.org/10.1023/A:1014362705613
Issue Date:
DOI: https://doi.org/10.1023/A:1014362705613