Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter March 19, 2015

H-CLAP: hierarchical clustering within a linear array with an application in genetics

  • Samiran Ghosh EMAIL logo and Jeffrey P. Townsend

Abstract

In most cases where clustering of data is desirable, the underlying data distribution to be clustered is unconstrained. However clustering of site types in a discretely structured linear array, as is often desired in studies of linear sequences such as DNA, RNA or proteins, represents a problem where data points are not necessarily exchangeable and are directionally constrained within the array. Each position in the linear array is fixed, and could be either “marked” (i.e., of interest such as polymorphic or substitute sites) or “non-marked.” Here we describe a method for clustering of those marked sites. Since the cluster-generating process is constrained by discrete locality inside such an array, traditional clustering methods need adjustment to be appropriate. We develop a hierarchical Bayesian approach. We adopt a Markov clustering algorithm, revealing any natural partitioning in the pattern of marked sites. The resulting recursive partitioning and clustering algorithm is named hierarchical clustering in a linear array (H-CLAP). It employs domain-specific directional constraints directly in the likelihood construction. Our method, being fully Bayesian, is more flexible in cluster discovery compared to a standard agglomerative hierarchical clustering algorithm. It not only provides hierarchical clustering, but also cluster boundaries, which may have their own biological significance. We have tested the efficacy of our method on data sets, including two biological and several simulated ones.


Corresponding author: Samiran Ghosh, Department of Family Medicine and Public Health Sciences and Center of Molecular Medicine and Genetics, Wayne State University School of Medicine, 3127 Scott Hall, 540 East Canfield, Detroit, MI, USA, e-mail:

Acknowledgments

The research of first author is partly supported by NIH grant number P30-ES020957. Authors would like to thank Dr. Dipak Dey for his comments and initial involvement. We also like to thank an anonymous reviewer for his/her many insightful comments and constructive suggestions which have greatly improved the scope and presentation of the paper.

Appendix

A Proof of the Theorems 2.1

For notational simplicity, define U=CECS+1. With little abuse of notation, the pmf (probability mass function) of the random variable U is given by,

f(u)=(unC)pCnC(1pC)unC(NunnC)pOnnC(1pO)Nun+nC.

Without loss of generality, we assume exactly nC=n many consecutive sites are marked and U=n. Then the probability of such a cluster is given by,

f(u=n)=pCn(1pO)Nn.

Now consider U=n+1, i.e., a cluster with n consecutively marked sites and only one non-marked site placed at one end. We first consider the case when non-marked site is at the very beginning. The other situation when the non-marked site is at the extreme end can be reasoned similarly. In the likelihood equation (1) the binomials coefficients will yield (n+1n)=n+1. However, this does not take into account the constraint of consecutively marked sites and produces n+1 by placing the non-marked site in all possible position which is a clear violation of above assumption. Hence under the assumption that n marked sites are all consecutive and the only non-marked site is at the very beginning, the binomial coefficient yields 1 (only possible way). Hence,

f(u=n+1)=pCn(1pC)(1pO)Nn1.

It is simple algebra to show that f(u=n)f(u=n+1)=1pO1pC1, since pCpO by the assumption. Following a similar argument it is easy to note f(u=n)≥f(u=l) for any ln.

Now to prove the other side, consider U=n–1. Hence, the interval [CS, CE] contains n–1 marked cells (i.e., nC=n–1), the only other remaining marked site falls outside [CS, CE]. Hence,

f(u=n1)=pCn1pO(1pO)Nn.

consider again f(u=n)f(u=n1)=pCpO1. Following a similar argument, it is easy to note f(u=n)≥f(u=l) for any ln. Thus the likelihood probability in equation (1) is maximum when U=CECS+1=n, i.e., a cluster covering exactly the consecutively marked cells.

B Proof of the Theorems 3.1

We first note the joint likelihood of (CS, CE) given in (2). Form this the marginal pmf of CE is given by,

π(CE=cE)=cS=1cE1N(NCS+1),for1CEN.

To prove the asymmetry and monotonicity consider any integer x∈[1, N–1]. Clearly from the above,

π(CE=x+1)=cS=1x+11N(NCS+1)=cS=1x1N(NCS+1)+1N(Nx)π(CE=x).

Thus the marginal pmf of CE is non-decreasing and hence can not be symmetric. Also note above inequality holds for all x∈[1, N–1] hence we get the monotonic increasing density from which left skewness is immediate.

C Proof of the Theorems 3.2

Proof is very similar to that of the Theorem 2.1. Denote U=CECS+1, pmf of the random variable U and CS is given by,

f(u)=(unC)pCnC(1pC)unC(NunnC)pOnnC+α1(1pO)Nun+nC+β21N(NCS+1).

Again we assume nC=n many consecutive sites are marked and U=n. Then the probability of such a cluster is given by,

f(u=n)=pCnpOα1(1pO)Nn+β21N(Nl+1),

where CS=l for l=1, …, Nn+1. Also note that,

f(u=n+1)=pCn(1pC)pOα1(1pO)Nn+β31N(NCS+1).

The reason for (unC)=(n+1n) in the above being replaced by 1 is exactly same as in the proof of Theorem 2.1. Since the above involves CS, the simplification for the choice of U=n+1 is dependent upon the value assumed by CS. Note we may have two cases for a cluster of length n+1 depending upon CS=l–1, l. To include one more site in the cluster of size n (i.e., U=n), we may keep CS=l position fixed and move CE by one site, or we may keep CE fixed and move CS by one site to make CS=l–1. Of course, the choice CS=l–1 is not available if l=1. Note that in both the cases all consecutively marked sites are in one cluster. When CS=l is kept fixed, it is easy to show f(u=n)f(u=n+1)=1pO1pC1. For CS=l–1, f(u=n)f(u=n+1)=(Nl+2)(1pO)(Nl+1)(1pC)>1. Following a similar argument, it is easy to note that P(U=n)≥P(U=k) for any integer kn.

Now, to prove the other side, consider U=n–1. Hence,

f(u=n1)=pCn1pOα(1pO)Nn+β21N(NCS+1).

This again corresponds to two choices, namely CS=l, l+1, i.e., either move CE one site backward or move CS one site forward. When CS=l is kept fixed, it is easy to show f(u=n)f(u=n1)=pCpO1. For CS=l+1, it is easy to note f(u=n)f(u=n1)=(Nl)pC(Nl+1)pO. Hence f(u=n)f(u=n1)1 provided pCpONl+1Nl. Following a similar argument, it is easy to note that P(U=n)≥P(U=k) for any integer kn. Hence equation (4) is a maximum when U=CECS+1=n. This completes the proof.

References

Aa, E., J. P. Townsend, R. I. Adams, K. M. Nielsen and J. W. Taylor (2006): “Population structure and gene evolution in Saccharomyces cerevisiae,” FEMS Yeast Res., 6, 702–716.Search in Google Scholar

Cormen, T. H., C. E. Leiserson, R. L. Rivest and C. Stein (2001): Introduction to algorithms, Reprint 2001, MIT Press.Search in Google Scholar

Damien, P. and S. G. Walker (2001): “Sampling truncated normal, beta, and gamma densities,” J. Comp. Graph. Stat., 20(2), 206–215.10.1198/10618600152627906Search in Google Scholar

Derrick, D. (2006): TreeForm syntax tree drawing software. Available from: http://sourceforge.net/projects/treeform., Version 1.0.Search in Google Scholar

Dongen, S. V. (2000): Graph clustering by flow simulation. PhD Thesis, University of Utrecht.Search in Google Scholar

Fares, M. A., S. F. Elena, J. Ortiz, A. Moya and E. Barrio (2002): “A sliding window-based method to detect selective constraints in protein-coding genes and its application to RNA viruses,” J. Mol. Evol., 55(5), 509–521.10.1007/s00239-002-2346-9Search in Google Scholar

Gaut, B. S. and B. S. Weir (1994): “Detecting substitution-rate heterogeneity among regions of a nucleotide sequence,” Mol. Biol. Evol., 11(4), 620–629.Search in Google Scholar

Gelfand, A. E. and A. F. M. Smith (1990): “Sampling-based approaches to calculating marginal densities,” J. Am. Stat. Assoc., 85, 398–409.Search in Google Scholar

Geman, S. and D. Geman (1987): “Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach. Intel., 6, 721–741.Search in Google Scholar

Gimeno, C. J. and G. Fink (1994): “Induction of pseudohyphal growth by overexpression of PHD1, a Saccharomyces cerevisiae gene related to transcriptional regulators of fungal development,” Mol. Cell. Biol., 14, 2100–2112.Search in Google Scholar

Hartmann, M. and G. B. Golding (1998): “Searching for substitution rate heterogeneity,” Mol. Phylogenet. Evol., 9(1), 64–71.10.1006/mpev.1997.0446Search in Google Scholar

Jain, A. K., M. N. Murty, and P. J. Flynn (1999): “Data clustering: a review,” ACM Comput. Surv., (CSUR) 31(3), 264–323.10.1145/331499.331504Search in Google Scholar

Kaufman, L. and P. J. Rousseeuw (1990): Finding groups in data: an introduction to cluster analysis, John Wiley & Sons, Inc.10.1002/9780470316801Search in Google Scholar

Kimura, M. (1983): The neutral theory of molecular evolution, Cambridge, England: Cambridge University Press.10.1017/CBO9780511623486Search in Google Scholar

Kreitman, M. and R. R. Hudson (1991): “Inferring the evolutionary histories of the Adh and Adh-dup loci in Drosophila melanogaster from patterns of polymorphism and divergence,” Genetics, 127, 565–582.10.1093/genetics/127.3.565Search in Google Scholar

Lance, G. N. and W. T. Williams (1966): “A general theory of classifactory sorting strategies 1. – Hierarchical systems,” Comput. J., 9(4), 373–380.Search in Google Scholar

Liu, J. S. (1994): “The collapsed Gibbs samlper in Bayesian computations with applications to a gene regulation problem,” J. Am. Stat. Assoc., 89, 958–966.Search in Google Scholar

Liang, H., W. Zhou and L. F. Landweber (2006): “SWAKK: a web server for detecting positive selection in proteins using a sliding window substitution rate analysis,” Nucleic Acids Res., 34(Suppl 2), W382–W384.Search in Google Scholar

McDonald, J. H. and M. Kreitman (1991): “Adaptive protein evolution at the Adh locus in Drosophila,” Nature, 351, 652–654.10.1038/351652a0Search in Google Scholar

Nekrutenko, A. and W. H. Li (2000): “Assessment of compositional heterogeneity within and between eukaryotic genomes,” Genome Res., 10(12), 1986–1995.Search in Google Scholar

Pesole, G., M. Attimonelli, G. Preparata and C. Saccone (1992): “A statistical method for detecting regions with different evolutionary dynamics in multialigned sequences,” Mol. Phylogenetics Evol., 1(2), 91–96.10.1016/1055-7903(92)90023-ASearch in Google Scholar

Ponger, L. and D. Mouchiroud (2002): “CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences,” Bioinformatics, 18(4), 631–633.10.1093/bioinformatics/18.4.631Search in Google Scholar PubMed

Proutski, V. and E. Holmes (1998): “SWAN: sliding window analysis of nucleotide sequence variability,” Bioinformatics, 14(5), 467–468.10.1093/bioinformatics/14.5.467Search in Google Scholar

Sawyer, S. A. and D. L. Hartl (1992): “Population genetics of polymorphism and divergence,” Genetics, 132, 41161–41176.10.1093/genetics/132.4.1161Search in Google Scholar

Schmid, K. and Z. Yang (2008): “The trouble with sliding windows and the selective pressure in BRCA1,” PLoS One, 3(11), e3746.10.1371/journal.pone.0003746Search in Google Scholar

Stephens, J. C. (1985): “Statistical methods of DNA sequence analysis: detection of intragenic recombination or gene conversion,” Mol. Biol. Evol., 2(6), 539–556.Search in Google Scholar

Struyf, A., M. Hubert and P. J. Rousseeuw (1997): “Integrating robust clustering techniques in S-Plus,” Comput. Stat. Data Anal., 26, 17–37.Search in Google Scholar

Weinreich, D. M. and D. M. Rand (2000): “Contrasting patterns of nonneutral evolution in proteins encoded in nuclear and mitochondrial genomes,” Genetics, 156, 385–399.10.1093/genetics/156.1.385Search in Google Scholar

Zhang, Z. and J. P. Townsend (2009): “Maximum-likelihood model averaging to profile clustering of site types across discrete linear sequences,” PLoS Comput. Biol., 5(6), e1000421.Search in Google Scholar

Zharkikh, A. A. and A. Y. Rzhetsky (1993): “Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies,” Biosystems, 30(1), 93–111.10.1016/0303-2647(93)90065-KSearch in Google Scholar

Published Online: 2015-3-19
Published in Print: 2015-4-1

©2015 by De Gruyter

Downloaded on 19.4.2024 from https://www.degruyter.com/document/doi/10.1515/sagmb-2013-0076/html
Scroll to top button