H-CLAP: hierarchical clustering within a linear array with an application in genetics

Samiran Ghosh; Jeffrey P. Townsend

doi:10.1515/sagmb-2013-0076

Published by De Gruyter March 19, 2015

H-CLAP: hierarchical clustering within a linear array with an application in genetics

Samiran Ghosh and Jeffrey P. Townsend

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2013-0076

Showing a limited preview of this publication:

Abstract

In most cases where clustering of data is desirable, the underlying data distribution to be clustered is unconstrained. However clustering of site types in a discretely structured linear array, as is often desired in studies of linear sequences such as DNA, RNA or proteins, represents a problem where data points are not necessarily exchangeable and are directionally constrained within the array. Each position in the linear array is fixed, and could be either “marked” (i.e., of interest such as polymorphic or substitute sites) or “non-marked.” Here we describe a method for clustering of those marked sites. Since the cluster-generating process is constrained by discrete locality inside such an array, traditional clustering methods need adjustment to be appropriate. We develop a hierarchical Bayesian approach. We adopt a Markov clustering algorithm, revealing any natural partitioning in the pattern of marked sites. The resulting recursive partitioning and clustering algorithm is named hierarchical clustering in a linear array (H-CLAP). It employs domain-specific directional constraints directly in the likelihood construction. Our method, being fully Bayesian, is more flexible in cluster discovery compared to a standard agglomerative hierarchical clustering algorithm. It not only provides hierarchical clustering, but also cluster boundaries, which may have their own biological significance. We have tested the efficacy of our method on data sets, including two biological and several simulated ones.

Keywords: constrained prior; genetics; hierarchical clustering; linear array; Markov clustering

Corresponding author: Samiran Ghosh, Department of Family Medicine and Public Health Sciences and Center of Molecular Medicine and Genetics, Wayne State University School of Medicine, 3127 Scott Hall, 540 East Canfield, Detroit, MI, USA, e-mail: sghos@med.wayne.edu

Acknowledgments

The research of first author is partly supported by NIH grant number P30-ES020957. Authors would like to thank Dr. Dipak Dey for his comments and initial involvement. We also like to thank an anonymous reviewer for his/her many insightful comments and constructive suggestions which have greatly improved the scope and presentation of the paper.

Appendix

A Proof of the Theorems 2.1

For notational simplicity, define U=C_E–C_S+1. With little abuse of notation, the pmf (probability mass function) of the random variable U is given by,

f(u)=(unC)pCnC(1−pC)u−nC(N−un−nC)pOn−nC(1−pO)N−u−n+nC.

Without loss of generality, we assume exactly n_C=n many consecutive sites are marked and U=n. Then the probability of such a cluster is given by,

f(u=n)=pCn(1−pO)N−n.

Now consider U=n+1, i.e., a cluster with n consecutively marked sites and only one non-marked site placed at one end. We first consider the case when non-marked site is at the very beginning. The other situation when the non-marked site is at the extreme end can be reasoned similarly. In the likelihood equation (1) the binomials coefficients will yield (n+1n)=n+1. However, this does not take into account the constraint of consecutively marked sites and produces n+1 by placing the non-marked site in all possible position which is a clear violation of above assumption. Hence under the assumption that n marked sites are all consecutive and the only non-marked site is at the very beginning, the binomial coefficient yields 1 (only possible way). Hence,

f(u=n+1)=pCn(1−pC)(1−pO)N−n−1.

It is simple algebra to show that f(u=n)f(u=n+1)=1−pO1−pC≥1, since p_C≥p_O by the assumption. Following a similar argument it is easy to note f(u=n)≥f(u=l) for any l≥n.

Now to prove the other side, consider U=n–1. Hence, the interval [C_S, C_E] contains n–1 marked cells (i.e., n_C=n–1), the only other remaining marked site falls outside [C_S, C_E]. Hence,

f(u=n−1)=pCn−1pO(1−pO)N−n.

consider again f(u=n)f(u=n−1)=pCpO≥1. Following a similar argument, it is easy to note f(u=n)≥f(u=l) for any l≤n. Thus the likelihood probability in equation (1) is maximum when U=C_E–C_S+1=n, i.e., a cluster covering exactly the consecutively marked cells.

B Proof of the Theorems 3.1

We first note the joint likelihood of (C_S, C_E) given in (2). Form this the marginal pmf of C_E is given by,

π(CE=cE)=∑cS=1cE1N(N−CS+1), for 1≤CE≤N.

To prove the asymmetry and monotonicity consider any integer x∈[1, N–1]. Clearly from the above,

π(CE=x+1)=∑cS=1x+11N(N−CS+1)=∑cS=1x1N(N−CS+1)+1N(N−x)≥π(CE=x).

Thus the marginal pmf of C_E is non-decreasing and hence can not be symmetric. Also note above inequality holds for all x∈[1, N–1] hence we get the monotonic increasing density from which left skewness is immediate.

C Proof of the Theorems 3.2

Proof is very similar to that of the Theorem 2.1. Denote U=C_E–C_S+1, pmf of the random variable U and C_S is given by,

f(u)=(unC)pCnC(1−pC)u−nC(N−un−nC)pOn−nC+α−1(1−pO)N−u−n+nC+β−21N(N−CS+1).

Again we assume n_C=n many consecutive sites are marked and U=n. Then the probability of such a cluster is given by,

f(u=n)=pCnpOα−1(1−pO)N−n+β−21N(N−l+1),

where C_S=l for l=1, …, N–n+1. Also note that,

f(u=n+1)=pCn(1−pC)pOα−1(1−pO)N−n+β−31N(N−CS+1).

The reason for (unC)=(n+1n) in the above being replaced by 1 is exactly same as in the proof of Theorem 2.1. Since the above involves C_S, the simplification for the choice of U=n+1 is dependent upon the value assumed by C_S. Note we may have two cases for a cluster of length n+1 depending upon C_S=l–1, l. To include one more site in the cluster of size n (i.e., U=n), we may keep C_S=l position fixed and move C_E by one site, or we may keep C_E fixed and move C_S by one site to make C_S=l–1. Of course, the choice C_S=l–1 is not available if l=1. Note that in both the cases all consecutively marked sites are in one cluster. When C_S=l is kept fixed, it is easy to show f(u=n)f(u=n+1)=1−pO1−pC≥1. For C_S=l–1, f(u=n)f(u=n+1)=(N−l+2)(1−pO)(N−l+1)(1−pC)>1. Following a similar argument, it is easy to note that P(U=n)≥P(U=k) for any integer k≥n.

Now, to prove the other side, consider U=n–1. Hence,

f(u=n−1)=pCn−1pOα(1−pO)N−n+β−21N(N−CS+1).

This again corresponds to two choices, namely C_S=l, l+1, i.e., either move C_E one site backward or move C_S one site forward. When C_S=l is kept fixed, it is easy to show f(u=n)f(u=n−1)=pCpO≥1. For C_S=l+1, it is easy to note f(u=n)f(u=n−1)=(N−l)pC(N−l+1)pO. Hence f(u=n)f(u=n−1)≥1 provided pCpO≥N−l+1N−l. Following a similar argument, it is easy to note that P(U=n)≥P(U=k) for any integer k≤n. Hence equation (4) is a maximum when U=C_E–C_S+1=n. This completes the proof.

References

Aa, E., J. P. Townsend, R. I. Adams, K. M. Nielsen and J. W. Taylor (2006): “Population structure and gene evolution in Saccharomyces cerevisiae,” FEMS Yeast Res., 6, 702–716.Search in Google Scholar

Cormen, T. H., C. E. Leiserson, R. L. Rivest and C. Stein (2001): Introduction to algorithms, Reprint 2001, MIT Press.Search in Google Scholar

Damien, P. and S. G. Walker (2001): “Sampling truncated normal, beta, and gamma densities,” J. Comp. Graph. Stat., 20(2), 206–215.10.1198/10618600152627906Search in Google Scholar

Derrick, D. (2006): TreeForm syntax tree drawing software. Available from: http://sourceforge.net/projects/treeform., Version 1.0.Search in Google Scholar

Dongen, S. V. (2000): Graph clustering by flow simulation. PhD Thesis, University of Utrecht.Search in Google Scholar

Fares, M. A., S. F. Elena, J. Ortiz, A. Moya and E. Barrio (2002): “A sliding window-based method to detect selective constraints in protein-coding genes and its application to RNA viruses,” J. Mol. Evol., 55(5), 509–521.10.1007/s00239-002-2346-9Search in Google Scholar

Gaut, B. S. and B. S. Weir (1994): “Detecting substitution-rate heterogeneity among regions of a nucleotide sequence,” Mol. Biol. Evol., 11(4), 620–629.Search in Google Scholar

Gelfand, A. E. and A. F. M. Smith (1990): “Sampling-based approaches to calculating marginal densities,” J. Am. Stat. Assoc., 85, 398–409.Search in Google Scholar

Geman, S. and D. Geman (1987): “Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images,” IEEE Trans. Pattern Anal. Mach. Intel., 6, 721–741.Search in Google Scholar

Gimeno, C. J. and G. Fink (1994): “Induction of pseudohyphal growth by overexpression of PHD1, a Saccharomyces cerevisiae gene related to transcriptional regulators of fungal development,” Mol. Cell. Biol., 14, 2100–2112.Search in Google Scholar

Hartmann, M. and G. B. Golding (1998): “Searching for substitution rate heterogeneity,” Mol. Phylogenet. Evol., 9(1), 64–71.10.1006/mpev.1997.0446Search in Google Scholar

Jain, A. K., M. N. Murty, and P. J. Flynn (1999): “Data clustering: a review,” ACM Comput. Surv., (CSUR) 31(3), 264–323.10.1145/331499.331504Search in Google Scholar

Kaufman, L. and P. J. Rousseeuw (1990): Finding groups in data: an introduction to cluster analysis, John Wiley & Sons, Inc.10.1002/9780470316801Search in Google Scholar

Kimura, M. (1983): The neutral theory of molecular evolution, Cambridge, England: Cambridge University Press.10.1017/CBO9780511623486Search in Google Scholar

Kreitman, M. and R. R. Hudson (1991): “Inferring the evolutionary histories of the Adh and Adh-dup loci in Drosophila melanogaster from patterns of polymorphism and divergence,” Genetics, 127, 565–582.10.1093/genetics/127.3.565Search in Google Scholar

Lance, G. N. and W. T. Williams (1966): “A general theory of classifactory sorting strategies 1. – Hierarchical systems,” Comput. J., 9(4), 373–380.Search in Google Scholar

Liu, J. S. (1994): “The collapsed Gibbs samlper in Bayesian computations with applications to a gene regulation problem,” J. Am. Stat. Assoc., 89, 958–966.Search in Google Scholar

Liang, H., W. Zhou and L. F. Landweber (2006): “SWAKK: a web server for detecting positive selection in proteins using a sliding window substitution rate analysis,” Nucleic Acids Res., 34(Suppl 2), W382–W384.Search in Google Scholar

McDonald, J. H. and M. Kreitman (1991): “Adaptive protein evolution at the Adh locus in Drosophila,” Nature, 351, 652–654.10.1038/351652a0Search in Google Scholar

Nekrutenko, A. and W. H. Li (2000): “Assessment of compositional heterogeneity within and between eukaryotic genomes,” Genome Res., 10(12), 1986–1995.Search in Google Scholar

Pesole, G., M. Attimonelli, G. Preparata and C. Saccone (1992): “A statistical method for detecting regions with different evolutionary dynamics in multialigned sequences,” Mol. Phylogenetics Evol., 1(2), 91–96.10.1016/1055-7903(92)90023-ASearch in Google Scholar

Ponger, L. and D. Mouchiroud (2002): “CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences,” Bioinformatics, 18(4), 631–633.10.1093/bioinformatics/18.4.631Search in Google Scholar PubMed

Proutski, V. and E. Holmes (1998): “SWAN: sliding window analysis of nucleotide sequence variability,” Bioinformatics, 14(5), 467–468.10.1093/bioinformatics/14.5.467Search in Google Scholar

Sawyer, S. A. and D. L. Hartl (1992): “Population genetics of polymorphism and divergence,” Genetics, 132, 41161–41176.10.1093/genetics/132.4.1161Search in Google Scholar

Schmid, K. and Z. Yang (2008): “The trouble with sliding windows and the selective pressure in BRCA1,” PLoS One, 3(11), e3746.10.1371/journal.pone.0003746Search in Google Scholar

Stephens, J. C. (1985): “Statistical methods of DNA sequence analysis: detection of intragenic recombination or gene conversion,” Mol. Biol. Evol., 2(6), 539–556.Search in Google Scholar

Struyf, A., M. Hubert and P. J. Rousseeuw (1997): “Integrating robust clustering techniques in S-Plus,” Comput. Stat. Data Anal., 26, 17–37.Search in Google Scholar

Weinreich, D. M. and D. M. Rand (2000): “Contrasting patterns of nonneutral evolution in proteins encoded in nuclear and mitochondrial genomes,” Genetics, 156, 385–399.10.1093/genetics/156.1.385Search in Google Scholar

Zhang, Z. and J. P. Townsend (2009): “Maximum-likelihood model averaging to profile clustering of site types across discrete linear sequences,” PLoS Comput. Biol., 5(6), e1000421.Search in Google Scholar

Zharkikh, A. A. and A. Y. Rzhetsky (1993): “Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies,” Biosystems, 30(1), 93–111.10.1016/0303-2647(93)90065-KSearch in Google Scholar

Published Online: 2015-3-19

Published in Print: 2015-4-1

H-CLAP: hierarchical clustering within a linear array with an application in genetics

Abstract

Acknowledgments

Appendix

A Proof of the Theorems 2.1

B Proof of the Theorems 3.1

C Proof of the Theorems 3.2

References

Journal and Issue

Articles in the same Issue