RhierBAPS: An R implementation of the population clustering algorithm hierBAPS

Identifying structure in collections of sequence data sets remains a common problem in genomics. hierBAPS, a popular algorithm for identifying population structure in haploid genomes, has previously only been available as a MATLAB binary. We provide an R implementation which is both easier to install and use, automating the entire pipeline. Additionally, we allow for the use of multiple processors, improve on the default settings of the algorithm, and provide an interface with the ggtree library to enable informative illustration of the clustering results. Our aim is that this package aids in the understanding and dissemination of the method, as well as enhancing the reproducibility of population structure analyses.


Introduction
Identifying sub-populations in collections of genetic sequences is a common problem in population genetics, molecular ecology, epidemiology and microbiology. In general, the aim of genetic clustering algorithms is to identify separate panmictic clusters within a broader, more heterogeneous population. In large sequence data sets, it is helpful to identify smaller subpopulations which can be further analysed for associations with particular phenotypes as well as recombination 1,2 , as long as potential biases introduced through taking clusters from a larger population are taken into account 3 .
A frequently used model assumes that each individual sequence is drawn from one of K distinct subpopulations with each cluster having its own set of allele frequencies. The aim is then to identify which cluster each sequence originates from and the corresponding allele frequencies within that cluster.
There are a number of methods for solving this problem including STRUCTURE 4,5 , snapclust 6 and BAPS (Bayesian Analysis of Population Structure 7-10 ). The BAPS algorithm 9,11 is distinct in that it attempts to estimate the partition of individual sequences into clusters directly by analytically integrating over the allele frequencies parameters for each subpopulation. This allows for the latent number of underlying subpopulations, K, to be estimated as part of the model fitting procedure. The hierBAPS algorithm extends this approach by enabling the investigation of a population at multiple resolutions. This is achieved by initially clustering the entire dataset using the BAPS algorithm before iteratively applying the algorithm to each of the resulting clusters.
Similar to other approaches 4 , BAPS assumes that alleles are drawn independently from a multinomial distribution with a Dirichlet prior. However, unlike STRUCTURE, which uses Gibbs sampling to estimate the posterior distribution, BAPS attempts to find the partition of the data S that maximises the posterior probability of an allocation over all other possible allocations. A partition S is defined as the allocation of each sequence to one of K possible clusters. The maximum possible value of K is given in the hierBAPS algorithm. Here S indicates the set of all possible partitions with up to K max clusters. The hierBAPS algorithm attempts to choose S to maximise where P(data|S) is the marginal likelihood of having the allele frequency parameters analytically integrated out leading to where n i jl is the count for allele l at locus j in cluster i and α i jl is the corresponding hyper-parameter for the Dirichlet prior. The BAPS algorithm attempts to find the partition S that maximizes the posterior probability using a greedy stochastic search approach. A discretised uniform distribution of the cluster size K (K = 1,…, K max ) is used in hierBAPS to provide the prior probability of each partition P(S). The Dirichlet hyperparameters are set at is the number of of distinct alleles at locus j.
Currently, hierBAPS is only available as a MATLAB binary, which can be both difficult to install and use as separate runtime libraries are generally needed for different OS versions for MacOS X, Windows and Linux systems. The documentation is also lacking, making it difficult for less computationally experienced researchers to use. There is currently no clear guide on how to use the output of the MATLAB binary to produce informative plots for interpretation. Whilst there are other algorithms available to cluster genetic data in R, such as snapclust 6 and DAPC 12 , neither make use of the partition approach used in BAPS. By providing an R implementation of hierBAPS, we aim to increase its usability and the reproducibility of analyses using the software.

Implementation
RhierBAPS is implemented in the R language 13 . The core program relies upon the R packages ape 14 , dplyr, gmp, purrr and ggplot2. Additional plotting functionality makes use of ggtree 15 and phytools 16 . The structure of the code is very similar to the original MATLAB code and has similar runtimes. The development version of the package can be installed using devtools.
install.packages("devtools") d e v t o o l s : : i n s t a l l _ g i t h u b ( " g t o n k i n h i l l / rhierbaps") Unlike the MATLAB version, rhierBAPS by default only considers SNP loci that have a minor allele in at least two sequences. This has been found to improve the results of the analysis as although singleton SNPs are important when constructing phylogenies they introduce noise into the model used in hierBAPS leading to poorer quality clusterings. It is currently recommended that singletons SNPs are removed before running the MATLAB version of the software.

Operation
RhierBAPS can be installed on any computer where R versions 3.5 and above can be installed. The package can be run using just a few lines of R code where the variable "fasta.file.name" should be replaced with the location of the FASTA formated multiple sequence alignment of the sequences of interest.

Use cases
RhierBAPS requires a multiple sequence alignment in FASTA format. In all examples we make use of sequences from the Bacillus cereus Multi Locus Sequence Typing website (https://pubmlst.org/bcereus/) 17 . The sequences used are included as part of the R package.
The algorithm requires an initial number of clusters to be set which should be higher than the maximum number of expected clusters in the dataset. If a dataset is likely to contain many distinct lineages, for example, if there are many samples from many locations, then a higher initial number of clusters should be set. Conversely, if the samples are from only a small number of sites and little variation is expected then a smaller initial cluster size can be set. To get an idea of a good initial cluster size, agglomerative clustering with complete linkage using pairwise SNP distances can be performed initially. The number of levels over which clustering should be performed is also required as input to the algorithm.
In the preceding example, we ran rhierBAPS with 20 initial clusters at two clustering levels. Additional parameters that can be set include the number of cores to use and whether the program should generate progress information. The hierBAPS function generates a data frame indicating the assignment of sequences to clusters at each level. This, along with the marginal log likelihoods can be saved to file.
write.csv(hb.results$partition.df, file = "hierbaps_partition.csv", col.names = TRUE, row.names = FALSE) save_lml_logs(hb.results, "hierbaps_logML.txt") Finally, as the program is written in R we are able to take advantage of the excellent plotting capabilities available. Given a phylogenetic tree generated using IQTREE 18 with model selection 19 using the command iqtree −s, we can annotate it with the BAPS clusters using ggtree 15 (Figure 1).

Summary
Clustering is an essential component of many genetic analysis pipelines. We have presented rhierBAPS, an R package that implements the hierBAPS algorithm for clustering genetic sequence data. It is both easy to install and use, whilst providing additional plotting capabilities and the ability to run using multiple cores. We believe it will aid in the reproducibility of population structure analysis.

Software availability
The package is available on CRAN: https://cran. The identification of groups (or clusters) with genetic or genomic data is one the basic questions asked by population geneticists. Several methods exist with various software implementations. The software described in this note is a valuable addition to the set of R packages for population genetics and related fields. I greatly appreciate to have this method implemented in R. The analytical integration is a really great feature and that makes this method attractive.
I tried the package and the examples run very smoothly as expected. I also tried the main function with a small DNA alignment (the 'woodmouse' data in ape) and the results made sense. The graphical tools provided with the package are helpful, and the results output as a list in R make easy to use a custom graphical display. For instance, I was able to make my own display with functions in ape in just four lines of code.
I have a few comments or suggestions that the Authors may find useful for future versions of their article and/or package.
At present, it seems that the package handles SNP data. Does this mean that only biallelic DNA loci can be analysed? Can other types of biallelic genetic data be handled? What if more than two alleles are observed at a DNA site?
One suggestion for future developments would be to better integrate with other data classes, particularly from ape and adegenet which are more and more widely used. Also, ape has now efficient links with the data classes used in BioConductor, which makes possible to integrate a wide range of approaches (bioinformatics, genomics, phylogenetics, population genetics) within R.
It seems that the present package does not calculate the individual relative probabilities of assignment to the different clusters as done in Corander . . This might be a valuable addition for future versions, and et al it would help to compare the results from different methods, for instance using the nice compoplot function in adegenet.