Coinfinder: detecting significant associations and dissociations in pangenomes

The accessory genes of prokaryote and eukaryote pangenomes accumulate by horizontal gene transfer, differential gene loss, and the effects of selection and drift. We have developed Coinfinder, a software program that assesses whether sets of homologous genes (gene families) in pangenomes associate or dissociate with each other (i.e. are ‘coincident’) more often than would be expected by chance. Coinfinder employs a user-supplied phylogenetic tree in order to assess the lineage-dependence (i.e. the phylogenetic distribution) of each accessory gene, allowing Coinfinder to focus on coincident gene pairs whose joint presence is not simply because they happened to appear in the same clade, but rather that they tend to appear together more often than expected across the phylogeny. Coinfinder is implemented in C++, Python3 and R and is freely available under the GNU license from https://github.com/fwhelan/coinfinder.


2.
A list of the Identifiers of the genomes used within as well as all input/output files are available at https:// github. com/ fwhelan/ coinfinder-manuscript.

InTRoDuCTIon
Pangenomes consist of core genes, common across all strains of a species, and accessory genes that are present in some but not all strains [1]. Accessory genes by definition are not essential to every strain of a species. Accessory genes are often pathogenicity islands, or associated with niche adaptation, or defence from predation, and so forth [2]. Why some members of a species might have some of these genes, while others do not is subject to debate [3,4]. It is likely that some genes co-occur, or associate, because they positively influence each other's fitness in a particular set of host genomes. Similarly, we expect some genes to avoid, or dissociate with one another because their co-occurrence produces a negative fitness effect. We expect that genes whose products function together in a biochemical pathway, or that can combine to form a useful heteromeric protein complex, will appear together in the same genome more often than their observed frequency in the dataset would predict. For example, MYD88 consistently co-occurs with the genetic components of the MYD88-dependent TLR-signalling pathway in vertebrate species [5]. In contrast, genes that produce a toxic by-product when they are expressed in the same cell, or that perform the same function and therefore induce functional redundancy, are expected to appear together less often than their observed frequency in the dataset would predict. This is seen, for example, with siderophore biosynthetic gene clusters in Salinispora spp. where an isolate either has one iron-chelating siderophore or a different non-homologous system, but never both [6]. As a first step towards understanding these kinds of gene-to-gene interactions in the accessory pangenome, it is useful to identify genes that appear together or that avoid one another significantly more often than would be expected by chance.
Previously established methodology can identify various forms of co-occurrence patterns in prokaryotes. For example, many tools (e.g. [7][8][9]) and tool comparisons [10] are available for the identification of species-species co-occurrence patterns in microbial communities. For example, the program SparCC identifies correlations in compositional data, including species presence-absence patterns within microbial communities [11]. Other tools, such as NetShift [12], find differences in species association networks of microbial communities across datasets (e.g. healthy versus diseased states). Similarly, methods have been established to identify associations between genotypic and phenotypic traits in pangenomes (i.e. gene-trait co-occurrence). Usually called pangenome genome-wide association studies (pan-GWAS), tools such as bugwas [13] and Scoary [14] compare components of the pangenome to a user-provided list of phenotypic traits. New methods such as SpydrPick [15] identify single nucleotide polymorphisms (SNP)-SNP co-occurrence patterns by comparing SNPs in multiple sequence alignments of proteins in microbial population genomic datasets.
A few approaches have focussed on gene-gene co-occurrence. Pantagruel [16] uses gene and species trees to identify genes which have similar patterns of gain and loss in a pangenome to define co-evolved gene modules. Similarly, CoPAP [17] searches for correlated patterns of gene gain and loss across a species tree to find co-evolutionary interactions of clustered orthologous groups (COGs). While conceptually similar to Coinfinder, these methodologies are based on phyletic patterns; further, the dissociation of genes is not considered by either method. The most similar method to Coinfinder in concept is the identification of correlogs and anti-correlogs, genes which favour or disfavour co-occurrence within a genome, by Kim and Price [18]. However, this method was not packaged into publicly available software and was not coupled with the pangenome concept, instead focusing on global patterns of gene associations across the bacterial Domain.
Currently, the identification of gene-gene coincident patterns is not part of pangenome analyses tools. Pangenome pipelines -such as Roary [19], PIRATE [20], or Pandora (https:// github. com/ rmcolq/ pandora) -cluster open reading frames (ORFs) into homologous gene clusters and report a presence-absence matrix of these clusters in relation to each input genome. These pipelines also generate statistics as to the numbers of core and accessory genes, a core gene alignment (from which a phylogeny can be determined), and the distribution of genes across genomes; however, these pipelines fail to determine statistically significant gene-gene relationships.
Here, we present Coinfinder, a command line software program that identifies coincident (associating or dissociating) genes across a set of input genomes. Coinfinder can run in any Unix environment using a user-specified number of processing cores. Coinfinder can be used to investigate the structure of strain-or species-pangenomes and is not restricted to prokaryote or eukaryote genomic input.

THEoRy AnD ImpLEmEnTATIon Input
Coinfinder accepts genome content data in one of two formats: (a) the gene_ presence_ absence. csv output from Roary [19]; or (b) as a tab-delimited list of the genes present in each strain. If option (b) is used, genes should be clustered into orthologous groups/gene clusters prior to using Coinfinder (for example, using blast [21] and a clustering algorithm, such as MCL [22,23]). Additionally, Coinfinder requires a Newick-formatted phylogeny of the genomes in the dataset. We suggest that this phylogeny can be constructed using the core genes from the input genomes as produced using programs such as Roary, or using ribosomal RNA genes, or a similar approach [24].
Heatmap images (R, ggplot2 [35], ggtree [36]) of the presence-absence patterns of coincident components across input genomes. The heatmap is split across multiple files when needed for ease of visibility

Impact Statement
Coinfinder identifies genes that co-occur (associate) with or avoid (dissociate) each other across the accessory genomes of a pangenome of interest. Genes that associate or dissociate more often than expected by chance, suggest that those genes have a connection (attraction or repulsion) that is interesting to explore. Identification of these groups of genes will further the field's understanding of the importance of accessory genes. Coinfinder is a freely available, open-source software, which can identify gene patterns locally on a personal computer in a matter of hours.     (Fig. 1a). Six of these genes were correctly labelled with their gene names via the Prokka/Roary pipeline; one gene was given an alternative gene name often used as a synonym in the literature; a further two genes were listed as 'hypothetical proteins'. Collectively, the nine genes that compose the V-ATPase/ntp operon form cliques with an additional 51 genes. These cliques are shown as a network (b) and as a presence-absence heatmap (c). In the heatmap, unlabelled gene columns represent unnamed hypotheticals.
gene i and gene j are observed together or apart in the input genomes more often than would be expected by chance.
As a pre-processing step, the input gene set is culled for high-and low-abundance genes. Genes present in every genome (i.e. core genes) are removed as they cannot statistically associate or dissociate (i.e. be coincident with) another gene more or less often than expected. Similarly, genes whose presence is constrained to a small number of genomes will not produce significant associations, therefore low-abundance genes can be removed from the input at a user-determined cutoff. Coinfinder's default is to remove any gene present in less than 5 % of the input genomes.
Coinfinder has two modes for identifying coincident relationships: association and dissociation. When testing for gene associations, Coinfinder evaluates whether gene i and gene j of a given gene pair are observed together in the input genomes more often than would be expected by chance. More formally, for a set of genomes N, we define the probability of observing gene i as In each mode, Coinfinder's default behaviour is to use a Bonferroni-corrected binomial exact test statistic (adapted from https:// github. com/ chrchang/ stats) of the expected and observed rates to evaluate whether each gene pair are significantly coincident with each other.
Coincident genes that share an evolutionary history are more likely to have indirect correlations with each other. For example, if two genes are found to associate and each is observed only within a particular clade, the most parsimonious explanation for the observation is that the last common ancestor of the clade obtained both genes at the same evolutionary step. These two genes may, or may not, have a functional relationship with one another, and are of potential interest. However, non-monophyletic -or lineage-independent -genes that are dispersed throughout a phylogeny and are found to be significantly coincident are more likely to have a direct relationship with each othertheir patchy phylogenetic distribution, combined with their statistically significant rate of association is prima facie evidence that they interact in some way, prefer a particular ecological niche, or have some other direct association with each other. Thus, Coinfinder focuses on identifying coincident relationships between lineage-independent accessory genes. To do this, Coinfinder uses a previously established phylogenetic measure of binary traits (D, as coded into the R function phylo.d [25]) to determine the lineage-dependence of each coincident gene. D is a measure of phylogenetic signal strength of a binary trait, which quantifies the amount of dispersion of the trait -here, the presence of a gene -over a phylogenetic tree [25]. Coinfinder does not implement a particular threshold for lineage independence but instead reports the D value of each gene in the output for the user to consider in conjunction with their input phylogeny. The calculations for gene association/dissociation and lineage independence are conducted independently of one another.

output
Coinfinder visualizes the results of its analysis in two ways. First, Coinfinder produces a network in which each node is a gene family and each edge is a statement of significant gene association (corrected for lineage effects) or significant gene dissociation. The size of a node is proportional to the gene's D value. Second, Coinfinder generates a presence-absence heatmap, indicating the presence of coincident genes in the context of the input phylogeny. The genes in the heatmap are ordered by D value (from most lineage-independent to least) and are coloured according to coincident patterns.
Coinfinder produces a number of output files, with the default prefix of coincident_, as described in Table 1. Examples of the network and heatmap outputs of Coinfinder are shown in Fig. 1.

RESuLTS
As an example, Coinfinder was executed using 534 Streptococcus pneumoniae genomes as input, a subset of the Global Pneumococcal Sequencing Project (GPS; https://www. pneumogen. net/ gps/) whose ORFs were identified using Prokka [26] and clustered into orthologous gene families using Roary [19]. Coinfinder took 7.2 min (using 20 cores; see Table 2 for more runtime details) to examine the relationships between 2 813 gene families across 534 genomes (3 957 891 pairwise tests in total). Coinfinder identified 104 944 associating gene pairs, which clustered into 32 connected components or sets of genes that associate with each other. Similarly, Coinfinder took 7.5 min using 20 cores to identify 98 461 dissociate gene relationships within this dataset. The network and heatmap outputs of Coinfinder from this example set are shown in Fig. 1.
Although the availability of sequenced genomes is increasing rapidly, it is still rare to have access to such a large specieslevel pangenomic dataset. As such, the user could consider analyses at the genus-or family-level to increase dataset size. In order to identify the effect of input dataset size on Coinfinder's ability to identify gene-gene associations, we randomly subsetted the 534 genome S. pneumoniae dataset into datasets sized between 400 and 50 genomes ( Table 3). Analyses of these data with Coinfinder returned less genegene associations than observed in the full dataset, and the number of associations observed decreased substantially with smaller numbers of genome inputs, culminating with no associations identified with an input of 50 genomes. Although this provides an estimate of the necessary number of input genomes to Coinfinder, it should be noted that the power of Coinfinder will vary based on the average number of genes per genome as well as the diversity of genes within the dataset (i.e. the 'openness' of the pangenome).
Of the gene associations and dissociations that Coinfinder identified, many are in line with previous investigations of S. pneumoniae pangenomes. For example, we identify a large number of associations between widely dispersed genes, which agrees with evidence that S. pneumoniae has an extensive set of 'soft core' genes in its pangenome [27]. Further, many genes involved in coincident relationships are lineage independent, which is expected given the high natural competency of the species and, therefore, affinity for horizontal gene transfer events [28]. As an example, we focus on a V-ATPase present in S. pneumoniae. V-ATPases are enzymes which transport protons across the cell membrane in a process which hydrolyses ATP [29]. V-ATPases are intricate protein complexes, providing an excellent use case for Coinfinder's potential to identify the genes expected to co-occur as part of a multi-protein enzyme. While the V-ATPase in S. pneumoniae has been understudied, it has been welldocumented in S. pyogenes and sister taxon Enterococcus hirae [29,30]. In E. hirae the V-ATPase consists of 10-11 proteins organized into the ntp operon: ntpFIKECGABD(H) J [29]. In S. pneumoniae, the V-ATPase complex is predicted to contain nine proteins (KEGG pathway spx_M00159 [29]). In the annotation of S. pneumoniae that we performed here, only six genes of the ntp operon were annotated successfully: ntpA, ntpB, ntpC, ntpD, ntpG and ntpK. Coinfinder identified consistent co-occurrence relationships between these six genes, forming a clique (i.e. a complete subgraph of gene associations; Fig. 2a). However, these six genes also co-occurred with other genes in the dataset; we extended our analyses to determine whether any other genes consistently co-occurred with all six genes of this operon. In doing so, we identified three genes -atpE, and two unnamed geneswith homology to ntpE, ntpI and ntpG/H, respectively, that consistently co-occur with the rest of the ntp operon (Fig. 2a). An additional 51 genes formed cliques with the genes of the ntp operon. Of the 51 genes, three encode neuraminadase genes from nan gene clusters (Fig. 2b-c). Another three genes co-occurring with the V-ATPase complex belong to the dpnMAB operon which encode the DpnII system implicated in DNA transformation (among other functions) [31] and an additional three are homologous to transposase IS66-related domains, perhaps suggesting how this operon has been horizontally transferred in this species (Fig. 2b-c). Additionally, four of these proteins contained a putative cell wall binding repeat ('CW_binding_1') which has been implicated in choline binding [32]. Choline-binding proteins (CBPs) contain a choline-binding module/domain which allows them to bind to the cell wall of S. pneumoniae, functioning as essential elements of cell division, as well as strong determinants of virulence [32,33]. It is unknown why four CBPs co-occur with the V-ATPase complex; in eukaryotes, it has been shown that acetylcholine can be transmitted via the V-ATPase complex of vacuoles [34] but the result has not been generalized to prokaryotic cell membranes. A further 11 genes are of uncharacterized function. This example shows the power of Coinfinder in (a) identifying gene associations between proteins in a known protein complex; (b) being able to overcome poor gene annotations by looking for patterns in gene co-occurrence and gene association networks; and (c) being able to extrapolate those results to other genes with known protein interactions.
Coinfinder uses parallel processing to compute pairwise tests of coincident relationships. The most time-consuming step is the determination of the lineage-dependence of each gene; consequently, we have programmed this part of the software to run in parallel for only those genes that are found in statistically significant coincident relationships. For the S. pneumoniae example, using the input set of 2 813 accessory gene families, the lineage-dependence calculation was only necessary on the 1 961 genes deemed to be in coincident relationships. Using these data, the computation time varied from 6 to 31 min when using 32 to 2 CPUs, respectively ( Table 2).

Conclusions
Coinfinder is an accurate and efficient tool for the identification of coincident gene relationships within pangenomes. Coinfinder is open-source software available from https:// github. com/ fwhelan/ coinfinder.