Published April 12, 2023 | Version v2
Dataset Open

Host symbiont gene reconciliation supplementary material

  • 1. Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, F-69622 Villeurbanne, France
  • 2. Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR5558, F-69622 Villeurbanne, France, INRIA Grenoble Rhône-Alpes, F-38334 Montbonnot, France

Description

Cinara aphids dataset was obtained from the data of article by Manzano-Marin et al. ISME, 2019, and we chose a representative subset of the species present in the gene trees. We used an exterior source for the phylogeny for the enterobacteria present in the gene trees using Annotree (Mendler et al., Nucleic Acids Research, 2019) (for the one that are not associated to Cinara aphids, and are thus "free living" in this setting). 

Helicobacter pylori dataset was constructed by Alexia Nguyen Trung, gathering available whole genome sequences on NCBI with assigned geo populations on NCBI or pubMLST. 
A phylogenetic tree was built based on the concatenation of universal-unicopy genes (322 genes), and a sample of 113 strains representing the diversity of H. pylori in the old world (excluding strains from the Americas) was obtained using Treemmer (Menardo et al, BMC Bioinformatics, 2018).
Then, 6 non pylori strains were added (H. hepaticus, H. acinonychis, H. canadensis, H felis, H. bizzozeronii, H. cetorum), as an external group. 
In this study we considered the 1034 gene families, including 322 universal unicopy family, which displayed strains from the external group and from at least 3 continents.
The taleoutput repository contains the recphyloxml outputs of the methods used in the paper, am stands for the approach with amalgamation of the universal unicopy genes to construct a strain tree, while nc refers to the results with the tree constructed from the concatenate, and then the repositories refer to the putative population trees 1, 2, 3 and 4. 0u_0 is the strains genes reconciliation in recphyloxml format. 0upper is the host strains reconciliation in recphyloxml format.

Finally, the last repository contains the simulated dataset. It was generated using Sagephy https://compbio.engr.uconn.edu/software/sagephy/ https://doi.org/10.1093/bioinformatics/btz081
For each instance, a host tree, a symbiont tree, and 5 gene trees were generated.

We used the parameters proposed in https://doi.org/10.1145/3307339.3342168 \cite{kordi_inferring_2019}, as representative of small (D 0.133, T 0.266, L 0.266), medium (D 0.3, T 0.6, L 0.6) and high (D 0.6, T 1.2, L 1.2) transfer rates, without replacing transfers. The software enables to specify an inter transfer rate, corresponding to the probability for a gene transfer between different hosts. When a horizontal transfer is chosen during generation of the gene tree (inside a symbiont tree and knowing a host/symbiont reconciliation), the transfer is chosen to be an inter host one with the inter transfer rate. So an inter transfer rate of 0 corresponds to only intra transfer, and of 1 corresponds to a case where transfers are only between symbionts in separate hosts.


We constructed two simulated datasets, one with a combination of the different rates for the DTL parameters (varrates), and one with only medium rates but with different rates of "inter" and "intra" transfers (coevol).
For the first dataset, we used all 9 combinations of small, medium and high rates for the symbiont generation and the gene generation, with only intra host gene transfer (i.e. an inter transfer rate of zero).
For the second dataset, we used only medium rates for both symbiont and genes generation, but we used 6 inter transfer rates going from 0 to 1. 

For both datasets, and for each set of rates, we generated 50 instances consisting of 1 host tree with 100 leaves, 1 symbiont tree and 5 gene trees, each generated in the pruned version of the other trees (branch that do not reach present are pruned before the generation of the next tree). We then kept each host leaves with a probability of 0.08 to simulate unexhaustive sampling, resulting in host trees with an average size of 8 leaves.
This ended up to 399 instances for the first dataset and 226 instances for the second one, and at least 29 instances of 5 genes for each set of parameters.

Each repositiory is a simulation instance, varrates_k_l_i correspond to simulation number i with lower rate k and upper rate l. genes, symbiont, species are repositories containing the trees in newick of the genes, symbiont and host in newick. Gene trees are unrooted. Lower matching is a matching between gene and symbiont leaves, upper matching between symbiont and host leaves, gene_host_matching is a matching between gene and host leaves. Transfer list contains one file for each gene and with all transfers simulated, donor and receiver symbiont internal nodes.
Each instance also contains the output of tale used in the paper, with the three heuristic, the 2-level symbiont gene (_2l), the 3-level sequential heuristic (_dec) and the 3-level monte carlo approach (_mc) with 50 iterations. All the launch were with 5 rounds of parameters estimation (the default usage). The 0u_0 contain a sampled symbiont gene reconciliation in recphyloxml, and for the 3-level heuristics, 0upper a corresponding host symbiont reconciliation in recphyloxml (multiple ones for the montecarlo i_upper and iu_0 for i from 0 to 49). A known error in the recphyloxml transcription script, now corrected, has induced some errors in some of the recphyloxml files : for some transfers the indicated receiver species is the donor and not the receiver, however redundancy in the format makes it possible to retrieve the information by looking at the matching species in the next event of the gene, that will be the receiver species, we chose to leave it this way instead of relaunching all the computations, as it has no impact on the figures and results presented in our article (mostly constructed using the freq files). The files freq contains information on the frequencies of the different events summed up over gene symbionts reconciliation, lower_log_likelihood contains the log likelihood of the host and symbiont trees knowing the genes (the probability of the genes knowing the host and symbiont).

Files

host_symbiont_gene_reconciliation.zip

Files (215.6 MB)

Name Size Download all
md5:c70adf0d3ae2be229c78a08d2d08bc02
215.6 MB Preview Download