Evolutionary history of dimethylsulfoniopropionate (DMSP) demethylation enzyme DmdA in marine bacteria

Dimethylsulfoniopropionate (DMSP), an osmolyte produced by oceanic phytoplankton and bacteria, is primarily degraded by bacteria belonging to the Roseobacter lineage and other marine Alphaproteobacteria via DMSP-dependent demethylase A protein (DmdA). To date, the evolutionary history of DmdA gene family is unclear. Some studies indicate a common ancestry between DmdA and GcvT gene families and a co-evolution between Roseobacter and the DMSP-producing-phytoplankton around 250 million years ago (Mya). In this work, we analyzed the evolution of DmdA under three possible evolutionary scenarios: (1) a recent common ancestor of DmdA and GcvT, (2) a coevolution between Roseobacter and the DMSP-producing-phytoplankton, and (3) an enzymatic adaptation for utilizing DMSP in marine bacteria prior to Roseobacter origin. Our analyses indicate that DmdA is a new gene family originated from GcvT genes by duplication and functional divergence driven by positive selection before a coevolution between Roseobacter and phytoplankton. Our data suggest that Roseobacter acquired dmdA by horizontal gene transfer prior to an environment with higher DMSP. Here, we propose that the ancestor that carried the DMSP demethylation pathway genes evolved in the Archean, and was exposed to a higher concentration of DMSP in a sulfur-rich atmosphere and anoxic ocean, compared to recent Roseobacter eco-orthologs (orthologs performing the same function under different conditions), which should be adapted to lower concentrations of DMSP.


INTRODUCTION 58
Dimethylsulfoniopropionate (DMSP) is an osmolyte synthesized by oceanic phytoplankton 59 (Galinski, 1995;Yoch, 2002). This molecule became abundant in the oceans 250 million years ago 60 (Mya), coinciding with the expansion and diversification of dinoflagellates (Bullock et al., 2017). 61 Since then, it has played an important role in the biogeochemistry of sulfur cycle on Earth 62 (Lovelock, 1983). DMSP is the main precursor of the climate-relevant gas dimethylsulfide (DMS; 63 and hosted in the MarRef database (Klemetsen et al., 2018). The sequences were obtained as 136 described by González et al. (2019). The DmdA homologs included were obtained using a HMM 137 designed for DmdA orthologs (González et al., 2019), with a relaxed maximum e-value (e-50). A 138 total of 204 sequences from 184 genomes were used to infer the evolutionary history of DmdA gene 139 family (Supplementary Table 1). 140 141 142

Phylogenetic tree reconstruction and topology tests 143
The phylogenetic tree of the DmdA protein sequences included DmdA orthologs and DmdA 144 homologs (called non-DmdA). The sequences were aligned using MUSCLE (Edgar, 2004).  (Stamatakis, 2006) were used to generate 100 ML bootstrap trees, using the Le Gascuel (LG) model 152 with a discrete gamma distribution (+G) with four rate categories, as this was the model with the 153 lowest Akaike information criterion and Bayesian information criterion score. For the Bayesian 154 analysis, trees were constructed using the PhyloBayes program (Lartillot & Philippe, 2004155 Lartillot et al., 2007) with the CAT model that integrates heterogeneity of amino acid composition 156 across sites of a protein alignment. In this case, two chains were run in parallel and checked for 157 convergence using the tracecomp and bpcomp scripts provided in PhyloBayes. As an alternative, 158 we computed a phylogenetic tree using a Bayesian inference implemented in BEAST2 program 159 which was run with relaxed clock model and Birth Death tree prior (Bouckaert et al., 2014). Finally, 160 we used R v3.6.1 (R Core Team, 2017) with phangorn v2.5.5 (Schliep, 2011) to perform consensus 161 unrooted tree. 162 163 We ran several topology tests to establish whether the trees generated using the ML and BI methods 164 provided an equivalent explanation for the two main groups, i.e., the non-DmdA and DmdA clades. 165 pervasive and episodic selection during the evolution of dmdA orthologs. Likelihood-ratio tests 238 (LRTs) were used to compare models, and significant results (p-value<0.05) were determined 239 contrasting with a chi-square distribution (chisq) (Anisimova et al., 2001). 240

241
In the site-specific analysis, we tested for variability of selection (type and magnitude) across the 242 codons of the gene using three pairs of nested models. The first pair includes M0 (just one dN/dS 243 ratio) and M3 ("K" discrete categories of dN/dS) and has four degrees of freedom (df For the last set of models, we identified sites that have been under positive selection at a particular 259 point of evolution using branch-site models, in which dN/dS can vary among sites and among 260 branches (Zhang, 2005). We computed two models: a null model, in which the "foreground branch" 261 may have different proportions of sites under neutral selection to the "background branches", and 262 an alternative model in which the "foreground branch" may have a proportion of sites under 263 positive selection. We compare these models for each terminal branch with a LRT of one df. For 264 each branch-site analysis, we applied the Bonferroni correction for multiple testing. 265

266
In site and branch-site tests, we identified sites under positive selection as those with Bayes 267 Empirical Bayes (BEB) posterior probability above the 0.95 (Yang, 2005). We also checked for 268 convergence of the parameter estimates in PAML by carrying out at least two runs for each tree and 269 starting the analysis with different ω (0.2, 1, 1.2 and 2). In addition, to test for convergent selection 9 in several lineages, we ran at Branch-site analysis selecting as "foreground branches" all those whether selective pressures diverged following duplication that led to dmdA and non-dmdA genes 277 (Bielawski & Yang, 2004). We compared the M3 model, which accounts for ω variation among 278 sites but not among branches or clades, with a model allowing a fraction of sites to have different ω 279 between two clades of a phylogeny (clade model D). We also tested M0 and M3 models and we 280 used a posterior BEB probability above the 0.95 to identify sites evolving under divergent selective 281 pressures. We checked for convergence of the parameter estimates in PAML by carrying out at least 282 two runs for the tree and starting the analysis with different ω (0.1, 0.25, 2, 3 and 4). 283 284 Finally, we applied two branch-site models (as described above) to test dN/dS differences on the 285 branches representing the ancestral lineages of the DmdA and non-DmdA clades (see results) 286 ( Supplementary Fig 25). We considered the ancestral sequences from DmdA and non-DmdA clades 287 as foreground branches in two different models. 288 289 290

Reconstruction of ancestral DmdA sequence 291
To reconstruct the ancient conditions where dmdA gene prospered, we inferred the ancestral 292 sequences of the DmdA node using the FastML web server (Ashkenazy et al., 2012) and then 293 computed estimated physico-chemical properties on predecessor sequence using Compute 294 ProtParam tool from Expasy -SIB Bioinformatics Resource Portal (Gasteiger et al., 2005). 295 Moreover, we also reconstructed the ancestral sequence of the non-DmdA node, as well as the 296 ancestral sequence of both the DmdA, and the non-DmdA families. FastML was run considering the 297 alignment of proteins and the ML phylogenetic tree for those DmdA orthologs or homologs inferred 298 as we explained above. Posterior amino acid probabilities at each site were calculated using the Le 299 Gascuel (LG) matrix (Le & Gascuel, 2008) and Gamma distribution. Both marginal and joint 300 probability reconstructions were performed. Protein sequences resulting from marginal 301 reconstructions were used to predict tertiary structure (see below) as well as to identify family 302 domains using Pfam v32 (Finn et al., 2010). 303 304 Protein tertiary structure analysis 306 Predicted three-dimensional structures of protein sequences were examined by Iterative Threading 307 ASSEmbly Refinement (I-TASSER) (Roy et al., 2010;Yang et al., 2015). First, I-TASSER uses 308 local meta-threading-server (LOMETS) (Wu & Zhang, 2007) to identify templates for the query 309 sequence in a non-redundant Protein Data Bank (PDB) structure library. Then, the top-ranked 310 template hits obtained are selected for the 3D model simulations. To evaluate positively the global 311 accuracy of the predicted model, a C-score should return between -5 and 2. At the end, top 10 312 structural analogs of the predicted model close to the target in the PDB (Berman et al., 2000) are 313 generated using TM-align (Zhang, 2005). The TM-score value scales the structural similarity 314 between two proteins, and should return 1 if a perfect match between two structures is found. A 315 TM-score value higher than 0.5 suggests that the proteins belong to the same fold family. We identify a total of 204 DmdA protein sequences out of 150 curated genomes, and reconstruct 324 their evolutionary relationships by Bayesian Inference (BI) (Fig 1) and Maximum Likelihood (ML) 325 ( Supplementary Fig 4). Unrooted trees in TOPD-FMTS indicated that split distances did not exceed 326 0.19, indicating that the phylogenetic reconstruction is robust, with minor variations in alignment 327 filtering and methods for inferring topologies (Supplementary Table 2). 328

329
The BI tree (Fig 1) shows a main duplication between two lineages. The larger phylogenetic group 330 comprises genes from Bacteroidetes, while the smaller group includes genes from 331 Alphaproteobacteria. We focused on this smaller group as it includes the DmdA sequences (Fig 1;  332 green color) and the closest homologs to DmdA (Fig 1; yellow color). 333 334 Using phylogenetic analyses including DmdA orthologs and DmdA homologs close to those (the 335 limit to select closer homologs was set to a maximum e-value of e-80) we resolve the position of the 336 first DmdA sequences isolated from two marine bacterial species, R. pomeroyi (AAV95190.1) and robust phylogenetic relationship of DmdA gene family (Fig 2). We detected a clear separation 339 between DmdA and putative non-DmdA families. Indeed, the four DmdA family trees constructed 340 using different methods compared in TOPD-FMTS using split distances (Supplementary Table 3)  341 and unrooted trees ( Supplementary Fig 5) agreed with this result. The average split distance was 342 0.60, indicating that the trees were neither identical (split difference=0) nor completely different 343 (1). A random split distance was calculated to analyze whether the split distances were significantly 344 different. Because the random split distance resulted in a value close to 1 (0.988), our observations 345 are unlikely to be given by chance. 346

347
To identify HGT and duplication events, we constructed a proxy for the species tree of the genomes 348 considered here by using a set of small subunit ribosomal protein (see Material and Methods). 349 Given this (proxy) species tree ( Supplementary Fig 6), the positions of many sequences on the 350 DmdA tree are better explained as cases of HGT ( Supplementary Fig 6; Fig 3) with high statistical 351 support. We then tested whether the topology for a common set of taxa within the DmdA family 352 ( Supplementary Fig 7) were similar to that of the species tree ( Supplementary Fig 8). We found 353 significant differences (at an alpha of 0.01) between the topology of DmdA group and that of the 354 proxy species tree (Table 1); this incongruence between phylogenies is true irrespective of the test 355 used (Kishino-Hasegawa, Shimodaira-Hasewaga and unbiased tests). From these results we 356 conclude that the phylogenetic relationships within each DmdA group are different to those of the 357 species tree, strongly supporting a HGT-based evolution of DmdA family ( Supplementary Fig 9). 358 359 Moreover, we found many genes that use different codons than the neighboring genomic regions. 360 These genes are inferred as having been horizontally transferred given their (G+C) wobble content 361 (Supplementary Table 1 The structure for DmdA orthologs inferred on the protein sequences by Iterative Threading 367 ASSEmbly Refinement (I-TASSER) were threaded onto the known structure of DMSP-dependent 368 demethylase A protein (PDB accession: 3tfhA) with a C-score<= 2 (Table 2). However, the 369 predicted models for DmdA homologs were threaded onto two types of known structure; DmdA We clustered sequences with a putative DmgdH structure in a separate group using principal 375 component analysis (Supplementary Fig 11). There is a clear 3D-structure coincidence between 376 DmdA clade (red color in Supplementary Fig 10a) and the majority of lineages from non-DmdA 377 clade (orange color in Supplementary Fig 10a)  validates the use of relaxed molecular clock approach to estimate the node ages throughout 391 Bayesian analysis (see Methods for details). We observed that the marginal densities for each run of 392 the divergence time estimate analysis were nearly identical, pointing that the runs converged on the 393 same stationary distributions. In all runs, the marginal densities for the standard deviation 394 hyperparameter of the uncorrelated log-normal relaxed clock model were quite different from the 395 prior, with no significant density at zero and with a coefficient of variation around 0.2. Analyses 396 using three different calibrated prior dates showed not discrepancies in the final divergence time 397 estimates (Table 3). 398

399
The time estimates for the MRCA of each gene family (Table 3 and Fig 4) indicate that the most 400 recent common ancestor of DmdA gene family occurred in the late Archean, around 2,400 Mya, 401 after a gene duplication event. Also, a duplication within the DmdA lineage generated a separated 402 SAR11 and Roseobacter DmdA lineage in the early Precambrian ca. 1,894 Mya (Fig 4: red arrow).
comprising the Roseobacter DmdA family is larger than in SAR11 (Fig 4). 407 408 We detected two duplication events within the putative non-DmdA clade (Fig 4;  FastML inferred the 100 most likely ancestral sequences of the DmdA family. We observed that the 420 same sequences were always inferred. Indeed, the difference in log-likelihood between the most 421 likely ancestral sequence at this node (N1; Supplementary Fig 12) and the 100th most likely 422 sequence was only 0.105, indicating that both sequences are almost as likely to reflect the "true" 423 ancestral sequence. That ancestral protein contains both PF01571 (GCV_T) and PF08669 424 (GCV_T_C) domains, found in the DmdA orthologs and it is nearly identical to Ca. P. ubique 425 HTCC1062 DmdA sequence. Moreover, PSI-BLAST search confirmed that the ancestral sequence 426 in node 1 close to DmdA genes hosted in EMBL-EBI databases ( Supplementary Fig 13) and the 427 structure for Ca. P. ubique apoenzyme DmdA was the closest analog to our predicted models (Table  428 2; Supplementary Data 1). Inferred physico-chemical properties are identical between Ca. P. ubique 429 and the DmdA ancestral sequence (Supplementary Table 4 Fig 18) fits better the data, as 458 the LRT was 23.777 and p-value < 0.01 (Table 4). dN/dS value in Roseobacter (ω 1 : 0.0767) was 459 significantly lower than in the remaining branches (ω 2 : 0.1494), suggesting stronger purifying 460 selection on dmdA in Roseobacter. When we tested the intensity of selection over evolutionary time 461 using the free-ratio model (Table 4), we found changes in the selection pressure from the branches 462 which defines the separation of SAR11 and Roseobacter DmdA gene families ( Supplementary Fig  463   19: branches from nodes 21 to 23). In particular, we observed a dN/dS value > 1 in the branch 464 connecting nodes 21-23. We also identified some more recent branches (connecting nodes 25-26 465 and 28-29) for which dN/dS >> 1 was estimated (Supplementary Fig 19). 466 467 Finally, we applied the two branch-site models to test for sites under selection on the individual 468 lineages associated with dmdA ( Supplementary Fig 20). Four sequences (WP_047029467, 469 AHM05061.1, ABV94056.1, AFS48343.1) had a significant LRT after correcting for multiple 470 testing (Table 5), suggesting episodic positive selection on these lineages (Supplementary Fig 20).
sites are closed to conserved positions (17E; 87Y; Supplementary Fig 21). The residue 87Y is 474 adjacent to the conserved interaction site with THF (88Y; Supplementary Fig 21). Interestingly, 475 since the selected lineages are separated in the tree, the adaptive mutations seem to have occurred 476 through three parallel independent changes (Supplementary Fig 22). 477 478 479 Functional divergence during the molecular evolution of DmdA sequences 480 We tested whether DmdA and non-DmdA gene families were subjected to different functional 481 constrains after gene duplication (Supplementary Fig 5). We estimated the one-ratio model (M0) 482 that yielded a value ω = 0.053 (Table 6), indicating that purifying selection dominated the evolution 483 of these proteins. The discrete model (M3) was applied to these sequences (Table 6) and the LRTs 484 comparing M0 and M3 indicated significant variation in selective pressure among sites (Table 6; 485 Supplementary Fig 23). 486 487 The M3 model was compared with Model D, which accommodates both heterogeneity among sites 488 and divergent selective pressures. The LRT was significant and supported the model D (Table 6) (Table 6). Nineteen sites were located within the alpha helix (red 496 tube in Supplementary Fig 24) of the secondary structure prediction and sixteen were located in the 497 beta sheet (green arrows in Supplementary Fig 24). According to the global dN/dS estimates, for all 498 divergent positions dmdA sequences seem to be more conserved than non-dmdA sequences. 499 Moreover, this data is only compatible with recombination breaking linkage disequilibrium within 500 the gene set that we observed with the HGT analysis. 501 502 Finally, we are interested in knowing if adaptive evolution has occurred in the lineages immediately 503 following the main duplication event (Supplementary Fig 25). We applied two branch-site models 504 to test for sites under selection on the ancestor associated with the DmdA and non-DmdA clades 505 (Table 5). The LRT was significant for both ancestral branches (LRT > 7 and p-value < 0.05). 506 Nonetheless, the foreground ω for class 2 sites tended to infinite (ω=999) in both cases, indicating lack of synonymous substitutions (dS=0) in these sites. We also performed two-ratio models to 508 estimate global ω on these branches, but both estimates tended to infinite (Supplementary Table 5 The DmdA clade is a member of aminomethyltransferase (AMT/GCV_T) family with DMSP-530 dependent demethylase tertiary structure while non-DmdA clade includes an ancestor with a tertiary 531 structure that better matches the dimethylglycine dehydrogenase oxidorreductase (DmgdH, EC 532 1.5.99.2) (Fig. 7) and members with DmdA tertiary structure. To establish structural convergence as 533 the reason of this DmdA structure coincidence between DmdA and non-DmdA members, we used a 534 phylogenetic approach based on reconstructing ancestral sequences of the two clades, and then to 535 model the ancestral proteins. We determined different structural features between ancestral 536 sequence reconstructed from DmdA and non-DmdA families. In the first case, the ancestral 537 sequence reconstructed coincides with a DmdA tertiary structure, as well as with a DmdA sequence 538 with physico-chemical properties inferred in this study and agree with previous ones (Reisch et al., (within non-DmdA clade) where the majority of sequence gained DmdA structure (Fig. 7). 542 Therefore, DmdA structural features seem to have emerged independently in both clades: DmdA 543 and non-DmdA. This finding is extremely interesting, since known cases of structural convergence 544 of proteins are rare (Zakon, 2002). Experimental assays expressing and screening the activity of the 545 ancestral proteins at different conditions will be required to corroborate the structural convergence. In the second scenario, our data does not support the hypothesis of a co-evolution sceneario 562 between Roseobacter and DMSP-producing-phytoplankton (Luo et al., 2013). On the contrary, we 563 found an ancestor sequence of DmdA cluster similar to DmdA from a strain of Ca. P. ubique that 564 diverged after a more recent duplication event, before the dinoflagellate radiation in the late 565 Permian. This finding indicates that the enzyme activity has not changed in the course of DmdA 566 evolution. Indeed, we found that most of the codons in DmdA clade are under purifying selection 567 probably due to the importance of this pathway for sulfur acquisition. Nonetheless, we also detected 568 episodic positive selection in four sequences affecting a few sites, suggesting that adaptive 569 evolution fine-tuned the function of DmdA in Roseobacter. Furthermore, positively selected 570 residues were located around the GcvT domain and close to the residue involved in conserved 571 interaction with THF, reinforcing the idea of adaptive evolution in response to the external 572 environment.
During the study of this scenario, we suspected that dmdA was acquired by HGT in Roseobacter 575 and SAR11. This agrees with Luo et al., (2013) and Tang et al. (2010) which found that the 576 expansion of dmdA was by HGT. Moreover, our study evidence that DmdA ancestral sequence in 577 our phylogeny comes from a marine heterotrophic bacteria adapted to presence of DMSP in the 578 Archean, after a HGT event from this bacteria to another linage that acquired the dmdA ancestral 579 sequence. However, after the HGT events, some dmdA sequences have acquired similar residue 580 changes by independent (parallel) evolution, reinforcing the idea of functional/ecological 581 constrains. Therefore, Rhodobacteraceae can live in an environment where DMSP is the main 582 source of sulfur because they acquired the DmdA ancestor sequence by HGT, prior to have been 583      Hypothesis of DmdA evolution. BI phylogeny under uncorrelated relaxed clock model and that the similarity between two trees produced by BI or ML is better than random. This random 1185 analysis is repeated 100 times and the result is the mean and SD of the different repetitions. 1186