Pyridoxal 5’-phosphate synthesis and salvage in Bacteria and Archaea: predicting pathway variant distributions and holes

Pyridoxal 5’-phosphate or PLP is a cofactor derived from B6 vitamers and essential for growth in all known organisms. PLP synthesis and salvage pathways are well characterized in a few model species even though key components, such as the vitamin B6 transporters, are still to be identified in many organisms including the model bacteria Escherichia coli or Bacillus subtilis . Using a comparative genomic approach, PLP synthesis and salvage pathways were predicted in 5840 bacterial and archaeal species with complete genomes. The distribution of the two known de novo biosynthesis pathways and previously identified cases of non-orthologous displacements were surveyed in the process. This analysis revealed that several PLP de novo pathway genes remain to be identified in many organisms, either because sequence similarity alone cannot be used to discriminate among several homologous candidates or due to non-orthologous displacements. Candidates for some of these pathway holes were identified using published TnSeq data, but many remain. We find that ~10 % of the analysed organisms rely on salvage but further analyses will be required to identify potential transporters. This work is a starting point to model the exchanges of B6 vitamers in communities, predict the sensitivity of a given organism to drugs targeting PLP synthesis enzymes, and identify numerous gaps in knowledge that will need to be tackled in the years to come.

annotated as PdxK, PdxY, RbsK, ThiD, and ThiD2, a subfamily used in bacilli as a PdxK replacement (named pdxK_basu in our analysis). Once the subfamilies had been defined, these were used as seeds to create new HMM profiles. The different seed sequences were aligned using MAFFT v7.453 [6] (linsi algorithm, using the group's subtree as guide tree, option --treein). The poorly aligned regions at the extremities of the alignment were manually trimmed in the alignment using SEAVIEW v5.0.2 [2]. The trimmed alignments were used to build the HMM profile using hmmbuild (default parameters) from HMMER package v3.3 [2], resulting in a set of 113 HMM protein profiles (Table S4).
In the second round of refinements of the HMM profiles, we ran hmmsearch (default parameters) from HMMER package v3.3 [2] using the 113 HMM protein profiles against the same database and we decided to keep all the hits with an e-value > 1e-6 and a coverage of the alignment >= 50%. Using these sequences, phylogenetical trees were inferred for each annotated protein group as described in the Methods section. In parallel, each set of sequences was divided in groups using the EFI-EST sequence similarity network [7] using an alignment score threshold corresponding to 35 to 40% pairwise identity. Each tree was then divided into sub-trees using the clusters identified by the SSNs and rarefied to reduce their size and redundancy to 20 representatives by group using Treemmer v0.3 [8] (option -X 20). Each protein sequence of the representatives belonging to the same superfamily was concatenated into the same multifasta file. The superfamily tree was then inferred as described in the method section. For each superfamily, groups inside the superfamily were separated using EFI-EST SSN with a step-by-step evolution of the alignment score threshold to identify the best threshold for each superfamily to separate PLP pathway protein from the other functional groups (supplemental data 1 in FigShare [9]). After mapping the SSN groups on the tree of each superfamily to make sure that the groups were homogenous (supplemental data 2 in FigShare [9])._ HMM protein profiles were built for each group as described below.

HMM protein profile generation
An initial alignment was performed using MAFFT v7.453 [10] ("linsi" algorithm, using the tree of the superfamily as guide tree, option --treein). Then, the resulting multiple alignments were analyzed using BMGE v1.12 [11] (options -m BLOSUM62-g 0.2 and -oh to have an HTML output). Next, the HTML output was parsed to identify the nonconserved block at the border of the alignment and trimed them out. Finally, the trimmed alignment was used to build the HMM profile using hmmbuild (default parameters) from HMMER package v3.3 [2]. All final profiles are given in Supplemental data 3 in FigShare

Sequence similarity networks (SSNs) and genome context
SSNs for PF02913 (UniProt Release 2022_01) were generated using the EFI-EST web tool [7]. For the Desulfovibrionales Order (Fig. 5, panel A), the SSN was generated using UniProt IDs; an alignment score of 140 that separates paralogous groups was used to display the SSN. For the Bacteria Superkingdom (Fig. 5, panel C), the SSN was generated using UniRef90 clusters; an alignment score of 140 was used to display the SSN. Genome neighborhood diagrams (GNDs) were generated for the five paralogs in Desulfovibrio vulgaris str. Hildenborough using the EFI-GNT web tool [7]. and five of those also harbored epd, suggesting the pdxJ gene might have been missed.

Analyses of the missing PdxJ proteins
Using tblastn (v2.11.0+, default parameters) [12,13], we could detect, in 4 of the genomes, pdxJ in two fragments suggesting an artifactual frameshift was causing the absence of this protein in the final Refseq file. Furthermore, for one of the genomes, Pseudomonas aeruginosa VRFPA04, we saw that the genome version in contig in PATRIC harbors the missing pdxJ in the gene set. We could not find any trace of pdxJ in the three remaining genomes but anticipate this must be caused by problems in genome assembly as the same species/genus harbor pdxJ in their genomes.