Revealing potential functions of hypothetical proteins induced by genistein in the symbiosis island of Bradyrhizobium japonicum commercial strain SEMIA 5079 (= CPAC 15)

Bradyrhizobium japonicum strain SEMIA 5079 (= CPAC 15) is a nitrogen-fixing symbiont of soybean broadly used in commercial inoculants in Brazil. Its genome has about 50% of hypothetical (HP) protein-coding genes, many in the symbiosis island, raising questions about their putative role on the biological nitrogen fixation (BNF) process. This study aimed to infer functional roles to 15 HP genes localized in the symbiosis island of SEMIA 5079, and to analyze their expression in the presence of a nod-gene inducer. A workflow of bioinformatics tools/databases was established and allowed the functional annotation of the HP genes. Most were enzymes, including transferases in the biosynthetic pathways of cobalamin, amino acids and secondary metabolites that may help in saprophytic ability and stress tolerance, and hydrolases, that may be important for competitiveness, plant infection, and stress tolerance. Putative roles for other enzymes and transporters identified are discussed. Some HP proteins were specific to the genus Bradyrhizobium, others to specific host legumes, and the analysis of orthologues helped to predict roles in BNF. All 15 HP genes were induced by genistein and high induction was confirmed in five of them, suggesting major roles in the BNF process.

biological nitrogen fixation (BNF) achieved by inoculation of soybean seeds with Bradyrhizobium elite strains selected by the research over the past decades provides highs rates of fixed atmospheric nitrogen (N 2 ), supplying the entire N needs from the plants [2]. Estimates are that approximately 12 Tg of mineral N (fertilizer) per year are saved in Brazil due to the BNF contribution to the soybean crop, representing an economy of about US$ 13 billion annually to the country [3].
Strain SEMIA 5079 (= CPAC 15) was isolated in 1992 from a Cerrado soil that originally had no compatible soybean rhizobia and was inoculated with strain SEMIA 566; as both strains share identical rep-PCR profiles and differ from all other strains from the germplasm bank, SEMIA 5079 is considered a natural variant of SEMIA 566 [2]. SEMIA 566 was isolated from plants inoculated with a North American inoculant in the late 1960s [4], and belongs to the same serogroup as USDA 123, highly competitive in the United States [2,4]. The variant strain SEMIA 5079 shows higher N 2 fixation than its parental, and since 1992 it composes the group of four Bradyrhizobium strains used in Brazilian inoculants [5]. Nowadays, SEMIA 5079 is the main strain composing more than 70 million doses of commercial inoculants sold per year in Brazil, used in more than 30 million hectares.
The genome of SEMIA 5079 was estimated at 9.58 Mbp, and is composed of one circular chromosome with two ribosomal operons and G + C content of 63.54%. The annotation predicted 4,203 putative genes with assigned function (48.6%) and 4,445 hypothetical CDSs (coding DNA sequences), totaling 8,648 predicted genes [6]. The most representative gene category is of amino acid metabolism (15.95%) and xenobiotic biodegradation and metabolism (13.24%). Other classes of genes associated with nutrient transport, iron acquisition, and indole acetic acid (IAA) metabolism potentially correlate with the higher saprophytic capacity and competitiveness properties of strain SEMIA 5079. The symbiosis island size was estimated at 700,212 bp, with 569 predicted CDSs, including all nodulation (nod, noe) and fixation (fix and nif) genes, 247 hypothetical ORFS, and 90 mobile elements. Additionally, the symbiosis island encompasses a probable non-functional hydrogenase operon, cytochromes for energy supply, ABC-transporters, and operons for secretion systems.
The Bradyrhizobium-soybean symbiosis is highly specific, with exchange of molecular signals between the plant and its symbiotic rhizobium. In the symbiosis, isoflavonoids (mainly genistein and daidzein) secreted by the soybean seeds and roots attract the specific Bradyrhizobium and activate the NodD family of transcription activators [7,8]. The nodD gene activates the expression of common and host-specific nod genes. Following, rhizobia synthesize and secrete Nod factors (lipo-chitooligosaccharides, LCOs), which are signal molecules that interact with the soybean plant, inducing signal transduction cascades [9]. The recognition of Nod factors by kinase-like receptors on the legume root epidermal cells triggers the root hair deformation, resulting in nodule formation [10]. Besides being the primary inducers of nod gene expression, isoflavonoids are also related to chemotactic signals [11], induction of type III secretion system (T3SS) effectors [12], with the rapid proliferation of rhizobia [13], promotion of bacterial growth [14], and activation of rhizobial quorum sensing systems [15]. Altogether, these phenotypes may affect rhizobial competitiveness and nodulation [16,17].
Nowadays, more than 187 genomes of Bradyrhizobium strains are publicly available, including several B. japonicum strains [18]. There has been a significant increase of new species identified in recent years, e.g. nine new Bradyrhizobium species have been described and their genomes released in the last two years [19]. Despite the increase of Bradyrhizobium genomes, there is still a lack of transcriptional studies supporting gene expression levels, their molecular functions during symbiosis, and their interactions with isoflavonoids.
The majority of the transcriptome studies has used strain USDA 110 (previously classified as B. japonicum, and later reclassified as the type strain of B. diazoefficien) as a model in micro and macro-arrays studies to assess the whole-genome expression profiling of soybean bradyrhizobia, evaluating gene expression levels in both bacteroids and free-living conditions [20][21][22]. Studies have also evaluated the response to drought stress [23] and gene expression when induced by genistein [24,25], soybean seed extracts [24], soybean root exudates [16], coumestrol (a polyphenolic compound isolated from soybean roots) [26], and peribacteroid (PBS) solution purified from soybean root nodules [27].
Unfortunately, there is only one study reporting gene expression patterns in B. japonicum SEMIA 5079 [31]. Furthermore, the high number of genes located in the symbiosis island coding for hypothetical proteins points out to the need of conducting more transcriptional studies to understand the roles of these genes in the BNF process. As SEMIA 5079 is the most competitive strain used in commercial inoculants in Brazil [4], and as it belongs to the same serogroup as the very competitive USDA 123 [2], these studies can enlighten properties related to both the efficiency of BNF and competitiveness.
This study aimed to investigate gene expression levels of hypothetical protein-coding genes of the symbiosis island of B. japonicum strain SEMIA 5079 in response to genistein. We confirmed gene expression of different functional classes of genes, including enzymes, transporters, and potential effectors in response to genistein.
We also provided a framework of helpful bioinformatics tools for functional annotation and curation of hypothetical proteins of rhizobia.

Selection of hypothetical proteins coding genes and characterization of gene neighborhood
To define hypothetical protein-coding (HP) genes in the symbiosis island of B. japonicum SEMIA 5079 with possible roles in BNF and competitiveness, we searched for putative genes located relatively close to genes related to saprophytic capacity and/or BNF traits, such as the nodulation genes and secretion systems, located in a size range of 600-2,500 bp. By adopting this strategy, 15 HP genes were selected, but one of them was later identified  Table 1, while their genomic neighborhoods are shown in Fig. 1.
The closest gene to the bjs_07605 is the nodulation noeI, with a distance of 8,615 bp, followed by noeE (9,683 bp). The bjs_07621 is located close to several BNF genes, the closest being fixA. The bjs_07647 is located at 1,698 bp from nolZ gene, while bjs_08317 is located at 2,361 bp from the nopP gene (type III secreted effector). Following, bjs_08160 has as nearest gene the conjugal transfer gene trbL (15,779 bp), and bjs_08216 is closer to fixW gene (16,189 bp). The bjs_08251 and bjs_08254 are located near to each other (1,469 bp), and are close to the 6-O-carbamoyl transferase (nodU gene), with a distance of 3,381 bp and 1,022 bp, respectively. We also selected the aspC gene (bjs_08258) in order to compare the expression of an annotated gene in our set of HP, and the gene is close to nodU (315 bp). Close to the bjs_08267, an enoyl-CoA hydratase (fadB gene, 810 bp) and a LacI family transcription regulator coding (lacI) were identified. Several ABC transporters spanning a ~ 5 kb region are also located close to bjs_08267 (Fig. 1).

Distribution of HPs between strains of Bradyrhizobium genus
First, we analyzed the number of HPs retrieved against Bradyrhizobium genomes deposited in the NCBI database. We chose a threshold of ≥ 70% of identity and ≥ 50% of coverage for BLAST searches. The HPs set was split in two sets, the first composed of HPs broadly distributed within the Bradyrhizobium genus, and the second restricted to few species ( Fig. 2; Supplementary File S1). Within the first group, the BJS_07605 protein proved to be the most broadly distributed member in Bradyrhizobium, detected in 35 different described species and 50 Bradyrhizobium spp. strains, totaling 159 proteins, followed by BJS_7536, which is present in 32 described species and 53 Bradyrhizobium spp. strains, totaling 155 proteins. In contrast, BJS_08240 is present in only four described species in addition to five Bradyrhizobium spp., totaling 16 proteins. Other examples of hypothetical proteins present in a few species are the BJS_08254, with five species and six Bradyrhizobium spp., with total of 17 proteins; and BJS_08251, present in five species and five Bradyrhizobium spp., summing 17 proteins.
The analysis of the distribution of hypothetical proteins in different species was also carried out. We observed that at least one strain of B. japonicum and of B. diazoefficiens, were present in all 15 proteins selected (14 HPs and aspC), followed by strains of B. ottawaense (14 proteins) B. elkanii (12 proteins

Functional annotation of hypothetical protein-coding genes from symbiosis island
Several bioinformatics tools and resources were used to assign functions to the 14 HPs and to confirm the annotation of the aspC gene of the symbiosis island of SEMIA.  (Table 1). For two HPs (BJS_08317 and BJS_08774), although protein domains were identified in the Pfam, the results were not significant. In this case, we annotated these domains as a first attempt to assign functions and understand their roles.
According to the functional annotation, 11 HPs were assigned as enzymes playing roles in different pathways, such as in the metabolism of hopanoid, methionine, queuosine biosynthesis, oxidation-reduction, and proteolysis. The majority of the enzymes were categorized as peptidases and hydrogenases. Two HPs were identified as membrane transporters, one as a DNA binding protein and another as a transcription regulator (Table 1). In addition, as many of the HPs were identified as enzymes, the physicochemical parameters as pI (isoelectric point), extinction coefficient, grand average of hydropathicity (GRAVY), and 3D structures were predicted and are summarized in Supplementary File S3. Most of the pI (isoelectric point) values calculated were around 6.0 to 6.5, with higher values for BJS_08267 (9.44), BJS_08774 (9.08) and BJS_08160 (8.98). The pI values are helpful to estimate where the protein will be localized on a 2-D gel. We also predicted the cellular localization of the HPs, the majority in the cytoplasm (Table 1). However, the trypsin-like peptidase (BJS_08254) was predicted as a non-classical secreted protein by Secretome P. As expected, the transporters were located in the membrane, the porin protein (BJS_07647) in the outer membrane, and the MFS transporter on the inner membrane (BJS_08267).
Using the EffectiveDB database, we analyzed the HPs that could be related to virulence and competitiveness, by investigating type III secretion signals, conserved binding sites of type III chaperones, type IV secretion peptides, eukaryotic-like domains, and subcellular targeting signals in the host ( Table 2). The peptidase U49 (BJS_08240) was predicted as a potential virulent protein by the VIC-Mpred tool. The trypsin-like peptidase (BJS_08254) and the acetylcholinesterase (BJS_08774) were found to be virulent by the VirulentPred tool. The remaining HPs were predicted as related to information and storage and also to metabolic or cellular processes. Interestingly, the trypsin-like peptidase (BJS_08254) was predicted as a potential T3SS effector using EffectiveDB database, corroborating with the VirulentPred tool. Furthermore, a glutathione-dependent formaldehyde-activating (GFA) enzyme (BJS_07621), as well as a pyridoxal phosphatedependent aminotransferase (aspC gene, corresponding to BJS_08258) were predicted as potential T3SS effectors.

Phylogeny and genes enrichment in host-bacteria interactions based on ELD analysis
We predicted secreted proteins based on eukaryotic-like domains (ELD) through the EffectiveELD tool (Table 2).
These domains occur in eukaryotic genomes and show higher frequency in host-associated bacteria (pathogen or symbiont) genomes than in non-host-associated bacteria, indicating that they might play roles in the bacteriahost plant interaction. Some proteins were present in few bacterial genomes ( Following the criteria of threshold of ≥ 70% of identity and ≥ 50% of coverage for BLAST searches, we performed the phylogenetic analysis of all the 14 HPs and of the BJS_08258 (AspC) (Supplementary File S5). Our objective was to verify if there was a relationship with strains symbionts of soybean, and in general they were clustered together in all analyses. Interestingly, there were HPs as BJS_08254 not present either in B. japonicum type strain USDA 6 or in B. diazoefficiens type strain USDA 110, considering BLAST of both nucleotides and proteins, but other soybean symbionts were detected.
We chose to explore in more details the evolutionary relationships and functions of the GFA, since this protein showed higher number of copies compared to the Ku protein (Supplementary File S4). Therefore, we searched for GFA proteins in Bradyrhizobium spp. genomes deposited in NCBI, selecting proteins with identity ≥ 70% (Supplementary File S1). A phylogenetic tree relating bacteria with their host plants was built (Fig. 3). B. japonicum, including strain SEMIA 5079, and B. diazoefficiens were clustered together. In the same clade, strains of B. ottawaense, B. liaoningense and B. lupini were also present. In addition, the cluster included Bradyrhizobium sp. SG09, isolated from sorghum (Sorghum bicolor) and B. diazoefficiens Y21, isolated from a root nodule of common bean (Phaseolus vulgaris) in China, but there was no information about their ability to nodulate and fix nitrogen with soybean. All the remaining strains in this cluster are reported as capable of nodulating soybean. Interestingly, other species reported as nodulating soybean plants were closer to symbionts of other legumes instead of B. japonicum or B. diazoefficiens. For example, B. yuanmingense strains were grouped with strains associated with cowpea (Vigna unguiculate) and peanut (Arachis hypogaea). Also, almost all B. elkanii strains (except for BLY3-8, BLY6-1, and SEMIA 938) were grouped in the same clade and were closer to symbiotic strains of common bean (Phaseolus vulgaris), and pigeon pea (Cajanus cajan). Strains associated with other beans, such as lima bean (Phaseolus lunatus), hyacinth bean (Lablab purpureus) and least snout bean (Rhynchosia minima) were grouped together, and were closer to other legumes, as Retama monosperma and Stylosanthes viscosa. In general, the results indicated that GFA proteins are present in several species associated with a large range of legume plants, with low host specificity.
We also evaluated the phylogenetic trees of the two trypsin-like peptidases, since both were enrichment in bacteria-plant interactions by ELD analysis (Fig. 4). In both trees, the majority of the trypsin-like peptidases were identified in species nodulating soybean. In BJS_08251, only two proteins were retrieved from strains isolated from sorghum, while in BJS_08254 one protein was identified in a strain from sorghum and another was isolated in a soil under free-living conditions.
The expansion of these genes symbiotic genomes compared to non-symbiotic represents additional evidence of potential roles in the symbiotic process. While some proteins are broadly distributed in Bradyrhizobium species nodulating several legumes, as GFA and Ku proteins, indicating low host specificity, others show host specificity, such as the trypsin-like peptides, and may play roles in the Bradyrhizobium-soybean interaction.

Gene expression levels of HPs in presence of genistein
In the next step, we validated the relative expression levels of all 15 selected genes by quantitative RT-PCR. Significant expression levels were detected in all 15 genes after the induction with 2.5 µM of genistein. The induction levels ranged from 10.15-to 42.58-fold (Fig. 5). The expression levels of the nodC gene, used as a positive control were of 29.4-fold. The levels of the squalenehopene cyclase (bjs_07536) were higher than the other 14 genes. Similar expression level patterns, ranging from 26.72 to 32.64-fold were observed in the set composed by HxlR-type HTH transcription regulator (bjs_08317), pyridoxal phosphate-dependent enzyme (bjs_08258,=aspC), succinate-semialdehyde dehydrogenase (bjs_08261), MFS transporter (bjs_08267), acetylcholinesterase (bjs_08774), and peptidase U49 (bjs_08240).
To get more insights about gene expression, we identified RpoN and RpoD-like promoters on the SEMIA 5079 genome through a list of promoters predicted in B. japonicum USDA 110. From the initial set of 1,210 RpoN-like promoters, 910 were identified, with 199 RpoN-like promoters identified in the symbiosis island; from the initial set of 4,007 RpoD-like promoters, 2,683 were identified in SEMIA 5079, with 481 located in the symbiosis island. It is worth to mentioned that several USDA 110 promoters aligned into the same regions in SEMIA 5079 and for those, only one promoter was counted.
RpoN-like promoters upstream of the bjs_08317 and bjs_07605 genes were identified (Fig. 1). Interestingly, for bjs_07536, a RpoN-like promoter was located upstream of a cluster of genes (P450 and fpps genes), in which, bjs_07536 apparently belongs. Several RpoN-like promoters were identified in the region of nod and fix genes (close to bjs_07647 and bjs_07621). For instance, a RpoNlike promoter upstream of nodD1, which is followed by the nod box genes. Also, other RpoN-like promoters were found upstream of fixR and fixA genes. For the bjs_08774 region, a RpoN-like promoter was predicted upstream the nodM gene; however, a potential operon of three HPs, which contains the bjs_08774 could be observed preceded by a RpoN-like promoter. RpoD-like promoters were predicted upstream of the bjs_07647, and bjs_07621 (which contain multiple RpoD and RpoN promoters in the region). For the following genes, bjs_08261, bjs_08267, bjs_08251, bjs_08216, bjs_08254, and aspC (bjs_08258), neither RpoD or RpoN-like promoters were identified in their regions. While for bjs_08240, bjs_08160, and bjs_08523, RpoD-like promoters were predicted between genes upstream of their location and no clear operon organization was observed between those genes. In general, as expected, more RpoD-like promoters were identified than RpoN, in special in the bjs_07647, bjs_07621 and bjs_07605 gene regions. All the RpoN and RpoD-like promoter sequences predicted in SEMIA 5079 and their location are provided in Supplementary File S6.

Discussion
Genes coding for several enzymes are induced by genistein and might play roles in nitrogen fixation, competitiveness and environmental stress tolerance in SEMIA 5079 We submitted our sequences to several protein databases and bioinformatic tools that have been used to assign functions to hypothetical proteins in bacterial species, including Bradyrhizobium spp. [34,35], creating a helpful workflow for functional annotation and curation. For 13 HP proteins, we found robust information in multiple databases, especially regarding known protein domains, which are described as highly conserved structure during the evolution process [36]. Although we were unable to find significant protein domains in the two remaining HP proteins, we continued to investigate their possible functions.
We found 11 HP proteins as potential enzymes, and they were classified in six classes. Transferases and hydrolases, each one containing four and three HPs, respectively, another protein was categorized as lyase, one as oxidoreductase, one as hydrogenase (chaperone), and one as ligase. Enzymes play several critical roles in rhizobia, being necessary for bacterial survival and nutrition [37], nodulation process [38], adaptation to stressful environments [39], competitiveness [40], growth in freeliving conditions [30], physiological and metabolic processes [41], and virulence ability [33]. Saprophytic ability and competitiveness are important features to a successful symbiosis and in SEMIA 5079 these genes are located in the symbiosis island.

Transferases
Transferases catalyze the transfer of a specific group from one substance to another [42], and we identified four transferases (BJS_07605, BJS_08258, BJS_08523, and BJS_08774). BJS_07605 was predicted to be a 5-methyltetrahydropteroyltriglutamate-homocysteine methyltransferase (MetE protein; EC 2.1.1.14), belonging to the UROD/MetE-like superfamily. MetE protein has been reported as involving in methionine (cobalamin) biosynthetic pathway in soil bacteria as Pseudomonas putida [43], and Saccharopolyspora spinosa [44]. In rhizobia species, metE has been identified in transcriptomes studies in both bacteroids and free-living conditions in USDA 110 [35,36]. In the plant pathogen Ralstonia solanacearum, metE is under the control of the pathogenicity regulator HrpG, and metE mutants showed reduced disease symptoms on tomato (Solanum lycopersicum) [45]. The metE has also been related to stress conditions tolerance, e.g. to acetate and temperature in Escherichia coli [46], and copper in yeast Rhodotorula mucilaginosa [47]. Thus, we may suggest that the main roles of this gene in SEMIA 5079 may be the tolerance to environmental stress conferring higher saprophytic ability, in addition to competitiveness. Roles in BNF have also been proposed.
The disruption of the metH (cobalamin-dependent enzyme gene) in S. meliloti causes inability of fix nitrogen in alfalfa plants. The expression of metE in metH mutants complements the Fix − phenotype, as the average plant height, nitrogenase activity and the percentage of pink nodules are dramatically improved [48].
The BJS_08258 protein was predicted as an aminotransferase class I/II-fold pyridoxal phosphate (PLP)-dependent enzyme, also named aspartate aminotransferase (AspC protein, EC 2.6.1.1). Almost all PLP-dependent enzymes are involved in biochemical pathways with amino compounds, mainly amino acids [49]. In S. meliloti 4R3, alteration in aspartate transaminase resulted in mutants unable to catabolize aspartate and to induce effective nodules on M. sativa [50,51]. The aspC gene was found expressed in nitrogen-fixing bacteroids in the S. meliloti-Medicago truncatula symbiosis [52]. Also, an aspC mutant of S. meliloti showed a nitrogen-starvation phenotype during plant infection assays and significantly reduction in nitrogenase activity [53]. Finally, an aspartate aminotransferase was identified as differentially expressed in the presence of genistein in B. japonicum SEMIA 5079 [54], supporting our data that AspC is induced by genistein in this strain and might play a role in BNF. The BJS_07536 was predicted as a squalene-hopene cyclase (Shc protein, EC 5.4.99.17) involved in the hopanoid biosynthesis pathway (secondary metabolite biosynthesis) [55]. Hopanoids are lipids that are mainly located in bacterial membranes, associated with membrane fluidity, lipid rafts, and stress tolerance [56]. Many hopanoid-producing bacteria are capable of fixing nitrogen, including Bradyrhizobium spp. [57,58], and it has been suggested that they facilitate nitrogen metabolism, what could explain the highest gene expression level of this gene. In photosynthetic Bradyrhizobium sp. strain BTAi1, an shc mutant showed slower-growing under optimal growth conditions, lower resistance to stresses, and reduced ability to survive intracellularly in its host plant Aeschynomene [59].
In general, transferases play roles in spore germination, synthesis of lipoproteins, and virulence in bacteria [60,61]. Corroborating with that, in our study, two transferases, BJS_07605 (MetE) and BJS_07536 (Shc) were also associated with stress tolerance and virulence in bacteria species, while BJS_08258 (AspC) was associated with an effective nitrogen fixation process.

Hydrolases
Hydrolase enzymes catalyze the addition of water to a substrate through a nucleophilic substitution reaction and are the biocatalysts most commonly used in organic synthesis [62]. Hydrolases play essential roles in the invasion of the host tissue, escaping host defense mechanisms [63]. In our study four proteins were predicted as hydrolases. Three of them are peptidases, BJS_08251 and BJS_08254 predicted as putative trypsin-like peptidases belong to serine peptidases family S1, and BJS_08240 predicted as peptidase U49. BJS_08774 carries an acetylcholinesterase domain. In our results, BJS_08254 was predicted as secreted, a virulence factor, a potential T3SS effector, and enrichment by ELD tool. We may thus suggest that this trypsin-like peptidase could act in the infection process of soybean by Bradyrhizobium, as the protein was majorly identified in Bradyrhizobium symbionts of this legume (Fig. 4).
Trypsin peptidases (STS) have been described as markers in phytopathogenic fungi genomes [64]. Secretomes of plant pathogens are enriched with endopeptidases [65], and mutations lead to the loss of virulence ability [66,67]. Therefore, in Bradyrhizobium these proteins could facilitate the host plant-bacteria interaction. STS was identified among the secreted proteins in B. diazoefficiens USDA 110 under free-living conditions [68] and in bacteroids [69], but there is little information about their roles in the infection process. Serine protease DO-like proteins were identified on the secretome of R. etli CE3 [70], and in the proteomic analysis of Rhizobium favelukesii LPU83 in response to acid stress [71]. Also, extracellular serine proteases were identified in the genomes of Rhizobium strains tolerant to high temperatures [72], and mutations in the serine protease (CtpA protein) lead to increased sensitivity to desiccation in R. leguminosarum [73]. In the S. meliloti-M. truncatula symbiosis, a peptidase (HrrP) cleaves host-encoded signaling peptides and mediates symbiotic compatibility [74]. Little information is available about the functions of peptidases U49. Besides that, roles on microbial virulence and pathogenesis have been proposed [75]. We may then hypothesize the potential roles of these STS in SEMIA 5079 both in the infection process and stress tolerance.

Lyase, ligase, oxidoreductase and hydrogenase
The BJS_07621 protein was predicted as a lyase, a glutathione-dependent formaldehyde-activating (GFA) enzyme (Gfa protein, EC 4.4.1.22), of the glutathionedependent formaldehyde oxidation pathway, involved in C1 metabolism and methanol oxidation in B. diazoefficiens [76]. In B. diazoefficiens USDA 110, gfa was induced in cells grown with vanillin (aromatic compound) as sole source of carbon [77], in M. loti MAFF303099 in response to salt stress [78], in R. tropici CIAT 899 to acid stress [79], and in USDA 110 cells exposed to elevated atmospheric CO 2 [80]. Therefore, the enzyme has been associated with bacterial growth under different environmental stresses, and the several copies found in SEMIA 5079 might indicate advantages under stressful conditions.
The BJS_08261 protein was identified as an oxidoreductase enzyme, a succinate-semialdehyde dehydrogenase (GabD protein, EC 1.2.1.24) involved in the GABA (γ-aminobutyric acid) metabolism [81]. When R. leguminosarum bv. viciae strain 3841 was mutated in aminotransferase and succinate semialdehyde dehydrogenases, the mutants were still able to fix nitrogen on pea plants [82]. In R. etli, gabD is required for glutamine utilization, and mutations affected growth in culture medium with glutamine as only carbon source [83]. Proteomic analysis showed that GabD protein was increased in R. leguminosarum SRDI565 cultivated with sulfosugar sulfoquinovose (SQ) as a carbon source [84]. GabD proteins were identified in rhizobia proteomes, including B. diazoefficiens USDA 110 cells grown in HM (Mueller-Hinton) medium [85], R. tropici CIAT 899 induced by apigenin and salt stress [86], and R. favelukesii LPU83 in response to acid stress [71]. In an association mapping study (GWAS) using 153 strains of S. meliloti to map regions associated with symbiosis phenotypes [87], two of the reported genes (gabD and queC) were annotated and validated in our study. Therefore, GabD in SEMIA 5079 may be associated with the utilization of different carbon sources and also with tolerance to environmental stresses.
The BJS_08216 protein was predicted as the hydrogenase maturation factor HypA, a nickel metalloenzyme, an accessory protein for nickel incorporation into hydrogenases [88]. The hypA gene is part of hydrogenase uptake (hup) operon located in the symbiosis island of several rhizobia, such as SEMIA 5079 [6], SEMIA 5080 [6], USDA 110 [89], R. tropici CIAT 899, and PRF 81 [90]. Hydrogen-uptake genes improve symbiotic efficiency in several legumes, including soybean [91]. However, often the operon is not complete and functional. B. diazoefficiens USDA 110 has two sets of the hup gene cluster, a complete set of functional hup genes located outside the symbiosis island and an incomplete set of non-functional hup genes within the symbiosis island [6]. In B. japonicum USDA 6 and SEMIA 5079 the genes are duplicated and the operon incomplete, or are pseudogenes; therefore, also nonfunctional [6,89]. In addition, a Hup − phenotype has been reported in B. japonicum SEMIA 5079 [92]. In R. leguminosarum a mutation in hypA confirmed that the gene is essential for hydrogenase activity, as it is required for the correct processing of the large hydrogenase subunit [93]. Nevertheless, several hyp and hup genes, including hypA, were induced in USDA 110 cells during chemoautotrophic growth [94]. Similarly, hypA was expressed in SEMIA 5079 even though apparently the hydrogenase is not functional [94], what might indicate other possible functions.
The BJS_08523 protein was predicted as a 7-cyano-7-deazaguanine synthase, QueC protein (EC 2.7.7) involved in the biosynthesis pathway of queuosine [95]. Mutations in queC of S. meliloti resulted in ineffective BNF [96], and in a GWAS study performed to identify BNF traits in S. meliloti, queC was identified in one of the regions mapped [87]. In Shigella flexneri, full expression of the virulence VirF was shown to require queuosine Q34 [97]. Therefore, in SEMIA 5079 the queC gene might be relative both with the infection process and the effectiveness of the BNF process.

Transporters and other proteins
The BJS_07647 was predicted as a porin, a class of proteins that are β-barrel channels located in Gram-negative bacterial outer membranes and allow the diffusion of different molecules, being also associated with multidrugresistant [98]. Environmental conditions such as water availability, oxidative stress, heavy metals, temperature, and nutrients shortage affect the expression and/or activity of porins. [99]. In rhizobia, there are studies relating porin to increased expression in bacteroids of Bradyrhizobium sp. (Lupinus) in plants treated with high doses of glyphosate doses [100], required for USDA 110 cells growth under manganese deficiency [101], and involved in copper transport in R. etli CFN4 [102,103]. Also, porins were identified in proteomic and transcriptomic studies of cells and bacteroids of USDA 110 [68,69,92]. In our results, BJS_07647 was predicted as an outer membrane, as expected to porins, and we may attribute roles in tolerance to different toxic compounds, as the high manganese and aluminum levels of many Brazilian Cerrado soils.
The BJS_08267 was predicted as a major facilitator transporter (MFS), a superfamily composed of multidrug resistance efflux pumps (MDR). Besides the roles in multidrug resistance, the genes of this family have been primarily described in bacteria and might play a general role in detoxification of toxic compounds and in rhizobia actuating in competitiveness in the rhizosphere [104]. MFS transporters have been described in USDA 110 involved in nitrate assimilation and nitric oxide detoxification [105], and in Rhizobium sp. RC1, mediating the uptake of haloacids into the cells [106]. In relation to the symbiosis, MFSs have been described contributing to nodulation competitiveness in S. meliloti [107] and in R. leguminosarum bv. viciae [108]. Thus, this could also be another protein facilitating detoxification of toxic compounds and competitiveness in SEMIA 5079.
The BJS_8160 was identified as a non-homologous endjoining protein Ku (Ku protein); these proteins are specialized in repairing DNA damages [109] In S. meliloti Ku proteins have been identified in both free-living conditions and bacteroids [110], and related to stress conditions, as heat and nutrient starvation [111]; however, in S. meliloti, Ku protein was not essential to the symbiotic interaction with M. truncatula [112]. Our results indicate enrichment of this Ku protein in rhizobial species and we may suggest a role in SEMIA 5079 in repairing DNA damages derived from stressful conditions, such as high temperatures and acidity.
A probable HxlR-type HTH domain, present in putative transcription regulators with a winged helix-turnhelix (wHTH) structure, was found in BJS_08317. This DNA-binding protein acts as a positive regulator of the formaldehyde-inducible hxlAB operon in Bacillus subtilis, which is part of the ribulose monophosphate pathway involved in formaldehyde fixation [113]. HxlR transcriptional regulators have been induced in other bacteria, such as P. fluorescens in the presence of heavy metals [114], Lactobacillus acetotolerans in response to ethanol stress [115], and Bacillus atrophaeus UCMB-5137 stimulated by maize root exudates [116]. As other enzymes identified in our study, BJS_08317 could be associated with the saprophytic ability of SEMIA 5079, since this protein has been associated with bacterial survival in response to different compounds.

Expression levels of hypothetical protein-coding genes located in symbiosis island in presence of genistein
We applied some parameters to choose the hypothetical protein-coding genes for further functional annotation, followed by confirmation by qRT-PCR. Firstly, to identify potential new genes related to BNF traits, we selected genes from the symbiosis island of SEMIA 5079, which contains several predicted hypothetical protein-coding genes [6]. As the main genes related to BNF, such as nod, fix, and nif genes and are positioned in symbiosis islands in Bradyrhizobium and some Mesorhizobium strains [117][118][119], we selected the hypothetical protein-coding genes close to these genes, with an emphasis on those related to nodulation. In rhizobia, nod genes are usually organized in operons, as nodABC in S. meliloti [120] and nodYABCSUIJnolMNO in USDA 110 [121], indicating that their expression regulation involves common mechanisms, including induction by flavonoids and common transcription factors [122,123]. Therefore, the hypothetical protein-coding genes selected in our study have high probability of being under the same regulation as the nodulation or nitrogen fixation genes.
We quantified the gene expression levels of the 15 selected protein-coding genes in the presence of 2.5 µM of genistein, a known inducer of nodulation genes in Bradyrhizobium [7,8], and confirmed expression for all of them in cells induced at the initial exponential phase, an approach reported in previous studies [28,32]. For example, the porin gene (bjs_07647) was close to several nodulation genes previously identified as induced by genistein, as nodC in both B. japonicum SEMIA 5079 (1 µM) [31] and SEMIA B. diazoefficiens 5080 (5 µM) [35], and also close to nodD1, nodD2 and nodA genes, all induced in B. liaoningense CCBAU05525 (0.5 µM) [32]. Another example is the HxlR transcriptional gene (bjs_08317), close to the nopP gene, which was also previously induced by genistein in SEMIA 5079 [31]. Higher expression levels were detected in five HP genes (bjs_07536-hopanoid biosynthesis pathway; bjs_08240-peptidase, bjs_08774acetylcholinesterase, bjs_08267-MFS transporter and bjs_08261-succinate-semialdehyde dehydrogenase), reinforcing the hypothesis of major roles in BNF. These results also support our strategy of choosing genes close to known BNF genes, followed by a functional workflow of annotation and curation, and the transcription validation in the presence of a nod-gene inducer.
We predicted RpoN and RpoD-like promoters in SEMIA 5079 genome to get insights about gene expression regulation. We chose to select those types of promoters since RpoN promoters (σ 54 ) have been identified as associated with nitrogen-fixing ability [124][125][126] and RpoD promoters (σ 70 ) associated with free-living conditions [125][126][127][128]. Corroborating with the USDA 110 prediction, more RpoD than RpoN promoters were identified on SEMIA 5079 genome. We did not find any clear correlation between gene expression levels and promoter identification. Indeed, in five HPs genes, we did not find any RpoN and RopD-like promoters; however, three of them (bjs_08261, bjs_08267 and aspC) are located close in the genome and showed high gene expression levels compared to the remained genes. The bjs_08267 gene (MFS transporter) was preceded by several transporters-coding genes (Fig. 1), while bjs_08261 (GabD) and aspC are enzymes associated with biosynthesis and oxidation process, with roles in bacterial growth and BNF ability, respectively. Therefore, further investigation in order to identify other types of promoters regulating those genes should be performed. Also, other hypothesis could be the regulation of those genes by the two sigma-54 factors copies present in B. japonicum located in the symbiosis island: bjs_08297 (rpoN1 protein, 46,402 to 47,898 bp) and y4pA gene (bjs_08283, 58,369 to 60,123 bp). These genes have been reporting regulating several genes with diverse functions, including genes related to nitrogen fixation, free-living conditions, and transport on rhizobia species genes [124,129,130].
Interestingly, the highest gene expression level was observed to bjs_07536 (hopanoid biosynthesis pathway), which seems to belong to the P450 operon preceded by a RpoN promoter, as previously reported in B. japonicum USDA 110 [125,130,131] and identified in our study. Other genes as bjs_08317 (HTH transcription regulator) and bjs_07605 (metE) showed an RpoN promoter in their upstream region, and in special metE gene has its expression associated with the NifA-RpoN regulon in M. loti [132]. The genes, as bjs_07647 (porin protein) and bjs_07621 (GFA enzyme) were preceded by RpoD-like promoters, which makes sense, since both genes are associated with bacterial growth and tolerance to different environment stresses. In the bjs_08160 gene region, which was annotated as a protein Ku associated with free-living conditions and bacteroids, only RpoD-like promoters were predicted. In a similar way, the bjs_07647 (porin) region was predicted with a enrichment of RpoD-like promoters, thus, raising the idea the others HPs located in those regions could play important roles to free-living conditions in B. japonicum as bjs_07647 and bjs_08169. In general, the prediction the RpoN and RpoD-like promoters in SEMIA 5079 was similar to the predictions in the symbiosis island of USDA 110 [126]. Also, in some of our HPs with high level of expression we are not able to find RpoN and RpoD-like promoters, raising perspectives of future studies to find other promoters regulating those genes and playing important roles in the BNF capacity, saprophytic ability, and tolerance to stresses. Finally, we also provided here an extensive list of predicted promoters for future functional studies in SEMIA 5079.

Conclusions
The genome of B. japonicum strain SEMIA 5079 contains a large proportion of hypothetical protein-coding genes, especially in its main symbiosis island, a key region in the circular genome where the nodulation and nitrogen fixation genes are located. Our study brings a workflow of bioinformatics tools/databases to correctly assign the function of hypothetical protein-coding genes present in SEMIA 5079 symbiosis island and that may be applied to other rhizobia. Fourteen hypothetical protein-coding genes and another one later identified as aspC were studied in more detailed and were induced by the nod-gene inducer genistein, from 10.15-up to 42.58-fold. Some proteins were specific to Bradyrhizobium, while others were present in several rhizobia, and the analysis of orthologues helped to predict roles in BNF, including competitiveness and infectiveness. Although the results are descriptive and efforts should be done to identify the role of HP genes on symbiosis island, the study contributed to improve the knowledge about the genomics features of this outstanding strain used in more than 70 million doses of i soybeaninoculants yearlyin Brazil and in other countries of South America.

Functional annotation of hypothetical protein-coding genes from symbiosis island
The annotated genome of B. japonicum SEMIA 5079 (CP007569.1 -GenBank) was retrieved from the NCBI Genome database [133]. Hypothetical protein-coding (HP) genes located in the symbiosis island of SEMIA 5079 were initially collected based on an automatic annotation of this genome [6]. All genes predicted inside the symbiosis island were retrieved into a unique FASTA format file. To choose HP genes located in the symbiosis island, some criteria were used: genes located close to others previously described as playing roles on the BNF process (e.g. nod, fix, nif genes), and with size range of 600 to 2,500 bp (avoiding truncated genes or pseudogenes). The hypothetical protein-coding gene sequences were taken separately to assign their accurate function based on several bioinformatic tools.
In order to predict proteins related to soybean infection by Bradyrhizobium, prediction of virulence factors was performed using VICMpred (http:// crdd. osdd. net/ ragha va/ vicmp red/) and VirulentPred (http:// 203. 92. 44. 117/ virul ent/ index. html/) tools that are based on PSI-Blast and the support vector machine (SVM) method based on patterns, amino acid and dipeptide composition of bacterial protein sequences.
The occurrence of the studied hypothetical proteins in genomes of other strains of Bradyrhizobium was verified by similarity searches using BLASTX (E ≥ 1 × 10 − 6 , identity ≥ 70%), at the NCBI (https:// blast. ncbi. nlm. nih. gov/ Blast. cgi). Further, only sequences from Reference Sequence (RefSeq) database (http:// www. ncbi. nlm. nih. gov/ RefSeq/), which provides a non-redundant collection of sequences representing genomic data, transcripts and proteins were kept. Also, in cases where more than one protein per genome was identified, only the protein with more similarity with the hypothetical protein-coding gene from SEMIA 5079 was used. Protein alignments between HPs queries and the resulting protein sequences obtained from BLAST steps were performed using the CLUSTAL W program in MEGA version 11 [134]. Phylogenetic trees were constructed using the neighborjoining (NJ) method [135,136], with 1,000 bootstrap replicates.
We searched in SEMIA 5079 genome for both promoter motifs similar to σ 54 RpoN-dependent promoters, mainly activated in symbiosis and promoter motifs similar to σ 70 RpoD-dependent promoters, associated with free-living conditions. Through BLASTN, we searched on SEMIA 5079 genome for 1,201 RpoN-like promoters and 4,007 RpoN-like promoter sequences upstream of TSSs (transcriptional start sites) predicted in B. japonicum USDA 110. Only promoters located in intergenic regions were kept and analyzed.

Bacterial strain, growth conditions and genistein essay
Bradyrhizobium japonicum strain SEMIA 5079 (= CPAC 15, =DF 24, =CNPSo 7), was obtained from the "Diazotrophic and Plant Growth Promoting Bacteria Culture Collection" of Embrapa Soja (WFCC Collection #1213, WDCC Collection #1054). The strain was cultured at 28 °C in modified-YM solid medium supplemented with Congo red (0.025%) [137]. Bacterial pre-cultures were grown under shaking (100 rpm) at 28 °C in 10 mL of TY broth [138]. When reaching OD 600 of 0.4, 1 mL was inoculated into 100 mL of TY broth supplied with genistein at a final concentration of 2.5 µM, followed by incubation at 28 °C under shaking (100 rpm and 28 °C) for approximately 1 h, up to the exponential phase (OD 600 0.5 ~ 0.6). The cells were then recovered and used for RNA isolation and expression analysis. The experiment was performed with a randomized block design, with three biological replicates. The control treatment received only methanol, which was the solvent used to prepare the genistein solution.

RNA isolation and cDNA synthesis
Total RNA was extracted using TRIzol ® reagent (Thermo Fischer Scientific Inc., Waltham, MA, USA) following the manufacturer's instructions, with some modifications [30]. Thirty mL of cells were harvested by centrifugation (10,000 rpm at 4 °C for 10 min), the cell pellets were resuspended with 1 mL of NaCl solution (0.85%), the pellet solutions were transferred to new 2 mL microtubes and harvested by centrifugation as described above. Pellets were then resuspended in Table 3 Genomic position and primer sequences for the qRT-PCR assay of the 14 hypothetical and the aspC genes of the symbiosis island of Bradyrhizobium japonicum SEMIA 5079 (= CPAC 15) selected for this study (+) strand positive, (-) strand negative, F primer forward, R primer reverse 10 µL lysozyme at 5 mg/mL) and incubated at 40 °C for 5 min. Following, 1.5 mL of Trizol was added to the samples, vortexed for 15 s, and incubated at room temperature (24 ± 1 °C) for 5 min. Cells were then centrifuged (10,000 rpm at 4 °C for 10 min) and the aqueous phase (~ 800 µL) was transferred to a new microtube with 250 µL of chloroform, vortexed for 15 s and incubated at room temperature (5 min). Cells were then centrifuged as described above, the aqueous phase (~ 400 µL) transferred to new microtubes, and the RNA was precipitated with 500 µL isopropanol. After 20 min on ice, RNA samples were collected by centrifugation at 14,000 rpm for 15 min and resuspended in ultra-pure water (50 µL). After that, 3 M sodium acetate and 250 µL of 100% ethanol were added to the RNA samples and mixed well by gently inverting the tubes several times. A new step of centrifugation was performed as described above, the upper phase was discarded and the pellets were washed with 500 µl ethanol 70%. The samples were centrifuged again, and RNA pellets were kept to dry at room temperature (24 ± 1 °C) for 30 min. Finally, the dry pellets were dissolved in 50 µL of water ultra-pure. RNA concentration was estimated in a NanoDrop ™ 8000 Spectrophotometer (Thermo Fischer Scientific Inc., Waltham, MA, USA) and integrity was assessed in 1.0% agarose gel by electrophoresis. RNA samples were treated with DNAse I (Thermo Fischer Scientific Inc., Waltham, MA, USA), according to the manufacturer's instructions. Following, cDNA strands were synthesized from 5 µg of treated RNA using the SuperScript → III First-Strand Synthesis System (Thermo Fischer Scientific Inc., Waltham, MA, USA) and random primers, according to the manufacturer's instructions.

Quantitative real-time PCR (qRT-PCR) approach
Primers were designed using the Primer Express 3.0 ® software (Applied Biosystems, Foster). The default parameters for qRT-PCR were applied, with an amplicon range of 50-150 bp size. The primers sets were checked against SEMIA 5079 genome by BLASTN tool to verify their specificity, and the list of the primers used is shown in Table 3.
Reactions were performed on a qRT-PCR thermocycler 7500 (Applied Biosystems), according to the manufacturer's recommendations, using a Platinum SYBR Green qPCR SuperMix-UDG with ROX kit (Thermo Fischer Scientific Inc., Waltham, MA, USA). Initially, the PCR primers efficiency was determined using a quantitative standard curve for each pair of primers (10-fold serial dilution of a cDNA bulk containing all the biological replicates of both treatment and control). The Ct (cycle threshold) values obtained in each dilution were plotted as a function of logarithm of the cDNA dilutions, with the slope of regression curve used to estimate primer efficiency values [139]. All qPCR reactions were performed in 12.5 µL reactions containing 1 µL of cDNA, 6.25 µL of SYBR Green with ROX, 0.2 µM of each primer (reverse and forward), and 4.25 µL of ultra-pure water. The PCR cycling program consisted of 50 °C for 2 min, 95 °C for 10 min, followed by 45 cycles at 95 °C for 2 min, 60 °C for 30 s and 72 °C for 30 s. The reactions were performed with three technical replicates for each biological replicate. The calculation of the expression and the statistical analysis were carried out with the software REST 2009 [140]. The expression of the 16 S rRNA endogenous gene was used as an internal control for normalization [31,32], and the genistein responsive nodC gene was used as positive control [31,141].