Bioinformatics and Structural Characterization of a Hypothetical Protein from Streptococcus mutans: Implication of Antibiotic Resistance

As an oral bacterial pathogen, Streptococcus mutans has been known as the aetiologic agent of human dental caries. Among a total of 1960 identified proteins within the genome of this organism, there are about 500 without any known functions. One of these proteins, SMU.440, has very few homologs in the current protein databases and it does not fall into any protein functional families. Phylogenetic studies showed that SMU.440 is related to a particular ecological niche and conserved specifically in some oral pathogens, due to lateral gene transfer. The co-occurrence of a MarR protein within the same operon among these oral pathogens suggests that SMU.440 may be associated with antibiotic resistance. The structure determination of SMU.440 revealed that it shares the same fold and a similar pocket as polyketide cyclases, which indicated that it is very likely to bind some polyketide-like molecules. From the interlinking structural and bioinformatics studies, we have concluded that SMU.440 could be involved in polyketide-like antibiotic resistance, providing a better understanding of this hypothetical protein. Besides, the combination of multiple methods in this study can be used as a general approach for functional studies of a protein with unknown function.


Introduction
The Gram-positive oral pathogen Streptococcus mutans is the main leading cause of dental caries [1]. As one of the early colonizers, S. mutans adheres to the tooth surface and enables the further colonization of other microorganisms, forming dental plaques as a result [2]. Not only enduring a rather acidic environment, these microorganisms also have to withstand various stresses from changes in temperature, nutrition and osmotic pressure variations [3] as well as exposure to natural virulence factors and antibiotics.
For the 1960 ORFs (open reading frames) in the S. mutans genome, 63% of them were assigned functions initially through bioinformatics studies and more ORFs, or their orthologs, have been characterized by microarray analysis, phenotype studies and so on [4][5][6]. So far, there are fewer than 500 ORFs of unknown functions (http://cmr.jcvi.org/) [7]. One such case is SMU.440 (GeneID: 1029579), which is composed of 138 residues. There are very few similar proteins in the current databases and these homologs are all hypothetical proteins without known function.
Lateral gene transfer (LGT) [8][9][10] serves as a major way by which organisms acquire novel genes, and it plays an important role in bacterial survival and adaption to environmental changes as well as pathogenity [11,12]. Thus, studies of LGT can be helpful not only for the understanding of gene evolution and species diversification, but also for the development of drugs that inhibit the transfer of resistance genes. Phylogenetic analysis is a robust method in LGT identification [13].
LGT creates unusually high similarities among organisms, particularly those that are closely related or share the same habitat, which can be used for the detection of LGT [12,14].
In order to understand the function of unknown ORFs in the S. mutans genome, we have initiated a structural genomics project a few years ago in Peking University [15], SMU.440 has been selected as one of the targets. Here, we report the bioinformatics studies and the crystal structure of SMU.440 from S. mutans. Phylogenetic analyses suggest that SMU.440 originated via LGT among certain oral pathogens. The crystal structure reveals a fold similar to known polyketide cyclases even though the amino acid sequences are quite different. SMU.440 also shares a similar binding pocket composed primarily of residues with aromatic and acidic side-chains, which points to a potential binding of a polyketide-like molecule.

Homology Search
SMU.440 is a hypothetical protein without any known functions or protein family classification. A BLAST [16] search against the nonredundant database (NRDB) returns 13 homologs of SMU.440, excluding proteins with short overlaps (,90 residues) or low sequence identity (,20%). The five most similar proteins share more than 40% sequence identities, much higher than the rest of the proteins for which a sudden drop of identities to less than 26% is observed (Fig. 1). Proteins with high identities (.40%) are referred to as SMU.440 close homologs and those with low identities (,26%) are referred to as SMU.440 remote homologs. Among the SMU.440 homologs, SMU.440, SGO0266 and SSA0360 are all proteins from the genus Streptococcus. CBEI3892 is from Clostridium beijerinckii, which belongs to a very different class from Bacilli to which Streptococcus belongs. These four genomes are under the same phylum, Firmicutes, and they are all from gram-positive bacteria. On the other hand, FNP1018 and FNV2091 are from the genus Fusobacterium, in the phylum Fusobacteria of gram-negative bacteria. Thus, SMU.440 homologs are sparsely distributed in certain species across broad bacterial domains.

Phylogenetic Analysis
A phylogenetic tree was generated based on the amino acid sequences of SMU.440 and its homologs. These 14 proteins were clearly divided into three groups, which are strongly habitat correlated ( Fig. 2A). SMU.440 close homologs fell into the same group as SMU.440 and except for the protein from C. beijerinckii they are all from organisms known as oral pathogens involved in the formation of dental plaque [17]. Although not considered a typical oral pathogen, C. beijerinckii has been isolated from human carious dentin in previous studies [18]. It was earlier shown by Wilson, Kreychman & Gerstein that at levels of sequence identity .40%, precise function is conserved for pairs of single-domain proteins [17]. Thus, SMU.440 and its close homologs are likely to share a similar function.
To search for further proteins with similar function, an iterative search using SMU.440 close homologs was carried out using PSI-BLAST [16] against the NRDB. However, no more sequences were found within the current genome databases, even when lower constrains (overlap .120 aa, identity .30%) were used, which further confirmed the highly specific distribution of SMU.440 with its close homologs in oral pathogen bacteria.
SMU.440 remote homologs are grouped into terrestrial and aquatic bacterial proteins. SMU.440 shares an unusually high similarity to genes from rather divergent organisms. In addition, this scattered phylogenetic distribution appears to be habitat related. Together, it is indicated that LGT has been involved in the spread of SMU.440 close homologs and possibly SMU.440 remote homologs. Besides, most of them are assigned to the polyketide cyclase family (Pfam, PF10604) [19].

Co-evolution of SMU.440 and SMU.441
In bacteria, proteins with related functions are often clustered into the same operon, which provides useful information for the investigation of proteins with unknown functions. With four overlapping nucleotides in the coding sequences, SMU.440 and the adjacent SMU.441 protein (GeneID: 1027951) are located in the same operon (Fig. 2B). SMU.441 belongs to the MarR protein family of transcription regulators, which is involved in multiple antibiotic resistance [20]. Similar bioinformatics studies were performed on SMU.441. It was found that the top five BLAST hits are from the same organisms as the SMU.440 close homologs (Fig. S1), and their identities form a similar profile as that observed in the SMU.440 BLAST search result (Fig. 2B). Furthermore, if only the N-terminal region (1-40) of SMU.441, which is involved in the dimerization of MarR family proteins and less conserved [20], was selected as a query sequence to search for homologs, only proteins from the five oral bacteria mentioned above were found.
In summary, the homologs of SMU.440 and SMU.441 show a very similar conservation pattern and distribution and their cooccurrences in genomes indicate that the genes encoding these two proteins are related and have been laterally co-transferred at the same time. The splitting of the two genes in some species may be due to later rearrangements of genes after LGT.

Overall Structure
The crystal structure of SMU.440 has been determined at 2.4 Å resolution using the SIRAS (single isomorphous replacement with anomalous scattering) method and all residues except the last one could be fitted into the electron density. In addition, five residues from the N-terminal His6-tag were also observed in the structure.
Statistics from the data collection and structure refinement are summarized in Table 1. The structure has been deposited to the Protein Data Bank and has been assigned PDB ID 3IJT.
SMU.440 is comprised of three a-helices and a seven-stranded antiparallel b-sheet, bending into an unclosed b-barrel (Fig. 3A). The structure belongs to the SCOP superfamily of Bet v1-like [21] proteins. There are two molecules per asymmetric unit (ASU) and they form a homodimer via a pair of antiparallel b-strands (Fig. 3B). The dimer has a twofold symmetry, with the twofold axis in the center of the dimer interface and perpendicular to the plane of the extended b-sheet of the dimer. Prediction of assemblies by the PISA server [22] indicates that this dimer interface is the largest (interface area 946.6 Å 2 ) with a favorable interaction energy (D i G 26.9 kcal/mol), in agreement with the dimer state in solution observed during the gel filtration chromatography experiment. However, it is not clear if there is any functional advantage associated with the dimerization.
A large forked cavity is formed as the b-barrel wraps around the long C-terminal a3-helix. One end of the cavity is closed by helices a1 and a2 together with the loop between b2 and b3. The volume of the cavity is about 1 050 Å 3 with a surface area of ,700 Å 2 . The cavity is comprised of Trp26, Trp29, Glu30, Asp32, Met50, Met52, Met55, Phe60, Phe71, Asp73, Thr75, Thr77, Val82, Phe84, His86, His100, Val102, Phe116, Ile120, Asp123, Val124 and Ser127, most of which are quite conserved residues among the SMU.440 homologs (Fig. 1, Fig. 4). In addition, these residues are spatially distributed in clusters. In the bottom of the pocket are primarily aromatic residues; whereas at the top of the pocket, close to the cavity opening, the residues can be divided into two parts, one side with mainly acidic and the other with neutral residues.

Comparison to Structures with Known Function
A structural similarity search was performed using the DALI web server [23], and the 30 most similar (RMSD ,3.0 and Zscore.12) hits all corresponded to polyketide cyclases and class 10 of pathogenesis-related (PR-10) proteins (Fig. 5), if excluding uncharacterized proteins. Polyketide cyclases play an important role in the syntheses of polyketides, where the cyclization patterns diversify the final aromatic products [24]. With a wide distribution throughout the plant kingdom, PR-10 proteins are presumed to be involved in plant resistance in incompatible interactions by binding plant hormones [25][26][27]. Proteins of these two families not only share the same fold, but also have cavities with similar features including the preference for aromatic ligands. SMU.440 does not share a strictly conserved binding site with any protein of these two families. The binding pocket of SMU.440 consists of several conserved aromatic, hydrophobic and acidic residues in orthologous sequences, matching the binding pocket patterns observed in polyketide cyclases and PR10 proteins indicating a potential ability to bind chemically similar classes of ligands. A superimposition of all known ligand complexes of the SMU.440 structural homologs illustrates the common ligand-binding features of this pocket (Fig. 5).
Among the DALI hits, XOXI (PDBID, 3cnw) forms a similar dimer interface as that of SMU.440. Being a hypothetical protein from Bacillus cereus, XOXI is predicted to belong to polyketide cyclase family by Pfam. Furthermore, a profile-profile alignment was performed using the FFAS03 server [28], which showed that SMU.440 was more similar to the polyketide cyclase family proteins (Pfam, PF010604; score, 244) than to the PR-10 family proteins (Pfam, PF00407; score, 224) in sequences. This corresponds to the previous observation that SMU.440 remote homologs belong to the polyketide cyclase family.
In conclusion, we have observed that SMU.440 is only found and conserved in a small number of dental pathogenic bacteria through sequence and phylogenetic analysis. Further analysis shows that SMU.441, a MarR protein from the same operon as SMU.440, shares the same distribution pattern and very similar level of sequence conservation. It is thus suggested that SMU.440 has co-evolved with SMU.441, and that SMU.440 may be involved in antibiotic resistance. The determination of the SMU.440 crystal structure revealed a cavity which is similar to that of hormone and especially polyketide binding proteins sharing the same fold, indicating a polyketide-like binding site in SMU.440 homologs. The results reported here shed light on a likely function of a hypothetical protein found exclusively in the dental habitat bacteria.

Bioinformatics Analysis
BLAST [16] was used for homology search against the NRDB. The output sequences were input into the software MUSCLE [29] and a phylogenetic tree was generated based on the alignment. The programs JALVIEW2.4 [30] and TREEVIEW [31] were   used for alignment analysis and tree visualization, respectively. The cavity in the structure was analyzed using the web server CASTp [32] and DALI [33] was used to identify structures that share similarity with SMU.440.

Protein Expression and Purification
The SMU.440 gene was amplified from genomic S. mutans DNA by PCR using the primers SMU.440-F 59-GCGGATCCAT-GAAATTTTCTTTTGAATTGG-39 and SMU.440-R 59-CCG-CTCGAGTCATACTGTCTCCAAGATTT-39, which contain Bam HI and XhoI restriction sites, respectively. After digestion with BamHI and XhoI, the PCR amplified fragment was ligated to the pET-28a (+) expression vector (Novagen, USA), which was linearized with the same two restriction enzymes. Recombinant clones were selected and sequenced for verification. E. coli BL21 (DE3) cells transformed with plasmids encoding SMU.440 were grown in Luria-Bertani broth supplemented with 50 mg ml 21 kanamycin at 310 K until the optical density at 600 nm reached 0.6. Recombinant protein expression was induced by adding isopropyl-b-d-thiogalactopyranoside to a final concentration of 1.0 mM, after which the culture was incubated for 4 hours at 303 K. Cells were harvested by centrifugation at 6 700 g for 10 minutes at 277 K. The cell pellet was re-suspended in lysis buffer (20 mM Tris-HCl pH 7.5, 500 mM NaCl) supplemented with 1 mM phenylmethylsulfonyl fluoride, and then lysed by sonication. The crude cell extract was clarified by centrifugation (30 000 g for 1 hour at 277 K) and the supernatant was purified using a Ni 2+ chelating column (GE Healthcare, USA). The protein was eluted with a buffer containing 20 mM Tris-HCl pH 7.5, 500 mM NaCl, 500 mM imidazole and further purified by size exclusion chromatography on a HiLoad Superdex 75 column (GE Healthcare, USA) using an elution buffer containing 20 mM Tris-HCl pH 7.5, and 150 mM NaCl. The protein was concentrated to 10 mg ml 21 using an Amicon Ultra-15 concentrator (Millipore, USA). The purity of the SMU.440 protein was about 95% as judged by SDS-PAGE analysis.

Crystallization
Crystallization trials were performed by the hanging-drop vapor-diffusion method at 289 K using 24-well VDX plates (Hampton Research, USA). Crystallization drops were prepared by mixing 1 ml protein with 1 ml reservoir solution, followed by incubation at 289 K. Crystals were observed in several conditions tested in our initial experiments using the crystallization screening kits Crystal Screen, Crystal Screen II and Index Screen (Hampton Research, CA, USA). After optimization, well diffracting crystals were obtained using a reservoir solution containing 0.2 M (NH 4 ) 2 SO 4 , 0.1 M Tris-HCl pH 7.0 and 25% (w/v) PEG 3350. Mercury derivatives were prepared by soaking the crystals in the same solution supplemented with 2.0 mM ethylmercury thiosalicylate for three hours.

Data Collection and Processing
Diffraction data were collected on the I-711 beamline at MAX-Lab (Lund, Sweden) equipped with an Oxford Cryosystem and a Mar165 CCD detector. Crystals were flash cooled without further cryo-protection in a nitrogen cryostream. Data were collected at 100 K and indexed, integrated and scaled using DENZO and SCALEPACK from the HKL package [34] (Table 1)

Phasing and Model Building
The software SOLVE [37] was used to search for heavy atoms and two mercury atoms were located per ASU. With both the native and the mercury derivative data, SIRAS phases were calculated using SOLVE and improved by solvent flattening using the program DM. Automatic model building was carried out with the software RESOLVE [37] and 112 residues were traced per ASU. In the partially built model, two helices could be assigned to each of the two molecules per ASU. A least square (LSQ) matching of the two helices was performed using the LSQ function in the program O [38] and the twofold NCS axis was located. The initial phases were improved using the program SHARP [39] ( Table 1) and by density modification using DM [40]. As a result of imposing NCS averaging and using the automated tracing in RESOLVE, 218 residues were traced. The programs O and CNS were used for manual model building of the remaining parts and PHENIX.refine [41,42] for the final crystallographic refinement. Structure figures were generated with the software PYMOL [43]. Figure S1 Multiple sequence alignment of SMU.441 homologs.