Genomic Determinants Encode the Reactivity and Regioselectivity of Flavin-Dependent Halogenases in Bacterial Genomes and Metagenomes

ABSTRACT Halogenases create diverse natural products by utilizing halide ions and are of great interest in the synthesis of potential pharmaceuticals and agrochemicals. An increasing number of halogenases discovered in microorganisms are annotated as flavin-dependent halogenases (FDHs), but their chemical reactivities are markedly different and the genomic contents associated with such functional distinction have not been revealed yet. Even though the reactivity and regioselectivity of FDHs are essential in the halogenation activity, these FDHs are annotated inaccurately in the protein sequence repositories without characterizing their functional activities. We carried out a comprehensive sequence analysis and biochemical characterization of FDHs. Using a probabilistic model that we built in this study, FDHs were discovered from 2,787 bacterial genomes and 17 sediment metagenomes. We analyzed the essential genomic determinants that are responsible for substrate binding and subsequent reactions: four flavin adenine dinucleotide-binding, one halide-binding, and four tryptophan-binding sites. Compared with previous studies, our study utilizes large-scale genomic information to propose a comprehensive set of sequence motifs that are related to the active sites and regioselectivity. We reveal that the genomic patterns and phylogenetic locations of the FDHs determine the enzymatic reactivities, which was experimentally validated in terms of the substrate scope and regioselectivity. A large portion of publicly available FDHs needs to be reevaluated to designate their correct functions. Our genomic models establish comprehensive links among genotypic information, reactivity, and regioselectivity of FDHs, thereby laying an important foundation for future discovery and classification of novel FDHs. IMPORTANCE Halogenases are playing an important role as tailoring enzymes in biosynthetic pathways. Flavin-dependent tryptophan halogenases (Trp-FDHs) are among the enzymes that have broad substrate scope and high selectivity. From bacterial genomes and metagenomes, we found highly diverse halogenase sequences by using a well-trained profile hidden Markov model built from the experimentally validated halogenases. The characterization of genotype, steady-state activity, substrate scope, and regioselectivity has established comprehensive links between the information encoded in the genomic sequence and reactivity of FDHs reported here. By constructing models for accurate and detailed sequence markers, our work should guide future discovery and classification of novel FDHs.

H alogen-containing molecules are ubiquitous in the environment. An increasing number of organohalogen products have been isolated and characterized in recent decades (1). Organic halides produced as secondary metabolites are considered potential pharmaceuticals and agrochemicals, constituting an important class of antibiotics and antitumor drugs represented by rebeccamycin, chloramphenicol, and vancomycin derivatives (2). Because dehalogenation of these molecules often lowers binding affinity to the target (3) and biological activities (4), halogenation is one of the essential reactions in the functionalization of natural products. Halogenated molecules can also engage in specific noncovalent interactions with heteroatom-containing functional groups or aromatic rings by halogen bonds (5). The highly directional and specific halogen bonds have been exploited to enhance ligand binding affinities of drugs and inhibitors, with minimal disruption of other important intermolecular interactions (6).
In addition to the substrate scope, regioselectivity is another key element in the identification of Trp-FDH reactivities. Based on the site of halogenation on the indole ring, Trp-FDHs can be further categorized into 5-, 6-, and 7-tryptophan halogenases (5-, 6-, and 7-Trp-FDHs, respectively) (7,32,33). Previous studies demonstrated that the regioselectivity of carbon-halogen bond formation is determined by the relative positioning of the substrate with respect to the catalytically critical lysine residue at the active site in conjunction with a glutamate located between the flavin adenine dinucleotide (FAD)-binding and tryptophanbinding domains (34,35). Additionally, site-directed mutagenesis and directed evolution studies revealed that tryptophan-binding sites are critical to determine the substrate scope and regioselectivity of selected enzymes (16,32,33,(36)(37)(38)(39).
The reactivity and regioselectivity of FDH are essential in halogenation activity and determined by the active sites that recognize aromatic substrates. However, no definitive set of genomic markers was analyzed from the large-scale sequence analysis to distinguish their reactivities at atomic accuracy. This is presumably due to the high variability in the amino acid composition of the substrate-binding pocket (40), and FDH sequence motifs that underpin such functional diversity are yet to be elucidated (41).
Recent studies have shown that whole genomes as well as metagenomes can serve as an immense source of halogenase genes. Using PCR-based screening, Liao et al. found halogenase genes from Streptomyces and Nocardiopsis in Artic marine actinomycete isolates (42). From forest soil microbiome and 11 Botany Bay metagenomes, Weigold et al. (43) and Neubauer et al. (31) discovered putative FDHs. The increasing amount of metagenome data provides more opportunities to find novel halogenases with different substrate specificities.
Here, we report a comprehensive profiling of FDHs from bacterial genomes and metagenomes. A well-trained profile hidden Markov model (pHMM), built from experimentally validated halogenases, allowed us to find highly diverse halogenase genes from bacterial genomes and metagenomes. To discover remotely homologous genes, our sequence models include information for conserved class-specific cofactors and binding sites. Structural analysis on FDHs was integrated to identify key genomic determinants that are responsible for substrate binding and subsequent reactions. Using the isolated bacterial and metagenomic halogenases, we determined the steady-state activity, substrate scope, and regioselectivity of FDHs, thereby establishing comprehensive links between genotypic information and functional sites. Our work thus lays an important foundation for the future discovery and classification of novel FDHs by constructing accurate and detailed sequence models.

RESULTS
Genomic patterns of halogenases in bacterial genomes and metagenomes. We collected the experimentally validated FDH sequences to build a pHMM and searched homologous genes from 2,787 bacterial genomes and 17 sediment microbiomes (see Table S1 in the supplemental material). In total, 103 and 68 halogenases were identified using the pHMM search from the bacterial genomes and metagenomes, respectively (see Fig. S1). The 103 halogenases from bacterial genomes were further characterized into two groups: 20 located in the biosynthetic gene clusters (BGCs) and 83 that were not. Additionally, 109 halogenases were obtained from the NCBI protein sequence database by keyword and pHMM search.
Similar patterns were also observed in the phylogenetic tree that was constructed with the same set of Trp-FDH and non-Trp-FDH sequences (Fig. 1b). The FDHs were grouped into three discrete groups, A, B, and C, in the tree. Consistently, the 17 known Trp-FDHs were clustered in groups A and B, whereas non-Trp-FDHs, such as MibH, BrvH, and KrmI, were found in group C, showing that the substrate scope is associated with the branch grouping. Regioselectivity appeared to be represented phylogenetically: 7-Trp-FDHs and 5-Trp-FDHs were found in groups A and B, respectively. Interestingly, 6-Trp-FDHs were found in both groups A and B, which was also observed in the embedding space in Fig. 1a.
Structure-guided analysis of FAD-and halide-binding regions in FDHs. To elucidate protein sequence-structure-activity relationships among FDHs, conserved regions were investigated. We screened 3-mers (i.e., sequences of three amino acids) using a multiple sequence alignment of 313 sequences (see Fig. S2A). Regions of low entropy were investigated using X-ray crystal structures of FDHs that are complexed with FAD, chloride, and tryptophan ( Fig. 2a to f; Fig. S2B to H). We defined substrate-binding sites as the residues that make direct contacts with the molecule. As a result, 10 putative functional regions, FAD1 to -4, Lys (K), halide, and Trp1 to -4, were determined as wellconserved regions and subjected to further investigation (Fig. 2g).
We found four highly conserved regions related to FAD-binding (FAD1 to -4), as reported previously (18). We noted that FAD1 and FAD2 regions are consistently conserved in all identified FDHs in groups A to C. In contrast, FAD3 and -4 are less conserved than FAD1 and -2, exhibiting slight sequence alterations among groups. However, it is unclear whether the sequence variation in FAD3 and -4 is directly related to regioselectivity, as FAD3 and FAD4 are located remotely from the tryptophan-binding motifs.
The halide-binding site is also highly conserved in the sequences and structures of genomic and metagenomic FDHs. A large number of Trp-FDHs, including RebH (32,57) and PyrH (18), react with both Cl 2 and Br 2 ; the yields are either comparable to or slightly higher with Cl 2 . In contrast, the recently identified halogenase BrvH exhibits catalytic reactivity only with Br 2 (31), indicating that the halogenation activity is less likely determined by the halide-binding motif alone. Because the activated halogenating species, such as hypochlorite (ClO 2 ) or hypobromite (BrO 2 ), are likely to transfer from the halide-binding sites to tryptophan-binding regions, unidentified residues that are associated with the dynamic process might be related to halogen-specific reactivity.
Identification of tryptophan-binding regions in Trp-FDHs. In addition to the FAD-and halide-binding sites, the interactions with tryptophan are essential for halogenation ( Fig. 2a to f). We determined Trp1 to -4 regions with low entropy of amino acid sequences (Fig. 2g): Trp1 (amino acids [aa] 50 to 55 in PrnA), Trp2 (aa 101 to 105 in PrnA), Trp3 (aa 160 to 163 in PyrH), and Trp4 (aa 443 to 461 in PrnA). Notably, different groups showed different patterns of tryptophan-binding regions. Trp1 and Trp2 motifs are found in all genes of groups A and B, whereas Trp3 is present only in group B, and Trp4 is present differentially in groups A and B. The Trp4 motif was not observed in group C.
For halogenases that have different regioselectivity for tryptophan, the Trp1 to -4 regions show considerable sequence variations. In particular, the third and fourth positions in Trp1 have conservation with Ile (I), Val (V), and Phe (F) at 63%, 15%, and 14%, respectively, followed by Pro (P) and Ser (S) with 84% and 13%, respectively. They exhibit considerably high conservation within each subgroup, indicating that the regioselectivity could be related to the Trp1 motif. Structural analysis showed that the Trp1 motif undergoes considerable structural changes upon tryptophan binding (62). In particular, the third through fifth residues are involved in hydrophobic and hydrogen-bonding interactions with tryptophan ( Fig. 2a to f), suggesting that Trp1 determines the regioselectivity (see the experiment results in the next section).
The Trp2 region also exhibits high sequence conservation. Here, hydrophobic residues constitute the first and third positions with 95% and 98% conservation, respectively, and engage in p-p stacking interactions with the indole ring portion of tryptophan. Although the side chain of the second residue in the HxF motif is not directed toward the substrate, noticeable sequence conservation is observed depending on phylogeny: Leu (L) in branch Residues comprising Trp1 to -4 are colored yellow, cyan, orange, and purple, respectively. Tryptophan molecules and catalytic lysine residues are shown in green and magenta bars, respectively. (g) Multiple-sequence alignment and conservation sites of FAD1 to -4, Lys (K), Trp1 to -4, and halide sites of FDHs. Secondary structure for each motif site is represented as a-helices, b-sheets, and loops using PrnA and PyrH. A1, and Pro (P) in branches A2, B1, and B2 (Fig. 2g). These observations imply that the leastconserved second residue in Trp2 might relate to altered orientations of the hydrophobic residues for p-p stacking interactions with tryptophan. Indeed, hydrogen-bonding interactions are observed between the indole N-H and the carbonyl group of Pro in the Trp2 motif in PyrH, suggesting its potential roles in regioselectivity (see the experiment results in the next section).
The Trp3 region occurs only in branch B, whereas variations are observed in a number of nonconserved residues between two Gln (Q)s as QxxQ and QxxxQ in groups B1 and B2, respectively (Fig. 2g). The enzyme PyrH belongs to group B1 and exhibits hydrogen bonds between tryptophan and the residues in Trp3, along with Trp1, -2, and Trp4 motifs (Fig. 2e). Notably, the residues positioned ahead of Trp3 (A153 to A158 in PyrH) become ordered upon complexation with tryptophan, indicating that these regions interact during substrate binding (62). Such a Trp3 region of PyrH (Q160 to Q163) exists in the alpha helix (a7) (Fig. 2g), which follows a longer loop than that observed in PrnA of group A. The side chains of Q160 and Q163 in PyrH interact with the carboxylate oxygen atom of tryptophan, and they are 3.21 and 2.88 Å apart from tryptophan, respectively (62). While Trp3 regions were not observed in the halogenases in group A, a strong signal of Trp4 was observed in alpha-helix regions (e.g., a15, a16, and a17 in PrnA; a14, a15, and a16 in Thal), which plays an important role in substrate binding. To verify the Trp3 functional site, we performed experimental characterizations (see the experiment results in the next section).
The region Trp4 shows high sequence and structural variations relative to Trp1 and Trp2. The Trp4 motif is assigned as YYxxF(EjD)(AjY)EF(RjK)NFW(LjT)N(GjS)(NjS)YY in group A and (YjH)YHG(LjF)(PjE)PYS in group B, whereas no analogous motif is found in group C (Fig. 2g). The structures of Trp-FDHs indicate that Trp4 motifs form enzymesubstrate interactions with tryptophan, which suggests their critical roles for substrate binding (Fig. 2a to e). This observation is consistent with the altered substrate scope and regioselectivity upon mutation of Trp4 motifs (33,37,39,63). The structure of BrvH (31) is in good agreement with the observation that halogenases lacking the Trp4 motifs do not interact with tryptophan. As such, it is likely that halogenases in group C might have a different substrate scope than those in groups A and B.
Characterization of putative Trp-FDHs. To validate the classification of FDHs, we isolated and characterized 11 putative halogenases: 7 bacterial genomic FDHs (Hal1 to -7) and 4 metagenomic FDHs (MHal1 to -4) (Fig. 1, Table 1, and Table S2). One or two sequences were randomly selected from each phylogenetic group (groups A to C). Both Hal1 and Hal7 are located close to 7-Trp-FDHs, such as PrnA and KtzQ, respectively, whereas Hal7 is located slightly apart from the majority of 7-Trp-FDHs. This pattern is also observed in the phylogenetic tree (Fig. 1b). In contrast, Hal2 and Hal3 are tightly clustered with multiple 6-Trp-FDHs in the embedding space and reside in group A2 with 6-Trp-FDHs such as ThdH (Thal), BorH, and AORI_5336 (Fig. 1). Notably, Hal3, which was previously named ThnH and reported as a part of the gene cluster for thienodolin synthesis (56), is clustered with a 6-Trp-FDH. Hal4 and Hal5 are tightly clustered with multiple 6-Trp-FDHs in the embedding space, which are close to 6-Trp-FDHs and belong to group B2 with 6-Trp-FDHs such as SttH, KtzR, and Th-Hal (Fig. 1). Lastly, Hal6 is close to 5-Trp-FDHs in the embedding space and is in group B1 with 5-Trp-FDHs such as PyrH, SpmH, and XszenFHal in the tree (Fig. 1). Notably, the selected enzymes possess strictly conserved residues involved in FAD and halide binding, whereas Hal1 to -7 genes exhibit the representative Trp1 to -4 motifs of each group shown in the conservation graph ( Fig. 2g and Table 1). MHal1 to -4 exhibit an analogous Trp1 and -2 motif but no Trp4 motif.
The putative FDHs that we identified were heterologously expressed in Escherichia coli using the methods described in reference 32. We measured steady-state activities of putative Trp-FDHs from bacterial genomes (Hal1 to -7). The disappearance of the substrate, tryptophan, and the concurrent formation of new peaks were observed for all reactions (see Fig. S3). The products were analyzed by liquid chromatography mass spectrometry (LC-MS) and 1 H and 13 C nuclear magnetic resonance (NMR) (Table S3; Fig. S3; Text S1). Hal1 and Hal7 primarily produced 7-chlorotryptophan (7-Cl-Trp), while Hal7 also produced 6,7-dichlorotryptophan (6,7-Di-Cl-Trp) as a minor product. Hal2 to -5 produced 6-chlorotryptophan (6-Cl-Trp), and Hal6 produced 5-chlorotryptophan (5-Cl-Trp), which is consistent with the genomic patterns and the position in the embedding space ( Fig. 3a and b). Notably, these results are inconsistent with the annotations in NCBI gene database, where all selected genes, except for Hal3, are described as 7-Trp-FDHs. Such incorrect annotations might be due to the use of incorrect functional assignments based on homology transfer. Instead, the experimentally determined regioselectivity is consistent with the phylogenetic analysis (Fig. 1b), demonstrating that our bioinformatics approach correctly establishes the relationship between tryptophan-binding motifs and regioselectivity in halogenation.
Substitution of Cl 2 with Br 2 yielded comparable conversion (%) of tryptophan ( Fig. 3a and c). LC-MS analysis indicated that the major products are monobrominated. The regioselectivity in bromination was maintained except for Hal4 to -5 in group B2, which produced mixtures of 5-Br-Trp and 6-Br-Trp. A lower regioselectivity here might indicate that Trp motifs in group B2 less stringently constrain the positioning of tryptophan to the catalytic lysine residue. Additionally, the more electrophilic and polarized nature of BrO 2 over that of ClO 2 might be partially responsible for the more reactive and less selective halogenation. Altered regioselectivity of Trp-FDHs by the modification of Trp1 to -4 motifs. Our genomic analysis integrated with structural information indicated that the genomic patterns and the Trp1 to -4 motifs might determine the regioselectivity of tryptophan halogenation. We thus proceeded to investigate whether the discrete noncovalent interactions could be changed by site-directed mutagenesis, thus altering the regioselectivity. Key amino acids in Trp1 to -4 motifs were replaced by comparing the motifs between two groups of halogenases that differ in the regioselectivity. The first mutagenesis study was performed between 6-Trp-FDH and 7-Trp-FDH in group A (Fig. 4a). The second study was performed between 5-Trp-FDH and 6-Trp-FDH in group B (Fig. 4b).
As shown in Fig. 4a, the first mutagenesis study was performed between 6-Trp-FDH and 7-Trp-FDH in group A. For the conversion of Hal2 (6-Trp-FDH in group A2) to 7-Trp-FDH, we carried out the following mutations: ATVP to ATIP (Trp1), HPF to HLF (Trp2), and YYx 13 GSYY to YYx 13 GNYY (Trp4) (Fig. 4a). Mutation of the Trp1 or Trp2 motif alone or together yielded no detectible changes in regioselectivity (Fig. 4a). Substitution of the Trp4 motif resulted in the 1:0.6 production of 6-Cl-Trp and 7-Cl-Trp. Inclusion of Trp1 or Trp2 mutation in the Trp4 variant inverted the product distribution to 1:4. Finally, when all three mutations were combined, the ratio became 1:21, which is consistent with the recent work on the quintuple mutations of Thal (40). These data indicate that seemingly subtle mutations, such as Val to Ile (Trp1), Pro to Leu (Trp2), and Ser to Asn (Trp4), are responsible for regioselectivity, and the orchestration of these motifs is necessary to alter the positioning of tryptophan exclusively. These results also indicate that Trp1, Trp2, and Trp4 motifs can be utilized as effective sequence markers to distinguish 6-Trp-FDH from 7-Trp-FDH proteins. Similar effects of these mutations were observed in the bromination with Hal2 (Fig. 4a). Again, the Trp4 motif makes the largest contribution, and the highest conversion to 7-Br-Trp is observed when Trp1, Trp2, and Trp4 mutations were combined, producing 6-Br-Trp and 7-Br-Trp in a 1:6 ratio. The distribution of the brominated products was not as prominent as that with Cl 2 , which is consistent with our earlier observation that BrO 2 is less selective than ClO 2 .
The reverse mutagenesis was performed with Hal7 (7-Trp-FDH in group A1) toward 6-Trp-FDH. The effects of swapping the Trp motifs were similar to those with the A2 protein (Fig. 4a). While Trp1 and/or Trp2 mutations yielded no or slight perturbation, the addition of the Trp4 mutation drastically produced monohalogenated products at the 6 position in detectible amounts. When Trp1, Trp2, and Trp4 mutations were combined, the relative fraction of 6-Cl/Br-Trp was maximized, yielding 7-Br-Trp, 6,7-Di-Br-Trp, and 6-Br-Trp in a 1:0.46:1.23 ratio. We suspect that the 6,7-dihalogenated products were formed presumably due to the loosely regulated regioselectivity. Subsequent growth and decay upon the series of mutations might indicate that more selective interactions are made with tryptophan.
The second mutation study was carried out between 5-Trp-FDH and 6-Trp-FDH in group B (Fig. 4b). They possess identical Trp1 and Trp2 motifs as ATFS and HPF,  respectively. However, the tilted orientations of the conserved phenylalanines and a histidine were markedly different in the structures of PyrH (5-Trp-FDH in B1) and SttH (6-Trp-FDH in B2), thereby altering the directionality of the p-p stacking interactions. We speculated that the amino acids subsequent to the Trp1 and Trp2 motifs might be responsible for these distinct interactions. We thus extended the Trp1 and Trp2 motifs to include two additional amino acids: ATFSTV and HPFER in group B1 and ATFSDI and HPFEQ in group B2 (Fig. 2g and Table 1). Additionally, two glutamine residues in Trp3 (QxxQ in group B1 and QxxxQ in group B2) and Trp4 motifs (FE[TjS] in group B1 and LP [PjA] in group B2) were taken into consideration.
Since the role of Trp3 regions has not been investigated thus far, we first carried out experiments to confirm their involvement in the halogenation reaction. A double mutant was prepared with Hal6 by replacing both glutamines in Trp3 with alanines (from QxxQ to AxxA). Significantly lower conversion of tryptophan was observed in both chlorination and bromination reactions (see Table S3). The results suggest that two glutamine residues in Trp3 play an essential role in the reaction with tryptophan, possibly by the formation of hydrogen bonds with the amino and carboxyl moieties of the substrate.
Modification of Trp1 slightly altered the product distribution of 5-Cl-Trp and 6-Cl-Trp, whereas the ratio of brominated products was substantially changed to 1:4.6 (Fig. 4b).
However, mutations of either the Trp2 or Trp3 motif reduced the overall yields without considerable changes in regioselectivity. Trp4 mutation alone or combined with a Trp1 or Trp2 mutation completely inactivated the enzyme, which is consistent with previous studies on PyrH (33). However, enzymatic activity was restored by the introduction of a Trp3 mutation without altering regioselectivity. When Trp1 and Trp2 mutations were combined, the ratios of 5-Cl-Trp to 6-Cl-Trp and of 5-Br-Trp to 6-Br-Trp were 1:0.3 and 1:4.7, respectively.
In the reverse mutagenesis performed with Hal4 (6-Trp-FDH in group B2) toward 5-Trp-FDH in group B1, the regiochemical preference was improved. While Trp1 or Trp2 mutations of Hal4 (6-Trp-FDH in group B2) alone yielded no alternation in the distribution of chlorinated products, the Trp4 mutation by itself altered the ratio of 6-Cl-Trp and 5-Cl-Trp to 1:0.4 (Fig. 4b), which is consistent with a previous report on the triple mutant of SttH (33). Substitutions on both Trp2 and Trp4 motifs resulted in negligible changes, whereas substitutions on Trp1 and Trp4 enhanced the ratio to 1:25. When Trp1, Trp2, and Trp4 motifs were altered simultaneously, the ratio was increased to 1:34 in chlorination, indicating that the combined interactions of these three motifs with tryptophan dictate regioselectivity in a nearly exclusive manner. The effects of this mutation were conserved in the bromination reaction except that Trp1 mutation alone induced considerable changes in the regioselectivity. Trp4 mutation produced mixed products at the ratio of 1:1; the combination of Trp1, Trp2, and Trp4 mutations maximized the ratio to 1:9.
Biochemical characterization of metagenomic FDHs. Steady-state activity assays were carried out on MHal1 to -4. MHal2 to -4 actively halogenated indole, but no conversion was observed for tryptophan (Tables S3 and S4; Fig. S3; Text S1). The discrete reactivity indicates that Trp-binding site Trp4, which is absent in MHal1 to -4, is critical for the halogenation activity toward tryptophan. Notably, MHal2 to -4 were considerably more reactive in bromination than in chlorination, although there is no significant difference in the sequences and structures of FAD-and halide-binding sites in genomic and metagenomic FDHs. Similarly to that for genomic FDHs, the molecular basis for halide selectivity observed in MHals needs further investigation. The discrete chemical nature of BrO 2 and ClO 2 and/or kinetic and/or thermodynamic parameters in transfer reactions of hypohalite to tryptophan or the indole-binding region via catalytic lysine might be in part responsible for the determination of the reactivities. The latter is consistent with the molecular dynamics (MD) simulation studies of PrnA and PyrH (64), where significant conformational changes occur in the transfer of hypohalite. MHal1 was inactive even with indole, which is presumably related to the substantially altered sequence nearby the Trp-binding region (see Fig. S4).
For the reactions of MHal2 to -4 with indole, they exhibited identical regioselectivity, exclusively producing 3-Cl and 3-Br-indoles. We also measured the reactivities of Hal1 to -7 with indole. They also produced 3-Cl and 3-Br-indoles, displaying the regioselectivity identical to that of MHal2 to -4 (Fig. 5a). Additionally, Hal1, Hal6, and Hal7 yielded doubly halogenated products from indole, i.e., 2,3-dichloroindole or 2,3-dibromoindole. The formation of doubly halogenated products indicates that the binding affinity and additional halogenation reactivity of 3-Cl-indole or 3-Br-indole might be at least comparable to those of indole.
As the distinction between Hal1 to -7 and MHal2 to -4 is related to the Trp4 motif, we also explored whether the sequences in Trp4 may determine the reactivity with tryptophan versus that with indole. When two tyrosine residues (YY) at the beginning of Trp4 motif in Hal1 were mutated to alanines (Fig. 5b and c), the mutant yielded even higher conversion of tryptophan. When the other conserved residues in Trp4, glutamate and asparagine (EN), were mutated to alanines, the quartet mutant became completely inactive to tryptophan. Double or quartet mutants produced halogenated indoles with comparable or lower yields than Hal1, indicating that the absence of these four residues might severely weaken substrate-binding affinity, leading to lower catalytic activities, particularly for tryptophan but not as much as with indole. Therefore, the presence of the Trp4 motif might be critical to determining the substrate scopes, such as tryptophan versus indole, and the product yields.
We further investigated the substrate scopes of Hal1, Hal3 to -7, and MHal1 to -4 beyond indole ( Fig. 6; Table S4; Fig. S3; Text S1). Hal2 was excluded due to protein instability. First, we observed whether other aromatic amino acids, such as tyrosine and phenylalanine, can be halogenated. No reactivity was observed, implying that the sp 2 carbon of aromatic rings might not be closely placed to the catalytic lysine. It is possible that the relatively smaller size of aromatic rings might have altered the mode of binding and interactions with halogenases. When indole derivatives, such as benzimidazole and benzothiazole having a heteroatom such as S and N at the C-3 position of indole, respectively, were added, no halogenated product was again observed, indicating that an alternative mode of binding of the indole moiety is disallowed, which contrasted with the discrete regioselectivity of tryptophan. Phenol was brominated by most of the proteins that exhibited activity with indole (Hal1, Hal3 to -7, and MHal3 and -4). The genomic and metagenomic FDHs were also active with 1-naphthol and 2naphthol. Hal1 and Hal3 to -7 yielded both chlorinated and brominated products (11% to 83% and 21% to 100% conversion, respectively), whereas MHal2 to -4 reacted with 1-and 2-naphthols only in the presence of Br 2 (4% to 69% conversion). Notably, irrespective of the phylogenetic positions of the genes, only the ortho and para positions of phenol and naphthols were halogenated, which reflects a strong intrinsic electronic bias for regioselectivity not overridden by substrate-protein interactions (Fig. 1b). Our results reveal that FDHs can act on diverse substrates, including a 6-membered benzene ring (phenol), 5-membered ring fused with 6-membered ring (indole), and two fused 6-membered rings (naphthol), with substantial variations in conversion and product distribution.

DISCUSSION
Many studies have performed computational analyses to estimate the functions of the genes in newly assembled genomes. For this purpose, sequence searches have been extensively used to find homologous genes of the same function. However, there exists limitations in finding subtle differences in biochemical reactivities such as regioselectivity in functional group transformation and selectivity of substrates. Since halogenation is an important chemical reaction in pharmaceutical or agrochemical applications, we have carried out an integrated genomic and protein structural analysis to identify the genotypes that are critical to the biochemical functions of FDHs. From both genome and metagenome mining, halogenases with similar genomic patterns of active sites are clustered in the phylogenetic tree as well as embedding space.
In this study, we found primarily genomic patterns that determine the enzymatic activity toward tryptophan (Trp-FDH versus non-Trp-FDH) and the site of halogenation for Determinants for the Regioselectivity of Halogenases tryptophan (5-, 6-, and 7-Cl/Br-Trp-FDH). Recent studies found that the sequences with different substrate scope were clustered based on the sequence similarity in the network analysis (19) and screened the halogenation with various substrates (65). Structure analysis of the known proteins identified the residues that bind to FAD and substrates (66)(67)(68)(69). Compared with such studies, our work utilizes large-scale genomic information to propose a comprehensive set of active sites. Such sequence motifs for the active sites showed different patterns for the discrete regioselectivity of Trp-FDHs. Notably, the sequence motifs of Trp-FDHs that belong to A1 and A2 differ only moderately while they exhibit discrete regioselectivity. The results suggest that the halogenating reactivities are determined by the actions of several nonsequential and subtly different sequence motifs in concerted and complicated manners. In conjunction with biochemical assays, mutagenesis experiments suggest that the sequence motifs can be exchanged to create altered product distributions, implicating potential biotechnological applications in protein engineering and synthetic biology for the synthesis of novel natural products.
The correlation of the substrate specificity and regioselectivity with source organisms and their habitats is also an interesting and challenging question to characterize FDHs. In terms of taxonomic distribution, the FDHs were found in Actinobacteria, Cyanobacteria, and Proteobacteria (see Table S1 in the supplemental material). Since we do not have enough FDHs identified, it is hard to determine the correlation between substrate specificity/regioselectivity and taxonomic distribution. In this study, we used seawater microbiome data to find more diverse FDHs and found that most of the sequences were indole-FDHs. Screening of FDHs in more diverse environments could provide a better chance to characterize FDHs. The demonstration of discrete and diverse substrate scopes of genomic and metagenomic FDHs suggests that the FDHs annotated in the last decade are only a small portion of the genes, and more FDH genes could be discovered.
As discussed above, many gene annotations are based primarily on the homology search, which is fast but less accurate to predict subtle functional differences. While the NCBI annotation reported six FDHs (Hal1 and -2 and Hal4 to -7) as 7-Trp-FDHs, we found that four genes are in fact 5-or 6-Trp-FDH. To assign subtle functions such as regioselectivity, the genomic sequences of halogenases in the public repository need additional confirmation procedures using more informative models such as motif patterns or phylogenetic characteristics.

MATERIALS AND METHODS
Construction of Trp-FDH pHMM and halogenase identification. To search halogenase genes in the bacterial genomes and metagenomes, the pHMM was constructed using 13 Trp-FDHs experimentally validated in previous studies (see Fig (70). The pHMM of halogenase genes was constructed using hmmbuild in HMMER3.0 (ver. 3.1b2) (71). To identify the genes that are homologous to the known Trp-FDHs in bacterial genomes and metagenomes, hmmscan in HMMER3.0 (71) was used with the 2domtblout option. Halogenases identified in the BGC cluster were included regardless of the E value. Using an in-house Python script, homology search results were filtered. The models and script are available at GitHub site (https://github.com/DMnBI/Halogenase). The E value threshold was set to ,10 2130 , and the model length covered .50%. A total of 351 sequences were collected from the NCBI repository by searching with the keyword "tryptophan n-halogenase [All Fields] AND bacteria [filter]," where n =5, 6, and 7. Partial and redundant sequences were discarded, resulting in 333 sequences obtained (Fig. S1). In addition, we found three additional Trp-FDHs from the literature survey, which were also included in the motif analysis.
Detection of halogenase genes in the BGCs. To identify BGCs in 2,787 bacterial complete genomes obtained from the NCBI repository, antiSMASH3.0 (72) was used. In total, 10,166 BGCs were predicted; halogenases inside a BGC were characterized subsequently. In total, 20 putative halogenases inside BGCs and 83 outside BGCs were classified.
Sample collection, preparation, and metagenome shotgun sequencing. A total of 17 metagenomic samples were obtained from different marine environments, such as coastal and deep-sea sediments (Table S1). The metagenome data sets analyzed in this study were collected in our previous studies and downloaded from the European Nucleotide Archive with the accession number ERP107268 (Table S1). All reads were 151-bp paired-end reads. Low-quality reads, which have the ambiguous sequence "N," were trimmed and filtered using Sickle (downloaded from https://github.com/najoshi/sickle) with the option of Phred quality score of .20. The metagenomic reads were assembled using MEGAHIT (v1.0.3) (73) with default k-mer options. The assembled contigs longer than 1,000 bp were translated into protein sequences using FragGeneScan (v1.19) (74).
Embedding of halogenase genes and classifying halogenases. To investigate the similarity of genomic patterns among FDHs, t-distributed stochastic neighbor embedding (t-SNE) was carried out using "Rtsne" in the R package (75) with the parameters of perplexity of 10 and iterations of 1,000. From multiple-sequence alignment of the known halogenase sequences, the profiles were constructed for six groups (5-, 6-, and 7-Trp-FDHs, indole-Hals, phenolic-Hals, and pyrrole-Hals). Using these profiles, the probability of the amino acid at each position of the halogenase genes was calculated as a feature value to learn their embedding. After obtaining embedding vectors for each halogenase, the class (5-Trp-FDHs in B1, 6-Trp-FDHs in B2 and A2, 7-Trp-FDHs in A1, and indole-Hals in C) was assigned using the k-nearest neighbor method with a k of 1.
Phylogenetic analysis. Multiple-sequence alignments were conducted using MUSCLE (v3.8.31) (76), and a phylogenetic tree was constructed using MEGA6.0 (77). The evolutionary distance was calculated using the maximum likelihood method based on a Jones-Taylor-Thornton (JTT) matrix-based model. In the phylogenetic analysis, a total of 180 sequences (of 313) were used for better representation of the tree (Fig. S1). Among the 109 sequences from the keyword search, 6 were included. Among the 68 sequences obtained from metagenome data, 38 sequences were included by random sampling.
Examination of FDH structures. Three-dimensional structures of FDHs cocrystallized with native substrates such as FAD, halide, and/or tryptophan were obtained from the RCSB database and inspected for structural analysis. The figures were created using PyMOL software (78). The residues involved in direct enzyme-substrate interactions with the native substrates were manually examined to correlate the structural data with the sequence analysis ( Table 1). Residues that form enzyme-substrate interactions were identified as FAD1 to -4, halide, and Trp1 to -4 ( Fig. S2B to H). To represent the conservation of functional regions, the sequence logo was constructed using WebLogo (v.3.6.0) (79) with default options.
Biochemical characterization of putative halogenases from genomic and metagenomic sequences. From the sequence analysis data, we selected 11 putative FDH genes, 7 from genome data (Hal1 to -7) and 4 from metagenome data (MHal1 to -4). For steady-state coupled assays, an NADH-dependent reductase (MR) was selected from the metagenome with MHal4. The MR is functionally analogous to NAD(P)H-dependent flavin oxidoreductase (Fre) found in Escherichia coli (80).
The 12 genes were synthesized after codon optimization for E. coli heterologous expression (General Biosystems, NC Morrisville). MHal1 to -3 were inserted into the pET-22b(1) vector using NdeI and XhoI as the cut sites. MR, Hal1 to -7, and MHal4 were inserted into the pET-28a(1) vector using NdeI and BamHI as the cut sites. A stop codon was inserted at the end of the sequence due to its native C-terminal His tag sequence in the pET-28a(1) vector. The plasmids were transformed to DH5a or BL21(DE3) strains of E. coli competent cells by heat shock method for sequencing or protein expression, respectively. All putative FDHs were coexpressed with GroES and GroEL chaperones using pGro7 plasmid (TaKaRa Bio) as reported previously (32). For site-directed mutagenesis of FDHs, custom-made primers were used for PCR (Table S2). Detailed experimental procedures for cell growth, protein isolation, purification, and characterization are available in Text S1 and Fig. S5.
Characterization of an NADH-dependent reductase from arctic environments. The catalytic activity of a newly discovered reductase (MR) was determined by monitoring the consumption rates of NADH by the addition of 5 mM NADH, 10 mM FAD, 5 mM reductase, and 65 mM halide source, such as NaCl and NaBr. The conversions (%) of indole were measured using high-pressure liquid chromatography (HPLC) by calculating the residual indole in the presence of 5 mM NADH, 10 mM FAD, 5 mM reductase, 65 mM NaBr, and 25 mM MHal2 in 25 mM HEPES (pH 7.4) buffer at room temperature. For preliminary studies, we measured the activity of another reductase, Fre (Fig. S5) (81).
Activity assays of putative FDHs. For steady-state coupled activity assays, NADH (5 mM, 20 eq to the aromatic substrate), FAD (10 mM, 0.04 eq), and aromatic substrates (250 mM, 1 eq), NaCl or NaBr (65 mM, 260 eq) in 25 mM HEPES buffer, pH 7.4, were prepared in a 1.5-ml microcentrifuge tube at final concentrations. The reaction was initiated by adding putative FDH (25 mM, 0.1 eq) and MR (5 mM, 0.02 eq) to the premixed solution to make the final volume 100 ml. The assays were carried out at room temperature in darkness. After 5 h, 100 ml of methanol was added, and the resulting precipitates were removed by centrifugation at 15,928 Â g for 10 min at 4°C. After syringe filtration, the supernatant was analyzed by reverse-phase HPLC (Agilent 1260 Infinity II) with InfinityLab Poroshell column (120 EC-C 18 , 4.6 by 100 mm, 2.7 mm, or 120 EC-C 18 4.6 by 150 mm, 2.7 mm) with a linear gradient of H 2 O containing 0.05% trifluoroacetic acid (TFA) and acetonitrile (ACN) containing 0.1% TFA. A representative HPLC trace and the retention time are shown in Fig. S3. Averages and standard deviations of activity assays were obtained from triple replications. The conversions and yields are listed in Tables S3 and S4.
Characterization of the halogenated products. For NMR characterizations, we scaled up the reaction volumes to 90 ml without altering the concentrations or the ratios of the reaction components. We observed that increasing the volume of the reactions considerably increased the total yields of the products, presumably because the activities of coupled enzymatic reactions are dependent on the concentration of dissolved dioxygen in the buffer solution. To maximize the yields, the reaction mixtures were incubated overnight. Depending on the substrates, the workup procedure was slightly modified. For the reactions with tryptophan, 90 ml of methanol (MeOH) containing 0.1% TFA was added for quenching. After centrifugation at 2,935 Â g for 10 min at 4°C and syringe filtration, the solution was directly injected to the semiprep HPLC (ZORBAX SB-C 18 semipreparative 9.4-by 250-mm 5-mm column). For the reactions with substrates other than tryptophan, ca. 5 drops of concentrated HCl were added for quenching. Precipitates were removed after centrifugation at 2,935 Â g for 10 min at 4°C, and the supernatants were extracted with dichloromethane (DCM). After vacuum evaporation, ACN-H 2 O (1:9 [vol/vol]) was added to the extracted fraction. The solution was filtered by a 0.25-mm syringe filter and purified by HPLC (ZORBAX SB-C 18 semipreparative 9.4-by 250-mm 5-mm column).
The reaction products of non-amino acidic substrates (indole, phenol, 1-naphthol, and 2-naphthol) were characterized by gas chromatography-mass spectrometry (GC-MS) (Hewlett Packard HP 6890 series GC system equipped with Agilent 5973 Network mass selective detector or Agilent Technologies 7820A GC system equipped with Agilent Technologies 5977E MSD). The reaction products from tryptophan were analyzed by LC-MS (Agilent Technologies 1200 Infinity series equipped with Agilent Technologies 6120 quadrupole LC-MS). The incorporation of chlorine or bromine was determined by the isotope ratios of 35 Cl to 37 Cl of 3:1 or 79 Br to 81 Br of 1:1, respectively. The products were analyzed by 1 H and 13 C NMR spectroscopy ( Fig. S3; Text S1).
Synthesis of 3-chloroindole and 3-bromoindole. To isolate large quantities of 2,3-dichloroindole and 2,3-dibromoindole for 1 H and 13 C NMR experiments, we synthesized 3-chloroindole or 3-bromoindole and used as the substrate for the reactions with Hal1. To a dried 50-ml round-bottom flask, 588 mg of indole (1 eq) and 377 ml of 1,4-dimethylpiperazine (0.558 eq) were dissolved into 5 ml DCM. The temperature was lowered by placing the flask in iced water. After 10 min, 800 mg of N-chlorosuccinimide or N-bromosuccinimide (0.9 eq) was added slowly and stirred for 15 min to yield 3-chloroindole or 3-bromoindole, respectively. After 15 min, the temperature was raised to room temperature, and the reaction mixture was stirred for 3 h. Volatile fractions were removed by rotary evaporation, and the product was isolated by silica chromatography (ethyl acetate-hexane, 5:95). The isolated yields of 3-chloroindole and 3-bromoindole were 55.5% and 57%, respectively.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TEXT S1, DOCX file, 0.1 MB.