Functional and Mechanistic Characterization of an Enzyme Family Combining Bioinformatics and High-Throughput Microuidics

Next-generation sequencing doubles genomic databases every 2.5 years. The accumulation of sequence data raises the need to speed up functional analysis. Herein, we present a pipeline integrating bioinformatics and microfluidics and its application for high-throughput mining of novel haloalkane dehalogenases. We employed bioinformatics to identify 2,905 putative dehalogenases and selected 45 representative enzymes, of which 24 were produced in soluble form. Droplet-based microfluidics accelerates subsequent experimental testing up to 20,000 reactions per day while achieving 1,000-fold lower protein consumption. This resulted in doubling the dehalogenation “ toolbox" characterized over three decades, yielding biocatalysts surpassing the efficiency of currently available enzymes. Combining microfluidics with modern global data analysis provided precious mechanistic information related to the high catalytic efficiency of new variants. This pipeline applied to other enzyme families can accelerate the identification of biocatalysts for industrial applications as well as the collection of high-quality data for machine learning. identify novel variants of the model enzyme family, haloalkane dehalogenases. Our results show that only 63 % of the identified putative HLDs were labelled correctly as dehalogenase enzymes in genomic databases. While miss annotations were rare, many proteins annotated as “α/β -hydrolase ” or “ hypothetical protein” would have been missed by a simple text-based search. Proteins from the α/β hydrolase fold superfamily are well-known for their catalytic promiscuity and tendency to catalyze diverse reactions using the same catalytic machinery. 31 Substrates are not currently known for 35 % of enzymes annotated as α/β -hydrolases, and thus their functions remain unclear. The current mining approach identified more than 2,578 putative HLDs, five more in in silico screening putative HLDs. The current screening approach only 97 sequences of


GRAPHICAL ABSTRACT INTRODUCTION
Nature is an experienced "protein engineer" since it launched its bioengineering experiments billions of years ago. 1 It has evolved a fascinating diversity of biocatalysts. This enormous pool of biodiversity represents an inventory of valuable biocatalysts fulfilling the demands for new and higher-quality products and services. We still only scratch the surface of potential biocatalytic applications as the current pool of available enzymes catering to biomedical, pharmaceutical, and industrial applications remain limited. 2 The advent of next-generation sequencing technologies has revolutionized genomic research, filling public databases with DNA and protein sequences. There are currently almost 300 million non-redundant protein sequences in genomic databases. 3 The rate of sequence data accumulation far exceeds the speed of functional studies. In practice, the low ratio of explored to unexplored protein sequences reflects the low-throughput efficiency of existing biochemical techniques compared to highthroughput next-generation sequencing technologies.
The magnitude of the "big data" generated in biology is enlarged by the fact that a large portion of functional annotations contains vague, indirect, or even incorrect descriptions. 4 Extrapolation from sequence to protein function is not trivial and often proves incorrect when tested experimentally. Thus, the challenge now is to explore how this new sequence information can be harnessed to obtain new biocatalysts and what modern techniques can be used to characterize them rapidly. Several strategies have been developed to exploit the vast number of enzyme sequences in genome and metagenome databases to discover novel biocatalysts. 5,6 Efficient exploration of the millions of uncharacterized enzymes in public databases can be achieved by computational approaches, which offer an adequate capacity for screening large pools of sequences. 7 In a previous study, we have developed a sequencebased strategy that identifies putative members of known enzyme families and facilitates their prioritization for experimental characterization. 8 The main benefit of such a genome mining approach is the effective identification of thousands of putative hits among millions of sequence entries and the rational selection of a restricted set of attractive targets. 9,10 Subsequent experimental characterization of selected representative candidates is a crucial but rate-limiting process. Conventional techniques used to collect experimental biochemical data are labour and time-demanding, cost-ineffective, and lowthroughput. Given the many attractive targets, analytical tools capable of collecting functional and mechanistic data with high throughput are particularly attractive to accelerate the discovery of new biocatalysts. 11,12 In this context, implementing microfluidic technology in experimental workflows offers key advantages, including low sample consumption, the possibility of parallelization, and integration with optical detection.
Herein, we describe an integrated workflow for effectively mining novel protein family members using computer-assisted genome mining and high-throughput experimental characterization (Fig. 1).
First, we applied an automated bioinformatics workflow to identify putative family members and select promising candidates (Fig. 1a). Using homology modelling, we predicted tertiary structures of all target proteins, analyzed them for cavities and access tunnels, and conducted molecular docking simulations with a set of representative substrates. Next, we experimentally characterized the selected hits from the in silico screening by employing small-scale expression followed by microfluidic and microanalytical analysis (Fig. 1b). Achieving an analytical throughput of up to 20,000 reactions per day and reduced protein consumption to only tens of micrograms, three orders of magnitude less than conventional approaches, droplet-based microfluidics enabled fast collection of combinatorial biochemical data, as well as detailed kinetic and thermodynamic properties providing valuable mechanistic information. In a single run of this workflow, we achieved doubling the number of experimentally characterized haloalkane dehalogenases (HLDs). We obtained a pool of new biocatalysts with industrially attractive characteristics such as high catalytic efficiencies, unique substrate specificities, high enantioselectivities, and broad temperature tolerance.

I. In Silico Screening Using an Automated Workflow
Model Enzyme Family. The automated database-mining protocol was tested with the HLDs (Fig. 1a). Three decades of intensive research on HLDs has made them benchmark enzymes for studying the structure-function relationships of the >100,000 members of the α/β-hydrolase fold superfamily 13 and the development of novel concepts in the field of protein engineering. [14][15][16] Members of this enzyme family have been employed in several practical applications: (i) biocatalytic preparation of optically pure building-blocks for organic synthesis, (ii) recycling of by-products from chemical processes, (iii) bioremediation of toxic environmental pollutants, (iv) decontamination of chemical warfare agents, (v) biosensing of environmental pollutants, and (vi) protein tagging for cell imaging and protein analysis. 17 Database Mining. The rate generation of genomic databases is doubling every ~2.5 years. Therefore, periodic mining of novel genes is highly required. We rerun the in silico screening with the same input sequence as previously 8 using the current version of the NCBI nr database and a recently developed tool for automated database mining. 18 In comparison to the initial version, the presented in silico workflow has been significantly expanded by: (i) application of EFI-EST 19 and Cytoscape 20 for calculation and visualization of the sequence similarity network, (ii) extraction of the biotic relationships and disease annotations of the source organisms from the BioProject database, 21 and (iii) the quantitative assessment of the quality of all homology models by MolProbity. 22 Sequence database searches using four known HLDs as query sequences generated 24,594 hits sharing minimal sequence similarity to at least one of the query sequences. The putative HLD sequences containing the target HLD domain were automatically recognized using global pairwise sequence identities and average-link hierarchical clustering. Artificial protein sequences annotated by the terms "artificial", "synthetic construct", "vector", "vaccinia virus", "plasmid", "HaloTag", or "replicon", were excluded.
Clustering and Filtering. The remaining 2,905 protein sequences were clustered into four subfamilies: HLD-I (915), HLD-II (1058), HLD-III (910), and HLD-IV (22), based on the sequence identity and the composition of their catalytic pentads. Despite having identical catalytic pentads, HLD-III and HLD-IV were clustered separately based on differences in their sequence similarity.
Incomplete and degenerated sequences were filtered out by the construction of multiple sequence alignments of individual subfamilies. Sequence-similarity networks were used to visualize relationships among putative HLD sequences (Fig. 2). The most apparent defining features are clustered in the distinct HLD subfamilies, implying that the sequence-similarity networks might provide a framework for identifying HLDs of similar structural and functional properties and surveying regions of sequence space with high diversity. To diversify HLD sequence space, redundant sequences with ≥ 90 % sequence identity to the set of 22 characterized dehalogenase sequences were filtered out.
Annotation and Homology Modelling. The remaining 2,578 putative HLD sequences were subjected to an annotation step consisting of information retrieval from biological databases and structure predictions. The annotation step revealed that the identified HLDs span a broad range of sequence and host diversity, including bacterial, archaeal, and eukaryotic proteins. The overall accuracy of annotation, judged by assignment to the HLD family, was 63 % but varied significantly among each of the HLD subfamilies. Most sequences in HLD-I (73 %) and HLD-II (86 %) subfamilies were annotated correctly. In contrast, the portion of correctly annotated sequences was reduced to 31 % for HLD-III and 56 % for HLD-IV (Table S1). Most members from the putative HLD-IV subfamily were annotated as HLDs, despite their low sequence identity to experimentally characterized HLDs or other HLD subfamily members. The annotation revealed four putative dehalogenases from psychrophilic organisms, 35 novel proteins from moderate halophilic organisms, and four proteins with known tertiary structures. Reliable homology models could be constructed for most members from subfamily HLD-I and HLD-II but only a limited number of HLD-III members and none of the HLD-IV members. The predicted volumes of catalytic pockets ranged from 60 Å 3 to 2,170 Å 3 (Fig. S1).
Prioritization and Selection of Targets. Rational selection of hits for experimental characterization was carried out to maximize the functional diversity of the studied protein family (Supporting Data Set). The dataset of 2,578 putative HLDs was summarized in 17 datasheets focused on different annotations or computed properties. Hits represented by homology models with MolProbity scores > 3.0 were removed from the datasheets summarizing the annotations based on the predicted homology structure, i.e., active site volume and tunnel properties. A few sequences were picked from each datasheet to make the selection as diverse as possible ( Table S2). The sequences with a higher predicted solubility and higher-quality homology models were prioritized. Simultaneously, we tried to balance the number of sequences from each haloalkane dehalogenase subfamily (HLD-I, HLD-II, and HLD-III).
The only exception was the HLD-IV subfamily, which contains multi-domain protein sequences derived from eukaryotic organisms. We avoided sequences with additional Pfam domains, as they are usually poorly expressible in bacterial host systems. Altogether 45 diverse sequences were selected as targets for experimental characterization (Table S3, Table S4).

II. Small-Scale Protein Expression
A representative set of 45 HLD genes was subjected to a small-scale expression in Escherichia coli in 96-deep well square plates and screening of HLD activity in whole cells (Fig. 2). Overall, 40 out of  (Table S5). A further thorough analysis of solubility profiles revealed that most of the proteins belonging to HLD-I (73 %) and HLD-II (71 %) sub-families were expressed in a soluble form, while a minority of HLD-III (40 %) proteins were soluble. We then probed the expressibility of all 45 HLD genes using a reconstituted cell-free transcription and translation system (PURExpress). Overall, 41 of 45 genes (91 %) were overexpressed, and 29 proteins (64 %) were obtained in soluble form (Fig. S3b). Application of the cell-free PURExpress system did not result in the desired improvement of solubility for the "difficult-to-express" HLDs. Activity measurements with whole cells showed that 40 out of 45 proteins (89%) had HLD activity with at least one of the five substrates tested (Fig. 2, Table S6). This confirms that most of the identified genes code for proteins with target activity. These include not only those 24 well-expressed genes yielding soluble proteins listed as sufficient for biochemical characterization but also 12 genes coding for HLDs with low or poor solubility and 4 genes showing weak expression in the small-scale expression analysis (Fig. S3a).

III. Microfluidic and Microanalytical Characterization
The experimental pipeline comprised commercial microanalytical instruments (e.g., capillary and microcuvette differential scanning fluorimetry) and custom-made microfluidic platforms leading to 1000-fold lower consumption of protein and increased throughput up to 20,000 reactions per day compared to 96 well plates. The methodology provided experimental data on protein stability, catalytic activity, temperature profiles, substrate specificity, kinetics, and thermodynamics ( Table 1). Protein Stability. The stability of the novel HLDs was analyzed in a high-throughput manner by monitoring changes in extrinsic (SYPRO orange dye) and intrinsic (tryptophan) fluorescence during temperature scanning experiments, using thermal shift assay (TSA) and microscale differential scanning fluorimetry (DSF), respectively. The midpoint of the denaturation curve (apparent melting temperature, Tm app ) was used for the stability evaluation. The results of fast microscale methods showed a very good agreement (R 2 0.79 and 0.93 for TSA and micro-DSF, respectively) with conventional circular dichroism (CD) spectroscopy (Table S7, Fig. S5). The apparent melting temperatures (Tm app ) values, shown in Table S7, reflect the mesophilic origins of the novel HLDs (40-55 °C) primarily. Exceptions are the DsmA and DppsA, which exhibited Tm app values at 35.7 and 38.1 °C, respectively, correlating with their psychrophilic origin. It is worth noting that the most stable protein identified was DspoA with a Tm app value of 60 °C.
Temperature Profiling. Temperature profiling was performed using a capillary-based droplet microfluidic system described elsewhere. 23 The new dehalogenases obtained in this study showed activity over a wide temperature range (Fig. 3b, Fig. S6, Table S8). DmaA was especially unique, as it retained more than 65 % dehalogenase activity at 5 °C. This dehalogenase performs equally well at this low temperature compared to benchmark dehalogenases at their temperature optima (30-45 °C). 17 A positive correlation was observed between the temperature of the highest observed activity (Tmax) and the temperature at which protein denaturation starts (Tonset) obtained from temperature scanning experiments (Fig. S7).
Substrate Specificity Profiling. Profiling of substrate specificity towards a set of 27 representative substrates was also conducted using the capillary-based droplet microfluidic system (Table S9). This structurally diverse set of substrates reflects the application of HLDs, including environmentally important compounds ( Table S10). The raw data of specific activities (Table S11) showed that HLDs exhibited better activities with the following order of preference: brominated > iodinated ≫ chlorinated.
Principal Component Analysis (PCA). First, we conducted PCA analysis using the untransformed specificity data of 8 benchmarks 23 and 24 newly identified HLDs. This analysis aimed to compare the enzymes according to their score along with the first principal component (t1), thus quantifying their global activity against the set of substrates activity (Fig. 3d). Surprisingly, 11 of the 24 newly characterized HLDs showed significantly higher global activity than the known benchmark HLDs. This result was validated using conventional activity measurements with a selected, overall well-converted substrate, 1,3-dibromopropane (Fig. S9). Six out of these 11 highly active enzymes were then chosen to characterize steady-state kinetics and reaction thermodynamics using the microfluidic approach (Fig. 3d). The second PCA was performed with log-transformed and weighted activity data allowing a direct comparison of the specific profiles of individual enzymes unbiased by the different levels of their global activity (Fig. S10). The benchmark HLDs (DbjA, LinB, DmbA, DhlA, and DhaA) were clustered in agreement with the previously reported substrate-specificity groups of HLDs. 24 In this analysis, two of the newly discovered variants, DstA and DthA, were separated from other enzymes due to their unusually narrow substrate specificity.
Hierarchical Clustering. The log-transformed specificity data were subjected to hierarchical clustering for identifying similarity in preferred substrates or selectivity of enzymes; both plotted as a double dendrogram heatmap (Fig. 3c). Our analysis clustered the substrates into three main groups. The

Steady-State Kinetics and Reaction Thermodynamics. Steady-state kinetics and reaction
thermodynamics parameters were determined for selected highly active enzymes, DspoA, DexA, DeaA, DprxA, DphxA, and DhxA, using a combination of droplet-based microfluidics and global numerical analysis of kinetic data (Fig. 4a). Complex data, including concentration and temperature dependence of the reaction, were collected by monitoring the conversion progress starting at 6 different substrate concentrations (0-1 mM 1,3-dibromopropane) each at 6 different temperatures from 25 to 50 °C in 5degree increments (Fig. 4b). The global numerical fitting of such a complex dataset provided unique estimates for the specificity constant (kcat/Km), turnover number (kcat), the equilibrium constant for enzyme-product complex dissociation (KP), and the corresponding energy barriers (Fig. 4c, Fig. S11, Table S13). There were two reasons to follow a new methodology in fitting steady-state data and calculating directly kcat/Km instead of Km. First, unlike Km, which has no mechanistic meaning, kcat/Km is interpreted as the apparent second-order rate constant for substrate binding and quantifies enzyme specificity, efficiency, and proficiency. 25 Second, there are smaller errors in the fitting process to derive kcat/Km directly rather than calculating the ratio of kcat and Km derived independently (see details in

Supporting Information).
All six selected enzymes showed higher values of kcat in comparison with the hitherto known enzyme from the HLDs family, including engineered variants with typical kcat below 10 s -1 . 17 The highest known turnover number for a dehalogenase, kcat of 56 s -1 , was determined for LinB86 in the conversion of 1,2dibromoethane. This multiple-point mutant with an introduced de novo access tunnel was obtained by several cycles of computer modelling and rational engineering. 16 Despite the high kcat, it still showed a relatively low specific constant (kcat/Km = 24 mM -1 .s -1 ). Interestingly, LinB wild type in the reaction with 1-chlorohexane exhibited a high specific constant (kcat/Km = 160 mM -1 .s -1 ), but in this case, associated with lower kcat = 2.6 s -1 . Significantly, the new biocatalysts identified in this study exceeded the currently available enzymes concerning high values in turnover numbers ranging from 13 to 80 s -1 (Fig. 4c, upper left) without the need for laborious engineering procedures and, in addition, these new enzymes showed an advantageous combination of high values of both, turnover numbers and specificity constants ( Fig. 4c), making them superior to any known haloalkane dehalogenase.
The temperature dependences analyzed for the catalytic rate (kcat) indicated that the free energy of activation is predominantly determined by a positive enthalpy or combination of both contributions, entropy, and enthalpy, in the case of DprxA and DhxA. Interestingly, DspoA, DexA, and DphxA showed a favourable entropic contribution lowering the activation energy of the catalytic turnover (Fig. 4c). The temperature dependences of kcat/Km indicated that the efficiency of substrate binding is similarly influenced predominantly by enthalpy (DeaA, DprxA, DhxA) or a combination of positive enthalpy and unfavourable loss of entropy (DspoA, DhxA). The other two interesting cases are DexA, with its specificity constant dominated by unfavourable entropy, and DeaA, with a favourable positive entropy compensating activation enthalpy and reducing the overall free energy of activation (Fig. 4c). The mechanistic information derived from the differences in the thermodynamic profiles provides an excellent starting point for rational design 26  Secondary and Quaternary Structure. Additionally, we analyzed the secondary and quaternary structure using far-UV-CD spectroscopy and size-exclusion chromatography, respectively. All HLDs exhibited CD spectra with one positive peak at 195 nm and two negative minima at 208 and 222 nm, characteristic for proteins with an α/β-hydrolase fold (Fig. S15). Newly identified HLDs were mostly monomeric, similar to the previously characterized HLD members (Table S15). Exceptions were DmmarA, which exists as a dimer, and DprxA, a mixture of monomer, dimer, and higher oligomeric states, respectively (Fig. S16). Interestingly, DstA showed sensitivity to the environment's oxidation/reduction potential and formed dimers only under oxidative conditions.

DISCUSSION
The biotechnology field employing enzymes as catalysts represents a billion-dollar industry, putting constant pressure on speeding up the identification and characterization of novel biocatalysts. 2 The avalanche of newly available sequences from next-generation sequencing represents an enormous potential but at the same time a significant challenge for the practical aspects of efficient search and throughput for experimental functional characterization. The application of rational genome mining can provide a potential solution to managing a large quantity of complex sequence data effectively. 30 Currently, it is not feasible to characterize all sequences being deposited in sequence databases. Instead, in silico screening and prioritizing a narrower selection of targeted sequences, followed by miniaturized high-throughput characterization, appears to be an attractive approach. This study has integrated computational genome mining with high-throughput microfluidic and microanalytical techniques to identify novel variants of the model enzyme family, haloalkane dehalogenases. Our results show that only 63 % of the identified putative HLDs were labelled correctly as dehalogenase enzymes in genomic databases. While miss annotations were rare, many proteins annotated as "α/β-hydrolase" or "hypothetical protein" would have been missed by a simple text-based search. Proteins from the α/βhydrolase fold superfamily are well-known for their catalytic promiscuity and tendency to catalyze diverse reactions using the same catalytic machinery. 31 Substrates are not currently known for 35 % of enzymes annotated as α/β-hydrolases, and thus their functions remain unclear. 32 The current mining approach identified more than 2,578 putative HLDs, nearly five times more hits than in the previous in silico screening (530 putative HLDs. 8 The current screening approach missed only 97 sequences out of the original set and identified 2,145 new sequences. Although in silico screening strategies for identifying novel enzymes are being used profitably, [33][34][35] there is no report providing automated protocols for rational selection of a limited number of promising hits for further characterization. In this project, several sequence-based analysis tools have been developed to predict key protein characteristics, e.g., thermostability, 36 optimal pH 33 , or protein solubility. [37][38][39][40] Other computational tools help automatically analyze, filter and visualize large sets of identified hits. 19,20 The in silico part of sequence mining analysis presented in this study is available as a user-friendly web tool EnzymeMiner (https://loschmidt.chemi.muni.cz/enzymeminer/), making the analysis widely accessible to the scientific and industrial communities. 18 In addition to the prediction of tertiary structures that can be achieved using the recently released AlphaFold 2, 41 analysis of cavities and access tunnels, modelling of enzyme-substrate complexes will be implemented in a future version of the presented web tool. The major limitation of in silico analysis is the prediction of protein solubility.
Despite applying the recent solubility prediction tool SoluProt, 37 , our comprehensive expression analysis of the whole set of 45 selected putative HLDs revealed only a 67 % success rate in terms of soluble proteins. A similar result was achieved by the previous screening of novel HLDs, where only 60 % of the constructed variants could be expressed in soluble form. 8 Protein production in E. coli can be improved by optimizing genetic constructs or expression conditions. However, related combinatorial variation is impractical for such a large set of proteins. Therefore, the production of soluble proteins remains a hit-or-miss affair and currently represents the biggest bottlenecktoward the functional and structural characterization of novel proteins. Improvement of the in silico solubility prediction tool is paramount for the increased success rate of protein characterization pipelines. 14,27 An essential component of the experimental workflow is the application of time-and materialefficient microscale and microfluidic methods. These miniaturized techniques can be parallelized, allowing high throughput with low demands on the amount of biological material. 42 The droplet microfluidics system described in this study provides a high-throughput platform for activity screening, temperature profiling, and recording kinetic data of different HLD variants in a miniaturized fashion.
Since each droplet is isolated and independent from the other droplets, large numbers of discrete experiments or assays can be performed on short timescales. The current droplet screening setup allows for the measurement of enzymatic activity of 24 HLDs against 27 substrates in a few days representing a marked improvement from state-of-the-art systems. Compared with plate-based screening approaches, our system provides enhanced levels of material efficiency, 1,000-fold lower protein consumption, and throughputs up to 20,000 reactions per day.
The temperature-controlled droplet-based platform described within this study was used to measure temperature-dependent kinetics and extract energetic and entropic contributions. This combination of kinetic and thermodynamic analysis identified variants that are superior to currently available enzymes concerning their catalytic efficiency and exhibiting notable variations in thermodynamic parameters, which essentially drive their catalytic force. Enzymes possessing a differential mix of enthalpy and entropy contributions to the catalytic activity provide unique starting points for laboratory evolution, targeting active sites, 43,44 access tunnels 45 , or dynamical protein loops. 46 Such valuable mechanistic information is rarely collected for multiple catalysts during protein discovery campaigns due to the timeconsuming nature of these experiments and complex data analysis.
This study significantly enriched the toolbox of model HLDs biocatalysts available for various biotechnological applications. 17 Homology modelling followed by the calculation of the active site volumes is a powerful approach for identifying enzymes with high catalytic activities: 11 of 24 characterized HLDs showed activities higher than the reported previously. 24 Moreover, the high selectivity described for two enzymes, DstA and DthA, where these prefer only one or two molecules from a wide set of representative substrates, has not been described so far. Several novel enzymes showed high enantioselectivity, including (S)-enantiopreference unusual for this family of enzymes.
Temperature profiling experiments identified DmaA, which retained > 65 % of its maximal activity at 5 °C, an attractive property for applying HLDs as environmental biosensors. 47 It is worth noting that some newly discovered enzymes combine several industrially attractive properties, such as high activity, stability, and selectivity.

CONCLUSIONS
In this study, we demonstrated how a combination of automated in silico screening and high throughput miniaturized experimental techniques can be used to obtain highly efficient catalysts from the large unexplored diversity of an enzyme family hidden in genome databases. A set of variants with catalytic efficiencies an order of magnitude higher compared to currently available family members, unique substrate specificity, and selectivity in combination with a wide range of operational temperature was obtained without the need for demanding engineering or high-throughput experimental screening.
The hereby-used in silico pipeline employs automated sequence similarity search accompanied by annotation and structural bioinformatics analyses. The experimental characterization of new variants (tens to hundreds) was accelerated by applying high-throughput microfluidic and microanalytical methods. The droplet-based microfluidics enabled minimized consumption of protein samples and high throughput acquisition of substrate specificity, activity, and steady-state kinetics data at various temperatures. The latter provided valuable mechanistic information and the extraction of thermodynamic parameters, entropy, and enthalpy, contributing to the catalysis. Accordingly, developing an automated droplet-based microfluidic device will open up new opportunities for future optimal data collection employing back-loops and machine learning algorithms. Another interesting perspective appears to be the application of cell-free methods for protein production and its integration into microfluidic systems, which will simplify the experimental workflow. Contrary to expectations, cell-free protein production did not increase the yield of soluble variants in this case. Therefore, developing more reliable prediction tools to refine the selection of protein candidates with good solubility remains a significant challenge. We also demonstrated that repetitive database mining generates a variety of novel enzymes with valuable industrial properties and enables the consistent collection of experimental data. High-quality datasets are attractive for machine learning methods, which may in the future provide an understanding of sequence-function relationships and contribute to the development of a new generation of tools in protein engineering.

EXPERIMENTAL SECTION
In Silico Screening. A previously developed in silico pipeline for the identification and characterization of putative HLDs was employed. 8   Each enzyme was assayed at the temperature closest to its Tmax value (0-10 °C below Tmax). A detailed description of the microfluidic method was provided previously. 56 Briefly, the droplets were generated using Mitos Dropix (Dolomite, UK). A custom sequence of droplets (150 nL aqueous phase, 300 nL spacing oil) was generated using negative pressure generated by a syringe pump. The droplets were guided through a polythene tubing to the incubation chamber. Within the incubation chamber, the halogenated substrate was delivered to the droplets via a combination of microdialysis and partitioning between the oil (FC-40 (3M, USA) + 0.5 % PicoSurf 1 (Sphere Fluidics, UK)) and the aqueous phase.
The reaction solution consisted of a weak buffer (1 mM HEPES, 20 mM Na2SO4, pH 8.2) and fluorescent indicator 8-hydroxypyrene-1,3,6-trisulfonic acid (HPTS, 50 μM). The composition of buffer was optimized to maintain both droplet stability and provide a weak buffering system for the sensitivity of pH assay. The buffer exchange of enzyme samples was carried out using the standard spin protocol of PD Minitrap™ G-25 (GE Healthcare, USA), where 2 centrifugation steps were applied, each at 1,000 g for 2 min. The fluorescence signal was obtained by using an optical setup with an excitation laser (450 nm), a dichroic mirror with a cut-off at 490 nm filtering the excitation light, and a Si-detector (Thorlabs, Germany). By employing a pH-based fluorescence assay, small changes in the pH were observed, enabling monitoring of the enzymatic activity. Reaction progress was analyzed as an end-point measurement recorded after 10 droplets/sample passed through the incubation chamber. The reaction time was 4 min. The raw signal was processed by a droplet detection script written in MATLAB 2017b (MathWorks, USA) to obtain the specific activities. The raw signal of every single measurement was at first processed by a LabView-based code (National Instruments, USA) developed in-house. The peaks were assigned to the particular sample using this software, and the mean signal was calculated. The output XLS file gathering mean signal values for every sample type (calibration, enzyme activity, buffer, blank buffer, and blank enzyme) for a particular dataset (e.g., 6 enzymes measured in one temperature with all 27 substrates) served as an input for the MATLAB script (MathWorks, USA) calculating the specific activities using the same principle described previously. 23,56 The activities were classified as "not determined" whenever the measured product concentration was below the limit of detection (LOD -3 times noise signal). Each substrate had a different calibration curve, so the LOD product concentration was in the range of 10-100 μM). Two PCA models were constructed to visualize systematic trends in the dataset. The first one was done on the raw data, which ordered the enzymes according to their total activity. The second PCA was carried out on the log-transformed data. Each specific activity needed to be incremented by 1 to avoid the logarithm of zero values. The resulting values were then divided by the sum of the values for a particular enzyme, and weighted values were estimated. These transformed data were used to calculate principal components, and the components explaining the highest variability in the data were then plotted to identify substrate specificity groups. Additionally, the hierarchical clustering analysis was performed on the log-transformed data using MATLAB (MathWorks, USA).

Equation 1
The dependence of partition coefficients on temperature was determined in the temperature range from 20 to 60 °C in 10-degree increments (Fig. S12). The partition coefficients of 1,3-dibromopropane were analyzed by monitoring the substrate concentration in a two-phase system, including reaction buffer excluding HPTS and FC-40 oil. Initially, 1 µL of the halogenated substrate was dissolved in 1 mL of the oil and continuously shaken for 30 min at room temperature. Then 1 mL of reaction buffer excluding HPTS was added to create the two-phase system similar to droplet-based microfluidics. After 30 min incubation in a water bath with continuous shaking at the target temperature, 30 µL samples of both the oil and the aqueous phase were extracted with 300 µL acetone containing 4.7 mM 1,2,3-trichloropropane as internal standard. In both phases, the concentration of a particular substrate was quantified using an Agilent 5975C mass spectrometer coupled to an Agilent 7890A (Agilent Technologies, USA) gas chromatographer equipped with a ZB-5 capillary column (30 m x 0.25 mm x 0.25 μm, Phenomenex, USA). The samples were taken by Automatic Liquid Sampler, and 1 L was injected into the Split-Splitless inlet at 250 °C, with a split ratio of 1:20. The temperature program was isothermal at 50 °C for 1 min, followed by an increase to 190 °C at 25 °C/min. The flow of carrier gas (He) was 2.2 ml.min -1 .
The concentration of 1,3-dibromopropane in reaction oil was determined using GC-MS similarly to the above-described measurement of partition coefficients. The extraction was performed using 50 µL of oil with substrate extracted in 250 or 500 µL of acetone with internal standard.
Global Numerical Integration of Rate Equations. The datasets consisting of temperature and concentration dependence of reaction rates were fit globally based on numerical integration of rate equations using KinTek Explorer software version 10 (KinTek Corporation, USA), 57 which includes the capability to fit temperature-dependent rate constants. 58 The software allows for the input of a given kinetic model via a simple text description, and the program then automatically derives the differential equations needed for numerical integration. An updated form of the steady-state model applying new standards for fitting kinetic data (Scheme 1) was used to obtain the values of turnover number (kcat) and specificity constant kcat/Km directly. By setting k−1 = 0, after substrate binds, it is always converted to the product, so kcat/Km is defined by the value of k1. 25 By setting k−2 = 0, the value of kcat is defined by k2.
The Michaelis constant was derived from the two primary steady-state kinetic parameters (Km = kcat / kcat/Km). Numerical analysis of full progress curve kinetics provided accurate parameter estimates, including product inhibition (KP = k3/k-3, when k-3 was varied as a fitted parameter and k3 = 1,000 s -1 was used as a fixed value to satisfy k3 >> k2 assumption).

Scheme 1
Numerical integration of rate equations searching a set of kinetic parameters that produce a minimum χ 2 value was performed using the Bulirsch-Stoer algorithm with adaptive step size, and nonlinear regression to fit data was based on the Levenberg-Marquardt method. To account for fluctuations in experimental data, the halogenated substrate concentrations were allowed to be slightly adjusted (± 5 %) to derive the best fits. Residuals were normalized by sigma value for each data point. The standard error (S.E.) was calculated from the covariance matrix during nonlinear regression. In addition to S.E. values, a more rigorous analysis of the variation of the kinetic parameters was accomplished by confidence contour analysis using FitSpace Explorer (KinTek Corporation, USA). 59 In these analyses, the lower and upper limits for each parameter were derived from the confidence contour by setting the χ 2 threshold at 0.95. The standard error estimates in the fitted parameters were propagated to obtain estimates of the error in the calculated value of Km. The global kinetic model was used to analyze the temperature dependence of obtained kinetic parameters using the Eyring equation, ln(kcat/T) = -ΔH ‡ /(R.T) + ln(kB/h) + ΔS ‡ /R, to obtain estimates for the enthalpy (ΔH ‡ ) and entropy of activation (ΔS ‡ ), where R is the universal gas constant, kB is the Boltzmann constant and h is Planck's constant. The Gibbs free energy (ΔG ‡ ) was defined as ΔG ‡ = ΔH ‡ -T.ΔS ‡ at the reference temperature 310.15 K.

Enantioselectivity.
Kinetic resolution experiments were performed at 20 °C. The reaction mixtures consisted of 1 mL glycine buffer (100 mM, pH 8.6) and 1 μL of a racemic mixture of 2-bromopentane or ethyl 2-bromopropionate. The glycine buffer was selected to maintain sufficient buffering capacity in the mildly alkaline pH range corresponding with the pH profiles for most characterized HLDs. A detailed description is provided by Vanacek et al. 8 The kinetic resolution data were fitted globally using KinTek Explorer software (KinTek Corporation, USA). Applying new standards for collecting and fitting steady-state kinetic data, 25

Equation 2
The spontaneous conversion of the substrate (reaction without enzyme) was included in the model and analyzed globally to obtain a specific effect of enzyme selectivity. Numerical integration of rate equations searching a set of kinetic parameters that produce a minimum χ2 value was performed using the Bulirsch-Stoer algorithm with adaptive step size, and nonlinear regression to fit data was based on    The sequences were first clustered at 50 % identity to reduce the number of nodes and edges. The sequences with higher identity are consolidated into a single node. Edge lengths indicate sequence similarity between representative sequences of the connected nodes. Sequence similarity networks of putative HLDs were calculated and visualized by EFI-EST 19 and Cytoscape v3.6.1 20 . The results from expression, solubility, and activity analysis is shown in the doughnut graphs (upper left). Enzymes were assigned to five distinct groups of enzymes based on their expressibility, solubility, and activity, indicated by different colours in doughnut graphs and the sequence similarity network. 24 well-soluble and active enzymes (green) were subjected to systematic biochemical characterization. Four weakly expressed genes (black) and twelve overexpressed genes providing proteins with low solubility (yellow) were tested positive with at least one of the five halogenated substrates in the whole-cell activity screening assay (Table S6). Four overexpressed genes providing insoluble proteins (blue) and one weakly-expressed gene (red) led to proteins that did not exhibit any activity with the substrates tested.  (4), fluorescence excitation laser (5), and a photodetector (6). b, Temperature profiles. The heat map represents the relative activity of individual enzymes. c, Multivariate analysis of substrate specificity. A double-dendrogram heat map of log-transformed data depicts the similarity of enzyme activity (vertical axis) and conversion of halogenated substrates (horizontal axis). Major groups of enzymes and substrates are highlighted with the same colour. d, Multivariate analysis of catalytic activity. The score plot t1 compares the enzymes in terms of their overall activity with 27 substrates and explains 85.1 % of data variance. The light red frame highlights new enzymes with an outstanding catalytic activity, which were characterized for their steady-state kinetics and reaction thermodynamics (Fig. 4). The previously characterized HLDs are coloured grey in c, d. The heat maps (b, c) and bars (d) are colour-coded by enzymatic activity from low activity (blue) to medium activity (yellow) and high activity (red).  (1) and oil phase (2), droplet generator (3), reaction zone with temperature control (4), motorized stage (5), excitation light source (6), dichroic mirror (7) and detection of emitted light (8). b, Example of the kinetic and thermodynamic data collected for DspoA by monitoring the enzymatic conversion under different substrate concentrations (0-1 mM 1,3-dibromopropane) at different temperatures (25-50 °C) in 1 mM HEPES buffer and pH 8.2. Each data point represents an average of 20 repetitions; the solid lines represent the best global fit. The data for all selected enzymes, parameter estimates, and statistics are summarized in Fig. S11 and Table S13. c, The kinetic parameters (top figures), turnover number (kcat), and specificity constant (kcat/Km), obtained by global fitting complex kinetic and thermodynamic data (values at reference temperature 310.15 K, 30 °C). The error bars represent standard errors. Grey columns represent previously reported values for the reaction of wild type LinB with 1-chlorohexane and engineered LinB86 with 1,2-dibromoethane (100 mM glycine buffer pH 8.6 at 37 °C). The