Background & Summary

High-temperature proteins (HTPs) find their way into wide-ranging applications spanning from drugs and diagnostics in human health to catalysing reactions in industrial processes and bioremediation1,2,3,4,5. The extraordinary adaptability of these proteins enables them to maintain functionality even in extreme conditions— a feat that unfortunately remains a formidable challenge in protein design and engineering6,7,8. One key determinant of protein adaptability is thermal stability, governed by the Gibbs free energy difference between the folded and native states. This energy balance, crucial for proteins to maintain their functional conformation, remarkably holds across various organisms and environments9,10. Even so, minor alterations in protein sequences can have profound impacts on their intricate structures, and consequently, on their stability and folding capacity11. Despite these challenges, life finds a way: thermophiles, which rely on HTPs, can thrive at extreme temperatures12,13,14,15.

Current methods that aim to produce proteins functional at higher temperatures, such as directed evolution or rational design, unfortunately, provide no guarantee of success and require substantial effort for each new protein of interest16,17,18,19,20,21,22,23,24. Recent advances in fully deep learned designers have shown impressive results and generating sequences that reliably fold to a target structure, particularly at ambient temperatures25,26,27,28. While new and improved models are continuing to be developed, they do not invariably produce designs that retain tertiary structure and activity at high temperatures29,30,31. Strategies which are based on substructures, energy functions, or patterns learned from protein structures represented in the PDB, are limited considering that the majority of proteins are non-thermophilic: only 5% of proteins from the top 25 most populous source organisms are thermophilic32,33,34,35. The temperature-dependent nature of enthalpic and entropic forces in the protein means that stability at ambient temperature does not necessarily translate to high-temperature stability36,37. Learning from high temperature proteins in a targeted manner may further improve reliability of novel methods. Consequently, researchers have been in search of rules employed by nature to produce HTPs for years14,38,39,40,41,42. Table 1 presents a selection of datasets that have been utilised in this search for design principles.

Table 1 A selection of the largest publicly available datasets of proteins and their thermal stability.

Existing datasets, apart from their size, suffer from a number of limitations. For example, single point mutation datasets such as FireProtDB allow for the study of stability independent from confounding factors like evolutionary drift, yet they offer limited variety and informational density when compared to the variety of known proteins37,43,44. Larger datasets that label substantial portions of proteins using parameters such as melting temperature (TM) or the more noisy optimal growth temperature (OGT) of the host organism can be used to discern average trends45,46,47,48. However, any thermal stability modes extracted in this way may not be universally applicable, as thermal stability is often specific to the protein fold itself42,49,50. Therefore, patterns identified on average may even prove destabilising when applied to a new target41,51. To account for this, thermal stability should be studied in a context dependent manner, where homologs are paired across temperature, with multiple examples across evolution. The largest dataset of protein pairs contains only 1.6k unique examples, and does not have redundancy across evolution52.

Recent advancements have shown that large deep learning models can effectively understand protein context in applications such as binding predictions, structure prediction, and backward design, among others31,53,54,55,56,57,58. Each of these applications, however, is predicated on large (>10k) datasets. To translate these successes to context-dependent thermal stability design, the development of a large dataset of homologous protein pairs across different temperatures becomes imperative. This work introduces such a dataset: learn2thermDB. The dataset includes 69 million protein pairs of 250 amino acids or fewer derived from 4739 mesophilic organisms (with OGT < 40 °C) and 289 thermophilic organisms (with OGT > 40 °C, up to 98 °C) using homology search. We first paired mesophilic organisms to thermophilic ones by evolutionary distance, and then homologous proteins among those taxa pairs. Applying a stricter thermophilic threshold of 60 °C, used by the current largest dataset, our database still retains 9 million protein pairs, a four orders of magnitude increase. It is worth noting that individual proteins often participate in multiple pairs, providing the database with some redundancy across the evolutionary landscape. Moreover, even those organisms that did not demonstrate pairing still contribute to the dataset with their proteomes labelled by OGT, yielding 23 million proteins from mesophiles and 1 million from thermophiles (with a maximum OGT of 102 °C) for further study. The dataset’s size, in terms of the included organisms and proteins, is depicted in Table 2. The featured organisms span a broad range of prokaryotes across the taxonomic tree (Fig. 1)-left and cover much of the known protein space (Fig. 1)-right. The current largest dataset of protein pairs across temperature from Hait et al. is also depicted for comparison in Fig. 1-right. The distribution of OGT difference between organisms is shown in Fig. 2-top, and the number of protein pairs as a function of OGT difference in Fig. 2-bottom. We can see that the distribution is skewed towards similar temperatures, however we still retain millions of protein pairs with OGT difference > 30 °C. The dataset as described in this manuscript is made available as-is on Figshare, please see the section “Data Descriptor” below59.

Table 2 Number of various entities in the dataset all with temperature labels.
Fig. 1
figure 1

The taxonomic and protein space covered by the protein homologous pairs (N = 69 mil) within the dataset. Left) NCBI taxonomic breakdown of the dataset88. The outer ring depicts super kingdom, Phylum, Class, and Order, where the size of wedges indicates the number of organisms in the classification that contain at least one protein in a learn2therm protein pair. Highly populated Phylum and Class are labelled. The inner ring moving inward is a histogram of the number of proteins participating in pairs per organism followed by a colour mapping labelling organisms as mesophilic in blue and thermophilic in red. Central connections indicate taxa pairs contributing to protein pairs. Right) Two dimensional mapping (using t-SNE) of a sample protein space as determined by Evolutionary Scale Model (ESM) embeddings89. In blue, a sample of data from the ESM Atlas with highest structural confidence. Note that this data contains eukaryotic proteins. In yellow, our proteins in pairs, and in orange, the current largest set of protein pairs across temperature. Size of samples conserves relative size of our proteins vs. the Atlas and reference dataset. For details of the mapping procedure, see Supplementary Information S8.

Fig. 2
figure 2

The data distribution as a function of difference in growth temperature between items. top) Histogram of difference in OGT between pairs of organisms. The data is skewed towards 0, but there are still many pairs with temperature differences >30 °C. bottom) Count of protein pairs remaining as minimum OGT difference is increased. We still retain 14 mil protein pairs with OGT difference >30 °C.

In addition to the static dataset provided, the entire pipeline used to produce the data is not only open source, but parameterized, such that an interested reader can reproduce or alter the produced dataset. Please see section “Usage Notes.” We believe that the carbon cost of compute efforts is as important as the result itself, thus our pipeline tracks approximate carbon cost (CodeCarbon, hhtps://codecarbob.io/). The cost of running the pipeline was estimated to be 11.5 kg carbon, according to the energy partition in Washington State, USA, and the compute total carbon cost of this research effort including development was estimated at about 70 kg.

Methods

Our process of identifying homologous protein pairs across temperature is discussed in the following sections. The pipeline is depicted in Fig. 3.

Fig. 3
figure 3

Workflow for labelling homologous protein pairs across temperature. Raw data includes RefSeq. 16s rRNA sequences, OGT labels from Engqvist, UniProtKB proteome metadata and proteins. Proteome metadata is parsed to identify a single proteome for highly represented organisms, while retaining data for weakly studied taxa. Proteins are filtered such that only ones from the chosen proteomes and for which we have OGT are kept. Protein pair search space is filtered by first identifying pairs of related organisms via 16s rRNA alignment. Protein pairs are searched for by alignment of sequences. Final database tables are taxa, pairs of meso/thermo taxa, proteins, and pairs of meso/thermo proteins.

Ingestion of raw data records

We retrieved Archaeal and Bacterial 16s rRNA sequences, along with their associated NCBI taxonomy identifier (taxid) from NCBI BioProjects 33175 and 33317 using the Entrez API60,61. We only retained full sequences, ranging from 1300 to 1600 bases. When multiple sequences mapped to the same taxid, we observed >98% identity and retained the longest. Curated OGTs, produced by Engqvist, were then downloaded62,63. For cases with multiple OGTs corresponding to the same species taxid, we used the average. The OGTs and 16s rRNA sequences were subsequently linked via taxid, generating a table with both quantities measured. For the next steps, we categorised organisms with OGT > 40 °C as thermophiles. It should be noted that this information is tracked, enabling the users to filter data using a more stringent thermophilic label, such as 60 °C3.

We downloaded Archaeal and Bacterial protein sequences from UniProtKB64. UniProtKB Proteome metadata was also retrieved to minimise redundancy while preserving taxonomic diversity. For strains with multiple proteomes, if any, we selected a single proteome based on UniProt’s priority labels: “Reference and representative proteome”, “Reference proteome”, “Representative proteome”. If these labels were absent, we discarded “Redundant proteome” and “Excluded proteome”, choosing the most populous remaining proteome. This process was repeated at the species level. Protein data files were parsed for primary sequence, host organism taxid, PDB, and AlphaFold database ID. Proteins not mapping to an organism in the taxa table were discarded. If a protein mapped to a proteome, it was retained if it belonged to the priority proteome identified for the organism, or if the priority proteome for the organism contained fewer than 2000 proteins. This avoided dataset saturation with model organism proteins while still including novel or infrequently studied proteins.

Filtering protein homologous pair search space

To provide context-dependent information on protein thermal stability, pairs of proteins must be identified from a large pool of thermophilic and mesophilic ones. To do this we identified homologous sequences using a BLAST-like local alignment65. The search space, considering only proteins <250 amino acids, was approximately 4.5 trillion pairs. A full pairwise BLAST-like search given the protein data space was projected to cost an unacceptable amount of carbon at 50,000 kg. To mitigate this, we first filtered the search space by identifying evolutionarily related taxa across temperatures. Using BLASTn, we aligned 16s rRNA sequences for mesophilic and thermophilic organisms. This sequence evolves slowly and is frequently used for inferring taxonomic relationships66,67,68. See Supplementary Information S3 for alignment parameters. Any organism with >81% gap compressed sequence identity and >98.5% coverage on both strands was considered a taxa pair for the subsequent protein homolog searches, resulting in 150k taxa pairs and a search space of 230 billion possible protein pairs. The taxonomic breakdown and protein contributions of these pairs are depicted in Fig. 1-left.

Identifying protein homologs across temperature

Using DIAMOND, we executed pairwise local sequence alignments for each thermophilic-mesophilic taxa pair, considering only proteins of 250 amino acids or less69. A resource test indicated that DIAMOND had a twofold speed increase, while preserving sensitivity (see Supplementary Information S4 for alignment parameters and a comparison with BLASTp). Several alignment metrics such as percent identity and alignment coverage were monitored (refer to Supplementary Information S2 for a comprehensive list and definitions). This process yielded 69 million putative protein pairs with an E-value less than 1e-4 and over 75% coverage of both sequences. Using a stricter definition of thermophilicity (OGT > 60 °C), the number of protein pairs reduces to 9 million. Despite this reduction, the dataset remains significantly larger than any existing homologous sequence collection across temperature52. The protein space occupied by protein pairs is depicted in Fig. 1-right, covering much of known protein space.

Data pipeline

The entire process of developing the dataset, from extraction of raw data to downstream validation steps (see Technical Validation), is tracked using data version control (DVC). Thus, the complete history of how the data changed as the code was developed and the parameters were changed (eg. maximum protein length, minimum alignment metrics etc.) is available. Additionally, the pipeline can be rerun with a single command after environment and computing cluster configuration. The data tables created in this process (taxa, proteins, taxa_pairs, and pairs) are collected and linked as a relational database using DuckDB, allowing for fast access and filtering of the data70. A depiction of the data pipeline is shown in Fig. 3. A key subset of tunable parameters is shown in Table 3 along with the values used for the presented data. We hope that this organisation and data transparency will allow others to experiment with the data pipeline to suit their needs. The carbon cost of compute consuming steps in the pipeline was estimated using CodeCarbon (). A comprehensive description of the pipeline steps and parameters is given in Supplementary Information S1.

Table 3 A small selection of tunable parameters used to produce the final dataset.

Data Records

The dataset is available on Figshare59 in the form of a relational database, as well as a dump of semicolon separated value files.

Schema

The dataset is structured as a relational database, implemented with DuckDB, an analytical database management system70. It consists of four main tables: ‘taxa’, ‘proteins’, ‘taxa_pairs’ and (protein) ‘pairs’. An abbreviated schema detailing the relationships between these tables is illustrated in Fig. 4. For a more detailed schema and a comprehensive description of each field, please refer to Supplementary Information S5.

Fig. 4
figure 4

Abbreviated schema for the presented database. The taxa table contains NCBI taxonomic information, from superkingdom to species, as well as 16s rRNA sequence and OGT. The protein table contains protein sequence and external database links. The taxa_pairs and pairs table contain metrics for alignment for 16s rRNA sequences and protein sequences, respectively.

For 16s rRNA alignment (table “taxa_pairs”) and protein alignment (table “pairs”) conducted using DIAMOND, various metrics including bit score, coverage, etc. are reported and referenced throughout the manuscript. Detailed definitions of these metrics are provided in Supplementary Information S2.

Accessing the dataset

To access the dataset, refer to the provided link on Figshare59. The dataset can be found in a zipped file named “database.tar.gz.” This file includes the database “learn2therm.ddb”, a minimal environment file for querying the database called “enviroment.yml,” and a set of instructions provided in the “README.md” file. The database employs SQL, with some specific elements defined by DuckDB. For example, to retrieve all protein pairs and their corresponding amino acid sequences with >95% alignment coverage of both strands and thermophilic temperature >80 °C, use the following query:

SELECT m.protein_seq AS meso_seq, t.protein_seq AS thermo_seq

FROM pairs

INNER JOIN proteins AS m ON (m.pid = pairs.meso_pid)

INNER JOIN proteins AS t ON (t.pid = pairs.thermo_pid)

INNER JOIN taxa ON (taxa.taxid = pairs.thermo_taxid)

WHERE pairs.query_align_cov > 0.95

AND pairs.subject_align_cov > 0.95

AND taxa.temperature > 80.0

This returns protein pairs in <10 seconds. See the README in the zipped data file for some additional example queries. DuckDB allows for exporting to csv, parquet, and many other desired formats, thus a user can retrieve the specific data they require on a taxa, protein, or protein pair basis70.

We have also provided dumps of the database tables as semicolon separated value files, however they may be unwieldy and it is recommended to use the database interface.

Technical Validation

Mapping to existing data

To ensure a proper mapping of OGT label to protein, we have joined our proteins with enzymes from Engqvist et al.63 via UniProt ID. Comparing the labels for the records in both datasets (N = 1.6 mil) yields an R2 of 0.995. Some small differences in OGT labels arise due to our strain aggregation procedure, described above.

Growth temperature as a proxy for melting temperature

To validate that the optimal growth temperature (OGT) is a suitable substitute for melting temperature as a measure of thermostability, and to ensure that temperature data has been accurately mapped, we compared proteins within our dataset to existing melting temperature datasets. We paired wild type proteins from FireProtDB and the Meltome atlas to proteins within our dataset using >99% coverage and identity, yielding 4,640 proteins with both internal OGT labels and external TM labels43,49. A Spearman’s correlation of 0.85 with a p-value of 0.0 was observed between these two quantities, suggesting a strong correlation. Furthermore, a binomial test was performed to determine if the melting temperature has a >99% chance of being greater than OGT, yielding a P-value of 2.68e-19. Figure 5 provides a parity plot of these two values. This analysis indicates that the trend observed in an organism’s protein melting temperatures is reflected in its growth temperature.

Fig. 5
figure 5

Melting temperature vs OGT, using data from two third party databases. The black dashed line is identity, and almost all examples fall on the side of TM > OGT. The adage that melting temperature is greater than OGT is clearly supported, with a Spearman’s of 0.85 (P value 0.0), and a binomial test of >99% chance of passing with alternative P-value 2.68e−19.

Comparing to alignments of known functional pairs

Hait et al. previously produced the current largest dataset of functional protein pairs across temperature at 1.6k pairs by starting with structures of PDB entries52. This dataset served as our benchmark, with the quality of their protein pairs used for comparison. To facilitate this comparison, we first aligned their dataset using DIAMOND, and subsequently compared the resulting alignment metrics to ours69. Statistically similar or even superior alignment scores to the baseline were observed in our dataset, as depicted in Table 4. Distributions of alignment percent identity and homology (as indicated by normalised bit score) are presented for both our data and the baseline pairs in Fig. 6a,b. Further comparisons given in the table and figure are discussed in the subsequent sections.

Table 4 Statistical comparison of learn2therm protein pairs to Hait et al.’s pairs.
Fig. 6
figure 6

Comparison of protein pair quality between our data and Hait et al.’s 1660 protein pairs52. (a) Empirical distribution of local alignment homology as bit score normalised to the average length of both protein strands. Our data has a right shifted score on average with t-test probability = 1.94e−8. (b) Same as A, except with percent identity. Our data has a right shifted score on average with t-test probability = 1.75e−6. (c) Empirical distribution of Jaccard score over Pfam annotations for our data (blue) compared to the baseline data (orange). Our full data has more annotation mismatches on average. When only the 25 mil protein pairs with BLAST coverage >95% are considered, the Pfam annotations become indistinguishable from the baseline with t-test probability = 3.24e−13. (d) Cumulative distribution of FATCAT structural alignment P-value for bins in BLAST coverage uniformly sampled from our data, compared to the baseline structural alignments. Even low coverage pairs are more likely to have less than one in a thousand P-value with binomial confidence >99%.

Pfam annotations

We used Pfam to annotate proteins in both our dataset and Hait’s pairs52,71,72,73. Retaining matches with an E-value < 1e-10 (normalised using the size of Pfam 35.0) resulted in 86.1% of our proteins and 99.8% of Hait’s proteins being labelled with at least one annotation. This suggests that our data includes novel proteins not extensively represented in Pfam. We evaluated the quality of a protein pair according to Pfam by calculating the set Jaccard score of accession annotations (Supplementary Information Eq. 1). We only considered pairs where at least one member was labelled with at least one annotation, as the remaining are out of scope for Pfam. Although our data contains slightly more annotation mismatches, if only the N = 24 million pairs with >95% sequence alignment coverage are considered, the scores are indistinguishable from the baseline pairs, with a t-test probability of 3.24e−13. The score distributions for both datasets are compared in Fig. 6c. A detailed description of this search is available in Supplementary Information S6. Homology searches using HMMs are highly sensitive to evolutionary distance, suggesting that true pairs of proteins with the same or similar functions are likely to share Pfam annotations, if available74.

Structural alignments

We used FATCAT to align PDB structure of the baseline pairs with flexible alignment75,76. Chain A was chosen if present. This flexible protein structure alignment algorithm provides an empirically scaled score “P-value” which indicates the likelihood that the raw alignment score were to occur between two random proteins, thus a P-value close to 0.0 indicates a quality structural overlap. We repeated this process for our protein pairs using PDB structures where available, otherwise utilising AlphaFold predicted structures54,77. For learn2therm protein pairs, we took a subset of 10k pairs due to computational cost limitations of structural alignment. We took this sample uniformly across 5 bins in sequence alignment coverage between 75% (the dataset minimum) and 100%. The cumulative distribution of probability scores from FATCAT for each of these subsets is given in Fig. 6d, where we see that our protein pairs have alignments even less likely to be observed randomly than the already quality baseline. To compare the alignment results to the baseline statistically, we considered a binary problem where FATCAT probability of occurring randomly less than one in a thousand is considered a quality protein pair. Conducting a binomial test, even pairs with sequence alignment coverage <80% are indistinguishable from the baseline, and higher alignment coverage yields structural alignment with smaller P-values than Hait’s on average with >99.9% confidence.

Signal of growth temperature predictors

In order to evaluate the signal for downstream deep learning models on our data, we consider a classifier of thermophilic versus mesophilic origin from protein sequence alone. This is one of the simplest models that can be produced from our data, and other work has shown this to be a learnable function78,79,80,81,82. To facilitate this test, we preprocess our data using the following steps: binarization by OGT <30 °C or > = 60 °C, class balancing, deduplication of similar sequences, and splitting based on NCBI taxonomy of host species, resulting in a training and test set of 290k and 28k proteins respectively, each labelled as mesophilic or thermophilic (see Supplementary Information S7 for details on the rigorous preprocessing conducted).

We then evaluated a recent predictor, TemStaPro, on our test set of 28k proteins79. This model is an ensemble of neural networks trained on top of the output embeddings of ProtT5XL, a Large Protein Language Model (LPLM)83. The TemStaPro ensemble outputs predictions in bins of 5 °C between ‘<40’ and ‘65< = ’, and conducts a self consistency check between ensembles. On our data, the model is consistent for 95.5% of examples, and of those, predicts the correct class with 91% accuracy compared to the null model of predicting the majority class with 61% accuracy. The distribution of predictions is shown in Fig. 7 below, where the model is clearly a predictor of > = 60 and < 30 °C labels.

Fig. 7
figure 7

Predictions on our test set by TemStaPro79. Thermophilic label is > = 60 °C versus < 30 °C. The model makes very few predictions between these two quantities where there is no data, as expected, and is able to recover the actual classes with 91% accuracy.

Given that TemStaPro was trained on UniParc proteins, it is possible that the model has seen data from our test set before. To ensure that high performance is not due to data leakage, we trained our own model on the development set of our data. We finetuned ProteinBERT, a LPLM, on our development set83. The training parameters and architectural details can be found in Supplementary Information S7. The performance on our held out test set is shown in Table 5 below. The model is highly predictive, confirming that our dataset of proteins and OGTs retain signal.

Table 5 Test set performance of our fine tuned LPLM.

It is clear that the data we produced is capable of supporting the foundational machine learning task of thermophilic classification, yet with almost an order of magnitude more data than previously. With the addition of homologous pairing, as well as its increased size, the dataset will open the door for more complex models such as thermal stability design models.

Usage Notes

The codebases that produced the results in this work are free and openly available. The pipeline leverages Data Version Control (DVC, https://dvc.org). A user can reproduce this data with a single command ‘dvc exp run’ (assuming available compute resources) or run the data pipeline using different parameters and produce a variant of the dataset, all the while monitoring how the data changes. Environments and parameters necessary for this process are retained within the repositories.

A snapshot of the data repositories at the time of production of these results is provided as a Figshare dump in addition to the repository links. Each of these repositories is DVC tracked, thus the set of parameters used to execute the pipeline is found in ‘params.yaml’. A description of each parameter is given as comments in the file, and in Supplementary Information S1. See Table 6 for the repositories and data dumps. In order to easily produce a directed acyclic graph of steps within the pipelines, dependencies and outputs of each pipeline stage, and view experiment results as parameters were changed, see DVC’s API.

Table 6 Location of open code and data.