Comprehensive database of secondary metabolites from cyanobacteria

Cyanobacteria form harmful mass blooms in freshwater and marine environments around the world. A range of secondary metabolites has been identified from cultures of cyanobacteria and biomass collected from cyanobacterial bloom events. A comprehensive database is necessary to correctly identify cyanobacterial metabolites and advance research on their abundance, persistence and toxicity in natural environments. We consolidated open access databases and manually curated missing information from the literature published between 1970 and March 2020. The result is the database CyanoMetDB, which includes more than 2000 entries based on more than 750 literature references. This effort has more than doubled the total number of entries with complete literature metadata and structural composition (SMILES codes) compared to publicly available databases to this date. Over the past decade, more than one hundred additional secondary metabolites have been identified yearly. We organized all entries into structural classes and conducted substructure searches of the provided SMILES codes. This approach demonstrated, for example, that 65% of the compounds carry at least one peptide bond, 57% are cyclic compounds, and 30% carry at least one halogen atom. Structural searches by SMILES code can be further specified to identify structural motifs that are relevant for analytical approaches, research on biosynthetic pathways, bioactivity-guided analysis, or to facilitate predictive science and modeling efforts on cyanobacterial metabolites. This database facilitates rapid identification of cyanobacterial metabolites from toxic blooms, research on the biosynthesis of cyanobacterial natural products, and the identification of novel natural products from cyanobacteria.


Introduction
Around the globe, cyanobacteria inhabit fresh waters including drinking water reservoirs, brackish waters, and marine environments where they can proliferate to form harmful blooms. During these events, cyanobacteria can produce high concentrations of a diverse mixture of rather unique secondary metabolites. Various countries have put forward drinking water guidelines for one metabolite, microcystin-LR, 1 for which the World Health Organization proposed a concentration threshold of 1 µg L -1 and the recently revised guidelines will also include the low molecular weight toxins anatoxin-a, saxitoxins and cylindrospermopsins. 2 Currently, data on occurrence, fate, transformation processes, and toxicity of many other bioactive metabolites is lacking, and improved high-throughput analytical and effect-based methods are needed to overcome this information gap.
Research into analytical and toxicological methods relies on a comprehensive understanding of the structural information of cyanobacterial metabolites. One obstacle that contributes to this is the lack of a bioinformatics platform that the cyanobacteria researcher community collectively supports. While information from commercial databases of secondary metabolites are only accessible to paying customers (e.g., Antibase, MarinLit, The Dictionary of Natural Products), several open-access databases exist but are often limited regarding the number cyanobacterial metabolites or parameters listed (e.g., ALGALTOX List, NORINE database, Handbook of Marine Natural Products). [2][3][4] Some key open access databases are listed in Table 1. The "Cyanomet mass" list by LeManach et al. (2019) contains 852 entries, of which 35 belong to the class of microcystins and nearly 500 compounds are listed with complete molecular formulae and literature reference but no further structural information is given. 3 The Natural Product Atlas (2019) contains a similar number of entries for cyanobacterial metabolites, including microcystins, and also provides structural codes (e.g., SMILES code) and the stereochemistry is known for 768 entries (isomeric SMILES code). 4 In 2017 the Handbook of Cyanobacterial Monitoring and Cyanotoxin Analysis published a list of 246 microcystins. 5 Today, the most comprehensive list of microcystins and nodularins is curated by Miles et al. and was recently updated (2019) to include 279 microcystin variants with molecular formulae, references and a systematic naming system implying the structural compositions but no structural codes were provided. 6,7 Here, we describe the compilation of information on secondary metabolites from cyanobacteria from these existing databases and literature curation into one database, CyanoMetDB. We first present the methodology for compiling the information, provide an overview of the content of the database, and summarize areas of research that can benefit from this database including suspect screening by mass spectrometry (MS) as well as analysis of biosynthetic variation, bioactivity, and environmental behaviour.

Material and methods Parameter Selection.
CyanoMetDB is a flat-file database (i.e., a single table) comprising the following core fields: compound identifier (key), compound name, compound class, molecular formula, molecular weight, monoisotopic mass, primary reference that elucidated the structure, detailed structural information as a simplified molecular input line entry system (SMILES codes) and other structural codes that serve as chemical identifiers for each compound (InChl, InChlKey, IUPAC names), whether an NMR platform was used for structure elucidation, and information about the sample used to elucidate the structure of the compound (genus/species/strain or in situ sample), see Table 2. 8 Entries in the database correspond to compounds reported as cyanobacterial metabolites. In some instances, compounds reported in the database were first reported in non-cyanobacterial species, and have since been reported in one, or more, cyanobacterial species. In these cases, the reference for structure elucidation refers to the primary publication. The database lists secondary references where those compounds have later been reported for cyanobacteria as well. We indicate in the database which entries provide NMR spectroscopy data for confirming the planar structure. Other entries only offer tandem mass spectrometry (MS/MS) and potential misidentification of some of these compounds is possible. We provide canonical SMILES codes representing the connectivity of each atom in a planar structure for all compounds and we aim to provide isomeric SMILES codes when stereochemical information was available from the literature. For some compounds, we list a second reference that improved or provided additional structural information following the initial structure elucidation. For some compounds, the "Note" section indicates where uncertainties for structure elucidation were observed. The stereochemistry of entries whose structures were established mainly through MS cannot be known with certainty, but where there was a reasonable basis for this, an assumed stereochemistry is included. For example, in the case of microcystins with fully established stereochemistries, position-2 and -4 have always been found to contain L-amino acids, with D-amino acids at position-1, 3 and -6. Similarly, the stereochemistry of di-and tetrahydrotyrosine in microcystins was assumed to be that established in other bacteria by Walsh et al., 9,10 and this will if necessary be amended if found to be incorrect. We include some additional entries that are known metabolites, oxidation products, or (semi)synthetic and indicate them as such in the note section. However, we do not provide a comprehensive list of transformation products and synthetic compounds herein. The database also lists the material used to identify the compound from the primary literature as the genus and species of the cyanobacteria or whether it was a field sample. The database does not provide a comprehensive list of all known producing species.

Data sources and curation procedure.
We manually verified and completed entries for the associated literature references using various bibliographic sources (e.g., existing open access databases, Sci-Finder, Pub-Med). From these references we extracted information of the sample type used therein (e.g., genus, species, field sample) and whether NMR spectroscopy was used for structure elucidation. Finally, with reference to the literature information, the molecules were manually drawn including known (or in some cases probable) stereochemistry and SMILES codes etc. were generated. In this way, some disagreements with the literature were identified and corrected. Entries extracted from PubChem were carefully checked to verify the structure from the primary literature. In case of discrepancies between structures reported in the primary literature reference and the one found in PubChem, we refer to the information derived from the primary literature reference herein. For selected compounds, remarks are included in the "Notes" field to highlight any shortcomings or relevant additional information, e.g., assumed stereochemistry of a compound. Information in Table 2 provides a detailed content description and illustrative example for each data field.

Results and discussion
The database includes 2031 entries with complete literature and structural information. The earliest entry was published in 1970 for the cyanopeptolin micropeptin-996 followed by almost 100 additional compounds until 1990. Between 1990 and 2000, the number of reported cyanobacterial metabolites increased five-fold. The rapid increase during the 1990s was, in part, likely associated with the discovery that microcystins pose significant hepatotoxic risks to humans, which in turn lead to MC-LR being included in the World Health Organization's water quality guidelines, prompting significant research on cyanobacteria. 2 More than one thousand compounds were identified by the year 2010, and a further one thousand compounds have been reported in the subsequent decade (to March 2020). The historic data show that 50 to 130 compounds have been identified annually (Figure 2A). Increased availability of advanced analytical instrumentation has also contributed to the increased rate of discoveries in recent years (e.g., HRMS, high-field NMR spectroscopy with cryogenic probes).

Publication trends.
The discovery of cyanobacterial metabolites has not yet reached a plateau (Figure 2A). The question arises: How close we to identifying the majority of these cyanobacteria-specific metabolites? Also, how many of (2) (1) (1) (1) (1) (7) (6) discoveries mostly relate structurally to previously published compounds and that the "range of scaffolds readily accessible from nature is limited". 12 This does not mean that new discoveries are expected to be exhausted but rather that most new discoveries are likely to share structural similarities to previous ones. Another reason may be that scientific results are limited by the environments that we predominantly explore, while less focus thus-far having been paid to, for example, cyanobacteria from terrestrial or extreme environments. Compared to discoveries of natural products from other bacteria that started in the 1940s, 12 cyanobacterial metabolite discoveries began only in the 1970s. The overall trend is especially driven by discoveries of peptide-based metabolites and common cyanopeptide classes make up more than one third of all peptidic metabolites in CyanoMetDB ( Figure 2B). Within each class, compounds show high structural similarity supporting the previous observations that the contribution of novel structural scaffolds is typically low for natural products within the pool of cyanobacterial metabolites. 12

Chemical space.
CyanoMetDB show that cyanobacterial metabolites occupy a wide range of molecular weights between 100 and 2500 Da. Of these, 57% are cyclic compounds and 65% are peptides. The mass range of known compounds does not perfectly fit a normal distribution ( Figure 3A). Analysing peptidic metabolites separately demonstrates that peptides mainly account for the compounds with high molecular weights ( Figure 3B). Note here that by peptidic, we refer strictly to the presence of at least one peptide bond, not the biosynthetic pathway. Among the peptides, more compounds are present in the 1000-1100 Da range than expected from a normally-distributed dataset. In particular, cyanopeptides can be classified to a large extent into common peptide classes including microcystins, anabaenopeptins, cyanopeptolins, and other depsipeptides that cover the majority of compounds with molecular weights above 900 Da (Figure 4). The distribution based on metabolite classes shows that microcystins and cyanopeptolins, in particular, contribute to the high abundance of known metabolites between 1000-1100 Da. Most likely this is because these classes have received more attention in studies focusing on identifying new members of these prominent classes that are often identified in the same species, or perhaps because microcystins and cyanopeptolins are especially abundant secondary metabolites from cyanobacteria.
For the other non-peptide metabolites, we can see a particularly high contribution between 350-500 Da ( Figure 3B). The molecular weight distribution of marine natural products has previously been shown to also center around 350 Da. 13 In general, the non-peptide-based metabolites from cyanobacteria are more difficult to classify because they lack unique structural features. The structural information (SMILES codes) in the database allow for substructure searching to identify common molecular motifs. For example, of the non-peptide compounds, 15% contain an ester bond and 11% a proline ring. Most compounds show unsaturation with 64% carrying at least one aromatic group, 30% having 1-2 aromatic rings, and 6% having 3-6 aromatic rings. Halogen atoms are present in 30% of the non-peptide compounds (26% Cl, and 4% Br) and 20% contain sulfur. The structural information also allows calculations of other physico-chemical properties such as hydrogen bonding, aromaticity, or pKa values. However, for those calculations the applicability domain of any underlying models needs to be considered. Secondary metabolites from cyanobacteria have mostly been discovered by "top-down" approaches from extraction of biomass and analyses that were guided by chemical motifs (peptides, molecular weight group, common fragments) or non-targeted searches by MS. Within the CyanoMetDB, 12% of compounds were first identified from field samples and the remaining compounds were identified from laboratorygrown cultures. In total, more than 50 different cyanobacterial genera were used, dominated by Microcystis (17% of all entries), Moorea/Lyngbya, Nostoc, Anabaena/Dolichospermum, Oscillatoria/Planktothrix, Nodularia, Scytonema, Fischerella, and Symploca, in decreasing order of total entries. These genera are not necessarily the main producers of the respective metabolites but were predominantly used for initial structure elucidation.

Implications for suspect screening, biosynthetic and bioactivity-guided analysis.
The CyanoMetDB dataset contains structurally known cyanobacterial metabolites that will aid future research including, but not limited to, suspect screening by MS, analysis of biosynthetic variation, bioactivity-guided analysis, and environmental behaviour. Suspect Screening. Mass spectrometry-based analysis methods offer considerable opportunities and are the state-of-the-art analytical techniques to identify and quantify structurally-known compounds, i.e., targets and suspects. Targeted LC-MS analysis typically uses tandem mass spectrometry (MS/MS) and identifies compounds by characteristic fragment ions using the selected reaction monitoring scan mode of a low-resolution triple quadrupole mass spectrometer. Quantification by targeted analysis relies upon the knowledge of the characteristic fragments ions that are obtained from reference spectra or availability reference standards. Reference spectra and reference standards are, however, currently only available for a small fraction of entries in the CyanoMetDB (e.g., few microcystins, anatoxin-a, cylindrospermopsin, Microcystins saxitoxins). Accordingly, targeted analyses are rather limited to assess the exact concentration in environmental samples for a wider range of cyanobacterial metabolite. Alternatively, LC-HRMS can be used for suspect screening of any compound with known molecular formula. One approach is by triggering fragmentation during MS/MS analysis of those spectral features in a sample that match the molecular formulae from a provided suspect list (i.e., an inclusion list). The fragments of a suspect can be matched against reference spectra or in-silico predictions to increase the level of confidence for the identification of a compound. 14 With advances in machine learning and network-based analyses, further opportunities are also arising for the discovery of novel, or structurally-related, cyanobacterial metabolites based on holistic analysis and clustering of LC-HRMS/MS fragmentation datasets. These procedures all require a collated knowledgebase of structurally known compounds, which CyanoMetDB provides. Biosynthetic basis. Discovery of secondary metabolites based-on genetic information is a promising "bottom up" approach that can reveal additional compounds, 15 though this remains largely theoretical at this stage. Recent advances in microbial genomics have greatly improved our understanding of the biochemical mechanisms responsible for the biosynthesis of cyanobacterial natural products. 16 The majority of these natural products are synthesized through secondary metabolic pathways that catalyze the synthesis of the complex secondary metabolites in coordinated enzyme cascades. 16 Approximately 75% of the natural products included in the CyanoMetDB dataset can be assigned to structural families for which biosynthetic pathways have been reported from one or more representative. Many families of cyanobacterial secondary metabolites included in the CyanoMetDB exhibit extensive chemical variation but share a structural core that defines the family (Figure 1). Typically the biosynthesis of compounds in one family share a set of conserved biosynthetic enzymes for the synthesis of the defining structural core, but also a set of accessory tailoring enzymes that are not universally conserved. 16,17 The biosynthetic logic underlying the synthesis of most common cyanobacterial toxins, including microcystins, 18 nodularins, 19 saxitoxins, 20 cylindrospermopsins, 21 anatoxins, 22 are now well understood. However, the genetic basis for the biosynthesis of specific secondary metabolite chemical variants is incomplete in many cases. Numerous molecular ecology methods have been developed to characterize the types of toxin producers in blooms based on the biosynthetic gene clusters. 17 Compilation of CyanoMetDB has shown that the chemical variation of the major secondary metabolites is more extensive than previously realized. A more complete understanding of the biosynthetic basis for the chemical variation of secondary metabolites from cyanobacteria is necessary to ensure the unbiased detection of their biosynthetic pathways directly from environmental samples.
Bioactivity-guided analysis. With the fast-growing number of new secondary metabolites from cyanobacteria and the increasing awareness of their biological activities, a need for a comprehensive database is essential. Cyanobacterial metabolites present cytotoxic, antimicrobial, antifungal, antiprotozoal, enzyme inhibiting, anti-inflammatory, dermatotoxic and neurotoxic activities that can also be exploited by the pharmaceutical industry to develop new drugs potentially beneficial to humans. [24][25][26] Future discoveries are likely to share structural similarities to previously discovered metabolites that may, however, exhibit differing potency and thus are critical to identify. Any modifications to the structure, such as replacement of amino acids by other residues, substitution by methylation, halogenation or oxidation and changes in configuration, can significantly affect the ability of cyanobacterial metabolites to evoke a biological response. For example, for the toxicity of microcystins, the cyclic structure and the Adda-D-Glu region, which interacts with the catalytic unit of protein phosphatases, is crucial. The modification of the less conserved positions 2 and 4 also plays a significant role for microcystin bioactivity (Figure 1). 6,27 The selective activity of cyanopeptolins against serine proteases depends on the residue in the Ahpneighboring position. The Arg-Ahp-containing cyanopeptolin variants are mainly active against trypsin, whereas cyanopeptolins with hydrophobic amino acids (e.g., Phe, Tyr, Ile) show potent activity against toward chymotrypsin. 28 In case of cryptophycins, which are cyclic depsipeptides composed of four fragments (i.e., A, B, C, , D), their effect on microtubule dynamics is determined by the intact 16-membered macrolide structure, reactive epoxide ring in unit A, methyl group in units A and C, O-methyl group and chloro-substituent in unit B, and isobutyl group in unit D. 29 The CyanoMetDB presents a dataset including structural information for each metabolite that is deposited in a format immediately accessible to software tools (e.g., SMILES codes). The in silico studies, termed virtual screening (target-based or ligand-based), use such databases to overcome problems related to the limited availability of reference materials (i.e., pure compounds) and lack of information on their physico-chemical properties. 27 In the case of new drug development from cyanopeptides, virtual screening will also help to reduce the cost and increase the success rate of the process. Chemoinformatics can also be helpful to discover as-yet unknown molecular targets of cyanobacterial metabolites in so-called target-fishing. 28,29 The deposited structures in CyanoMetDB can be used as templates in these approaches and to design new chemical entities with desirable traits and ligand-target interactions.
Environmental behavior. As hundreds of secondary metabolites have been identified from cyanobacteria, we now face the questions: How toxic and abundant are the other cyanobacterial metabolites relative to known toxins? How stable are they in surface waters? How do their concentrations change during bloom events? Can they reach water treatment plant intakes? Answers to these questions are needed to quantify the exposure side of the risk equation and to prioritize cyanobacterial metabolites for toxicity testing, monitoring of surface waters, and evaluating removal in engineered water treatment systems. The inclusion of SMILES codes in CyanoMetDB offers the possibility of visualizing the planar or stereo-structure of a given compound. Access to the chemical structures enables use of chemical models to predict physico-chemical properties from quantitative structure-activity relationships. The structural information in CyanoMetDB also allows substructure searching for moieties with known reactivity, including biotic and abiotic processes. Open access tools that use known transformation rules to propose degradation of organic molecules can be applied to cyanobacterial metabolites of known structure as well. The predicted environmental transformation products can then also be included in virtual screening of bioactivity and suspect screening by MS as discussed above.

Conclusions
Cyanobacterial metabolites have been studied for over 50 years, with the number of new discoveries continuing each year. In this work, the disparate data sources and primary research articles dealing with cyanobacterial metabolites have been collated and synthesized in to a freely-available and accessible flatfile database termed CyanoMetDB. The database comprises over 2000 entries, each corresponding to a unique cyanobacterial metabolite and its associated descriptors: name; molecular formula; molecular weight; monoisotopic mass; SMILES code; etc. CyanoMetDB represents a complementary tool to aid dereplication and analyses of cyanobacterial metabolites. We aim to continue this work as a communitydriven effort in the future. As a next step, the content of this database will be integrated into other open access platforms to make it widely available and to benefit from already existing infrastructures and online tools. The database may also serve as a framework for connecting and collating other disparate data sources associated with cyanobacterial metabolites. This in turn may help to enrich collaborative research efforts, enhance the frequency with which compound annotations are assigned and aid communication, comparison, and interpretation of results.

Genus
Genus of specimen used for extraction and structure elucidation as in listed reference; this does not show a list of possible producing genera only that used in the listed primary literature. When field samples were used, it is referred to as "in situ sample"

Anabaena/Dolichospermum
Species Species of specimen used for extraction and structure elucidation as in listed reference; this does not show a list of possible producing species only that used in the listed primary literature

Strain
Identification of strain used

Notes
Any additional comment on the entry