CyanoMetDB, a comprehensive public database of secondary metabolites from cyanobacteria

Harmful cyanobacterial blooms, which frequently contain toxic secondary metabolites, are reported in aquatic environments around the world. More than two thousand cyanobacterial secondary metabolites have been reported from diverse sources over the past fifty years. A comprehensive, publically-accessible database detailing these secondary metabolites would facilitate research into their occurrence, functions and toxicological risks. To address this need we created CyanoMetDB, a highly curated, flat-file, openly-accessible database of cyanobacterial secondary metabolites collated from 850 peer-reviewed articles published between 1967 and 2020. CyanoMetDB contains 2010 cyanobacterial metabolites and 99 structurally related compounds. This has nearly doubled the number of entries with complete literature metadata and structural composition information compared to previously available open access databases. The dataset includes microcytsins, cyanopeptolins, other depsipeptides, anabaenopeptins, microginins, aeruginosins, cyclamides, cryptophycins, saxitoxins, spumigins, microviridins, and anatoxins among other metabolite classes. A comprehensive database dedicated to cyanobacterial secondary metabolites facilitates: (1) the detection and dereplication of known cyanobacterial toxins and secondary metabolites; (2) the identification of novel natural products from cyanobacteria; (3) research on biosynthesis of cyanobacterial secondary metabolites, including substructure searches; and (4) the investigation of their abundance, persistence, and toxicity in natural environments. Crown Copyright © 2021 Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Cyanobacteria inhabit diverse freshwater, marine and terrestrial environments across the globe and can survive under extreme irradiation, temperature, pH, or salinity conditions. When cyanobacteria proliferate rapidly to form harmful blooms, they can produce high amounts of unique and bioactive secondary metabolites ( Baumann and Juttner, 2008 ;Ferranti et al., 2013 ;Gkelis et al., 2015 ;Grabowska et al., 2014 ;Jancula et al., 2014 ;Kurmayer et al.,  Publicly available suspect-lists with total number of entries, number of microcystins, available molecular formulae, primary references for the structure elucidation and structural code (e.g., SMILES code).  Bouaicha et al., 2019 ;4 This study; CyanoMetDB contains 2010 secondary metabolites identified in cyanobacteria as well as 99 additional entries: 50 structurally related compounds that are semi-synthetic and synthetic; 5 metabolites that have been identified upon bioaccumulation in other organisms feeding on cyanobacteria, 3 common oxidation products, 341 metabolites that have been identified in other organisms and are structurally related to metabolites from cyanobacteria. lites have long been recognized, making them lead compounds for the development of drugs ( Harvey et al., 2015 ;Shen, 2015 ). Pharmaceutical research actively explores their use, e.g., to fight cancers, cardiac and autoimmune disorders or infectious diseases. A number of cyanobacterial secondary metabolites show antimicrobial or antifungal activity ( Swain et al., 2017 ). Despite the recognition of their bioactivity, little is known about the potential human and ecotoxicological risks posed by exposure to cyanobacteria and their less-studied secondary metabolites. Epidemiological studies indicate human health effects and potential cancer development from acute and chronic exposure to cyanopeptides, and other work offers clues regarding the toxicity of less-studied cyanopeptides ( Janssen, 2019 ;Liu et al., 2017 ;Svircev et al., 2017 ). Various countries have put forward drinking water guidelines for one toxic metabolite, microcystin-LR ( Ibelings et al., 2014 ), for which the World Health Organization (WHO) proposed a threshold concentration of 1 μg L −1 for free and cell-bound exposure ( WHO, 2004 ). Recent updates of these guidelines now include thresholds for a total microcystin content and short-term exposure thresholds of 12 μg L −1 for microcystins as well as threshold values for cylindrospermopsin, anatoxin-a, and saxitoxins with 3, 30, and 3 μg L -1 in drinking water, respectively ( WHO, 2020a ;b ;. Despite recent progress in establishing guidelines for these compounds, data on the occurrence, fate, transformation processes, and toxicity of most cyanobacterial secondary metabolites is lacking, and improved high-throughput analytical and effect-based methods are needed to overcome this information gap. Two of the major challenges associated with studying cyanobacterial secondary metabolites are: 1) availability of analytical methods capable of detecting and identifying a broad range of compounds, and 2) a publically available list of known metabolites including chemical structure information. Research into analytical and toxicological methods for cyanobacterial metabolites relies on a comprehensive understanding of their structures. One obstacle that contributes to this is the lack of a bioinformatics platform that the cyanobacterial research community collectively supports. While information from commercial databases of secondary metabolites is only accessible to paying customers (e.g., Antibase, MarinLit, The Dictionary of Natural Products), several open access databases exist but are often limited in terms of the number of cyanobacterial metabolites or parameters listed (e.g., ALGAL-TOX List, NORINE database, Handbook of Marine Natural Products). Recently, a new database unified 411,621 entries of natural products from various organisms using stereochemistry-free InChI key information from 50 open and accessible databases, termed CO-CONUT: COlleCtion of Open Natural ProdUcTs ( Sorokina and Steinbeck, 2020 ). Key open-access databases regarding (cyano)bacterial metabolites are listed in Table 1 . The "Cyanomet mass" list by LeManach et al. (2019) contains 852 entries, of which 35 belong to the class of microcystins and nearly 500 compounds are listed with complete molecular formulae and literature references, but no further structural information is given ( Le Manach et al., 2019 ). In 2017, the Handbook of Cyanobacterial Monitoring and Cyanotoxin Analysis was published including a list of 248 microcystins and 10 nodularins ( Spoof and Catherine, 2017 ). Today, the most comprehensive list of microcystins and nodularins was recently updated (2019) to include 286 microcystin and 10 nodularin variants with molecular formulae, references and a systematic naming implying the structural compositions but no structural codes ( Bouaicha et al., 2019 ;Miles and Stirling, 2019 ). The Natural Products Atlas presents a repository maintained and actively updated by the Simon Fraser University in Canada and covers all microbially-derived natural products. It includes compounds published in peer-reviewed primary literature and contained 1006 cyanobacterial metabolites, including structural identifiers (isomeric SMILES codes), as of December 2019 ( van Santen et al., 2019 ).
Here, we describe a database of cyanobacterial secondary metabolites, termed CyanoMetDB, containing 2010 unique entries. CyanoMetDB has been compiled from existing databases and inhouse libraries, as well as through manual curation of available peer-reviewed primary literature published between 1967 and 2020. We present an overview of the structure of this database and explain the methodology used in its compilation and curation. We summarize areas of research that can benefit from this openly-accessible resource, including supporting compound identification using liquid chromatography coupled with mass spectrometry, analysis of biosynthetic variations, bioactivity, environmental behavior and, finally, we provide future perspectives on the database. Herein we supply the CyanoMetDB and literature metadata as (i) separate flat-files, and (ii) converted spreadsheet format files. The current and future versions of CyanoMetDB are available on Zenodo ( Jones et al., 2021 ) and the NORMAN Suspect List Exchange (https://www.norman-network.com/nds/SLE/). We recommend citing the repository on Zenodo along with this article when CyanoMetDB content is used. This work was initiated at the 11th International Conference of Toxic Cyanobacteria (ICTC, 2019, Krakow, Poland) and we welcome participation of the wider research community in future editions of CyanoMetDB.

Data sources and curation procedure
CyanoMetDB was established as a consolidation of multiple, disparate, sources of information pertaining to cyanobacterial secondary metabolites, including in-house libraries of the CyanoMetDB curation-team members and various open-access databases ( Table 1 ). CyanoMetDB was then extended to include additional cyanobacterial secondary metabolites reported in the scientific literature. For each compound, the primary literature meta-data was manually verified and where required, corrected. The sample type from which a compound was extracted and identified (e.g., genus/species/strain of cyanobacterium, or field sample), as well as whether nuclear magnetic resonance spectroscopy was used for its structure elucidation, were recorded. Furthermore, the 2D-chemical structure of each compound was manually drawn (ChemDraw, ChemDraw Professional, ACD/ChemSketch) based on the information provided in the primary references, from which structural identifiers were generated, including: simplified molecular input line entry system (SMILES) string, International Union of Pure and Applied Chemistry (IUPAC) International Chemical Identifier (InChI), and InChIkey and IUPAC name. In some instances, this led to the correction of structures originally reported in one of the consolidated data sources, e.g., aeruginosin 101 and aeruginosin 98C both contain a D-allo-Ile in their structure, but were previously misreported as L-allo-Ile derivatives in the primary literature ( Ishida et al., 1999 ). Entries extracted from Pub-Chem were carefully checked to verify the structure from the primary literature. Where discrepancies were observed between structures reported in the primary literature and those found in PubChem, we report the information from the primary literature in CyanoMetDB. In the above-mentioned example, aeruginosin 101 was misreported as its L-allo-Ile derivative whereas aeruginosin 98C was correctly reported in PubChem (accessed 08. September 2020). It is useful to highlight that some issues emerged, for example, with the representation of olefin stereochemistry when a compound's structure was drawn using an InChI code in Chem-Draw. For this reason, we recommend the use of SMILES codes from CyanoMetDB for accurate representation of a compound's 2D structure.
After the initial compilation and expansion of CyanoMetDB, the accuracy of data in the database was confirmed through multiple rounds of database integrity checks. In each round, members of the CyanoMetDB curation team received sub-sets of CyanoMetDB compounds, whereupon they evaluated and if necessary, corrected the assignment of the primary (and secondary) literature sources and chemical structural descriptors (SMILES, InChI, InChIKey and IUPAC name).

CyanoMetDB structure
CyanoMetDB is presented as a flat-file database comprising the following core fields: compound identifier (key), compound name, compound class, molecular formula, molecular weight, monoisotopic mass, primary reference (in which the compound was first reported), SMILES string, InChl string, InChlKey and IUPAC name. Together, the SMILES, InChI, InChIKey and IUPAC name fields serve as textual identifiers for each compound ( Heller et al., 2015 ).
For each entry in CyanoMetDB, the following additional fields were also optionally populated to clarify aspects of a compound's identification. The field "Nuclear magnetic resonance spectroscopy (NMR) used" indicates whether nuclear magnetic resonance spectroscopy was used to confirm a compound's structure, or (partial) relative stereochemistry. The field "secondary reference", if populated, provides a citation to an article in which: 1) a compound was first reported in a cyanobacterial species, after having first been reported (in the primary reference) in a non-cyanobacterial species; or 2) the structural annotation information reported in the primary reference has since been clarified or refined. The "genus", "species" and "strain" fields, collectively, provide an overview of the sample type in which a compound was identified. The field "field sample" includes information for instances where a compound was identified in samples from bloom material comprising one, or multiple, (potentially undefined) cyanobacterial species. Finally, the "Notes" field provides further comments of interest concerning a compound's origin and structural annotation. Table 2 provides a detailed description and illustrative examples for each data field in CyanoMetDB. For a subset of 1097 compounds, we also provide the structural representation as a building block string with abbreviated monomers (e.g., three-letter amino acid codes).
For all entries in CyanoMetDB we provide, as a minimum, the canonical SMILES code describing the connectivity of each atom in the compound's planar structure. An isomeric SMILES code is provided when a compound's stereochemistry was clearly reported in the primary or secondary reference(s), or where evidence from related compounds or known biosynthetic pathways provides evi- dence for probable stereochemical configurations. For example, in all microcystins characterized to date by multiple and complimentary analytical techniques, the monomers identified in positions 2 and 4 of the cyclic structure have always been found to be l -amino acids, while in positions 1, 3 and 6, d -amino acids are always reported. Consequently, the microcystins in CyanoMetDB identified solely by LC-MS have been assumed to retain this stereochemistry. The "Notes" column provides additional details as to whether uncertainty remains regarding some aspect of a compound's structure.

Results and discussion
To date, CyanoMetDB includes 2010 cyanobacterial secondary metabolites with complete literature and structural information, and some additional entries: 50 semi-synthetic and synthetic entries presenting related compounds that have only been chemically derived or where cyanobacterial metabolites have been chemically modified; 41 metabolites that are structurally related to metabolites from cyanobacteria but so far, have only been identified in other organisms, and; 5 metabolites that have been identified in other organisms that were feeding on cyanobacteria.
The earliest entry in CyanoMetDB was published in 1967 for the chromophore phycoerythrobilin, followed by more than 115 additional compounds by the end of 1990. There was a four-fold increase in the number of reported cyanobacterial metabolites between 1990 and 20 0 0. This rapid increase in the number of reported secondary metabolites during the 1990s was, in part, likely associated with the realization that microcystins pose significant hepatotoxic risks to humans. This led to MC-LR being included in the WHO water quality guidelines ( WHO, 2004 ), prompting significant research on cyanobacteria. Nearly one thousand cyanobacterial compounds were identified by the year 2010, and a further one thousand compounds have been reported in the subsequent decade. The increasing incidences of cyanobacterial blooms and availability of advanced analytical instrumentation (e.g., high resolution mass spectrometry (HRMS) and high-field NMR spectroscopy with cryogenic probes), have contributed to discoveries in recent years ( Fig. 2 A inset).

Publication trends
Publication trends in the field suggest that the discovery of cyanobacterial metabolites has not yet reached a plateau ( Fig. 2 A). The question arises: how close are we to identifying the majority of cyanobacterial secondary metabolites? Also, how many of the newly-described compounds are chemical variants of known families and how many describe new families? These are also overarching questions in the general domain of natural product research. Pye and coworkers recently surveyed the chemical space of natural products and concluded that new discoveries mostly relate structurally to previously published compounds and that the "range of scaffolds readily accessible from nature is limited" ( Pye et al., 2017 ). This limitation does not mean that new discoveries are expected to be exhausted, but rather that most new secondary metabolites are likely to share structural similarities to previous reported ones. In CyanoMetDB, 27% of the entries have been identified by mass spectrometry and such data acquisition or pro- cessing methods are generally optimized for known classes of compound such as cyanopeptides (e.g., microcystins) or low molecular weight molecules (e.g., anatoxins). This increases the probability of identifying additional variants of these classes, rather than compounds with structural differences that could require different LC-MS settings. The discovery of new variants in a known class is still highly relevant because the toxicity is dependent on chemical structure. For example, among 18 microcystin congeners, the IC 50 values for enzyme inhibition of serine/threonine-protein phosphatases ranged over six orders of magnitude ( Altaner et al., 2020 ). Another reason may be that scientific results are limited by the environments that are predominantly explored, while less focus has thus-far been paid to, for example, cyanobacteria from terrestrial or extreme environments, or that live symbiotically with other organisms ( Kaasalainen et al., 2012 ). Discoveries of natural products from other bacteria started in the 1940s ( Pye et al., 2017 ), whereas cyanobacterial metabolite discoveries began only in the late 1960s. The overall trend is especially driven by discoveries of peptidebased metabolites, and common cyanopeptide classes make up more than one third of all peptidic metabolites in CyanoMetDB (top 5 contributing classes in Fig. 2 B). To date, the database contains 2010 compounds including 310 microcystins, 193 cyanopeptolins (also called micropeptins), 211 other depsipeptides, 101 anabaenopeptins, 85 microginins, 67 aeruginosins, 64 cylamides, 38 cryptophycins, 38 saxitoxins, 26 spumigins, 25 microviridins, 16 nodularins, 11 anatoxins, and 5 cylindrospermopsins. Within each class, compounds show high structural similarity, supporting the previous observations that the contribution of novel structural scaffolds is typically low for natural products, including within the pool of cyanobacterial metabolites ( Pye et al., 2017 ).

Chemical space
CyanoMetDB shows that cyanobacterial secondary metabolites cover a wide range of molecular weights, between 118 and 2708 Da. Of these, 59% are cyclic compounds and 69% are peptides. The distribution of molecular weights of the 2010 compounds is shown in Fig. 3 A and 3 B, which demonstrate that compounds with at least one peptide bond account for most of the compounds with molecular weights of 900 Da and higher.
Particularly among the peptides, many compounds are present in the 10 0 0-110 0 Da range. These peptides can be classified to a large extent into common peptide classes including microcystins, cyanopeptolins and other cyclic depsipeptides that cover the majority of compounds with molecular weights above 900 Da ( Fig. 4 ). The distribution based on metabolite classes shows that microcystins and cyanopeptolins, in particular, contribute to the high abundance of known metabolites between 10 0 0 and 1100 Da. For the other non-peptide metabolites, there is a particularly high frequency of metabolites with masses between 350 and 500 Da ( Fig. 3 B), dominated by linear non-peptides ( Fig. 4 ). The molecular weight distribution of more than 28,0 0 0 marine natural products has previously been shown to also center around 350 Da and the chemical diversity aligned with biological diversity of the producing species ( Blunt et al., 2018 ). These authors reported that compounds with higher molecular weights that diverged from this trend were predominantly produced by cyanobacteria as well as echinodermata, dinophyta and tracheophyta (mangroves).
Non-peptide secondary metabolites make up only 41% of entries in CyanoMetDB, but this might be an underrepresentation. In general, the non-peptide-based metabolites from cyanobacteria are more difficult to classify because they lack unifying structural features. The structural information (SMILES codes) in the database allows for substructure searching to identify common molecular motifs. For example, of the non-peptide compounds, 15% contain an ester bond and 11% a pyrrolidine ring. Most compounds show unsaturation with 64% carrying at least one aromatic ring, 30% having 1-2 aromatic rings, and 6% having 3-6 aromatic rings. Halogen atoms are present in 35% of the non-peptide compounds (30% Cl, and 5% Br) and 12% contain sulfur. Isomeric compounds, i.e., compounds with the same molecular formulae but different atom connectivity, are common among the cyanobacterial secondary metabolites. For 270 unique molecular formulae, we identified between 2 and 11 isomeric compounds, affecting 706 compounds in the database. Data in Fig. 3 C show the distribution of isomeric compounds across molecular weights. Microcystins show the highest number of isomeric compounds (5 to 11) for molecular weights around 10 0 0 Da. The database contains 179 molecular formulae with 2 isomeric compounds, 56 formulae with 3 isomeric compounds and 20 formulae with 4 isomeric compounds. The presence of isomeric compounds makes unequivocal identification of individual compounds challenging for mass spectrometrybased analytical methods and highlights the importance of MS/MS data interpretation and availability of authentic standards for retention time matching.  Secondary metabolites from cyanobacteria have mostly been discovered by "top-down" approaches from extraction of biomass and analyses that were guided by chemical motifs (e.g., peptides, molecular weight groups, common mass spectrometric product ions) or non-targeted searches by MS. Within CyanoMetDB, 12% of compounds were first identified from field samples and the remaining compounds were identified from laboratory-grown cultures. In total, more than 50 different cyanobacterial genera were used, considering the genera listed in the associated publications that first identified the structure. Note that the taxonomic classification has subsequently changed for some genera, and additional changes may be introduced in accordance with future knowledge on taxonomy of cyanobacteria. Dominant genera were: Moorea / Lyngbya, Microcystis, Nostoc, Anabaena / Dolichospermum, Oscillatoria / Planktothrix, Nodularia, Scytonema, Fischerella , and Symploca , in decreasing order of total entries. These genera are not necessarily the main producers of the respective metabolites but were predominantly used for initial structure elucidation.
Compounds from CyanoMetDB that were not previously listed in the Natural Products Atlas but that fulfilrequirements for inclusion have now been added to this online repository of bacterial secondary metabolites; The NP Atlas version 2020_08 ( van Santen et al., 2019 ). Comparing the chemical space of these 1640 natural products from cyanobacteria against more than 28,0 0 0 natural products from any bacterium, the NP Atlas demonstrates a relatively even spread of cyanobacterial metabolites supporting their high structural diversity ( Fig. 5 ). The spherical plot uses the molecular composition to position each compound based on C:H ratios (radial value in xy-plane), C:O ratios (angle with z-axis), and C:N ratios (distance from origin) and nodes illustrate the relationship among clusters regarding their structural similarities ( van Santen et al., 2019 ). Here, cyanobacterial secondary metabolites are linked by 239 out of 6800 nodes in The Natural Products Atlas repository.

Implications for cyanobacterial research
CyanoMetDB holds significant potential as a tool for supporting diverse areas of cyanobacterial research, for instance: aiding the identification of known and novel cyanobacterial metabo- lites; exploring the impact of biosynthetic pathways on cyanobacterial metabolite profiles and dynamics; as a framework around which information may be collated regarding structural characteristics and biological activity of cyanobacterial metabolites; understanding the environmental occurrence, fate and transformation of cyanobacterial secondary metabolites. Below, we explore some of these opportunities in further detail.

Identification by liquid chromatography-mass spectrometry (LC-MS)
Just as mass spectrometry-based analytical methods have played a significant role in the rapid discovery of many of the cyanobacterial metabolites included in CyanoMetDB, modern LC-MS(/MS) methods are also one of the primary applications of the database. Comprehensive screening of culture-or field-derived samples for cyanobacterial secondary metabolites presents a significant analytical challenge, in part due to the large number of chemically diverse compounds that exist and due to the presence of many isomeric compounds. Moreover, commercially available chemical standards do not exist for the vast majority of these compounds. Together, these factors render targeted triplequadrupole-based LC-MS/MS methods less effective for analysis of a wider range cyanobacterial metabolites. Instead, LC-high resolution MS (LC-HRMS) methods that are able to efficiently separate and selectively detect ionizable chemicals in complex matrices, with or without fragmentation, hold great potential. A popular strategy by which to analyze compounds by LC-HRMS is termed 'suspect-screening'. This approach involves searching fullscan HRMS data for exact m / z values of interest, which is usually done using commercial or open-source software (e.g., Thermo Trace Finder and Compound Discoverer, or Skyline, respectively) ( Gunthardt et al., 2020 ;Natumi and Janssen, 2020 ). Detected compounds of interest require further confirmation, ideally through retention time matching against a chemical standard and with detection of characteristic MS/MS fragmentation spectra, applying standardized criteria to specify the confidence of compound identification ( Schymanski et al., 2014 ). Relatively comprehensive full-scan and tandem HRMS data can be acquired using either data-dependent or data-independent HRMS/MS acquisition strategies, which are suitable for suspect-screening and confirmation of abundant cyanobacterial secondary metabolites. CyanoMetDB will greatly expand the power of such LC-HRMS-based suspectscreening workflows, allowing known metabolites to be detected more routinely and thereby avoid "rediscoveries" of known compounds (i.e., dereplication). In the case of data-dependent acquisition, the m / z values in CyanoMetDB can be used as an inclusion list to preferentially trigger MS/MS fragmentation of precursors detected in full-scan with the m / z values of interest.
In addition to suspect-screening, one can also interrogate LC-HRMS datasets in more holistic ways, referred to as untargetedor non-target analysis. Non-target screening is primarily done using commercial or open-source metabolomics software packages ( Hohrenk et al., 2020 ;Schymanski et al., 2015 ;Walsh et al., 2019 ) or with dedicated custom tools (e.g., Global Natural Product Social Molecular Networking, GNPS) ( Wang et al., 2016 ). These workflows typically build upon suspect-screening strategies, starting with the identification of expected compounds and then extending the search space to detect compounds outside the realm of known compounds (as defined by a list or database of known compounds). Here, novel compounds are revealed through similarities in MS/MS data, mass defect analysis, searching for retention time or molecular formulae ranges of interest and by assessing statistical differences between sample groups within a dataset. For example, untargeted workflows have been reported for microcystins that can detect new analogues based on characteristic product ions that are sensitively detected for most known microcystins ( Ortiz et al., 2017 ;Roy-Lachapelle et al., 2019 ). The comprehensive structural and formula data within CyanoMetDB will enable development of such untargeted workflows and improve the detection of known and novel cyanobacterial secondary metabolites in cultured and environmental samples. Moreover, this information can also be used in combination with in silico MS/MS fragmentation tools (e.g., Mass Frontier, mMass, MetFrag) to pre-dict possible structures for MS/MS product ions, i.e., to aid compound annotation ( Natumi and Janssen, 2020 ;Niedermeyer and Strohalm, 2012 ;Ruttkies et al., 2016 ;Sheldon et al., 2009 ). You can now find CyanoMetDB among the databases available in Met-Frag (https://msbi.ipb-halle.de/MetFragBeta/) to generate predicted MS/MS data. Recent developments in the area of machine learningguided MS/MS data interpretation hold great potential in enhancing the accuracy of in silico -based compound identification efforts, in particular through their capacity to predict both the types and intensities of product ions likely to be generated from a given structure, e.g., Competitive Fragmentation Model, CFM-ID ( Djoumbou-Feunang et al., 2019 ). Naturally, the effectiveness of such approaches hinge upon the availability and quality of the MS/MS data, which is not always the case in environmental samples with low abundant metabolites or in the presence of complex matrices.

Biosynthetic analysis
Discovery of secondary metabolites based on genetic information is a promising "bottom-up" approach that can reveal additional, unknown compounds ( Moosmann et al., 2018 ). Recent advances in microbial genomics have greatly improved our understanding of the biochemical mechanisms responsible for the biosynthesis of cyanobacterial natural products ( Dittmann et al., 2015 ). The majority of these natural products are synthesized through secondary metabolic pathways in coordinated enzyme cascades ( Dittmann et al., 2015 ). Approximately 75% of the natural products included in the CyanoMetDB dataset can be assigned to structural families for which biosynthetic pathways have been reported from one or more representatives. Many families of secondary cyanobacterial metabolites exhibit extensive chemical variation but share a structural core that defines the family ( Fig. 1 ). Typically, the biosynthesis of compounds in one family shares a set of conserved enzymes for the synthesis of the defining structural core, but also a set of accessory tailoring enzymes that are not universally conserved ( Dittmann et al., 2013( Dittmann et al., , 2015. The biosynthetic logic underlying the biosynthesis of most common cyanobacterial toxins, including microcystins ( Tillett et al., 20 0 0 ), nodularins ( Moffitt and Neilan, 2004 ), saxitoxins ( Kellmann et al., 2009 ), cylindrospermopsins ( Mihali et al., 2008 ) and anatoxins ( Mejean et al., 2009 ) are now well understood. However, the basis for the biosynthesis of specific chemical variants is incomplete in many cases. Numerous molecular ecology methods have been developed to characterize the types of toxin producers in blooms, based on the biosynthetic gene clusters ( Dittmann et al., 2013 ). The compilation of CyanoMetDB has shown that the chemical variation of the major secondary metabolites is more extensive than previously thought. A more complete understanding of the biosynthetic basis for the chemical variation of secondary metabolites from cyanobacteria is necessary to ensure the unbiased detection of their biosynthetic pathways directly from environmental samples.

Bioactivity and structure
With the fast-growing number of new secondary metabolites from cyanobacteria and the growing awareness of their (sometimes beneficial, and sometimes detrimental) biological activities, the availability of a comprehensive database is essential. Cyanobacterial metabolites present cytotoxic, dermatotoxic, hepatotoxic, neurotoxic, enzyme inhibiting, antimicrobial, antifungal, antiprotozoal, and anti-inflammatory activities that can also be exploited by the pharmaceutical industry to develop new drugs that are potentially beneficial to humans ( Huang and Zimba, 2019 ; Kini et al., 2020 ;Singh et al., 2011 ). Future discoveries are likely to share structural similarities to previously discovered metabolites but may, however, exhibit differing potencies and thus are important to identify. Any modifications to the structure, such as replacement of amino acids by other residues, substitution by methylation, halogenation or oxidation and changes in configuration, can significantly affect the ability of cyanobacterial metabolites to evoke a biological, or toxic, response. For example, in the case of microcystins, biological activity toward protein phosphatases is underpinned by their cyclic structure and the Adda-d -Glu-region, which interacts with the catalytic unit of these enzymes ( McLellan and Manderville, 2017 ). The modification of the less conserved positions 2 and 4 ( Fig. 1 ) also plays a significant role in microcystin bioactivity ( Bouaicha et al., 2019 ;Fontanillo and Kohn, 2018 ). The selective activity of cyanopeptolins against serine proteases depends on the residue in the Ahp-neighboring position. The Arg-Ahp-containing cyanopeptolin variants are mainly active against trypsin, whereas cyanopeptolins with hydrophobic amino acids (e.g. Phe, Tyr, Ile) show potent activity against chymotrypsin ( Yamaki et al., 2005 ). In the case of cryptophycins, which are cyclic depsipeptides composed of four subunits, their effect on microtubule dynamics is determined by the intact 16-membered macrolide structure, reactive epoxide ring in unit 1, methyl groups in units 1 and 3, O -methyl group and chloro-substituent in unit 2, and isobutyl group in unit 4 (units 1-4 in Fig. 1 correspond to units A-D in Golakoti et al., 1996 ). The interaction of a metabolite with cellular molecules that causes an observable adverse outcome (i.e., toxicodynamics) depends critically on the structure of the metabolite. In silico studies, termed virtual screening (target-based or ligand-based) use structural information of compounds deposited in databases to overcome problems related to the limited availability of chemical standards (i.e., pure compounds) and lack of information on their physico-chemical properties ( Kirchweger and Rollinger, 2018 ). In the case of new drug development from cyanopeptides, virtual screening could also help reduce the cost and increase the success rate of the process. Cheminformatics can also assist the discovery of as-yet unknown cyanobacterial metabolites in so-called target-fishing ( Brzuzan et al., 2020 ;Liang et al., 2018 ;Zhu et al., 2015 ). Modern techniques of machine learning can assist in predicting the effect of an unknown metabolite after training the model with known compounds, as recently demonstrated for microcystins ( Altaner et al., 2020 ). CyanoMetDB presents a database including structural information for each metabolite that is deposited in a format immediately accessible to software tools. The deposited structures in CyanoMetDB can also be used as templates in these approaches and to design new chemical entities with desirable traits and ligand-target interactions.

Environmental behavior
As thousands of secondary metabolites have been identified from cyanobacteria, we now face the questions: how persistent are they? How stable are they in surface waters? How do their concentrations change during bloom events? Can they reach water treatment plant intakes and does sufficient abatement occur during treatment? Answers to these questions are needed to quantify the exposure side of the risk assessment equation and to prioritize cyanobacterial metabolites for toxicity testing, monitoring of surface waters, and evaluating removal in engineered water treatment systems. Physico-chemical properties, reactivity with oxidants in surface waters and during water treatment, and biotransformation mechanisms are relevant to predict the behavior of secondary metabolites. With known chemical structures, models can be developed to predict physico-chemical properties from quantitative structure-activity relationships (QSARs). For example, water-octanol partitioning coefficients were predicted for 45 polar plant toxins with three different models (KOWWIN, ACD/Percepta, and Chemicalize) and results were in close agreement with empirically derived values . Good correlations, i.e., QSARs, have also been observed among structurallyrelated organic micropollutants and their reaction with oxidants used in (advanced) water treatment (e.g., O 3 , ClO 3 , HOCl, HFeO 4 − ) ( Lee and von Gunten, 2012 ). The use of training sets enables calibration of such models and improves their predictive power for properties of unknown compounds. Machine learning can be combined with such modeling to train QSARs, as recently demonstrated for the prediction of p K a values of more than 60 0 0 organic molecules ( Mansouri et al., 2019 ). Also, biotransformation mechanisms can in part be predicted for compounds with known structures. For example, EnviPath presents an open access tool that uses known transformation rules to propose degradation pathways for organic molecules ( Wickert et al., 2016 ). Such predictions can also assist in the search for transformation products of known metabolites. The structural information in CyanoMetDB allows substructure searching for moieties with known reactivity in biotic and abiotic transformation processes, and the transformation products can be included in QSAR models. QSAR models may be particularly useful within compound classes that share a core structure, but where empirical parameters are only available for some congeners. Such models can be explored to prioritize secondary metabolites from cyanobacteria for further research and monitoring according to their expected toxicity, persistence and mobility in the environment.

Conclusion
Cyanobacterial metabolites have now been studied for over 60 years. During this time, thousands of chemical structures have been reported across hundreds of primary research articles, which have only in part been included in curated lists by individual research groups. In this work, with a view to promoting effective analysis and interchange of information about cyanobacterial metabolites, we have manually collated and evaluated these disparate resources (including 850 primary research articles, as of December 2020) to generate a single, unified, database of known cyanobacterial secondary metabolites. Termed CyanoMetDB, this database contains 2010 individual records, each corresponding to a unique cyanobacterial secondary metabolite and associated chemical descriptors: SMILES string, molecular weight, monoisotopic mass, molecular formula, etc. We made CyanoMetDB an openaccess resource to ensure that it is accessible to a broad audience, whilst also fulfilling the goal of supporting the analysis and identification of cyanobacterial secondary metabolites in the future. Herein, we supply CyanoMetDB and literature metadata as separate flat-files and converted spreadsheet format files. To this end, the current and future versions of CyanoMetDB are and will be available on Zenodo ( Jones et al., 2021 ) and the NORMAN Suspect List Exchange (https://www.norman-network.com/nds/SLE/). We recommend citingcrecommiting the repository on Zenodo along with this article when CyanoMetDB content is used.
Moving forward, our aim is to continue the work presented herein by: 1) adding missing and newly reported cyanobacterial secondary metabolites to CyanoMetDB as they are published, and 2) using CyanoMetDB as a framework for connecting and collating various data sources associated with cyanobacterial metabolites, e.g., their tandem mass spectrometry product ion spectra, toxicity data, etc. We anticipate that the continued curation of CyanoMetDB will enrich existing and establish new collaborative research efforts, enhance the frequency with which compound annotations are assigned, and aid communication, comparison and interpretation of results.
The approach used to build the CyanoMetDB dataset could also be applied to build related datasets containing toxins and sec-ondary metabolites from other organisms, e.g., those produced by marine microalgae. Such effort s would be highly beneficial to facilitate monitoring of marine microalgal toxins in the food chain for research and monitoring purposes. Indeed, some toxin classes, such as saxitoxins, were first identified in marine dinoflagellates before their discovery in cyanobacterial samples. Thus, datasets of cyanobacterial and marine microalgal metabolites will partially overlap.

Acknowledgments
We thank São Paulo State Research Foundation (FAPESP -Grant No. 2014/50420-9) to E.P., University of São Paulo Foundation (FUSP-Grant No. 1979 (Project No. 82845)) and Jane and Aatos Erkko Foundation to K.S.. We acknowledge COST Action ES 1105 "CYANOCOST -Cyanobacterial blooms and toxins in water resources: Occurrence impacts and management" for their role in connecting experts in the field. We thank Roger Linington and Jeffrey van Santen, curators of the Natural Products Atlas for their collaboration.

Appendix
Genus of specimen used for extraction and structure elucidation as in listed reference; this does not show a list of possible producing genera only that used in the listed primary literature. When field samples were used, it is referred to as "field sample"

Species
Species of specimen used for extraction and structure elucidation as in listed reference; this does not show a list of possible producing species only that used in the listed primary literature aeruginosa Strain Identification of strain used NIES-90 Field sample Information on field sample if this was used Notes Any additional comment on the entry stereochemistry not completely resolved