Unifying antimicrobial peptide datasets for robust deep learning-based classification

Leguminous crops are vital to sustainable agriculture due to their ability to fix atmospheric nitrogen, improving soil fertility and reducing the need for synthetic fertilizers. Additionally, they are an excellent source of protein for both human consumption and animal feed. Antimicrobial peptides (AMPs), found in various leguminous seeds, exhibit broad-spectrum antimicrobial activity through diverse mechanisms, including interaction with microbial cell membranes and interference with cellular processes, making them valuable for enhancing crop resilience and food safety. In the field of plant sciences, computational biology methods have been instrumental in the discovery and optimization of AMPs. These methods enable rapid exploration of sequence space and the prediction of AMPs using deep learning technologies. Optimizing AMP annotations through computational design offers a strategic approach to enhance efficacy and minimize potential side effects, providing a viable alternative to conventional antimicrobial agents. However, the presence of overlapping sequences across multiple databases poses a challenge for creating a reliable dataset for AMP prediction. To address this, we conducted a comprehensive analysis of sequence redundancy across various AMP databases. These databases encompass a wide range of AMPs from different sources and with specific functions, including both naturally occurring and artificially synthesized AMPs. Our analysis revealed significant overlap, underscoring the need for a non-redundant AMP sequence database. We present the development of a new database that consolidates unique AMP sequences derived from leguminous seeds, aiming to create a more refined dataset for the binary classification and prediction of plant-derived AMPs. This database will support the advancement of sustainable agricultural practices by enhancing the use of plant-based AMPs in agroecology, contributing to improved crop protection and food security.


a b s t r a c t
Leguminous crops are vital to sustainable agriculture due to their ability to fix atmospheric nitrogen, improving soil fertility and reducing the need for synthetic fertilizers.Additionally, they are an excellent source of protein for both human consumption and animal feed.Antimicrobial peptides (AMPs), found in various leguminous seeds, exhibit broadspectrum antimicrobial activity through diverse mechanisms, including interaction with microbial cell membranes and interference with cellular processes, making them valuable for enhancing crop resilience and food safety.In the field of plant sciences, computational biology methods have been instrumental in the discovery and optimization of AMPs.These methods enable rapid exploration of sequence space and the prediction of AMPs using deep learning technologies.Optimizing AMP annotations through computational design offers a strategic approach to enhance efficacy and minimize potential side effects, providing a viable alternative to conventional antimicrobial agents.However, the presence of overlapping sequences across multiple databases poses a challenge for creating a reliable dataset for AMP prediction.To address this, we conducted a comprehensive analysis of sequence redundancy across various AMP databases.These databases encompass a wide range of AMPs from different sources and with specific functions, including both naturally occurring and artificially synthesized AMPs.Our analysis revealed significant overlap, underscoring the need for a non-redundant AMP sequence database.We present the development of a new database that consolidates unique AMP sequences derived from leguminous seeds, aiming to create a more refined dataset for the binary classification and prediction of plantderived AMPs.This database will support the advancement of sustainable agricultural practices by enhancing the use of plant-based AMPs in agroecology, contributing to improved crop protection and food security.
© 2024 The Author(s

Data collection
The dataset for the study on antimicrobial peptides (AMPs) was gathered through a rigorous analysis of sequence redundancy across multiple AMP databases, encompassing both naturally occurring and artificially synthesized AMPs.The Dover analyzer ( http://mobiosd-hub.com/doveranalyzer/ ) was used to identify and eliminate redundant AMP entries, thereby unveiling the intricate relationship among 28 distinct AMP databases.To enrich the dataset, non-AMP data was sourced from the UniProtKB database ( https://www.uniprot.org/uniprotkb), with a specific focus on plant proteins meeting predetermined criteria regarding sequence length and taxonomy.The non-AMP dataset underwent hierarchical redundancy removal using CD-HIT ( https://www.bioinformatics.org/cd-hit/), a widely adopted clustering tool tailored for large sequence databases.Subsequent to data collection, various preprocessing steps were implemented to ensure data integrity and compatibility for downstream analysis.Special characters were systematically removed from the dataset entries to standardize the format and facilitate uniform processing.Furthermore, to distinguish between AMPs and non-AMPs, peptides shared across both datasets were identified and subsequently excluded from the final datasets.The processed AMP and non-AMP datasets were integrated with the Hugging Face dataset module ( https://huggingface.co/datasets ), a versatile platform catering to the needs of machine learning and data science endeavours.Drawing upon the functionalities offered by the Hugging Face framework ensured efficient integration and facilitated accessibility for subsequent analysis and model development.

Data source location
The data for this study were collected from various publicly available databases, which are typically hosted online and accessible globally (see Table 1).The specific geographical coordinates of the data sources may vary, as they are often maintained by institutions or organizations worldwide.Additionally, the final datasets were integrated and stored using the resources available at our institution, namely the Recherche Data Gouv platform.

Data accessibility
Please note: All raw data referred to in this article must be made publicly available in a data repository prior to publication.Please indicate here where your data are hosted (the URL must be working at the time of submission and editors and reviewers must have anonymous access to the repository):

Value of the Data
This section states why these data are of value to the scientific community.Please provide between 3 and 6 bullet points and answer at least the questions below (delete the questions afterwards).Each bullet point should be a maximum of 150 words long, and should not include conclusions or inferences: • The paper describes the creation of a non-redundant Antimicrobial peptides (AMPs) dataset by analyzing sequence overlap across multiple databases, aiming at training robust binary AMP classifier.• The out-performed AMP classifier accelerated the discovery and optimization of AMPs, which offer a potential solution of the escalating antibiotic resistance issue.• The model performance is strongly related to the size of training data size.This dataset retrieves the AMPs from the databases in comprehensive way, which could be used as the knowledge base for those researchers who want to develop the specific predictors on their own.• In addition, despite the large number of available models, the number of standardized datasets is still very limited.This dataset is expected to be one of the standardized datasets in the future.• Finally, in the context of deep learning modeling, the more standardized Hugging Face release makes it easier to access and use the dataset.

Background
The excessive use of synthetic chemical pesticides has led to environmental and health concerns, including the development of resistance among various harmful agricultural pests and pathogens [ 1 ].The search for sustainable and effective biotic control options is a significant focus of contemporary agricultural research [ 2 ].
Antimicrobial peptides (AMPs), especially those extracted from plant resources, offer a promising avenue for developing new, environmentally friendly pest control solutions.These peptides are an important part of the innate immune system of various organisms [ 3 ], with broad activity against microorganisms by binding, penetrating, and interfering with microbial cells [ 4 , 5 ].
The in-silico approach enhances the design accuracy and discovery of potential AMPs through pattern recognition, sequence alignment, and machine learning [ 6 , 7 ].This method reduces time and costs by efficiently exploring a sequence space for antimicrobial activity prediction.
The impact of dataset size on predictive model performance is critical, and increasing AMP databases play a vital role as the AMP repository.However, many AMP databases have overlapping sequences for various purposes [ 8 ].This study aims to analyze sequence redundancy in different AMP databases and create a non-redundant sequence database for AMP binary classification.

Data Description
Two AMP classification datasets ( Fabaceae and Viridiplantae ) were organized according to the non-amp sources, where the training set of Fabaceae contained 58,119 entries, the validation set contained 18,231 entries, and the test set contained 14,530 entries, and the training set of Viridiplantae contained 91,131 entries, the validation set contains 28,548 entries, and the test set contains 22,783 entries.
Each of the two datasets corresponds to a HuggingFace Dataset, and each entry contains the five features (index, id, sequence, length and label).

Data Source
The 28 databases used to construct the AMP dataset are described in Table 1 .There were 22 AMP datasets in [ 1 ] used AMP data source, with APD database updated [ 3 ] and the other 21 unchanged.In addition, 6 new databases (BaAMPs, dbAMP, DRAMP General, DRAMP Patent, DRAMP Clinical, DRAMP Specific) have been added.
The non-AMPs were retrieved from the UniProtKB database.Catering to the research purpose of predicting AMPs from plant resources, especially from legume seeds, the protein selection was limited to the plant kingdom.Database searches were conducted according to the following criteria: Two non-AMP data sets, Fabaceae and Viridiplantae , were generated, with 59,606 and 682,607 entries respectively.

Data Preprocessing
Dover analyzer was used to remove redundant AMPs.The results revealed the relationship between the 28 AMP databases.The percentage of non-redundant amp content in all databases is shown Table 2 , which shows that all AMP databases contain more than 50% non-redundant AMP content.Two heatmaps show the percentage of overlap between different databases, where Table 2 represents the percentage of row databases contained in the column databases, while Table 2 represents the percentage of row databases that are completely stored in each number of databases.A total of 44,099 AMP entries were obtained after preprocessing.
The redundant non-AMPs in Viridiplantae dataset were hierarchically removed by CD-HIT ( Table 2 ).
All special characters except common amino acids in AMP and non-AMP were removed through preprocessing.Finally, the peptides shared between AMP and non-AMP were deleted.

Dataset Preparation
The AMPs and non-AMPs (including Fabaceae and Viridiplantae ) were used to build the dataset for amp prediction modeling.Where there were 8,876 amps from 11 validated databases selected for test data ( Table 3 ) and the remaining 35,229 amps used for training data ( Table 4 ).
The Fabaceae and Viridiplantae dataset were divided into training set and test set by equidistant sampling, containing 78,685 / 19,672 and 37,420 / 9,355 non-AMPs respectively.
The dataset was then integrated through HuggingFace dataset module.

Limitations
While the antimicrobial peptide dataset presented in this research offers a quantitative edge, it lacks a comprehensive depiction of the data label, including details like origin, specific antimicrobial targets, immune test parameters, and whether it is synthetically produced or predicted.It is recommended that future users of the dataset supplement it with more elaborate descriptions to enhance the efficacy of predictive model ( Fig. 1 and Table 5 ).

Fig. 1 .
Fig. 1.Heatmap of sequence overlap between each two AMP databases.The overlaps were represented by the percentage of overlap sequence in database of row.

Table 1
Antimicrobial peptide database as input to amp-dover, including a description of entry source, target, current availability, last update date, total number of entries and references.

Table 2
The total number of sequences before/after filtration and unique sequences percentage in each database.

Table 3
Results of hierarchical protein clustering on the Viridiplantae dataset using cd-hit, the first column shows the similarity metric for each clustering hierarchy, and the last two columns correspond to the input and output corresponding protein sequence entries.

Table 4
AMP database selection for the test set, containing input and output entries and the percentage of non-redundant peptides to itself.