FLAMS: Find Lysine Acylations and other Modification Sites

Abstract Summary Today, hundreds of post-translational modification (PTM) sites are routinely identified at once, but the comparison of new experimental datasets to already existing ones is hampered by the current inability to search most PTM databases at the protein residue level. We present FLAMS (Find Lysine Acylations and other Modification Sites), a Python3-based command line and web-tool that enables researchers to compare their PTM sites to the contents of the CPLM, the largest dedicated protein lysine modification database, and dbPTM, the most comprehensive general PTM database, at the residue level. FLAMS can be integrated into PTM analysis pipelines, allowing researchers to quickly assess the novelty and conservation of PTM sites across species in newly generated datasets, aiding in the functional assessment of sites and the prioritization of sites for further experimental characterization. Availability and implementation FLAMS is implemented in Python3, and freely available under an MIT license. It can be found as a command line tool at https://github.com/hannelorelongin/FLAMS, pip and conda; and as a web service at https://www.biw.kuleuven.be/m2s/cmpg/research/CSB/tools/flams/.


Introduction
To expand the functional diversity accomplished by proteins, amino acids can be modified after protein synthesis by means of post-translational modifications (PTMs), which are commonly enzymatic in nature (Walsh et al. 2005).A wide range of PTMs exists, generally divided into three main categories: (i) proteolytic cleavage, (ii) linkage of amino acids, and (iii) (reversible) addition of chemical moieties.PTMs have been detected on all 20 of the standard amino acids (Li et al. 2022), and many proteins carry multiple PTMs simultaneously.By the modification of specific residues, PTMs influence the residues' chemical properties, and as a result, PTMs can impact a protein's charge, conformation and binding (e.g.So stari c and van Noort 2021), which could ultimately influence a protein's function.This is referred to as the PTM code (Sims and Reinberg 2008).
Phosphorylation has historically received the most attention, as reflected by the large amount of detected sites and the plethora of phosphorylation databases [a recent review summarized over 60 (Zhao et al. 2023)].However, other posttranslationally modified amino acids are coming to the fore, with lysine emerging as the amino acid capable of carrying the largest diversity of PTMs, as counted by dbPTM PTM categories.Nonetheless, resources for other amino acids are far more limited than for phosphorylation.For lysine, the second most represented amino acid in dbPTM, only one (non-species-specific) dedicated resource has been developed and maintained over the years: the Compendium of Protein Lysine Modifications (CPLM; http://cplm.biocuckoo.cn)(Liu et al. 2014).CPLM currently integrates and curates experimental protein lysine modification data from literature and 10 databases, and supplements it with rich metadata from 102 additional data sources (Zhang et al. 2022).CPLM stores information on 25 lysine modification types, encompassing the majority of lysine modification types.
As reported in dbPTM, hundreds of thousands of PTM sites have been discovered, driven by the continuously improving detection methods (Keenan et al. 2021).Due to the massive number of identified sites, functional assignments of these PTM sites are lagging behind.A comparison of the identified sites to those stored in PTM databases (by means of a BLASTp search for similar proteins in these databases) could aid in identifying already known, potentially even functionally assigned, modification sites.In addition, non-functionally assigned sites, especially those with repeated identifications across studies, could then be prioritized for functional characterization.Finally, the comparison to pre-existing datasets can also serve as a quality check, allowing researchers to quickly assess the overlap between their sites and those found during similar experiments.However, today, position-based searches, specifying both a PTM and its exact location, are generally not possible against major PTM databases.The major exception here appears to be the historically well researched PTM phosphorylation, with peptide matching in EPSD (Lin et al. 2021), and site searches in PhosphoSitePlus (Hornbeck et al. 2012(Hornbeck et al. , 2019)).To address this issue, we developed FLAMS (Find Lysine Acylations and other Modification Sites), which serves to find previously identified modification sites in the same and similar proteins across species, by enabling a position-based search of the CPLM database and the experimentally supported subset of dbPTM.These databases were chosen to represent an up-to-date, comprehensive overview of the PTM landscape (dbPTM) and add additional information on a large subset of these PTMs (CPLM), as lysine is, after serine (for which phosphorylation sites can already be searched for on a position-basis in other databases), the amino acid carrying the most PTM sites.

Implementation
FLAMS uses a sequence similarity-based approach to match a user provided query (consisting of at least a protein and a modification site) to similar proteins carrying the same PTM at a similar site.FLAMS does so by searching the CPLM and the experimental subset of the dbPTM for this information, and returning the results in a tabular format.It does this in a fast manner, by employing a clever trick: the data needed to identify conserved sites is stored directly in the FASTA headers, circumventing the need to consult any additional data source after the initial BLAST search.FLAMS can be used both from the command line and through its web interface.An overview of the workflow used by FLAMS is given below, and depicted visually in Fig. 1.

Preprocessing and aggregating protein PTM data
In the data aggregation and preprocessing stage, PTM data for a specific modification type are downloaded from the CPLM (Zhang et al. 2022) and/or the experimental part of dbPTM (Li et al. 2022).Each CPLM/dbPTM entry is converted into a FASTA formatted sequence record, storing relevant information in its header (i.e. the modification type, modification position, UniProt identifier, protein name, protein length and species name, as well as CPLM/dbPTM evidence).For dbPTM, this requires fetching the FASTA file of each protein through UniProt's API (The UniProt Consortium 2023), as sequence information is not included in the dbPTM download files.All entries are written to one multi-FASTA file per data source (i.e. one for CPLM and one for dbPTM) using BioPython v.1.79SeqIO (Cock et al. 2009).As dbPTM entries depend on the downloaded UniProt FASTA file, it is possible that dbPTM entries are excluded from the multi-FASTA file, e.g. when the UniProt identifier in dbPTM is obsolete.The CPLM and dbPTM multi-FASTA files are then integrated into one multi-FASTA file per modification type.These FASTA files are pre-generated, and hosted on Zenodo (doi: 10.5281/zenodo.10143463).FASTA files will be updated in accordance to dbPTM and/or CPLM updates.

Creating local BLAST modification databases
The multi-FASTA file containing all CPLM and/or dbPTM entries for a specific modification type is subsequently downloaded from Zenodo and used to generate a local BLAST database with BLASTþ v.13 makeblastdb (Camacho et al. 2009).This procedure is invoked once per modification type, namely the first time this specific modification type is called by the application.

Assessing validity of user-provided arguments
Before actually performing a search for known PTM sites, user-provided arguments are parsed and checked for validity.Most relevantly, FLAMS checks (i) if the provided FASTA file is recognized as such by SeqIO, (ii) if the specified modification position is within the range of the protein size, and (iii) if the specified modification position points to an amino acid that is capable of carrying the given modification type(s).If users provide a UniProt identifier instead of a local FASTA file, FLAMS downloads the corresponding FASTA file through UniProt's API (The UniProt Consortium 2023).

Detecting conserved protein modifications
If all arguments are valid, FLAMS performs a BLASTp search against the local BLAST database(s) containing the PTM data for the specified modification type(s).High scoring pairs (HSPs) are filtered in three stages.First, only HSPs with an evalue the user-specified e-value are retained.Second, only HSPs containing a modified amino acid in both the aligned query and target sequence are retained.Finally, only HSPs where the queried modification site aligns (within the userspecified range) to a modified amino acid in the target sequence, are retained.
For each retained HSP, a row is written to the output file in .tsvformat.Each row contains information on one conserved PTM, and specifies information on (i) the protein (UniProt identifier and protein name), (ii) the modification (type, position and the sequence surrounding this modification), (iii) the species, (iv) the BLASTp run (E-value, identity and coverage), (v) CPLM hits, if any (CPLM ID, evidence code and evidence links), and (vi) dbPTM hits, if any (evidence code and links).

Batch mode
In batch mode, FLAMS first performs additional checks on the batch input file, verifying the provided UniProt identifiers and positions.Then, FLAMS is run iteratively (going through described stages 1-3), taking the UniProt identifiers and positions specified in the batch file as input and creating one output file per line in the batch file.Any additional specified command line options will be applied to all runs in the batch.

Web tool
A modified version of the command line tool is made available as a web interface.The web interface is created from the raw Python3 scripts with Flask v.2.2.2.A Docker image, combining the Flask application with Gunicorn v.20.1.0,is created, and hosted on the KU Leuven hosting platform Elsschot.
The goal of FLAMS is to provide users a straightforward way to assess whether their modification sites have been reported previously, either exactly as found, or in similar proteins (as examined with a BLASTp search against dbPTM and CPLM).To showcase how this can be done, we examined whether the TatA acetylation on K66 in Dehalococcoide mccartyi strain CBDB1, as described by Greiner-Haas et al. (2021), had been previously detected.Using the command 'FLAMS-id A0A916NWA0 -p 66 -m acetylation -o tatA.tsv',we can find that this exact acetylation site has been previously reported in Escherichia coli (Weinert et al. 2013).In a more extensive example use case, we assessed all 152 modifications of the acylproteome of Syntrophus aciditrophicus reported by Muroski et al. (2022) for their novelty, and found that it contains 72 modifications not yet present in the CPLM and dbPTM database and 80 that were previously reported.Data preprocessing and the FLAMS utilities shows that most of these are only conserved in a few other species, some sites are highly conserved.Details and more examples can be found in the iPython notebook (Supplementary Material).

Discussion
Today, PTM studies frequently report a large number of identified sites, often varying between a hundred and over a thousand sites, depending on the modification type, equipment, protocols, etc Typically, the number of identified sites vastly eclipses the number of sites that can be investigated experimentally, and researchers attempt to explain the majority of the identified sites by (i) protein set enrichment analysis and (ii) comparison to other PTM studies.However, these comparisons are usually limited to a small set of other modification datasets, carefully selected by the authors.For example, for acetylation, one of the most studied lysine modification types, cross-study comparisons are often limited to less than 10 datasets, as authors have to compare the different datasets manually following a BLAST between the different acetylomes (e.g.Chen et al. 2017, Pang et al. 2020, Sun et al. 2018).As a result, some authors report only protein-level conservation of acetylation, instead of a more informative and precise residue position-based conservation (e.g.Chen et al. 2017, Meng et al. 2016).
With FLAMS, it becomes straightforward to carry out the task of identifying previously reported modification sites stored in dbPTM and CPLM.Briefly, the example cases (Supplementary Material) showed that FLAMS can be used (i) to quickly verify whether modifications in a specific protein have been reported previously, (ii) to assess whether findings in one species might translate to other species, and (iii) to systematically assess the novelty and conservation of reported modification sites.The key to FLAMS' success is the clever use of FASTA headers, where all relevant search information is concisely stored.This innovative approach may also help other future projects where residue information is needed.
To conclude, FLAMS facilitates the comparison of new PTM datasets to currently known ones, by allowing a position-based search against the contents of the entire CPLM database and the experimentally supported subset of dbPTM.As such, it automates the oftentimes time-consuming manual comparisons and eliminates potential selection bias, stemming from the limited number of datasets used in these comparisons.Due to FLAMS' implementation as a Python3 command line tool, FLAMS can readily be integrated into larger analysis pipelines, which will likely become increasingly important in the field of PTMs, where the amount of data is increasing faster than the functional interpretation thereof.Its intuitive web interface also guarantees smooth access to the tool for the many experimentalists in the field.