Introduction

Protein functions and interactions are facilitated by amino acid (aa) sequences, so-called functional motifs, or domains, which participate in various processes, including protein interactions, trafficking, pre- or post-translational regulation, or recruiting enzyme [15]. They are either short linear motifs (SLiM), 3–11 residues (e.g., RGD), or long domain, >30 residues (e.g., Zinc finger, ankyrin or tetratricopeptide repeats (TPR)). Motifs may contain repeated residue(s) or region(s) (e.g., L-, SR-, AR- or PEST-rich motifs). Number of databases were established to catalogue these motifs, including PROSITE, ELM, and Minimotif Miner (MnM) databases [68]. MnM, MEME Suite, QSLiMFinder, SLiMSearch, 3of5, MotifHound, and DoReMi tools can be used to predict motif(s), pattern(s), or shared consensus within input sequence(s) [914]. Another approach uses hidden Markov model (phylo-HMM) to search for evolutionarily conserved functional motifs [15]. These tools were previously reviewed in [13, 16]. Briefly, they offer arena for searching and parsing de novo or pre-defined motifs. They may require sequence alignment, uploading background sequences, or connection to third-party tools or databases. Statistics, based on background sequences to overcome false-positive results, were provided. On the other hand, for finding sequences enriched with residues, EMBOSS provides a tool for finding PEST-rich motif within a query sequence (http://emboss.sourceforge.net/), whereas LCR-eXXXplorer is developed to visualize low-complexity regions (LCRs) [17].

Shetti-Motif was developed to help experimental biologists to mine for multiple (pre-defined or experimentally validated) motifs, consensus patterns, or motifs enriched with residues within a large dataset of protein sequences (e.g., entire proteome). The tool is interactive, versatile, and user-friendly, Fig. 1. It visualizes UniProt and PROSITE flat files and maps them in a human-readable table.

Fig. 1
figure 1

Screenshot of Shetti-Motif main window (a), and flowchart of features and method used in this study (b)

Method

Shetti-Motif is standalone and portable program, which is developed in C#.NET. The tool is free for academic uses. The main purpose of the tool is to mine for data within large dataset of sequences, and present them in a human-readable table. The input file is FASTA sequences, UniProt or PROSITE flat files, which are publically available in the databases. All the sequences were downloaded from UniProt, GeneBank and PROSITE (prosite.expasy.org/) websites during October 2015. Three modules were implemented in Shetti-Motif tool.

The first module is searching for x-rich motifs (i.e., motifs enriched with a residue(s), where x is any residue, e.g., Leu-, SR- or PEST-rich motifs) in multiple sequences (entire proteome). Coverage of the residue(s) within motif is the criterion to select the motif. The default coverage value is 30% (e.g., if the length of P-rich motif is 10 aa, P is enriched >3 aa) and can be modified by users. Using sliding window, Shetti-Motif slides over the sequence until residue coverage and motif length thresholds are fulfilled. The tool reports proteins enriched with the input residues, protein length, number of motifs in each protein, motif length, and coverage (number) of residue(s), Figs. S1, S2.

Shetti-Motif has additional interactive feature, which enables searching for one or multiple consensus pattern among multiple protein sequences, Figs. S3–S6 [18]. Shetti-Motif provides two built-in databases; the first obtained from PROSITE database, while the second obtained from literature, which are validated experimentally, Fig. S3, Tables 1, S1–S3. Users may select patterns from the list, or third-party motif/pattern of interest. Notably, the tool accepts PROSITE pattern syntax, Table S1. The tool uses perfect (exact) text-search method, including regular expression, to search for patterns. By this option, large datasets of proteomes can be parsed efficiently. The outputs are presented in a table or exported to text file, Figs S4–S6. Protein names, number of proteins, and enrichment of the proteins to total number of proteins on the dataset are reported.

Table 1 The motif-containing proteins (McPs) profile of poxviruses, table S1–S3

Third module can parse UniProt and PROSITE flat files and convert them to human-readable tables, Figs. S7–S9. Shetti-Motif maps them into one table, which includes PROSITE IDs, patterns, and name of proteins harboring these patterns, Fig. S9. The tables can be copied into clipboard or can be exported into a tabulated text file.

Implementation

Shetti-Motif tool, sample files, and documentation are available on http://sourceforge.net/projects/ShettiMotif/. The tool runs and it was tested on windows 7 or higher, without any preliminary installation. For Mac and Linux, MonoDevelop (http://www.monodevelop.com/) are needed. For details, see program’s user guide.

Case study

As a proof of concept, we analyzed proteomes encoded by eleven members of Poxviridae family (2251 proteins) against experimentally validated built-in motifs (Walker motifs, glycosylation, nuclear localization, SUMO-, ESCRT- and integrin-binding motifs, etc.), Tables 1, S2, S3 [1]. The viruses belong to Chordopoxvirinae (Orthopoxvirus: camelpox, cowpox, monkeypox, vaccinia and variola viruses; Avipoxvirus: canarypox and fowlpox viruses; Crocodylidpoxvirus: Nile crocodilepox virus; Leporipoxvirus: myxoma virus; Parapoxvirus: orf virus; and Yatapoxvirus: Yaba monkey tumor virus). Poxviruses are ubiquitous and infect wide-range of hosts [19]. Therefore, (i) entry, virus-cell interactions, or cellular trafficking mechanisms might not be conserved between species or subfamily members, and (ii) comparative proteomics approach is a potential benchmark to understand these interactions. First, the proteomes were searched for ≈100 query motifs, see Table 1, S2, S3. Then for each virus, the proteins harboring these motifs were counted and normalized to the total number of proteins in the proteome. Finally, the motif-profile table was constructed, Table 1, S3. For statistical analysis, the mean and maximum number of motif-containing proteins, standard deviation, and Spearman correlation coefficient were calculated, see Figs. 1, 2.

Fig. 2
figure 2

Poxviruses encode divergent number of motifs; the motif-containing proteins (McPs) profile of closely related viruses are correlated. The number of motif-containing proteins (i.e., protein containing at least one instance of the query motif) were counted and normalized (percentage) to total number of proteins encoded by a virus, Table 1, S3. a Box and whisker plot shows 1st, 2nd, 3rd quantiles (Q1, Q2 and Q3 respectively) of numbers of McPs, whiskers at 1.5 IQR (interquartile range) (Q3 + 1.5 IQR); b the average and maximum numbers of McPs, the error bars are based on standard deviation values; c Spearman correlation coefficient values and scatterplots (using STASTICA Data Miner; StatSoft, USA) of the number of McPs encoded by each virus. VacV: Vaccinia virus WR, VarV: Variola virus, MPxV: Monkeypox virus, YMTV: Yaba monkey tumor virus, FPxV: Fowlpox virus, CPxV: Canarypox virus, Orf V: Orf virus, CPxV: Cowpox virus, CmPxV: Camelpox virus M-96, MV: Myxoma virus, and NileCV: Nile crocodilepox virus

The results show that (i) the number of protein harboring these motifs significantly differs among poxviruses, Tables 1, S3, Fig. 2. Although a proteome harbors several copies of a motif, another proteome does not harbor any copy of the same motif (e.g.,. NLS motifs). (ii) The closely related viruses show a linear correlation, e.g., vaccinia and variola viruses (infect human cells and phylogenetically related) show similar motif-profile and Spearman correlation ≈0.99, Fig. 2. (iii) Some motifs were not detected in any of poxvirus proteomes (e.g., inhibitor of apoptosis, adenovirus fiber flexibility, and protein cleavage motifs, which characterize other viral families). This suggests that poxviruses encode wide range of proteins and functional motifs for fruitful interactions with wide range of host cells, and evolutionary events play roles to shape their proteome diversity. This explains the ubiquitous nature and ability of poxviruses to interact with wide range of hosts.

Results and discussion

Shetti-Motif has a user-friendly interface in which plain data are visualized as a table, and can be copied to clipboard and transferred into spreadsheet program. The sequences containing the x-rich motifs are exported directly to a FASTA file. Thus, the input and output files can be managed easily by experimental biologists. Shetti-Motif searches for multiple pre-defined motifs/patterns within proteome or large dataset of protein sequences. This functionality does not require to searching public databases, loading a background sequence file, or writing additional scripts. This offers flexible option for biologists to search wide range of protein sequences, which are not indexed in public databases. This issue could be critical when parsing proteome datasets of recently isolated microbiological and metagenomics samples. To the best our knowledge, this whole-proteome mining approach cannot be achieved by similar tools. Shetti-Motif was used to search for ≈100 experimentally validated patterns against poxvirus proteomes. The results show variation in enrichment of motif-containing proteins among the viruses, which support that motifs are correlated with evolutionary events, cellular interaction, or host-specificity.

LCRs are sequence repeats or extension of one or more residue(s), e.g., 6xHis-tag. Despite their functional importance, they are under-represented on publications, reviewed in [1, 17, 2022]. Their crystallization could be difficult; thus, previous efforts attempted to mask them. Another type of motifs, which are enriched with a residue(s) but interrupted by others, e.g., Cys-rich, Gly-rich or KR-rich motifs, reviewed in [1]. Notably, in literature, they are referred as x-rich motif, but not as LCRs. This could be due to the following: (i) they may not be considered as disordered repeats, (ii) may not conform to a known pattern, and (iii) could be structurally important. The difference between LCRs and x-rich motifs can be noticed in some proteins (e.g., Q5UNS9, E3VZK9, Q5UNX5, and Q5UQQ7), see SI-1, SI-2. Q5UNS9 harbors glycosylation sites LCRs, whereas the x-rich regions in the others are not masked by NCBI-BLASTp. For this reason, the criterion for finding x-rich motifs in Shetti-Motif is the coverage of the residue(s) to the total motif length. The x-rich proteins may share common biochemical or molecular interactions, e.g., post-translational modification for non-histone proteins. Therefore, it is beneficial to establish a dataset of proteins rich with particular residues, for investigating (experimentally) their molecular functions.

Short motifs are subjected to evolutionary changes, which could affect cellular processes, interactions, or protein characteristics [13]. Although proteins sharing functional motifs might share similar function, the consensus pattern is not the absolute measure for the protein functions, and other factors could influence the function, reviewed in [1]. Our bioinformatics approach may benefit in predicting tropism and pathogenicity for emerging infectious agents [23, 24], as well as determining potential protein dataset(s) among whole proteome for designing further experiments. Importantly, this approach includes exact text search of experimentally validated motifs, which increase the chances of true-positive results. However, motif-containing proteins may still have different functions from that being expected, which benefits studies on evolution of protein function.

In conclusion, Shetti-Motif has simple, versatile, user-friendly, and interactive features, which are useful for experimental biologists lacking prior knowledge of bioinformatics, such as search for pattern(s) or x-rich motifs in protein sequence(s) or entire proteome without loading background files and user-friendly interface to visualize UniProt and PROSITE flat files as tables.

We applied this pipeline to poxvirus proteomes, and we observed that our pipeline is able to correlate the closely related viruses. The results show that functional motifs are conserved within evolutionary related viruses and/or viruses that share similar molecular interactions. Therefore, we conclude that the pipeline is useful to compare between species; it will help in designing a dataset of candidate proteins for further experimental investigations, either by confirming the function or studying the evolution of protein function.