Research paper
BlockLogo: Visualization of peptide and sequence motif conservation

https://doi.org/10.1016/j.jim.2013.08.014Get rights and content

Highlights

  • We developed a tool for visualization of linear and non-linear immunological motifs.

  • Utility is demonstrated for neutralizing influenza B cell epitopes.

  • Utility is demonstrated for allergenic and hypoallergenic Bet v 1 allergens.

  • Utility is demonstrated for variability of HLA-DRB1 binding pocket P1.

  • The BlockLogo tool is available at http://research4.dfci.harvard.edu/cvc/blocklogo/.

Abstract

BlockLogo is a web-server application for the visualization of protein and nucleotide fragments, continuous protein sequence motifs, and discontinuous sequence motifs using calculation of block entropy from multiple sequence alignments. The user input consists of a multiple sequence alignment, selection of motif positions, type of sequence, and output format definition. The output has BlockLogo along with the sequence logo, and a table of motif frequencies. We deployed BlockLogo as an online application and have demonstrated its utility through examples that show visualization of T-cell epitopes and B-cell epitopes (both continuous and discontinuous). Our additional example shows a visualization and analysis of structural motifs that determine the specificity of peptide binding to HLA-DR molecules. The BlockLogo server also employs selected experimentally validated prediction algorithms to enable on-the-fly prediction of MHC binding affinity to 15 common HLA class I and class II alleles as well as visual analysis of discontinuous epitopes from multiple sequence alignments. It enables the visualization and analysis of structural and functional motifs that are usually described as regular expressions. It provides a compact view of discontinuous motifs composed of distant positions within biological sequences. BlockLogo is available at: http://research4.dfci.harvard.edu/cvc/blocklogo/ and http://met-hilab.bu.edu/blocklogo/.

Introduction

Sequence logos are useful tools for visual display of conservation and variability in a multiple sequence alignment (MSA) of DNA, RNA, or protein sequences (T D Schneider and Stephens, 1990). Individual nucleotides or residues in each position in an MSA are displayed by stacking the characters, where the height of each character corresponds to its frequency relative to the frequencies of all the characters in that position, and the height of the stack is determined by the total information content (Shannon, 1948). Sequence logos aid the interpretation of sequence data by visualization of conserved motifs representing various functional or structural properties. Examples of motifs that have been visually analyzed using sequence logos are: transcription factors (Wade et al., 2004), enzyme DNA sequences (Goll and Bestor, 2005), proteolytic cleavage sites (Mahrus et al., 2008), T-cell epitopes (Bryson et al., 2009, Olsen et al., 2011), and the analysis of targets of neutralizing antibodies in HIV (Sun et al., 2008), among others. Sequence logos display stacked motifs with the most frequent residues shown at the bottom and the least frequent motif displayed on the top of the stack. Sequence logos visualize biological sequence motifs where the height of the logo element represents its log-transformed frequency displayed in bits of information. Logos often do not display low-frequency motifs because their heights are below useful resolution.

The most popular sequence logo web server is WebLogo (Crooks et al., 2004). It enables users to generate standard sequence logos for DNA, RNA, and protein sequences. In addition to the WebLogo web server, several specialized logo generators have been developed to visualize specific motifs or functional sequence units that are unapparent from the standard sequence logos. Examples of extensions to the basic sequence logo are: RNA structure logo (J Gorodkin et al., 1997) which combines the standard sequence logo with information about base pairing and mutual information of base pairs; enoLOGOS (Workman et al., 2005) which displays energy measurements, probability matrices and alignment matrices in addition to the standard sequence logo; two-sample logo (Vacic et al., 2006) which displays comparative sequence logos for two sets of MSA; CorreLogo (Bindewald et al., 2006) calculates mutual information of nucleotides in different positions to determine correlation and potential base pairing; Phylo-mLogo (Shih et al., 2007) creates sequence logos for the comparison of phylogenetically distinct clades within an MSA of DNA sequences; Blogo (Li et al., 2008) displays a sequence logo with statistically significant bias of individual positions; RNAlogo (T.-H. Chang et al., 2008) extends the RNA structure logo with a graphical representation of secondary structure; PoreLogo (Oliva et al., 2009) uses sequence logos and 3D protein structures to visualize motifs of channels in transmembrane proteins; iceLogo (Colaert et al., 2009) provides a probability-based visualization by allowing users to define reference sequences of the sample's origin; Seq2Logo (Thomsen and Nielsen, 2012) offers the capacity to visualize amino acid sequence profiles in terms of amino acid enrichment and depletion; RIlogo (Menzel et al., 2012) for the visualization of RNA–RNA interactions; and CodonLogo (Sharma et al., 2012) which enables visualization of conserved codon patterns. The BlockLogo web server (Fig. 1) enables visualization of continuous and discontinuous immune epitopes and various sequence motifs. To our knowledge, it is the first logo web server that specifically enables visualization and analysis of immunologically relevant motifs.

WebLogo is suitable for the visualization of immunological motifs such as immune epitopes. A main limitation of the standard sequence logo for this type of application is that sequence logos carry no information about the relationship between the residues in the logo, but treat each residue as an individual independent position. Often, such logos have limited interpretability. For example, the sequence logo of influenza A HA peptide 232–241 (Fig. 2A) shows variability that can be encoded by as many as 3072 different peptides (4 × 1 × 1 × 4 × 3 × 2 × 4 × 2 × 2 × 2, corresponding to the number of different residues in each position). The BlockLogo presented in Fig. 2B and Table 1 shows, at a glance, that the vast majority of actual sequence diversity is produced by only five peptides that can be read directly from BlockLogo. The actual number of different peptides that have produced sequence logo displayed in Fig. 2A is seven, as shown in Table 1. The peptides visible in this BlockLogo have frequencies > 6%, while each of the two peptides not readable from BlockLogo has a frequency of < 1%. Sequence logos can be useful for visualizing individual anchor position variability of MHC binding peptides, however since many motifs, such as T-cell epitopes, are recognized as linear peptides rather than individual residues, they should be visualized as continuous sequence blocks or fragments. A typical MHC class I T-cell epitopes may be between 8 and 11 amino acids long. MHC class II epitopes can be longer than 30 amino acids but they bind MHC through a nine amino acid long binding core (Reinherz et al., 1999). The input to the BlockLogo web server tool is an MSA of nucleotides, of short peptides of equal length, or of a user-defined subset of positions (here termed a “block”) within an MSA of longer protein sequences. The user-defined positions from within an MSA (i.e. positions derived from the continuous or discontinuous motifs) define the blocks. The information content (Shannon entropy) and relative frequency of each block are calculated, and the sequences printed in the BlockLogo, stacked according to frequency, from the most to the least frequent, from the bottom to the top of the stack. An extension of BlockLogo enables the prediction of the binding affinity of identified peptides for a selection of common HLA molecules using the netMHC prediction algorithms (Lundegaard et al., 2011, Nielsen et al., 2007) that have been experimentally validated for accuracy.

Section snippets

Variability and conservation metrics

Calculation of information content of individual positions in an MSA of homologous protein sequences is based on Shannon entropy (Shannon, 1948). Similarly, Shannon entropy can be calculated for each motif within a defined block. Each block contains W unique motifs of length l in a dataset of N sequences. The formula used for the calculation of block entropy is (Olsen et al., 2011):HBx=w=1WPwxlog2Pwxwhere H(Bx) is the total entropy of a block of motifs starting at position x, and w is a

User interface

The user is prompted to copy/paste an MSA, or upload a file containing an MSA, in standard FASTA or ClustalW formats. Users can select a block from the MSA by specifying the start and end positions of the subset, or a series of individual positions corresponding to the positions of a discontinuous motif. The motifs that have a gap in any of the positions within the specified range will be excluded by default. In the analysis of discontinuous motifs, the sequences with gaps in specified

Conservation of influenza A T-cell epitopes

To illustrate the utility of BlockLogo, we analyzed a block of peptides in 29,113 influenza virus HA protein sequences, containing approximately 36.1 bits of information. All peptides in the block of 10-mers, starting at position 232 were predicted to bind to HLA A*02:01 with similar affinities. The relative frequencies of individual peptides within the viral population cannot be determined from the standard sequence logo produced with WebLogo (Fig. 2A), but are clear from the BlockLogo (Fig. 2

Conclusion and discussion

BlockLogo is a novel sequence logo tool optimized for the visualization of user-defined continuous and discontinuous motifs, fragments, and peptides. Paired with the prediction of HLA binding, BlockLogo is a useful tool for the rapid assessment of the immunological potential of selected regions within an MSA, such as those containing human pathogen sequences or tumor antigen alignments. The BlockLogo tool provides an easily interpretable visual representation of the immunological status and

Funding

LRO was funded by the Novo Nordisk Foundation; UJK was funded by the Oticon Foundation; C. Simon was funded by the Novo Scholarship Programme; and GLZ, JS, VB, and ELR acknowledge funding from NIH grant U01 AI 90043.

References (35)

  • N. Colaert et al.

    Improved visualization of protein consensus sequences by iceLogo

    Nat. methods

    (2009)
  • G.E. Crooks et al.

    WebLogo: a sequence logo generator

    Genome Res.

    (2004)
  • M.G. Goll et al.

    Eukaryotic cytosine methyltransferases

    Annu. Rev. Biochem.

    (2005)
  • J. Gorodkin et al.

    Displaying the information contents of structural RNA alignments: the structure logos

    Comput. Appl. Biosci.

    (1997)
  • K. Katoh et al.

    MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

    (2013)
  • W. Li et al.

    BLogo: a tool for visualization of bias in biological sequences

    Bioinformatics

    (2008)
  • H.H. Lin et al.

    Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research

    BMC Immunol.

    (2008)
  • Cited by (23)

    • Lipid droplet-associated kinase STK25 regulates peroxisomal activity and metabolic stress response in steatotic liver

      2020, Journal of Lipid Research
      Citation Excerpt :

      This analysis identified 12 peptides with reduced phosphorylation status in Stk25−/− livers, representing potential target sites for the kinase activity of STK25 (Fig. 3D, supplemental Table S4). Comparison of the phosphosites that were downregulated in Stk25−/− livers using the BlockLogo application (34) identified a consensus sequence with a high variability in most positions, although a proline-directed ([pS]P) motif was over-represented (Fig. 3E). Importantly, because the STK25 protein is globally depleted early in development in knockout mice (35), it is not possible to discriminate between direct and indirect targets of STK25 activity using this conventional model of gene inactivation.

    View all citing articles on Scopus
    View full text