Comparative functional analysis of proteins containing low-complexity predicted amyloid regions

Background In both prokaryotic and eukaryotic proteins, repeated occurrence of a single or a group of few amino acids are found. These regions are termed as low complexity regions (LCRs). It has been observed that amino acid bias in LCR is directly linked to their uncontrolled expansion and amyloid formation. But a comparative analysis of the behavior of LCR based on their constituent amino acids and their association with amyloidogenic propensity is not available. Methods Firstly we grouped all LCRs on the basis of their composition: homo-polymers, positively charged amino acids, negatively charged amino acids, polar amino acids and hydrophobic amino acids. We analyzed the compositional pattern of LCRs in each group and their propensity to form amyloids. The functional characteristics of proteins containing different groups of LCRs were explored using DAVID. In addition, we also analyzed the classes, pathways and functions of human proteins that form amyloids in LCRs. Results Among homopolymeric LCRs, the most common was Gln repeats. LCRs composed of repeats of Met and aromatic amino acids were amongst the least occurring. The results revealed that LCRs composed of negatively charged and polar amino acids were more common in comparison to LCRs formed by positively charged and hydrophobic amino acids. We also noted that generally proteins with LCRs were involved in transcription but those with Gly repeats were associated to translational activities. Our analysis suggests that proteins in which LCR is composed of hydrophobic residues are more prone toward amyloid formation. We also found that the human proteins with amyloid forming LCRs were generally involved in binding and catalytic activity. Discussion The presented analysis summarizes the most common and least occurring LCRs in proteins. Our results show that though repeats of Gln are the most abundant but Asn repeats make longest stretch of low complexity. The results showed that potential of LCRs to form amyloids varies with their amino acid composition.


INTRODUCTION
Low complexity regions (LCRs) in proteins are either composed of repeats of single amino acids or short amino acid motifs (Wootton & Federhen, 1996). Because of enrichment of one or a few amino acids, LCRs are characterized by its low information content. Statistical analysis suggests that up to 25% of proteome are found within LCRs (Wootton, 1994) and their abundance is more than what expected as random (Alba, Tompa & Veitia, 2007). The well-characterized examples of LCRs are single amino acid repeats, also called as runs (Harbi, Kumar & Harrison, 2011). Proteins carrying Ala, Lys and Pro repeats are known to play key role in several important biological processes (BP) such as development, immunity, reproduction and cellular localization (Albrecht et al., 2004;Toro Acevedo et al., 2017). The compositional bias of LCRs makes them prone to undergo expansion or contraction, which ultimately influences the function of protein in which it is present. For example, in many species length variation of LCRs affects circadian rhythm duration (Avivi et al., 2001) and phenotypic characters (Michael et al., 2007). They are also of medical interest, because uncontrolled expansions of such regions may induce self-aggregation and formation of amyloid fibrils in eukaryotes (Michelitsch & Weissman, 2000). Amyloids are fibrous protein forms that are assembled of cross β-sheet structure (Chiti & Dobson, 2006) and show high degree of protease resistance. Gain-of-toxic functions by amyloid fibrils are known to cause several devastating human diseases which include, but are not limited to, Type II diabetes, rheumatoid arthritis, and several progressive neurodegenerative disorders such as Alzheimer's disease, Parkinson's disease, Spinocerebellar ataxias and Huntington's disease (Michelitsch & Weissman, 2000;Kreil & Ouzounis, 2003;Gunawardena & Goldstein, 2005). Many studies demonstrated that amyloids are commonly formed in Gln and/or Asn rich domains (Scherzinger et al., 1997;Warrick et al., 1998;Chow, Paulson & Bottomley, 2004). These regions form mutation-linked pathological amyloids as well as the functional amyloids, induced by a specific stimulus. For example, in TDP-43 (TAR DNA-binding protein) single residue substitution mutation in the Gln/Asn-enriched LCR forms irreversible protein aggregates which are involved in amyotrophic lateral sclerosis and frontotemporal lobar dementia (Neumann et al., 2006;Johnson et al., 2009). In Huntington patients, expansion of poly-Gln runs in huntingtin protein is responsible for formation of intranuclear and cytoplasmic aggregates, which result in Huntington disease (DiFiglia et al., 1997;Huang et al., 1998). In yeast, LCRs of Cdc19, an isoform of pyruvate kinase, and termination proteins, Nab3 and Nrd1 assemble to form functional amyloid (O'Rourke et al., 2015;Grignaschi et al., 2018). Moreover, yeast prions [PUB1] and [SUP35] are known functional amyloids which form microtubule-associated complex important for translation (Li et al., 2014;Nizhnikov et al., 2016). The peptide GNNQQNY results in aggregation of SUP35 (Garbuzynskiy, Lobanov & Galzitskaya, 2010).
Despite the importance in several pathological conditions, the functional properties of proteins containing LCRs, and impact of constituent amino acids in the propensity of LCR toward amyloid formation is not worked out in much detail. Also, LCR have been generally excluded from wet lab structure-function correlation experiments and in silico functional analysis due to their less amenability of crystallization and difficulty in sequence alignment, respectively. This is also a major reason toward availability of less information on impact of LCRs on protein functions.
In this paper, we classified LCRs according to their amino acid composition. LCRs which were composed of more than one amino acid but having same physico-chemical properties were grouped into four classes: positively charged, negatively charged, polar and hydrophobic. We did comparative functional analysis of proteins containing LCRs consisting of a single amino acid and amino acid of similar physico-chemical properties. Additionally, for all LCR groups, we predicted the propensity of amyloid formation and analyzed their compositional patterns and optimum size. We also predicted amyloid formation in LCRs of human proteome and annotated the pathways, classes and functions in which they participate. Overall, results of this study will further increase our understanding on LCRs in triggering functional variations and formation of amyloids.

Dataset compilation
For this study, we downloaded 553,231 protein sequences from SwissProt. To remove redundancy among proteins in the dataset we used CD-HIT (Li & Godzik, 2006) with the aim that no two proteins have 40% pairwise sequence identity between them. Any proteins containing non-standard amino acid residues were also removed. Finally we obtained 85,381 protein sequences in total.

Extraction and classification of LCRs
We next identified LCRs in the protein sequences using SEG which uses Shannon's entropy to search region of low complexity in a protein sequence. During the LCR identification process, SEG collects all possible subsequences of length L having the local sequence complexity K1. All overlapping subsequences having sequence complexity K1 are merged in both directions till the complexity of contig created by overlapping subsequences lie below K2 (Kumari, Kumar & Kumar, 2015). In this work we used default values of SEG (L = 12, K1 = 2.2, K2 = 2.5). Among all LCRs we kept only those which has at least three amino acids. The final dataset had 186,637 LCRs obtained from 59,821 non-redundant proteins. The complete LCR dataset was divided into two sets: Set-I which contained LCRs composed of runs of single amino acid (Harbi, Kumar & Harrison, 2011) and Set-II contained LCRs which were composed of more than one type of amino acids but of similar physico-chemical property. Thus, Set-I has 20 different subsets of LCR, each corresponds to a distinct amino acid. Depending on the functional groups of amino acids, Set-II was also further divided into positive charged (Arg, Lys), negative charged (Glu, Asp), polar (Arg, Lys, Asn, Gln, Asp, Glu) and hydrophobic (Cys, Ile, Leu, Met, Phe, Trp, Val) LCRs.

Functional enrichment
In order to gain insights into the functions, we performed GO-term enrichment analyses on all proteins containing LCRs. This analysis was done using DAVID (Huang Da, Sherman & Lempicki, 2009), which can handle a number of heterogeneous annotation terms (e.g., GO terms, protein domains, pathways and so on) or gene classes and thus helps in visualization of the larger biological picture. For functional analysis we used complete set of SwissProt proteins as background.

Data I
In protein sequences containing LCRs, Waltz (Maurer-Stroh et al., 2010) was used to find the potential amyloid forming regions with default parameters. Waltz can efficiently recognize local amyloid propensity and differentiate them from "amorphous" aggregates,"proto fibrils," or the mixture of all. In order to find LCRs, which may form aggregates, LCRs were mapped with Waltz prediction and regions common in both were considered as amyloidogenic LCR. Using this approach, amyloidogenic regions were retrieved from LCR sets. To achieve high reliability, only amyloid regions with at least three amino acids were considered for analysis.

Data II
We also collected experimentally annotated amyloid proteins from AmyPro database (Varadi et al., 2018). AmyPro had information about 174 amyloid regions distributed in 126 protein sequences. We located LCRs in these 126 proteins using SEG and found that 76 protein stretches had common LCRs and amyloid regions (hereafter named as Data II). In AmyPro 70 proteins were belonged to human. Out of 70, in 31 proteins we found overlapping LCR and amyloid regions. This dataset is named as Data IIh in this manuscript and was used for functional analysis of human proteins that forms amyloids.

Prediction of amyloids in LCRs of human proteome
In order to study the aggregation tendency of LCRs in human proteome, we used human proteome compilation of HPRD (Keshava Prasad et al., 2009). Using default parameters of SEG, we found LCRs in 23,727 proteins out of total 30,046 proteins. Subsequently amyloid regions were predicted in LCR-containing proteins with Waltz.

Functional annotation of human proteins with amyloids in LCRs
Using Protein Analysis THrough Evolutionary Relationships (PANTHER) (Mi, Muruganujan & Thomas, 2013), we analyzed the classes, pathways and functions of human protein which were predicted to have amyloidogenic LCRs. PANTHER does gene annotations on the basis of evolutionary relationships, which were taken from Gene Ontology Reference Genome project.

Compositional trends of LCRs: in general and in amyloids
We first checked prevalent amino acids in LCRs and analyzed whether amino acid composition of LCR affects their aggregation tendency. We did compositional analysis for each LCR sets categorized on the basis of homopolymeric runs, charge and hydrophobicity. The most common homopolymeric runs were polyGln, polyAsn, polySer, polyAla, polyGlu and polyPro (in decreasing order) (Table 1). polyLeu, polyLys, polyAsp, Table 1 Distribution of low complexity regions and amyloids predicted in them.   polyGly and polyThr were also found in moderate number. The least preferable homo-repeats were polyVal (13 in number), polyPhe and polyIle (11 in number), polyTyr (seven in number), polyCys (two in number) and polyMet (one in number). We observed total absence of polyTrp LCRs in our dataset (Table 1). The compositional trend analysis on Set-II LCRs revealed that the number of positively charged LCRs was ca. 1/3rd of the negative charged LCRs. The results also showed the number of LCRs composed of polar amino acids was more than hydrophobic amino acids ( Table 1). Prediction of amyloids in LCRs suggests that polyAla, polyPhe, polyLeu, polyAsn, polyGln, polar and hydrophobic amino acids have amyloidogenic capability. Majority of the amyloidogenic LCRs were composed of polyLeu, polyAsn and hydrophobic residues. In contrast, polyCys, polyAsp, polyGlu, polyLys, polyGly, polyHis, polyThr and charged LCRs accounted for a very small fraction of amyloidogenic LCRs with less than 10 residues. The runs of polyMet, polyTrp, polyPro and polyArg were predicted to be completely lacked of amyloidogenic capability (Table 1).
In order to verify our results obtained by prediction of amyloidogenic LCRs, we repeated the analysis on effect of amino acid composition on amyloidogenesis in Data II, which had only experimentally verified amyloidogenic LCRs. The results revealed Gln and Asn as the most abundant; Ala, Gly and Ser as moderate; and Cys and Trp as the least preferred ( Fig. 1). This observation was inline with our earlier observation on predicted amyloids in LCRs. We also identified two polar LCRs and one hydrophobic LCR in Data II.

Length analysis of amyloids
In an attempt to investigate whether the amyloids have size variation, we also analyzed the length of amyloidogenic LCRs. In Data I, the homopolymeric runs, predicted to form Full-size  DOI: 10.7717/peerj.5823/ fig-1 amyloids, ranged between 3 and 6 AA (Fig. 2). The maximum length of amyloid forming polyLeu runs was 7 AA and polyAsn was 4 AA. For hydrophobic LCRs, shorter amyloids of length 3-11 predominated whereas the length of amyloidogenic polar LCRs were 3-4. We found that hydrophobic LCRs had the longest stretch (18 AA). As the longer hydrophobic stretches are known to be toxic (Dorsman et al., 2002;Oma et al., 2004), this indicates the toxic nature of the amyloidogenic LCRs.
In Data II, which contained only experimentally proven amyloidogenic LCRs, the length of LCRs composed of polar amino acid ranged between 8 and 9 and for hydrophobic LCR, it was 16. This was inline with our observation with Data I.

Functional enrichment of LCRs
To study the functions in which proteins having amyloidogenic LCRs are involved, we analyzed their BP, molecular functions (MF) and cellular components (CC) abundance in each category of LCR using DAVID (https://david.ncifcrf.gov/). For our analysis we considered only the top five enriched GO-terms.
The result showed MF enrichment only in proteins having runs of His, Arg, Asp, Glu, Asn, Gln, Ser, Thr, Ala, Gly and Pro. We found that these protein subsets are strongly associated with metal ion binding, transition metal ion binding and nucleotide binding ( Fig. 3A and Table S1). Both polyAsn and polyGln were involved in similar functions, that is, DNA binding and transcription regulator activity. We observed that runs belonging to charged amino acids were involved in ion and DNA binding but LCRs composed of combination of positive and negative charged amino acids were showing additional functions, that is, protein binding ( Fig. 4A and Table S1). Polar LCRs were involved in "DNA binding," "nucleotide binding," "ATP binding," "protein binding" and "metal ion binding." Perfect repeats of hydrophobic amino acids were not enriched in MF whereas LCRs composed of combination of hydrophobic amino acids were involved in "calcium ion binding," "hydrolase activity," "receptor activity," "serine-type endopeptidase activity" and "G-protein coupled receptor activity" (Table S1). Under BP category, "regulation of transcription" was the most common GO-term. Interestingly, whereas runs of all amino acids were involved in transcription, the topmost enriched function of proteins with polyGly was translational activity (Fig. 3B and Table S2). We also observed that runs of different amino acids were enriched in unique BP. For example, polyArg were involved in "cell surface receptor linked signal transduction," polyLys in "RNA biosynthetic process," polyAsp in "protein localization," polyGlu in "cellular response to stress," polyGln in "chromosome organization," polySer in "phosphate metabolic process," polyThr in "cell division," "peptidoglycan metabolic process," "glycosaminoglycan metabolic process" and "regulation of cell morphogenesis," polyPro in "cytoskeleton organization" and polyVal in "lipid biosynthetic process" (Fig. 3B and Table S2). "Regulation of transcription, DNA-dependent" was common process in the proteins having runs of positively charged amino acids, polyArg and polyLys. Except GO terms "regulation of transcription, DNA-templated" and "transcription, DNA-templated," BP of LCRs containing combinations of positively charged and combinations of negatively charged amino acids were completely different, for example, positively charged LCRs were involved in "positive regulation of transcription from RNA polymerase II promoter" whereas negatively charged LCRs were involved in "negative regulation of transcription from RNA polymerase II promoter." Polar LCRs showed enrichment of "transcription, DNA-templated," "phosphorylation" and "transport" but hydrophobic LCRs were enriched for "signal transduction," "proteolysis," "innate immune response" and "transport" (Fig. 4B and Table S2). In case of CC, the enriched locations were "nuclear lumen," "organelle lumen" and "membrane enclosed lumen" (Figs. 3C and 4C; Table S3). However, polyLeu, polyPro and polyVal completely lacked any lumen; their enrichment terms were related to "plasma membrane."

Functional annotation of amyloidogenic LCRs in human proteins
We also analyzed the broad spectrum of functions for each human protein containing predicted amyloidogenic LCRs in terms of GO slim functional categories viz., BP, MF, CC and functional classes using PANTHER. Under BP category, we noticed the annotation terms "biological regulation," and "response to stimulus" were common to polyAla, polyPhe, polyLeu and hydrophobic LCR (Figs. S1A-S4A). The GO term "cellular process" was found in proteins with polyAla, polyLeu, polar and hydrophobic LCRs (Figs. S2A-S5A). The processes "localization" and "locomotion" were specific to polyAla, polyLeu and hydrophobic LCR (Figs. S2A-S4A); "developmental process" was specific to polyAla and hydrophobic LCRs (Figs. S2A and S4A) whereas the "immune system process" and "biogenesis" to polyLeu and hydrophobic LCR (Figs. S3A and S4A). The processes "biological adhesion," "biological regulation" and "reproduction" were exclusive for proteins with hydrophobic LCRs (Fig. S4A). Molecular functions analysis in each category of LCRs showed involvement in "catalytic activity" of proteins with polyAla, hydrophobic and polar LCRs (Figs. S2B, S4B and S5B). The MF term "receptor activity" was observed in proteins with polyPhe (Fig. S1B), polyLeu (Fig. S3B) and hydrophobic LCR (Fig. S4B) while "binding" was enriched in polyAla, polyLeu and hydrophobic LCR. polyPhe and hydrophobic LCR were also associated with "signal transducer activity." Two additional MF terms "structural molecule activity" and "transporter activity" were unique to hydrophobic LCR.
Furthermore, the CC analysis of amyloidogenic LCRs containing human protein suggested that polyAla and hydrophobic LCR were common in "cell part," "macromolecular complex" and "membrane" (Figs. S2C and S4C). The CC GO term "extracellular region" was found in proteins with polyLeu and hydrophobic LCR (Fig. S3C). In hydrophobic LCRs, we found additional components such as "extracellular matrix" and "organelle" (Fig. S4C).
We could not found GO slim BP for proteins containing polyAsn, polyGln and polyVal which most likely occurred because GO slim provides a broad outline of GO contents. We then searched specific terms for the proteins in these categories from PANTHER. We found that proteins with amyloids in polyGln had GO BP terms "ion transport" and "neutrophil degranulation" and for proteins with polyVal the terms were "G-protein coupled receptor signaling pathway," "neuropeptide signaling pathway," "female pregnancy" and "hormone metabolic process." The MF term for polyGln was "nucleic acid binding" and for polyVal these were "G-protein coupled receptor activity," "neuropeptide Y receptor activity," "protein binding" and "neuropeptide receptor activity." We found that GO CC terms for polyGln was "lysosomal membrane," "microtubule organizing center," "plasma membrane," "integral component of membrane," "specific granule membrane," "intracellular membrane-bounded organelle," "extracellular exosome" and "tertiary granule membrane." Proteins containing amyloids in polyVal were associated to "plasma membrane" and "integral component of plasma membrane." PANTHER showed pathways for only polyAla and hydrophobic LCRs. We found that proteins of both the categories were mainly associated to various signaling pathways ( Table 2). The proteins forming amyloids in hydrophobic LCRs were also involved in many other pathways and cascades such as plasminogen activating cascade, cadherin signaling pathway, Wnt signaling pathway and Alzheimer disease-presenilin pathway ( Table 2).
We found protein class annotation for only the LCRs composed of polyAla, polyLeu, polar and hydrophobic amino acids in human proteome. Proteins containing amyloids in polyAla were kinase, DNA binding protein, and forkhead and homeodomain transcription factor ( Table 2). Some of the proteins containing polyLeu were Type I cytokine receptor and Chemokine. Polar LCRs which formed amyloid in human proteins contained non-receptor serine/threonine kinase. In case of hydrophobic LCRs, proteins showed diversified class such as glycosyltransferase, chemokine, protease, receptor, enzyme modulator, signaling molecule and transporter (Table 2).

DISCUSSION
This study outlines the correlation of LCR amino acid composition to their abundance, function and amyloidogenic properties. Herein, we separated LCRs into different sets: LCRs composed of (i) single amino acid repeats, (ii) positively charged amino acids, (iii) negatively charged amino acids, (iv) polar amino acids and (v) hydrophobic amino acids. We found that number of LCR subgroups varied widely across different subsets. Among LCRs containing repeats of single amino acids, most abundant was polyGln followed by polyAsn (Table 1). Similar observation was reported earlier also (Faux et al., 2005). LCRs constituted of polyCys, polyMet and aromatic acids were found to be very rare. Despite the fact Gln was the most abundant homopolymeric run, we found that total number of Asn residue was more. We feel this may be due to the reason that the polyAsn forms longer LCR stretches. When the abundance of LCR was analyzed on the basis of their physico-chemical properties, the highest number was observed of LCRs composed of polar amino acids followed by hydrophobic, negatively charged and positively charged LCRs. Further, we analyzed the aggregation propensity in different LCR subgroups. On prediction of amyloids in LCRs it was observed that whereas polyAla, polyIle, polyLeu, polyPhe, polyAsn, polyGln, polar and hydrophobic LCRs had high potential to exhibit amyloidogenic nature, polyAsp, polyGlu, polyGly, polyLys and charged LCRs were least amyloidogenic. All of these are the major LCR groups. We also observed that complete stretch of polyIle and polyPhe were predicted to be involved in amyloid formation. In addition, when we analyzed the nature of experimentally validated amyloids, we found that validated amyloids were also rich in Gln, Asn, Ala and Ser.
Biological functions were found to be highly diverse in LCR-containing proteins such as signal transduction, RNA biosynthetic process, protein localization, cellular response to stress, chromosome organization, cell division, peptidoglycan metabolic process, and transport. In addition, we also noticed involvement of proteins with LCRs in transcription, metal ion binding and nucleic acid binding. Since these functions are also observed in disordered proteins, hence it suggests the association of amyloidogenic LCRs in disordered proteins.
We also functionally analyzed the annotation of the human proteins that showed amyloid formation in LCRs. We found that amyloids were predicted in only polyAla, polyPhe, polyLeu, polyAsn, polyGln, polyVal, hydrophobic and polar LCRs of human proteins. Our analysis showed that human proteins containing these amyloid forming LCRs were mostly involved in biological regulation and cellular processes. The major MF of human proteins predicted with amyloidogenic LCRs was binding. Whereas the LCR-containing proteins were related to lumen and plasma membrane but human proteins in which LCR was predicted as amyloidogenic, were present only in membrane. Some of the BP were absent in homopolymeric runs but appeared when they were forming LCRs in combination with other amino acid(s). For example, the process "reproduction" was seen in hydrophobic LCR-containing proteins but this process was absent in polyLeu (hydrophobic amino acids).