A Systems Biology Approach for the Investigation of the Heparin/Heparan Sulfate Interactome*

A large body of evidence supports the involvement of heparan sulfate (HS) proteoglycans in physiological processes such as development and diseases including cancer and neurodegenerative disorders. The role of HS emerges from its ability to interact and regulate the activity of a vast number of extracellular proteins including growth factors and extracellular matrix components. A global view on how protein-HS interactions influence the extracellular proteome and, consequently, cell function is currently lacking. Here, we systematically investigate the functional and structural properties that characterize HS-interacting proteins and the network they form. We collected 435 human proteins interacting with HS or the structurally related heparin by integrating literature-derived and affinity proteomics data. We used this data set to identify the topological features that distinguish the heparin/HS-interacting network from the rest of the extracellular proteome and to analyze the enrichment of gene ontology terms, pathways, and domain families in heparin/HS-binding proteins. Our analysis revealed that heparin/HS-binding proteins form a highly interconnected network, which is functionally linked to physiological and pathological processes that are characteristic of higher organisms. Therefore, we then investigated the existence of a correlation between the expansion of domain families characteristic of the heparin/HS interactome and the increase in biological complexity in the metazoan lineage. A strong positive correlation between the expansion of the heparin/HS interactome and biosynthetic machinery and organism complexity emerged. The evolutionary role of HS was reinforced by the presence of a rudimentary HS biosynthetic machinery in a unicellular organism at the root of the metazoan lineage.

A large body of evidence supports the involvement of heparan sulfate (HS) proteoglycans in physiological processes such as development and diseases including cancer and neurodegenerative disorders. The role of HS emerges from its ability to interact and regulate the activity of a vast number of extracellular proteins including growth factors and extracellular matrix components. A global view on how protein-HS interactions influence the extracellular proteome and, consequently, cell function is currently lacking. Here, we systematically investigate the functional and structural properties that characterize HS-interacting proteins and the network they form. We collected 435 human proteins interacting with HS or the structurally related heparin by integrating literature-derived and affinity proteomics data. We used this data set to identify the topological features that distinguish the heparin/HS-interacting network from the rest of the extracellular proteome and to analyze the enrichment of gene ontology terms, pathways, and domain families in heparin/HS-binding proteins. Our analysis revealed that heparin/ HS-binding proteins form a highly interconnected network, which is functionally linked to physiological and pathological processes that are characteristic of higher organisms. Therefore, we then investigated the existence of a correlation between the expansion of domain families characteristic of the heparin/HS interactome and the increase in biological complexity in the metazoan lineage. A strong positive correlation between the expansion of the heparin/HS interactome and biosynthetic machinery and organism complexity emerged. The evolutionary role of HS was reinforced by the presence of a rudimentary HS biosynthetic machinery in a unicellular organism at the root of the metazoan lineage.
A major challenge for the postgenomics era is to establish functional and structural relationships between the components of biological systems. In the last decade, the development of high throughput methods for the study of genetic interactions (1) and protein-protein interactions (2)(3)(4) has enabled the collection of large data sets describing binary relationships between primary gene products. The accumulation of these large data sets required innovative ways to represent and analyze molecular networks, thus stimulating the development of a new discipline known as network biology (5)(6)(7)(8). This new approach has been successfully used to integrate data from different experimental platforms (9), infer properties of interaction networks by applying statistical theories (6), assign protein function (10), identify network signatures characteristic of diseases such as cancer (8,10), and investigate the evolution of interaction networks (11,12). However, the chemical complexity of secondary gene products such as glycans and lipids and the technical challenges associated with the study of their interactions have generated a gap in our current models of interaction networks, and as a consequence, the interactions of proteins with secondary gene products such as glycosaminoglycans (GAGs) 4 have been excluded from the above systematic analyses.
The GAGs are linear polysaccharides whose synthesis is not template-driven. As the most complex of biological polymers, they provide access to a vast chemical information space. This has been exploited in eumetazoans to provide structural frameworks and active mediation of cell-cell communication, both absolute requirements for multicellularity. The sulfated GAGs such as heparin/heparan sulfate (HS) are synthesized and serine-linked to the core proteins of proteoglycans (HSPGs) and are located on the plasma membrane and in the extracellular matrix (ECM). The chemical complexity of heparin/HS arises from the initially synthesized monotonous polymer being extensively modified (epimerization and sulfation at various positions in the sugar rings). These modifications are substoichiometric and grouped to produce characteristic domains, which vary in size and number within each chain (13,14). Analysis of functional structures in vivo demonstrates that there is specific regulation of the structures of heparin/HS that are expressed at the cellular level (15)(16)(17)(18), and thus, it seems that biology exploits a substantial amount of the chemical information space of heparin/HS. The functions of heparin/HS are exerted through their capacity to engage protein ligands. The consequences of these interactions range from elaborating large scale structures to regulating the gradient formation and signaling activities of growth factors, cytokines, and morphogens and the localization and activity of extracellular enzymes (Refs. 16, 19, and 20; for a review, see Ref. 17). The scope of these functions is evidenced by the size of the human heparin/HS interactome: 216 proteins in a review published in 2008 (17). Many pathogens express proteins that interact with heparin/HS as part of their molecular adaptation to infection of mammals (21). Thus, HSPGs are key players in molecular networks driving biological phenomena such as development (22), inflammation and immune response (23,24), and disease (21,25).
The first aim of this study was to integrate and rationalize available data on heparin/HS-protein interactions. The current coverage of heparin/HS-protein interactions in public databases is largely incomplete (Table 1). Therefore, a literature mining effort (17) was combined with an affinity proteomics approach for the identification of heparin/HS-binding proteins (HBPs) (supplemental Results and supplemental Figs. 1 and 2), and data were retrieved from public databases to generate a comprehensive list of the interactions between heparin/HS and proteins described so far. The term "HBP" is used because heparin is commonly used as an experimental proxy for the sulfated domains of HS, and many interactions have not been validated with HS. This data set then enabled a new systematic way of analyzing heparin/HS-protein interactions using tools widely applied in genomics and proteomics studies. The system-level analysis allowed the investigation of how HBPs interact with each other by computing the topological properties of the network they form and the identification of functional and structural features that are associated with the heparin/HS binding activity. Finally, to generate insights into the role of HSPGs in the evolution of multicellular organisms, the presence of orthologs of HS biosynthetic enzymes in the genome of the choanoflagellate Monosiga brevicollis was investigated. Choanoflagellates are unicellular and colony-forming organisms found in marine and freshwater environments. They use a single apical flagellum surrounded by a collar of actin-filled microvilli to swim and capture bacterial prey (26). Because choanoflagellates are not metazoans and did not evolve from sponges or more recently derived metazoan phyla, they are indicated as the last unicellular organisms that evolved before the origin and diversification of metazoans (27). Previous works indicate the presence in the genome of M. brevicollis of protein families that were thought to be exclusive to multicellular organisms (26 -28). The presence of functional signaling cas-cades based on tyrosine phosphorylation has been also demonstrated (28). For these reasons, the study of M. brevicollis is considered to be crucial for the identification of the molecular networks that were present in the last common ancestor of choanoflagellates and metazoans and that likely contributed to the emergence of multicellularity and the development of animals.

EXPERIMENTAL PROCEDURES
Construction of Heparin/HS Interactome-The heparin/HS interactome was built using a combination of literature curation, data retrieval from public databases, and experimental data obtained by the affinity proteomics approach described in the supplemental information. An initial version of the literature-curated data set was first published in 2008 (17) and originally included 216 HBPs. The original data set was expanded, and its current version (December 2009) includes 280 interactors. Additional data were retrieved from publicly available databases and gene ontology (GO) classifications using the search criteria listed in Table 1. For GO (29) and UniProtKB (30), the search was restricted to human genes, whereas this was not necessary for MatrixDB because all the interactions described in it are associated by default to human protein identifiers (31). Finally, the 147 HBPs identified by the heparin affinity proteomics approach described in the supplemental Results and supplemental Methods were included as a set of experimentally derived, non-literature-based data (supplemental  Figs. 1-3 and 5 and supplemental Tables 1 and 7). The integration of these data sets resulted in a non-redundant list of 435 human proteins. This list, referred to as the heparin/HS interactome, was used for all the subsequent analysis presented, and it is available in supplemental Table 1.
Construction and Analysis of Heparin/HS-interacting Network-A protein-protein interaction network based on the heparin/HS interactome was built using Cytoscape v.2.6.3 (32,33). The protein-protein interaction resource was the Cytoscape Human Interactome Dataset (2007), which was obtained by merging molecular interaction data from a variety of sources including IntAct (34), Database of Interacting Proteins (35), and Human Protein Reference Database (36). The NCBI Entrez GeneID for each HBP was obtained from their UniProt accession number using DAVID 6.7 (37) and used to extract HBPs and their interactions from the human interactome data set. Topological parameters of the heparin/HS-interacting network, treated as an undirected network, were computed using the Cytoscape plugin Network Analyzer v.2.6 (38). For a context-relevant analysis, the topological parameters of the extracellular human interactome were compared with those of the network formed by extracellular HBPs. Thus, the extracellular interactome was extracted from the human data set by applying  (29), Kyoto encyclopedia of genes and genomes (KEGG) pathways (39), and Pfam domain families (40) in the heparin/HS interactome was analyzed using the web-accessible program DAVID 6.7 (37). The list of UniProt accession numbers of the HBPs was used as the input list, and the default human proteome was used as the background list. The significance of the enrichments was statistically evaluated with a modified Fisher's exact test (Expression Analysis Systematic Explorer score), and a p value for each term was calculated by applying a Benjamini-Hochberg false discovery rate correction (37). Cutoff values were 0.01 for GO biological process terms enrichment and 0.05 for KEGG pathways and Pfam domains. Furthermore, Pfam domains significantly enriched in the heparin/HS interactome were associated with the corresponding structural classification of proteins (SCOP) superfamilies (41) to reduce redundancy and perform the analysis described in Fig. 4. For GO term enrichments, the GO FAT annotation available in DAVID was used. The GO FAT is a subset of the GO term set created by filtering out the broadest ontology terms to avoid overshadowing more specific terms. The enrichment of GO biological process terms was also analyzed using the Cytoscape plugin BiNGO v2.3 (42) using the complete GO term set and a hypergeometric statistical test with Benjamini-Hochberg false discovery rate correction.
Identification of HS Biosynthetic Enzymes in M. brevicollis-The orthologs of human biosynthetic enzymes responsible for synthesis of the protein-GAG linker tetrasaccharide and for the polymerization and modification of HS chains were identified using a reciprocal Blast best hit approach (43). Thus, the protein sequences of human enzymes were searched against the non-redundant protein sequences databases of M. brevicollis (taxid 81824), fungi (taxid 4751), Dictyostelium discoideum (taxid 44689), and plants (taxid 3193) using Blastp with default settings. The best hits for each group were then searched against the human non-redundant protein sequences database (taxid 9606) and selected as orthologs only in the case where the reciprocal best hit criterion was satisfied. In the cases where more than one paralog enzyme was present in the human genome, a paralog was arbitrarily chosen from the family and used for the Blast search. Only for these cases was the reciprocal best hit criterion considered satisfied if the best hit of the reciprocal Blast search was a member of the same enzyme family but not necessarily the paralog used for the first search.

Topological Analysis of Heparin/HS-interacting Network-
The first step toward a systematic analysis of the heparin/HS interactome was to build and analyze the topological parameters of the network of protein-protein interaction formed by HBPs. This analysis is based on the representation of proteinprotein interaction networks as graphs where proteins are represented as nodes connected by edges that indicate the presence of an interaction between them (Fig. 1A). By applying statistical tools used for graph theory, it is then possible to compute topological parameters describing the properties of the network and use such parameters to compare different networks (for reviews, see Refs. 6, 38, and 44) (Fig. 1, B and C). Thus, HBPs and their interactions were extracted from a data set of human protein interactions obtained by merging data retrieved from different repositories (see "Experimental Procedures"). Because the predominant location of HSPGs is extracellular, the extracellular heparin/HS-interacting network was compared with the network formed by extracellular non-heparin/HS-interacting proteins and the total extracellular interactome (Fig. 1A). These networks were extracted from the human interactome data set using filters based on GO cellular component terms associated with extracellular protein localization (see "Experimental Procedures"). The topological parameters of the networks analyzed are summarized in Fig. 1B. By comparing the properties of these three extracellular networks, the main topological parameter associated with the heparin/HS binding function was the high average clustering coefficient displayed by the heparin/HS-interacting network (Fig. 1, B and C). The clustering coefficient is a measure of the modularity of a network (i.e. the tendency of nodes to form groups or clusters). Therefore, a high average clustering coefficient indicates the presence of highly interconnected groups of nodes (modules) within the network. Clustering coefficients of the heparin/ HS-interacting network are distributed at higher values when compared with the non-heparin/HS-interacting network and the total extracellular interactome (Fig. 1C). The high average clustering coefficient of the extracellular heparin/HS-interacting network indicates a stronger tendency of HBPs to form highly interconnected modules than other extracellular proteins. In Fig. 2, selected examples of highly clustered modules extracted from the heparin/HS interactome are shown. These examples indicate how the tendency to form highly connected modules is independent of the nature of the HBP. The clusters shown are in fact formed by secreted growth factors such as VEGFB and transforming growth factor-␤2 (TGF␤2) and their transmembrane receptors (Fig. 2, A and D) as well as by structural components of the ECM such as fibrillins (Fig. 2B) and plasma proteins such as coagulation factors (Fig. 2C). Furthermore, the architecture of the heparin/HS-interacting network also has functional implications. These clusters represent examples of functional modules responsible for the regulation of complex biological processes such as angiogenesis (45), morphogenesis (46), ECM assembly (47), and regulation of the coagulation cascade (48). These data support the view of HSPGs as key mediators of the assembly of molecular complexes at the cell surface and in the extracellular space (49).
Functional Analysis of Heparin/HS Interactome-To gain insights into functional roles of HBPs, the over-representation (enrichment) of ontology terms and components of molecular pathways in the heparin/HS interactome was analyzed in comparison with their occurrence in the human proteome. The GO biological process terms describe biological objectives to which the gene or gene product contributes (29), and they can be either broad, generic terms such as "response to stimulus" or a more specific term such as "fibroblast growth factor receptor signaling pathway." Therefore, the identification of biological process terms associated with a particular set of genes, in this case the HBPs, can be useful to highlight their functional roles at a network level. Ninety-four percent of the HBPs were annotated with at least one biological process term (Table 2 and  supplemental Table 2). The remaining 6% were not annotated due to the incompleteness of the current GO annotation. From this analysis, a strong correlation between the heparin/HS interactome and biological functions characteristic of multicellular and higher organisms emerged. Significantly enriched terms are associated with fundamental processes common to all multicellular organisms such as cell-cell signaling but also with more complex processes such as wound healing and the immune response that are characteristic of higher organisms ( Table 2). As with other ontologies, the biological process terms can be visualized as a graph where directed links describe the hierarchy and relationships between terms (29). This kind of visualization helps to group highly related/redundant terms typical of ontology classifications and identify relevant functional modules. The graph shown in Fig. 3 highlights the existence of four main functional groups strongly enriched in the heparin/HS interactome. These include two clusters of biological processes involved in the control of the immune system and of developmental processes and two other clusters related to the regulation of cellular processes such as cell proliferation and cell-cell signaling (Fig. 3). It has to be highlighted that the graph in Fig. 3 is based for clarity only on the subsets of terms with the highest p value; therefore, other significant and perhaps less investigated biological functions (such as, for example, "cation homeostasis"; p value, 2.1eϪ16) were not included in this analysis.
Next, a similar approach to identify those pathways that have a statistically significant over-representation of HBPs was FIGURE 1. Topological and functional analysis of heparin/HS-interacting network. A, the extracellular heparin/HS-interacting network ("Ec_hepint"; blue) was extracted from a data set of extracellular protein-protein interactions, and it was compared with the non-heparin/HS-interacting network ("Ec_not-hepint"; red) and the whole extracellular interactome ("Ec"; green) (see "Experimental Procedures"). The position of the nodes in the networks and the length of edges are arbitrary and only have a graphical purpose. The properties and topological parameters of the network analyzed are summarized in B. "Proteins" indicate the number of proteins (nodes) that form each network, and "PPI" (protein-protein interactions) indicates the number of interactions (edges) connecting them. The "Average degree" indicates the mean number of neighbors per node in the network. The "Characteristic path length" is the average over the shorter distances (number of links) separating all pairs of nodes in the network and offers a measure of the overall navigability of a network. The clustering coefficient is defined as the number of links connecting the first neighbors of a given node divided by the total possible number of connections between them. It is a measure of the tendency of nodes to form highly interconnected modules. The "Avg. clustering coefficient" for a network is calculated as the mean of the clustering coefficients for each node having a degree Ն2. The "Ec_hepint-random" network was generated from the extracellular heparin/HS-interacting network by applying a degree-preserving random shuffle of the edges (1320 shuffles). The "Ec_random" network was generated by randomly selecting a network of the same size as the extracellular heparin/HS-interacting network from the total extracellular interactome. The procedure was iterated 50 times, and mean network parameters are shown with S.D. in parentheses. The average clustering coefficient of the extracellular heparin/HS-interacting network is six standard deviations higher than the average value calculated for randomly picked networks. In C, nodes are binned according to their degree, and the average (Avg.) clustering coefficient for each bin is plotted applying the same color code used in A. The distribution of the clustering coefficients is characterized by the typical slope of protein-protein interaction networks, which indicates the presence of hierarchical modularity.

System-level Analysis of Heparin/HS Interactome
applied by projecting the heparin/HS interactome on the KEGG collection of pathways. KEGG pathways are manually drawn graphs that are based on the current knowledge of molecular interactions and reaction networks (39). In Table 3  and supplemental Table 3, the pathways enriched in HBPs are summarized. HBPs are involved in pathways responsible for the control of key physiological and pathological processes characteristic of multicellular organisms. In particular, the enriched pathways highlight the role of the heparin/HS interactome in key mechanisms implicated in the regulation of the cellular response to external stimuli. These include interactions between soluble ligands and their cell surface receptors ("cytokine-cytokine receptor interaction") as well as cross-talk with components of the ECM ("ECM-receptor interaction"). These mechanisms are directly linked to the control of cell behavior via the regulation of processes such as cytoskeleton reorganization ("regulation of actin cytoskeleton" and "focal adhesion") and activation of intracellular signaling cascades (e.g. "TGF␤ signaling pathway"). The deregulation of these pathways can lead to the establishment of pathological conditions such as cancer (e.g. "melanoma") and immunological disorders (e.g. "systemic lupus erythematosus"). Furthermore, pathways linked to other pathologies caused by structural alteration of extracellular proteins and accumulation of amyloid plaques (e.g. "prion diseases") are significantly correlated with the heparin/HS binding activity because most of these proteins directly interact with GAGs.
In summary, the functional analysis of the heparin/HS interactome highlighted the following. (i) HBPs are functionally enriched in biological processes that are characteristic of multicellular and higher organisms. (ii) HBPs are candidates for a potential key role in mediating the information flow between the extracellular space and intracellular signaling pathways. (iii) This role could imply a direct involvement of the heparin/HS interactome in complex physiological and pathological systems such as organismal development, inflammation, cancer, and neurodegenerative disorders.
Structural Analysis of Heparin/HS Interactome-The same strategy used for the functional analysis of the heparin/HS interactome was performed at a structural level by investigating the enrichment of protein domains in HBPs. Ninety-eight per-   (Table 4)). Highlighting the structural diversity of the heparin/HS interactome, the list includes domains that are characteristic of small soluble, single domain proteins such as cytokines and growth factors (e.g. "small cytokines, interleukin-8-like," "fibroblast growth factor," and "transforming growth factor-␤-like domain"), domains that assemble in large multidomain proteins (e.g. "thrombospondin type 1 domain," "laminin G domain," and "fibronectin type III domain") and domains associated with enzymatic activity (e.g. "trypsin fam-ily," associated with proteolytic activity

. GO biological process terms enriched in heparin/HS interactome are shown as nodes connected by directed edges that indicate hierarchies and relationships between terms.
The node size is proportional to the number of HBPs belonging to the functional category. The node color indicates the corrected p value (Benjamini-Hochberg false discovery rate correction) for the enrichment of the term according to the legend. For clarity, only highly significant terms are displayed (p Ͻ 1eϪ21). The graphs were generated using Cytoscape v.2.6.3 (32) and its plugin BiNGO v2.3 (42).

TABLE 3 KEGG pathways enriched in heparin/HS interactome
"Count" indicates the number of HBPs, and "Percent" indicates the percentage of the mapped proteins associated to each term. NOD, nucleotide binding and oligomerization domain.  Table 4). This approach allowed the reduction of the redundancy of the Pfam classification by grouping domains of related structure into 25 superfamilies (for example the entries "EGF-like domain" and "laminin EGF-like" of Pfam are grouped in the SCOP superfamily "EGF/laminin") ( Table 5). As mentioned above, not all the structures are directly associated with heparin/HS binding activity. Literature curation of the heparin/HS-binding sites (HBSs) identified so far revealed that 11 of the 25 superfamilies have never been described as mediating the interaction with the carbohydrate (Table 5). Although the lack of evidence in the current literature does not necessarily exclude a role of these structures in heparin/HS binding, this analysis highlights those structures that are predominantly mediating the interaction with GAGs. For each of these structures, a refer-  Integrin domains 58010 Fibrinogen coiled coil and central regions enced example and, when available, a three-dimensional structure of the domain in complex with heparin or a heparin-related molecule are reported in Table 5.
In summary, the structural analysis of the heparin/HS interactome revealed a high level of diversity between structures associated with heparin/HS binding function possibly with a direct link to the structural complexity of GAGs. These structures are typically found in extracellular proteins, and the majority of them are specific to metazoans. They can either occur in single domain proteins, which are in general part of large families of paralogs, or as units mediating the heparin/HS binding activity that have been arranged during evolution in different multidomain architectures. Finally, a fraction of the structures enriched in the heparin/HS interactome does not seem to be directly involved in the interaction with the carbohydrate, and they might be over-represented because they are functionally and structurally linked with the heparin/HS-interacting modules.
Evolutionary Aspects of Heparin/HS Interactome-Because a strong association emerged between HBPs and biological functions and pathways characteristic of higher organisms, the evolution of the protein domains enriched in the heparin/HS interactome was analyzed in correlation with organism complexity across the tree of life. In a recent publication, Vogel and Chothia (50) investigated the expansion of domain superfamilies across the genomes of 38 uni-and multicellular eukaryotes and correlated it to the increase in organism complexity as measured by the number of different cell types of which the organisms are composed. They mapped the occurrence of different superfamilies using a database of 1219 hidden Markov models based on the superfamily classification of domains (51). For each genome, they annotated single domain proteins and the individual domains of multidomain proteins to their respective superfamily and then calculated the abundance of each superfamily as the number of proteins that contain at least one domain belonging to that particular superfamily. The normalized abundance profiles were then used to calculate a Pearson correlation coefficient (PCC), r, describing the correlation between superfamily abundance and the estimated number of cell types per genome (50). The PCCs for the 27 superfamilies enriched in the heparin/HS interactome were extracted from the data set of Vogel and Chothia (50), and they are plotted in Fig. 4. The distribution of PCCs of the heparin/HS interactome superfamilies indicates a strong correlation between their abundance and organism complexity (mean r ϭ 0.83 with r Ն 0.8 indicating a strong positive correlation). To establish a link between domain functions and their correlation with organism complexity, the authors extended the domain annotations by manually assigning functional categories to each superfamily (50). They observed that only 15% of the superfamilies have a strong positive correlation (r Ն 0.8) with organism complexity. Moreover, just two functional categories contributed nearly half of these positively correlated superfamilies. These two functional categories are superfamilies associated with extracellular processes (20%) and regulation (29%) (50). Therefore, because most of the superfamilies characteristic of the heparin/HS interactome are also associated with extracellular processes, their relative contribution to the extracellular functional category was investigated. Thus, the heparin/HS interactome superfamilies were annotated using the functional classification described by Vogel and Chothia (50), and the distribution of the correlation coefficients of extracellular superfamilies enriched in the heparin/HS interactome was compared with that of superfamilies annotated as extracellular but not enriched in the heparin/HS interactome and of the whole extracellular category (Fig. 4). The mean correlation coefficient of the heparin/HS interactome superfamilies annotated as extracellular is even higher than the whole heparin/HS interactome set (mean r ϭ 0.89), and most importantly, the distribution of their PCCs is significantly different from that of the whole extracellular category (p ϭ 9.2eϪ3; one-sided Wilcoxon rank sum test) and other extracellular superfamilies (p ϭ 1.9eϪ3; one-sided Wilcoxon rank sum test).
Next, the analysis of the correlation between HSPGs and organism evolution was extended to their biosynthetic enzymes. In this case, the analysis was not performed using the SCOP superfamily classification because functional specificity cannot be entirely recapitulated by the structural annotation (i.e. structurally related enzymes can be grouped in the same superfamily despite having different substrate specificity). Only the HS co-polymerases EXT1 and EXT2 are currently annotated to a superfamily ("nucleotide-diphospho-sugar transferases"; 53448), which also includes other sugar transferases such as galactosyltransferases. However, this superfamily also shows a positive correlation with organism complexity (r ϭ 0.77). Therefore, the occurrence of HS biosynthetic enzymes across the tree of life was investigated by multiple sequence alignment across 638 sequenced genomes using STRING (52). The results are schematically summarized in Fig. 5. HS biosynthetic enzymes appear to be characteristic of the eumetazoan lineage and therefore strongly associated with the emergence of multicellularity. No homology is detectable in unicellular eukaryotes such as fungi and in plants (Fig. 5). Interestingly, FIGURE 4. Correlation between abundance of superfamilies associated with heparin/HS interactome and organism complexity. The PCCs describing the association between superfamily abundance and organism complexity were extracted from Vogel and Chothia (50). The PCCs for superfamilies enriched in the heparin/HS interactome ("Hepint"), enriched in the heparin/HS interactome and annotated as extracellular in Vogel and Chothia (50) ("Ec_hepint"), annotated as extracellular ("Ec"), and annotated as extracellular but not enriched in the heparin/HS interactome ("Ec_not-hepint") are plotted along the y axis. For each group, a horizontal bar indicates the mean PCC. The dashed line indicates the threshold for strong correlation between superfamily abundance and organism complexity. some level of homology is detected in certain species of bacteria. The occurrence of enzymes able to synthesize GAG-related carbohydrates has been described already in some vertebrate pathogens such as Streptococci (53). The emergence of these microbial enzymes may have occurred by either convergent evolution or horizontal gene transfer under the selective pressure of the vertebrate host defense as a mechanism to enhance the pathogen virulence (53). Interestingly, some HS postsynthesis editing mechanisms appear to have evolved at a later stage than the core biosynthetic machinery. No homology is detectable for the extracellular HS-cleaving heparanases in the nematode Caenorhabditis elegans, although homology, albeit low, is observed for the extracellular HS sulfatases (Fig. 5). The association between an increase in organism complexity and HSPGs is further corroborated by the expansion of the HS biosynthetic enzymes in particular in the vertebrate lineage. Thus, although C. elegans possesses only one isoform for each of the key enzymes involved in HS biosynthesis (HS co-polymerase exostosin (EXT), N-deacetylase/N-sulfotransferase (NDST), glucuronic acid C5-epimerase (GLCE), and the sulfotransferases HS2ST, HS6ST, and HS3ST), in humans, all these enzymes have expanded by gene duplication with the exception of GLCE and HS2ST. Two EXT, four NDST, three HS6ST, and six HS3ST isoforms together with one isoform each for GLCE and HS2ST form the HS biosynthetic machinery in humans. A similar pattern is observed for the HS core proteins. Only one syndecan and two glypican genes are found in C. elegans, whereas humans possess four syndecan and six glypican isoforms (54). Interestingly, a similar expansion is not observed for the ECM core proteins perlecan and agrin (54).
In summary, a strong correlation emerged between the abundance in a genome of domain superfamilies associated with the heparin/HS interactome and organism complexity. This correlation has already been observed for extracellular domains (50), but it is statistically more pronounced for the heparin/HS interactome superfamilies than others. A similar correlation is also suggested for HS biosynthetic enzymes and core proteins based on the occurrence of these proteins across the tree of life and their expansion during evolution, especially in the vertebrate lineage.
Choanoflagellate M. brevicollis Possesses Rudimental HS Biosynthetic Machinery-Because GAGs are commonly considered a hallmark of eumetazoans and the expansion of HS biosynthetic enzymes and interacting domains is associated with an increase in organism complexity, the presence of orthologs of HS biosynthetic enzymes in the genome of M. brevicollis was investigated. Thus, orthologs of human HS biosynthetic enzymes were established using a reciprocal Blast best hit proce-  (52). The UniProt Accession numbers of the human HS biosynthetic enzymes were used as input. The conservation of each gene across different species is indicated by squares colored according to the sequence homology detected by STRING. For clarity, some species (e.g. bacteria) were grouped in collapsed nodes colored in gray, and the number indicated in parentheses reports the number of species grouped in each node. In these cases, split squares report the highest and lowest score for the given gene within the grouped species. dure (43) against the non-redundant protein sequences database of M. brevicollis. The Blast search was also performed against the protein sequence databases of the nematode C. elegans, known to produce GAGs, of the colony-forming slime mold D. discoideum, and of Fungi and Plantae.
The biosynthesis of HS and heparin requires the formation of a linker tetrasaccharide by sequential transfer of a xylose, two galactose units, and a glucuronic acid (GlcUA) unit to serine residues located in a serine-glycine repeat consensus on the core protein. This linker region is common to other GAGs such as chondroitin sulfate and dermatan sulfate. High score orthologs of the four enzymes required for the assembly of the protein-GAG linker tetrasaccharide are present in the genome of M. brevicollis, whereas clear orthology cannot be established for the full set in lower eukaryotes and plants (Fig. 6). The addition to the linker region of an N-acetylglucosamine (GlcNAc) or an N-acetylgalactosamine (GalNAc) residue commits the biosynthesis toward heparin/HS or chondroitin sulfate/dermatan sulfate, respectively (55). The biosynthetic reaction then proceeds with the polymerization of the sugar backbone by alternate addition of GlcUA and GlcNAc units in the case of heparin and HS. In human, the addition of the first GlcNAc residue is performed by the enzyme exostosin-like 3 (EXTL3) that possesses only GlcNAc-transferase activity, whereas the sugar polymerization is carried out by the enzymes EXTs that present both GlcUA-and GlcNAc-transferase activities (56). The two activities can be localized in the EXT enzymes in two separate domains with the N-terminal domain (EXT(N)) responsible for GlcUA-transferase activity and the C-terminal domain (EXT(C)) responsible for the GlcNAc-transferase activity (57). The EXT enzymes are likely to be the result of a gene fusion event between two functionally interacting enzymes carrying the GlcUA-and GlcNAc-transferase activities (55). The more rudimentary biosynthetic machinery of the nematode C. elegans comprises just a single enzyme (rib-2) with GlcNAc-transferase activity that is responsible for both chain initiation and polymerization and a separate enzyme (rib-1) that presents high sequence homology with EXT(N) and possesses GlcUA-transferase activity but only when in complex with rib-2 (58) (Fig. 6). Similarly, in M. brevicollis, two genes are present that possess homology with the C-terminal portion of EXTL3 and EXT(N) (Fig. 6). These two genes present conserved features that have been associated with GlcUA-and GlcNAc-transferase activity in human genes such as a conserved DXD motif in the C-terminal portion of EXTL3 (55, 57) (supplemental Fig. 4), and they indicate the possibility of the production of heparosan-like chains by choanoflagellates. The heparin/HS polymerizing chains undergo a series of modification by four classes of sulfotransferases (NDST, HS2ST, HS6ST, and HS3ST) that introduce sulfate groups at different positions on both the GlcUA and GluNAc units and by an epimerase (GLCE) that converts some GlcUA residues to iduronic acid (55,59). These modifications largely contribute to generate sequence diversity in HS chains. M. brevicollis possesses two genes homologous to members of two of the four families of HS sulfotransferases: HS2ST and HS3ST (Fig. 6). Also in this case, the choanoflagellate genes present conservation of key residues involved in sulfotransferase catalytic activity such as binding sites for the sulfate donor 3Ј-phosphoadenosine 5Ј-phosphosulfate (supplemental Fig. 4). These data suggest that M. brevicollis potentially possesses a complete, albeit rudimentary, HS biosynthetic machinery able to synthesize sulfated forms of HS- discoideum, and Plantae. The figure reports the Blastp score of the best hits only in the cases when the reciprocal best hit criterion was satisfied. The "Linker" enzymes are responsible for the synthesis of the protein-GAG linker tetrasaccharide, which is shared between HS and other GAGs. The "HS" group includes enzymes that are specific for the HS-specific biosynthetic pathway that follow the formation of the linker tetrasaccharide. The full report of the Blastp search including M. brevicollis hits is provided in supplemental Table 6. like polysaccharides. There is no evidence supporting the presence of a GLCE enzyme in M. brevicollis, suggesting that this kind of postsynthesis modification might have been a more recent innovation in the metazoan lineage.
In summary, the identification of orthologs of key GAG biosynthetic enzymes in the genome of a unicellular choanoflagellate suggests for the first time the presence of HS-like sulfated polysaccharides in non-metazoans. Intriguingly, the presence of a rudimentary HS biosynthetic machinery at the root of the metazoan lineage but not in other eukaryotes could support the role of GAGs as key molecules for the emergence of metazoan multicellularity as it has been suggested for other cell signaling, adhesion, and ECM molecules exclusively shared by M. brevicollis and metazoans (27,28). Further studies including the biochemical characterization of the enzymes identified in this study and isolation of GAGs from M. brevicollis cells are required to validate this hypothesis.

DISCUSSION
The analysis undertaken of the heparin/HS interactome allowed the investigation of the properties of the heparin/HSinteracting network of protein-protein interaction and the analysis at a system level of the functional and structural features characterizing HBPs. The heparin/HS interactome was built combining literature mining, data retrieval from public databases, and proteomics experimental data. The affinity proteomics strategy described in the supplementary information led to the identification of 147 extracellular proteins of which 32 were previously described HBPs (supplemental Table 1). The remaining 115 newly discovered HBPs were also included for the analysis of the heparin/HS interactome, although these interactions will require independent experimental validation. The resulting data set included 435 human proteins of which most are extracellular (supplemental Table 5). The analysis of the heparin/HS-interacting network revealed that HBPs tend to form more highly clustered modules than other extracellular proteins. These clusters can assemble both at the cell surface and in the ECM, and they often also represent functional modules. From a functional point of view, the heparin/HS interactome is strongly associated with biological processes characteristic of multicellular organisms and with pathways that are crucial for the conversion of extracellular cues into intracellular signaling events and finally into a phenotypic response. These processes and pathways are central to complex biological phenomena particular to higher organisms such as development and the immune response, and they are consequently linked to pathological conditions such as cancer and neurodegenerative disorders. In this perspective, potential intracellular roles of HSPGs could provide additional mechanisms for GAG-mediated regulation of intracellular signaling (60). The structural analysis of the heparin/HS interactome revealed the existence of two main categories of domains associated with heparin/HS binding activity: domains that are characteristic of soluble single domain proteins and domains that occur mainly in multidomain proteins and that have been assembled during evolution in different architectures. From an evolutionary perspective, the expansion of domains associated with the heparin/HS interactome strongly correlates with an increase in organism complexity that is independent of their nature. A similar correlation was already described for domains and proteins generically associated with extracellular processes (50,61); however, also within this set, the heparin/HS interactome-associated domains display a statistically significant higher correlation than other extracellular domains. It has been shown that the evolutionary rate of the extracellular proteome is faster then the intracellular proteome probably due to the less chemically constrained environment faced by extracellular proteins (62). This evolutionary plasticity of the extracellular proteome could have been a driving force for the organization of more complex systems of intercellular communication and organization. The functional and structural link between the heparin/HS interactome and biological processes characteristic of complex organisms and the fact that HBPs are more correlated than other extracellular proteins with an increase in organism complexity strongly suggest a pivotal role of HSPGs in driving the evolution of multicellular and higher organisms. In fact, the expansion of enzymes and core proteins responsible for the synthesis and localization of HS chains also correlates with an increase in organism complexity in the metazoan lineage. The core HSPG biosynthetic machinery was thought to have evolved in early eumetazoans concomitantly with the emergence of multicellularity (63). Biochemical data describing the presence of heparin/HS-like GAGs in phyla at the root of the eumetazoan lineage such as Cnidaria and Ctenophora (64,65) supported this view. However, the presence of orthologs of key HS biosynthetic enzymes in the choanoflagellate M. brevicollis suggests that GAGs are likely to be a molecular innovation that predates the origin of metazoans. HSPGs might have been a critical step for the assembly of an extracellular network of proteins required for the structural organization of the extracellular space and for the establishment of cell-cell communication and cell differentiation. Later on, gene duplication events, in particular the whole genome duplications that occurred at the root of the vertebrate lineage (66), contributed to expand the repertoire of both HSPG biosynthetic enzymes and their interacting partners on which evolution could act. The fact that domains characteristic of the heparin/HS interactome are strongly correlated with organism complexity could indicate that their expansion has been one of the driving forces toward the organization of the more sophisticated and tunable extracellular networks that are required for the development of higher organisms and for the control of organism-level biological processes such as the establishment of an immune system. In support of this, others have already proposed HSPGs as key molecules for the emergence of neural connectivity (54,67). The authors collected a series of experimental evidences obtained in different model organisms describing the specific involvement of HSPGs in all the key processes required for the establishment of neural connectivity including axon guidance, neuron-target interaction, and synapse development (54). Similar to what has been proposed here, the authors suggested a role for HSPGs as versatile extracellular scaffolds that modulate extracellular cues influencing the response of neurons to their environment (54). In the future, heparin affinity proteomics strategies similar to that described in the supplemental information could be implemented in combination with quantitative mass spectrometry techniques such as targeted proteomics (68) and staple isotope labeling with amino acids in cell culture (69) to investigate dynamic changes of the heparin/HS interactome associated, for example, with different developmental stages or pathological conditions. Such data, complemented by the characterization of the dynamic changes in HS structure, could be extremely valuable to elucidate HSPG-mediated mechanisms involved in the control of physiological and pathological processes. This opens the door to the design of new, network-based therapeutic strategies targeting multiple protein-glycan interactions that are associated with multifactorial diseases. Finally, the structural and functional characterization of the proteome and glycome of organisms at the root of the metazoan lineage could illuminate the role of protein and glycan co-evolution in the assembly of the extracellular molecular networks necessary for the development of complex forms of life.