The Development and Application of a Quantitative Peptide Microarray Based Approach to Protein Interaction Domain Specificity Space*

Protein interaction domain (PID) linear peptide motif interactions direct diverse cellular processes in a specific and coordinated fashion. PID specificity, or the interaction selectivity derived from affinity preferences between possible PID-peptide pairs is the basis of this ability. Here, we develop an integrated experimental and computational cellulose peptide conjugate microarray (CPCMA) based approach for the high throughput analysis of PID specificity that provides unprecedented quantitative resolution and reproducibility. As a test system, we quantify the specificity preferences of four Src Homology 2 domains and 124 physiological phosphopeptides to produce a novel quantitative interactome. The quantitative data set covers a broad affinity range, is highly precise, and agrees well with orthogonal biophysical validation, in vivo interactions, and peptide library trained algorithm predictions. In contrast to preceding approaches, the CPCMAs proved capable of confidently assigning interactions into affinity categories, resolving the subtle affinity contributions of residue correlations, and yielded predictive peptide motif affinity matrices. Unique CPCMA enabled modes of systems level analysis reveal a physiological interactome with expected node degree value decreasing as a function of affinity, resulting in minimal high affinity binding overlap between domains; uncover that Src Homology 2 domains bind ligands with a similar average affinity yet strikingly different levels of promiscuity and binding dynamic range; and parse with unprecedented quantitative resolution contextual factors directing specificity. The CPCMA platform promises broad application within the fields of PID specificity, synthetic biology, specificity focused drug design, and network biology.

methods. Here, we develop and apply a quantitative high throughput method capable of addressing these questions.
Peptide arrays (16,17), degenerate libraries (18,19), and phage display (20) are the most frequently applied high throughput approaches for investigating PID specificity. Phage display and degenerate library approaches sample a large ligand space and can produce consensus selectivity motifs that represent the most preferred residues at every position panned. This selectivity data is used to predict interactions, often via position specific scoring matrices (PSSMs) (21)(22)(23). However, neither approach can explicitly measure nonbinding events and only large phage display data sets can resolve a limited subset of high-affinity contextual binding information (24). Nonbinding information and contextual interplay, that is, correlated contributions between ligand positions to binding affinity, play important roles in defining the specificity landscapes for multiple PID families (25)(26)(27). Having explicit nonbinding or low-affinity information available helps uncover contextual binding information, and improves the accuracy of interaction priority assignment between multiple competing PIDs. Correspondingly, the availability of nonbinding and contextual information improves interaction prediction performance (28,29). Peptide arrays using physiological ligands do not have these limitations, yet may undersample PID specificity space because of smaller library sizes. Newly emerging ultrahigh density peptide arrays avoid this particular limitation and are capable of sampling the entire proteome (30). However, a common limitation for all of these techniques is their dependence on nonquantitative interaction information.
A comprehensive understanding of PID specificity space requires the quantitative assessment of pairwise interactions across a broad dynamic range of affinities. Common lowthroughput biophysical techniques used to measure proteinpeptide interaction affinities require highly pure and often large amounts of interactants along with prior knowledge of their interaction. To facilitate discovery and lessen the stringency of the purity and/or quantity requirements of interacting molecules, multiple quantitative high throughput methodologies have been developed. Thus far, the protein microarray (PMA) (31) and high-throughput fluorescence polarization (HTFP) (11,32) quantitative approaches have been used to examine PID specificity. Unfortunately, PMAs suffer from poor sensitivity, reproducibility, and measurement discrepancies (32,33). The alternative HTFP assay is more sensitive than PMAs, but also has poor reproducibility and is biased toward high affinity interactions (32). Further, PMAs and HTFP have minimal K D sensitivities of 2 and 20 M, respectively. This boundary limits their scope of application considering the importance of moderate affinity interactions for many PID families (34) and the general importance of these interactions in directing emergent phenomena such as ultrasensitivity and interaction gating (35,36).
The Src Homology 2 (SH2) domain phosphotyrosine interactome is a system of prominent physiological importance that has been the subject of multiple preceding interaction mapping and platform development efforts (17,19,31,32,37,38). Tyrosine phosphorylation and its recognition by SH2 domains is a uniquely metazoan adaptation (39), prominently involved in the transduction of growth signals (1,2,40). SH2 domain specificity has been leveraged in multiple contexts to rewire signaling pathways (15), and specificity dysregulation is associated with multiple diseases, including cancer (40 -42). Despite the importance of directing specific signals, many SH2 domains share sequence preferences, resulting in substantial binding crossover over a broad dynamic range of affinities (27). Yet, the extent of this crossover and the peptide motif characteristics responsible for its direction are poorly resolved. In light of these biological attributes and the availability of preceding information to guide platform development and assess its performance, we chose a significantly novel set of 124 physiological phosphopeptides and four SH2 domains as a pilot interactome for our arrays.
Here we develop a cellulose peptide conjugate microarray (CPCMA) based approach that is the first broadly applicable protein interaction mapping approach capable of explicitly capturing binding and nonbinding information, resolving the contextual interplay between amino acids that contribute to binding, and providing a quantitative measure of interaction affinity. We optimize multiple production and incubation parameters for the arrays and develop an experimental and computational pipeline to control for systematic artifacts and robustly estimate affinities, ultimately producing a physiological interactome of 368 measurements between 92 peptides and four domains. In contrast to preceding platforms, we find our approach is highly sensitive, extremely precise, and consistent with multiple forms of orthogonal biophysical and in vivo validation. Leveraging unique modes of analysis enabled by the CPCMAs, we demonstrate that SH2 domains vary considerably in promiscuity and binding dynamic range in a manner aligned with biological function. Consistent with recent evidence that pathway crosstalk is subjected to negative selection (10 -12, 43) we demonstrate for the first time that the degree distribution of the interactome is a function of affinity, and therefore little binding overlap is found among the highest affinity interactions, suggesting that such sites may be important points of in vivo coregulation. We highlight a few such interactions and corroborate others using available domainsite specific in vivo interaction information. Next, in order to provide insight into the specificity determinants responsible for this segregation we produce for the first time predictive domain specific binding motif affinity matrices. Lastly, we demonstrate the unique ability of the CPCMAs to resolve contextual contributions to binding, and highlight examples where contextual information contributes to binding specificity. Quantitative analysis of systems-level PID properties is a necessary prerequisite for improved mathematical modeling of signal transduction systems, specificity focused drug design, and engineering advanced synthetic biology circuits. Thus, our quantitative CPCMA based approach should have broad application in future network biology and PID specificity efforts.

EXPERIMENTAL PROCEDURES
SH2 Domain Production and Purification-SH2 domain sequences were cloned into pGEX vectors as GST fusions as described (27) and expressed in E. coli BL21 cells using autoinduction recipes and protocols according to Studier et al. (44). Following overnight culturing, cells were pelleted and resuspended in 50 mM HEPES, 150 mM NaCl, 1 mM EDTA, 1 mM dithiotreitol, and 0.1% Triton X-100 pH 7.5 ϩ protease inhibitor mixture (EMD). Unless stated otherwise, all purification steps were carried out at 4°C. The suspension was sonicated, pelleted, and the supernatant was mixed with polyethyleneimine (PEI) to .1% (v/v) forming a white precipitate. This precipitate was pelleted in an ultracentrifuge, and the GST-SH2 domain containing supernatant was subjected to batch affinity chromatography using glutathione (GSH) functionalized resin (Fisher) for 2 h at RT or overnight. The proteins were eluted from the beads with 10 mM GSH in .1 M HEPES pH 8.0. Eluted protein was buffer exchanged using Nap10 (GE) columns for cation exchange chromatography. Proteins were further purified with an AKTA purifier system (GE) using a cation exchange resin SP (GE), CM (GE), or heparin (GE) to remove nucleic acid contamination. In isolated cases, proteins were subjected to gel filtration (GE) for additional cleanup. Following centrifugal concentration (Thermo), proteins were buffer exchanged into 50 mM HEPES, 100 mM NaCl, 1 mM EDTA, 1 mM DTT, 20% v/v glycerol pH 7.5 using Nap10 or PD MiniTrap (GE) columns, flash frozen, and stored at Ϫ80°C until needed. Aliquots were subjected to SDS-PAGE with Coomassie Blue stain to verify purity. Concentration measurements were performed via BCA assay (Thermo) according to the manufacturer's instructions.
Soluble Peptide Production and Fluorescence Polarization-Peptides were synthesized using 9-Fluorenylmethoxycarbonyl (FMOC)chemistry onto preloaded tenta-gel resins. Peptides were labeled N-terminally with Rhodamine B (Abbey Color, Fremont, CA) or 5-6 carboyfluorescein (Anaspec) and cleaved using trifluoroacetic acid. Peptides were purified and mass quantified using a LC/MS or MALDI mass analyzer (Agilent 2100). Peptides were lyophilized and stored under nitrogen at Ϫ20°C until use. A subset of fluorescein labeled peptides was also purchased (Genscript, Piscataway, NJ). FP experiments were conducted using the Beacon 2000 (Invitrogen, Carlsbad, CA).
SPOT Synthesis-Membranes were synthesized using the MultiPep high throughput peptide synthesizer (Intavis, Cologne, Germany). All Fmoc-Amino Acids were purchased from Bachem, except Fmoc-pTyr (Anaspec). All solvents were purchased from Fisher unless otherwise indicated. Cellulose membranes were purchased from Intavis. The SPOT synthesis method was carried out on cellulose membranes using Fmoc chemistry as described in detail elsewhere (45). Each 11-mer peptide chain was synthesized with pTyr or Tyr in the 5th position relative to the N terminus. To prevent oxidation mediated artifacts, all cysteine residues were replaced with serine and all methionine residues were replaced with norleucine. All peptides were acetylated at the N terminus.
SC 2 Synthesis-Following synthesis, each of the 384 peptide spots were manually removed from the membrane using a hole-puncher and placed into individual wells of a 96 well plate (Matrix Technologies, Toledo, OH). Each well was treated with 150 l of a deprotection mixture consisting of 95% trifluoroacetic acid, 3% triisopropylsilane (TIPS), and 2% H 2 O for 1.5 h. Deprotection mixture is then aspirated from each well and replaced with 250 l of trifluoroacetic acid (93.5%), Trifluoromethanesulfonic acid (TFMSA) (2.5%), H 2 O (2.5%), TIPS (1.5%), or one of the alternative dissolution cocktails (see Table  I) for 17 h at RT with minor agitation. Subsequently each CPC is precipitated and washed three times with cold MTBE using the Precision 2000 pipetting robot (BioTek, Winooski, VT). The final wash supernatant is removed and each CPC is dissolved in 100 l dry dimethyl sulfoxide with agitation. The dissolved conjugates are transferred in 20 l aliquots to five 384 well plates (Genetix) by robotic or manual pipetting. The plates are then centrifuged and covered with metallic sealer to prevent evaporation (Dot Scientific). The plates are flush or flash frozen and stored at Ϫ20 in a desiccator until used for printing.
Microarray Fabrication-Plates are thawed to RT in a desiccator before removing the seal. Five microliter aliquots were used for printing each batch of arrays. Each print batch was fabricated using the Omnigrid 100 (Digilab, Marlborough, MA) contact arrayer using a single SMP3 split pin (Arrayit, Sunnyvale, CA). Arrays were printed on amine, aldehyde, acrylic, epoxy, or nontreated slides. All slides were purchased from Genetix with the exception of the nontreated slides that were purchased from Fisher. Each unique CPC or CPC dilution combination was spotted four times per microarray for control experiments, 12 times per microarray for quantitative arrays and substrate comparisons (one/well), and 24 times per microarray for dissolution condition comparisons (two/well). After printing slides were stored in a desiccator at 4°C until ready for use.
CPCMA Incubations and Analysis-All array incubations were performed using a custom designed ArraySlide and ArrayMix (Gel Company, San Francisco, CA) orbital shaker at 300 rpm according to the protocol outlined here. All steps were performed at RT unless otherwise indicated. All blocking and incubation steps were performed covered with seal plates to prevent evaporation. All wash steps were performed with 250 l of solution unless otherwise indicated. Microarrays were swollen with two 10 min washes of 500 l H 2 O followed by two 5 min washes with PBS, and two 5 min washes with PBS-T (.1% v/v). The arrays were then blocked for 1.5 h with 250 l PBS-T/dry milk (2% w/v) and 1 mM dithiotreitol. The arrays were then washed three times with 250 l PBS-T for 5 min. The following protocol was used for all CPCMAs. Primary analyte incubations were performed for 20 h at 4°C with 100 l PBS-T/dry milk (2% w/v), with (SH2 domains) or without (Anti-pY/CBD) 1 mM DTT. Anti-pY antibodies were arrayed at 2.5 nM and CBD-Protein L fusion (Fluka) at 45 nM. The arrays were then washed three times with PBS-T for 5 min. Secondary incubations were performed for 5 h with 100 l PBS-T/dry milk (2% w/v). Anti-GST (Genscript monoclonal A00865) for SH2 domains and control experiments was arrayed at 7 nM, human IgG for CBD-L experiments was arrayed at 70 nM. After another round of PBS-T washes, tertiary incubations were performed for 2.5 h with 100 l PBS-T/dry milk (2% w/v). AF633 conjugated anti-mouse and antihuman secondary detection antibodies (Invitrogen) were arrayed at 7 nM. The arrays were then washed thrice with PBS-T, twice with PBS, and twice with H 2 O. After the final wash with H 2 O slides were removed from the gaskets and washed in a bath of H 2 O covered in a glass tray (Ted Pella, Inc., Redding, CA) with slight agitation for 10 min. Slides were removed from their gaskets and dried with a stream of extra-dry Ar. Slides were placed covered in a desiccator at 4°C until scanned. Slides were usually scanned within 1 week of arraying.
Array Scanning-All images were acquired on a Genepix 4000B scanner. Genepix Pro 6.1 software was used for feature identification and image processing. All images were acquired with maximal laser power at 5 m resolution. Nonquantitative arrays were scanned at a single photomultiplier voltage such that saturated pixels were minimized. Spots with artifacts such as saturated pixels or bleed in from adjacent spots were omitted from further analysis. Average intensities were calculated for each feature on the microarray. For quantitative experiments, voltage levels were chosen such that the lowest setting captured no saturated pixels. Each array was scanned at a minimum of three different voltages, in steps of 100 V.
Data Analysis-Most data analysis steps were performed using the R statistical computing environment (http://www.R-project.org/). Some analysis was also performed using Originlab8.1 and Excel.
For CPCMAs used for substrate comparisons unique CPC/dilution factor replicates (n ϭ 4) for each bioassay were averaged for each slide. For CPCMAs used for dissolution condition comparisons unique CPC/dissolution condition replicates (n ϭ 24/slide) were averaged. For substrate and dissolution condition comparisons, each unique CPC condition was pooled into a single bioassay specific group for statistical comparisons. Group distributions were tested for normality. Normally distributed groups were compared by two sample t-tests and non-normally distributed groups were compared by Wilcoxon rank-sum test. Unless otherwise indicated, hypothesis testing was performed using a two-tailed t test. PCC comparisons were performed using Fisher's r-to-z transformations and a two tailed test of significance. Venn diagrams were produced using the R package "VennDiagram." The network graphic was produced using cytoscape (46).
Receiver Operating Characteristic (ROC) Analysis-The array produces numerical estimates of affinity and not binary affinity class assignments. We tested the performance of our array derived affinity estimates as a ranking classifier, using our paired FP affinity estimates as the reference set. The K D cutoff corresponds to the reference set of FP values and is used to define the actual class set. For example, in the case of the 10 M cutoff, if an interaction has an affinity greater than 10 M as measured by FP it is assigned a 1, and if not a 0. To produce the ROC curve, the interactions with class affiliation are ordered by apparent K D , and the apparent K D (ranking classifier) values are then thresholded, or stepped sequentially throughout the data range, producing different true positive (tpr) and false positive rates (fpr) at each step. Where tpr ϭ tp tp ϩ fn and fpr ϭ fp fp ϩ tn such that tp and tn indicate those interactions correctly classified as above or below the threshold and fp and fn indicate those interactions incorrectly classified as above or below the threshold. ROC curves from single concentration data and SMM derived predictions were generated in an analogous fashion. For the 10 M cutoff, the apparent K D value that produced the maximum accuracy (where accuracy ϭ tp ϩ tn tp ϩ tn ϩ fp ϩ fn ) was also calculated. ROC analyses and the associated plots were produced with the R package "ROCR." Literature Curation, SMALI, and Scansite Implementation-The literature was manually mined along with the databases Phosphosite (47), MINT (48), HPRD (49), DOMINO (50), and Pepcyber (51) for protein-protein, domain-protein, and domain-phosphopeptide in vivo interaction information (supplemental Table S2). SMALI (22) and Scansite (21) (low stringency) assigned a binding score to our 124 peptides in accordance with database instructions.
SMM and Logo Generation-The availability of high-quality binding affinity measurements for the SH2 domains offered us an opportunity to build accurate in silico models of the energy contributions from amino acid residues at different positions. We used the stabilized matrix method (SMM) algorithm to build such models (52). Briefly, SMM is a linear regression based method that generates a Position Specific Scoring Matrix (PSSM) given a set of peptides and their measured affinities. The algorithm optimizes a PSSM during "training" such that predicted affinities using the matrix are as close as possible to the measured ones as determined by a sum of squared errors. To predict a binding affinity of a peptide to an SH2 domain, corresponding entries in the domain's PSSM (i.e. residues/positions) are read off and summed. SMM is distinguished from simple applications of a linear regression technique by the following features: (1) trained models are adapted to noise in the affinity measurements, and (2) the algorithm can deal with bounded measurements. The SMM was applied to each domain as outlined previously (52) using the "reactive-quantitative" subset of peptides. Signals were converted into pK D scale, with higher numbers representing stronger affinities. Nonbinding interactions were assigned a value of 5, whereas binding interactions were assigned a value of 6. For each SH2 domain ten fivefold cross validations were performed. The scoring matrices shown in this study were averages of the 50 matrices. Matrices were converted to logos using the "Seq2logo" web tool (53). RESULTS CPCMA Production and Optimization-Frank and colleagues recently introduced the spotting compound conjugates (SC 2 ) technique (54) to manufacture multiple peptide microarrays from a single SPOT synthesis cellulose membrane (16) (Fig. 1A). To optimize the CPCMA platform, we varied substrate type and cellulose polymer length and analyzed their effect on assay sensitivity and precision using a series of bioassays. A cellulose-binding domain (CBD) was used to report the density of the printed cellulose matrix, and both a phosphotyrosine-specific antibody (Anti-pY) and a representative SH2 domain from the protein VAV1 were used to report accessible phosphopeptides (Fig. 1B). Noting superior performance in preliminary tests, we examined assay sensitivity, precision, and throughput capacity (spot size) of amine and aldehyde coated surfaces. Using a panel of physicochemically diverse CPCs (supplemental Table S1), a maximum of 1024 measurements was made per bioassay. Amine functionalized slides displayed a clear advantage in sensitivity, precision, spot size, and spot uniformity compared with aldehyde slides (Fig. 2), and were therefore used for subsequent analyses.
In SC 2 , the cellulose support is cleaved via acid mediated hydrolysis of glycosidic bonds, and the final cellulose polymer length depends on acid concentration (55). A longer cellulose chain may attach more robustly to the slide, thereby increasing CPC density and assay sensitivity, consistent with previous reports of increased sensitivity for protein and peptide microarrays employing three-dimensional polymer scaffolds for surface attachment (56,57). However, this advantage must be balanced against the general insolubility of cellulose, which may lead to precipitation and printing anomalies. To further optimize the CPCMAs we performed a separate synthesis of supplemental Table S1 peptides, subjecting them to five different dissolution cocktails (Table I). Although we could not recover peptide from the highest and lowest acidity conditions, we identified a "medium" acidity condition with the highest CPCMA sensitivity (supplemental Fig. S1, Table I), which we used in subsequent assays.
CPCMA Affinity Estimation Pipeline-The experimental pipeline includes a series of control arrays followed by SH2 domain titration arrays (Fig. 3A) and a CBD bioassay-mediated print control assay (supplemental Fig. S2B). We measured affinities for 124 physiological phosphopeptides against four SH2 domains (supplemental Table S2). We independently processed eighteen of these peptides and nonadjacently arrayed them twice to assess the precision of replicate measurements (supplemental Fig. S2A). In total, we spotted 1728 features per array with 144 per subarray. A subset of the peptides are known to bind to specific SH2 domains with high affinity, and almost all phosphopeptides arrayed are phosphorylated in vivo (47). We comprehensively covered selected well characterized receptor tyrosine kinases (RTKs) such as the ERBB family, as well as some under characterized RTKs such as DDR1 and DDR2 that are known to be highly phosphorylated in cancer contexts (58). Well over one third (53 of 124) of our phosphopeptide set has never been investigated in high throughput before (supplemental Table S2). This design enabled comparisons with preceding platforms and orthogonal biophysical data while producing an interactome with considerable novelty. We selected SH2 domains from GRB2, NCK1, ABL1, and SHP2 (N-terminal domain) proteins. Each of these domains derive from proteins with different functions and have distinct selectivity preferences (supplemental Table S3), so overlap is thought to be relatively rare and for physiological peptides, potentially consequential in vivo. We produced each domain as GST fusions and applied each to three replicate arrays from 0.5 nM to 5 M (Fig. 3A).
We encountered and addressed three sources of systematic error in the CPCMAs: image capture gain and image saturation, printing efficiency between arrays, and printing efficiency across blocks within a single array. Because we use a wide SH2 concentration range, we cannot capture all features from all arrays without saturation with a single photomultiplier voltage setting. We therefore scanned each array multiple times, varying the voltage and ensuring that at least one scan had no saturated pixels. Bias-free measurements of the same array at different voltages would yield intensity values that differ only through instrument gain. Therefore, M-A scatterplots of one voltage versus a second voltage should produce uniform horizontal lines with homogeneous scatter. However, a clear curvilinear bias exists at low intensities, whereas at high intensities saturated pixels underestimate the signal (Fig. 3B, top). To correct for this bias and control for saturated pixels, we applied an affine model of scanner characteristics (see supplemental Experimental Procedures in supplemental material) (59). This model allows us to use the signals from each voltage as replicate measures of probe content, after bias subtraction and scale normalization (Fig.  3B, bottom panels). The median of these replicates yields a robust estimate of fluorophore quantity on a common linear scale.
FIG. 1. Overview of CPCMA fabrication, layout, and bioassay interactions. A, 11-mer phosphopeptides are SPOT synthesized on a cellulose membrane via Fmoc SPPS. The CPC spots are manually removed and placed into microplates where they are simultaneously deprotected and acid solubilized. Following precipitation, washing and resolubilization (arrow), they are pin arrayed onto glass substrates with twelve addressable subarrays. Each subarray serves as an independent reaction vessel for each bioassay. B, Example subarray layout and interaction profile. Each subarray contains control spots consisting of cellulose alone (green square) and experimental CPC spots (purple square). The control spots can only be recognized by the CBD bioassay (red spots). All of the properly synthesized and displayed CPCs can be recognized by the CBD and Anti-pY antibody (yellow spots) whereas a subset of these can be recognized by the CBD, Anti-pY antibody, and SH2 domain (blue spots).
Array production often suffers from a systematic, gradual loss of printed mass during a production run (60, 61). Our system revealed significant loss across three replicate arrays (Fig. 3C). Using the CBD to monitor printed mass, we observe a similar loss across print blocks within a single array (supplemental Fig. S2C). We therefore applied a correction using model-derived estimates for printing effects to achieve consistent signal distributions (see supplemental Experimental Procedures in supplemental material). We assigned SH2-CPC interactions and subsequently the CPCs themselves into categories (supplemental Fig. S3A and S3B). Using this empirical approach to classification, a hierarchy emerged reflecting the information provided from each peptide. Approximately 74% of our peptide set bound to at least one domain. The great majority of these reactive peptides scored at least one quantitative reaction, yielding a rich interaction data set of 324 quantitative measurements and 44 binary measurements. ln [SI PMTϪVhigh * SI PMTϪVlow ]) of all PMT-V feature pairs for a single experimental array are shown before (top) and after correction for voltage specific offset (bottom left) and gain (bottom right). The bias corrected SI measurements are used to estimate a single SI value for each feature on the array, simultaneously correcting for nonlinearity and saturated pixels. C, Inter-array correction. Representative box plots of the SI distributions between replicate arrays before and after inter-array normalization are shown. In this instance, replicate "1" and "2" were printed in succession, whereas "3" was printed six slides after replicate "2." single concentration measurements, the usual measurement mode for high throughput assays (PCC ϭ 0.87, p ϭ 0.016, supplemental Fig. S4B).
To further validate the affinity measurements, we compared a subset of identical peptide-SH2 interactions using fluores-cence polarization (FP). The FP assay does not suffer from interfacial assay artifacts and is often used as an orthogonal measurement technique (11,27,33). We examined 10 peptides and each SH2 domain across an affinity range spanning four orders of magnitude. These peptides were printed twice  Tables S2 and S4). Note that the array overestimates the absolute affinities because of an avidity effect imparted by the constitutively dimerized GST expression tag (62). Although the affinity measurements must be interpreted in a relative sense, the correlation of the CPCMA and FP estimates for this diverse panel of interactions is good (PCC ϭ 0.65), as shown in Fig. 4B. Once again, single concentration measurements performed poorly (PCC ϭ Ϫ0.28, supplemental Fig. S4C).
Because the CPCMAs outperformed single concentration estimates of binding, we assessed the classification performance of each type using receiver operating characteristic (ROC) analysis (63) (Fig. 4C). ROC curves were generated with the set of peptides having both array and FP derived affinity estimates (Fig. 4B) using the FP values as the reference set for classification. We chose cutoff values-that is K D values from the FP data set to define the classification boundary-in the low micromolar range because this is near our median affinity and because low and submicromolar interactions are common for well-characterized SH2 domains. At cutoff levels of 10 and 1 M, apparent K D values classify more efficiently than single SI measurements (Fig. 4C). Apparent K D estimates at the 10 M cutoff performed extremely well (AUC ϭ 0.97), with maximal threshold accuracy achieved with Array log K D ϳ 1.74 nM. We chose this value to serve as a boundary between two affinity classes, "high" and "low" (corresponding to Ͻ10 M and Ͼ10 M, respectively). We further identified a third class of CPC-SH2 interactions, the "nonbinders." To further assess the appropriateness of the 10 M boundary, and to enrich our data set with measured affinities and in vivo interaction information, we manually mined the literature and multiple databases (supplemental Table S2). All of the 34 biophysically quantified in vitro interactions and almost all of the in vivo domain-site specific interactions in our data set (41/43) we classified as binders. Moreover, 91% (31/34) of the in vitro interactions mapped to the correct affinity category, and 85% (35/41) of the in vivo interactions we identified as high-affinity, as expected for relevant interactions among numerous competitors.
Next we compared our interactions to those predicted by two algorithms trained on in vitro peptide libraries, Scansite (21) and SMALI (22) (Fig. 4D). Interactions that produce a probability value or score above a threshold are predicted to be high-affinity binders. Consistent with this, our interaction set almost completely encompasses the predictions from these two algorithms, the bulk of which mapping to the highaffinity class (ϳ70%, Fig. 4D).
To assess the performance of CPCMAs relative to the preceding quantitative PMA and HTFP platforms, we compared an overlapping set of 64 and 84 interactions, respectively (31,32,64). Our comparisons show that the CPCMAs are ϳfivefold more sensitive, have improved reproducibility, and more consistently align with SMALI and Scansite algorithmic pre-dictions, known binding motif preferences, affinity measurements, and in vivo interactions (see Supplemental Results in supplemental material, supplemental Fig. S4D and S4E, supplemental Table S5).
Systems Level Analyses of the Quantitative SH2 Domainphosphotyrosine Interactome-The CPCMA platform yielded a high quality quantitative interactome of 280 interactions between 92 peptides (89/92 of these peptides are physiological, 69/89 currently annotated as phosphorylated in vivo, with the remainder potentially phosphorylated) and four SH2 domains. A network graphic depicting interactions between SH2 domains and physiological peptides, along with interaction strength, phosphorylation data, and in vivo interaction metadata parsed into protein-protein, domain-protein, and domain-site categories is supplied as supplemental Fig. S5. Given that over one-third of our peptides have never been queried in high throughput before and preceding approaches suffered from low sensitivity and/or lack of quantification (see Supplemental Results), this interactome is considerably novel. Unlike preceding interactomes, here interactions can be prioritized based on affinity and (for quantitative peptides) nonbinding events, or negative information can also be incorporated into analyses of signaling systems. The phosphorylation data and in vivo interaction data can also be used to prioritize interactions for further investigations. For instance, an interaction between a domain and a site, which is validated in vivo, lends greater confidence that a high affinity interaction identified for another domain without in vivo information may be physiologically relevant, simply because this phosphorylation site has already mediated a functional SH2 domain interaction.
Peptides that bind to multiple domains are potential sites of coregulation, possibly serving as critical points for pathway cross-talk. The four SH2 domains examined here are from different functional subfamilies where binding motifs differ with little overlap (17,19,33,37) and therefore high affinity binding overlap may be functionally relevant in vivo. Previous work in bacteria and yeast have shown that pathway crosstalk mediated by interactions between domains or kinases and short specificity determining peptide motifs may be subjected to negative evolutionary pressure (10,43), resulting in highly specific domain-peptide interactions with little binding overlap between domain family members. In higher eukaryotes, such as humans, this may not be the case at the interaction site level given the role of tissue specific expression in mediating signaling specificity. In addition, because of the expansion in the size of protein domain families in higher eukaryotes, secondary contacts (interactions outside of the primary domain-ligand binding interface), and coincident detection (multiple domain-ligand interactions are needed for a functional binding interaction) may play a more prominent role in insulating signaling pathways compared with the primary interaction interfaces.
The degree of interaction overlap between SH2 domains has heretofore not been examined with this level of quantitative resolution. Therefore, it is not known if the degree distribution of the interaction network is a function of interaction affinity. If pathway cross-talk is subjected to negative selection in the SH2 domain phosphopeptide interactome at the level of the primary interaction interface, the relative number of promiscuous peptides should decrease as the affinity of the interactions (and presumably the physiological relevancy) examined increases. Consistent with this hypothesis, promiscuous peptides within our high affinity subset are rare (Fig. 5A). Considering only high-affinity interactions, 63 peptides bind to at least one SH2 domain, whereas only six peptides bind to three SH2 domains and no peptides bind to all four domains. Indeed, a comparison of the degree distributions between the entire quantitative data-set (n ϭ 81 peptides and 267 interactions) and another for the high affinity subset (n ϭ 63 peptides and 102 interactions) reveals that node degree is a function of interaction affinity (supplemental Fig. S6A and S6B, Fisher's Exact Test p ϭ 1.0 ϫ 10 Ϫ7 ). This finding is consistent with the hypothesis that physiological domain-motif interactions evolved with signaling specificity as a constraint. To our knowledge, this is the first quantitative examination of this property for any domain-motif interactome in higher eukaryotes, and therefore may be representative of other families, such as the SH3, PDZ, Chromo, and Bromo domain families.
The binding mode of individual domains is best assessed quantitatively so that the distribution of binding affinities can be judged. The high sensitivity and quantitative accuracy provided by the CPCMAs enabled, for the first time a quantitative assessment of protein domain binding distributions ( Fig 5B). Here, we tested the hypotheses that each domain has a similar central tendency and distribution shape. This novel data analysis mode revealed that SH2 domains do not differ greatly in their average affinity for physiological phosphopeptides (ANOVA p ϭ 0.182, Kruskal-Wallis p ϭ 0.362, supplemental Fig. S6C). However, there were differences in promiscuity (number of interactions/domain) and selectivity (dynamic range of interactions) between the domains. Q-Q plots of the binding distributions for the four domains reveal similar centers for all the distributions but a large difference in slope for the ABL1 SH2 domain compared with the others, indicating less spread (selectivity) for its interactions (supplemental Fig. S6C). Two-way Anderson-Darling tests with multiple testing correction also supports a difference between the distributions uniquely for ABL1 at the p ϭ 0.05 significance level.
These systems level quantitative properties are likely to have important consequences in vivo. For ABL1, the high interaction promiscuity (large number of binders) and moderate selectivity (tight affinity distribution) is consistent with the role of the SH2 domain within the full length tyrosine kinase. There it enables the hyper-phosphorylation of diverse target proteins by tethering peptide ligands after initial phosphorylation, allowing processive phosphorylation of adjacent target sequences in cis (65). The binding profiles of the adapter proteins GRB2 and NCK1 are also consistent with their biological function, as they discriminate between potential ligands and also achieve high-affinity anchoring interactions to ensure effective signal transduction. Similarly, a highly-selective binding profile with high-affinity binding is consistent with the role of the N-terminal SH2 of the SHP2 tyrosine phosphatase. Here, SH2 ligands of SHP2 simultaneously release intramolecular inhibition and anchor the phosphatase to sites of intended action (66).
Quantitative Affinity Matrix Construction and Interaction Prediction-Previous PSSMs provide insight into the sequence motifs critical for high-affinity binding (supplemental Table S3). However, because these matrices were trained by sequence frequency and not interaction affinities, they may be sensitive to the binding/nonbinding threshold while ignoring information from low affinity interactions. The combined high sensitivity and quantitative capacity of our approach enables for the first time the application of a quantitative binding motif summarization methodology to a HTP domain-peptide data set. Here, we have applied the stabilized matrix method (SMM) (52), which scores each amino acid at every position according to its approximated contribution to the binding energy to produce an affinity matrix.
All of the matrices produced here are consistent with previously reported motifs (Fig. 5C) while yielding further information. For instance, the core pY-X-N-X and pY-[D/E]-X-[V/P] motifs for GRB2 and NCK1, respectively, are apparent, whereas a preference for C-terminal acidic residues is seen here for NCK1. Likewise, SHP2 has an extended motif for hydrophobic residues at the Ϫ2 and the ϩ1 through ϩ4 positions. We verified the predictive capacity of our matrices using cross-validation. This analysis revealed excellent classification accuracy for the GRB2 and NCK1 motifs, good accuracy for SHP2, and poor accuracy for ABL1, likely because of the high promiscuity and narrow interaction affinity distribution for this particular domain (supplemental Fig. S7A-S7D). For all SH2 domains except ABL1, the predictive accuracy is improved when the classifier boundary is restricted to high-affinity interactions, which are likely most important for biological function (supplemental Fig. S7A-S7D). Overall, the performance of these motifs demonstrates the value of quantitative affinity estimates for ligand discrimination, and they serve as a useful guide for identifying residues impacting binding specificity (see below).
Motif Targeted Investigation of Contextual Affinity Modulation-Selectivity summarization matrices are a fundamentally useful tool for visualizing general sequence preferences and in this case predicting binding interactions. However, because of the independence assumption they cannot incorporate contextual interplay between AAs into predictions. Although the entire sequence always affects binding, context is most important for high-affinity interactions anchored by hot-spot residues, such as ϩ2 Asn for GRB2. Neighboring residues modulate affinities, contributing to promiscuity and the assignment of binding priority across multiple domains (25,27). We (27) and others (67) have shown that the GRB2 SH2 domain is sensitive to particular nonpermissive residues within the pY-X-N-X motif. To investigate if contextual interplay can be resolved by the CPCMAs, we parsed our 19 pY-X-N-X peptides into two groups depending on whether nonpermissive residues are in the motif. Indeed, we found a highly significant (p Ͻ 0.001) reduction in binding affinity for peptides containing nonpermissive residues (supplemental Fig. S7E).
Given that the CPCMA platform was able to resolve submotif differences in affinity, we investigated interaction overlap within a core-motif defined context, focusing on GRB2 and NCK1. The matrices for these domains display excellent predictive accuracies (AUC ϳ 0.9), agree with the literature, and contain known hot-spot residues. We show peptide affinities for pairs of SH2 domains in Fig. 5D. Consistent with the affinity defining nature of the core motifs, all known domain and site-specific in vivo interactions (red dots) and points of coregulation (multiple interacting domains, red lines) are within the high-affinity subset. Generally, high-affinity overlap is common with ABL1 because of its promiscuous nature, yet much less common for the other three domains.
Contextual information modulates interaction crossover for many peptides. For example, GRB2 and SHP2 both interact with IRS1 Y895 (68) in vivo, and are predicted here to interact with the Y39 peptide RVQIYHNPTAN within vasodilator-stimulated phosphoprotein (VASP). ABL1 also binds with highaffinity to both peptides, and is known to phosphorylate VASP Y39 in vivo (69). Although IRS1 Y895 contains no hot-spot residues for NCK1, VASP Y39 contains a ϩ3P. However, this peptide lacks other permissive residues beyond the ϩ3 position for NCK1, preventing crossover with this domain (Fig.  5C). Perhaps not surprisingly given the impact of contextual information, core-motif overlap does not guarantee binding overlap, nor does the absence of motif overlap prevent binding overlap. For example, of the two peptides that conform to both the GRB2 and NCK1 core-motifs, only one peptide binds with high affinity to both domains. The transmembrane adaptor protein non-T cell activation linker (NTAL) Y136 peptide DANSYENVLIS only binds strongly to GRB2. This selection may be accomplished by a combination of a Ϫ1 Ser and ϩ4 Leu, which are detrimental for NCK1 (Fig. 5C). Similarly, ABL1 prefers a ϩ2 Asn and ϩ3 Pro, yet the HER4 site Y1162 PKQEYLNPVEE is this domain's lowest affinity pY-X-N-X peptide, perhaps because of a ϩ1 Leu (Fig. 5C). Conversely, this peptide binds to NCK1 because of its acidic C terminus, despite not conforming to the NCK1 core-motif (Fig. 5C). This motif-focused analysis highlights the exquisite specificity in PID-peptide interactions, which can only be revealed through quantitative affinity measurements. DISCUSSION We find that the CPCMA based approach has greatly improved sensitivity, reproducibility, and quantitative resolution compared with earlier methods, allowing us to examine PID domain specificity space with confidence. The performance of the CPCMA platform was achieved through the careful application of a combination of critical experimental and computational processing decisions. New tools, such as the ArrayMix™ and ArraySlide™ enabled prolonged, reproducible, kinetically controlled incubations while novel applications of existing tools, such as the CBD for quality control enabled the application of systematic error corrections. Here we provide an approach that addresses issues related to the adequate display of peptide, optimal incubation conditions, image capture, correction of systematic error, estimation of affinity, and categorization of interactions. The procedure developed herein is therefore the most comprehensive approach for mapping PID specificity to date. At the most direct level, our approach enables a confident assessment of binding priority between competing PIDs for binding sites. This ability is critically important for the development of testable hypotheses examining time-dependent signal propagation and information integration at hub proteins (70), which are often aberrant processes in cancers (42,71) and require the processing of high and low affinity interactions (34,(72)(73)(74). In addition, the CPCMA platform is capable of explicit measurement of nonbinding events, can resolve interplay between neighboring amino acids within a ligand, does not require the purification of both interacting components, and is economical to employ. Taken together, the CPCMA platform provides an appealing choice relative to existing alternatives for all manner of PID specificity investigations (supplemental Table S6). In support of this view, a number of new analysis modes for investigating PID specificity space were enabled by our approach.
Here we provided, to our knowledge for the first time, a quantitative investigation of domain specific binding distributions. These PID affinity signatures likely play an important role in shaping the systems level function for a given domain. For instance, a "promiscuous," or highly connected domain may bind to many peptides but do so at low affinity, thereby producing interactions unlikely (in the absence of other in vivo mechanisms such as spatial clustering or interactions potentiated by other domains) to be competitive with other domains in vivo. However, this may not be the case if multiple interaction sites are available on the same protein (in cis), where avidity may enable successful competition with domains that have higher interaction affinities. We found the promiscuity and selectivity signatures of SH2 domain family members aligned with systems level protein function. For instance, the ABL1 SH2 domain demonstrated a promiscuous but nonselective distribution signature. A common moderate affinity for its many ligands ensures that ABL1 SH2 interactions are readily outcompeted by other phosphopeptide binding domains and may be an important component of the ABL kinase processive hyper-phosphorylation ability (65,75). Together these properties may help confer ABL family kinases with the dynamism required to rapidly transition within and between its diverse sites of regulation (76). SH2 domain dependent processive multisite phosphorylation of diverse substrate proteins is also a hallmark of SRC family kinases (77), and it remains to be seen if this signature is common between these two families. More generally, how PID affinity signatures cluster within families, have changed during evolution, and can inform drug and biological circuit design are important future directions for exploration.
We demonstrated the ability of the CPCMA platform to resolve contextual contributions to binding affinity and produce predictive PID specific affinity matrices from a single quantitative data set. To our knowledge, this is the first demonstration of these abilities by a quantitative high-throughput platform and the first demonstration of the ability to resolve nonpermissive contributions to binding affinity in highthroughput. Previous work has incorporated quantitative high-throughput interaction information from multiple PIDs into unified interaction prediction algorithms (11,28,78). Although useful for predicting interactions for uncharacterized or under-characterized PID family members, such approaches are sensitive to the sequence similarity of test domains to the domains used in the training set (28). Going forward, we suggest that models trained incorporating the idiosyncratic contextual binding preferences of single domains are more likely to fully represent ligand binding preferences than those trained from entire PID families. Considering the significant impact of contextual binding information on PID specificity space (15), the combination of larger CPCMA derived interaction data sets with algorithms capable of in-corporating residue correlations is therefore an important future application. For instance, when such models are combined with in vivo information more accurate network (re)constructions are possible (14,17).
Using quantitative CPCMAs we recapture general themes derived from preceding PID specificity studies and demonstrate the ability to extend these concepts in new ways. PID-ligand interactions serve to transmit specific cell signals in a dynamic fashion and are therefore thought to possess modest interaction affinities (1). Consistent with previous reports to this effect (33,34,79) and recent findings that very high affinity PID-peptide interactions can impair cell signaling and viability (80,81), we found our SH2 domain interactions span a moderate affinity range. Within a PID family, domain selectivity profiles cluster into separate binding motif classes (23,33,37,82) that are aligned with sequence similarity and protein function (40,83). Yet, likely as a consequence of selection against the emergence of signaling systems with excessive cross-reactivity (10,43), PID family members have selectivity preferences that are evenly distributed across a continuum (11). This results in PID family specific interaction networks with perhaps surprisingly low levels of binding-site competition (12). Given the relatively evenly distributed selectivity profiles across family members and the moderate binding affinity norm, overlapping high-affinity interactions are thought to be uncommon, but potentially important sites of signaling cross-talk.
Here, we also observe low levels of high affinity interaction overlap across four SH2 domains with unique selectivity profiles. We demonstrate, for the first time that the degree distribution of a PID derived interactome is a function of affinity, with node degree decreasing with increasing affinity threshold. Moreover, consistent with the notion that CPCMA identified sites of high-affinity interaction overlap may represent rare sites of signaling coregulation, the vast majority of the known domain-site specific in vivo interactions map to our high-affinity subset, including some overlapping pairs known to perform a coregulatory function. For instance, we identified a high affinity overlap site between GRB2 and SHP2 at IRS1 Y895 that is known to function as a phenotypic switch regulating differentiation and proliferation in vascular smooth muscle cells. Here, SHP2 dephosphorylates this site upon stimulation with insulin-like growth factor-I, which serves as a switch to block GRB2/SOS mediated migration and proliferation while maintaining the PI3K/AKT signaling pathway, and thereby the differentiated cell state (68). SHP2 mediated dephosphorylation does not occur in dedifferentiated cells, resulting in cell proliferation and migration (68).
The ligand sequence determinants responsible for directing this high-affinity PID overlap have important implications within a systemic context, especially for ligands like phosphopeptides, which can serve as binding sites for kinases, phosphatases, and multiple PIDs. As demonstrated here using affinity matrices as a guide, core-motif defining hotspot resi-dues are necessary prerequisites for high affinity interactions. However, consistent with our previous analysis of SH2 domain binding specificity (27), we find that high-affinity interaction overlap is not solely directed by core motif overlap, and subtle combinations of permissive and nonpermissive residues play important roles in defining specificity space. This concept seems to be a general one for PIDs (15), and highlights that selectivity space-as defined by preferred binding motifs where residues are assumed to contribute independently to binding affinity-does not predetermine specificity space, and many physiologically important interactions may be dictated by submotif contextual factors. Because of the low sensitivity, quantitative resolution, or inherent limitations of previous high throughput techniques, such important modulations have remained elusive. The quantitative CPCMA based approach developed here is well poised to help address the functional role each individual residue within a ligand plays in directing cellular signals.
Further application or in vivo investigation of findings derived from in vitro PID-peptide interaction mapping should be applied in a context specific fashion. Fundamentally, interacting proteins must be coexpressed and colocalized for an interaction to occur. Given this, although it is generally true that interactions between a given PID and its ligand are necessary for an interaction to occur in vivo, they are not always sufficient. There are multiple instances where secondary contacts (interactions outside the ligand motif and binding interface of the PID) are also necessary for a biological outcome (5,15,84). These "secondary" interactions can be mediated by alternative regions of the domain containing protein or of the domain itself. In addition, the functional significance of any given biophysical interaction identified in vitro may be contingent upon combinatorial interactions between a given PID and multiple ligands (35) or between multiple domain-ligand pairs (85). Because these and other mechanisms exist in vivo to potentiate the functional consequences of low-affinity interactions, it is unwise to ignore such interactions in follow-up investigations. However, given the concordance of in vivo corroborated interactions and decrease in ligand promiscuity within the "high affinity" subset of our results, it is likely that functional interactions within this subset rely less on these mechanisms than those within the "low affinity" subset, and therefore are the priority subset for downstream investigation.
Regardless of the affinity, many of these interactions take place only when a cell responds to an external stimulus. Therefore, time resolved quantitative in vivo analysis approaches that can resolve the state of protein complexes such as those based on affinity purification mass spectrometry (86 -88) hold promise in the effort to tease apart the functional significance of putative PID-ligand interactions. In this way, in vitro interaction mapping and in vivo protein complex analysis provide highly complementary systems level approaches for understanding and manipulating cell signaling circuits.
The CPCMA based approach developed here can be readily applied to other PID ligand interaction systems. In addition to other natural systems, our approach is compatible with those incorporating unnatural amino acids, peptoids, and small molecules. Unlike other techniques, our approach does not require the maintenance of protein structure at an interface, the purification of each interacting component, sophisticated microfluidics, or imaging apparatus, is capable of multiplexing, and can use existing commercially-available equipment and open-source software suites. Taken together, the CPCMA platform should have broad applicability within the protein interaction network field to discover new interactions, new interaction features, and more effectively investigate emergent systems level properties of interactomes.