Alignment and quantification of ChIP-exo crosslinking patterns reveal the spatial organization of protein–DNA complexes

Abstract The ChIP-exo assay precisely delineates protein–DNA crosslinking patterns by combining chromatin immunoprecipitation with 5′ to 3′ exonuclease digestion. Within a regulatory complex, the physical distance of a regulatory protein to DNA affects crosslinking efficiencies. Therefore, the spatial organization of a protein–DNA complex could potentially be inferred by analyzing how crosslinking signatures vary between its subunits. Here, we present a computational framework that aligns ChIP-exo crosslinking patterns from multiple proteins across a set of coordinately bound regulatory regions, and which detects and quantifies protein–DNA crosslinking events within the aligned profiles. By producing consistent measurements of protein–DNA crosslinking strengths across multiple proteins, our approach enables characterization of relative spatial organization within a regulatory complex. Applying our approach to collections of ChIP-exo data, we demonstrate that it can recover aspects of regulatory complex spatial organization at yeast ribosomal protein genes and yeast tRNA genes. We also demonstrate the ability to quantify changes in protein–DNA complex organization across conditions by applying our approach to analyze Drosophila Pol II transcriptional components. Our results suggest that principled analyses of ChIP-exo crosslinking patterns enable inference of spatial organization within protein–DNA complexes.


INTRODUCTION
Each cell type is defined by a unique gene expression program, which is in turn determined by the activities of regulatory proteins binding to promoters, enhancers, and other genomic regions. Genomic regulatory regions are bound by particular combinations of sequence-specific transcription factors (TFs), co-regulators, and chromatin modifiers in a spatiotemporal dependent manner. While large-scale efforts are underway to map and functionally characterize potential regulatory regions (1,2), we still know relatively little about the structure and organization of individual protein-DNA complexes along the genome. To fully understand how gene regulatory programs are coordinated, it will be crucial to characterize precisely how regulatory complexes are assembled and organized.
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) enables genome-wide localization of regulatory proteins. However, the spatial resolution of ChIP-seq is limited, as chromatin fragmentation strategies can result in sequencing reads that map several hundred base pairs away from the site bound by the protein of interest. Therefore, while integrative analyses of ChIP-seq data collections can find groups of co-bound regulatory proteins (3)(4)(5), such analyses provide only limited insight into the spatial organization of proteins within regulatory complexes.
In contrast to ChIP-seq, ChIP-exo and related assays (e.g. ChIP-nexus (6)), precisely define protein-DNA binding locations via the use of lambda exonuclease (7). The exonuclease digests protein-bound DNA in a 5 to 3 direction and, on average, stops at 6 bp before a protein-DNA crosslinking point. The ChIP-exo tag distribution at a given regulatory region is thus the product of crosslinking events that formaldehyde or other crosslinking agents have induced between the targeted protein and DNA.
ChIP-exo's ability to map crosslinking signatures suggests a strategy for characterizing the spatial organization of regulatory complexes. At sites where a sequence-specific TF is bound directly to its cognate motif, the dominant crosslinking signature should result from direct interactions between the TF's residues and proximal DNA bases. However, regulatory proteins that alternatively (or additionally) interact with DNA via protein-protein interactions should display crosslinking signatures related to the TFs that recruit them. Since the physical distance of a regulatory protein to the recruiting TF will affect crosslinking efficiencies, different members of a regulatory complex should display distinct crosslinking patterns across a regulatory region. In principle, then, analysis of ChIP-exo crosslinking patterns should enable some degree of inference regarding the spatial organization of regulatory proteins within a protein-DNA complex.
Previous work suggests the feasibility of inferring the spatial organization of protein-DNA complexes via ChIPexo crosslinking analysis. ChIP-exo analysis of yeast general transcription factors found that crosslinking patterns at Pol II promoters were consistent with those expected from crystallographic models of the transcriptional machinery (8). ChIP-exo characterization of ribosomal protein gene (RPG)-specific factors introduced the idea that the ordering of indirect protein-DNA interactions can be inferred from analysis of crosslinking efficiencies (9). Specifically, high-resolution analysis of ChIP-exo crosslinking patterns at sites bound by the sequence-specific TF Rap1 show the same crosslinking pattern echoed in ChIP-exo experiments targeting Sfp1, Ifh1, and Fhl1, suggesting that these factors may be indirectly recruited to DNA by Rap1. We and others have made use of this concept to characterize indirect protein-DNA interactions in other systems. For example, in some mammalian cell types, both Glucocorticoid Receptor and Estrogen Receptor alpha may be indirectly recruited to certain binding sites via protein-protein interaction with FoxA TFs, as evidenced by near identical crosslinking patterns at those sites (10)(11)(12). One limitation of previous approaches is that they have relied on TF binding motifs or known genomic anchor points to align bound sites before characterizing crosslinking patterns (7,9,13). Naturally, such strategies limit the usefulness of ChIP-exo crosslinking analysis to protein-DNA complexes where the focal point of spatial organization is already known.
In this work, we formalize concepts suggested by previous studies by presenting ChExAlign, a systematic approach for characterizing ChIP-exo crosslinking patterns across multiple members of a protein-DNA regulatory complex ( Figure 1). Specifically, we first develop a multiple alignment procedure for characterizing consistent ChIPexo crosslinking signatures across multiple ChIP-exo experiments and across multiple regulatory regions. To make the alignment approach broadly applicable to different types of protein-DNA complexes, our procedure does not rely on sequence features or other genomic annotations, but rather directly aligns multi-protein ChIP-exo tag profiles. While there has been limited work on aligning broad tag distributions such as those from histone modification ChIP-seq and ChIP-chip data (14)(15)(16)(17), our approach is optimized for high-resolution, strand-separated ChIP-exo data. Given a multiple alignment of ChIP-exo tag profiles, our approach next applies a probabilistic mixture model to deconvolve individual protein-DNA crosslinking events. This approach allows consistent quantification of crosslinking strengths across multiple proteins in a regulatory complex. Finally, we apply principal component analysis (PCA) to visualize similarities between the crosslinking preferences of the regulatory proteins.
We demonstrate the utility of our approach by applying it to characterize the spatial organization of three distinct regulatory complexes. We first apply our method to yeast RPG ChIP-exo datasets in order to show that multiple profile alignment can be used to automatically align collections of ChIP-exo data across a collection of coordinately regulated regions. While previous analyses of the RPG ChIPexo data relied on a manual alignment around a sequence motif feature (Rap1 binding sites), our ChIP-exo alignment approach yields similar alignments, and the same biological conclusions, without knowledge of DNA sequence features. Secondly, we extend our analyses to 12 novel ChIP-exo datasets that characterize the occupancy of regulatory complexes at yeast tRNA genes. Due to the variable length of tRNA intragenic promoters, we extend our multiple profile alignment procedure to account for affine gaps. We further demonstrate that crosslinking event detection and quantification yields insight into the spatial organization of individual proteins within the Pol III transcriptional machinery. Finally, we demonstrate that crosslinking analysis provides a quantitative framework for characterizing changes in regulatory complex organization across conditions. By applying our methods to a collection of ChIP-nexus data that profile Drosophila Pol II transcriptional components under two experimental conditions, we demonstrate that we can quantify the degree to which individual proteins are relocalized when transcriptional initiation is inhibited.
In summary, our approaches provide a novel platform for examining the spatial organization of protein-DNA complexes from collections of high-resolution ChIP data.

Pairwise alignment of ChIP-exo profiles with affine gap penalties
To align multi-experiment ChIP-exo tag profiles across multiple genomic regions, we extend an affine gapped, overlap alignment version of the Needleman-Wunsch algorithm (18,19). The Needleman-Wunsch algorithm is popularly used to perform global alignment of protein or nucleotide sequences. Here, rather than aligning discrete residue sequences, our approach aligns real-valued matrices of ChIPexo tag counts. For a given length L genomic region, the corresponding input matrix contains rows for each of the two DNA strands in each of K ChIP-exo experiments. The elements in each row contain the number of ChIP-exo tag 5 positions (stranded) mapped to each base in the region. Thus, the dimensionality of each input matrix is 2K × L, and each ChIP-exo tag is counted in only a single bin of the matrix. The goal of the pairwise alignment procedure is to form a global alignment between ChIP-exo profile matrices corresponding to genomic regions X = (x 1 , . . . , x i , . . . , x L ) and Y = (y 1 , . . . , y j , . . . , y L ).
As in the original dynamic programming alignment with affine gap penalties, we construct matrices M, I x , and I y , indexed by i and j, where the value M(i, j) is the best score up to (i, j) given that x i is aligned to y j , I x (i, j) is the best score given that x i is aligned to a gap, and I y (i, j) is the best score given that y j is an insertion with respect to x. We use an affine gap cost structure to impose different penalties for opening and extending a gap of length g as γ (g) = −d − (g − 1)e, where d is the gap-open penalty and e is the gap-extension penalty. In this study, we use e = 0.1d. The recursion relationships are unchanged from the original algorithm: The similarity score of x i and y j coordinates, s(x i , y j ), is computed using the Pearson correlation coefficient as follows: q kw and q kc are the numbers of ChIP-exo tag 5 positions mapped from experiment k to the Watson and Crick strands (respectively) at the x i coordinate. p kw and p kc are the numbers of ChIP-exo tag 5 positions mapped from experiment k to the Watson and Crick strands (respectively) at the y j coor-dinate. The above Pearson correlation metric scores are normally distributed between -1 and 1 for pairs of randomly generated profile columns, and we confirmed that the above score yields the expected values for perfectly correlated and anti-correlated vectors (Supplementary Figures S4 and S5).
Alignment scores between random regions are normally distributed, and low compared with the scores observed between pairs of RPG regions (Supplementary Figure S6). We use gap opening scores of d = 50 in analyzing Pol III transcriptional complex at tRNA genes, d = 100 for Drosophila Pol II components, and d = 200 for analyzing RPG-specific factors. These values were chosen to encourage fewer gaps for the latter two settings. However, we observed that similarity scores between RPG regions (which are expected to contain gaps) are stable across a wide range of settings for gap opening (d) and gap extension (e) values (Supplementary Figure S7).

Progressive alignment of multiple regions
To build multiple alignments, we progressively align nodes by a succession of pairwise profile alignments. Nodes initially contain single region profiles. Progressive alignment proceeds by aligning the most similar pair of profiles, and merging the aligned pair into a composite profile generated according to the gapped alignment. Gaps are represented as an additional one-dimensional array alongside the ChIPexo profiles. Similarities are then recalculated between the aligned profile and all remaining regions. We repeat these steps until all the nodes are merged. After the final multiple alignment of all regions has been constructed, we subtract background signals from the composite ChIP-exo profiles. We first use NCIS (20) to scale a control experiment to each analyzed signal ChIP-exo experiment. We then subtract scaled per-base control tag counts from each signal ChIP-exo experiment.

Deconvolution and quantification of crosslinking profiles using a mixture model
After constructing composite ChIP-exo profiles representing the multiple alignment of analyzed regions, we aim to locate and quantify protein-DNA crosslinking positions within the profiles. Our approach models ChIP-exo composite data as being generated by a mixture of crosslinking events, and an Expectation Maximization (EM) learning scheme is used to probabilistically assign sequencing tags to crosslinking positions. By estimating crosslinking positions in the composite profiles, we implicitly assume that protein-DNA crosslinking patterns are consistent at all aligned regions, albeit with spacing differences between crosslinking sites as modeled by the alignment gaps.
Pr(r n |x) gives the probability of observing ChIP-exo tag r n from a crosslinking event located at genomic coordinate x, and is defined by a pair of Gaussian distributions (one on the positive strand, one on the negative strand) with = 6 bp and offset from each other by 12 bp (positive strand to the left). We define a vector of component locations μ where μ j is the genomic location of the crosslinking event j. We initialize potential crosslinking events, M, such that they are spaced in 5 bp intervals along the window. The overall likelihood of the observed set of tags, r, given the crosslinking event positions, μ, the binding event mixture probabilities (i.e. crosslinking event strengths), π , is given by: where N is the number of ChIP-exo tags that have been mapped to genomic locations.
To limit the number of modeled crosslinking positions, we place a sparseness promoting negative Dirichlet prior, α, on the crosslinking strength π (21). The latent assignments of tags to crosslinking events are represented by the vector z. The complete-data log posterior is as follows: The E-step that calculates the relative contribution of each crosslinking event in generating each tag is: The α parameter can thus be interpreted as the minimum number of ChIP-exo tags required to support a crosslinking event active in the model. MAP values of μ j are determined by enumerating over several possible values of μ j . Specifically, the MAP estimation of μ j is: where x starts at the previous values of the position and expands outwards to 50 bp each side. If the maximization step results in two components sharing the same positions, they are combined in the next iteration of the algorithm.
Crosslinking positions are simultaneously modeled across all analyzed ChIP-exo experiments in a given aligned profile, and information about crosslinking position estimates are shared between experiments at each EM step using a positional prior strategy described in the MultiGPS algorithm (22). This positional prior encourages consistency in estimated crosslinking positions across ChIP-exo experiments. Our rationale is that a given protein complex will be crosslinked to DNA at relatively few positions, but the signatures of these crosslinks will be present across experiments characterizing multiple members of the complex.
Finally, the relative crosslinking strengths of all estimated crosslinking points are quantified across all ChIP-exo profiles using Maximum Likelihood assignment of tag counts, yielding a matrix of crosslinking strengths.

Visualization of crosslinking relationships
The matrix of crosslinking strengths is used to assess the relationships between the crosslinking preferences of each protein in a protein-DNA complex. We normalize the crosslinking strength matrix such that the sum of crosslinking strengths for each protein is 1. We then apply a standard PCA to visualize the relationships between protein-DNA crosslinking preferences.

ChIP-exo experiments and processing
Tandem affinity purification (TAP)-tagged Brf1, Bdp1, Rpo31, Rpc17, Rpc40, Tfc1, Tfc3, Tfc4, Tfc6, Tfc7, and Tfc8 Saccharomyces cerevisiae strains in a BY4741 back-Nucleic Acids Research, 2020, Vol. 48, No. 20 11219 ground were obtained from Open Biosystems. TAP-tagged S. cerevisiae cultures were grown in yeast peptone dextrose (YPD) media at 25 • C to an OD 600 = 0.8-1.0. ChIP-exo assays were performed as previously described (13,23). In brief, TAP-tagged S. cerevisiae cultures were treated with 1% formaldehyde for 15 min. Reactions were quenched. Cells were disrupted by bead beating, and chromatin pellets were washed. Chromatin was solubilized by sonication. Extracts were incubated with rabbit IgG antibodies (Sigma, i5006) coupled to protein A sepharose. The first adaptor was ligated to the ChIP DNA while immobilized on beads. Washed immobilized DNA was digested in the 5 -3 direction with lambda exonuclese. Next, the material was eluted, and the second adaptor ligated to the exonuclease treated end. The resulting libraries were sequenced using Illumina machine. Mock IP control ChIP-exo experiments in yeast were performed using rabbit IgG (Sigma, i5006) in the BY4741 background strain (which does not contain a TAP tag sequence).
Libraries were paired-end sequenced and read pairs were mapped to the sacCer3 genome using BWA version 0.7.12 with options "mem -T 30 -h 5". Read pairs that share identical mapping coordinates on both ends are likely to represent PCR duplicates, and so Picard (http://broadinstitute. github.io/picard) was used to de-duplicate such pairs. Reads with MAPQ score <5 are filtered out using samtools (24).

Drosophila TSSs
We use TSS annotations from dm3 refGene protein-coding genes. To annotate a consensus TSS position within the aligned profiles, we use the position that contains the highest TSS frequency in the underlying aligned regions (i.e. the mode TSS position).

Motif analysis
We ran MEME-ChIP version 4.10.0 (28) on tRNA gene sequences to characterize the Box A and Box B motifs. Then, we scan 400 bp of regions used for alignment with the discovered motifs using a log-likelihood scoring threshold of 0.1% per base FDR defined using a second-order Markov model based on yeast genome nucleotide frequencies. We obtained the Rap1 cognate DNA-binding motif positions by scanning the corresponding cis-bp database motif (M4379 1.02) (29) in 1400 bp regions centered around TSSs using a log-likelihood scoring threshold of 0.1% per base FDR defined using a second-order Markov model based on yeast genome nucleotide frequencies.

ChExMix peak calling
We run ChExMix version 0.42 (11) with default parameters on Pol II ChIP-exo data in Kc167 cells and obtained the top 500 most enriched Pol II peaks using q-value.

ChExAlign overview
ChExAlign is a computational framework that aligns and quantifies ChIP-exo crosslinking patterns from multiple proteins across a set of coordinately bound regulatory regions. ChExAlign does not use sequence motif features and operates directly on the multi-protein, strand-separated ChIP-exo tag patterns. ChExAlign's multiple alignment procedure starts by obtaining strand-separated, per-base tag count profiles composed of data from each of the analyzed ChIP-exo experiments within a fixed-sized window at the defined input regions ( Figure 1A). Each region's multidimensional profiles are normalized by dividing all values by the maximum tag count of each protein in a region. ChExAlign uses an overlap Needleman-Wunsch alignment algorithm (18,19) with affine gap penalties to find the optimal pairwise alignments between all input regions. Within the pairwise alignment procedure, the similarities between each pair of positions is defined using Pearson correlation. Next, a progressive profile multiple alignment procedure is applied to produce a multiple profile alignment. Pairs of multi-dimensional profiles are progressively combined in order of their pairwise alignment similarity scores. After aligning a pair of profiles, the resulting composite profile is aligned against all remaining profiles. After the final multiple alignment of all regions has been constructed, we subtract background signals from the composite ChIP-exo profiles. We first scale a control experiment to each analyzed signal ChIP-exo experiment and then subtract scaled perbase control tag counts from each signal ChIP-exo experiment. The output of the multiple alignment procedure is a composite multi-dimensional profile that represents the aligned protein-DNA crosslinking signatures across all analyzed regulatory regions.
After producing the composite aligned ChIP-exo profile, ChExAlign quantifies protein-DNA crosslinking events within the complex ( Figure 1B). A probabilistic mixture model is applied to the composite profile to detect the positions of individual crosslinking events. Crosslinking strengths are estimated by assigning fractions of each protein's ChIP-exo signal to each detected crosslinking point. Lastly, dimensionality reduction is applied to visualize the relationships between crosslinking signatures for each protein within the regulatory complex. ChEx-Align takes mapped ChIP-exo experiments (e.g. BAM files) and starting coordinates (e.g. loci of interest) as inputs. Outputs are aligned ChIP-exo profiles and quantifications of protein-DNA crosslinking events across multiple proteins.

ChIP-exo profile alignment recovers the spatial organization of a regulatory complex at ribosomal protein genes
To demonstrate that ChExAlign provides an informative alignment of protein-DNA crosslinking patterns across sets of related regulatory regions, we first applied it to analyze the organization of a protein-DNA complex at yeast ribosomal protein genes (RPGs). Most yeast RPGs are coordinately regulated by a common set of regulatory proteins. The transcription factor Rap1 binds to cognate sequence motifs located 77-501 bp upstream of the transcription start site (TSS), and recruits Fhl1, Ifh1, and Sfp1. Hmo1 is also recruited at roughly half of the RPGs (9,30). Previous analyses of Rap1, Fhl1, Ifh1, Sfp1, and Hmo1 ChIP-exo data determined that the RPG regulatory complex has a well-defined spatial organization (9). Rap1 binding sites serve as an upstream boundary to the complex. Fhl1, Ifh1, and Sfp1 are almost identically positioned ∼100 bp downstream of Rap1, with some evidence of additional crosslinking through the Rap1 site (indicative of protein-protein interactions between Rap1 and the recruited factors). When present, Hmo1 occupies the region between Rap1 and the TSS, essentially overlapping where Fhl1/Ifh1/Sfp1 bind. Thus, a consistently organized regulatory complex is present in the upstream regions of most yeast RPGs, and this organization should in principle be recoverable by aligning ChIP-exo profiles that span the RPG regulatory regions.
A standard approach to analyzing collections of regulatory genomics data at a set of related gene loci might begin by producing composite profiles centered on the genes' TSSs. Applying this approach to the five ChIP-exo datasets at 134 RPGs produces a set of smooth composite profiles without any discernible organization between the members of the regulatory complex ( Figure 2A). Indeed, previous analysis has demonstrated that it is only when the five ChIPexo datasets are aligned by Rap1-bound motif locations (consistently oriented with respect to the RPG TSSs) that a more structured organization emerges from the data (9). Thus, the smooth profiles produced by a TSS-centric alignment are artefacts of the variable spacing between TSSs and the true organizing points of the regulatory complex (i.e. the Rap1 sites). We therefore asked whether ChExAlign's alignment procedure can recapitulate insights into the organization of the RPG regulatory complex without using sequence motif information or prior knowledge of Rap1 sites as the organizing loci.
We applied ChExAlign to produce an ungapped overlap multiple profile alignment across data from 5 ChIPexo experiments (Rap1, Flh1, Ifh1, Sfp1, and Hmo1) taken from 1,400bp windows centered on the TSSs of 134 yeast RPGs ( Figure 2B). Our alignment recovers sharply distributed composite profiles across the five factors (Figure 2B, E). The aligned composite plots enable some degree of inference regarding the organization of the regulatory complex. For example, high-resolution analysis of the crosslinking patterns displayed at the Rap1 binding sites suggest that Fhl1/Ifh1/Sfp1 are indirectly bound through protein-protein interactions with Rap1 ( Figure 2F). In addition, the multiple profile alignment appropriately clusters the subset of RPGs that displays Hmo1 enrichment, consistent with previous observations (9,30). Importantly, even though ChExAlign does not use sequence information during the alignment procedure, the multiple profile alignment induces an alignment of the underlying Rap1 motif sequences ( Figure 2C, D). Therefore, ChExAlign can ac-curately align multi-dimensional ChIP-exo profiles across a set of coordinately regulated regions, enabling insights into the organization of regulatory proteins within the regions.

Gapped ChIP-exo profile alignment enables consistent analysis of protein-DNA crosslinking patterns across 12 regulatory proteins at tRNA genes
Having demonstrated that multiple profile alignment can recover informative protein-DNA crosslinking patterns in a set of regions that share a tightly organized regulatory complex, we next aimed to demonstrate that the alignment procedure is robust to cases where the protein-DNA regulatory complex contains a more variably spaced organization. We chose to focus on protein-DNA interactions in the yeast Pol III transcriptional machinery at tRNA genes, as tRNA genes with varying lengths contain regulatory elements of consistent composition but variable internal spacing. Analyses of protein-DNA crosslinking patterns over tRNA genes might therefore be expected to require gapped ChIP-exo profile alignment strategies.
Eukaryotic tRNA genes are transcribed by Pol III, which is recruited by the multi-subunit transcription factor complexes TFIIIB and TFIIIC (31,32) ( Figure 3A). TFIIIB is composed of TBP, Brf1, and Bdp1, while TFIIIC contains two subcomplexes, A (composed of Tfc1, Tfc4, and Tfc7) and B (composed of Tfc3, Tfc6, and Tfc8). The TFIIIC subcomplexes bind to conserved intragenic promoter motifs, named Box A (bound by A) and Box B (bound by B), and enable assembly of the TFIIIB complex at a region ∼30 bp upstream of the tRNA transcription start site (33)(34)(35). Pol III is then recruited by TFIIIB, enabling transcription. Yeast tRNA genes vary in length between 74 and 134 bp, and this variation is reflected by variable spacing between the intragenic Box A and Box B promoter elements ( Figure 3B).
We applied ChIP-exo to generate novel high-resolution protein-DNA interaction profiles for three protein components of Pol III, the three components of TFIIIB, and the six components of TFIIIC ( Figure 3C, D). In order to account for the expected variable spacing between crosslinking signatures centered on the Box A and Box B sites in tRNA genes of different lengths, we extended our ChIP-exo multiple profile alignment strategy to incorporate affine gap penalties. We then applied our framework to align ChIPexo profiles from the 12 ChIP-exo targets across 279 yeast tRNA gene loci. Our goal is to demonstrate that ChEx-Align's ChIP-exo profile alignment procedure can recover informative protein-DNA crosslinking signatures for a regulatory complex without any prior knowledge of sequence features. However, the tRNA genes contain highly conserved sequence features, so initializing the alignments centered on tRNA TSSs would run the risk of being indirectly biased by the underlying conserved sequences (Figure 3B; Supplementary Figure S1A, B). We therefore chose to remove this potential confounding effect by initializing the alignments as centered on randomly shifted locations within a 60 bp range surrounding tRNA TSSs ( Figure 3C). This also has the effect of scrambling the relative locations of the Box A and Box B motifs in the initial alignment (Figure 3E).
The aligned ChIP-exo profiles produced by ChExAlign multiple profile alignment display pronounced peaks compared to the initial alignment ( Figure 3C, D: representative profiles for protein components of Pol III, TFIIIB, and TFIIIC A & B are shown for illustration; Supplementary Figure S2A-D: when single TBP ChIP-exo was used for alignment). Despite the fact that no sequence information is included in the alignment procedure, the multiple profile alignment induces a tight alignment of both Box A and Box B motif locations ( Figure 3F). The aligned ChIP-exo profiles incorporate a gap between the Box A and Box B motifs; as expected, the newly aligned short tRNA genes contain a larger gap compared with longer tRNA genes, having the effect that a consistent crosslinking profile alignment is constructed across all tRNA gene loci ( Figure 3D, F). Visual inspection shows that the multiple profile alignment contains sharp protein-DNA crosslinking peaks near the major regulatory elements (i.e. surrounding the Box A motif, Box B motif, and upstream TFIIIB binding location), consistent with where formaldehyde-induced protein-DNA crosslinking events would be expected to occur for the profiled regulatory proteins. Our results therefore demonstrate that ChExAlign can produce a high-quality multiple profile alignment of ChIP-exo crosslinking signatures, even in cases where the underlying regulatory elements occur at variable spacing in the constituent genomic loci.

ChIP-exo crosslinking quantification enables inference of protein-DNA complex organization
Visual inspection of aligned ChIP-exo crosslinking profiles enables some degree of insight into the spatial organization of protein-DNA complexes. For example, the aligned crosslinking profiles of tRNA transcriptional components show that TBP and other TFIIIB components are primarily crosslinked through a site upstream of the TSS, as expected from the direct protein-DNA crosslinks resulting from TBP binding to the TATA motif ( Figure 4A). Meanwhile, the TFIIIC components primarily display crosslinking through sites around the Box A and Box B motifs, again reflective of the expected protein-DNA binding events. However, TFI-IIC components also display additional, weaker crosslinking signatures through the same site that is crosslinked by TFIIIB components, indicative of the indirect protein-DNA crosslinking that results from protein-protein interactions between TFIIIC and TFIIIB. Within a regulatory complex, we should expect a protein's crosslinking efficiency at a given DNA site to decay as a function of the number of protein-protein crosslinks required to link it to the protein that directly contacts the DNA site. Therefore, careful analysis of the relative strengths of all crosslinking peaks may enable inference of protein spatial positioning within a regulatory complex. To facilitate such inference, we aimed to incorporate a principled approach to quantifying and visualizing crosslinking peaks into the ChExAlign framework. Our approach to quantifying ChIP-exo crosslinking signatures first applies a probabilistic mixture model to estimate the positions and relative strengths of crosslinking events within a multiple ChIP-exo profile alignment (see Materials and Methods). The mixture model probabilistically assigns observed ChIP-exo tags to crosslinking events using a predefined model of tag distributions around single crosslinking events. We use the Expectation Maximization (EM) algorithm to iteratively optimize the positions and strengths of crosslinking events using information from the assigned tags. The mixture model incorporates priors to keep the number of detected crosslinking events low (i.e. assuming that protein-DNA crosslinking occurs at relatively few bases in a regulatory region) and to keep the positions of crosslinking events consistent across the aligned multiprotein profiles (i.e. assuming that the same crosslinking events will recur across ChIP-exo profiles from multiple interacting proteins in a regulatory complex).
We applied our crosslinking deconvolution procedure to the 12-experiment multiple profile alignment of the tRNA regulatory complex (Figure 4A), first subtracting scaled ChIP-exo control signals from the individual profiles to account for elevated ChIP-exo background signals over the tRNA genes. The result of the deconvolution procedure is illustrated for Tfc6 in Figure 4B; crosslinking events are detected at eight positions within the Tfc6 signature, and each has a relative strength quantification which is derived from the proportion of Tfc6 ChIP-exo tags that is associated by the mixture model. The results of the procedure across all profiles can be summarized using a relative crosslinking matrix ( Figure 4C). The matrix illustrates that TBP has its highest crosslinking signal at position -60 in the alignment, a position that is located within the transcription bubble created by the Pol III pre-initiation complex (36) ( Figure  4C). Brf1 and Pol III components also show the strongest signal at the same position, suggesting that they associate with DNA via protein-protein interactions with TBP (Figure 4C). On the other hand, Tfc6 and Tfc8 from the TFIIIC B subcomplex show highest signals at positions +47 and +77, again consistent with their role in binding the Box B motif.
To visually summarize all crosslinking quantifications, we applied Principal Component Analysis (PCA) to the rows of the crosslinking matrix ( Figure 4D). This dimensionality reduction organizes the proteins in a manner that resembles the expected spatial organization of the tRNA transcriptional complex ( Figure 3A). TFIIIB components TBP and Brf1 are the left-most outliers of the plot, while Pol III and TFIIIC complexes are grouped in the middle and the right side of the plot. Notably, Tfc4, which is known to crosslink to both TFIIIB and to the rest of the A subcomplex (35,(37)(38)(39) is situated between Brf1 and other A subcomplex members on the PCA plot. Consistent with a previous study that showed Bdp1 crosslinking to the C34 subunit of Pol III (40), Bdp1 is situated proximal to Pol III subunits on the PCA plot. Moreover, in accordance with the finding that Tfc3 crosslinks with Tfc1 and Box B (39), the PCA plot shows Tfc3 in between A and B subunits of TFIIIC.
In summary, ChExAlign's crosslinking deconvolution and quantification procedures, and downstream dimensionality reduction, are promising approaches for generating hypotheses regarding the spatial organization of a protein-DNA complex from a collection of ChIP-exo data.

ChExAlign enables a principled quantification of regulatory complex reorganization across conditions
If the structure of a regulatory complex changes across different gene classes or between cell types, we should expect that the relative crosslinking signatures of members of the complex will also change. Provided that data from all conditions are analyzed simultaneously, ChExAlign's approach to aligning and quantifying protein-DNA crosslinking signatures is suitable for detecting spatial localization changes within a regulatory complex across conditions. To demonstrate our framework's ability to detect regulatory complex reorganization, we applied it to ChIP-nexus data pro-filing Pol II, the basal transcription factors TBP, TFIIA, TFIIB, TFIIF, TAF2, XPB, and negative elongation factor E (NELFE) in Drosophila melanogaster Kc167 cells (26). Shao and Zeitlinger collected data profiling these factors both under control conditions and after applying Triptolide (TRI) treatment, which blocks transcriptional initiation. We therefore sought to apply ChExAlign to this collection of ChIP-nexus datasets to quantify TRI-dependent changes in protein-DNA crosslinking profiles across factors.
We applied ChExAlign to align 16 ChIP-nexus profiles (eight factors, each in two conditions) across the top 500 most enriched Pol II peak locations ( Figure 5A, C). The multiple profile alignment procedure tightly aligns ChIPnexus profiles across these factors and conditions, as indicated by the high consistency between our alignment and annotated gene TSSs in the regions; 99% regions were aligned in the same orientation as the annotated TSS, and 53% of annotated TSSs were within 10 bp of the consensus TSS position within the multiple profile alignment ( Figure  5B; Supplementary Figure S3). We next used our mixture model approach to quantify crosslinking strengths for each factor and across conditions (e.g. Figure 5D).
As illustrated in Figure 5C and D, Pol II's crosslinking profile in the control condition shows one major peak at alignment position +33, while the profile in the TRI treatment condition shows additional crosslinking at alignment position -19. This apparent shift of Pol II occupancy is explained in detail in the source publication; the shift in crosslinking to the -19 position is due to the accumulation of Pol II at the pre-initiation complex (PIC) when it is prevented from moving into the downstream transcriptional pause site by TRI treatment (26). Similarly, an increase in TFIIB occupancy is observed at position -19 in the TRI treatment condition compared to the control. Our approach allows us to quantify this effect. Comparison of crosslinking quantification across conditions shows a 2-fold increase in Pol II crosslinking strength at the -19 PIC position after TRI treatment, while the +33 pause site position shows a 47% decrease in crosslinking strength. Other changes discussed in the original publication are also provided with a quantitative basis by our approach. For example, XPB, the direct target of TRI treatment, shows a 2-fold increase in crosslinking strength at the +16 position after TRI treatment, consistent with it blocking transcriptional initiation at that site. Similarly, a 60% decrease in TBP crosslinking at the +33 position after TRI treatment is consistent with TBP enrichment at that location being dependent on Pol II moving into the pause site. While these factors show distinct differences between the conditions, factors such as NELFE and TAF2 retained similar crosslinking patterns after the TRI treatment ( Figure 5C, E). Our results thus demonstrate that crosslinking profile alignment followed by mixture model analysis enables us to quantify changes in regulatory complex organization between different conditions as well as across different experimental targets.

DISCUSSION
We have presented ChExAlign, a principled framework for characterizing ChIP-exo crosslinking patterns across multiple members of a protein-DNA complex. Our approach implements a multiple alignment procedure that is optimized for aligning multi-dimensional ChIP-exo profiles, and which does not rely on sequence motif or genomic annotation information. ChExAlign also encapsulates a novel mixture modeling approach for estimating the locations and relative strengths of crosslinking events within a set of aligned ChIP-exo profiles. As we have demonstrated, the resulting crosslinking matrix can be used to visualize the relative spatial relationships between members of a protein-DNA complex (via dimensionality reduction), and to quantify shifts in spatial positioning across experiment conditions. One caveat is that ChExAlign is designed to analyze crosslinking signatures between known or hypothesized members of a regulatory complex; it is not designed to be an exploratory tool where one can determine if a given pro-Nucleic Acids Research, 2020, Vol. 48, No. 20 11225 tein belongs to a regulatory complex. ChExAlign's procedures will estimate relative crosslinking strengths for all constituent ChIP-exo experiments at characterized crosslinking positions, even if one of the underlying ChIP-exo profiles represents non-specific background signal. It is further challenging to assess the statistical significance of a marginal improvement in alignment scores in alignments that include or exclude a constituent experiment.
By applying ChExAlign to characterize the spatial organization of three distinct regulatory complexes, we have demonstrated that our approach is generally applicable to collections of ChIP-exo data profiling multiple members of a protein-DNA complex that is consistently organized across a set of genomic regions. Several previous studies have proposed the idea of characterizing the organization of protein-DNA complexes from ChIP-exo data (9,26,41). However, these previous approaches were typically limited to aligning data manually around known focal points of the protein-DNA complex (e.g. TF binding motifs or genomic annotations) and thus become impractical when analyzing large protein-DNA complexes that may be crosslinked at multiple unknown locations within regulatory regions.
The gold standard for characterizing the spatial organization of regulatory complexes is the creation of 3D structural models by applying structural biology techniques such as Xray crystallography or cryo-EM to purified complexes. Our approaches by no means result in the same type of structural information. However, careful analysis of protein-DNA crosslinking patterns might add useful orthogonal information to structural studies. For example, ChIP-exo crosslinking analysis might help to confirm the in vivo relevance of 3D structures that have been defined in vitro. Similarly, analysis of how crosslinking patterns vary across regulatory regions or experimental conditions may help to clarify how consistent the structure of a regulatory complex is in biological conditions.
In summary, we have demonstrated that our framework enables new forms of insight from analysis of multiple ChIP-exo crosslinking patterns, taking another step towards the fine-grained characterization of spatial organization within protein-DNA complexes. As larger collections of high-resolution protein-DNA interaction data become available, our analysis framework will contribute to investigating transcriptional mechanisms in greater detail.

DATA AVAILABILITY
Open source code (MIT license) and documentation for ChExAlign are available from https://github.com/seqcode/ chexalign. All ChIP-exo sequencing data produced in this study has been uploaded to GEO under accession GSE140923.