Structure of the catalytic core of the Integrator complex

Summary The Integrator is a specialized 3′ end-processing complex involved in cleavage and transcription termination of a subset of nascent RNA polymerase II transcripts, including small nuclear RNAs (snRNAs). We provide evidence of the modular nature of the Integrator complex by biochemically characterizing its two subcomplexes, INTS5/8 and INTS10/13/14. Using cryoelectron microscopy (cryo-EM), we determined a 3.5-Å-resolution structure of the INTS4/9/11 ternary complex, which constitutes Integrator’s catalytic core. Our structure reveals the spatial organization of the catalytic nuclease INTS11, bound to its catalytically impaired homolog INTS9 via several interdependent interfaces. INTS4, a helical repeat protein, plays a key role in stabilizing nuclease domains and other components. In this assembly, all three proteins form a composite electropositive groove, suggesting a putative RNA binding path within the complex. Comparison with other 3′ end-processing machineries points to distinct features and a unique architecture of the Integrator’s catalytic module.


INTRODUCTION
3 0 end processing of nascent RNA polymerase II (RNAPII) transcripts is one of the key steps in gene expression (Proudfoot, 2011;Shi and Manley, 2015). Nearly all protein-coding RNAPII transcripts are cleaved and polyadenylated by the cleavage and polyadenylation specificity factor (CPSF) (Bienroth et al., 1993;Murthy and Manley, 1992;Preker et al., 1997). The exceptions include replication-dependent histone pre-mRNAs, which are processed by a partially overlapping machinery that depends on the U7 small nuclear ribonucleoprotein particle (snRNP) (Mowry and Steitz, 1987;Pillai et al., 2003), whereas non-coding RNAPII transcripts, such as small nuclear RNAs (snRNAs), are processed by the Integrator complex (Baillat et al., 2005). The Integrator was discovered as a factor required for 3 0 end processing of metazoan snRNAs (Baillat et al., 2005), which critically depend on snRNA promoters driving transcription (Hernandez and Weiner, 1986;Lobo and Hernandez, 1989), and the Integrator is believed to couple both processes (Egloff et al., 2007). More recent studies suggest that transcription termination of human snRNA genes may also occur in an Integrator-independent manner (Davidson et al., 2020).
Given such a broad spectrum of functions, it is not surprising that several genetic disorders have been associated with mutations in the components of the Integrator complex (Krall et al., 2019;Oegema et al., 2017;Zhang et al., 2020a).
To date, 14 unique proteins have been identified as core components of the Integrator complex (Baillat et al., 2005;Chen et al., 2012), some of which interact with additional factors (Barbieri et al., 2018;Huang et al., 2020). None of the core Integrator subunits (INTS1-INTS14) are shared with the CPSF or the histone pre-mRNA processing machinery (Baillat and Wagner, 2015). However, based on sequence similarities, INTS9 and INTS11 have been identified as homologs of CPSF100 and CPSF73, respectively (Dominski et al., 2005b). Both proteins belong to the family of metallo-b-lactamase (MBL)/b-CASP (CPSF-Artemis-SNM1-Pso2) nucleases, which play important roles in various aspects of the RNA metabolism (Pettinati et al., 2016). Nucleases from this family are characterized by the presence of seven conserved sequence motifs (1-4 in the MBL and A-C in the b-CASP domains) (Dominski et al., 2013) typically coordinating two catalytic Zn 2+ ions within the active site, located in a deep cleft between the MBL and b-CASP domains (Mandel et al., 2006). Those sequence motifs, including the characteristic HxHxDH consensus sequence (motif 2), are conserved in INTS11 but not in INTS9, implying that the two proteins form a pair of active (INTS11) and catalytically impaired (INTS9) nucleases, reminiscent of CPSF73 and CPSF100 (Dominski et al., 2005b;Mandel et al., 2006). Indeed, mutations in the putative active site of INTS11 result in snRNA misprocessing and identify it as the bona fide catalytic subunit of the Integrator complex (Baillat et al., 2005).
Yeast two-hybrid and immunoprecipitation experiments showed that INTS9 and INTS11 form a heterodimer and do not cross-react with CPSF73 or CPSF100 (Albrecht and Wagner, 2012; Dominski et al., 2005b). The interaction of INTS9 and INTS11 is mediated by the C-terminal regions of both proteins and is necessary for proper pre-snRNA processing and stable association with other integrator subunits (INTSs) (Albrecht and Wagner, 2012;Wu et al., 2017). INTS4, a HEAT repeat-containing protein, has been identified as such a direct interaction partner of the INTS9/11 dimer (Albrecht et al., 2018). RNAi-mediated depletion of all three proteins results in a higher degree of pre-snRNA misprocessing than for any other INTS, suggesting that they form a minimal INTS4/9/11 core complex, hereafter referred to as the Integrator cleavage module (Albrecht et al., 2018). In the analogous mammalian cleavage factor (mCF) and in the histone cleavage complex (HCC), the CPSF73/100 dimer interacts directly with the HEAT repeat protein Symplekin, forming a catalytic core shared between the two machineries (Kolev and Steitz, 2005;Michalski and Steiniger, 2015;Sullivan et al., 2009). This, together with similarities in the primary sequence motifs, suggests that INTS4 and Symplekin might be functionally related factors (Albrecht et al., 2018).
In the past few years, tremendous progress has been made in structural studies of the 3 0 end-processing machineries, providing first insights into the architecture of the substrate recognition modules of the cleavage and polyadenylation factor (CPF)/CPSF (Casañ al et al., 2017;Clerici et al., 2018;Sun et al., 2018), their recruitment to the mCF (Zhang et al., 2020b) and activation of CPSF73 within a fully assembled histone pre-mRNA processing complex .
In contrast, very little is known about the molecular architecture of the Integrator complex despite its emerging importance for transcription attenuation and 3 0 end processing of a wide range of substrates. The structure of a small INTS9/11 C-terminal domain 2 (CTD2) dimer provided first detailed insights into one of several interfaces between these two proteins (Wu et al., 2017); however, the majority of the cleavage module remains structurally uncharacterized. This poses several questions regarding the relative orientation of the two nuclease domains, the role of the INTS4 in assembly of the cleavage module, the mechanism of substrate recognition and nuclease activation, and how the specificity is achieved.
Here we analyzed the composition of the native Integrator complex, identified several stable Integrator sub-complexes, and report a 3.5-Å cryoelectron microscopy (cryo-EM) reconstruction of the 250-kDa INTS4/9/11 ternary complex. Our structure provides insights into the molecular architecture of the Integrator cleavage module, revealing a tight association of INTS9 and INTS11 and a stabilizing role of INTS4 for several mobile domains.
Comparison with other 3 0 end-processing machineries highlights the unique architecture of the Integrator's catalytic core.

Modularity of the Integrator complex
The Integrator complex consists of 14 different subunits (Baillat et al., 2005;Chen et al., 2012), but little is known about its internal architecture or the assembly mechanism. To gain more insights into the composition of the Integrator complex, we generated a series of stable HEK293F cell lines ectopically overexpressing tagged variants of INTS4,INTS5,INTS7,INTS10,and INTS14 (data for INTS7 and INTS14 not shown). We performed tandem affinity purification followed by quantitative mass spectrometry analysis to identify factors co-purifying with each bait protein ( Figure 1A). INTS4 pull-down was used as an internal reference against which all enrichment ratios were calculated.
Using a stringent tandem affinity purification approach, we observed that each bait protein enriched a different subset of INTSs Figures 1B-1E). Other INTSs were also detected in our experiment, but their enrichments were significantly lower.
Because INTS4/9/11 has been described previously as a stable module of the Integrator (Albrecht et al., 2018), we concluded that two other assemblies detected here, INTS5/8 and INTS10/ 13/14, may also be similar, independent modules. To validate the newly identified subcomplexes, we expressed them in insect cells and purified and analyzed them by size-exclusion chromatography (SEC) (Figures 1F-1I). Each of the newly identified complexes eluted as a single symmetrical peak ( Figures 1G and 1I), indicating its biochemical homogeneity.
To reconstitute a larger core complex, we mixed all three modules at an equimolar ratio and subjected them to SEC. We observed a slight shift in the elution volume, which was also observed when mixing only INTS4/9/11 and INTS10/13/14, indicating that INTS4/9/11 and INTS10/13/14 interact with each other but not with INTS5/8 (Figures S1B-S1D). This interaction was confirmed further when the modules INTS4/9/11 and INTS10/ 13/14 were co-expressed in Hi5 cells and purified by tandem affinity purification with one tag on each module ( Figure S1A) Overall architecture of the Integrator cleavage module A ternary complex of INTS4/9/11 was produced by co-expression of full-length proteins in insect cells, followed by tandem affinity purification with an 83His tag attached to INTS4 and streptavidin-binding peptide (SBP) tag to INTS11 ( Figures S2A and  S2B). Using single-particle cryo-EM, we obtained a 3D reconstruction of the complex at an overall resolution of 3.5 Å (Figures 2 and S3-S6; Table 1).
The body of the structure is composed of two MBL/b-CASP domains of INTS9 (residues 1-506) and INTS11 (residues 1-449) facing each other in a head-to-head arrangement with a pseudo 2fold symmetry axis running along their interface (Figures 2A and  2B). The C termini of INTS9 507-556 and INTS11 450-502 extend from the body of the complex and form a tightly intertwined composite domain, hereafter referred to as the CTD1 dimer. A small ll OPEN ACCESS Article linker helix at the C-terminus of INTS9 CTD1 (residues 496-502) is also present in a recently reported structure of the INTS9/11 C-terminal regions (INTS9 582-658 /INTS11 503-600 ), which we define as the CTD2 dimer (Wu et al., 2017). Superposition of the two structures, together with focused classification and refinement, allowed unambiguous placement of the CTD2 dimer within our cryo-EM reconstruction ( Figures 2C, 2D, and S5A).
An a-helical repeat that wedges between the two nuclease domains has been identified as the N-terminal domain of INTS4 (INTS4 NTD ), consistent with the visible side-chain densities, secondary structure predictions, and cross-linking and mass spectrometry data (Figures 2 and 4). INTS4 CTD is not well resolved in our map, but the density near the INTS9 b-CASP domain could be unambiguously assigned to this part of the protein.   Figure S1.
2,944 Å 2 of buried surface area (BSA) (Figure 3). In the resulting pseudo-symmetric arrangement, CTDs wrap around one another and bring the nuclease domains into close proximity. Despite facing each other at a close distance, the nuclease domains of INTS9 and INTS11 (107 kDa combined) form only minor contacts (400 Å 2 BSA) and do not contribute significantly to heterodimer formation (Figures 3A-3C and S1H).
INTS9 CTD1 and INTS11 CTD1 , in contrast, form a composite domain and account for a significantly larger interaction surface despite its small size (1,200 Å 2 BSA, 11.5 kDa). The structure of the CTD1 dimer resembles a b-barrel with hydrophobic residues facing its inner core (Figures 3D and 3E). Extensive interactions between these two domains are achieved via formation of two intermolecular b-sheets involving INTS9 CTD1-b1 and INTS11 CTD1-b1 as well as INTS9 CTD1-b32 and INTS11 CTD1-b2 , which connect the two halves of the barrel ( Figure 3E). Formation of the CTD1 dimer tethers the two nuclease domains together and restricts their movement, facilitating formation of other weak interactions.
Another major contact involves INTS9 CTD2 and INTS11 CTD2 , which form the second independent dimerization domain, as reported previously (Wu et al., 2017). The CTD2 region appears flexible in our cryo-EM reconstruction, and it was modeled entirely by rigid body docking of the available crystal structure.
Formation of the INTS9/11 CTD1 dimer is functionally important and necessary for assembly of the cleavage module To analyze to what extent CTD1 dimer formation is important for Integrator function, we generated a series of INTS11 variants that aimed to disrupt the interface observed in the structure. We screened those variants for the effect on RNA processing in vivo ll OPEN ACCESS Article in a depletion/reconstitution assay utilizing the U7 snRNA reporter (Albrecht and Wagner, 2012). In this experimental setup, impaired 3 0 end cleavage of the U7 transcript results in transcriptional readthrough, allowing translation of the downstream GFP gene, which can be quantified by fluorescence intensity measurement ( Figure 3G).
Deletion CTD2 or CTD1 and CTD2 domains of INTS11 has a significant effect on the processing activity of the Integrator (Figure 3H). This is to be expected, considering that CTD2 is required for dimerization of INTS9/11, as shown previously (Wu et al., 2017). Deletion of INTS11 CTD1 alone has a severe effect on reporter RNA misprocessing, even though this domain is not required for dimerization of INTS9/11, as analyzed by co-expression and pull-down of the 33hemagglutinin (HA)-INTS9 and 33FLAG-INTS11 protein variants ( Figure S1F). To analyze this further, we designed a series of point mutations in INTS11 CTD1 in which pairs of amino acids located at the interface of the intermolecular b-sheet were substituted with prolines. These muta- tions are predicted to disrupt the secondary structure and, consequently, formation of the composite b-barrel-like domain. Although none of the introduced point mutations affected its dimerization properties ( Figure S1F), they increased reporter RNA misprocessing to levels comparable with deletion of the entire INTS11 CTD1 or introduction of the catalytic mutation E203Q at the INTS11 active site ( Figures 3H and 3I). This suggests that CTD1 dimer formation is important for Integrator function.
We hypothesized that this domain might be important for assembly of higher-order complexes, providing a quality control checkpoint. INTS4 recruitment control would be a likely target for such a checkpoint. To test this hypothesis, we co-expressed SBP-INTS4 with 33HA-INTS9 and 33FLAG-INTS11 variants in HEK293T cells, followed by HA-agarose pull-down and western blotting. Deletion of the CTD2 or CTD1 and CTD2 regions of INTS11 abolishes binding to INTS4 even though the nuclease domain of INTS11 has a significant interface with INTS4. Deletion of INTS11 CTD1 has no effect on dimerization of INTS9 and INTS11( Figure 4A) but fails to enrich INTS4 in the pull-down experiment ( Figure 4A). Consistently, point mutations, which disrupt CTD1 dimer formation, abolish recruitment of INTS4 ( Figure 4A, lanes 7-9). The fact that these residues are not involved in interactions with INTS4 suggests that proper CTD1 dimer formation is required before INTS9/11 can be incorporated into the Integrator complex.
INTS4 acts as a scaffold for the INTS9/11 dimer Secondary structure predictions show that INTS4 is almost an entirely a-helical protein, with the exception of the very C-terminal 150 residues, which display a high propensity to form b strands. Of 37 helical segments predicted in INTS4, 24 form helix-turn-helix motifs, which are clearly visible in our maps, and seven of them exhibit sequence signatures of a canonical HEAT repeat (Andrade et al., 2001).
Twelve HEAT repeat motifs (H1-H12) are well defined in our structure and constitute INTS4 NTD , which forms a curved solenoid-like structure ( Figures 2C-2E). The inner, concave surface of INTS4 NTD is in contact with the MBL domain of INTS9, and H1-H5 interact with the INTS9 MBL domain by forming specific polar contacts ( Figures 4B-4E). Other prominent contacts include the charged residues at the tip of H5, which interact with the CTD1 dimer, tethering it into its position ( Figure 4E).
Additional contacts with the CTD1 dimer are formed by H6, which also forms a contact with INTS11 MBL .
The outer, convex surface of H7-H12 faces the nuclease domain of INTS11, but the quality of the map in this region does not allow us to confidently assign an amino acid register. It is possible that additional contacts between the two proteins exist.
Importantly, the simultaneous interaction of INTS4 NTD with INTS9 MBL and INTS11 MBL via its inner and outer surfaces locks the relative orientation of the two nuclease domains, which would otherwise be associated loosely. Similarly, the CTD1 dimer does not form any contacts with the nuclease domains   nuclease domain of INTS11 shares 40% sequence identity with CPSF73, and the invariant residues include all 7 sequence motifs (1-4 in the MBL and A-C in the b-CASP domains ( Figure S7), which, in CPSF73, coordinate two catalytic Zn 2+ ions within a deep cleft between the MBL and b-CASP domains (Mandel et al., 2006). The geometry of these motifs is preserved in INTS11 and capable of supporting a similar coordination of the catalytic Zn 2+ ions ( Figures 5D and 5E). This implies that INTS11 forms a bona fide MBL/b-CASP nuclease active site.
The relative orientation of the MBL and b-CASP domains of INTS11 resembles that observed for CPSF73 (Mandel et al., 2006) and its homologs Ysh1 (Hill et al., 2019) and CPSF3 (Swale et al., 2019). In this configuration, a cleft leading to the active site is too narrow to accommodate the RNA substrate, suggesting that the nuclease was captured in a closed, inactive conformation ( Figure 5). The local resolution for the INTS11 b-CASP domain is relatively low ( Figure S6), indicating the dynamic nature of this region. It is consistent with the idea that displacement of the b-CASP domain would allow it to achieve an active configuration similar to the one observed for CPSF73  or RNase J (Dorlé ans et al., 2011).

The INTS9 MBL/b-CASP domain contains non-canonical insertions
INTS9 exhibits a domain architecture similar to INTS11; however, most of its conserved sequence motifs responsible for creating the active site are mutated ( Figures 5C and S7). In particular, the HxHxDH sequence in catalytic motif 2 is changed to NYHC, not only eliminating 3 of 4 ligands for the catalytic Zn 2+ ions but also causing significant changes in the geometry of the active site. This and other alterations render the INTS9 MBL/b-CASP domain incapable of coordinating catalytic divalent ions, reinforcing the notion that INTS9 is a pseudoenzyme.
INTS9 and CPSF100 share 23% sequence identity, and their superposition reveals a very good correspondence of their tertiary structures (root-mean-square deviation [RMSD] = 1.8 Å ). Despite this similarity, INTS9 has some unique features that are not present in other MBL/b-CASP proteins. The most prominent are two large loops inserted into the canonical MBL fold, with the first one (residues 40-82) between helix 4 and b strand 5 and the second (residues 148-192) between helices 2 and 3 (Figures 2 and 7). These two loops are well-ordered in our structure, interact with each other, and together form a structurally distinct domain in INTS9, referred to as the INTS9 accessory domain (INTS9 NAD ). Although conserved among Integrator sequences, the functional implications of these insertions remain unknown.

Structural implications for RNA substrate binding
We performed electrophoretic mobility shift assays (EMSAs) to assess the RNA-binding properties of INTS4/9/11 and other reconstituted sub-complexes ( Figures 6A-6C). Our data show that INTS4/9/11 can form an RNP complex with its cognate pre-U1 snRNA substrate, containing the 3 0 -box signal sequence (Ciliberto et al., 1986;Hernandez, 1985;Yuo et al., 1985), but also with an unstructured single-stranded RNA (ssRNA) ( Figure 6A). Similarly, INTS10/ 13/14 and INTS5/8 (to a lesser extent) exhibit RNA binding properties ( Figures 6B and 6C). However, in all three cases, the binding appears to be non-specific and in a low-affinity range. The lack of sequence specificity in RNA binding is consistent with the genome-wide functions of the Integrator and a broad spectrum of different substrates.
Next we wanted to find out whether INTS4 NTD could be involved in substrate binding in the Integrator complex in a manner similar to Symplekin in the HCC. In the histone processing machinery, Symplekin NTD is critical for its activity in vitro, and it bridges the CPSF73/100 dimer to the U7snRNP, forming a cavity that accommodates the U7 snRNA:histone pre-mRNA duplex . Analysis of the surface electrostatic potential of the INTS4/9/11 complex reveals a highly positively charged cavity formed by all three proteins ( Figures 6F and 6H). This composite cavity is formed between the INTS9/11 MBL domains and traces along the concave side of INTS4 leading toward the INTS11 b-CASP domain (i.e., the nuclease active site). In contrast, a similar charged surface is not present at the equivalent position in the CPSF73/100 complex, and neither do we see any other charged surfaces in the INTS9/11 dimer in a place that would correspond to histone the pre-mRNA binding site in the HCC. Interestingly, the side chains in the helix-loop-helix motifs of INTS4 NTD point toward the putative RNA-binding groove in a manner resembling the Symplekin NTD , where the corresponding loops bind the phosphate backbone of the H2a pre-mRNA.
We performed mutagenesis of the charged residues in INTS4 and INTS11, which are involved in formation of this positively charged surface, and analyzed their effect using our U7 snRNA reporter in vivo. Of seven point mutations tested, three impair the Integrator's 3 0 -processing activity (INTS11 K462E , INT-S4  Comparison with the histone pre-mRNA processing machinery reveals unique architectural features of the Integrator complex The arrangement of the INTS9/11 heterodimer closely resembles the one observed in the related CPSF73/100 complex , reported to be part of the histone pre-mRNA processing machinery ( Figures 7A and 7B). In both complexes, the two respective nucleases are brought together by dimerization of their two consecutive CTDs. Despite a similar overall conformation, the INTS11 MBL/b-CASP domain remains in a closed inactive configuration ( Figure 5).
The most striking differences between the two structures are the positions of the HEAT-repeat proteins INTS4 and Symplekin, which are believed to be functionally related (Albrecht et al., 2018). In both cases, they interact predominantly via their N termini with the pseudonuclease (INTS9 or CPSF100) but are located on opposite sides of their respective nuclease heterodimers ( Figures 7F-7H). In addition, Symplekin NTD binds head on with only two helices to the b-CASP domain of CPSF100, whereas INTS4 NTD stretches along the entire MBL domain of INTS9.
Because of limited resolution, we could not model INTS4 CTD , but our map shows a large elongated density associating with the INTS9 b-CASP domain, which we interpreted as the missing INTS4 CTD ( Figure 2C). The CTD of Symplekin, on the other hand, associates with a distantly located CTD2 dimer of the CPSF73/100 Zhang et al., 2020b). The NTDs and CTDs of Symplekin and INTS4 are not connected by a continuous density, indicating that a flexible central region might be functionally important in both cases. Our structure highlights that, despite similar primary sequence architecture and a bimodal binding manner, both proteins are recruited to distinct regions of their respective nuclease heterodimers.

DISCUSSION
The protein composition of the Integrator complex was established using affinity purification followed by western blotting and mass spectrometry (Baillat et al., 2005). Although this approach successfully uncovered factors involved in complex formation, it could not determine its stoichiometry or inter-subunit contacts, and such architectural information remained missing. Other biochemical approaches revealed the association of INTS4/9/11 (Albrecht et al., 2018) as well as INTS3/6 (Jia et al., 2020;Skaar et al., 2009;Zhang et al., 2013). Building on these findings, we developed a method that allowed us to identify two previously unknown sub-complexes, INTS5/8 and INTS10/13/14. The latter was recently confirmed by another study (Sabath et al., 2020), which revealed the structure of the INTS13/14 dimer and its RNA binding properties.
Our findings suggests that the Integrator complex is highly modular, not unlike the related CPSF/CPF complex (Casañ al et al., 2017; Zhang et al., 2020b). Furthermore, we provide evidence that two of the modules (INTS4/9/11 and INTS10/13/ 14) interact with each other ( Figure S1), consistent with recent reports (Mascibroda et al., 2020;Sabath et al., 2020). Notably, other INTSs were also detected in our experiment, but their abundance was much lower compared with the three complexes discussed. It is possible that those subunits are required to bridge the three modules and act as the limiting factors for the complex assembly. INTS7 would be a prime candidate for such a subunit because it is highly abundant in our MS experiment but equally enriched by INTS5/8 and INTS4/9/11 ( Figures  1B and 1C).
The association of the Integrator complex with 3 0 end processing was initially based on the similarity between INTS9/11 and a nuclease dimer of CPSF73/100 found in the HCC and CPSF (Dominski et al., 2005a). Our structure confirms that each dimer consists of four independent elements: a pseudonuclease domain, a bona fide MBL/b-CASP domain, and two separate CTD regions (CTD1 and CTD2) ( Figures 7A and 7B). In both cases, dimerization is driven by their tightly entangled CTDs, which bring the nuclease domains into close proximity.
From multiple INTS9-INTS11 interfaces, disruption of the CTD2 dimer has been shown to abolish binding of both proteins in vivo and, consequently, recruitment of other INTSs (Wu et al., 2017). This implies that an intact CTD1 is not sufficient for maintaining the interaction in the absence of the CTD2 dimer interaction or that the CTD1 dimer interface is not formed in this context.
Given the extensive CTD1 dimer interface, we believe the latter is true and that formation of the CTD2 dimer is a prerequisite for establishing the secondary CTD1 dimer interface. Indeed, impaired formation of CTD1 does not affect the interaction between INTS9 and INTS11, but it has severe consequences for recruitment of INTS4, implying that its formation plays a role in assembly of a higher-order complex. It is tempting to speculate that assembly of the INTS9/11 heterodimer may involve progressive formation of multiple interfaces nucleating with the CTD2 dimer, followed by the CTD1 dimer, which brings together two nuclease domains and allows them to form weak interdomain contacts.  (Hill et al., 2019;Mandel et al., 2006) conformation, enabling or preventing access of the substrate to the active center. INTS11 falls into the second category, consistent with not being engaged with an RNA substrate (Figure 5A). It has been noted previously that CPSF73 and its yeast homolog Ysh1 require repositioning of the b-CASP domain to achieve catalytic competence (Hill et al., 2019;Mandel et al., 2006). The mechanism of such a rearrangement has been shown recently for the histone pre-mRNA processing machinery . In this case, Lsm10, a component of the substrate recognizing the U7 snRNP module, provides a loop that wedges between the MBL and b-CASP domains of CPSF73, allowing the active center to be engaged with its cognate substrate . It is plausible that a similar mechanism might exist for the Integrator complex, but the factor that would trigger such activation remains elusive.
Our work identified a structured insertion in INTS9 MBL , referred to as the NAD domain (Figures 2 and 7). To our knowledge, this domain is unique to INTS9 and is not present in any other member of the MBL/b-CASP family. Notably, CPSF100 contains an unrelated, disordered insertion (approximately 145 residues) in a different region of the b -CASP domain ( Figures 7A-7E and  S7). This insertion has been shown recently to contain a linear sequence motif (mPSF interaction motif [PIM]) that is crucial for recruitment of the mammalian polyadenylation specificity factor to the mCF (Zhang et al., 2020b). It is tempting to speculate that the NAD in INTS9 might serve a similar purpose for recruitment of other INTSs. If true, then this observation would support a more general mechanism shared between different 3 0 end-processing machineries, where the inactive nuclease subunit mediates interactions with other modules to bring them into proximity of the endonuclease active site.
In a recent report (Zhang et al., 2020b), the nuclease domains of CPSF73 and CPSF100 in the mCF were captured at a wide angle, very different from our arrangement. A similar open arrangement has not been observed for the Integrator complex, but it is possible that such a configuration exists in the absence of INTS4. Our data suggest that INTS4 plays a role in achieving a compact nuclease/pseudonuclease configuration resembling that observed for CPSF73/100 in the histone-processing machinery .
INTS4 has been identified previously as a ''Symplekin-like'' factor (Albrecht et al., 2018) because both proteins interact directly with their respective nuclease heterodimers, and the resulting core complexes are critical for the activities of the corresponding machineries (Kolev and Steitz, 2005;Michalski and Steiniger, 2015;Sullivan et al., 2009). INTS4 and Symplekin are mostly a-helical HEAT repeat proteins and appear to be composed of two separate domains. Our structure reveals striking differences between the positions of INTS4 and Symplekin with respect to their nuclease heterodimer partners (Figures 7F-7H). Both proteins interact mainly via their NTDs with corresponding pseudonucleases (INTS9 or CPSF100) but are located on the opposite side of each complex. The CTD of INTS4 is poorly resolved in our map but identified unambiguously near the b-CASP domain of INTS9. An equivalent CTD domain of Symplekin binds the CPSF73/100 dimer at a very distant CPSF73/100 CTD2 dimerization domain (Zhang et al., 2020b). Such a differential binding mode is unexpected, given the high similarity of the nuclease heterodimer structures. This raises the question whether INTS4 and Symplekin are indeed functionally related. It is likely that the differences observed here are the result of a specialization of each machinery developed to recognize and process different sets of substrates. However, in principle, other INTSs could bind INTS9/ 11 in a manner resembling Symplekin binding, even in the presence of INTS4. Also, it cannot be excluded that each machinery could undergo a conformational rearrangement during assembly, and the two structures compared here may not represent the same functional state.
Our analysis revealed that INTS4/9/11 forms a highly electropositive composite groove leading toward the active site of INTS11, strongly suggesting a possible path for the RNA substrate within the complex ( Figure 6). Currently, it is not clear whether this channel would accommodate a substrate upstream or downstream of the cleavage site because, in principle, both modes of binding could be supported. Importantly, such a binding mode would be very different from the one observed for the histone processing machinery, where Symplekin NTD forms an RNA-binding cavity on the opposite side of the nuclease heterodimer. The differences observed here may represent how the specialization necessary to accommodate different substrates or substrate recognition modules is achieved. However, despite the very different architecture of each complex, we recognize that the design principles of both machineries have some important similarities. Nuclease dimer formation and recruitment of two unrelated helical proteins may functionally play analogous roles in creating a substrate binding cavity. Functional characterization of additional modules identified in this work, in particular identification of the substrate binding module and the factors required to trigger nuclease activity, will be crucial to understand the underlying mechanism and the specificity of the Integrator complex.
While this manuscript was under review, a structure of the integrator-containing PP2A-AC complex (INTAC) was reported (Zheng et al., 2020), providing new insights into the architecture of a nearly complete Integrator complex and its function as a non-canonical RNAPII phosphatase. This new structure comprises two of the modules discussed in this manuscript, INTS5/ 8 and INTS4/9/11, which localize to opposite sides of the complex and do not interact, consistent with our biochemical data. INTS10/13/14 is present within the INTAC complex but not resolved in the cryo-EM map. INTS4/9/11, described by Zheng et al. (2020), shows overall good agreement with our structure (RMSD of 2.2 Å over 5,400 atoms), with only minor discrepancies in the amino acid register in the middle part of INTS4 (residues 280-340). The atomic coordinates of INTS4 CTD can be readily fitted into the density attributed to this domain in our map. Interestingly, INTS11 remains in an inactive state despite numerous other components present in the complex, and its nuclease activation mechanism remains unknown.

Limitations of study
Our tandem affinity purification approach was designed to detect abundant sub-complexes containing unique components. Low-abundant INTSs, which were not enriched sufficiently in our analysis, might bridge different modules in vivo. Therefore, we cannot exclude the possibility that each of the modules described here could also exists within a larger assembly.
In addition, the nature of the control in our pull-down experiment (INTS4) implies that subunits shared between different sub-complexes would be well abundant but not enriched by either bait protein.
The RNA-binding properties of all three modules were tested in vitro on a model substrate. Although we observed reproducible and consistent formation of an RNP complex between INTS4/9/11 and the model substrate, we cannot prove that this binding is mediated by the electropositive patch identified as a putative RNA-binding surface. Mutagenesis of this surface and an in vivo assay ( Figure 6) suggest that this surface is functionally important; however, INTS4 residues with the strongest misprocessing effect are also partially involved in binding of INTS9, and their mutation may have a more convoluted effect.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
We acknowledge the EMBL Proteomic Core Facility, in particular Mandy Rettel, Per Haberkant, and Frank Stein, for assistance with mass spectrometry data acquisition and analysis; Felix Weis and Michael Hons for assistance with cryo-EM data collection at the EMBL cryo-EM platform and ESRF CM01 beamline; Estelle Marchal for technical assistance with RNA preparation; Erin Cutts and Alessandro Vannini for advice regarding the biGBac system; the EMBL Grenoble EEF platform for assistance with cell culture; and Lori Passmore, Daniel Peter, Stephen Cusack, Sagar Bhogaraju, and Michal Razew for critical comments on the manuscript.

AUTHOR CONTRIBUTIONS
M.M.P. identified, expressed, and purified stable Integrator sub-complexes, and performed cryo-EM data analysis, model building and refinement, EMSA assays, and pull-down validation. W.P.G. initiated and coordinated the project and provided support with experimental procedures. M.M.P. and W.P.G. analyzed the data and wrote the manuscript.

Article RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Wojciech P. Galej (wgalej@embl.fr).

Material availability
Unique and stable reagents generated in this study are available upon request.

Data and code availability
Cryo-EM maps obtained within this project have been deposited in the EMDB database with the following accession codes: EMD-12165 (High-resolution map), EMD-12166 (Medium-resolution, ESRF1 map), EMD-12164 (INTS4-focused map), EMD-12163 (CTD2focused map). The atomic coordinates have been deposited in PDBe with the following accession codes: 7BFP (model refined against the high-resolution map) and 7BFQ (pseudoatomic model fitted into the medium-resolution map).

EXPERIMENTAL MODEL AND SUBJECT DETAILS
HEK293T cells (ATCC) were propagated in DMEM medium (GIBCO) supplemented with 10% FBS (Thermo Fisher) and Pen Strep (GIBCO). SF21 Insect cells were cultured in SF-900 TM II SFM media (GIBCO). Hi5 cells were grown in Express Five TM SFM media (GIBCO) supplemented with L-Glutamine (GIBCO).

METHOD DETAILS Protein expression and purification
Full-length open reading frames (ORFs) of INTS4 (N-terminally 8xHis-tagged), INTS9 and INTS11 (N-terminally SBP-tagged) were assembled into one vector using the biGBac system (Weissmann et al., 2016). Similarly, INTS5 (N-terminally 8xHis-tagged) and INTS8 (N-terminally SBP-tagged) as well as INTS10 (N-terminally 8xHis-tagged), INTS13 (N-terminally SBP-tagged) and INTS14 were cloned into a single vector containing two or three ORFs respectively. Baculovirus was generated in SF21 cells grown in SF-900 TM II SFM media, as previously described (Bieniossek et al., 2012). Hi5 cells, used for protein production, were grown in Express Five TM SFM media and infected at a density of 1x10 6 with 1% volume of the SF21 pre-culture and incubated for 72h at 27 C.
Cells were split into 500 mL aliquots, harvested at 300 g for 15 min at 4 C (JLA8.1000), resuspended in PBS and transferred to 50 mL tubes. The cells were spun down again at 500 g for 10 min (A-4-62), frozen in liquid nitrogen and stored at À80 C until further use.
The cell pellets were thawed in 10 volumes Buffer1 (150 mM KCl; 20 mM HEPES KOH pH 7,8; 30 mM Imidazole) and sonicated (4 times 1 min, 30% Amplitude, 10 s ON/OFF cycle). The membrane fraction was pelleted at 48.384 g for 45 min at 4 C (JA-25.50) and the supernatant incubated on 10% (v/v) Ni-NTA resin (QIAGEN) at 4 C for 2 h on a turning wheel. The resin was transferred to Poly-Prep Chromatography Columns (Bio-Rad) washed 5 times in 2 volumes Buffer1 and subsequently eluted in 5 3 1ml Buffer2 (150 mM KCl; 20 mM HEPES KOH pH 7.8; 250 mM Imidazole). The elutions were pooled, transferred to 200 ml Streptavidin agarose resin and again incubated for 2h at 4 C on a turning wheel. Subsequently, the resin was washed 5 times with 1 mL Buffer3 (150 mM KCl; 20 mM HEPES KOH pH 7.8) and eluted in 3 steps with Buffer4 (150 mM KCl; 20 mM HEPES KOH pH 7.8; 10 mM desthiobiotin). Sample quality was monitored at the different stages by SDS-PAGE.

Crosslinking gradient (GraFix)
The elution fractions of the freshly purified complexes were pooled and transferred to a crosslinking gradient, GraFix (Kastner et al., 2008), with the following buffer composition: 10% -30% Glycerol, 150 mM KCl, 20 mM HEPES KOH pH 7.8; 0,05% Glutaraldehyde and spun for 14 h at 160,000 g (SW60Ti). The gradient was aliquoted in 150 ml fractions, quenched with 20 mM Tris-HCl pH 7.8 (final concentration) and the protein complex traced with a Dot-Blot against the SBP-tag. Fractions containing cross-linked complex were pooled and the glycerol was removed by multiple rounds of concentration using an 0.5 mL Amicon spin column (50 kDa cut-off).
Size exclusion chromatography 50 mL of the freshly purified complexes at 1-2 mg/ml were injected into a Superose6Ò Increase 3.2/300 column and eluted over 3.3 mL equilibrated in Buffer3 at the flowrate of 0.04 ml/min. Fractions of 100 mL were collected and subsequently analyzed by SDS-PAGE. Interactions were assessed by mixing the purified protein complexes in 1:1 molar ratio and incubating them on ice for 30 min before injections.
Sample vitrification R1.2/1.3 UltrAuFoil 300 mesh grids were glow-discharged from each side for 20 s at 25 mA at 0.3 bar using a Pelco EasyGlow device. The protein concentration was adjusted to 0.5 mg/ml as described above and 2.5 ml sample applied to each side of the grid. Excess of the sample was blotted away using a Vitrobot MARK IV at 4 C, 100% humidity for 2 s at À10 blotting force and plunge frozen in liquid ethane.

Cryo-EM data collection
The grids were loaded into a Titan Krios (FEI) electron microscope at the CM01 ESRF beamline (Kandiah et al., 2019) (dataset ESRF1 and ESRF3) or at the EMBL Heidelberg cryo-EM platform (dataset EMBL2), both equipped with a K2 Summit direct electron detector and a GIF Quantum energy filter (Gatan). Microscopes were operated at 300kV acceleration voltage in an EF-TEM mode. The cryo-EM data was acquired using Thermo Fisher EPU software (ESRF) or serialEM (Mastronarde, 2005) (EMBL) at a nominal magnification of 3 165 000, resulting in 0.83 Å $ pixel -1 (ESRF) and 0.81 Å $ pixel -1 (EMBL). ESRF movies were acquired for 4 s at a flux of 11.7 e $ Å -2 $ s -1 and the total fluence of 46.8 e $ Å -2 was fractionated into 40 movie frames. EMBL movies were acquired for 6 s at a flux of 7 e $ Å -2 $ s -1 and the total fluence of 42 e $ Å -2 was fractionated into 40 movie frames. A total of 6182 (ESRF) and 13,086 (EMBL) movies were acquired with a defocus range from À0.5 to À3.0 mm.

Cryo-EM data processing
All image processing was performed within Relion 3.0 (Zivanov et al., 2018), unless stated otherwise. For all three datasets beaminduced motion correction was performed using Relion's implementation of MotionCorr2 (Zheng et al., 2017) with a 5x5 patch model without binning followed by CTF estimation with CTFFIND 4.1 (Rohou and Grigorieff, 2015).
Automated particle picking of the ESRF1 dataset identified 1.02 million particles which were subjected to 3 rounds of 2D classification which reduced the particle number to 691 k. Two subsequent rounds of 3D classification with 3 classes each and a subsequent refinement resulted in a 8.3 Å resolution map with 302k particles. The refined map was classified further into 3 classes, the best class refined and again split into 5 classes. The best class containing 22 k particles was refined and yielded a 4.08 Å resolution map.
In case of the EMBL2 dataset automated picking in Relion identified 4.65 million particles. Due to the high number of particles, extraction was performed with 2-fold binning. The particles were subjected to 6 rounds of 2D classification to remove broken and poorly aligning particles, which reduced the particle number to 241 k. Those particles were re-extracted with their full pixel size and subjected to 3D classification into 6 classes. The best class, containing 84k particles, was refined and yielded a 3.98 Å resolution map.
The ESRF3 dataset was not processed separately as it was collected with the intention of merging the data with the previously collected datasets.
The final density of the EMBL2 dataset was used for template picking on all 3 datasets (ESRF1, EMBL2, ESRF2), yielding 2.9 million particles for the ESRF datasets and 6.1 million particles for the EMBL dataset. The ESRF particles were extracted from datasets 1 and 3 with full pixel size and merged. 2 rounds of 2D classification reduced the particle number to 786 k. The EMBL dataset was extracted 4-fold binned and subjected to 4 rounds of 2D classification which reduced the particle number to 1.6 million.
Since none of these datasets achieved the resolution required for de novo model building, we decided to merge all available data to improve the overall resolution. In order to do so, we determined a relative scaling factor between the maps from two different microscopes by iterative adjustment of the pixel size in one of the maps, followed by calculation of the correlation-coefficient in ChimeraX against the reference map (Wilkinson et al., 2019). As a result, a relative pixel size for the EMBL dataset was adjusted to 0.816 (instead of 0.810) and the CTF parameters were determined again for this dataset. To achieve the best possible scaling of this dataset against the ESRF reference (0.83 Å $ pixel -1 ), we determined optimal box sizes for the extraction and the 786 k preselected ESRF particles were re-extracted in a 348x348 pixel box, while preselected 570 k EMBL particles were re-extracted in a 354x354 pixel box, downscaled to 348x348 pixels during the extraction. Both datasets were merged and the resulting 1.35 million particles were subjected to 2 rounds of 2D and one round of 3D classification. The best 3D class containing 541 k particles was refined to 4.9 Å resolution. Further 3D classification into 5 classes gave 2 equally good reconstructions, which differed in angular distribution. As previous data processing indicated a missing angle problem, both classes (312 k particles) were used in a subsequent 3D refinement yielding a reconstruction at 3.9 Å resolution. This reconstruction was used for CTF refinement and Bayesian polishing (Zivanov et al., 2019). Although we could not identify any significant heterogeneity in our data at this stage, further 3D classification allowed us to select a subset of particles, with presumably higher signal-to-noise ratio, which refined to 3.6 Å -resolution (from 99K particles) and 3.5 Å -resolution (from 27 k particles). The latter was used for subsequent model building.
The map used for de novo model building is missing some of the peripheral regions including the INTS9/11 CTD2 dimer and INTS4 CTD . While we were performing focused classification and refinement, we realized that classes containing well-defined peripheral regions originate almost exclusively in the ESRF1 dataset. It is possible that variation in sample preparation, vitrification conditions and/or ice thickness preserved those fragile elements only in 1 out of 3 datasets. Therefore, we performed focused classification and refinement on this dataset alone. Briefly, the initially picked 1.4 million particles were subjected to 3 rounds of 2D and 3 rounds of 3D classifications in order to remove broken particles and to select well-aligning classes. The resulting 115 k particles were refined to 4.4 Å -resolution. The angles assigned in this refinement were used for subsequent signal subtraction (Bai et al., 2015) followed by 3D classification without image alignment and T = 40. The best classes were reverted to the original particles and refined. Both maps yielded a final resolution of 6.5 Å .
All reported resolutions were calculated within Relion using gold standard Fourier Shell Correlation (FSC) procedures (Scheres and Chen, 2012).

Model Building and Refinement
For INTS9 and INTS11 homology models were calculated using the I-TASSER web server (Zhang, 2008) and fitted into the cryo-EM map in ChimeraX (Goddard et al., 2018). Based on the slightly larger molecular weight of INTS9, this subunit was assigned to the larger lobe, which was later confirmed by the amino acid register and identification of a unique insertion (NAD) in the INTS9 MBL domain. Homology models were rebuilt manually in Coot (0.8.9.2-pre EL) (Emsley and Cowtan, 2004). The density of the CTD1 dimer was good enough for de novo modeling. The CTD2 dimer was modeled by rigid-body docking of the previously reported coordinates (PDB ID: 5V8W) into a low-resolution lobe protruding from the end of the CTD1 dimer. Additionally, the linker helix connecting the CTD1 and CTD2 of INTS11 (495-504), which is as well part of the previously reported crystal structure of CTD2 (PDB ID: 5V8W), associates more tightly with the CTD1 dimer in our structure and its position is in agreement with the current CTD2 assignment.
The directionality of the INTS4 helical repeat was identified by crosslinking and mass spectrometry. The model for HEAT repeats 1-9 was built de novo into the high-resolution map and the register assignment was based on visible side-chain densities and secondary structure predictions (Kelley et al., 2015). The length of predicted helices is in good agreement with the visible density. Additionally, residues 264-280 are predicted to form a short extended loop between repeats H6 and H7, which is indeed visible in our map and provides a landmark verifying a correct sequence assignment.
A medium resolution map obtained from focused classification and refinement was used to extend the model with 3 additional HEAT repeats, whose register was tentatively assigned based on the secondary structure predictions.
The atomic model was refined in reciprocal space with Refmac5 (Murshudov et al., 2011) with the secondary structure restrains generated in ProSMART (Nicholls et al., 2014) within the CCP-EM (1.3.0) software suit (Burnley et al., 2017) . Half-map validation was performed as previously described (Brown et al., 2015). The atomic model was visualized in ChimeraX (Goddard et al., 2018) and PyMol and the electrostatic calculations were performed with the APBS plugin in PyMol (Schrö dinger).

Cross-linking Mass spectrometry
Protein complexes were purified as previously described and adjusted to a concentration of 1 mg/ml in a 50 mL volume. DSS (disuccinimidyl suberate) was dissolved in Dimethylformamide with a final concentration of 50 mM. The dissolved crosslinker was added to the protein aliquot with a final concentration of 1 mM and 10 mM respectively and incubated at 35 C for 30 min. The reaction was quenched with 0.1 volumes of 1M Ammoniumbicarbonate and again incubated for 10 min at 35 C. Subsequently the sample was treated with 0.8 volumes of 10 M Urea and 250 mM Ammoniumbicarbonate as well as 0.05 volumes RapiGest SF Surfactant (Waters, cat. No. 186008090) and sonicated for 1 min in an ultrasound bath. Later, DTT was added with a final concentration of 10 mM, incubated for 30 min at 37 C and freshly prepared Iodoacetamide was added with a final concentration of 15 mM and again incubated for 30 min at room temperature (protected from light).
Subsequently, the sample was treated with different proteases beginning with Endoproteinase Lys-C (5 mL of an 0.1 mg/ml stock in Ammoniumbicarbonate) and incubated for 4 h at 37 C. The Urea concentration was adjusted to 1.5 M with HPLC grade water and Trypsin was added (1 mL of an 1 mg/ml stock) and incubated over night at 37 C. Subsequently the sample was acidified with Trifluoroacetic acid (1% v/v final concentration), incubated at 37 C for 30 min and spun down at 17,000 g for 5 min. The supernatant was discarded and the pellet was frozen in liquid nitrogen and stored at À80 C or dry ice. The mass spectrometry experiment and cross-links identification was performed by the EMBL Proteomics Core Facility in Heidelberg. The cell lines expressing Protein A -TEV -SBP N-terminal tagged INTS4, INTS5, INTS7, INTS10 or INTS14 were grown in 150 mL Freestyle media to a density of 1x10 6 cells/ml and harvested at 300 g (JA14) for 10 min and 4 C. The cells were disrupted and fractionated following Dignam's nuclear extract preparation protocol (Dignam et al., 1983) Nuclear extract and S100 fractions were frozen in liquid nitrogen and stored at À80 C until further use. The S100 or the nuclear extract fractions were thawed, and incubated with IgG beads for 2 h on a turning wheel at 4 C. Subsequently the beads were washed with Buffer3 (150 mM KCl, 20 mM HEPES-KOH pH 7.8) and 200 mL Buffer3 containing 12.5ug of TEV protease was added to the IgG beads to elute the proteins. The digestion was performed at 20 C for 90 min and the resin was additionally eluted 4 times with 200 mL of Buffer3. The elution fractions were pooled, added to Steptavidin beads and incubated again for 90 min on a turning wheel at 4 C. Subsequently, the beads were washed and eluted in 70 mL Buffer4 (150 mM KCl, 20 mM HEPES-KOH pH 7.8, 10 mM desthiobiotin). The eluate was spun down at 17,000 g and 4 C or 5 min, the supernatant collected and flash frozen in liquid nitrogen and stored at À80 C or on dry ice until further use.

Analytical pull-down and quantitative mass spectrometry
For the mass spectrometric analysis, 40 mL of SBP elution fractions obtained from nuclear extract (INTS4 and INTS5,shown in Figures 1B and 1C) or S100 fraction (INTS4 and INTS10,shown in Figures 1D and 1E) were subjected to an in-solution tryptic digest using a modified version of the Single-Pot Solid-Phase-enhanced Sample Preparation (SP3) protocol (Hughes et al., 2014;Moggridge et al., 2018). Samples were added to Sera-Mag Beads (Thermo Scientific, #4515-2105-050250, 6515-2105-050250) in 20 ml 15% formic acid and 60 ml of ethanol. Binding of proteins was achieved by shaking for 15 min at room temperature. SDS was removed by 4 subsequent washes with 200 ml of 70% ethanol. Proteins were digested with 0.4 mg of sequencing grade modified trypsin (Promega, #V5111) in 40 ml Na-HEPES, pH 8.4 in the presence of 1.25 mM TCEP and 5 mM chloroacetamide (Sigma-Aldrich, #C0267) overnight at room temperature. Beads were separated, washed with 10 ml of an aqueous solution of 2% DMSO and the combined eluates were dried down. Peptides were reconstituted in 10 ml of H 2 O and reacted with 80 mg of TMT10plex (Werner et al., 2014) (Thermo Scientific, #90111) label reagent dissolved in 4 ml of acetonitrile for 1 h at room temperature. Excess TMT reagent was quenched by the addition of 4 ml of an aqueous solution of 5% hydroxylamine (Sigma, 438227). Peptides were mixed to achieve a 1:1 ratio across all TMT-channels. Mixed peptides were subjected to a reverse phase clean-up step (OASIS HLB 96-well mElution Plate, Waters #186001828BA) and analyzed by LC-MS/MS on a Q Exactive Plus (Thermo Scentific) as previously described (Becher et al., 2018).
Briefly, peptides were separated using an UltiMate 3000 RSLC (Thermo Scientific) equipped with a trapping cartridge (Precolumn; C18 PepMap 100, 5 lm, 300 lm i.d. 3 5 mm, 100Å ) and an analytical column (Waters nanoEase HSS C18 T3, 75 lm 3 25 cm, 1.8 lm, 100Å ). Solvent A: aqueous 0.1% formic acid; Solvent B: 0.1% formic acid in acetonitrile (all solvents were of LC-MS grade). Peptides were loaded on the trapping cartridge using solvent A for 3 min with a flow of 30 ml/min. Peptides were separated on the analytical column with a constant flow of 0.3 ml/min applying a 2 h gradient of 2 -28% of solvent B in A, followed by an increase to 40% B. Peptides were directly analyzed in positive ion mode applying with a spray voltage of 2.3 kV and a capillary temperature of 320 C using a Nanospray-Flex ion source and a Pico-Tip Emitter 360 lm OD 3 20 lm ID; 10 lm tip (New Objective). MS spectra with a mass range of 375-1.200 m/z were acquired in profile mode using a resolution of 70.000 [maximum fill time of 250 ms or a maximum of 3e6 ions (automatic gain control, AGC)]. Fragmentation was triggered for the top 10 peaks with charge 2-4 on the MS scan (datadependent acquisition) with a 30 s dynamic exclusion window (normalized collision energy was 32). Precursors were isolated with a 0.7 m/z window and MS/MS spectra were acquired in profile mode with a resolution of 35,000 (maximum fill time of 120 ms or an AGC target of 2e5 ions).
Acquired data were analyzed using IsobarQuant (PMID: 26379230) and Mascot V2.4 (Matrix Science) using a reverse UniProt FASTA Homo sapiens database (UP000000589) including common contaminants. The following modifications were taken into account: Carbamidomethyl (C, fixed), TMT10plex (K, fixed), Acetyl (N-term, variable), Oxidation (M, variable) and TMT10plex (Nterm, variable). The mass error tolerance for full scan MS spectra was set to 10 ppm and for MS/MS spectra to 0.02 Da. A maximum of 2 missed cleavages were allowed. A minimum of 2 unique peptides with a peptide length of at least seven amino acids and a false discovery rate below 0.01 were required on the peptide and protein level (Savitski et al., 2015).
The raw output files of IsobarQuant were processed using the R programming language (ISBN 3-900051-07-0). Only proteins that were quantified with at least two unique peptides were considered for the analysis. Raw signal-sums (signal_sum columns) were normalized using variance stabilization normalization (Huber et al., 2002). Ratios were computed using these normalized TMT reporter ion signals. The top3 value is the average log10 MS1 intensity of the three most abundant peptides for each protein and serves as an estimator for the average abundance of a protein in the multiplexed mass spec run.
The mass spectrometry experiment and data analysis was conducted by the EMBL Proteomic Core Facility in Heidelberg.
RNA in vitro transcription and labeling RNA substrates for binding studies were generated by T7 run-off transcription using the Milligan transcription method from annealed template DNA oligonucleotides (Milligan et al., 1987). The U1 stem-loop 4 substrate comprises human U1 snRNA nucleotides 137-164 followed by 34 downstream nucleotides, including the 3 0 -box sequence. The control RNA is a scrambled sequence of the same size. Transcription products were gel purified, enzymatically capped with the vaccinia capping enzyme (NEB) following manufacturer recommendations and labeled with fluorescein at the 3 0 end. Briefly, the 3 0 vicinal diol was oxidized by re-suspending the RNA in 100 mL of freshly made oxidation solution (0.1 M sodium periodate and 0.1 M sodium acetate (pH 5.0)) and incubated at room temperature for 1.5 h in the dark. The reaction was quenched by the addition of 10 mL of 3 M KCl, placed on ice for 10 min, and the resulting insoluble KIO 4 was pelleted by a brief centrifugation. A thiosemicarbazide derivative of fluorescein (100 mM in DMSO) was added to a final concentration of 50 mM and incubated at room temperature for 4 h, as previously described (Hardin et al., 2015). Labeled products were purified using a denaturing PAGE.

EMSA
Fluorescein labeled RNA was adjusted to a final concentration of 10 nM, mixed with a 2-fold dilution series of purified INTS4/9/11 with a maximum concentration of 10 mM and incubated on ice for 2h. Subsequently the sample was loaded onto a 1% TBE-Agarose gel and run for 120 min at 50 V and 4 C. Gels were visualized with the Chemidoc imaging system (Bio-Rad).

Dot blot
Dot-Blots were used to trace the crosslinked protein samples after the GraFix. A PVDF membrane was incubated for 5 min in PBS and installed on a Dot blot rack (BioRad) and washed twice with 100 mL PBS. 10 mL of the crosslinked and fractionized GraFix were mixed with 100 mL PBS and blotted onto the PVDF membrane. Subsequently, the membrane was washed twice with 100 mL PBS and transferred to 5% milk in PBST. After 1 h incubation the milk was replaced by Anti-SBP antibody (Merck, cat. MAB10764), 1:5000 dilution) in 5% milk in PBST. 1 hour later, the antibody was removed, the membrane washed twice for 5 min with PBST and incubated again for 1 h with goat Anti-Mouse-IgG-HRP conjugate (Thermo Fisher, 3cat. 1430, 1:2000 dilution). Subsequently, the membrane was washed 4 times with PBST, developed with Pirece TM ECL Western Blotting Substrate (Thermo Fisher, cat. 32106, 1ml) and imaged with a Bio-Rad Chemidoc system.

GFP-based in vivo reporter assay
The reporter plasmid was cloned by inserting U7 snRNA gene (including 200 nt of the promoter and 70 nt of the downstream regions) into a modified backbone of the pFLAG_CMV10 vector (SIGMA) followed by EGFP ORF, based on the previously described design (Albrecht and Wagner, 2012).
Integrator subunits for the rescue experiments were cloned into a modified pFLAG_CMV10 vector with N-terminal affinity tags: 3xFLAG for INTS11, 3xHA for INTS9 and SBP for INTS4. 3xFLAG_INTS11 was carrying silent mutations providing resistance to the siRNA treatment.
24 h before transfection, cells were seeded into 24-well plates to reach 60%-80% confluency at the time of transfection. For each condition 0.8 pmol of siRNA was diluted in 50 mL of opti-MEM and mixed with 50ul of opti-MEM containing 2.4ul of RNAi-MAX (Invitrogen). After 5 min incubation, the transfection mixture was added dropwise to each well, containing 500 mL of DMEM/10% FBS (GIBCO). After 24 h incubations the cells were split and re-seeded in a 1:3 ratio. After 24 h incubation the cells were transfected again with siRNA as described above. The following day, cells were split in 1:2 ratio, 12h before transfection of the reporter and rescue plasmids. For each well, 300 ng of the U7_GFP reporter plasmid was mixed with 50 ng of P2A-mCherry-N1 (Addgene #84329) and 300 ng of the rescue plasmids (or pFLAG_CMV10 negative control) in a total volume of 50 mL with DMEM (GIBCO), without FBS. The reporter/rescue/DMEM solution was mixed with 50 mL DMEM (no FBS) containing 1.95ul of LipoD293 (Sinagen). After 15 min incubation at room temperature, the mixture was added dropwise to each well.
Cells were harvested 36 h later by removing the media, adding 500 mL of cold PBS and gentle pipetting until the cells were detached. The harvested cells were spun down at 300 g for 2 min in a cooled table top centrifuge, resuspended in 50 ul RIPA buffer (150mM NaCl, 50mM Tris-Cl pH 8.0, 1%NP-40, 0.5% deoxycholate, 0.1% SDS) and transferred to a black, 96-well plate (Corning) for the fluorescence readout using a Clariostar plate reader.
The raw GFP fluorescence intensity (FI) was divided by mCherry FI (transfection control) and normalized against the mock-treated control. Each condition was done in biological triplicates. The error bars in the figures correspond to the standard deviation of the measurements.
After FI readout, samples were analyzed for transgene expression by western blot.
Immunoprecipitation assay HEK293T cells were seeded into 6-well plates with a density of 500 000 cells per well in 1.5 mL D-MEM medium supplemented with 10% FBS. Cells were transfected 24 h later by preparing 50 mL Opti-MEM with 1 mg of each transfected plasmid and 50 mL Opti-MEM with 3 mg PEI25k per mg plasmid for each well. Subsequently, PEI and Plasmid were mixed, incubated at room temperature for 30 min and dropwise added into the cells. After 24 h of transfection 1.5 mL D-MEM/FBS was added to each well. 48 h after transfection the cells were harvested by removing the media, adding PBS and detaching the cells by gentle pipetting. The cells were spun down, resuspended in 400 mL lysis buffer (150 mM KCl, 20 mM HEPES-KOH pH 7.8 and 0.1% Triton X-100) and sonicated for 10 s at 30% amplitude. Next, the sonicated cells were spun down using a table top centrifuge at 20 000 g at 4 C for 30 min and the supernatant was added onto affinity resin (HA agarose) to capture the bait protein. The beads and the lysate were incubated for 2h on a turning wheel. Subsequently, the beads were spun down, the supernatant removed and the beads were washed 3 times with 150 mM KCl and 20 mM HEPES-KOH pH 7.8. Finally, the beads were resuspended in SDS sample buffer and heated up to 85 C for 5 min to release bound proteins. Input and elution fractions were analyzed by western blotting.

Western blot
A PVDF membrane (Merck) was activated for 2 s in 100% EtOH and incubated for 5 min in transfer buffer (1xTris-Glycine, 20%EtOH). A wet transfer was performed for 60-90min at 30V in an Invitrogen blotting chamber. The membrane was blocked with 5% milk in PBS supplemented with 0.1% Tween 20 (hereafter referred to as PBST) for 1h at room temperature. Primary antibodies were added (anti-INTS4 -1:2000; anti-FLAG -1:5000; anti-GAPDH -1:10000, anti-HA -1:5000; all antibodies were diluted in 5% milk with PBST) and incubated for 1h at room temperature. The membrane was washed 3 times for 5 min with PBST and in case of INTS4 or GAPDH detection anti-rabbit IgG (1:10 000 in 5% milk in PBST) was added and incubated at room temperature for 1h. The membrane was washed 3 times for 5min with PBST and imaged using chemiluminescent substrate detection based on HRP (Pierce) in a Chemidoc imager (Bio-Rad).

QUANTIFICATION AND STATISTICAL ANALYSIS
The error bars in Figures 3 and 6 correspond to the standard deviation of the 3 individual measurements. For the EMSA assay ( Figures 6 and S1), individual bands were quantified in Image lab (Bio-Rad) to calculate fraction of RNA bound by protein in each condition. The titration curves were modeled using a modified Hill equation, as previously described (Ryder et al., 2008). The uncertainty of the apparent Kd estimation ( Figure S6) was estimated based on standard deviation of the nearest experimental titration point. Examples of particles picked for processing are indicated with yellow circles. g, reference-free 2D class averages of a small negative stain dataset. h, reference-free 2D class averages of the cryo-EM dataset. Three datasets were merged and processed as described in details in the method section.    Figures 2, 3 and 4. a, FSC plot for all four models used in this publication. b, Orientation distribution of the final, 3.5 Å map used for model building. c-f, cryo-EM maps from different reconstructions, coloured according to the local resolution.