Protocol for quantitative analysis of RNA 3′-end processing induced by disassociated subunits using chromatin-associated RNA-seq data

Summary Sequencing chromatin-associated RNA using libraries from the chromatin fraction makes it possible to characterize RNA processing driven by disassociated subunits. Here, we present an experimental strategy and computational pipeline for processing chromatin-associated RNA-seq data to detect and quantify readthrough transcripts. We describe steps for constructing degron mouse embryonic stem cells, detecting readthrough genes, data processing, and data analysis. This protocol can be adapted to various biological scenarios and other types of nascent RNA-seq, such as TT-seq. For complete details on the use and execution of this protocol, please refer to Li et al. (2023).1


SUMMARY
Sequencing chromatin-associated RNA using libraries from the chromatin fraction makes it possible to characterize RNA processing driven by disassociated subunits. Here, we present an experimental strategy and computational pipeline for processing chromatin-associated RNA-seq data to detect and quantify readthrough transcripts. We describe steps for constructing degron mouse embryonic stem cells, detecting readthrough genes, data processing, and data analysis. This protocol can be adapted to various biological scenarios and other types of nascent RNA-seq, such as TT-seq. For complete details on the use and execution of this protocol, please refer to Li et al. (2023). 1

BEFORE YOU BEGIN
This section includes the minimal hardware requirements, the installation procedures of essential tools, as well as the construction of RPB3 and dRPB3 degron systems in mouse embryonic stem cells (mESCs), which are the fundamental cell lines used for creating ChAR-seq libraries and subsequent sequencing.

Construction of RPB3 and dRPB3 degron mESCs
Timing: 6-7 weeks 1. Use RPB3 C-terminal targeting sgRNA, CRISPR/Cas9, and donor plasmid to generate RPB3 degron cell line ( Figure 1). a. Co-transfect RPB3 C-terminal targeting sgRNA, CRISPR/Cas9, and donor plasmids into OsTir parental cell. b. Co-transfect mESCs using FuGene HD transfection reagent, following the manufacturer's instructions. c. After two days, treat the cells with 500 mg/mL Geneticin for one week to select successfully transfected cells, change the medium every two to three days. d. Pick colonies and culture in a 48-well plate for genotyping. e. Design forward and reverse genotyping primers for the region about 200 bp upstream or downstream from the sgRNA targeting sites. f. Amplify genomic DNA by PCR and check for successful knock-in by observing PCR products around 3.5 kb larger than those from wild-type cells. (A) tet-OsTIR1 is inserted into Rosa26 locus with CRISPR/Cas9 and selected with 5 mg/mL puromycin. Middle: mAID-GFP tag is inserted at the last exon of RPB3 gene with CRISPR/Cas9 and selected with 500 mg/mL geneticin for one week. Bottom: mouse cDNA of RPB3 with 3xFlag tag is inserted at the last exon of RPB1 gene with CRISPR/Cas9 and selected with 250 mg/mL hygromycin for one week. (B) A schematic illustration of genotyping primers used to detect the genotype of TIR1 parental cells at Rosa26 locus. (C) Genotyping results to detect the homogeneity of TIR1 parental cells. The product length of primers P1P2, P3P4, and P1P4 is 3504 bp and 3841 bp, respectively. Since the PCR product of P1P4 in TIR1 parental cells should be longer than 6000 bp, it is not easy to extract such long genomic DNA from cells. Therefore, we just got a product with 1827 bp of P1P4 that can be derived from WT cells but no bands in TIR1 parental cells.

Reagent Final concentration Amount
Tris-HCl (pH 7. 4. Cell lysates were gently put on the top of the sucrose cushion (24% sucrose in NP-40 lysis buffer), centrifuged at 12,000 g for 10 min. Perform sucrose cushion precipitation to isolate nuclei. Discard the supernatants representing the cytoplasm and resuspend the pellets in 0.5 mL glycerol buffer. 5. Add 0.5 mL nuclei lysis buffer to lyse the nuclei and incubate on ice for 2 min. 6. Centrifuge the mixture and discard the supernatant representing the nucleoplasm fraction. 7. Discard the supernatant and washed twice with ice-cold PBS. 8. Extract the RNA from the chromatin pellet using TRIzol reagent according to the manufacturer's manual. Troubleshooting 1. 9. Send the RNA samples to Novogene for ribosomal RNA depletion, strand-specific library construction, and sequencing.
Alternatives: Cells from other species can also be used for normalization. We recommend using higher percentage of spike-in controls, such as 10%-20%.
Note: In 1-9 steps, different samples should be processed at the same time by the same reagents and equipment to reduce batch effects.

Quality control and sequence alignment
Timing: 10 h The workflow for ChAR-seq data processing is shown in Figure 2.
The minimum computational requirement: this pipeline requires at least 10 GB of RAM. 10. Trim adapter and low-quality bases from the raw sequencing data using the cutadapt tool (v2.10) with customized settings. Low-quality bases with a Phred quality score of <15 will be trimmed and reads with a final length of <25 bp will be discarded.      #Define the list of samples, stages and strands as arrays samples=("RPB3_IAA" "dRPB3_IAA") stages=("0h" "1h" "3h") strands=("plus" "minus")    d. Run ngs.plot to generate sense and anti-sense profiles for specific regions of interest using the correct BAM file. Note that if the libraries were produced such that mate2 represents the forward strand, the sense and anti-sense profiles would be reversed.
e. Use R to combine the sense and anti-sense profiles on a single set of axes by utilizing the ngs.plot output. f. Scale the profile by employing the calculated calibration factors in step 17. g. Edit the resulting ".cnt" file to multiply the read count by the calibration factor, and re-run ngs.plot using the same parameters as before with the modified one.
# Note: Step 26d is part of a large script along with the step 26a and 26c.
# Variables are defined earlier in the same script.   # The following R code refers to metagene plot for dRPB3 in Figure 3C # Note: similar code can be employed for RPB3.
29. Check the strandedness of comma-separated GTF files, and use it as input to generate a gene loci annotation file.
Note: The output file is named loci_annotation.bed, to be used as input for Get_DoGs 30. Preprocess, sort and index the input bam files to keep their order constant. Note: This step may take a long time. Once finished, the output raw and downsampled files will be located in the same folder as the original bam files.

Detect readthrough gene candidates
Timing: 2 h 32. Use the downsampled bam files and a loci annotation bed file as input for the DoGFinder tool to identify readthrough gene candidates. a. Remove all genic reads from the bam file. b. Limit gene boundaries by the location of the nearest 3' neighboring gene in the genome, and discard genes with 3' nearest neighbor closer than 4 kb. c. Run the Get_DoGs function with the following parameters to identify readthrough gene candidates based on a minimum length of 4 kb and a minimum coverage of 60% over the entire downstream length: -S -minDoGLen 4000 -mode F -minDoGCov 0.6. d. Elongate the identified readthrough gene candidates to find their putative endpoint.  a. Gather all the annotation bed files of readthrough gene candidates from different replicates that need to be intersected. b. Use the " Common_DoGs_annotation" function from DoGFinder to get the list of readthrough candidate gene sets common to all biological replicates. c. Execute the "sortBed" function from bedtools to sort the intersected bed file by chromosome and coordinate. d. Choose the most downstream coordinate as the end coordinate for all intersected readthrough gene candidates. Use the "awk" command to extract the most downstream coordinate for each gene name and save it in a new bed file. a. Collect all DoG annotation bed files to be merged for different treatments. b. Use the " Union_DoGs_annotation" function from DoGFinder to create a single DoG annotation bed file. c. For any DoG that appears in more than one input file, unify them according to their maximal length. d. Execute the "sortBed" function from bedtools to sort the merged bed file by chromosome and coordinate. e. Format the output bed file as a tab-separated gene annotation bed file to be used as input for Get_DoGs   36. Exclude very short genes with a length of less than 2 kb from the analysis. 37. Generate a sub-list of active genes from our untreated PRO-seq data.
Note: Gene activity was determined based on the ratio of N/L, where N corresponds to the count of coding-strand Pro-seq reads from +1kb (relative to the TSS) to the end of each gene, and L denotes the number of mappable bases in this region. The statistical significance of a given gene's activity level was estimated by calculating the probability of observing a minimum of N reads within an interval of length L, derived from a Poisson distribution of mean l = 0.04 reads/kb. Genes with a probability of less than 0.01 were defined as active, as described in our publication. 2 38. Refine readthrough gene candidates identified in step 34 that overlap with active genes but do not overlap with readthrough regions that correspond to neighboring genes on either strand using Bedtools intersect analysis. a. Identify readthrough gene candidates that overlap with active genes. b. Find transcripts whose TSS or TTS region extended in 5 0 and 3 0 overlaps with any transcript from another gene. c. Remove readthrough gene candidates that overlap with neighboring genes on either strand. d. Keep each readthrough gene a unique name. Note: The bedtools and gtftk tools have already been installed in the system's PATH.
39. Obtain the genomic coordinates for the active genes and readthrough regions of neighboring genes. 40. Identify overlap between the active genes and the readthrough regions according to bedtools intersect analysis. a. Create a list of active genes that do not overlap with readthrough regions of neighboring genes on either strand. b. Subtract the readthrough genes from this list of active genes to create a new list of non-readthrough genes that fail to generate readthrough transcripts. c. Ensure that each non-readthrough is given a unique name.
Note: The script in this part is similar to that of defining readthrough genes.

Quantify the readthrough indices
Timing: 1-2 h 41. Consider genes with a minimum length of 2 kb to exclude very short genes from the analysis. Troubleshooting 4. 42. Define two sub-genic windows as the following: termination windows are from 3 to 6 kb downstream of polyA sites; pre-polyA windows are 1 kb upstream of respective polyA sites. 43. Calculate the ChAR-seq read counts in two sub-genic windows. 44. Filter out genes with fewer than five reads in either termination or pre-polyA windows from the analysis. 45. Determine the readthrough index as the ratio of the ChAR-seq read coverage in the termination window to that in the pre-polyA window.
Note: The readthrough indices are float numbers generated in step 45, which need to be rounded into an integer.

EXPECTED OUTCOMES
We performed ChAR-seq data analysis on six samples following this protocol. Our findings indicate that dRPB3 regulates the 3 0 end processing of ribosomal subunit genes. Specifically, the results demonstrate that the detected readthrough genes in dRPB3 were significantly higher than those in RPB3 degron cells ( Figure 5A). Moreover, Differential 3 0 end processing analysis between samples suggests that dRPB3 plays an important role in RNA processing, given the higher proportion of observed changes in the readthrough index for ribosomal subunits upon dRPB3 depletion (Figure 5B). Finally, in line with prior findings, metagene analysis of ChAR-seq signals at TTS revealed increased transcriptional readthrough for ribosomal subunit genes after dRPB3 depletion ( Figure 5C).

QUANTIFICATION AND STATISTICAL ANALYSIS
For the differential 3 0 end processing analysis, the changes in the readthrough index were identified using the negative binomial generalized linear models implemented in the DESeq2 R package. Statistical significance was assessed by a two-sided Wilcoxon rank-sum test.

LIMITATIONS
The Specific Degradation of Disassociated Subunits (SDDS) strategy employed in our study could be extended to investigate the disassociated subunits of other multi-molecular complexes such as ribosomes, spliceosomes, and DNA polymerases. However, to apply the SDDS approach to investigate such complexes, it is necessary to optimize the design to ensure that fusions with specific interacting proteins faithfully mimic the functions of endogenous protein complexes and do not cause noticeable adverse effects. In addition, deletion mutants, combined with functional assays, would be ideal to further validate the conclusions drawn from the experimental results.
To investigate the immediate functions of dRPB3 and RPB3 in RNA processing, we established a degron system for these proteins in murine embryonic stem cells and utilized auxin-inducible degron (AID) technology 1,2,14-17 to rapidly deplete them. We then conducted chromatin-associated RNA experiments with spike-in controls in Drosophila S2 cells to evaluate the impact of their depletion. This approach is similar to that employed in previous studies. 18,19 To enhance the statistical confidence of the ChAR-seq data analysis, we included two biological replicates in all the datasets to minimize experimental bias and ensure the robustness of the statistics performed using this protocol. For example, the replicates were treated separately for significance assessment for differential expression and 3 0 end processing analysis, whereas for constructing metagene profiles, replicates were merged to maximize the power of revealing differences in the gene body and readthrough signals. Nonetheless, this protocol did not test datasets with a variable number of replicates, leaving open the potential issue of how to process such data. Therefore, the handling of replicates should be tailored to each experiment stage and analysis purpose to ensure the reliability of the results.

TROUBLESHOOTING Problem 1
Low yield of RNA extraction of chromatin-associated RNA may arise in step 8.

Potential solution
Increase the starting materials (e.g., larger number of cells). Adjust the amount of TRIzol (for RNA extraction) to avoid incomplete cell lysis due to the insufficient amount of TRIzol.
# This R code refers to metagene plot for dRPB3 in Figure 5A # Note: similar code can be employed for RPB3.

Problem 2
Low mapping reads of dm6 may occur in step 17.

Potential solution
Increase the percentage of spike-in Drosophila S2 cells.

Problem 3
Compatibility issues due to old versions of software and algorithms used in the protocol.

Potential solution
We mentioned this issue in the Note of the software and algorithms section. In brief, all the custom scripts run seamlessly with the R version (v.4.1.1), Perl (v5.26.2) and Python (v3.7.4), and the same results can be reproduced. Although we anticipate that forthcoming versions of software employed for NGS read processing will sustain their analytical and statistical efficacy using the same basic parameters, their performance on our data has yet to be evaluated. In the near future, we aim to provide a Docker image containing the identical software version on the Github repository to resolve the issue.

Problem 4
When running a piece of code, the output is not the expected (i.e., an error message in the console or an unexpected feature in the plot).

Potential solution
For the proper execution of the piece of code, it is important to verify that the syntax is correct. Specifically, examine for syntax errors, such as missing commas, unclosed brackets or parentheses, quotations, and misspelled functions or filenames.

Problem 5
The variability in the number of readthrough gene candidates identified by DoGfinder in step 32, depends on the sequencing and library type.

Potential solution
Adjust the parameters used in the readthrough gene identification process, given its reliance on continuous coverage criteria. Specifically, stricter minimal length and coverage parameters could be chosen when analyzing non-polyA selected RNA-seq libraries, since readthrough genes are known to remain nuclear. In our recent work, 2 we have adopted default parameters of 4000 bases and 60% coverage for polyA-selected RNA-seq libraries, which have been found to be appropriate in previous studies. 12,20 To further refine the list of candidates, we suggest considering highly expressed genes by limiting the selection to those exceeding a certain RPKM value threshold.

Potential solution
Perform sample clustering and utilize visualization tools, including PCA plots and scatterplots, to comprehensively evaluate batch effects. For experiments involving different batches, it is recommended to apply specialized software such as SVA or RUVseq to correct batch effects accurately.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to the lead contact, Xiong Ji (xiongji@pku.edu.cn).

Materials availability
Regents generated in this study are available upon request to thelead contact.

Data and code availability
The data used in this study are described in Li et al. (2023). 1 The GEO number of RPB3 ChAR-seq data is GEO: GSE179962. The GEO number of dRPB3 ChAR-seq data is GEO: GSE225453. This protocol includes all codes. The source code for executing this pipeline is readily available for download as a supplementary file and archived at Zenodo [https://doi.org/10.5281/zenodo.7936207].