Protocol for integrated analysis of bacterial RNA-seq and ChIP-seq data for establishing a co-expression network

Summary Transcriptomic data and ChIP-seq data from bacteria are systematically analyzed and efficiently combined. We describe the software environment for analysis and the download and installation methods. Furthermore, we describe the analytical process and present the corresponding mini-test data, which can be conveniently restored and reproduced by users. Moreover, we provide the script for data consolidation, which allows multiple files to be rapidly merged. Overall, this protocol presents software parameters, R codes, and in-house Perl scripts for analyzing bacterial multi-omics data. For complete details on the use and execution of this protocol, please refer to Xin et al.

R language code for easy data visualization and figure graphing In-house scripts help to process intermediate text, making the analysis streamlined Mini data for sequencing and genomes for preliminary simulated analyses SUMMARY Transcriptomic data and ChIP-seq data from bacteria are systematically analyzed and efficiently combined. We describe the software environment for analysis and the download and installation methods. Furthermore, we describe the analytical process and present the corresponding mini-test data, which can be conveniently restored and reproduced by users. Moreover, we provide the script for data consolidation, which allows multiple files to be rapidly merged. Overall, this protocol presents software parameters, R codes, and in-house Perl scripts for analyzing bacterial multi-omics data. For complete details on the use and execution of this protocol, please refer to Xin et al.

BEFORE YOU BEGIN
Before you begin your analysis, you need to prepare a platform for data storage and analysis. This procedure is performed on Linux CentOS, and you can download software and R packages online. The memory of the platform should be at least 10 times the size of the genome data.
Processing of the ChIP-seq data is initiated with a quality check to ensure that the data do not contain the artificial sequences typically added during the experiment. The data filtered by removing adapter and primer sequences and low-quality reads are typically referred to as clean data. In the subsequent analysis, we use only clean data. After using Bowtie2 1 and Samtools 2 to translate the clean data into the genome, the Model-Based Analysis of ChIP-Seq (MACS) 3 framework is used to identify peaks. Bowtie2 is an ultrafast in-province tool for matching short sequences to template genomes. Bowtie2 uses the Burrows-Wheeler transform to establish the index for the genome, which has a small memory footprint and is convenient for rapid comparison. MACS is one of the most widely used peak-calling programs, which improves the spatial resolution by combining information from two sequencing tags to integrate the site location and orientation. The ChIPpeakAnno 4 package is used to perform the downstream analysis after peak calling, including searching for the genes closest to the peak region, gene ontology (GO) enrichment analysis of the peak adjacent genes, and extraction of the sequences of the peak and its surrounding regions. Spliced Transcripts Alignment to a Reference (STAR) is used to map the transcriptome data to the genome, HTSeq 5 is used to obtain the raw counts, and DESeq2 6 is used to determine the differential genes. In conclusion, the protocol uses the mainstream software and scripts to improve the ease and efficiency of ChIP-seq and RNA-seq data processing.

Install required software
Timing: 10-30 min Install Miniconda and the software required by ChIP-seq and RNA-seq.

Install Miniconda.
Miniconda 7 is a free minimal installer for Conda that includes only Conda, Python, the packages they depend on, and a small number of other useful packages.
Note: Bioconda is a channel on Conda that distributes bioinformation software and currently includes more than 2,700 apps. Bioconda can be used to search for available software on this channel.
2. Install the software required by ChIP-seq.
Note: FastQC 8 is a quality control tool for high-throughput sequence data, which allows a user to easily perform quality control checks on raw sequence data from high-throughput sequencing pipelines. The input data may be BAM, SAM, or FastQ files (any variant). The HTML-based permanent report indicates the problematic areas through summary graphs and tables. FastQC can be installed using Conda.
Note: Bowtie is a super-fast, memory efficient tool for splicing short sequences into template genomes. Bowtie2 can be installed using Conda.
Note: Samtools is designed to handle SAM/BAM format alignment files. This tool can sort and merge SAM files and export BAM files. Samtools can be installed using Conda.
Note: MACS is one of the most widely used peak-calling programs and can be installed using Conda. FastQC, STAR, and HTSeq are required for RNA-seq analysis.
Note: The objective of STAR 9 is to align the sequenced reads to the reference genome. The software is characterised by high accuracy and mapping speed. Notably, the mapping speed is 50 times that of other competitive software.
Note: HTSeq is a Python package for analyzing high-throughput sequencing data. HTseqcount is an application that counts the reads in a number of units located on the genome, which refers to a set of locations on the chromosome (commonly known as the gene exon).

ChIP-seq analysis
Timing: 30-50 min Complete the entire ChIP-seq analysis process, including quality control, alignment and peak-calling. The corresponding test Data can be obtained from Data S1-Methods S1 in the supplemental information.
1. Perform sequence quality control by FastQC.
Note: Major parameters: -o: storage path of the generated report file and file name of the generated report; -t: number of threads; -q progress in real time. Note: Bowtie is a super-fast, memory efficient tool for splicing short sequences into template genomes. Bowtie2 can be installed using Conda.
Note: Mapping the high-throughput data to the index. -threads: number of threads.
Note: Build indexes with the FASTA-format P. syringae genome files. -threads: number of threads; -1: the left part of the paired-end data; -2: the left part of the paired-end data; and -S: File for SAM output. More information regarding the genome file can be found here: (NCBI: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000012205.1/ or http://pseudomonas. com/). The corresponding test Data can be obtained from Data S1-Methods S1 in the support file.
Note: Samtools is designed to handle SAM/BAM format alignment files. This tool can sort and merge SAM files and export BAM files. To install Samtools using Conda, use the command 'conda install samtools'. The main function of the view command is to convert the input file to the output file. Typically, the SAM file post-alignment is converted to the BAM file, and then various operations are performed on the BAM file. -q: mapping quality >= INT; -bS: output BAM format file and ignore the input format.
Note: MACS is one of the most widely used peak-calling programs. MACS can be installed using Conda with the command 'conda install macs2'. macs2_callpeak is one of the main macs2 features that can find ChIP-seq peaks using BAM files. -t: treatment group; -c: control group; -f: format of the input file; -g: effective genome size; -n: name of the output file; -B: determines whether the remaining fragment pileup, control lambda, -log10pvalue, and -log10qvalue scores are saved to the bedGraph file; -p: e-value; -nomodel: the bias of the peak candidate region is not considered, and the background l is used as the local l; -extsize: length of the transcription-factor-binding region. Note: Three R packages are needed. The ChIPpeakAnno package includes functions to retrieve the sequences around the peak; obtain the enriched GO terms; and find the nearest gene, exon, miRNA, or custom features such as the most conserved elements and other transcription-factor-binding sites supplied by users. GenomicFeatures 12 is a set of tools and methods for preparing and manipulating transcript centric annotations. The GenomicRanges 13 package defines general purpose containers for storing and manipulating genomic intervals and variables defined along a genome. The test data file can be downloaded from this link: https://github.com/dengxinb2315/STAR-protocol-data.git
The corresponding test Data can be obtained from Data S2-Methods S2 in the support file.
3. Perform sequence quality control by FastQC.
4. Perform alignment. RNA-seq reads are mapped to the P. syringae genome using STAR, and only the uniquely mapped reads are retained for the subsequent analyses. The STAR software is designed to align sequenced reads to the reference genome. The software is characterised by high accuracy and mapping speed (50 times those of other competitive software). a. Build a genome index.
b. Perform alignment using STAR Note: -runThreadN: number of threads; -genomeDir: index storage location; -readFilesIn: storage location of the paired-end FASTQ files; -outFilterMultimapNmax: maximum number of multiple alignments for a read that will be output to the SAM/BAM files; -outFilterMis-matchNmax: alignment is output only if it has no more mismatches than this value.; -outSAMtype: type of SAM/BAM output; -outFileNamePrefix: prefix of the output file; -quantMode: type of quantification requested 5. Obtain the raw counts using HTSeq.
HTSeq is a Python package for analyzing high-throughput sequencing data. HTSeq-count is an application that counts the reads in a number of units located on the genome, which refers to a set of locations on the chromosome (commonly known as the gene exon).
12. Construct the regulatory network.
a. The log2 fold change (log2FC) of the DEGs of each mutant is extracted based on the RNAseq data. b. Cytoscape is used to construct the regulatory network. c. The format of the input file is the classic network style, which includes the mutant names in the first column, DEG names in the second column, and log2FC in the third column. d. Use Cytoscape to build the networks through the following steps: i. Open Cytoscape and import the network by file.
ii. Select the first and second columns as the source and target nodes, respectively. iii. Make personalized adjustments to the chart, for instance, enclose the upregulated targets and downregulated genes in red and blue circles, respectively. iv. The size of the mutant circles represents the number of regulons. v. The layout is set as the Circular Layout. 13. Construct the functional heatmap.
Select the DEGs regulated by the cluster mutants to perform the GO analysis. First, use the in-house Perl script to determine the GO terms.  Finally, construct the functional heatmap using the R package pheatmap. 17 The R code is as follows:

EXPECTED OUTCOMES
First, the DEGs are identified using DESeq2 based on the RNA-seq data. Second, the binding peaks are identified after mapping the ChIP-seq reads to the genome files. Additional information, such as the sequences around the peak; enriched GO terms; and nearest gene, exon, miRNA, or custom features, is obtained through annotation. Third, the network of all interactions involving mutant target genes is construct based on the ChIP-seq results. Finally, the mutant regulatory network is established by integrating the regulons of the cluster mutants from the ChIP-seq and RNA-seq data obtained in different medium conditions.

LIMITATIONS
This version of the protocol is based on the RNA-seq and ChIP-seq data of bacteria that possess small genomes. Although it may be effective for other species, it may require considerable computing resources and long runtimes when processing data with huge genome files.

Problem 1
Software installation through Conda is extremely slow (Related to the section 'Install the required software').

Potential solution
Update the Conda software to the latest version using the command 'conda update conda'. Install the software using MAMBA instead of Conda. First, use the command 'conda install mamba' and then 'mamba install software'.

Problem 3
Cannot install the R package in R. (Related to the section 'Install R')

Problem 4
Errors appear when using 'read.table' in R. (Related to the section 'Obtain the histogram and heatmap to clarify the number of genes uniquely targeted by mutants')

Potential solution
Check whether each column in the data is tab-separated. Check the line name for special symbols or spaces. Check data integrity (presence of empty rows or redundant data).

Problem 5
The program yields too many DEGs based on the RNA-seq data, rendering the results chaotic and disordered. (Related to the section 'RNA-seq analysis')

Potential solution
Adjust the screening criteria to obtain the appropriate number of DEGs. When too many DEGs are identified, log2FC can be increased to 2 or the p-value can be decreased to 0.01.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Xin Deng (xindeng@cityu.edu.hk).

Materials availability
This study did not generate new unique reagents.

Data and code availability
This study did not generate new datasets or codes.