I-ATAC: interactive pipeline for the management and pre-processing of ATAC-seq samples

Assay for Transposase Accessible Chromatin (ATAC-seq) is an open chromatin profiling assay that is adapted to interrogate chromatin accessibility from small cell numbers. ATAC-seq surmounted a major technical barrier and enabled epigenome profiling of clinical samples. With this advancement in technology, we are now accumulating ATAC-seq samples from clinical samples at an unprecedented rate. These epigenomic profiles hold the key to uncovering how transcriptional programs are established in diverse human cells and are disrupted by genetic or environmental factors. Thus, the barrier to deriving important clinical insights from clinical epigenomic samples is no longer one of data generation but of data analysis. Specifically, we are still missing easy-to-use software tools that will enable non-computational scientists to analyze their own ATAC-seq samples. To facilitate systematic pre-processing and management of ATAC-seq samples, we developed an interactive, cross-platform, user-friendly and customized desktop application: interactive-ATAC (I-ATAC). I-ATAC integrates command-line data processing tools (FASTQC, Trimmomatic, BWA, Picard, ATAC_BAM_shiftrt_gappedAlign.pl, Bedtools and Macs2) into an easy-to-use platform with user interface to automatically pre-process ATAC-seq samples with parallelized and customizable pipelines. Its performance has been tested using public ATAC-seq datasets in GM12878 and CD4+T cells and a feature-based comparison is performed with some available interactive LIMS (Galaxy, SMITH, SeqBench, Wasp, NG6, openBIS). I-ATAC is designed to empower non-computational scientists to process their own datasets and to break to exclusivity of data analyses to computational scientists. Additionally, I-ATAC is capable of processing WGS and ChIP-seq samples, and can be customized by the user for one-independent or multiple-sequential operations.


1
Motivation 60 The use of high-throughput sequencing technologies has brought an enormous increase in the amount 61 of heterogeneous genomic data production in the last decades. The importance of genomic dataset 62 processing in the genomic community is well known; as it plays important role in analysing the 63 dynamics and complexities of gene regulation with modelling and implementation of different 64 statistical methods utilizing data processing pipelines. 65 Traditional way of next generation sequencing (NGS) data pre-processing is complex and based on 66 running a series of command-line applications in Unix, Linux, MAC and DOS environments, which 67 requires good knowledge of bioinformatics tools and good programming skills. There are over 200 68 tools available for the genome and exome sequencing data pre-processing and analysis (Pabinger paired end). The next step is the processing of ATAC-seq samples. A typical ATAC-seq data 88 processing pipeline's workflow is shown in S- Fig. 1 the field of epigenomic data analysis and are not familiar with ATAC-seq data processing steps. 102 The GUI of the I-ATAC (S- Fig. 2 A and B) is designed for simplicity and ease by following human 103 computer interaction (HCI) guidelines (Ahmed et al., 2014). The concept behind designing I-ATAC 104 GUI was to implement "One Click Operations" concept, similar to a Google search that requires 105 users to enter one natural language based query and click a search button. Similarly, along with the 106 default or customized settings (S- Fig. 2 B), I-ATAC requires only path to the sample data files 107 (zipped or unzipped "FASTQ" files), project name and pressing button "Run ATAC-Seq" (S- Fig. 2 I-ATAC is a platform designed by following software engineering principles for the sustainable  129 bioinformatics software implementation (Ahmed et al., 2014). Here, we present its operational 130 workflow, data structure and components' orientation. 131

Operational Workflow of I-ATAC 132
Following default workflow (S- Fig. 3), user can process ATAC-seq samples with the application of 133 complete pipeline, which involves the execution of all integrated applications (FASTQC, 134 Trimmomatic, BWA, Picard, ATAC_BAM_shiftrt_gappedAlign.pl, bedtools and macs2) but user is 135 not limited in the use of I-ATAC (S- Fig. 3

149
User can remotely handle sample data files for processing by either keeping them in the same parent 150 directory and putting only pre-processed results in the main project and sub-project directories or by 151 first copying compressed files into the project directory, unzips them and then process them. User 152 can configure job (UNIX based Secure Shell Scripts) settings by processing one or multiple samples 153 at a time as one job or multiple jobs (one for each sample). 154 I-ATAC also enables users to customize parameters used for data pre-processing steps by letting the 155 user to choose between applications as well as by setting different parameters (S- Fig. 4), which 156 enables customizing this pipeline for the analyses of other data types, such as ChIP-seq data. As the 157 output, I-ATAC produces data quality reports that can be visualized within the platform. It also 158 outputs ATAC-seq reads that are filtered, trimmed and aligned as well as peak calls from these reads. 159 3.2 Applications integration, data processing pipeline and project's directory structure 160 ATAC-seq data processing pipeline starts with the quality check, then paired end reads are trimmed, 161 aligned, filtered, and sorted in a "sam" file. The "sam" file is compressed and indexed to a bam file, 162 which is then used as input for peak calling. To manage pre-processed data, proposed directory 163 structure is followed and automatically created in data cluster before data processing (S- Fig. 5). 164 174 All the quality reports ("zip" and "html" files) are placed in "fastQC" sub-directory. Compressed 175 files contain different output files including text ("txt") and web page ("html"

183
All trimmed and filtered "FASTQ" files are placed in "trimmomatic" sub-directory, all the sorted, 184 shifted "sam", indexed "bam" and "bed" files are placed in "bwa" sub-directory. All the observed 185 peak files are placed in the "macs2" sub-directory. The nested directory structure provides an 186 organized and modular storage for multi-level ATAC-seq data analysis pipeline. Produced results in 187 the form of sorted "sam" and "bam" files, as well as peaks can be visualized using available genome 188 data browsers (e.g. USCS, Chipster etc.) and viewers (e.g. IGV etc.). 189 190 S- Fig. 6: I-ATAC: Components workflow, operating systems and physical data storage in data cluster.

Comments workflow, operating systems and physical data storage in data cluster 192
The components workflow (S- Fig. 6 The sample, sequenced data files, applications (S- Table. 3), compilers and interpreters (S- Table. 4),  196 pre-processed data and scripts are need to be placed in data cluster. 197 198

GUI Description 200
As shown in (S- Fig. 2 A and B), the overall GUI of the I-ATAC is divided in to two modules: 201 Process and Settings. 202 The Process module is to generate and run pipeline. The GUI-B module is mainly used to set the parameters of the applications and directory paths. As 211 shown in the figure (S- Fig. 8), it provides only four features: Applications Parameters, Directory 212 Paths, Save and load Parameters, and Reset Paths (S- Fig. 8 and S- User Login  User requires entering name of the host (attached data cluster or name of the personal computer), user login name and password to let the I-ATAC successfully login into to host and access sample data files ("FASTQ") and applications to perform data processing. generating and submitting one data processing job (one for all.

Merge Replicates:
Applicable only in case of processing multiple samples at a time by submitting one data processing job for all. It enables selection of all generated "bam" files from all the preprocessed samples directories (bwa) and performs peak calling. 4. Wall Time: Sets time to be allocated for the processing of the queued job. In case of multiple-parallel jobs, it will set provided time for all jobs. 5. Nodes: Sets the number of nodes (connection points) requested for job.
Default set node is 1. 6. Processor per node (ppn): Sets the number of cores (virtual processors) per node per. Default set ppn is 1. 7. Email: Sets to get notification (cancelled, completed) about the status of submitted job.

Create & Queue Jobs:
In case host is data cluster, then I-ATAC will prepare and submit jobs. 9. Direct Processing: In case host is personal computer, then I-ATAC will prepare and submit instructions. 10. Creates soft links: Having checked this option, I-ATAC will create soft links of FASTQ files in to output directory. 11. Copy: Having checked this option, I-ATAC will create copy FASTQ files in to output directory. 12. *.gz ziiped files: Having checked this option, I-ATAC will expect input FASTQ files are zipped otherwise not. (input/output) redirected (one's output is treated as another's input, in terms of both data analysis and 224 processing) and integrated method (S- Fig. 6). Additionally, it requires all needed compilers and 225 interpreters to be downloaded and installed as well (S-

MACS2 280
Model-based Analysis of ChIP-Seq (MACS) (Zhang, et al., 2008) is a tool for analyzing short reads 281 for the spatial resolution of the predicted sites, capturing local biases in the genome and generation of 282 peaks with detailed information about length, genome coordinates, summit, p-value, q-values, false-283 discovery rate (FDR) and fold enrichment. MACS2's used version details, including input, output 284 and download details are given in S-

298
After executing I-ATAC and before starting data processing, it is important to set valid applications 299 paths and calling protocols (section: Graphical User Interface of I-ATAC). Our default parameters 300 (S- Fig. 9A, 9B and 9C) are set according to our data cluster and installed versions of application (S-301  Table. 3), and Compilers/Interpreters (S- Table. 4). 302 Using default configuration settings; I-ATAC will consider logged-in user with a default directory of 303 same name as of user in the data cluster (e.g. Zeehan  "d:/data/Zeeshan/ATAC_PROJECTS/"). 304 However, user can alter, reset and save default project directory settings. 305

7
Case Studies 306 In order to validate the performance of I-ATAC and to guide the users, we present two case studies. 307 First involves using the example data; where we have created small size example dataset (provided in 308 supplementary material and can be downloaded from the following web link: 309 https://zenodo.org/record/46079#.VsJMg7S5LHM) with artificial names (to explain the process, 310 execution steps in simpler way.). The reason for giving example study is to let the user, use the 311 application and observe results in possible shortest time. Moreover, it will also help in figuring out 312 and resolving trouble shooting conditions (e.g. could be due to inappropriate installation of 313 downloaded application and compilers/interpreter or any other exceptional reason etc.). Second study 314 is using publically available data (GM12878, CD4); where we have processed publically available 315 data, which a trained user can download and process using I-ATAC. In both case studies, I-ATAC is 316 run at the Mac-OS-X-Yosemite 10.10.5 platform. 317

Dataset Details 319
Raw dataset and produced results mentioned in this example case study, which can be downloaded 320 from the provided project web link. Sequenced, paired sample data ("FASTQ" or "FASTQ.gz") files 321 are need to be collected and placed in the attached data cluster. 322

Input 323
The input to I-ATAC is the path to ATAC-seq sample data, which in our case is: 324 "/data/zahmed/ATAC_PROJECTS/gz_fastq_files" 325 As shown in S- Fig. 10, there are two samples available (paired data, four "FASTQ" zipped files) in 326 the above-mentioned directory i.e. "gz_fastq_files", which are: 327 At successful verification, file status window (S- Fig. 12) provides the information about located 341 sample data files, which were copied, pasted and unzipped in the project directory 342 (Example_with_gz_files). At second successful verification, the ATAC-seq data processing pipeline 343 was automatically scripted (S- Fig. 13) and created job was queued to the data cluster (S- Fig. 14).

Output 349
After the successful execution of the ATAC-seq data processing pipeline, the system's generated 350 output can be located in the mentioned output directory (S- Fig. 15). The output files were placed in proposed system's automatically created sub-directory structure 363 (Section: Applications integration, data processing pipeline and project's directory structure), as 364 shown in S- Fig. 14. We also input two samples and asked system to produce merged replicates as 365 well. So, we observed results for both samples as well as merged replicates. 366 367 368 S- Fig. 15: Screen shot (Linux Terminal, using Mac-OS-X) of produced I-ATAC output project 369 directory and files 370 The produced results from First_SampleData are shown in S- Fig. 16

Conclusions 484
To the best of our knowledge, I-ATAC platform is the first desktop tool that is specialized to 485 processing and analysis of ATAC-seq data. I-ATAC provides a flexible algorithm and parameter 486 setting GUI for non-computational scientists and a time-efficient parallel data analysis environment 487 for computational scientists. Future work includes incorporating visualization and differential 488 analysis modules in I-ATAC platform. 489 490