Delila-PY, a Pipeline for Utilizing the Delila Suite of Software to Identify Potential DNA Binding Motifs

Predicting potential DNA binding motifs is a critical part of understanding gene expression across all domains of life. Here, we report the development of Delila-PY, an easy-to-use pipeline to utilize the Delila suite to identify DNA binding motifs.

P roteins binding to DNA to regulate transcription are a key part of growth and responding to environmental stimuli. Identification of the DNA sequences bound by these proteins can elucidate members and function of the regulons. There exist several tools that can identify these DNA motifs, including MEME (using expectation maximization) and BioProspector (using Gibbs sampling) (1,2). The Delila suite of tools identifies motifs by maximizing information content and provides extensive flexibility in defining the parameters of motif prediction (3)(4)(5)(6)(7). However, the published instance of Delila requires extensive computational knowledge to install and use. Here, we present Delila-PY (Delila-PYthon), a pipeline for running the Delila programs ( Fig. 1). Written in Python3 and publicly available as a Docker image (8) (recommended use case) and on GitHub, Delila-PY requires only a set of DNA sequence coordinates and a genome sequence.
Delila-PY requires the GenBank file (9, 10) for the target organism and a user-generated genomic position file with the following columns: chromosome, sequence name, strand, and genomic position (see included example files in the GitHub and Docker repositories). The genomic position file can be constructed with scripts or a spreadsheet program (e.g., Microsoft Excel) and is used to indicate the region of the genome to search for the motif. Delila-PY works with individual organisms with multiple chromosomes and mobile genetic elements and microbiomes and has been tested with both bacteria and eukaryotes. Users indicate the left and right boundaries relative to a genomic position within the genomic position file (e.g., location of the center of a chromatin immunoprecipitation sequencing [ChIP-seq] peak or transcription start site) to search for a DNA binding motif. Users can also indicate a title for the motif (the default is the species name) to keep track of multiple runs of Delila-PY.
Delila-PY processes data through the Delila software ( Fig. 1A) (3)(4)(5)(6)(7). The pipeline starts by running the dbbk and catal programs to generate the required libraries from the GenBank file. The library and genomic position files are used by delila (note that delila is a specific program within the Delila-PY software suite) to generate a book of sequences for subsequent use. To maximize the information content in the DNA motif, the pipeline runs three cycles. The first cycle uses the malign and malin programs to search for the alignment with the highest information content among all input sequences. The results of the malin program are fed back to the delila program to generate an updated book file. In the second cycle, the alignment book is passed to programs (alist, encode, cmp, and rseq) to calculate the information content on each sequence. Delila-PY removes poorly matching sequences (#0 information content, ri program). Then, a new file is generated and fed back to the delila program to generate an updated book file. The final cycle generates a DNA motif (alist, encode, cmp, rseq, dalvec, and makelogo). Delila-PY produces a DNA sequence logo in postscript and PDF formats, the position weight matrix (PWM) (11) values of the motif as a text file, and the individual DNA sequences used to generate the DNA motif. Each program used in the pipeline requires a specifically defined parameter file, and Delila-PY makes the default parameter files, but the user can create and use their own parameter files.
As a proof-of-concept, we used Delila-PY to identify the sequence logo for the binding site of the Rhodobacter sphaeroides transcription factor FnrL using locations from ChIP-chip data (12). The resulting sequence logo from Delila-PY (Fig. 1B) resembles the logo generated from previously identified FnrL binding sites (Fig. 1C), supporting the utility and predictive power of Delila-PY (12,13). The files needed to run Delila-PY are available on the GitHub repository and in the Docker image. We predict that Delila-PY will allow more researchers in the life and computational sciences community to take sphaeroides from the Delila-PY pipeline. The relative heights of the letters at each position indicate the frequencies, while the overall height of the stack indicates the degree of sequence conversion, all measured in bits of information (y axis). The x axis is the relative position based on the DNA sequence coordinate used. Error bars indicate a 95% confidence interval. (C) Logo generated by WebLogo from previously identified FnrL binding sites in R. sphaeroides (12,13). The logo and axis descriptions are the same as those for panel B.
advantage of the powerful tools within the Delila suite to identify high-quality sequence DNA motifs and sequence logos.