Rotate: A command-line program to rotate circular DNA sequences to start at a given position or string

Sequences derived from circular DNA molecules (i.e. most bacterial, viral and plastid genomes) are expected to be linearised and rotated to a common start position for most downstream analyses including alignment. Despite this being a common and straightforward task, available software is either limited to a small number of input sequences, lacks the option to specify a custom anchor string, or requires a commercial license. Here, we present rotate, a simple, open source command line program written in C with no external dependencies, which can rotate a set of input sequences to a custom anchor string (allowing for a specified number of mismatches), or offset the input sequences to the desired position. The combination of both functionalities allows the rotation of all input sequences to any desired starting position, enabling downstream analysis. rotate is extremely fast and scales linearly with the number of input sequences, taking only seconds to rotate over a thousand mitochondrial sequences.


Introduction
Some DNA sequences, such as most viral, bacterial and organelle genomes, are circular as opposed to linear.When these sequences are deposited into public repositories, they are formatted as linear sequences by assigning them starting positions and orientations.These starting positions and orientations can be somewhat arbitrary and often differ across individuals or taxa.However, most multiple sequence alignment programs (e.g.MAFFT 1 , MUSCLE 2 ) assume linearity of sequences, including the same starting position.Performing accurate downstream analyses, then, requires first standardizing both their orientation and starting position.
Multiple programs or software packages already exist with some sequence rotation functionality, but have various restrictions on input, extensive dependencies, or do not allow user-defined starting positions or anchor strings.For example, geneious 3 can offset a sequence to a user-defined starting position, but does not allow automated rotation to a custom anchor string and it requires a commercial license.CSA 4 is restricted to 32 total input sequences, and rotates to an "optimal" rotation for multiple sequence alignment instead of a user-defined starting position.Circlator's fixstart function 5 does not accept a user-defined starting position or anchor string, and instead tries to detect dnaA genes to guide the rotation, and the software has many dependencies (including BWA 6 , Prodigal 7 , MUMmer 8 , and Canu 9 or SPAdes 10 ).Similarly to CSA, MARS 11 uses a sophisticated algorithm to compute the optimal rotation and can even integrate this into a multiple alignment algorithm, but again does not allow for a user-specified string or position.Though an "optimal" rotation is desirable in many contexts, the ability to rotate to a user-defined sequence or position is highly valuable, for example because it allows for the iterative inclusion of new sequences without re-running the algorithm on the entire dataset.Here, we present a software tool which can rotate a set of input sequences to a custom anchor string (allowing for a specified number of mismatches), or offset the input sequences to the desired position.

Implementation
rotate is a command-line program written in C that takes as input an (optionally gzipped) FASTA file of DNA sequences and either a new starting position (offset in base pairs) or anchor string which defines a new starting position, and outputs a FASTA file with the same sequences appropriately rotated.If a string is given as input, it can allow for any number of mismatches up to a user-defined threshold, and will also search for and output reverse complements when necessary.The program will fail if the specified input string is not unique in a target sequence, while allowing for mismatches, and return the locations instead.It is available at https://github.com/richarddurbin/rotate(see Software availability for more information).
Operation rotate has no external dependencies and is called from the command line.It accepts several arguments to invoke the desired functions, which are explained at https://github.com/richarddurbin/rotate. rotate is extremely fast -its runtime scales linearly with the number of input sequences, since every sequence is processed separately.rotate is easy to compile, and we tested its functionality on macOS 12.5.1 and on Scientific Linux 7.9.

Use cases
Below we give an example of how to use rotate in combination with a multiple sequence alignment program of choice to rotate a large dataset of sequences to a common start position.To enhance reproducibility, we made the input data and an extended version of this use case available at https://github.com/MoritzBlumer/rotate_use_case.Briefly, we downloaded two sets of publicly available organelle assemblies from NCBI's RefSeq database: (1) all available complete mammalian mitochondrial genomes (n=1,546) and (2) all available complete chloroplast genomes of the Rosaceae family (n=465) (as of 24 January 2023).For the mitochondria, we selected a conserved mammalian mitochondrial sequence (using the 100 Vertebrate Cons track in the UCSC Genome Browser 12 ) as the anchor string, while for the chloroplasts we used a common barcode primer 13 .See Underlying data for more information.We first rotated the sequences with rotate, specifying the anchor string (-s) and maximum number of mismatches (-m), respectively.Execution of this first step on a Linux machine using a single CPU with 6GB RAM took 3.165 and 3.422 seconds to rotate 1,546 mitochondrial and 465 chloroplast sequences, respectively (wall clock time).Next, we generated a multiple sequence alignment for both rotated sets of sequences with MAFFT (version 7.520) 1 and then performed a second rotation by position (-x) to accomplish conventional mitochondrial and chloroplast start positions in the multiple sequence alignments.For illustration, the first 750 base pairs of the 465 chloroplast sequences are shown in Figure 1, with the raw sequences from rosaceae.fa in panel (A) and the rotated sequences from rosaceae.rotated.fa in panel (B).The sequence files were visualized using Aliview version 1.27 14 .Note that neither panel in Figure 1 has undergone multiple sequence alignment.

Conclusions
Here we have presented a fast and simple command-line program to rotate circular DNA sequences, for example chloroplast or mitochondrial sequences, to a common starting position.This is often required to create a multiple sequence alignment.rotate can account for an arbitrary number of mismatches, has no external dependencies, and can process thousands of sequences in seconds.

João Gabriel Rodinho Nunes Ferreira
Universidade Federal do Rio de Janeiro, Rio de Janeiro, State of Rio de Janeiro, Brazil

# General Assessment
In this article, the authors address a challenge in the field of bioinformatics: the rotation of circular DNA sequences such as those found in most viral, bacterial, and organelle genomes.Typically, these sequences are deposited into public repositories as linear sequences with assigned starting positions and orientations, which can vary arbitrarily and affect downstream sequence alignment processes.Most existing sequence alignment programs assume a uniform linearity and starting point, which often does not reflect the circular nature of these genomes.To overcome those limitations, the authors developed rotate, a command-line tool that standardizes the starting positions and orientations of circular DNA sequences for more accurate downstream alignment in genomic analyses.Unlike previous tools, Rotate offers enhanced flexibility and scalability, supporting arbitrary mismatches and efficiently handling thousands of sequences simultaneously without the need for commercial licenses or extensive software dependencies.
By providing a mechanism to standardize the orientation and starting positions of circular DNA sequences, the tool facilitates more accurate multiple sequence alignments, essential for subsequent analyses and comparative studies across different taxa.Its ability to process thousands of sequences rapidly, without external dependencies and regardless of sequence complexity, positions it as a valuable resource in genomic research.Such capabilities not only save valuable research time but also improve the reproducibility of scientific results, thereby contributing to advancements in the fields of genetics and molecular biology.

## Manuscript
"The program will fail if the specified input string is not unique in a target sequence," Could you clarify the behavior described where the program will fail if the specified input string is not unique in a target sequence?In my testing, I introduced a duplicate manually, and the program completed successfully (returning a status of 0).However, as expected the duplicated sequence was not included in the final FASTA output.
"We selected a conserved mammalian mitochondrial sequence (TACGACCTCGATGTTGGATCA) (using the 100 Vertebrate Cons track in the UCSC Genome Browser12) as the anchor string" A more detailed explanation of how the conserved mammalian mitochondrial sequence was selected could greatly assist users.Providing this information would enable users to better understand the criteria and process involved, potentially guiding them in choosing appropriate anchor strings for their own datasets.
On 'Fig.1': It does not include the mammalian sequences, despite the detailed description of their processing in the Use Cases section.To maintain consistency, I suggest either removing the detailed processing description of the mammalian sequences from the Use Cases section or adding their plots to Figure 1 alongside the Rosaceae sequences.This adjustment would enhance the coherence and completeness of the visual data presented.
The primary purpose of 'rotate' is to prepare files for downstream analyses, such as multiple alignments.However, Figure 1 only presents a comparison between rotated and raw FASTA files without showing any actual alignment.To better support the tool's intended use, it would be valuable to include a comparison of aligned rotated sequences alongside aligned raw sequences.This would more effectively demonstrate the tool's efficacy in facilitating downstream analytical processes.
"sourced from from NCBI's RefSeq database" The word "from" is mistakenly repeated.It should read "sourced from NCBI's RefSeq database."

## Software
It would be helpful to provide details about what the program outputs to STDERR, as it seems to include statistics for each processed sequence, yet I found no specific documentation explaining these statistics.Additionally, considering implementing a 'verbose mode' (perhaps with a -v option) or directing these statistics to a separate file like 'rotation_stats.txt'could improve usability.This change would prevent the terminal from becoming cluttered with statistics when processing multiple sequences, thereby enhancing clarity and focus on the main output.
It appears that if a sequence lacks a matching -s, it is almost silently discarded from final output FASTA; the only indication is that its statistics in STDERR comprise a single line, whereas other sequences display two.To enhance user clarity, the program could more explicitly notify users of non-matching sequences, either through a warning message or by generating a file listing the IDs of discarded sequences.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound?Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genomics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.Minor: 1) The authors might think about also including a short paragraph describing the second utility in the GitHub repository, `compose`.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?Yes Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatician working in the field of circular RNAs a field closely related to circular DNAs in terms of algorithms and software.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The Rotate tool is a useful utility for analyzing sequences where an expected substring is present.
The introduction covers the state of the art in circular sequence rotation.An additional tool you may want to include in the introduction is SeqKit, which has a "restart" command in v2.Again, it does not allow a user specified string for the start.
Using the provided install command (i.e., clone the repo and run make), I was able to build the tool on a Apple Silicon and macOS 14.3.1.
For my use case, aligning viroids with an expected hammerhead ribozyme motif (CTGANGA), the ability to tolerate mismatches is very useful.However, I have run into a minor issue you may wish to consider.The command line usage states "use either -x and optionally -rc, or -s" Viroids may have their ribozyme on either polarity (or indeed both).It would appear that -rc and -s are incompatible but the command does still execute.However, when it finds a matching string in the opposite polarity, the output sequence is in the same polarity as the input sequence, not as the search substring.
For example, this command: ./rotate -m 1 -rc -s CTGANGA ../research/viroiddb/db/2021-09-07/avsunviroidae.fasta Includes in its output this line: >AF404052.1 TGTTCTTCCCATCTTTCCCTGAAGTGACGAAGTGATCAAGAGATTGAAGACGAGTGAACTAATTTTTTTTTAATAAAAAGTTCA The reverse complement of the final seven is indeed the search string but the output would not be directly usable.I would suggest either actively disallowing -rc and -s to be used together or add the additional logic to output the reverse complement.
Finally, the ability to use ambiguous nucleotides would be helpful.I saw some references to ambiguity in seqio.cbut it is not clear if that code is used in Rotate.Attempting to use ambiguous nucleotides in the search did not appear to work unless -m was used but this does not seem as it it would be useful for anything other than N.
Overall, a useful contribution to the community and a tool I expect I will be using myself in the future.
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Figure 1 .Reviewer
Figure 1.Visualization of the first 750 base pairs of 465 chloroplast assemblies (A) before rotation and (B) after rotation to a shared anchor sequence.No multiple sequence alignment was performed here.The sequence files were visualized using Aliview version 1.27 14 .

Reviewer
Report 27 April 2024 https://doi.org/10.21956/wellcomeopenres.21678.r79585© 2024 Jakobi T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Tobias JakobiCollege of Medicine Phoenix, The University of Arizona, Phoenix, Arizona, USA In their manuscript, Durbin et al. present rotate, a command line tool designed to aid in the task of correctly rotating circular DNA sequences required align multiple sequences based on custom offsets or sequences motifs.The software is very streamlined, uses no external dependencies and should work on any system that has a working C compiler.It is small and fast enough to easily integrate into existing software pipelines, e.g. for submission of sequences to the Genbank.The manuscript is well written and clearly documents the software's functionality as well as performed benchmarks and tests.Reasoning for why a new software is required are clearly laid out.The benchmark documentation is available as a second repository and contains all necessary steps to perform the analysis.