Swellix: a computational tool to explore RNA conformational space

Background The sequence of nucleotides in an RNA determines the possible base pairs for an RNA fold and thus also determines the overall shape and function of an RNA. The Swellix program presented here combines a helix abstraction with a combinatorial approach to the RNA folding problem in order to compute all possible non-pseudoknotted RNA structures for RNA sequences. The Swellix program builds on the Crumple program and can include experimental constraints on global RNA structures such as the minimum number and lengths of helices from crystallography, cryoelectron microscopy, or in vivo crosslinking and chemical probing methods. Results The conceptual advance in Swellix is to count helices and generate all possible combinations of helices rather than counting and combining base pairs. Swellix bundles similar helices and includes improvements in memory use and efficient parallelization. Biological applications of Swellix are demonstrated by computing the reduction in conformational space and entropy due to naturally modified nucleotides in tRNA sequences and by motif searches in Human Endogenous Retroviral (HERV) RNA sequences. The Swellix motif search reveals occurrences of protein and drug binding motifs in the HERV RNA ensemble that do not occur in minimum free energy or centroid predicted structures. Conclusions Swellix presents significant improvements over Crumple in terms of efficiency and memory use. The efficient parallelization of Swellix enables the computation of sequences as long as 418 nucleotides with sufficient experimental constraints. Thus, Swellix provides a practical alternative to free energy minimization tools when multiple structures, kinetically determined structures, or complex RNA-RNA and RNA-protein interactions are present in an RNA folding problem. Electronic supplementary material The online version of this article (10.1186/s12859-017-1910-7) contains supplementary material, which is available to authorized users.

This should populate the `viennabuild` directory with the resources that Swellix will use during computation. Inside of this `viennabuild` directory, there will be an existing thermodynamic parameter file called `rna_turner2004.par`. This file informs the calculations that are performed at runtime to determine the free energy of a secondary structure. If you have a different or updated parameter file that you'd like to use then you can simply replace `rna_turner2004.par` with your file. For this file to be properly used, you must also make a change to the Swellix Makefile. In the Makefile, there is a variable `PARAMFILE`. In this variable, you must change `rna_turner2004.par` to the name of the file you want to use.
The Swellix Makefile has gathered multiple options for compilation. The ones of most use are: ### SERIAL: Compilation using ``` make serial ``` will provide the serial version of Swellix with the most basic needed output. The program will output the RNA sequence it was given, and the number of structures computed. You may provide Swellix the commandline flag `-d` with a value of 2 to print each structure out in dot & parenthesis format to the terminal as it is computed.
### MPI: Using ``` make mpi ``` will produce a result similar to `make serial` except the code will be compiled to run using the parallel version of the algorithm. Once compiled, you should be able to run Swellix with `mpirun` or the like.
**_NOTE_** The parallel code in Swellix was developed using OpenMPI, so problems could arise if using some other implementation of MPI.

### DISP (Display):
Compiling with ``` make disp ``` instructs Swellix to provide various levels of more detailed output. These levels depend on the `-d` flag as detailed at the bottom of this file in the table of commandline options. Level 1 is the same as the default output from using `make serial` above except with more information relating to the data structures that were used in the algorithm. This information is mainly the size of some lists. Exceeding Level 2 results in output which is really only useful for debugging or further development.
**_NOTE_** If you do choose to use the `-d` option for more detailed output, be wary of the size of your sequence and possible size of output. You can easily generate very large files from the output of structures alone. This is not to mention the debugging text if you have the display level set high enough. It all comes down to your imposed constraints.
Swellix RNA sequence input must be formatted properly before being run in the program.
### GUIDELINES/RULES: The first rule is the simplest, and we'll call it the 'One Line Rule'. Any input sequence should occupy only one line.
e.g. Say you have some arbitrary sequence `GCUCUAAAAGAGAG`. You shouldn't create an input file with your sequence formatted on two lines like this: ``` GCUCUAA AAGAGAG ``` This is because it contains a new line indicator after the first 7 nucleotides. Swellix can't tell if the two lines are meant to be the same sequence or if you're trying to give it some kind of multi-sequence input, which the program doesn't currently handle. The result of this input would be Swellix running with only the first line as its input.
The second rule deals with what characters are used to represent the sequence. Swellix only knows how to handle sequences which consist of A, C, G, and U representing the nucleotides. If there are spaces, place holders, nucleotides with chemical modifications, etc., Swellix will not properly handle the sequence. The output from the program will be either incorrect or, even worse, the program will crash. So, any special characters need to be either removed or properly converted back to their corresponding A, C, G, or U.
e.g. Imagine that you have a sequence to run which contains some placeholding characters. For example, let the sequence look like `GCUCU--AAAAGA---GAG`. Since Swellix doesn't know what to do with these hyphens, you must first strip them from the sequence and then adhere to the previous one line rule. So, when sending this sequence to Swellix it should look exactly like the previous example: `GCUCUAAAAGAGAG`.
e.g. Now consider the case where there is some number of modified nucleotides which constrain the folding of the RNA. Let the sequence be `GCUCU"AAAKAGAG`, where `"` represents the 1methyladenosine modification and `K` represents the 1-methylguanosine modification. (these modifications are arbitrarily chosen for the example) We've stated that Swellix can't handle these characters properly on its own, so we need to first convert them to their unmodified characters. So the sequence will once again look like `GCUCUAAAAGAGAG` to Swellix.

### STANDARD INPUT:
To use standard input, simply pipe a sequence to Swellix like so: ``` echo "GCUCUAAAAGAGAG" | /path/to/swellix/swellix.exe [desired constraints] ``` where the desired constraints are some optional combination of the flags defined at the bottom of this file.

### FILE INPUT:
There are two flavors of reading input from a file. In one, only the first line of the file, which should be the sequence, is read into the program. In the other, you can instruct Swellix to continue reading through the file to look for any defined constraints. For the second case, we'll refer to that file as a configuration file since it is providing Swellix with more information than just the sequence.
In general, to use a plain input file you need the `-i` flag and the path to the file. ``` /path/to/swellix/swellix.exe -i sequence.txt [optional command-line args] ``` However, if you would like to provide a configuration file with certain folding constraints, you will also need to include the `-k` flag. You will still use the `-i` flag and the path to your config file. ``` /path/to/swellix/swellix.exe -k -i config.swlx [optional args/constraints not in config file] ``` ### ABOUT THE CONFIGURATION FILE: There are many constraints that can be imposed just by command line arguments. The advantage of providing input via a config file is that you can specify constraints such as individual nucleotide pairing restrictions. For example, you can tell Swellix that any particular nucleotide absolutely must pair to form a valid structure. The constraints provided must be in a strict format for the time being. The specific formatting rules and an example of a properly written config file is provided. It is called `configTutorial.swlx`. In this file, we use the same sequence as above but illustrate how to specify constraints.
**_NOTE_** If the case arises where you have provided input via both standard input and an input file, the sequence defined by the input file will override the sequence provided by standard input.

Swellix Configuration Tutorial
This document is for specifying and demonstrating the format restrictions on the constraints within a Swellix configuration file. For this example, we'll be using the same 14 nucleotide sample sequence which was our focus in the README file. Specifically, we'll use the version of the sequence which had chemically modified nuceleotides to restrict pairing possibilities.
As a refresher, our sequence was 'GCUCU"AAAKAGAG'. The '"' and the 'K' are our modified nucleotides, and the first step is always to obtain the "unmodified" sequence. This just means replacing the symbols for the modified nucleotides with the corresponding symbols for the unmodified nucleotides. So, the sequence that we want to provide to Swellix looks like this: 'GCUCUAAAAGAGAG'.
Next, we need to translate the pairing constraints (if any) to Swellix.
For now, the syntax for pairing constraints consist of three key phrases: "S1 PAIRING", "V1 PAIRING", and "COVARIANCE". S1 PAIRING is used to denote a list of indices which identify nuceleotides which MUST NOT pair. V1 PAIRING is used to denote the opposite of S1: indices of nucleotides which MUST pair. And COVARIANCE constraints list ordered pairs indicating pairs of nucleotides which must pair with each other.
For our example, we are assuming that the chemical modifications prohibit pairing of the respective nucleotides, so we will create a config file using the S1 PAIRING indicator.
The first step is always the same as for a normal sequence input file: enter the sequence on the first line of the file.
So, the first line will be: GCUCUAAAAGAGAG Next, on a new line below the sequence, specify the type of pairing constraint using square brackets and the necessary key word/phrase. In our case, it is [S1 PAIRING]. Then, for each nucleotide to be constrained, enter the index surrounded by parentheses. So, the pairing constraint section for our test sequence will look like this: [S1 PAIRING]