RNAlien – Unsupervised RNA family model construction

Abstract Determining the function of a non-coding RNA requires costly and time-consuming wet-lab experiments. For this reason, computational methods which ascertain the homology of a sequence and thereby deduce functionality and family membership are often exploited. In this fashion, newly sequenced genomes can be annotated in a completely computational way. Covariance models are commonly used to assign novel RNA sequences to a known RNA family. However, to construct such models several examples of the family have to be already known. Moreover, model building is the work of experts who manually edit the necessary RNA alignment and consensus structure. Our method, RNAlien, starting from a single input sequence collects potential family member sequences by multiple iterations of homology search. RNA family models are fully automatically constructed for the found sequences. We have tested our method on a subset of the RfamRNA family database. RNAlien models are a starting point to construct models of comparable sensitivity and specificity to manually curated ones from the Rfam database. RNAlien Tool and web server are available at http://rna.tbi.univie.ac.at/rnalien/.


Table of Contents
RNAlien -Unsupervised RNA family model construction -Supplement . .

B Implementation Details
RNAlien depends on several external tools and interfaces, which are listed in this section. System and function calls are included with their parameters and highlighted in italic. A starting point in the taxonomic tree is set, either specified by the input NCBI taxonomy id, or by running a nucleotide BLAST search via the NCBI REST interface and selecting the organism of the best hit. The model construction process starts at this organism and performs a initial model construction step. RNAlien retrieves the taxonomic lineage of starting organism from the NCBI ENTREZ REST interface. After each of these steps RNAlien proceeds to the taxnomic parent of the current taxonomic node. If a model was already constructed in a previous step then a model expansion step is performed, otherwise a initial model construction is reattemped. Once the root of the taxonomic tree has been reached model expansion stops and the model finalization step is performed.

B.1 Initial model construction
RNAlien tries to establish an inital set of sequences related to the input sequence, that serve as seed for further expansion of the model.
Search: Candidate search is performed via the nucleotide BLAST REST interface which returns a list of hits. The organisms to be searched are restricted by the current taxononomic node of the step in two ways. To avoid overenrichment of sequences similar to included ones, already visited organisms are excluded.
Only organisms that are are associated with children of the current node are searched. For example if the current taxonomic position is Enterobacteriacea and the previous node was Enterobacter all other organisms that belong to Enterobacteriacea excluding Enterobacter are searched. Summary of NCBI BLAST REST function call (one for each query, all other parameters default): blastHTTP parameter value program blastn database nt querySequence currentsequence hitlistSize 5000 e-value 0.001 uppertaxonomylimit currenttaxonomyid lowertaxonomylimit previoustaxonomyid Filtering hits: BLAST hits are filtered by consecutively by following criteria: Hit has to achive over 80% coverage of the query sequence.
Legend: Current Node Previous Node Search Node Fig. 2. Organisms used for candidate search are determined as follows. All organisms and their corresponding genomes that are associated with the currently selected position in the taxonomic tree are used for searching. Excepted from this are organisms that have already been searched in previous rounds.
The hit must not exceed query length by factor of three. Similarity of the hit to the query must be under 99%. The remaining hits are expanded to query length as explained in subsection B.5. The gene id contained in the BLAST result and the expanded coordinates are used to retrieve nucleotide sequence from the Entrez REST interface. The sequences are filtered by normalized structure conservation index (nSCI). To compute the nSCI for each candidate sequence we need the mimimum free secondary structure folding energy (MFE) of the candidate and the input sequence which is computed with RNAfold.
RNAfold parameter value −−noPS inputfilePath fastaFilePath outputFilepath foldFilePath Furthermore the structure conservation index and the sequence identity of the candidate and input sequence are required. The candidate sequence is pair-wise aligned with free end-gap setting (semi-globally) to the input sequence. For each of these alignments the structure conservation index SCI is computed via RNAalifold.
Model Construction: Candidate sequences that passed the nSCI filter are then used to build the inital model together with the input sequence. The sequences are structually aligned with mlocarna. mlocarna parameter value inputFilepath inputFastaFilePath outputFilepath mlocarnaFilePath cmbuild is applied to the resulting structural stockholm alignment to construct a covariance model. cmbuild parameter value −−refine inputModelFilepath cmFilePath inputAlignmentFilepath stockholmAlignmentFilePath outputLogFilepath logFilePath The covariance model is used in the model expansion rounds to filter candidates and is therefore calibrated with cmcalibrate. This step is very time-consuming but sped up by using nonstandard (--beta 10 −4 ) parameter. This affects the pre-filter steps of cmsearch, but not the final step where the sequence is aligned to the model via the CYK algorithm. Meaning that this increase in calibration speed reduces sensitivity but not specificity. Filtering based method is the default method and iteratively removes all entries from the list of collected sequences, that do not have at most 95% pairwise sequence identity. This method has less specificity and sensitivity in the benchmarks (see 5,6), but it is faster and removes the dependency on clustalo.
Clustering based method can alternatively be used by suppling the −m commandline switch with the value clustering to RNAlien.Clustal omega is used to compute a pairwise distance matrix of all collected sequences for clustering.
clustalo parameter value −−full −−distmat-out matrixFilePath −−infile fastaFilePath outputFilepath clustaloFilePath RNAlien clusters the sequences via unweighted pair group method with arithmetic mean (UPGMA) and then incrementally increases the cutoff distance until 5 clusters can be formed. If less than 5 seqences have been collected, then each of them will be used as query.

B.2 Model expansion
After a initial model has been constructed RNAlien enters into model expansion phase.
Search: Searching is performed as descibed in Initial model construction but with a relaxed e-value cutoff of 1 during the BLAST search.
blastHTTP parameter value program blastn database nt querySequence currentsequence hitlistSize 5000 e-value 1 uppertaxonomylimit currenttaxonomyid lowertaxonomylimit previoustaxonomyid Filtering hits: Filtering of BLAST hits and hit expansion is performed as described in Initial model construction.
Sequences are also retrieved via the NCBI Entrez REST interface but then filtered with a different approach. We use the calibrated covariance model of the previous round and apply it with cmsearch to the canidate sequences. Candidates are accepted into the growing model if their e-value is below 0.001 or as specified by the inputEvalueCutoff commandline argument.
To ensure a meaningful e-value cutoff we need to consider the size of the database. We reuse the size of the blast database the hit originates from.
The value is not by itself contained in the blast XML output, but all the parameters needed to compute it. The relationship of E-value and bitscore (Equation 3 adopted from [1]): where d = databasesize e = e-value b = bitscore q = querylength We compute the database size in Mbases that was used for the blast search as follows, by rearranging the equation above: As the secondary structure of the resulting stockholm alignment is not updated in this process, a consensus secondary structure of the new alignment is computed via RNAalifold, with settings specifically optimized to consider covariance contributions. The old consensus secondary structure is replaced with the new one in the alignment. Same strand BLAST hit are extended as follows,

RNAalifold
where h is the start coordinate of the hit, t is the extended start coordinate, q is the start coordinate of the hit on the query, H is the end coordinate of the hit, T is the extended endcoordinate, Q is the end cooridnate of the hit on the query, L is the length of the query sequence b is the length of the sequence the hit maps to s is the start coordinate of the extended sequence checked for being within the available coordinates of the hit sequence, E is the end coordinate of the extended sequence checked for being within the available coordinates of the hit sequence  Fig. 3. Extension of BLAST hit and query on the same strand to query length, where h is the start coordinate of the hit, t is the extended start coordinate, q is the start coordinate of the hit on the query, H is the end coordinate of the hit, T is the extended endcoordinate, Q is the end cooridnate of the hit on the query, L is the length of the query sequence b is the length of the sequence the hit maps to s is the start coordinate of the extended sequence checked for being within the available coordinates of the hit sequence, E is the end coordinate of the extended sequence checked for being within the available coordinates of the hit sequence Different Strand BLAST hit are extended as follows, where h is the start coordinate of the hit, t is the extended start coordinate, q is the start coordinate of the hit on the query, H is the end coordinate of the hit, T is the extended endcoordinate, Q is the end cooridnate of the hit on the query, L is the length of the query sequence b is the length of the sequence the hit maps to s is the start coordinate of the extended sequence checked for being within the available coordinates of the hit sequence, E is the end coordinate of the extended sequence checked for being within the available coordinates of the hit sequence C Rfam RNA families with known structure This section contains additional plots for the RNA families with known structure featured in the paper. The first 2 plots show the changes of specificity and Inclusion of paralogs and toggling of the refine switch for cmbuild were included first, this has improved both specificity, as well as recall. Additionally to this, we changed the method for selecting queries for searching candidates from clustering all collected sequences and picking one sequence per cluster to filtering all sequence that do not have a pairwise sequence identity of less then 95%.
While the specificity is slightly only lower, there is a decrease in specificity. Nevertheless we have selected the new query selection method as default, because it is substantially faster and it drops the dependency on clustal-omega.
Blast hits are now also checked for the hit to have at least 80% coverage of the query. This feature should have been included in RNAlien 1.0.0, but was faulty.
Query sequences submitted to blast can be softmasked with conservation information from /cmalign. This feature is not considered in the shown benchmarks, but can be activated via commandline switch.
All of the newly introduced features can be controlled via commandline switches, with exception of the cmbuild refinement.
The runtime of RNAlien for the structured RNA test set Following is the table of sequences from the Rfam 12.0 seed alignments of families with known structure, that was used in the result section. The first sequence of the family was picked with the exception of sequences that are associated with metagenomic tax ids that could not be processed by the NCBI REST BLAST interface.
To obtain a representative sample of Rfam families, for each of these tags the alphanumberically first 10 families (if avialable for that tag) were selected. As some families have multiple tags, the list was filtered to contain each family only once.
The benchmark was conducted in the same manner as for the families with known 3D structure. The plots show different combinations of e-value cutoffs and databasesizes. Without explicitly setting the database size cmsearch uses twice the sequence length (forward/backward strand).
The setting compareable to the one used for the structured dataset is diverse(ev-1e-3,db-1e-9), meaning a cmsearch e-value cutoff of 1e-3 and a databasesize of 10 9 bases in general and 10 6 bases for bacterial and viral RNA families.
The result with compareable settings to the structured dataset has 191 of 192 cases (99%) with at least half of the sequences collected by RNAlien are recognized as belonging to the Rfam model.

E Negative control set
We used coding sequences, ancestral repeats, untranslated regions (UTRs) and random sequences to perform a negative control. According to the procedure for structured and diverse RNA families the sequences of the negative control set were used as a input sequence for RNAlien. Taxonomic start points for the construction were set as below using taxids from NCBI taxonomy [2]. The results were summarized for each subset individually.