The core of the Vienna RNA Package is formed by a collection of routines for the prediction and comparison of RNA secondary structures. These routines can be accessed through stand-alone programs, such as RNAfold, RNAdistance etc., which should be sufficient for most users. For those who wish to develop their own programs we provide a library which can be linked to your own code.
This document only describes the library and will be primarily useful to programmers. The stand-alone programs are described in separate man pages. The latest version of the package including source code and html versions of the documentation can be found at the ViennaRNA page. This manual documents version 1.3.
Please send comments and bug reports to <Ivo.Hofacker@tbi.univie.ac.at>.
The library provides a fast dynamic programming minimum free energy folding algorithm as described by Zuker & Stiegler (1981). Associated functions are:
fold()
. If fold_constrained
(see section Global Variables for the Folding Routines)
is 1, the structure string is interpreted on input as a list of
constraints for the folding. The characters " | x < > " mark bases that
are paired, unpaired, paired upstream, or downstream, respectively;
matching brackets " ( ) " denote base pairs, dots "." are used for
unconstrained bases. Constrained folding works by assigning bonus energies
to all structures compliing with the constraint.
fold()
.
initialize_fold()
.
temperature
(see section Global Variables for the Folding Routines).
Prototypes for these functions are declared in `fold.h'.
Instead of the minimum free energy structure the partition function of all possible structures and from that the pairing probability for every possible pair can be calculated, using a dynamic programming algorithm as described by McCaskill (1990). The following functions are provided:
fold_constrained
(see section Global Variables for the Folding Routines) is 1, the structure string is
interpreted on input as a list of constraints for the folding. The
character "x" marks bases that must be unpaired, matching brackets " ( )
" denote base pairs, all other characters are ignored. Any pairs
conflicting with the constraint will be forbidden. This usually sufficient
to ensure the constraints are honored. If do_backtrack
(see section Global Variables for the Folding Routines) has been set to 0 base pairing probabilities will not
be computed (saving CPU time), otherwise the pr[iindx[i]-j]
(see section Global Variables for the Folding Routines) will contain the probability that bases i
and j pair.
pf_fold()
.
init_pf_fold()
.
Prototypes for these functions are declared in `part_func.h'.
We provide two functions that search for sequences with a given structure, thereby inverting the folding routines.
give_up
is set to 1, the function will return as soon as it is
clear that the search will be unsuccessful, this speeds up the algorithm
if you are only interested in exact solutions.
Since inverse_fold()
calls fold()
you have to allocate memory
for folding by calling initialize_fold()
inverse_fold()
.
Since inverse_pf_fold()
calls pf_fold()
you have to allocate
memory for folding by calling init_pf_fold()
char *symbolset
points to the allowed bases,
initially "AUGC"
. It can be used to design sequences from reduced
alphabets.
Prototypes for these functions are declared in `inverse.h'.
The following global variables change the behavior the folding algorithms or contain additional information after folding.
temperature
C.
Default is 37C. You have to call the update_..._params() functions after
changing this parameter.
dangles = 1
) dangling end energies are assigned only to unpaired
bases and a base cannot participate simultaneously in two dangling ends. In
the partition function algorithm pf_fold()
these checks are neglected.
If dangles
is set to 2, the fold()
and
energy_of_struct()
function will also follow this convention. This
treatment of dangling ends gives more favorable energies to helices
directly adjacent to one another, which can be beneficial since such
helices often do engage in stabilizing interactions through co-axial
stacking.dangles = 3
co-axial stacking is explicitely included for
adjacent helices in mutli-loops. The option affects only mfe folding
and energy evaluation (fold()
and energy_of_struct()
), as
well as suboptimal folding via re-evaluation of energies.
Co-axial stacking with one intervening mismatch is not considered so far.
Default is 1, pf_fold()
treats 1 as 2.
fold()
.
base_pair[0].i
contains the total number of pairs.
pf_fold()
.
pr[iindx[i]-j]
.
pf_fold()
to avoid overflows. Should be set
to approximately exp((-F/kT)/length), where F is an estimate
for the ensemble free energy, for example the minimum free energy. You must
call update_pf_params()
or init_pf_fold()
after changing this
parameter. If pf_scale is -1 (the default) , an estimate will be provided
automatically when calling init_pf_fold()
or
update_pf_params()
. The automatic estimate is usually insufficient
for sequences more than a few hundred bases long.
pf_fold()
; this is about
twice as fast. Default is 1.
inverse_fold()
; 'C': force (1,N) to be paired,
'M' fold as if the sequence were inside a multi-loop. Otherwise the
usual mfe structure is computed.
include `fold_vars.h' if you want to change any of these variables from their defaults.
A default set of parameters, identical to the one described in Mathews et.al. (1999), is compiled into the library. Alternately, parameters can be read from and written to a file.
The following describes the file format expected by
read_parameter_file()
. All energies should be given as integers in
units of 0.01kcal/mol.
Various loop parameters depend in general on the pairs closing the loops, as well as unpaired bases in the loops. Internally, the library distinguishes 8 types of pairs (CG=1, GC=2, GU=3, UG=4, AU=5, UA=6, nonstandard=7, 0= no pair), and 5 types of bases (A=1, C=2, G=3, U=4 and 0 for anything else). Parameters belonging to pairs of type 0 are not listed in the parameter files, but values for nonstandard pairs (type 7) and nonstandard bases (type 0) are. Thus, a table for symmetric size 2 interior loops would have 7*7*5*5 entries (2 pairs, two unpaired bases).
The order of entries always uses the closing pair or pairs as the first indices followed by the unpaired bases in 5' to 3' direction. To determine the type of a pair consider the base at 5' end of each strand first, i.e. use the pairs (i,j) and (q,p) for an interior loop with i<p<q<j . This is probably better explained by an example. Consider the symmetric size 4 interior loop
5'-GAUA-3' 3'-CGCU-5'
the first pair is GC, the second UA (not AU!) the unpaired bases are (in 5' to 3' direction, starting at the first pair) A U C G. Thus we need entry [2,6,1,4,2,3] of the corresponding table. Because the loop is symmetric you could equally well describe it by UA GC C G A U, i.e. entry [6,2,2,3,1,4]. Be careful to preserve this symmetry when editing parameter tables!
The first line of the file should read
## RNAfold parameter file
lines of the form
# token
mark the beginning of a list of energy parameters of the type specified by
token. The following tokens are recognized:
# stack_energies
The list of free energies for stacked pairs, indexed by the two closing
pairs. The list should be formated as symmetric an 7*7 matrix,conforming
to the order explained above. As an example the stacked pair
5'-GU-3' 5'-AC-3' 3'-CA-5' 3'-UG-5'
corresponds to the entry [2,5], which should be identical to [5,2]. Note that the format has changed from previous releases, to make it consistent with other loop parameters.
# stack_enthalpies
enthalpies for stacked pairs, used to rescale stacking energies to
temperatures other than 37C. Same format as stack_energies.
# hairpin
Free energies of hairpin loops as a function of size. The list should
contain 31 entries on one or more lines. Since the minimum size of a
hairpin loop is 3 and we start counting with 0, the first three values
should be INF to indicate a forbidden value.
# bulge
Free energies of bulge loops. Should contain 31 entries, the first one
being INF.
# internal_loop
Free energies of internal loops. Should contain 31 entries, the first 4
being INF (since smaller loops are tabulated).
# mismatch_interior
Free energies for the interaction between the closing pair of an interior
loop and the two unpaired bases adjacent to the helix. This is a three
dimensional array indexed by the type of the closing pair (see above) and
the two unpaired bases. Since we distinguish 5 bases the list contains
8*5*5 entries and should be formated either as an 8*25 matrix or 8 5*5
matrices. The order is such that for example the mismatch
5'-CU-3' 3'-GC-5'
corresponds to entry [1,4,2] (CG=1, U=4, C=2), (in this notation the first index runs from 1 to 7, second and third from 0 to 4)
# mismatch_hairpin
Same as above for hairpin loops.
# mismatch_enthalpies
Corresponding enthalpies for rescaling to temperatures other than 37C.
# int11_energies
Free energies for symmetric size 2 interior loops. 7*7*5*5 entries formated
as 49 5*5 matrices, or seven 7*25 matrices. Example:
5'-CUU-3' 3'-GCA-5'
corresponds to entry [1,5,4,2], which should be identical to [5,1,2,4].
# int21_energies
Free energies for size 3 (2+1) interior loops. 7*7*5*5*5 entries formated
in 5*5 or 5*25 matrices. The strand with a single unpaired base is listed
first, example:
5'-CU U-3' 3'-GCCA-5'
corresponds to entry [1,5,4,2,2].
# int22_energies
Free energies for symmetric size 4 interior loops. To reduce the size of
parameter files this table only lists canonical bases (A,C,G,U) resulting in
a 7*7*4*4*4*4 table. See above for an example.
# dangle5
Energies for the interaction of an unpaired base on the 5' side and
adjacent to a helix in multiloops and free ends (the equivalent of mismatch
energies in interior and hairpin loops). The array is indexed by the type
of pair closing the helix and the unpaired base and, therefore, forms a 8*5
matrix. For example the dangling base in
5'-C-3' 3'-GC-5'
corresponds to entry [1,2] (CG=1, C=2);
# dangle3
Same as above for bases on the 3' side of a helix.
# ML_params
For the energy of a multi-loop a function of the form
E = cu*n_unpaired + ci*loop_degree + cc
is used where n_unpaired is the number of unpaired bases in the loop and
loop_degree is the number of helices forming the loop. In addition a
"terminal AU" penalty is applied to AU and GU pairs in the loop.
The line following the token should contain these four values, in the order
cu cc ci termAU
. Ther terminal AU penalty is also used for the
exterior loop and size 3 hairpins, for other loop types it is already
included in the mismatch energies.
# Tetraloops
Some tetraloops particularly stable tetraloops are assigned an energy
bonus. Up to forty tetraloops and their bonus energies can be listed
following the token, one sequence per line. For example:
GAAA -200
assigns a bonus energy of -2 kcal/mol to tetraloops containing the sequence GAAA.
# END
Anything beyond this token will be ignored.
A parameter file need not be complete, it might may contain only a subset
of interaction parameters, such as only stacking energies. However, for
each type of interaction listed, all entries have to be present.
A `*' may be used to indicate entries of a list that are to retain
their default value. For loop energies a `x' may be used to indicate that
the value is to be extrapolated from the values for smaller loop sizes.
Parameter files may contain C-style comments, i.e. any text between
/*
and */
will be ignored. However, you may have no more
than one comment per line and no multi-line comments.
A parameter file listing the default parameter set should accompany your distribution as `default.par', the file `old.par' contains parameters used in version 1.1b of the Package.
The standard representation of a secondary structure is the "bracket notation", where matching brackets symbolize base pairs and unpaired bases are shown as dots. Alternatively, one may use two types of node labels, 'P' for paired and 'U' for unpaired; a dot is then replaced by '(U)', and each closed bracket is assigned an additional identifier 'P'. We call this the expanded notation. In Fontana et al. (1993) a condensed representation of the secondary structure is proposed, the so-called homeomorphically irreducible tree (HIT) representation. Here a stack is represented as a single pair of matching brackets labeled 'P' and weighted by the number of base pairs. Correspondingly, a contiguous strain of unpaired bases is shown as one pair of matching brackets labeled 'U' and weighted by its length. Generally any string consisting of matching brackets and identifiers is equivalent to a plane tree with as many different types of nodes as there are identifiers.
Bruce Shapiro (1988) proposed a coarse grained representation, which, does not retain the full information of the secondary structure. He represents the different structure elements by single matching brackets and labels them as 'H' (hairpin loop), 'I' (interior loop), 'B' (bulge), 'M' (multi-loop), and 'S' (stack). We extend his alphabet by an extra letter for external elements 'E'. Again these identifiers may be followed by a weight corresponding to the number of unpaired bases or base pairs in the structure element. All tree representations (except for the dot-bracket form) can be encapsulated into a virtual root (labeled 'R'), see the example below.
The following example illustrates the different linear tree representations used by the package. All lines show the same secondary structure.
a) .((((..(((...)))..((..)))).)). (U)(((((U)(U)((((U)(U)(U)P)P)P)(U)(U)(((U)(U)P)P)P)P)(U)P)P)(U) b) (U)(((U2)((U3)P3)(U2)((U2)P2)P2)(U)P2)(U) c) (((H)(H)M)B) ((((((H)S)((H)S)M)S)B)S) (((((((H)S)((H)S)M)S)B)S)E) d) ((((((((H3)S3)((H2)S2)M4)S2)B1)S2)E2)R)
Above: Tree representations of secondary structures. a) Full structure: the first line shows the more convenient condensed notation which is used by our programs; the second line shows the rather clumsy expanded notation for completeness, b) HIT structure, c) different versions of coarse grained structures: the second line is exactly Shapiro's representation, the first line is obtained by neglecting the stems. Since each loop is closed by a unique stem, these two lines are equivalent. The third line is an extension taking into account also the external digits. d) weighted coarse structure, this time including the virtual root.
For the output of aligned structures from string editing, different representations are needed, where we put the label on both sides. The above examples for tree representations would then look like:
a) (UU)(P(P(P(P(UU)(UU)(P(P(P(UU)(UU)(UU)P)P)P)(UU)(UU)(P(P(UU)(U... b) (UU)(P2(P2(U2U2)(P2(U3U3)P3)(U2U2)(P2(U2U2)P2)P2)(UU)P2)(UU) c) (B(M(HH)(HH)M)B) (S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S) (E(S(B(S(M(S(HH)S)(S(HH)S)M)S)B)S)E) d) (R(E2(S2(B1(S2(M4(S3(H3)S3)((H2)S2)M4)S2)B1)S2)E2)R)
Aligned structures additionally contain the gap character '_'.
Several functions are provided for parsing structures and converting to different representations.
b2C()
.
All the above functions allocate memory for the strings they return.
tree_edit_distance()
function back to bracket notation with '_'
as the gap character. The result overwrites the input.
loop_size[0]
contains the
number of external bases.
Prototypes for the above functions can be found in `RNAstruct.h'.
A simple measure of dissimilarity between secondary structures of equal length is the base pair distance, given by the number of pairs present in only one of the two structures being compared. I.e. the number of base pairs that have to be opened or closed to transform one structure into the other. It is therefore particularly useful for comparing structures on the same sequence. It is implemented by
For other cases a distance measure that allows for gaps is preferable. We can define distances between structures as edit distances between trees or their string representations. In the case of string distances this is the same as "sequence alignment". Given a set of edit operations and edit costs, the edit distance is given by the minimum sum of the costs along an edit path converting one object into the other. Edit distances like these always define a metric. The edit operations used by us are insertion, deletion and replacement of nodes. String editing does not pay attention to the matching of brackets, while in tree editing matching brackets represent a single node of the tree. Tree editing is therefore usually preferable, although somewhat slower. String edit distances are always smaller or equal to tree edit distances.
The different level of detail in the structure representations defined above naturally leads to different measures of distance. For full structures we use a cost of 1 for deletion or insertion of an unpaired base and 2 for a base pair. Replacing an unpaired base for a pair incurs a cost of 1.
Two cost matrices are provided for coarse grained structures:
/* Null, H, B, I, M, S, E */ { 0, 2, 2, 2, 2, 1, 1}, /* Null replaced */ { 2, 0, 2, 2, 2, INF, INF}, /* H replaced */ { 2, 2, 0, 1, 2, INF, INF}, /* B replaced */ { 2, 2, 1, 0, 2, INF, INF}, /* I replaced */ { 2, 2, 2, 2, 0, INF, INF}, /* M replaced */ { 1, INF, INF, INF, INF, 0, INF}, /* S replaced */ { 1, INF, INF, INF, INF, INF, 0}, /* E replaced */ /* Null, H, B, I, M, S, E */ { 0, 100, 5, 5, 75, 5, 5}, /* Null replaced */ { 100, 0, 8, 8, 8, INF, INF}, /* H replaced */ { 5, 8, 0, 3, 8, INF, INF}, /* B replaced */ { 5, 8, 3, 0, 8, INF, INF}, /* I replaced */ { 75, 8, 8, 8, 0, INF, INF}, /* M replaced */ { 5, INF, INF, INF, INF, 0, INF}, /* S replaced */ { 5, INF, INF, INF, INF, INF, 0}, /* E replaced */
The lower matrix uses the costs given in Shapiro (1990). All distance functions use the following global variables:
edit_backtrack
set to 1. See section Representations of Secondary Structures, for
details on the representation of structures.
Tree
( essentially the postorder list ) of the
structure xstruc, for use in
tree_edit_distance()
.
xstruc may be any rooted structure representation.
Prototypes for the above functions can be found in `treedist.h'. The
type Tree
is defined in `dist_vars.h', which is automatically
included with `treedist.h'
string_edit_distance()
.
Prototypes for the above functions can be found in `stringdist.h'.
For comparison of base pair probability matrices, the matrices are first condensed into probability profiles which are the compared by alignment.
pr
(see section Global Variables for the Folding Routines) and calculates a profile, i.e. a vector containing for
each base the probabilities of being unpaired, upstream, or downstream
paired, respectively. The returned array is suitable for
profile_edit_distance
.
Prototypes for the above functions can be found in `profiledist.h'.
The following utilities are used and therefore provided by the library:
pf_fold()
from the
global array pr
and the pair list base_pair
produced by
fold()
and produces a postscript "dot plot" that is written to
filename. The "dot plot" represents each base pairing
probability by a square of corresponding area in a upper triangle
matrix. The lower part of the matrix contains the minimum free energy
structure.
base_pair
array anymore.
option
is an uppercase letter the
sequence
is used to label nodes, if option
equals 'X'
or 'x' the resulting file will coordinates for an initial layout of
the graph.
PS_rna_plot
and gmlRNA
. Current possibility are
0 for a simple radial drawing or 1 for the modified radial drawing taken from
the naview
program of Bruccoleri & Heinrich (1988).
erand48()
.
urn ()
. These should be set to some random number seeds
before the first call to urn ()
.
free()
when the string is no longer needed.
Prototypes for PS_rna_plot()
and PS_dot_plot()
reside in
`PS_dot.h', the other functions are declared in `utils.h'.
The following program exercises most commonly used functions of the library. The program folds two sequences using both the mfe and partition function algorithms and calculates the tree edit and profile distance of the resulting structures and base pairing probabilities.
#include <stdio.h> #include <math.h> #include "utils.h" #include "fold_vars.h" #include "fold.h" #include "part_func.h" #include "inverse.h" #include "RNAstruct.h" #include "treedist.h" #include "stringdist.h" #include "profiledist.h" void main() { char *seq1="CGCAGGGAUACCCGCG", *seq2="GCGCCCAUAGGGACGC", *struct1,* struct2,* xstruc; float e1, e2, tree_dist, string_dist, profile_dist, kT; Tree *T1, *T2; swString *S1, *S2; float **pf1, **pf2; /* fold at 30C instead of the default 37C */ temperature = 30.; /* must be set *before* initializing */ /* allocate memory for fold(), could be skipped */ initialize_fold(strlen(seq1)); /* allocate memory for structure and fold */ struct1 = (char* ) space(sizeof(char)*(strlen(seq1)+1)); e1 = fold(seq1, struct1); struct2 = (char* ) space(sizeof(char)*(strlen(seq2)+1)); e2 = fold(seq2, struct2); free_arrays(); /* free arrays used in fold() */ /* produce tree and string representations for comparison */ xstruc = expand_Full(struct1); T1 = make_tree(xstruc); S1 = Make_swString(xstruc); free(xstruc); xstruc = expand_Full(struct2); T2 = make_tree(xstruc); S2 = Make_swString(xstruc); free(xstruc); /* calculate tree edit distance and aligned structures with gaps */ edit_backtrack = 1; tree_dist = tree_edit_distance(T1, T2); free_tree(T1); free_tree(T2); unexpand_aligned_F(aligned_line); printf("%s\n%s %3.2f\n", aligned_line[0], aligned_line[1], tree_dist); /* same thing using string edit (alignment) distance */ string_dist = string_edit_distance(S1, S2); free(S1); free(S2); printf("%s mfe=%5.2f\n%s mfe=%5.2f dist=%3.2f\n", aligned_line[0], e1, aligned_line[1], e2, string_dist); /* for longer sequences one should also set a scaling factor for partition function folding, e.g: */ kT = (temperature+273.15)*1.98717/1000.; /* kT in kcal/mol */ pf_scale = exp(-e1/kT/strlen(seq1)); init_pf_fold(strlen(seq1)); /* calculate partition function and base pair probabilities */ e1 = pf_fold(seq1, struct1); pf1 = Make_bp_profile(strlen(seq1)); e2 = pf_fold(seq2, struct2); pf2 = Make_bp_profile(strlen(seq2)); free_pf_arrays(); /* free space allocated for pf_fold() */ profile_dist = profile_edit_distance(pf1, pf2); printf("%s free energy=%5.2f\n%s free energy=%5.2f dist=%3.2f\n", aligned_line[0], e1, aligned_line[1], e2, profile_dist); free_profile(pf1); free_profile(pf2); }
In a typical Unix environment you would compile this program using: cc -c example.c -Ihpath and link using cc -o example -Llpath -lRNA -lm where hpath and lpath point to the location of the header files and library, respectively.
Jump to: a - b - e - f - g - h - i - m - n - p - r - s - t - u - w
Jump to: * - a - b - c - d - e - f - g - h - i - l - n - p - r - t - u - x
This document was generated on 28 September 2000 using texi2html 1.56k.