RNA - interface to the Vienna RNA library Version 0.3

SYNOPSIS
DESCRIPTION

Folding Routines
Parsing and Comparing Structures
Utilities (from utils.h)
SWIG helper functions

SYNOPSIS

  use RNA;
  $seq = "CGCAGGGAUACCCGCG";
  ($struct, $mfe) = RNA::fold($seq);  #predict mfe structure of $seq
  RNA::PS_rna_plot($seq, $struct, "rna.ps");  # write PS plot to rna.ps
  $F = RNA::pf_fold($seq);   # compute partition function and pair pobabilities
  RNA::PS_dot_plot($seq, "dot.ps");          # write dot plot to dot.ps
  ...

DESCRIPTION

The RNA.pm package gives access to almost all functions in the libRNA.a library of the Vienna RNA PACKAGE. The Perl wrapper is generated using SWIG http://www.swig.org/ with relatively little manual intervention. For each C function in the library the perl package provides a function of the same name and calling convention (with few exceptions). For detailed information you should therefore also consult the documentation of the library (info RNAlib).

Note that in general C arrays are wrapped into opaque objects that can only be accessed via helper functions. SWIG provides a couple of general purpose helper functions, see the section at the end of this file. C structures are wrapped into Perl objects using SWIG's shadow class mechanism, resulting in a tied hash with keys named after the structure members.

For the interrested reader we list for each scalar type of the corepsonding C variable in brackets, and point out the header files containing the C declaration.

Folding Routines

Minimum free Energy Folding (from fold.h)

fold SEQUENCE

fold SEQUENCE, CONSTRAINTS

computes the minimum free energy structure of the string SEQUENCE and returns the predicted structure and energy, e.g.

  ($structure, $mfe) = RNA::fold("UGUGUCGAUGUGCUAU");

If a second argument is supplied and $fold_constrained==1 the CONSTRAINTS string is used to specify constraints on the predicted structure. The characters '|', 'x', '<', '>' mark bases that are paired, unpaired, paired upstream, or downstream, respectively; matching brackets "( )" denote base pairs, dots '.' are used for unconstrained bases.

In the two argument version the CONSTRAINTS string is modified and holds the predicted structure upon return. This is done for backwards compatibility only, and might change in future versions.

energy_of_struct SEQUENCE, STRUCTURE

returns the energy of SEQUENCE on STRUCTURE (in kcal/mol). The string structure must hold a valid secondary structure in bracket notation.

update_fold_params

recalculate the pair matrix and energy parameters after a change in folding parameters. In many cases (such as changes to $temperature) the fold() routine will call update_fold_params automatically when necessary.

free_arrays

frees memory allocated internally when calling fold.

cofold SEQUENCE

cofold SEQUENCE, CONSTRAINTS

works as fold, but SEQUENCE may be the concatenation of two RNAs in order compute their hybridization structure. E.g.:

  $seq1  ="CGCAGGGAUACCCGCG";
  $seq2  ="GCGCCCAUAGGGACGC";
  $RNA::cut_point = length($seq1)+1;
  ($costruct, $comfe) = RNA::cofold($seq1 . $seq2);

duplexfold SEQ1 SEQ2

compute the structure upon hybridization of SEQ1 and SEQ2. In contrast to cofold only intra-molecular pairs are allowed. Thus, the algorithm runs in O(n1*n2) time where n1 and n2 are the lengths of the sequences. The result is returned in a C struct containing the innermost base pair (i,j) the structure and energy. E.g:


  $seq1 ="CGCAGGGAUACCCGCG";
  $seq2 ="GCGCCCAUAGGGACGC";
  $dup  = RNA::duplexfold($seq1, $seq2);
  print "Region ", $dup->{i}+1-length($seq1), " to ", 
        $dup->{i}, " of seq1 ",
        "pairs up with ", $dup->{j}, " to ", 
        $dup->{j}+length($dup->{structure}-length($seq1)-2, 
        " of seq2\n";

Partition function Folding (from part_func.h)

pf_fold SEQUENCE

pf_fold SEQUENCE, CONSTRAINTS

calculates the partition function over all possible secondary structures and the matrix of pair probabilities for SEQUENCE and returns a two element list consisting of a string summarizing possible structures. See below on how to access the pair probability matrix. As with fold the second argument can be used to specify folding constraints. Constraints are implemented by excluding base pairings that contradict the constraint, but without bonus energies. Constraints of type '|' (paired base) are ignored. In the two argument version CONSTRAINTS is modified to contain the structure string on return (obsolete feature, for backwards compatibility only)

get_pr I, J

After calling pf_fold the global C variable pr points to the computed pair probabilities. Perl access to the C is facilitated by the get_pr helper function that looks up and returns the probability of the pair (I,J).

free_pf_arrays

frees memory allocated for pf_fold

update_pf_params LENGTH

recalculate energy parameters for pf_fold. In most cases (such as simple changes to $temperature) pf_fold will take appropriate action automatically.

pbacktrack SEQUENCE

return a random structure chosen according to it's Boltzmann probability. Use to produce samples representing the thermodynamic ensemble of structures.

  RNA::pf_fold($sequence);
  for (1..1000) {
     push @sample, RNA::pbacktrack($sequence);
  }

Suboptimal Folding (from subopt.h)

subopt SEQUENCE, CONSTRAINTS, DELTA

subopt SEQUENCE, CONSTRAINTS, DELTA, FILEHANDLE

compute all structures of SEQUENCE within DELTA*0.01 kcal/mol of the optimum. If specified, results are written to FILEHANDLE and nothing is returned. Else, the C function returnes a list of C structs of type SOLUTION. The list is wrapped by SWIG as a perl object that can be accesses as follows:

  $solution = subopt($seq, undef, 500);
  for (0..$solution->size()-1) {
     printf "%s %6.2f\n",  $solution->get($_)->{structure},
                           $solution->get($_)->{energy};
  }

Alignment Folding (from alifold.h)

alifold REF

fold REF, CONSTRAINTS

similar to fold() but compute the consensus structure for a set of aligned sequences. E.g.:

  @align = ("GCCAUCCGAGGGAAAGGUU", 
            "GAUCGACAGCGUCU-AUCG", 
            "CCGUCUUUAUGAGUCCGGC");
  ($consens_struct, $consens_en) = RNA::alifold(\@align);
  
  =item consensus REF
  =item consens_mis REF

compute a simple consensus sequence or "most informative sequence" form an alignment. The simple consensus returns the most frequent character for each column, the MIS uses the IUPAC symbol that contains all characters that are overrepresented in the column.

  $mis = consensus_mis(\@align);

Inverse Folding (from inverse.h)

inverse_fold START, TARGET: find a sequence that folds into structure TARGET, by optimizing the sequence until its mfe structure (as returned by fold) is TARGET. Startpoint of the optimization is the sequence START. Returns a list containing the sequence found and the final value of the cost function, i.e. 0 if the search was successful. A random start sequence can be generated using random_string.
inverse_pf_fold START, TARGET: optimizes a sequence (beginning with START) by maximising the frequency of the structure TARGET in the thermodynamic ensemble of structures. Returns a list containing the optimized sequence and the final value of the cost function. The cost function is given by energy_of_struct(seq, TARGET) - pf_fold(seq), i.e.-RT*log(p(TARGET))
$final_cost [float]: holds the value of the cost function where the optimization in inverse_pf_fold should stop. For values <=0 the optimization will only terminate at a local optimimum (which might take very long to reach).
$symbolset [char *]: the string symbolset holds the allowed characters to be used by inverse_fold and inverse_pf_fold, the default alphabet is "AUGC"
$give_up [int]: If non-zero stop optimization when its clear that no exact solution can be found. Else continue and eventually return an approximate solution. Default 0.

Cofolding of two RNA molecules (from cofold.h)

Global Variables to Modify Folding (from fold_vars.h)

$noGU [int]: Do not allow GU pairs to form, default 0.
$no_closingGU [int]: allow GU only inside stacks, default 0.
$tetra_loop [int]: Fold with specially stable 4-loops, default 1.
$energy_set [int]: 0 = BP; 1=any mit GC; 2=any mit AU-parameter, default 0.
$dangles [int]: How to compute dangling ends. 0: no dangling end energies, 1: "normal" dangling ends (default), 2: simplified dangling ends, 3: "normal" + co-axial stacking. Note that pf_fold treats cases 1 and 3 as 2. The same holds for the main computation in subopt, however subopt will re-evalute energies using energy_of_struct for cases 1 and 3. See the more detailed discussion in RNAlib.texinfo.
$nonstandards [char *]: contains allowed non standard bases, default empty string ""
$temperature [double]: temperature in degrees Celsius for rescaling parameters, default 37C.
$logML [int]: use logarithmic multiloop energy function in energy_of_struct, default 0.
$noLonelyPairs [int]: consider only structures without isolated base pairs (helices of length 1). For pf_fold only eliminates pairs that can only occur as isolated pairs. Default 0.
$base_pair [struct bond *]: list of base pairs from last call to fold. Better use the structure string returned by fold.
$pf_scale [double]: scaling factor used by pf_fold to avoid overflows. Should be set to exp(-F/(RT*length)) where F is a guess for the ensmble free energy (e.g. use the mfe).
$fold_constrained [int]: apply constraints in the folding algorithms, default 0.
$do_backtrack [int]: If 0 do not compute the pair probabilities in pf_fold (only the partition function). Default 1.
$backtrack_type [char]: usually 'F'; 'C' require (1,N) to be bonded; 'M' backtrack as if the sequence was part of a multi loop. Used by inverse_fold
$pr [double *]: the base pairing prob. matrix computed by pf_fold.
$iindx [int *]: Array of indices for moving withing the pr array. Better use get_pr.

Parsing and Comparing Structures

from RNAstruct.h: these functions convert between strings representating secondary structures with various levels of coarse graining. See the documentation of the C library for details

b2HIT STRUCTURE: Full -> HIT [incl. root]
b2C STRUCTURE: Full -> Coarse [incl. root]
b2Shapiro STRUCTURE: Full -> weighted Shapiro [i.r.]
add_root STRUCTURE: {Tree} -> ({Tree}R)
expand_Shapiro COARSE: add S for stacks to coarse struct
expand_Full STRUCTURE: Full -> FFull
unexpand_Full FSTRUCTURE: FFull -> Full
unweight WCOARSE: remove weights from coarse struct
unexpand_aligned_F ALIGN
parse_structure STRUCTURE: computes structure statistics, and fills the following global variables:; $loops [int] number of loops (and stacks) $unpaired [int] number of unpaired positions $pairs [int] number of paired positions $loop_size[int *] holds all loop sizes $loop_degree[int *] holds all loop degrees $helix_size[int *] holds all helix lengths

from treedist.h: routines for computing tree-edit distances between structures

make_tree STRUCTURE: make input for tree_edit_distance (a Tree * pointer).
tree_edit_distance T1, T2: compare to structures using tree editing. T1, T2 must have been created using tree_edit_distance
print_tree T: mainly for debugging
free_tree T: free space allocated by make_tree

from stringdist.h routines to compute structure distances via string-editing

Make_swString STRUCTURE: [ returns swString * ] make input for string_edit_distance
string_edit_distance S1, S2: [ returns float ] compare to structures using string alignment. S1, S2 should be created using Make_swString

from profiledist

Make_bp_profile LENGTH: [ returns (float *) ] condense pair probability matrix pr into a vector containing probabilities for upstream paired, downstream paired and unpaired. This resulting probability profile is used as input for profile_edit_distance
profile_edit_distance T1, T2: [ returns float ] align two probability profiles produced by Make_bp_profile
print_bppm T: [ returns void ] print string representation of probability profile
free_profile T: [ returns void ] free space allocated in Make_bp_profile

Global variables for computing structure distances

$edit_backtrack [int]: set to 1 if you want backtracking
$aligned_line [(char *)[2]]: containes alignmed structures after computing structure distance with edit_backtrack==1
$cost_matrix [int]: 0 usual costs (default), 1 Shapiro's costs

Utilities (from utils.h)

space SIZE: allocate memory from C. Usually not needed in Perl
nrerror MESSGAE: die with error message. Better use Perl's die
$xsubi [unsigned short[3]]: libRNA uses the rand48 48bit random number generator if available, the current random number is always stored in $xsubi.
init_rand: initialize the $xsubi random number from current time
urn: returns a random number between 0 and 1 using the random number generator from the RNA library.
int_urn FROM, TO: returns random integer in the range [FROM..TO]
time_stamp: current date in a string. In perl you might as well use locatime
random_string LENGTH, SYMBOLS: returns a string of length LENGTH using characters from the string SYMBOLS
hamming S1, S2: calculate hamming distance of the strings S1 and S2.
pack_structure STRUCTURE: pack secondary structure, using a 5:1 compression via 3 encoding. Returns the packed string.
unpack_structure PACKED: unpacks a secondary structure packed with pack_structure
make_pair_table STRUCTURE: returns a pair table as a newly allocated (short *) C array, such that: table[i]=j if (i.j) pair or 0 if i is unpaired, table[0] contains the length of the structure.
bp_distance STRUCTURE1, STRUCTURE2: returns the base pair distance of the two STRUCTURES. dist = {number of base pairs in one structure but not in the other} same as edit distance with open-pair close-pair as move-set

from PS_plot.h

PS_rna_plot SEQUENCE, STRUCTURE, FILENAME: write PostScript drawing of structure to FILENAME. Returns 1 on sucess, 0 else.
PS_rna_plot_a SEQUENCE, STRUCTURE, FILENAME, PRE, POST: write PostScript drawing of structure to FILENAME. The strings PRE and POST contain PostScript code that is included verbatim in the plot just before (after) the data. Returns 1 on sucess, 0 else.
gmlRNA SEQUENCE, STRUCTURE, FILENAME, OPTION: write structure drawing in gml (Graph Meta Language) to FILENAME. OPTION should be a single character. If uppercase the gml output will include the SEQUENCE as node labels. IF OPTION equal 'x' or 'X' write graph with coordinates (else only connectivity information). Returns 1 on sucess, 0 else.
ssv_rna_plot SEQUENCE, STRUCTURE, SSFILE: write structure drfawing as coord file for SStructView Returns 1 on sucess, 0 else.
xrna_plot SEQUENCE, STRUCTURE, SSFILE: write structure drawing as ".ss" file for further editing in XRNA. Returns 1 on sucess, 0 else.
PS_dot_plot SEQUENCE, FILENAME: write a PostScript dot plot of the pair probability matix to FILENAME. Returns 1 on sucess, 0 else.
$rna_plot_type [int]: Select layout algorithm for structure drawings. Currently available 0= simple coordinates, 1= naview, default 1.

from read_epars.c

read_parameter_file FILENAME: read energy parameters from FILENAME
write_parameter_file FILENAME: write energy parameters to FILENAME

SWIG helper functions

The package includes generic helper functions to access C arrays of type int, float and double, such as:

intP_getitem POINTER, INDEX: return the element INDEX from the array
intP_setitem POINTER, INDEX, VALUE: set element INDEX to VALUE
new_intP NELEM: allocate a new C array of integers with NELEM elements and return the pointer
delete_intP POINTER: deletes the C array by calling free()

substituting intP with floatP or doubleP gives the corresponding functions for arrays of float or double. You need to know the correct C type however, and the function work only for arrays of simple types.

On the lowest level the cdata function gives direct access to any data in the form of a Perl string.

cdata POINTER, SIZE: copies SIZE bytes at POINTER to a Perl string (with binary data)
memmove POINTER, STRING: copies the (binary) string STRING to the memory location pointed to by POINTER.

In combination with Perl's unpack this provides a generic way to convert C data structures to Perl. E.g.

  RNA::parse_structure($structure);  # fills the $RNA::loop_degree array
  @ldegrees = unpack "I*", RNA::cdata($RNA::loop_degree, ($RNA::loops+1)*4);

Warning: using these functions with wrong arguments will corrupt your memory and lead to a segmentation fault.


=head1 AUTHOR

Ivo L. Hofacker <ivo@tbi.univie.ac.at>