Tools for Comparative Genomics

 
 

Using rVISTA web-interface

  1. Input to the server
    1. TRANSFAC matrices
    2. User-defined consensus sequence
    3. User-defined matrices
  2. Output
    1. Options
    2. Individual and Group Clustering
    3. Conserved and Aligned binding sites
  3. Visualization

These instructions explain the process of submitting an alignment to rVISTA. If you are submitting a pre-calculated alignment, you only need this document. If you wish to align your own sequences, you will need to read mVISTA instructions first.

To read about the process rVISTA uses to produce its predictions, see the about page.

  1. Input to the Server

    TRANSFAC matrices

    If you choose to use TRANSFAC matrices, you can select from several options to refine your search. The line of checkboxes allows the user to select factor groups based on the type of organism being analyzed (vertebrates, plants, nematodes, insects, fungi, or bacteria).

    The following cut-off selections are available, as described by TRANSFAC:

    Recommended Threshold:
    The recommended thresholds are calculated automatically by subtracting one standard deviation from the average score of the true binding sites for each matrix.

    Cut-off to minimize false positive matches (minFP) :
    In order to estimate this cut-off, which will reduce the number of random sites found by MatchTM , we have applied the algorithm described above to third exon sequences, because these sequences are presumed to contain no biologically relevant TF binding sites. For every matrix the lowest cut-off for which no match is found in the set of exon sites is considered to be the minFP cut-off.

    When a minFP cut-off is applied for searching a DNA sequence, the algorithm will find a relatively low number of matches per nucleotide. In the output the user will only find putative sites with a good similarity to the weight matrix; however, some known genomic binding sites could not be recognized. This kind of cut-off is useful, for example, for searching the most promising potential binding sites in the extended genomic DNA sequences.

    Cut-off to minimize false negative matches (minFN):
    We used sets of generated oligonucleotides for estimating the cut-offs to minimize the false negative rate, using actual weight matrices to calculate the probability of a nucleotide occurring at a certain position of a binding site.

    For each matrix we applied the MatchTM algorithm to these test sequence sets without using any matrix similarity cut-offs. (The core similarity has been set to 0.75). Then we set the cut-off to a value that provides recognition of at least 90% of oligonucleotides. We decided to tolerate an error rate of ten percent. We call this set of cut-offs minFN cut-offs.

    Applying the minFN cut-offs, the user will find most genomic binding sites, but in this case a high rate of false positives should be taken into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA fragments.

    Cut-offs for core and matrix similarity :
    The matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part of the input sequences. Analogously, the core similarity denotes the quality of a match between the core sequence of a matrix (i.e. the five most conserved positions within a matrix) and a part of the input sequence. A match has to contain the " core sequence " of a matrix, i.e. the core sequence has to match with a score higher than or equal to the core similarity cut-off. In addition, only those matches which score higher than or equal to the matrix similarity threshold appear in the output.

    After choosing your cut-off selection, a choice of matrices will be presented for you to choose from. This is the last step in the submission process.

    Consensus Sequence

    The user can specify his own motif search in the form of IUPAC one letter codes. The IUPAC recommendations include letters to represent all possible ambiguities at a single position in the sequence except a gap.

    Symbol Name
    A Adenine
    B G or T or C
    C Cytosine
    D G or A or T
    G Guanine
    H A or C or T
    K G or T
    M A or C
    N A or G or C or T
    R G or A
    S G or C
    T Thymine
    V G or C or A
    W A or T
    Y T or C

    User-defined matrices

    In order to search for motifs defined by position weight matrices, please specify the name of a matrix and enter your matrix in the format shown below:

    A| 1 0 0 3 1 2 1 1
    
    C| 0 0 0 1 0 0 0 1
    
    G| 1 4 1 0 0 0 0 0
    
    T| 2 0 3 0 3 2 3 2
                    

    If you want to search for several motifs in one request you can press "Another matrix" and press "Done" when you entered all the matrices.

  2. Output

    Visialization Options screen

    visualization options screen

    Above you can see a sample visualization options screen. It allows you to set the length of sequence that will be depicted on a single row of the VISTA graph (1), the width of the picture produced (2), clustering (3 and 4) and which binding sites will be visualized (5).

    Individual and Group clustering

    Clustering allows the users to identify transcription factor binding sites that are present in groups, or clusters. In order for an indivudual cluster to occur (3), K number of these binding sites must occur within N base pairs. K and N can be varied for different sites. In order for a group cluster to occur, K number of any transcription factor binding sites need to occur within N base pairs.

    Conserved and Aligned binding sites

    rVISTA can show its predictions of conserved, aligned, or all the binding sites. Conserved binding sites are defined to be predicted binding sites located in the sequence fragments conserved between two species at the level of over 80% over a 24 bp window. Aligned binding sites are those where core positions of the potential binding sites on the sequences corresponded to each other in the alignment. All binding sites shows all sites, regardless of the alignment and conservation.

    Visualization

    VISTA graph of the alignment

    The visualization features a VISTA graph of the alignment, which identifies regions of high conservation between the two species and provides an annotation (more here), and tick marks for all the transcription binding sites. Green tick marks represent conserved binding sites, red represents aligned sites, and blue tick marks represent all found sites.