|
About rVISTA
Finding potential regulatory elements in noncoding
regions of the human genome is a challenging problem. Analyzing
novel sequences for the presence of known transcription factor binding
sites or their weight matrices produces a huge number of false positive
predictions that are randomly and uniformily distributed.
We combined database searches with comparative sequence analysis,
and this procedure reduced the number of predicted transcription
factor binding sites by several orders of magnitude (Loots 2002).
rVISTA (regulatory VISTA) combines searching the major transcription binding
site database TRANSFAC Professional from Biobase with a comparative sequence analysis.
It can be used directly or through links in mVISTA, Genome VISTA, or VISTA Browser.
There are three steps in this process:
Human and mouse sequences are aligned using the global alignment
program AVID.
-
rVISTA makes predictions by the Match program based on TRANSFAC Professional 9.2 (June 30, 2005)
library or user submitted matrices to identify potential transciption factor binding
sites in each of the two aligned sequences, and determines which of the predicted sites are aligned and conserved between the species in the alignment. Predictions can also be based on a user submitted consensus sequence. TRANSFAC searches are performed using the default core similarity values of 0.75 and matrix similarity values of 0.70, or parameters submitted by the user. The visualization program for rVISTA allows the user to look at binding sites for a single transcription factor and/or various combinations of transcription factor binding sites which allows one to easily examine the clustering of binding sites for factors that are believed to interact with one another.
After obtaining all the binding sites in the human and mouse sequences
independently, we select only the hits where core positions of the
human and mouse potential binding sites corresponded in the alignment
of the two sequences. We call these binding sites aligned hits.
A qualifying aligned hit is allowed a maximum core shift of 6 basepairs
(bp), and only one gap of any length inside it.
-
The dual species sequence conservation of a
DNA region spanning a transcription factor binding site was assessed
using a novel strategy that identifies the maximal percent identity
for the DNA fragment surrounding the core of a binding site by allowing
a dynamic shift. Only predicted binding sites located in the
sequence fragments conserved between two species at the level of over
80% over 24 bp window were selected for further consideration.
We call these hits conserved hits.
Base sequence annotation determines
the genomic location of each predicted transcription factor hit.
The sites located within exons or UTRs were classified as coding and
the rest as noncoding. Since our focus is on non-coding regulatory
sequences, rVista output includes only those contained in intergenic
intervals. If the user doesn't provide an annotation, all predicted
conserved hits are included in the output of the program.
Visualization program for
rVista displays selected by the user predicted binding sites both
individually, and in various combinations which allows for investigating
clustering of the sites from different families.
How to Cite
When using results obtained with rVista, please cite this paper:
Loots, G., Ovcharenko,I., Pachter,L., Dubchak,I., Rubin, E. rVISTA for comparative sequence-based discovery of functional transcription factor binding sites. (2002) Genome. Res. 12:832-839
|