Finding potential regulatory elements in noncoding regions of the human genome is a challenging problem. Analyzing novel sequences for the presence of known transcription factor binding sites or their weight matrices produces a huge number of false positive predictions that are randomly and uniformily distributed. We combined database searches with comparative sequence analysis, and this procedure reduced the number of predicted transcription factor binding sites by several orders of magnitude (Loots 2002).
rVISTA (regulatory VISTA) combines searching the major transcription binding site database TRANSFAC Professional from Biobase with a comparative sequence analysis. It can be used directly or through links in mVISTA, Genome VISTA, or VISTA Browser.
There are three steps in this process:
Human and mouse sequences are aligned using the global alignment program AVID.
rVISTA makes predictions by the Match program based on TRANSFAC Professional 9.2 (June 30, 2005) library or user submitted matrices to identify potential transciption factor binding sites in each of the two aligned sequences, and determines which of the predicted sites are aligned and conserved between the species in the alignment. Predictions can also be based on a user submitted consensus sequence. TRANSFAC searches are performed using the default core similarity values of 0.75 and matrix similarity values of 0.70, or parameters submitted by the user. The visualization program for rVISTA allows the user to look at binding sites for a single transcription factor and/or various combinations of transcription factor binding sites which allows one to easily examine the clustering of binding sites for factors that are believed to interact with one another.
After obtaining all the binding sites in the human and mouse sequences independently, we select only the hits where core positions of the human and mouse potential binding sites corresponded in the alignment of the two sequences. We call these binding sites aligned hits. A qualifying aligned hit is allowed a maximum core shift of 6 basepairs (bp), and only one gap of any length inside it.
The dual species sequence conservation of a DNA region spanning a transcription factor binding site was assessed using a novel strategy that identifies the maximal percent identity for the DNA fragment surrounding the core of a binding site by allowing a dynamic shift. Only predicted binding sites located in the sequence fragments conserved between two species at the level of over 80% over 24 bp window were selected for further consideration. We call these hits conserved hits.
Base sequence annotation determines the genomic location of each predicted transcription factor hit. The sites located within exons or UTRs were classified as coding and the rest as noncoding. Since our focus is on non-coding regulatory sequences, rVISTA output includes only those contained in intergenic intervals. If the user doesn't provide an annotation, all predicted conserved hits are included in the output of the program.
Visualization program for rVISTA displays selected by the user predicted binding sites both individually, and in various combinations which allows for investigating clustering of the sites from different families.