An electronic version of these instructions can be found at http://hazelton.lbl.gov/vista/vistaexercises.shtml. You will be able to access the web links discussed here by clicking on the electronic version of this document.
All exercises start at the VISTA portal: http://genome.lbl.gov/vista/
These tutorials will familiarize you with most features of the VISTA Portal for Comparative Genomics, including
- VISTA Browser
- Phylogenetic Shadowing for close species comparisons (includes mVISTA and gVISTA)
These exercises can be done in any order, but it is best to do the VISTA browser first.
Step-by-step keys can be found after the exercises.
Vista Browser allows users to interactively visualize a variety of whole genome alignments and quickly identify highly conserved regions.
Find the maximum percent conservation identity for which all of the exons on the LDL Receptor gene are conserved between Human (March 2006 assembly), Mouse and Dog. Retrieve the coordinates of the conserved regions.
Hint: the RefSeq name of the LDL Receptor gene is LDLR. Right-click on the curves to change parameters. You can select each curve and use the (Details) button to open the Text Browser and get detailed information regarding the alignment.
Answer: All the exons are conserved at 57% minimum conservation identity.
Identify the human coordinates of the non-coding regions in the HOXA3 gene that are conserved between Human (March 2006 assembly) and Chicken. Find the coordinates of the chicken genomic interval that aligns to human HOXA3.
Hint: Use the Text Browser and the longest isoform
Answer: chr7:27116905-27117116; chr7:27117899-27118079; chr7:27122799-27122980; chr7:27124456-27124592; chr7:27130228-27130336; chr7:27130411-27130552; chr7:27130744-27130927
The chicken genomic intervals are chr2:32,523,807-32,532,839 (+) and chr2:32,533,785-32,542,508 (+).
Bonus: Try also adding the Fugu alignment and adjusting Human/Mouse parameters so that the conserved regions match those of Human/Fugu.
rVista is a tool that predicts transcription binding sites by combining a search of the Transfac database (Full Edition) with comparative sequence analysis.
We will now perform rVISTA analysis on the HOXA3 alignment to find predicted transcription binding sites.
Note that this is not the only way to use rVISTA – in addition to using it through this page, you can submit to rVISTA directly by going to the main VISTA site and submitting an existing alignment, or you can align two sequences with the main Vista program (mVISTA) and automatically submit to rVISTA from there.
Please remember that rVISTA has a 20Kb limit on the length of aligned sequences. If the sequence you want to analyze is larger than 20Kb, zoom in on the sequence until you have an interval smaller than 20Kb.
Submit the human-chicken HOXA3 alignment containing the HOXA3 5´ utr to rVISTA, and search for HOXA4 transcription binding sites. Find how many clusters of at least 3 conserved HOXA4 binding sites within a 100bp window are present in this human-chicken alignment, and note their approximate location.
Hint: click on "view in alignment" in Visualization Options to find precise position information about the rVISTA predictions.
Answer:2 clusters, at 650-750 bp and 6400-6500 bp on the human sequence.
Whole Genome rVISTA is designed to aid the analysis of gene expression studies by scanning the regulatory regions of genes exhibiting similar expression patterns. In the current implementation, a gene´s regulatory region is defined as the sequences upstream of the transcription start site, up to 5kb.
Find the 5 transcription factors which are most overrepresented in the 5 kb upstream of the transcription start site of the following mouse genes:
Runx1, Tpm2, Mthfr, Tpbg, Armcx2, Lox, Pdk3, Bcat1, Cdc25c, Dpysl3, Gatm, Tnc, Gpx7, Tfpi2, Adam12, Tubb3.
Which of the above genes are regulated by both of the top two overrepresented transcription factors?
Hint: Use "TFBS in Mouse Fabruary 2006 (mm8) assembly conserved in the alignment with the Human March 2006 (hg18) assembly".
Answer: TITF1, HIF1, ETS2, FOXP3, AP4. Runx1, Lox, Tpm2, Tpbg, Armcx2.
The VISTA Enhancer Browser is a central resource for experimentally validated human noncoding fragments with gene enhancer activity as assessed in transgenic mice. This is a continually growing resource that has tested 1091 noncoding sequences in transgenic mice as of August 2009.
How many enhancers have been experimentally verified to be expressed in the neural tube?
How many of them are conserved in human/mouse/rat at 100% identity over 200 basepairs or more ("ULTRA"conservation criterion)? View the experimental data for a conserved region with a positive enhancer result.
Hint: use "Advanced Search"
Answer: as of August 2009, there are 105 neural tube enhancers. 23 are ultraconserved.
Phylogenetic Shadowing is a strategy for the comparative analysis of multiple closely related species such as primates. In this exercise, we will use mVISTA to generate a multiple alignment of the sequence of 6 primate species and RankVISTA to quantitatively predict conserved regions across all species. We will also use the UCSC Browser and GenBank to retrieve the sequences required for this analysis, and gVISTA to generate the annotation file for the human sequence.
Perform Phylogenetic Shadowing analysis of the alpha-globin cluster regulatory region using the following sequences (all these sequences are available for download at VISTA_WashU.shtml):
human: chr16:43811-160260, May 2004 assembly;
Rhesus monkey: chr20:2423-137222, Jan 2006 assembly;
Colobus: accession # AC148220;
Marmoset: accession # AC146591;
Owl monkey: accession # AC146782;
Dusky titi: accession # AC145465
How many conserved non-coding sequences are predicted with a RankVISTA cutoff of 1 (P=0.1)? What are their coordinates on the human sequence?
Answer: 6. 37,589-38,733; 41,630-42,616; 53,997-54,487; 58,970-61,477; 64,036-65,027; 65,444-69,915
Go to http://genome.lbl.gov/vista/. Click on the "Browser" link located in the light blue line at the top of the page. Make sure "Human March 2006" is selected in the base genome box, and enter "LDLR" in the position box (note that you can only enter a RefSeq gene name or a chromosome coordinate in the position box). A new window will open with several matches to this gene name. Inspect the list to find the LDL Receptor Gene: in this case it is the first match. Click on it to load the human/mouse comparison in Vista Browser
Note: downloading the applet may take a while − be patient. If you experience any difficulties, ask one of the lab assistants to help you.
VISTA browser loads the human/mouse comparison by default. Identify the strand on which LDLR is transcribed, the coding exons and UTRs (they are marked on the annotation track above the curve, and colored according to the color legend in the lower left-hand corner). Are all the exons and UTRs conserved? No, 2 coding exons and 1 UTR are not conserved.
Try adding a second species evolutionarily closer to human, such as dog, to alignment, to improve the exon prediction. Select "Dog" from the second drop-down menu on the left ("select/add") to add the Human-Dog alignment. Accept the default values in the pop-menu for the display parameters for now. These values can be changed at any time by accessing the (Curve Parameters) button.
Are all the exons (coding/UTR) predicted by the human/dog comparison? Yes.
Try adjusting the parameters of the human/mouse comparison to emulate the human/dog comparison by requiring a lower amount of conservation for a region to be considered conserved. To do this, click on the curve you want to modify and select the button from the top menu. Alternatively, you can access "Curve Parameters" by right-clicking on the curve which you want to adjust and select "Parameters" in the pop-up menu. A description of the parameters is available from the "Help" pages at http://pipeline.lbl.gov/vgb2help.shtml: navigate to 5.3, "Changing Curve Parameters". In this case, you will want to try lowering the "Cons Identity". Experiment with parameter values until you get all the exons to be marked as conserved. Lowering the conservation to 57% identifies all coding exons/UTR as conserved in the human/mouse comparison.
In the "position" box of the Vista Browser entry page, enter "HOXA3" and click "Go." Three matches will come up – double-click the last match (the other two matches are alternatively spliced forms of the gene, which cover only a part of the region we want). Identify the strand of the HOXA3 gene, its exons and UTRs. Since we are displaying a relatively short interval (can you tell from the display how long the interval is? Ans.: 21kb), we can change the visualization to display the human/mouse curve on only one row using the "# rows" from the left control panel.
When looking at a highly conserved gene such as this one, it is useful to gain some evolutionary distance in order to identify the most strongly conserved regions. Add the chicken alignment to the display (use the second drop-down menu on the left, or the button from the top menu and then click on the "Track" drop-down menu). Identify regions that are highly conserved in all three species (human, mouse and chicken).
You will notice that some of the highly conserved sequences are non-coding (pink-colored). Those areas might seem like good candidates for further analysis.
Click on the second (human-chicken) curve to select it. Now click on the button ("alignment details") in the toolbar at the top of the screen. A new browser window, called "Text Browser", will open with detailed information regarding the segment of the human-chicken alignment you were looking at.
In this window, you can see detailed information about the aligned regions, including their genomic coordinates. The coordinates of the Chicken region that aligned to human can be found in the second column. A detailed description of all the options available from the Text Browser can be found in the Help pages (see link at 1.4).
To retrieve the coordinates of regions conserved between human and chicken, click on the "Get CNS: human-chicken" link or on the "CNS: human-chicken" link found in the Alignment column. The legend for this table is in the top line. The coordinates of conserved non-coding sequences are those marked as "non-coding". Note that clicking on the links on this page will give you the sequences of the conserved regions, with retrieval options that facilitate the design of PCR primers for further studying these sequences.
You should still have the TextBrowser window open for the Human-Chicken HOXA3 alignment (if not, bring up the alignment again in the browser and click the button).
To find the coordinates of the chicken genomic interval that aligns to human HOXA3, look for the "Location on chicken" column. In this case, 2 major chicken contigs align to the human interval.
From the text browser, select the human-chicken alignment that contains the HOXA3 gene 5′ utr. If you are not certain which is the right alignment, go back to the Genome Browser and check the direction of transcription of the gene.
Click on the rVISTA link in the "Alignment" column. Enter your email as prompted. You have now started the rVISTA submission process. The default values filled in on the next screen are sufficient for our purposes, however, if you wish to learn about these options, a description is available at http://genome.lbl.gov/vista/rvista/instructions.shtml#options
Click on "Submit" to go to a list of possible Transcription Factor Binding sites (matrices) for which to check. There are a large number of matrices here; find the box labeled "HOXA4" and check it. Note that the program becomes slower the more Transcription Factor Binding sites you select. Click on "Submit". Within a few moments you should get an email with a link to a web page that contains your results.
The various visualization options are described at http://genome.lbl.gov/vista/rvista/instructions.shtml#out . Check the "conserved," "aligned," and "all" boxes in the "Binding sites to visualize" column. To identify conserved HOXA4 binding sites occurring in clusters of 3 or more, in the "Clustering" area select "Individual Clustering": sites=3, base pairs=100 (note that "Group Clustering" is not applicable in this case as we have only searched for 1 transcription factor). Click on "Submit" to look at the predicted transcription binding sites (shown as tick marks above a regular Vista curve). The conserved predicted sites are shown in green. Only conserved predicted sites occurring in clusters of 3 or more in a 100 bp window are shown.
To appreciate the differences between the various visualization options ("conserved," "aligned," and "all"), remove the clustering requirement by selecting "Individual Clustering": sites=1, base pairs=100 in the Visualization Options area at the bottom of the page and resubmit. Inspect the new plot and note how moving from "all" to "aligned" to "conserved" reduces the number of site predictions.
On the main VISTA page (http://genome.lbl.gov/vista/), click on "Whole Genome rVISTA".
Click on "GO" TFBS in Mouse February 2006 (mm8) assembly conserved in the alignment with the Human March 2006 (hg18) assembly.
In the pop-up menu determining the size of the upstream region scanned by Whole Genome rVISTA, select "5000" (this is the default value).
Input the gene names in the rectangular input field. Note that the names provided in this exercise are RefSeq names.
Select "I am submitting gene names" and click on "Submit ".
The top table in the "Results" page shows all transcription factors that are overrepresented in the 5000 bp upstream of all the 16 genes submitted for this exercise at a p-value cutoff of 0.005. Note that not all the 16 genes need to have binding sites for a given transcription factor. The bottom table shows the transcription factors overrepresented upstream of each gene individually.
Click on and to obtain the list of genes where these two transcription factors are overrepresented. Compare the two lists to find the genes overrepresented in both lists.
Go to the VISTA homepage http://genome.lbl.gov/vista/
Click on the link in the navigation bar at the top of the page.
Click on the " Advanced Search" link on the bottom of the page.
Click the checkbox next to neural tube as expression pattern.
Click the radio button for positives to be shown in the result.
Click the " Search" button to retrieve all tested enhancers, with positive enhancer data, with the given expression pattern.
The resulting table should contain at least 66 chromosomal regions positively tested. Remember that new elements are continuously tested.
To retrieve only highly ultraconserved elements – choose neural tube as expression pattern, together with conserved in ultra, and again show only positives. Click the " Search" button.
This time, at least 23 enhancer regions should be displayed in the resulting table.
Click the location link chr16:53780686-53781784, to retrieve information about the Id 26 ultraconserved enhancer element.
On the resulting page, examine the expression pattern description as text (near the top). View the experimental data.
All sequences required for this exercise can be downloaded from here
Preparation of the annotation file.
The annotation file can be prepared manually or using the program gVISTA. This program is used to align user-generated sequences to the human and other sequenced genomes. In addition to a pairwise conservation profile between the base (i.e. reference) genome and the user′s sequence, gVISTA returns the annotation of the base genome. Submission of the human sequence to be used for phylogenetic shadowing will enable you to retrieve the annotation for that sequence.
- Upload from your computer to the gVISTA server the human sequence for your region of interest as a plain text file in FASTA format using the "Browse" button.
- Use " Human Genome" as the base genome. Enter your email address, a name for your project and click on the " Submit query" button.
- After a few minutes, you will receive an email with a link to your results. On the results page, click on " Text Browser" .
- On the results page, copy the genomic coordinates to which your sequence aligns to the human genome. These are: " chr16:43,811-178,231" .
- Open a new browser window and open the entry page of the VISTA browser, available from the VISTA Home Page. Paste the genomic coordinates you just obtained in the " Position" box.
- Click on the "" button to open the Text Browser
- Click in the "Download RefSeq genes" link. If you cannot see the address bar (this depends on your browser and its configuration), then you need to go back and force the browser to open the link in a different window. You can normally do this by clicking on a link with the right mouse button and selecting " open in a new window" . Note that the coordinates in the file obtained from the "Download RefSeq genes" link are relative to the sequence of the human genome and not to the sequence in your human sequence file.
- To express the
coordinates relative to your region of interest, point your cursor
to the address bar in your browser. It should look something like
- Position the
cursor at the end of the URL (you may have to use the right arrow on
your keyboard) and add
The address should now look something like this:
Hit "Enter". The RefSeq coordinates will now be expressed relative to the sequence you submitted.
- Save the resulting annotation file on your computer using the " File|Save as" menu. Make sure to select " Text files" in the " Save as type" drop-down menu
Submission to mVISTA: Sequence data fields.
To obtain a multiple sequence alignment for your sequences, submit your 6 sequences (the human and 5 non-human primate sequences) to the mVISTA server. Sequence can be uploaded in FASTA format from a local computer using the " Browse" button or, if available in GenBank, they can be retrieved by inputting the corresponding GenBank accession number in the " GENBANK identifier" field. Enter the human sequence as sequence#1.
Submission to mVISTA: Choice of alignment program.
Three genomic alignments programs are available in mVISTA. " LAGAN" is the only program that produces multiple alignments of finished sequences, and is the most appropriate choice for phylogenetic shadowing. Note that if some of the sequences are not ordered and oriented in a single sequence, your query will be redirected to AVID to obtain multiple - pairwise alignment. " AVID" and " Shuffle-LAGAN" are not appropriate genomic aligners for phylogenetic shadowing as they produce only all-against-all pair-wise alignments.
Submission to mVISTA: Additional options
- "Name": Select the names for your species that will be shown in the legend. It is advisable to use something meaningful, such as the name of an organism, the number of your experiment, or your database identifier. When using a GenBank identifier to input your sequence, it will be used by default as the name of the sequence.
- "Annotation": If a gene annotation of the sequence is available, you can submit it in a simple plain text format to be displayed on the plot. Although to display the annotation on the RankVISTA plot you need to submit the annotation file for one species only, usually human, annotation files for all other species can also be submitted.
- "RepeatMasker": Masking a base sequence will result in better alignment results. You can submit either masked or unmasked sequences. If you submit a masked sequence and the repetitive elements are replaced by letters "N", select "one-celled/do not mask" option in the pull-down menu. mVISTA also accepts softmasked sequences, where repetitive elements are shown as low-case letters while the rest of the sequence is shown in capital letters. In this case, you need to select "softmasked" option in the menu. If your sequences are unmasked, mVISTA will mask repeats with RepeatMasker. Select "human/primate" in the drop down menu. If you do not want your sequence to be masked, select "one-celled/do not mask"
- Leave the "Find potential transcription factor binding sites using rVISTA" and "Use translated anchoring in LAGAN/Shuffle-LAGAN" options unchecked.
Submission to mVISTA: Parameters for RankVISTA
- The RankVISTA algorithm, used for the quantitative analysis of the multiple primate sequence comparisons, is run automatically on the alignment generated by mVISTA. The option "RankVISTA probability threshold (0 < p < 1)" tells the RankVISTA algorithm to ignore predictions with a p-value greater than that indicated in the box. The default setting of "0.5" means that all conserved sequences with a conservation p-value between 1 and 0.5 will not be reported.
- If you know the phylogenetic tree relating the species you are submitting, enter it at "Pairwise phylogenetic tree for the sequences" , otherwise LAGAN will calculate the tree automatically.
- Click on "Submit" to send the data to the mVISTA server. If the mVISTA server finds problems with the submitted files, you will receive a message stating the type of problem; if not, you will receive a message saying that submission was successful. Several minutes after submitting your sequences, you will receive email from email@example.com indicating your personal Web link to the location where you can access the results of your analysis.
Retrieval of the results.
Clicking on the link found in the body of the email takes to the results page. It lists every organism you submitted, and provides you with three viewing options using each organism as base. These three options are:
- the "Text Browser", which provides all the detailed information - sequences, alignments, conserved sequence statistics, and RankVISTA results for multiple sequence comparisons. This is where you retrieve the coordinates of conserved regions predicted by phylogenetic shadowing;
- the "Vista Browser", an interactive visualization tool that can be used to dynamically browse the resulting alignments and view a graphical presentation of RankVISTA results;
- a PDF file, which is a static VISTA plot of all pair-wise alignments, and is not relevant to multiple primate comparisons.
It is important to note that while mVISTA shows the results of all pair-wise comparisons between one species chosen as the base (reference) sequence and all other submitted sequences, RankVISTA shows the result of the multiple (simultaneous) sequence comparisons of all submitted sequences and is independent of the choice of base sequence.
- The Text Browser brings you to the results of your analysis in text format. At the top of the page is a banner that displays the aligned organisms. The sequence listed in the darker header area is acting as base (the choice of the base sequence is irrelevant for RankVISTA analysis). This banner also lists the algorithm used to align your sequences. If you did not submit your own tree, you can click on "phylogenetic tree" to inspect the tree computed by MLAGAN and compare it with the one expected based on the known phylogeny of the species analyzed. Underneath is the navigation area, which shows the coordinates of the region currently displayed and offers a link to the Vista Browser (see below) and a link to a list of all conserved regions found. Following that is the main table, which lists each pairwise alignment that was generated for the base organism. Each row is a separate alignment. Each column, except the last one, refers to the sequences that were submitted for analysis.
- The last column (labeled "alignment") contains a link to the RankVISTA results and information pertaining to the whole alignment. It also provides links to alignments in human readable and MFA (multi-fasta alignment) formats, a list of conserved regions from this alignment alone, and links to pdf plots of this alignment alone.
RankVISTA Text Browser.
Clicking on the "RankVISTA" link in the alignment column takes you to a summary table of the RankVISTA analysis of multiple primate sequences. This is the primary result page for phylogenetic shadowing. The table shows the start and end coordinates, relative to the sequence of the organism chosen as the base, for all regions predicted to be conserved across all primate species analyzed, and the length of the conserved regions. The p-value column shows the probability of seeing that level of conservation by chance in a neutrally-evolving 10-kb segment of the base sequence, thus enabling the user to rank conserved sequences on the basis of their likelihood of conservation. The last column indicates whether the conserved sequence is coding or non-coding, and is based on the annotation file submitted to mVISTA.
Vista Browser is an interactive Java applet designed to visualize pairwise and multiple alignments using the mVISTA (default) and RankVISTA scoring schemes, and to identify regions of high conservation across multiple species.
- Clicking on the "Vista Browser" link will launch the applet with the corresponding organism selected as base. The VISTA Browser main page displays all pairwise comparisons between the base sequence and all other submitted sequences using the mVISTA scoring scheme, which measures conservation based on the number of identical nucleotides (% conservation) in a 100bp window. Multiple pair-wise alignments sharing the same base sequence can be displayed simultaneously, one under another. The plots are numbered, so that you can identify each plot in the list underneath the VISTA panel. The many additional features of the browser are described in detail in the online help pages, accessed by clicking on the "help" button in the top left corner of the browser.
- To access the RankVISTA plot, click on the "1 more organism" drop-down menu found in the left panel of the browser ("Control Panel") and select "RankVISTA". The "Add curve" pop-up dialog prompts the user to set parameters for the RankVISTA plot. The "minimum Y" and "maximum Y" parameters set the range of displayed p-values, on a logarithmic scale. The default values of "0" and "5" instruct the server to display conserved regions with p-values ranging from 100 (=1) to 10-5 (note that the default "RankVISTA probability threshold" set at the mVISTA submission stage instructs RankVISTA to cut off predictions at that given p-value). Both parameters can be adjusted later after displaying the plot. The resulting RankVISTA plot is displayed in the Genome Browser below the pairwise comparisons.
- The position of the plot can be reordered by selecting RankVISTA in the list underneath the VISTA panel and clicking on the "up" arrow. Conserved sequence regions predicted by RankVISTA are colored according to their annotation, with light blue regions corresponding to annotated exons and pink regions to non-coding sequences. Note that RankVISTA coloring is based on exon annotations of all aligned sequences (if annotation files were submitted for more than one species), not just the one currently used as the base. Consequently, an unannotated region in the base sequence might still be colored as an exon because of annotations from other sequences. In another deviation from the standard scheme, RankVISTA colors UTRs and coding exons the same, since they are treated identically by the underlying algorithm. The width of a predicted conserved region corresponds to its length, while the height corresponds to its conservation p-value.
- To adjust the RankVISTA plot parameters, first select the plot by clicking on it. By clicking on the " Curve parameters" button, you can modify the y-axis bounds.
- By clicking on the "Alignment details" button, you can quickly shift to the Text Browser and retrieve the coordinates of conserved sequences.
- To print the plot, click the "Print" button. The first time you do this, you will get a dialog box to confirm that you indeed requested that something be sent to the printer. This is a security measure in Java intended to handicap malicious code. Click "yes". A standard printing dialog box will appear. Proceed as you would with any other printing job.
- To save the plot, click the "save as" button. In the menu that will appear, select the file type you want, adjust parameters such as image width if desired, and press "ok". If you have pop-up blocking software such as the Google toolbar or a later version of IE browser, you may need to hold down the CTRL key while clicking the OK button.