HSCI324
From MonkeyWiki
This assignment is due on Thursday, March 3rd (2011), the first lecture after the mid-term exam. Please submit answers to all questions that are highlighted like this:
Contents |
Part 1: Coalescent analysis using BEAST
BEAST (Bayesian Evolutionary Analysis by Sampling Trees) is an important software package that allows you to analyze sequence data using the coalescent model in a Bayesian framework. BEAST is written in Java, so it can be executed on most computing platforms, i.e., Linux, Mac, Windows.
Downloading BEAST
If you are not using a computer at the Blusson Hall undergraduate computer lab then you will need to download and install a copy of BEAST on your own computer. You can obtain the latest version from the BEAST homepage. On a Mac, this will mount a disk image from which you can copy the entire contents to a folder on your Desktop or wherever you prefer. These contents will include (among many other things), two main applications:
- BEAST - the program that performs the analysis
- BEAUTi - a utility program that converts a file containing your genetic sequences into an XML file that contains instructions for the BEAST analysis
Complete mtDNA dataset
Download the file complete_mtDNAs_Americas.nex either from WebCT (it can be found in the Course Materials folder labelled "Computer Lab Materials") or directly from here. This file contains 26 complete human mitochondrial DNA (mtDNA) genomes from individuals in North America. This is an example of the NEXUS file format, which is a widely-used format in phylogenetics and bioinformatics. Open this file in a text editor. You should see the following:
#NEXUS [ Generated by HYPHY 2.0020101022beta(MP) for MacOS(Universal Binary) on Mon Feb 7 00:48:42 2011 ] BEGIN TAXA; DIMENSIONS NTAX = 26; TAXLABELS 'gi|13272850|gb|AF346984.1| Homo sapiens mitochondrion, complete genome'
The #NEXUS tag tells the computer program that the file it is reading will conform to the NEXUS file format standard. (Unfortunately, NEXUS is also one of the most abused file formats in phylogenetics and bioinformatics. No program uses NEXUS in a consistent way. Efforts have been made to enforce a standard, but it hasn't made much ground.)
A NEXUS file is organized into "blocks". Each block starts with a BEGIN tag and ends with an END tag. Comments are enclosed within square brackets. Glance over the file just to get an idea of what a NEXUS file contains and what it looks like. Go ahead and close the file.
Using BEAUTi
Open BEAUTi by double-clicking on the software icon.This will spawn a window (displayed below):
Click on the plus-sign (+) at the bottom-left corner of this window. This will spawn another window that will let you browse through your hard drive to find a file to open in BEAUTi. Locate and open complete_mtDNAs_Americas.nex. Now a row should have appeared in the main field of the default window, beginning with "complete_mtDNA..." under the column heading "Partition Name". If you do not see this, then you might have navigated away from the Data Partitions view. Check to make sure that the Data Partitions button/tab at the top of the default window is highlighted. This is where we load sequence data, as we have just done. We can also load multiple data sets; for example, if we have sequenced multiple chromosomes from the same sample of people, but we want to allow each chromosomes to have its own evolutionary history (because chromosomes are unlinked).
We are not going to change any of the settings in this window, so go ahead and click over to the Clock Models view by clicking on the button with that phrase at the top of the default window. You should now see something like this:
Note that you should see a "1" instead of the "1.7E-8" value that is listed under Rate. You need to change your BEAUTi settings to this value by double-clicking on the number and typing:
1.7E-8
Warning! The text field where you would edit this number collapsed to a very thin line in my version of BEAUTi. This is a bug that I will let the developers know about. If this happens to you, then just type in the text in the above box blind and hit ENTER. You will be able to see if you made a mistake when the value is displayed after editing.
The reason we do this is because the mtDNA genome sequences have all been sampled at the same time. Well, they were probably sampled months apart, but that's negligibly small on the time scale of human evolution. That means that we have no way of calibrating the molecular clock, i.e., estimating a constant mutation rate U. Instead, we are going to use a previous estimate of U as being equal to 1.7 times 10 to the power of negative 8, or 0.000000017. This is the estimate of the per-nucleotide mutation rate per year that was used in the original paper.
Click over to the Trees view. You will see a drop-down menu labelled Tree Prior. This enables us to fit different coalescent models to the data. For example, we have a good reason to believe that there has been exponential growth in the past, then that is going to influence the effective population size and in turn influence the coalescent. For this exercise, however, let's just leave it at the default setting of Coalescent: Constant Size, because that's the basic coalescent model that we've learnt about in class.
Click over to the "MCMC" view. This is where we are going to tell BEAST (through BEAUTi) how many steps to run the Markov chain for (under Length of chain). The default value (10000000) is usually too short for real data sets, but because we are looking at a pretty small data set and a simple model we can get away with this. In fact, we can make it even shorter. Let's change this number to 1000000, i.e., a million steps. This analysis will get done in about a minute.
Now we're almost done! Modify the File name stem to something that you will remember. This is the name that BEAST will use to write various output files, sticking some extension (like *.log) to the end. Click on the Generate BEAST file... button at the bottom-right corner of the MCMC view. BEAUTi will warn you about unchanged default priors. We're just going to ignore that; the default priors are uninformative (basically they say that we are coming in with no specific prior belief) and good enough for our exercise. So hit Continue.
Now we just have to tell BEAUTi where to write the XML file. The XML file will contain all the sequence data that was in the NEXUS file, plus a bunch of other instructions for BEAST. Choose some location that you will be able to find later.
Using BEAST
Double-click on the BEAST software icon to launch BEAST. It looks very similar to the BEAUTi icon only that it does not feature the "light-switch" panel in the background. Two windows will open. We first want to deal with the window that is shown in the inset to the right. Click on the Choose File button next to the label BEAST XML File. That will open a file browser window where you need to find the XML file that you just created using BEAUTi. Select that file and hit Open to load it into BEAST. That is all we need to do with this window, so dismiss it by hitting Run. Now it will close and all the action will unfold in the other text-based window.
Now a ton of text is going to zoom by in this second window. You should start seeing rows of numbers scroll up. This is displaying the population genetics parameters (like effective population size, time to the most recent common ancestor, and all that good stuff) being sampled by the MCMC process. It takes about 40 seconds to run this on my laptop, so you won't have to wait long. If it does not run, then it may be because you or someone else has tried running an XML file already and there is a *.log file sitting in the same folder. That log file contains a lot of important results from a BEAST analysis and this is a fail-safe that prevents you from overwriting your previous work. If this is the case then the text window will display the following error message:
The log file foobar.log already exists in the working directory.
Find that file and delete it.
Otherwise, everything should have gone fine and we are ready to quit BEAST and go to Tracer.
Using Tracer
Tracer is yet another software package written in Java that displays results from a BEAST analysis. BEAST writes many rows of numbers into what is called a log file. You will have set the name of the log file back when you were preparing the XML file with BEAUTi. Find it! Of course, it's difficult to understand what is going on from so many numbers. It would be nice to plot those numbers out and get an idea of what our Bayesian sample looks like. That's where Tracer comes in.
Double-click on its icon to launch. This will spawn a window that is composed of three major elements. First, in the top-left corner, there is a list that will contain all the open log files. You can either click-and-drag the log file to this field, or click on the plus-sign (+) button to open a browser window and select the log file that way.
Once you've loaded the log file, a lot of things will happen. The trace field (Traces) in the bottom-right corner will get populated with a list of all the population genetic parameters that have been recorded in the MCMC sample that's stored in your log file. Also, the main viewer field that takes up the right half of the window will display some numbers and a graph. By default, this will be a histogram of the posterior probability trace. However, the example that I'm displaying below has flicked the graph over from the Estimates tab to the Trace tab, which is much more useful for scrutinizing our sample of the posterior. Click on the Trace tab so that the viewer now displays a graph that looks like the figure below:
Let's take a step back and think about what this means. BEAST works by starting at a particular coalescent tree and then using some tree rearrangement to get to a different tree. Then it calculates the posterior probability of that tree (that's the prior probability times the likelihood of the tree given the sequence data). If the new posterior is higher, then it automatically takes that step to the new tree. If it's not higher, then it will still take a step with some probability x where x is equal to the ratio of the new posterior over the old posterior. This is how BEAST generates a random sample of trees from the posterior distribution, kind of like how Monte Carlo integration calculates the area under a curve by scattering points at random and seeing which ones are below the curve.
You should see that the first few posterior probability values sampled by BEAST are relatively low (note that Tracer is displaying the log of posterior, and because these are almost always much less than 1 but greater than 0, the log is a large negative value). This phase is called the burn-in. The idea is to run BEAST long enough that we get past the burn-in period. This means that the analysis has had enough time to wander around and find the good patches of parameter space. We are looking for a nice jagged but level sample of posteriors for a long interval. The sample, or "trace", displayed in the window above is a pretty good example of what we are after.
Now click on treeModel.rootHeight in the parameter table (lower-left corner), and click on the tab at the top labeled Estimates. You should now see something like this:
This parameter measures how far back in time we need to go until we get to the root of the coalescent tree, i.e., the most recent common ancestor (MRCA) of our sample. This is measured in units of years because we set the mutation rate to a value expressed as substitutions per site per year. We are looking for a nice bell-shaped distribution, which is usually a good indication that our analysis has gotten past the burn-in phase, i.e., it has "burned in".
Click on the parameter table row constant.popSize. This is the parameter that represents the effective population size of the coalescent.
Whew! That was a lot of text to read through, but you should only have had to click on a few buttons to do a very sophisticated analysis of human mtDNA genomes!
Part 2: Phylogenetics analysis with HyPhy
Now we're going to do a little phylogenetics. Download the file called trim5alpha.seq, either from WebCT Course Materials in the Computer Lab Materials folder, or directly here. This contains 18 sequences of a gene called TRIM5alpha that encodes a post-entry retroviral restriction factor - in other words, something that recognizes that an RNA virus genome has gotten into the cell and cuts it up into smithereens. This innate immunity gene has been associated with natural immunity against HIV infection in macaques. The sequences in this data set are taken from 18 different primates including a human sequence.
WARNING: This exercise was prepared on a Mac. One of the major differences with HyPhy in Windows is that each window has its own menu bar. You may need to click around to different windows to find the right menu option. The console window is usually a safe bet.
Open the file in a text editor and have a look. This is a sequence format that is known as the MEGA format and is also the one used by default by HyPhy. On the line before each nucleotide sequence, there is a line that contains identifying information that starts with the "hash" (#) character. Pretty simple. Go ahead and close the file, and fire up HyPHy by double-clicking on its icon. (Note that we've recently changed the icon, so you may see either of the two icons on the right.)
The first time you fire up HyPhy, you will get a "welcome" image - actually, you will always get this image unless you click on the Don't show again check box. Just click OK to get to the HYPHY Console window (shown below). If you clicked on something else in the welcome window by accident and got taken to some menu, just cancel so you get back to the default HYPHY Console window.
In fact, we're not going to do anything with the console window - other than displaying output, it's not used for much else unless you're an advanced user. What we DO want to do is click on the File menu, mouse down to the Open option and then over to the Open Data File... item in the sub-menu that appears. Click on Open Data File.... (Oh, and note that I do all my development and work on Macs, so what you see in these screen captures will be different than what you see on a PC.)
A file browser window will open. Find the file trim5alpha.seq on your hard drive and open it. This will cause a data panel to open that displays the sequences in this file, something like the screen capture below (although your sequences will not be in solid colours; that's an option that I won't go into). You can scroll around the sequence alignments by click-and-dragging the open orange box in the big scroll bar at the top of the data panel. Have a look around!
There's a lot of stuff that you can do with the data panel. You can set up analyses, select a range or group of sites to look at, and many more. But we're going to take another approach to analyzing these data. So go over to the Analysis menu in the menu bar, click it to open the menu and mouse down to the Standard Analyses... option:
This will open a new window that displays a whole bunch of ways to analyze your data. Actually, HyPhy is essentially unlimited in the kinds of analyses that you can do, but this is a selection of ones that most people will want to do. Double-click on Phylogeny Reconstruction to open the sub-menu and then double-click on NeighborJoining.bf.
Neighbor joining
Okay, we're going to build a tree of primates using the TRIM5alpha sequences by the neighbor joining method. You want to choose the following options in the short chain of menus that are going to pop up. If you get it wrong, don't worry, just cancel and start over. The option menu title will be in bold and the option to choose will be given after that in normal text.
- Distance computation - Distance formulae
- Data type - Nucleotide/protein
- A browser window will open. Choose the file trim5alpha.seq. Yes, I realize that it's right there in that open data panel you just created, but we often work with multiple data sets so HyPhy has to be dumb and ask for which file you want every time.
- Negative branch lengths - Force Zero
- Nucleotide based distance formula - TN93 (remember these?)
That's it. The NJ analysis should be really fast. In the console window, there should now be a message displayed that says:
***********Save this tree to a file (y/n)?
In the text field at the bottom of the console window, type "y" and hit ENTER. This will open up a file browser window. Save the file.
Let's have a look at the tree we just made! Follow the same procedure you used to open the sequences in a data panel, only instead of Open Data File..., you need to select Open Tree File... Select the file you just saved the tree to and open it. A tree display panel will open showing you the tree. When you are done answering the question below, go ahead and close the tree display panel.
Tree reconstruction by maximum likelihood
Now we are going to use a maximum likelihood method to reconstruct the phylogeny. Go back to the Standard Analyses menu and instead of double-clicking on NeighborJoining.bf, choose SequentialAddition.bf. Remember that this is a method that starts with a basic tree containing 3 sequences, and then starts to add sequences to the tree by placing branches in such a way that maximizes the likelihood of the tree at each step. Enter the following options:
- Data type - Nucleotide/Protein
- A browser window will open. Choose the file trim5alpha.seq.
- Addition order - Random order
- Topology constraint - No constrant
- Branch swapping - No swapping
- Choose one of the standard models - HKY85
- Model options - Global
- Starting 3 taxa tree - Random
- Global parameters - Estimate always
A tree display panel will open up automatically and show you the tree growing as HyPhy tries adding sequences and calculating the likelihood of each new tree given the sequence data. This is kind of fun to watch :-) It will also take a few minutes.
Detecting selection
Okay, now we're in the area of phylogenetics where HyPhy can really flex its muscles. HyPhy was never designed to estimate phylogenies like we just did, but it's really useful to be able to do that in a pinch. No, HyPhy is made for testing hypotheses and that means fitting models of evolution to the data (which includes the alignment and a tree). Open up the Standard Analyses menu and this time open the sub-menu labelled Positive Selection and double-click on TestBranchDNDS.bf. This analysis is going to test the hypothesis that the rate of protein evolution accelerated in the ancestral lineage separating the Old World monkeys (including Saimiri sciureus and Ateles geoffrei) from the New World monkeys (including ourselves).
Enter the following options:
- Choose Genetic Code - Universal
- A file browser window will open ("Select a codon data file"). Choose the file trim5alpha.seq.
- Site-to-site rate variation model - dN only
- The console window will display the message "How many rate classes should there be" - Type in the number "4" into the bottom of the console window and hit ENTER.
- Nucleotide Substitution Model - Default
- Amino Acid Class Model - Default
- A file browser window will open again ("Please select a tree file for the data:"). Choose one of the tree files you generated in the previous sections. This example has been tested using the neighbor-joining tree only.
- Choose the branch to test - Choose the internal branch (Node##) separating the Old and New world monkeys. In my NJ tree this was Node11. (See the screen capture below for some help in choosing the right branch.)
- Parameters to test - There is only one option: "1"
Now HyPhy will fit two evolutionary models to the tree. One assumes that the rates of protein evolution are the same across the entire tree. The second assumes that the rates of protein evolution are different in the branch separating the Old World from the New World monkeys. Because these models are nested, it uses the likelihood ratio test to see if there is a statistically significant improvement in likelihood with the addition of parameters in the alternative model.
You're done! Congrats on making it to the end. Believe me, this was a lot more time-consuming for me to prepare this lab than it was for you to complete it. :-)















