Seminars:MexicoCity2008
From MonkeyWiki
On July 29, 2008, I'll be conducting a seminar on ancestral inference using population genetics at a phylogenetics workshop being held in Mexico City, 2008. I thought this would be a good opportunity to set up this wiki, first of all so that my lecture notes would be available to the participants, but also as a simple means of distributing exercises for the practical section of my seminar. If it really works out well, then maybe participants could also create accounts on the wiki and contribute results from the practical section!
External links to Wikipedia are to pages in Spanish (EspaƱol).
Contents |
Lecture slides
Follow the link below, and use the "Back to MonkeyWiki" link to bring your browser back to this page. Each slide has navigation buttons directly beneath: use the right arrow to advance forward, and left arrow to go to the previous slide. Use the "home" button to go back to the first slide.
External (non-wiki) link: MexicoCity2008 slides
Practice
Installing BEAST
If your computer does not have BEAST installed, follow this external link to download the application: BEAST Homepage. If the link is unresponsive (I've had trouble accessing it directly), try again or follow the link from the Oxford Evolutionary Group webpage.
Basics
BEAST is a Java application for analyzing genetic sequences using phylogenetic and coalescent methods. BEAST is distributed with several support applications, including BEAUti and Tracer. Because BEAST is essentially a non-interactive pipeline that takes an XML-formatted file as input, and outputs a tab-separated plain text file containing a log of the MCMC run.
Creating XML with BEAUti
Although it is technically possible to write an XML file for a BEAST analysis by yourself, in practice it is very inconvenient to do so. BEAUti is a Java application that provides a graphical user interface for reading NEXUS-formatted files and setting model parameters.
To start, we need to have our sequences in a NEXUS format. However, this is not always the case! There are many formats for multiple sequence alignments. To make matters worse, NEXUS is a very poorly supported standard; in other words, a NEXUS file generated by one application is not guaranteed to be read properly by another. I've found that it is actually quite difficult to convert an alignment from another format (e.g., FASTA) to a NEXUS file that can be read by BEAUti.
My solution is to use another application written by one of the BEAST developers (Andrew Rambaut) called Se-Al, which can be downloaded here. What's nice about Se-Al is that it is a manual alignment editor that can import and export alignments in several standard formats, so it fits in nicely with a typical workflow --- one usually has to inspect their alignment before importing it into an analytical package such as BEAST.
Example of a NEXUS-formatted file:
#NEXUS Begin data; Dimensions ntax=7 nchar=1434; Format datatype=nucleotide gap=-; Matrix CanineOralPV ATGGCAAGGAAAAGACGCGCAGCCCCTCAAGATATATACCCTGCTTGTAAAGTGTCCAACAC FelisPV1 ATGCTTAGGCAAAAACGTGCAGCCCCAAAAGATATTTACCCACAATGCAAGATATCCAACAC LynxPV1 ATGCTACGGCGAAAACGTGCAGCCCCCCATGATATCTACCCCCAATGCAAAATCTCCAACAC PumaPV1 ATGCTTAGGCGAAAACGTGCAGCCCCCAAAGATATTTACCCCCAATGCAAAATCTCCAACAC RacoonPV1 ATGACTCGCAAACGCCGCGCCGCTCCTCGTGATATATACCCCTCTTGCAAACTGGCAAATAC AsianLionPV1 ATGCTAAGGCGAAAACGTGCAGCCCCCTCAGATATCTACCCCCAATGCAAAATTTCCAATAC SnowLeopardPV1 ATGCTAAGGCGAAAACGTGCAGCCCCTTCTGATATTTACCCACAATGCAAAATTTCCAATAC ; End;
This file would contain 7 sequences (ntax) that I've truncated to a shorter length --- otherwise each one would comprise 1434 nucleotide characters as specified by 'nchar', far too long to be rendered properly in your browser!
For now, we'll use one of the example files provided by the BEAST developers:
This alignment contains partial genome sequences from five feline papilloma viruses (FPV) in addition to two related viruses isolated from a raccoon and a dog.
When you have downloaded the alignment to your computer, open BEAUti. By default, it will open a new window.
(Note, the program interfaces depicted in these images may differ from how they will be rendered on your computer because of platform-specific issues.)
Opening a NEXUS file
Click on the File drop-down menu and select Import NEXUS... to open a directory browser window.
Navigate through the file hierarchy until you find the FPV.nex file that you downloaded from this wiki, and open it. The BEAUti main panel should now look like this:
Going left to right, the columns contain the following information extracted from the NEXUS file:
- Name --- Unique "taxon" names associated with each sequence.
- Date --- Displays the dates assigned to each sequence, e.g., virus sequences sampled from the same infected patient on different days; ancient DNA labeled with radiocarbon years. These values may be entered manually by double-clicking on the relevant entry. They can also be entered automatically using the Guess Dates button on the main panel, which will open a dialog window in which you can specify a set of rules for extracting dates from sequence labels (more later).
- Height --- This numerical value quantifies the age of each sequence relative to the youngest sequence in the data set, i.e., the sequence that is closest to the current time.
- Sequence --- Displays the actual genetic sequence extracted from the NEXUS file. Unfortunately, these are rendered using a variable-width serif font, which can make it difficult to determine whether the sequences are aligned properly --- especially when they are long. This difficulty arises because the '-' character conventionally used to represent gaps in a sequence is smaller in width than most nucleotide or amino acid characters, so that some sequences will end up being longer than others in terms of numbers of pixels on the screen. (To see this for yourself, scroll to the far right of the main panel contents.) Hence, it's a good idea to be certain that your sequences are properly aligned BEFORE importing them into BEAUti.
Exercise 1
- In the BEAUti main panel displaying the contents imported from FPV.nex, reset the DATE entry for one of the sequences to 1000 years. Note what happens to the height of the other sequences. Set another sequence to DATE = 500 years. Note what happens to its height and explain why.
- Click on the Translation drop-down menu and select a genetic code. What happens to the Sequence entries for each sequence?
- Manually edit the FPV.nex file. Add some arbitrary numbers to each taxon label as follows: "CanineOralPV" -> "CanineOralPV_135" (it doesn't have to be 135, it can be any number you want!). Let's pretend that this number corresponds to the millions of years before present (mya) that the sequence was sampled from ancient DNA. Save this edited file under a new file name to avoid overwriting the original file. Open this file in a new window in BEAUti and click on the Guess Dates button. Click the radio button labeled Defined by its order and select last. This will select the last numerical field in the taxon label; in other words, it will read in the label "CanineOralPV_135" and extract the last occurrence of a contiguous series of digits, i.e., 135. Hit 'OK'. What happens in the main panel? Click on the second drop-down menu following the label Dates specified as and select Before the present. What happens to the Height entries, and why?
Set up taxon subsets
Close the window containing the sequences imported from the NEXUS file that you modified with bogus dates, and keep the original FPV.nex window open. Click on the menu tab labeled Taxa. This will replace the contents of the main panel (to go back, just select the Data tab) with something like the following, except that all the fields should initially be empty:
Click on the plus sign (+) at the lower-left corner of this window. This will create a new taxon subset object that is displayed in the first column. A taxon subset will tell BEAST to keep track of the most recent common ancestor (MRCA) for sequences contained in this subset. BEAST can also enforce a prior distribution on divergence times that involve this taxon subset, and enforce monophyly as well. You can edit the name of this new taxon subset by double-clicking on the corresponding field in this window. To assign taxa to this subset, highlight the desired taxa in the 'Excluded Taxa window on the left, and click on the green right-arrow button to transfer them to the Included Taxa window on the right. (To select multiple taxa at once, use Command-click on a Mac.)
Let's create the following subsets (name them whatever you want, but keep track of who's who):
- FelisPV1, LynxPV1, PumaPV1
- SnowLeopardPV1, AsianLionPV1
- FelisPV1, LynxPV1, PumaPV1, SnowLeopardPV1, AsianLionPV1
Specify an evolutionary model
Click on the next tab labeled Model. Here, you can specify a nucleotide or amino acid substitution model whose parameters will be estimated from your sequence data. It is important to note that model specification in BEAUti is limited compared to the scope of models that can be handled by BEAST. We have the following options:
- Substitution Model --- Only 2 models can be specified in BEAUti. In order to use other substitution models, we will need to directly edit the XML file generated by BEAUti before executing it with BEAST.
- HKY85 --- Hasegawa-Kishino-Yano model extends Felsenstein's (1981) model F81 by allowing a transition-transversion rate bias modeled by a single parameter.
- GTR --- General Time Reversible models use the maximum number of model parameters (6) in a symmetric rate matrix (such that the rate of going from A to G is the same as going from G to A). Usually not a good idea if you have limited or invariant sequence data.
- Base frequencies --- As you'll have learned by now, many evolutionary models are specified by a rate matrix whose entries include the character frequencies.
- In practice, we usually use the Empirical base frequencies, i.e., the frequencies we observe in the alignment, as a crude estimate of this stationary distribution.
- However, in some cases you may want to Estimate the parameters of this stationary distribution (i.e., the number of characters minus 1) by fitting a phylogenetic model to the data.
- Finally, you can simply set the frequencies to be All Equal, as simpler models assume (e.g., the Kimura 2-parameter model K2P).
- Site Heterogeneity Model --- Options for modeling rate variation across sites.
- None --- Constant rates across sites.
- Gamma --- Use a 1-parameter gamma distribution to model rate variation, with parameter alpha to be estimated.
- Invariant Sites --- Assumes that some unknown proportion of sites (estimated by a single parameter, p) remain fixed over evolutionary time.
- Gamma + Invariant Sites --- Combines the Gamma and Invariant Sites models for a total of 2 additional parameters, i.e., L(1-p) variant sites vary according to a discretized 1-parameter gamma distribution.
- Number of gamma categories --- The gamma probability density function is a continuous function, and hence computationally inefficient. Rather than integrating the area under the probability density curve between points a and b, Ziheng Yang proposed using a discretized version of the gamma distribution that split the area under the curve into N equiprobable intervals (gamma categories). Each interval would be summarized by its local mean. In practice, 3 or 4 gamma categories are usually a good enough approximation --- BEAST limits the number of categories within the range 4-10.
- Partition into codon positions --- Each of the three positions in a codon may evolve at a different rate because of the nature of genetic codes. For example, most nucleotide substitutions at the third position will not modify the amino acid encoded. ** Using 2 partitions will lump positions 1 and 2 together, and obtain a separate set of parameter estimates for the 3rd positions of the nucleotide alignment.
- Using 3 partitions will obtain independent estimates for each codon position. You may want to avoid analyzing multiple partitions if the sequences in your alignment are short or not sufficiently variable.
- Not all model parameters need to be estimated independently across partitions. For example, if you only want to estimate partition-specific base frequencies but enforce the same rates of substitution and rate heterogeneity parameters, then check the box labeled Unlink base frequencies across codon positions. A special case of linking/unlinking 2 partitions in this manner can be specified by clicking on Use SRD06 Model, which implements settings proposed by Shapiro, Rambaut, and Drummond (2006).
- Fix mean substitution rate --- In lieu of using a tree calibrated by fossil record data, you may want to use a mutation rate estimate when the mutational process in your organism is fairly well-characterized (e.g., some viruses). This constant corresponds to the &mu scaling constant discussed in the lecture portion. For example, setting this to 1 results in a conventional phylogenetic tree in which branch lengths are measured in units of expected substitutions per site (mutation rate and chronological time are confounded). Warning: Using this setting stipulates that the quantity is known WITHOUT ERROR. If you're not tremendously confident about your estimate of the mutation rate, try specifying a prior distribution on &mu (more later).
- Molecular Clock Model
- Strict Clock --- Enforces a constant rate of evolution (scaling constant &mu) across all branches of the tree.
- Relaxed Clocks --- Allows a more realistic model in which the rate of evolution varies over time. The rate for each branch is an independent and identically-distributed random outcome drawn from an underlying Exponential (1 parameter) or Log-Normal (2 parameters) distribution. The BEAST developers recommend using the latter distribution, as the estimate of the parameter
ucld.stdevgives you some idea of how clock-like your data is (close to 0.0 is clock-like; greater than 1.0 means substantial rate heterogeneity).
That's a lot of options! And just think, this is a small fraction of the range of models that can be specified in BEAST...
Priors
Click on the Priors tab. This panel allows you to specify the prior distributions for each parameter in your model. These are roughly grouped into 2 categories: tree priors, which relate to scaling time by effective population size; and model priors, which relate to mutation rate scaling.
Tree Priors
Various coalescent models can be used to predict the genealogical process going backwards in time. BEAST implements a number of parametric demographic functions within the coalescent model:
- Coalescent models:
- Constant size
- Exponential growth --- given present-day population size N0, size decays exponentially at rate r going backwards in time.
- Logistic growth --- initial growth is exponential until size approaches carrying capacity K.
- Expansion growth --- ancestral population at constant size until it invades an unoccupied niche, at which time growth is exponential to present-day size N0.
- Bayesian skyline --- group time intervals into N bins using Bayesian model selection procedures.
Each formulation comes with its own set of prior distributions that will appear in the window listing priors below. For example, the constant population size coalescent model needs only one prior distribution, namely the prior over size.
When dealing with species-level trees, population-level coalescent models are not appropriate. In such cases, use the Yule tree prior, which models the generation of lineages as a Markov birth-death process with constant rate of speciation.
UPGMA
Unweighted Pair Group Method with Arithmetic mean is a distance-based method for constructing a phylogeny from a sequence alignment. Although UPGMA has been much maligned as inaccurate, it is a convenient means of generating an initial tree that is "good enough", especially when there are many model parameters and constraints making execution time-consuming.
Specifying priors
A prior distribution is a representation of our knowledge about a model parameter before we incorporate the additional information provided by our data. Although this is a useful (and honest!) framework, it obligates us to be careful about how we select our priors. An inappropriate choice of prior could inadvertently influence the posterior distribution inferred from our data.
BEAST implements 6 different prior distributions:
- Uniform --- Prior information consists of an upper and lower bound to the distribution, e.g., "It couldn't POSSIBLY be more than THAT!"
- Exponential
- Normal --- Our uncertainty about the parameter follows a Gaussian distribution with a given mean and variance.
- Log-normal --- Same, but on a log scale. Most often used to calibrate node ages,
clock.rate, andconstant.popSizeparameters. - Gamma --- A popular choice for continuous-valued parameters bounded from below at 0.0.
- Jeffreys --- An uninformative prior with nice theoretical properties.
Operators
This panel provides access to tuning parameters for the MCMC analysis. Because of time constraints and my utter lack of experience in fiddling with these settings, we won't cover this topic.
MCMC
The MCMC panel lists several arguments that delineate the MCMC analysis executed by BEAST:
- Length of chain --- The number of generations that the Markov chain will be executed for. This includes steps that are discarded under Metropolis-Hastings sampling as well as those that are accepted, so the realized length of the chain will not vary from run to run.
- Echo state to screen every --- In computer parlance, 'echo' refers to writing text to the screen. It is an important tool that enables the user to monitor what the program is doing at consistent intervals while it is being run, rather than watching in frustration as the machine does apparently nothing useful. The value to which you set this argument will be the number of steps elapsed until BEAST will echo its current state to the main window. You do not this value to be too small, or else the screen will be deluged in a flood of text and you will slow down your analysis considerably! Unless your data set is very small and your chain extremely long, 1000 is probably a good default value.
- Log parameters every --- The same thing as above, except that instead of 'echoing' the current state of the MCMC run to the screen, BEAST will write it to a file. Because it is computationally less expensive to write to the hard disk than it is to render text on the screen, there is less hazard of slowing down your machine by setting this argument to a small value. On the other hand, you will also create a gigantic text file on your hard drive and possibly crash your computer! Again, the default value of 1000 is probably appropriate. Take into account the length of your chain and the amount of autocorrelation in the chain (see below).
- Log file name --- the file that BEAST will write log output to. Each line of the log file will consist of parameter values sampled at that generation in the Markov chain. This file will appear in the same directory as the BEAST XML file is written by BEAUti.
- Trees file name --- contains trees sampled from the posterior distribution, updated with the same frequency as log. You have the option of writing trees to file with branch lengths in units of the expected number of substitutions, which may be easier to compare to trees generated using conventional phylogenetic methods. If you check the box for this option, you will need to provide a third file name to which these trees will be written.
- Sample from prior only --- This isn't a well-documented feature of BEAST, but my interpretation is that checking this box will cause BEAST to ignore the alignment and generate a sample of parameters and trees from the prior distribution, which may be a good way of evaluating your prior assumptions.
Exercise 2
- Download the RSVA.nex data set. This alignment consists of 35 sequences of the G attachment protein for human respiratory syncytial virus subgroup A, isolated from various sites around the world at times ranging from 1956-2002. See reference ZLVV04.
- Translate the nucleotide sequences. What do you notice that may cause problems for subsequent analyses?
- Inspect the sequence identifiers. Can you guess what part contains information about the year of isolation? Import the NEXUS file into BEAUti and use this information to automate date entry. (HINT: you will need to compensate for some missing information by clicking on the boxes Add the following value and ...unless less than; keep the default values).
- Set the model to GTR with empirical base frequencies, no site heterogeneity, and 3 partitions on codon positions with respect to rate heterogeneity (in HyPhy, we call this a "comb"). Use a strict molecular clock for now. Use the default settings for priors. Set the MCMC chain to run for 150,000 steps echoing every 1,000 steps, and write to file every 150 steps.
- Export the XML file and open it with a text editor. Locate the section in the XML file that specifies the partitioning of the alignment by codon position. BEAUti tallies the number of "unique patterns" in the alignment for each codon position; is this consistent with what you observed in step 2?
- Let's manually modify the XML file so that BEAST will fit a TN93 (010020) nucleotide substitution model instead of GTR. Locate the
<gtrModel>tag that delimits the XML block specifying the substitution model. Note that the rate C-T is absent. This omission occurs because rates are defined as being relative, and BEAST happens to use the transition rate between C and T as a reference. In order to implement the TN93 model, you need to constrain the transversion ratesgtr.at,gtr.cgandgtr.gtto be equal togtr.ac. This can be accomplished by replacing the line:
- <parameter id="gtr.at" value="1.0" lower="1.0E-8" upper="Infinity"/>
- with
- <parameter idref="gtr.ac"/>
- and applying the same replacement for the <rateCG> and <rateGT> blocks. Important: You'll also need to remove sub-blocks containing the
gtr.at,gtr.cgandgtr.gtparameters from wherever else they occur in the file.
- and applying the same replacement for the <rateCG> and <rateGT> blocks. Important: You'll also need to remove sub-blocks containing the
Running with the BEAST
When you open the application BEAST, a file browser window will immediately appear. Locate the XML file that you generated for Exercise 2 and open it. BEAST will immediately parse the contents of the XML file and if it does not find any errors in syntax, executes the MCMC analysis based on the data and settings in your file. However, if we have manually edited the XML file instead of pipelining from BEAUti directly into BEAST, then there is always some chance of error.
Debugging XML
Here are some examples of error messages produced by BEAST upon failing to parse an XML file.
| Error message | Cause | How to fix it |
|---|---|---|
SEVERE: Parsing error - poorly formed XML (possibly not an XML file): | I attempted to open a file whose contents were completely unrelated to XML. | Make sure the file that you attempt to open in BEAST is an XML-formatted file. Open the file in a text editor to be absolutely certain. |
[Fatal Error] :340:30: Element type "parameter" must be followed by either attribute specifications, ">" or "/>". | I deleted a '>' from a closing '/>' tag on line 340 in the XML file. | BEAST tells us where to look for the error by providing the line number and character position on that line in the colon-delimited tuple :340:30:. In this case, the missing '>' was the 27th character on line 340 after three tabs. |
Exception in thread "AWT-EventQueue-0" java.lang.RuntimeException: A ConsoleApplication cannot do an Open command | This occurs when I attempt to open another file in BEAST under Mac OS X 10.5. | No idea :-/ This is probably a platform-specific issue related to how Java is implemented in Mac OS X 10.5. |
SEVERE: Parsing error - poorly formed BEAST file, RSVA_TN93.xml: Object with idref=gtr.c has not been previously declared. | I replaced 'gtr.ac' with 'gtr.c' in the <gtrModel> specification block. | Unfortunately, the error message does not tell us specifically where to look. We just need to be careful about our spelling when defining model parameters! |
Tree likelihood is zero. | This occurs when the likelihood of the starting tree is 0 or less than the smallest positive floating point number that can be represented by double precision, what we call an "underflow". Underflow usually occurs when one or more of the initial values assigned to model parameters, including the tree itself, violates assumptions stipulated by the prior distributions. Alternatively, the initial tree may simply occupy a very unfavorable position on the posterior probability surface; this is increasingly likely with larger sample sizes (more sequences) and a greater number of model parameters. | Examine the initial values assigned to parameters in the XML file. Look for obvious violations of prior assumptions, e.g., an initial value below the lower bound of a prior distribution. Try playing around with the initial values. Try performing the analysis for a small random sample of the sequences, and use the parameter estimates from this run to assign initial values for the complete analysis. Try using UPGMA to generate the starting tree instead of coalescent simulation (default setting). |
During an MCMC run
After a series of messages relaying the settings extracted from the XML file, BEAST begins the MCMC analysis. The first thing that you will see is something like this:
Pre-burnin (1500 states) 0 25 50 75 100 |--------------|--------------|--------------|--------------| *************
As "pre-burnin" suggests, these first iterations of the Markov chain are immediately discarded and not reported to the log file. BEAST is attempting to climb the posterior probability surface to regions with greater support given the data.
This is immediately followed with a long series of lines echoed to the BEAST main panel:
state Posterior Prior Likelihood Root Height Rate 0 -3,977.1448 -296.3786 -3,680.7662 629.669 1.32572E-4 1000 -3,549.4650 -257.1323 -3,292.3328 208.979 3.02816E-4 2000 -3,401.9006 -221.7817 -3,180.1189 114.144 5.85529E-4 3000 -3,366.1948 -204.4371 -3,161.7576 100.762 8.76853E-4
State refers to the step number in the chain following a pre-burnin interval. Posterior is the log-transformed posterior probability of the chain at that step. Prior is the log prior probability of the model at that step before data. Likelihood is the log likelihood of the data given the model at that step. Because posterior probability is simply the product of prior probability and likelihood, under a log transformation it should be the sum. (Test this out for yourself!) Root Height refers to the time to the most recent common ancestor (MRCA) of the sample genealogy relative to the most recent sequence. Rate refers to the mutation rate scaling factor; under a strict molecular clock, it refers to the clock rate. Although many more parameter values may be written out to the log file, these six quantities will always be echoed to the main panel during an MCMC analysis.
After an MCMC run
When the MCMC run is complete, BEAST provides an analysis on the MCMC operator settings. This is useful information to diagnose whether we have used appropriate settings to obtain convergence and good effective sample size. For example, if we set the proposal function to generate infinitesimally small modifications of a substitution rate parameter, then it would take a very long time for the chain to converge to a stationary marginal distribution with respect to this parameter. On the other hand, if it was set too high, then the proposal function would keep generating outrageous values to assign to this parameter with a very low probability of being accepted under the Metropolis-Hastings sampling process. This probability is reported under the heading Pr(accept).
Here is an abbreviated example of an operator analysis generated after an MCMC run on the RSVA XML file.
Operator analysis Operator Pr(accept) Performance suggestion gtr.ac 0.514 0.2829 gtr.ag 0.556 0.3347 Try setting scaleFactor to about 0.4323
Always helpful, BEAST suggests that we might obtain better performance if we reset the scaling factor for A-G substitutions in the GTR (now TN93) model to 0.4323, rather than its current setting at 0.75. The second column, which is unlabeled, reports the final tuning value associated with the model parameter. Here, we can see that BEAST tried to dampen the range of values proposed for gtr.ag from its initial setting of 0.75 to 0.556.
Analyzing BEAST logs with Tracer
Try opening the log file generated by BEAST in a text editor. That's a lot of numbers! Not only are there more model parameters being reported per step in the chain sample than we saw echoed to the main panel during the run, but there are a lot more steps under the settings we used. Clearly we need some help in making sense of this log.
Let's open the log file using Tracer. You should see something like this:
Actually, the default setting for Tracer is to display the distribution in mean values of the chain sample with respect to whatever parameter is highlighted in the left panel. I just selected Trace to have the actual chain displayed in the window instead of this distribution. Note that the trace doesn't begin to be displayed until after 15000 steps into the chain. Tracer is omitting the first 10% of steps in the chain sample that it extracted from our log file, effectively discarding this initial portion of the chain as an initial burnin period. Even so, the values immediately following this burnin are at the lower end of the overall distribution, indicating that we might have benefited from an even longer burnin period. On the whole, however, our chain looks rather good --- we're looking for no obvious autocorrelation in the trace (i.e., periodic 'waves') and no monotonic increase or decrease either (which again would suggest insufficient burnin).
Switch back to the Estimates display. Your window should now resemble the following:
The histogram in the bottom half of this panel essentially collapses the chain that was displayed in the previous panel. In other words, the chain occupied posterior probability values ranging from -3280 to -3270 for the majority of steps in the sample. (Click back to the Trace display to confirm this for yourself.)
In the upper half of this panel, Tracer reports a number of summary statistics with respect to posterior probability (assuming that Posterior probability is still highlighted in the Traces panel to the left). Many of these should be self-explanatory, i.e., mean. But what is HPD? HPD is an abbreviation for Highest Posterior Density. If our analysis is working correctly, then the Metropolis-Hastings sampling process should be occupying regions of relatively high posterior probability for the majority of iterations. The upper and lower 95% quantiles simply give us a rough idea of the confidence interval in our estimate of the mean.
Auto-correlation time estimates the number of iterations that would have to elapse between samples in order to obtain independent samples. Auto-correlation occurs because when the Markov chain takes a step, it typically doesn't step very far in parameter space. As a result, the parameter values and posterior probabilities will never be very different between adjacent steps. Indeed, we might as well ignore steps in the immediate vicinity of a given sample. This concept leads to the Effective Sample Size (ESS), which gives a rough estimate of the number of independent steps that we would sample from a perfectly-mixing chain.
Exercise 3
- Examine the distribution in clock rate estimates. What is the mean clock rate? How do we interpret this value? (Hint: What are its units?) What are the upper, lower bounds in HPD?
- What is the mean estimate for time to the most recent common ancestor of the respiratory syncitial virus A sequences in our sample? What are the units of this estimate?
- What is the estimated constant effective population size? What might this imply about the virus? (Hint: see References:ZLVV04)
- Select all three siteModelN.mu traces in the Traces panel and the Marginal Density tab. Interpret the multiple plot. (Hint: Insert a plot legend by selecting a position in the Legend drop-down menu.) Remember the issue that arose when we were generating this XML file from NEXUS data. Do these plots make sense in light of this?
- Examine the parameter estimates for gtr.ac and gtr.ag. Do these values make sense in the context of the TN93 model?
- Run TreeAnnotator, which is another Java application that is distributed with BEAST. Use a burnin value of 100 and median node heights. For the Input tree file, select the *.tree file generated by BEAST from analyzing the RSVA sequences. Create a file with an arbitrary name for Output tree file and view this file after TreeAnnotator is finished using yet another application called FigTree. Make sure that the Node Bars box is checked. You should see something pretty neat!
Bayesian skyline plots with BEAST
Exercise 4
In the following exercise, we will learn how to use BEAST and its associated applications in order to generate a skyline plot using Bayesian MCMC methods (DRSP05).
- Download the data set Flu.nex. This NEXUS file contains 21 H5N1 Influenza A virus (IAV) sequences sampled from birds, swine, and humans in Southeast Asia during the period 1997-2005.
- Open this file in BEAUti and use the Guess Dates function to extract the year of isolation from sequence names.
- Select any options that you'd like (though you'll probably want to keep things pretty simple) in the Model panel.
- In the Priors panel, select the Coalescent: Bayesian skyline option in the Tree Prior drop-down menu, and reduce the Number of Groups from the default value of 10 to 5. Set the number of steps in the chain to 1,000,000.
- Execute the XML file using BEAST. This analysis should take roughly 10-15 minutes to complete. Have a coffee!
- When BEAST is finished, open the .log file in Tracer and select the option Bayesian Skyline Reconstruction from the Analysis drop-down menu. A new panel will appear; for Trees log file, select the *.trees file that was generated by BEAST during your analysis. Keep the default values and click OK.
External Links
- http://beast.bio.ed.ac.uk/Tutorials --- A wiki maintained by the BEAST developers with tutorials, exercises, and FAQs.
- http://groups.google.com/group/beast-users --- A Google Groups mailing list for BEAST users. A good resource for those interested in customizing their analysis or who are running into problems using BEAST.








