DNAMLK -- DNA Maximum Likelihood program with molecular clock

This program implements the maximum likelihood method for DNA sequences under the constraint that the trees estimated must be consistent with a molecular clock. The molecular clock is the assumption that the tips of the tree are all equidistant, in branch length, from its root. This program is fairly slow, and can be expensive to run. The program is indirectly related to DNAML. Its algorithmic details are not yet published, but details of the algorithm of a previous version of DNAML are described in my paper in Journal of Molecular Evolution (1981a). For some uses of this program and further developments, the papers of Masami Hasegawa are particularly instructive (Hasegawa and Yano, 1984a, 1984b; Hasegawa et. al. 1985a, 1985b). As with DNAML, the model of base substitutions used in this program is more general than the model described in the 1981a paper, as it allows transition and transversion rates to be expected to be unequal.

The assumptions of the model are:

  1. Each site in the sequence evolves independently.
  2. Different lineages evolve independently.
  3. There is a molecular clock.
  4. Each site undergoes substitution at an expected rate which is chosen from a series of rates (each with a probability of occurrence) which we specify.
  5. All relevant sites are included in the sequence, not just those that have changed or those that are "phylogenetically informative".
  6. A substitution consists of one of two sorts of events:
    1. The first kind of event consists of the replacement of the existing base by a base drawn from a pool of purines or a pool of pyrimidines (depending on whether the base being replaced was a purine or a pyrimidine). It can lead either to no change or to a transition.
    2. The second kind of event consists of the replacement of the existing base by a base drawn at random from a pool of bases at known frequencies, independently of the identity of the base which is being replaced. This could lead either to a no change, to a transition or to a transversion.

The ratio of the two purines in the purine replacement pool is the same as their ratio in the overall pool, and similarly for the pyrimidines.

The ratios of transitions to transversions can be set by the user. The substitution process can be diagrammed as follows: Suppose that you specified A, C, G, and T base frequencies of 0.24, 0.28, 0.27, and 0.21.

First kind of event:


   1.  Determine whether the existing base is a purine or a pyrimidine.

   2.  Draw from the proper pool:

      Purine pool:                Pyrimidine pool:

     I               I            I               I
     I   0.4706 A    I            I   0.5714 C    I
     I   0.5294 G    I            I   0.4286 T    I
     I (ratio is     I            I (ratio is     I
     I  0.24 : 0.27) I            I  0.28 : 0.21) I
     -----------------            -----------------


Second kind of event:

   Draw from the overall pool:

              I                  I
              I      0.24 A      I
              I      0.28 C      I
              I      0.27 G      I
              I      0.21 T      I
              I                  I
               ------------------

Note that if the existing base is, say, an A, the first kind of event has a 0.4706 probability of "replacing" it by another A. The second kind of event has a 0.24 chance of replacing it by another A. This rather disconcerting model is used because it has nice mathematical properties that make likelihood calculations far easier. A closely similar, but not precisely identical model having different rates of transitions and transversions has been used by Hasegawa et. al. (1985b). The transition probability formulas for the current model were given (with my permission) by Kishino and Hasegawa (1989).

Note the assumption that we are looking at all sites, including those that have not changed at all. It is important not to restrict attention to some sites based on whether or not they have changed; doing that would bias branch lengths by making them too long, and that in turn would cause the method to misinterpret the meaning of those sites that had changed.

An important new development in the 3.5 release is the Hidden Markov Chain method of inferring different rates of evolution at different sites. This wil l be described in a future paper by me and Gary Churchill, of Cornell University. It allows us to specify to the program that there will be a number of different possible evolutionary rates, what the prior probabilities of occurrence of each is, and what the average length of a patch of sites all having the same rate. The program then computes the likelihood by summing it over all possible assignments of rates to sites, weighting each by its prior probability of occurrence.

For example, if we have used the C and R options (described below) to specify that there are three possible rates of evolution, 1.0, 2.4, and 0.0, that the prior probabilities of a site having these rates are 0.4, 0.3, and 0.3, and that the average patch length (number of consecutive sites with the same rate) is 2.0, the program will sum the likelihood over all possibilities, but giving less weight to those that (say) assign all sites to rate 2.4, or that fail to have consecutive sites that have the same rate.

This feature effectively removes the artificial assumption that all sites have the same rate, and also means that we need not know in advance the identities of the sites that have a particular rate of evolution.

INPUT FORMAT AND OPTIONS

Subject to these assumptions, the program is a correct maximum likelihood method. The input is fairly standard, with one addition. As usual the first line of the file gives the number of species and the number of sites. There follow the character W if the Weights options is being used.

Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats described in the Molecular Sequence Programs document. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.

After that are the lines (if any) containing the information for the C, and W options, as described below.

The options are selected using an interactive menu. The menu looks like this:



Nucleic acid sequence
   Maximum Likelihood method with molecular clock, version 3.5c

Settings for this run:
  U                 Search for best tree?  Yes
  T        Transition/transversion ratio:  2.0
  F       Use empirical base frequencies?  Yes
  C   One category of substitution rates?  Yes
  G                Global rearrangements?  No
  J   Randomize input order of sequences?  No. Use input order
  M           Analyze multiple data sets?  No
  I          Input sequences interleaved?  Yes
  0   Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1    Print out the data at start of run  No
  2  Print indications of progress of run  Yes
  3                        Print out tree  Yes
  4       Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
The user either types "Y" (followed, of course, by a carriage-return) if the settings shown are to be accepted, or the letter or digit corresponding to an option that is to be changed.

The options U, J, O, M, and 0 are the usual ones. They are described in the main documentation file of this package. Option I is the same as in other molecular sequence programs and is described in the documentation file for the sequence programs.

The T option in this program does not stand for Threshold, but instead is the Transition/transversion option. The user is prompted for a real number greater than 0.0, as the expected ratio of transitions to transversions. Note that this is not the ratio of the first to the second kinds of events, but the resulting expected ratio of transitions to transversions. The exact relationship between these two quantities depends on the frequencies in the base pools. The default value of the T parameter if you do not use the T option is 2.0.

The F (Frequencies) option is one which may save users much time. If you want to use the empirical frequencies of the bases, observed in the input sequences, as the base frequencies, you simply use the default setting of the F option. These empirical frequencies are not really the maximum likelihood estimates of the base frequencies, but they will often be close to those values (what they are is maximum likelihood estimates under a "star" or "explosion" phylogeny). If you change the setting of the F option you will be prompted for the frequencies of the four bases. These must add to 1 and are to be typed on one line separated by blanks, not commas.

The C (categories) option allows the user to specify how many categories of substitution rates there will be, and what are the rates and probabilities for each. The user first is asked how many categories there will be (for the moment there is an upper limit of 5, which should not be very restrictive). Then the program asks for the rates for each category. These rates are only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4 have the exact same effect as rates 2.0, 4.0, and 4.8. Note that a category can have rate of change 0, so that this allows us to take into account that there may be a category of sites that are invariant. Note that the run time of the program will be proportional to the number of rate categories: twice as many categories means twice as long a run. Finally the program will ask for the probabilities of a random site falling into each of these categories. These probabilities must be nonnegative and sum to 1. Default for the program is one category, with rate 1.0 and probability 1.0 (actually the rate does not matter in that case).

If more than one category is specified, then another option, R, becomes visible in the menu. This allows us to specify that we want to assume that sites that have the same rate category are expected to be clustered. The program asks for the value of the average patch length. This is an expected length of patches that have the same rate. If it is 1, the rates of successive sites will be independent. If it is, say, 10.25, then the chance of change to a new rate will be 1/10.25 after every site. However the "new rate" is randomly drawn from the mix of rates, and hence could even be the same. So the actual observed length of patches with the same rate will be somewhat larger than 10.25. Note below that if you choose multiple patches, there will be an estimate in the output file as to which combination of rate categories contributed most to the likelihood.

With the current options C and R the program has gained greatly in its ability to infer different rates at different sites and estimate phylogenies under a more realistic model. Note that Likelihood Ratio Tests can be used to test whether one combination of rates is significantly better than another.

The G (global search) option causes, after the last species is added to the tree, each possible group to be removed and re-added. This improves the result, since the position of every species is reconsidered. It approximately triples the run-time of the program.

If the U (user tree) option is chosen another option appears in the menu, the L option. If it is selected, it signals the program that it should take any branch lengths that are in the user tree and simply evaluate the likelihood of that tree, without further altering those branch lengths. This means that if some branches have lengths and others do not, the program will estimate the lengths of those that do not have lengths given in the user tree. Note that the program RETREE can be used to add and remove lengths from a tree. This means that we can test hypothesis such as that a certain branch has zero length, and by doing a series of runs with different specified lengths for a branch we can plot a likelihood curve for its branch length while allowing all other branches to adjust their lengths to it. If all branches have lengths specified, none of them will be iterated. This is useful to allow a tree produced by another method to have its likelihood evaluated. The L option has no effect and does not appear in the menu if the U option is not used.

The W (Weights) option is invoked in the usual way, with only weights 0 and 1 allowed. It selects a set of sites to be analyzed, ignoring the others. The sites selected are those with weight 1. If the W option is not invoked, all sites are analyzed.

OUTPUT FORMAT

The output starts by giving the number of species, the number of sites, and the base frequencies for A, C, G, and T that have been specified. It then prints out the transition/transversion ratio that was specified or used by default. It also uses the base frequencies to compute the actual transition/transversion ratio implied by the parameter.

If the C (Categories) option is used a table of the relative rates of expected substitution at each category of sites is printed, as well as the probabilities of each of those rates.

There then follow the data sequences, with the base sequences printed in groups of ten bases along the lines of the Genbank and EMBL formats. The trees found are printed as a rooted tree topology. The internal nodes are numbered arbitrarily for the sake of identification. The number of trees evaluated so far and the log likelihood of the tree are also given. The branch lengths in the diagram are roughly proportional to the estimated branch lengths, except that very short branches are printed out at least three characters in length so that the connections can be seen.

A table is printed showing the length of each tree segment, and the time (in units of expected nucleotide substitutions per site) of each fork in the tree, measured from the root of the tree. I have not attempted in include code for approximate confidence limits on branch points, as I have done for branch lengths in DNAML, both because of the extreme crudeness of that test, and because the variation of time for different forks would be highly correlated.

The log likelihood printed out with the final tree can be used to perform various likelihood ratio tests. One can, for example, compare runs with different values of the expected transition/transversion ratio to determine which value is the maximum likelihood estimate, and what is the allowable range of values (using a likelihood ratio test, which you will find described in mathematical statistics books). One could also estimate the base frequencies in the same way. Both of these, particularly the latter, require multiple runs of the program to evaluate different possible values, and this might get expensive.

This program makes possible for the first time a (reasonably) legitimate statistical test of the molecular clock. To do such a test, run DNAML and DNAMLK on the same data. If the trees obtained are of the same topology (when considered as unrooted), it is legitimate to compare their likelihoods by the likelihood ratio test. In DNAML the likelihood has been computed by estimating 2n-3 branch lengths, if their are n tips on the tree. In DNAMLK it has been computed by estimating n-1 branching times (in effect, n-1 branch lengths). The difference in the number of parameters is (2n-3)-(n-1) = n-2. To perform the test take the difference in log likelihoods between the two runs (DNAML should be the higher of the two, barring numerical iteration difficulties) and double it. Look this up on a chi-square distribution with n-2 degrees of freedom. If the result is significant, the log likelihood has been significantly increased by allowing all 2n-3 branch lengths to be estimated instead of just n-1, and molecular clock may be rejected.

If the U (User Tree) option is used and more than one tree is supplied, and the program is not told to assume autocorrelation between the rates at different sites, the program also performs a statistical test of each of these trees against the one with highest likelihood. This test, due to Kishino and Hasegawa (1989), uses the mean and variance of log-likelihood differences between trees, taken across sites. If the mean is more than 1.96 standard deviations different then the trees are declared significantly different. This use of the empirical variance of log-likelihood differences is more robust and nonparametric than the classical likelihood ratio test, and may to some extent compensate for the any lack of realism in the model underlying this program. The program prints out a table of the log-likelihoods of each tree, the differences of each from the highest one, the variance of that quantity as determined by the log-likelihood differences at individual sites, and a conclusion as to whether that tree is or is not significantly worse than the best one. However the Kishino-Hasegawa-Templeton test is not available if we assume that there is autocorrelation of rates at neighboring sites (option R) and is not done in those cases.

The branch lengths printed out are scaled in terms of expected numbers of substitutions, counting both transitions and transversions but not replacements of a base by itself, and scaled so that the average rate of change, averaged over all sites analyzed, is set to 1.0 if there are multiple categories of sites. This means that whether or not there are multiple categories of sites, the expected fraction of change for very small branches is equal to the branch length. Of course, when a branch is twice as long this does not mean that there will be twice as much net change expected along it, since some of the changes occur in the same site and overlie or even reverse each other. The branch lengths estimates here are in terms of the expected underlying numbers of changes. That means that a branch of length 0.26 is 26 times as long as one which would show a 1% difference between the nucleotide sequences at the beginning and end of the branch. But we would not expect the sequences at the beginning and end of the branch to be 26% different, as there would be some overlaying of changes.

Because of limitations of the numerical algorithm, branch length estimates of zero will often print out as small numbers such as 0.00001. If you see a branch length that small, it is really estimated to be of zero length.

Another possible source of confusion is the existence of negative values for the log likelihood. This is not really a problem; the log likelihood is not a probability but the logarithm of a probability. When it is negative it simply means that the corresponding probability is less than one (since we are seeing its logarithm). The log likelihood is maximized by being made more positive: -30.23 is worse than -29.14.

At the end of the output, if the C option is in effect with multiple rate categories, the program will print a list of what site categories contributed the most to the final likelihood. This combination of rate categories need not have contributed a majority of the likelihood, just a plurality. Still, it will be helpful as a view of where the program infers that the higher and lower rates are. Note that the use in this calculations of the prior probabilities of different rates, and the average patch length, gives this inference a "smoothed" appearance: some other combination of rates might make a greater contribution to the likelihood, but be discounted because it conflicts with this prior information. See the example output below to see what this printout of rate categories looks like.

If the 3 (write out tree) option is used, then the tree with branch lengths will be written out on the tree file. The program DRAWGRAM can then be used to draw the tree more precisely.

PROGRAM CONSTANTS

The constants defined at the beginning of the program include "maxtrees", the maximum number of user trees that can be processed. It is small (10) at present to same some further memory but the cost of increasing it is not very great. Other constants include "maxcategories", the maximum number of site categories, "namelength", the length of species names in characters, and three others, "smoothings", "iterations", and "epsilon", that

The other CONSTants include maxcategories, the maximum number of site categories, namelength, the length of species names in characters, and three CONSTants, smoothings, iterations, and epsilon, that help "tune" the algorithm and define the compromise between execution speed and the quality of the branch lengths found by iteratively maximizing the likelihood. Reducing iterations and smoothings, and increasing epsilon, will result in faster execution but a worse result. These values will not usually have to be changed.

The program spends most of its time doing real arithmetic. Any software or hardware changes that speed up that arithmetic will speed it up by a nearly proportional amount. For example, microcomputers having a numeric co-processor will run the program much faster, if the compiled program uses it. The algorithm, with separate and independent computations occurring for each pattern, lends itself readily to parallel processing.

PAST AND FUTURE OF THE PROGRAM

This program was developed in 1989 by combining code from DNAPARS and from DNAML. It was speeded up by two major developments, the use of aliasing of nucleotide sites (version 3.1) and pretabulation of some exponentials (added by Akiko Fuseki in version 3.4). In version 3.5 the Hidden Markov Chain code was added and the method of iterating branch lengths was changed from an EM algorithm to direct search. The Hidden Markov Chain code slows things down, especially if there is autocorrelation between sites, so this version is slower than version 3.4. Nevertheless we hope that the sacrifice is worth it.

Two changes that are needed in the future are to put in some way of allowing for base composition of nucleotide sequences in different parts of the phylogeny, and to re-introduce the code that allowed us to assign relative rates of change to individual sites. By allowing prespecified rates and also allowing the Hidden Markov Chain method of inferring rates at different sites, we could then deal with cases in which (say) first, second, and third positions in a coding sequence had rates of (say) 1.0 : 0.8 : 2.7 but the Hidden Markov Chain also inferred rates that varied along the length of the coding sequence. This would be much more realistic.


----------------------------TEST DATA SET--------------------------

   5   13
Alpha     AACGTGGCCAAAT
Beta      AAGGTCGCCAAAC
Gamma     CATTTCGTCACAA
Delta     GGTATTTCGGCCT
Epsilon   GGGATCTCGGCCC

(it was run with two categories with rates 1.0 and 3.2, and with
probabilities 0.4 and 0.6 for these rates, and with patch length
parameter = 1.5)

----- CONTENTS OF OUTPUT FILE (with all numerical options on) ------


   5 Species,   13 Sites

Nucleic acid sequence
   Maximum Likelihood method with molecular clock, version 3.5c


Site category   Rate of change

          1        1.000
          2        3.200


Name            Sequences
----            ---------

Alpha        AACGTGGCCA AAT
Beta         ..G..C.... ..C
Gamma        C.TT.C.T.. C.A
Delta        GGTA.TT.GG CC.
Epsilon      GGGA.CT.GG CCC


Empirical Base Frequencies:

   A       0.24615
   C       0.29231
   G       0.24615
  T(U)     0.21538

Transition/transversion ratio =   2.000000

(Transition/transversion parameter =   1.523077)





                                                           +--Delta
  +--------------------------------------------------------4
  !                                                        +--Epsilon
--3
  !                                             +-------------Gamma
  +---------------------------------------------2
                                                !          +--Alpha
                                                +----------1
                                                           +--Beta


Ln Likelihood =   -72.40499

 Ancestor      Node      Node Time     Length
 --------      ----      ---- ----     ------
 root            3
   3             4         3.40037     3.40037
   4          Delta        3.55871     0.15834
   4          Epsilon      3.55871     0.15834
   3             2         2.75248     2.75248
   2          Gamma        3.55871     0.80623
   2             1         3.38653     0.63405
   1          Alpha        3.55871     0.17219
   1          Beta         3.55871     0.17219



Combination of categories that contributes the most to the likelihood:

             2222221111 112