CONTML - Gene Frequencies and Continuous Characters Maximum Likelihood method

This program estimates phylogenies by the restricted maximum likelihood method based on the Brownian motion model. It is based on the model of Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966), Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work leading to efficient algorithms. CONTML uses restricted maximum likelihood estimation (REML), which is the criterion used by Felsenstein (1973b). The actual algorithm is an iterative EM Algorithm (Dempster, Laird, and Rubin, 1977) which is guaranteed to always give increasing likelihoods. The algorithm is described in detail in a paper or mine (Felsenstein, 1981c), which you should definitely consult if you are going to use this program. Some simulation tests of it are given by Rohlf and Wooten (1988) and Kim and Burgman (1988).

The default (gene frequency) mode treats the input as gene frequencies at a series of loci, and square-root-transforms the allele frequencies (constructing the frequency of the missing allele at each locus first). This enables us to use the Brownian motion model on the resulting coordinates, in an approximation equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure of genetic distance and taking that to give distance between particles undergoing pure Brownian motion. It assumes that each locus evolves independently by pure genetic drift.

The alternative continuous characters mode (menu option C) treats the input as a series of coordinates of each species in N dimensions. It assumes that we have transformed the characters to remove correlations and to standardize their variances.

The input file is as described in the continuous characters documentation file above. Options are selected using a menu:


Continuous character Maximum Likelihood method version 3.5c

Settings for this run:
  U                       Search for best tree?  Yes
  C  Gene frequencies or continuous characters?  Gene frequencies
  A   Input file has all alleles at each locus?  No, one allele missing at each
  O                              Outgroup root?  No, use as outgroup species 1
  G                      Global rearrangements?  No
  J           Randomize input order of species?  No. Use input order
  M                 Analyze multiple data sets?  No
  0         Terminal type (IBM PC, VT52, ANSI)?  ANSI
  1          Print out the data at start of run  No
  2        Print indications of progress of run  Yes
  3                              Print out tree  Yes
  4             Write out trees onto tree file?  Yes

Are these settings correct? (type Y or the letter for one to change)
Option U is the usual User Tree option. Options C (Continuous Characters) and A (All alleles present) have been described in the Gene Frequencies and Continuous Characters Programs documentation file. The options G, J, O and M are the usual Global Rearrangements, Jumble order of species, Outgroup root, and Multiple Data Sets options.

The G and J options have no effect if the User Tree option is selected. User trees are given with a trifurcation (three-way split) at the base. They can start from any interior node. Thus the tree:


     A
     |
     *--B
     |
     *-----C
     |
     *--D
     |
     E
can be represented by any of the following:
(A,B,(C,(D,E)));
((A,B),C,(D,E));
(((A,B),C),D,E);
(there are of course 69 other representations as well obtained from these by swapping the order of branches at an interior node).

The output has a standard appearance. The topology of the tree is given by an unrooted tree diagram. The lengths (in time or in expected amounts of variance) are given in a table below the topology, and a rough confidence interval given for each length. Negative lower bounds on length indicate that rearrangements may be acceptable.

The units of length are amounts of expected accumulated variance (not time). The log likelihood (natural log) of each tree is also given, and it is indicated how many topologies have been tried. The tree does not necessarily have all tips contemporary, and the log likelihood may be either positive or negative (this simply corresponds to whether the density function does or does not exceed 1) and a negative log likelihood does not indicate any error. The log likelihood allows various formal likelihood ratio hypothesis tests. The description of the tree includes approximate standard errors on the lengths of segments of the tree. These are calculated by considering only the curvature of the likelihood surface as the length of the segment is varied, holding all other lengths constant. As such they are most probably underestimates of the variance, and hence may give too much confidence in the given tree.

One should use caution in interpreting the likelihoods that are printed out. If the model is wrong, it will not be possible to use the likelihoods to make formal statistical statements. Thus, if gene frequencies are being analyzed, but the gene frequencies change not only by genetic drift, but also by mutation, the model is not correct. It would be as well-justified in this case to use GENDIST to compute the Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a tree. If continuous characters are being analyzed, but if the characters have not been transformed to new coordinates that evolve independently and at equal rates, then the model is also violated and no statistical analysis is possible.

The program makes another kind of statistical analysis if the U (User Tree) option is invoked and if multiple user trees are input. Then the pairwise statistical test of Templeton (1983) as modified for likelihoods by Kishino and Hasegawa (1989) is carried out. This is a relative of the test that is called by Allan Wilson (see Holmquist, Miyamoto, and Goodman, 1988) the "winning sites" test. The test forms the differences between the log- likelihoods of the best tree and each given tree, locus by locus (or coordinate by coordinate in the continuous characters case). Then it carries out a pairwise t-test (actually a z-test as it uses the normal distribution, which is a bit rougher) between the two trees to see whether one of them is significantly more strongly supported by the data. We can consider the confidence interval of trees that are allowed by the data to be all those that pass the test (i.e. all those that have the sum of their log likelihood differences more than 1.96 standard deviations from zero). This test makes fewer assumptions than does the standard likelihood ratio test, and can be applied when the LRT is not valid, as when we compare different tree topologies. It may be valid even when there are different rates of evolution at different loci, for example.

One problem which sometimes arises is that the program is fed two species (or populations) with identical transformed gene frequencies: this can happen if sample sizes are small and/or many loci are monomorphic. In this case the program "gets its knickers in a twist" and can divide by zero, usually causing a crash. If you suspect that this has happened, check for two species with identical coordinates. If you find them, eliminate one from the problem: the two must always show up as being at the same point on the tree anyway.

The constants available for modification at the beginning of the program include "epsilon1", a small quantity used in the iterations of branch lengths, "epsilon2", another not quite so small quantity used to check whether gene frequencies that were fed in for all alleles do not add up to 1, "smoothings", the number of passes through a given tree in the iterative likelihood maximization for a given topology, "maxtrees", the maximum number of user trees that will be used for the Kishino-Hasegawa-Templeton test, and "namelength", the length of species names. There is no provision in this program for saving multiple trees that are tied for having the highest likelihood, mostly because an exact tie is unlikely anyway.

The algorithm does not run as quickly as the discrete character methods but is not enormously slower. Like them, its execution time should rise as the cube of the number of species. In one paper Astolfi, Kidd, and Cavalli-Sforza (1981) say that "ML requires a prohibitive amount of computer time." It should be realized that they were using a FORTRAN program of mine nearly ten years old at that time, which did not take advantage of the EM algorithm to speed up convergence to the best likelihood for a given topology. The current program should be much more practical than the one they had to use.

--------------TEST DATA SET--------------------------

    5    10
2 2 2 2 2 2 2 2 2 2
European   0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
0.8055 0.5043
African    0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
0.7582 0.6207
Chinese    0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
0.7482 0.7334
American   0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
0.8086 0.8636
Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
0.9097 0.2976

-------- TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON) --------

   5 Populations,   10 Loci

Continuous character Maximum Likelihood method version 3.5c


Numbers of alleles at the loci:
------- -- ------- -- --- -----

   2   2   2   2   2   2   2   2   2   2

Name                 Gene Frequencies
----                 ---- -----------

  locus:         1         2         3         4         5         6
                 7         8         9        10

European     0.28680   0.56840   0.44220   0.42860   0.38280   0.72850
             0.63860   0.02050   0.80550   0.50430
African      0.13560   0.48400   0.06020   0.03970   0.59770   0.96750
             0.95110   0.06000   0.75820   0.62070
Chinese      0.16280   0.59580   0.72980   1.00000   0.38110   0.79860
             0.77820   0.07260   0.74820   0.73340
American     0.01440   0.69900   0.32800   0.74210   0.66060   0.86030
             0.79240   0.00000   0.80860   0.86360
Australian   0.12110   0.22740   0.58210   1.00000   0.20180   0.90000
             0.98370   0.03960   0.90970   0.29760


  +------------------------------African
  !
  !          +--------American
--1----------2
  !          |                    +-------------Australian
  !          +--------------------3
  !                               +Chinese
  !
  +--European


remember: this is an unrooted tree!

Ln Likelihood =    33.29060

examined   15 trees

Between     And             Length      Approx. Confidence Limits
-------     ---             ------      ------- ---------- ------
  1       African           0.08464   (     0.02351,     0.17917)
  1          2              0.03569   (    -0.00262,     0.09493)
  2       American          0.02094   (    -0.00904,     0.06731)
  2          3              0.05098   (     0.00555,     0.12124)
  3       Australian        0.05959   (     0.01775,     0.12430)
  3       Chinese           0.00221   (    -0.02034,     0.03710)
  1       European          0.00624   (    -0.01948,     0.04601)