PhyloEM is the main function of the package. It uses maximum likelihood methods to fit a BM or an OU process for several traits evolving along a phylogenetic tree, with automatic shift detection on the branches of the tree. This function can handle missing data.

PhyloEM(
  phylo,
  Y_data,
  process = c("BM", "OU", "scOU", "rBM"),
  check_postorder = TRUE,
  independent = FALSE,
  K_max = max(floor(sqrt(length(phylo$tip.label))), 10),
  use_previous = FALSE,
  order = TRUE,
  method.selection = c("LINselect", "DDSE", "Djump"),
  C.BM1 = 0.1,
  C.BM2 = 2.5,
  C.LINselect = 1.1,
  method.variance = c("upward_downward", "simple"),
  method.init = "lasso",
  method.init.alpha = "estimation",
  method.init.alpha.estimation = c("regression", "regression.MM", "median"),
  methods.segmentation = c("lasso", "best_single_move"),
  alpha_grid = TRUE,
  nbr_alpha = 10,
  random.root = TRUE,
  stationary.root = random.root,
  alpha = NULL,
  check.tips.names = TRUE,
  progress.bar = TRUE,
  estimates = NULL,
  save_step = FALSE,
  rescale_OU = TRUE,
  parallel_alpha = FALSE,
  Ncores = 3,
  K_lag_init = 5,
  light_result = TRUE,
  tol_tree = .Machine$double.eps^0.5,
  allow_negative = FALSE,
  option_is.ultrametric = 1,
  trait_correlation_threshold = 0.9,
  ...
)

Arguments

phylo

A phylogenetic tree of class phylo (from package ape).

Y_data

Matrix of data at the tips, size p x ntaxa. Each line is a trait, and each column is a tip. The column names are checked against the tip names of the tree.

process

The model used for the fit. One of "BM" (for a full BM model, univariate or multivariate); "OU" (for an OU with independent traits, univariate or multivariate); or "scOU" (for a "scalar OU" model, see details).

check_postorder

Re-order the tree in post-order. If the Upward-Downward algorithm is used, the tree need to be in post-order. Default to TRUE if the upward-downward is used, otherwise automatically set to FALSE.

independent

Are the trait assumed to be independent from one another? Default to FALSE. OU in a multivariate setting only works if TRUE.

K_max

The maximum number of shifts to be considered. Default to \(max(|\sqrt ntaxa|, 10)\).

use_previous

Should the initialization for K+1 shifts use the estimation for $K$ shifts already obtained? Default to FALSE.

order

Should the estimations be done for K increasing (TRUE) or K decreasing (FALSE)? If use_previous=FALSE, this has no influence, except if one initialization fails. Default to TRUE.

method.selection

Method selection to be used. Several ones can be used at the same time. One of "LINselect" for the Baraud Giraud Huet LINselect method; "DDSE" for the Slope Heuristic or "Djump" for the Jump Heuristic, last two based the Birgé Massart method.

C.BM1

Multiplying constant to be used for the BigeMassart1 method. Need to be positive. Default to 0.1.

C.BM2

Multiplying constant to be used for the BigeMassart2 method. Default to 2.5.

C.LINselect

Multiplying constant to be used for the LINselect method. Need to be greater than 1. Default to 1.1.

method.variance

Algorithm to be used for the moments computations at the E step. One of "simple" for the naive method; of "upward_downward" for the Upward Downward method (usually faster). Default to "upward_downward".

method.init

The initialization method. One of "lasso" for the LASSO base initialization method; or "default" for user-specified initialization values. Default to "lasso".

method.init.alpha

For OU model, initialization method for the selection strength alpha. One of "estimation" for a cherry-based initialization, using nlrob; or "default" for user-specified initialization values. Default to "estimation".

method.init.alpha.estimation

If method.init.alpha="estimation", choice of the estimation(s) methods to be used. Choices among "regression", (method="M" is passed to nlrob); "regression.MM" (method="MM" is passed to nlrob) or "median" (nlrob is not used, a simple median is taken). Default to all of them.

methods.segmentation

For OU, method(s) used at the M step to find new candidate shifts positions. Choices among "lasso" for a LASSO-based algorithm; and "best_single_move" for a one-move at a time based heuristic. Default to both of them. Using only "lasso" might speed up the function a lot.

alpha_grid

whether to use a grid for alpha values. Default to TRUE. This is the only available method for scOU. This method is not available for OU with multivariate traits. OU with univariate traits can take both TRUE or FALSE. If TRUE, a grid based on the branch length of the tree is automatically computed, using function find_grid_alpha.

nbr_alpha

If alpha_grid=TRUE, the number of alpha values on the grid. Default to 10.

random.root

whether the root is assumed to be random (TRUE) of fixed (FALSE). Default to TRUE

stationary.root

whether the root is assumed to be in the stationary state. Default to TRUE.

alpha

If the estimation is done with a fixed alpha (either known, or on a grid), the possible value for alpha. Default to NULL.

check.tips.names

whether to check the tips names of the tree against the column names of the data. Default to TRUE.

progress.bar

whether to display a progress bar of the computations. Default to TRUE.

estimates

The result of a previous run of this same function. This function can be re-run for other model election method. Default to NULL.

save_step

If alpha_grid=TRUE, whether to save the intermediate results for each value of alpha (in a temporary file). Useful for long computations. Default to FALSE.

rescale_OU

For the Univariate OU, should the tree be re-scaled to use a BM ? This can speed up the computations a lot. However, it can make it harder for the EM to explore the space of parameters, and hence lead to a sub-optimal solution. Default to TRUE.

parallel_alpha

If alpha_grid=TRUE, whether to run the estimations with different values of alpha on separate cores. Default to FALSE. If TRUE, the log is written as a temporary file.

Ncores

If parallel_alpha=TRUE, number of cores to be used.

K_lag_init

Number of extra shifts to be considered at the initialization step. Increases the accuracy, but can make computations quite slow of taken too high. Default to 5.

light_result

if TRUE (the default), the object returned is made light, without easily computable quantities. If FALSE, the object can be very heavy, but its subsequent manipulations can be faster (especially for plotting).

tol_tree

tolerance to consider a branch length significantly greater than zero, or two lineages lengths to be different, when checking for ultrametry. (Default to .Machine$double.eps^0.5). See is.ultrametric and di2multi.

allow_negative

whether to allow negative values for alpha (Early Burst). See details. Default to FALSE.

option_is.ultrametric

option for is.ultrametric check. Default to 1.

trait_correlation_threshold

the trait correlation threshold to stop the analysis. Default to 0.9.

...

Further arguments to be passed to estimateEM, including tolerance parameters for stopping criteria, maximal number of iterations, etc.

Value

An object of class PhyloEM. Relevant quantities can be extracted from it using helper functions params_process.PhyloEM, imputed_traits.PhyloEM

Details

Several models can be used:

  • BM with fixed root, univariate or multivariate.

  • OU with fixed or stationary root, univariate or multivariate.

For the OU in the multivariate setting, two assumptions can be made:

  • Independent traits. This amounts to diagonal rate and selection matrices.

  • "Scalar OU" (scOU): the rate matrix can be full, but the selection strength matrix is assumed to be scalar, i.e. all the traits are supposed to go to their optimum values with the same speed.

Note that the "scalar OU" model can also be seen as a re-scaling of the tree. The selection strength parameter alpha can then be interpreted as a measure of the "phylogenetic signal":

  • If alpha is close to 0, then the process is similar to a BM on the original tree, and the signal is strong.

  • If alpha is large, then the re-scaled tree is similar to a star-tree, and the signal is weak.

When there are no shifts, and the root is taken to be constant, this model is actually equivalent to an AC model (Uyeda et al. 2015). With this interpretation in mind, one might want to explore negative values of alpha, in order to fit a DC (or Early Burst) model. With no shift and a fixed root, the same proof shows that the scOU with alpha negative is equivalent to the DC model. There are two strong caveats in doing that.

  • The interpretation of the OU as modeling the dynamic of a trait undergoing stabilizing selection is lost. In this case, the scOU can only be seen as a re-scaling of the tree, similar to Pagel's delta.

  • The values of the "optimal values", and of the shifts on them, cannot be interpreted as such (the process is actually going away from this values, instead of being attracted). When looking at these values, one should only use the un-normalized values happening of the underlying BM. You can extract those using the params_process function with rBM = TRUE.

Examples

if (FALSE) {
## Load Data
data(monkeys)
## Run method
# Note: use more alpha values for better results.
res <- PhyloEM(Y_data = monkeys$dat,        ## data
               phylo = monkeys$phy,         ## phylogeny
               process = "scOU",            ## scalar OU
               random.root = TRUE,          ## root is stationary
               stationary.root = TRUE,
               K_max = 10,                  ## maximal number of shifts
               nbr_alpha = 4,               ## number of alpha values
               parallel_alpha = TRUE,       ## parallelize on alpha values
               Ncores = 2)
## Plot selected solution (LINselect)
plot(res) # three shifts
## Plot selected solution (DDSE)
plot(res, method.selection = "DDSE") # no shift
## Extract and solution with 5 shifts
params_5 <- params_process(res, K = 5)
plot(res, params = params_5)
## Show all equivalent solutions
eq_sol <- equivalent_shifts(monkeys$phy, params_5)
plot(eq_sol)
}