PhyloEM
is the main function of the package. It uses maximum likelihood
methods to fit a BM or an OU process for several traits evolving along a
phylogenetic tree, with automatic shift detection on the branches of the tree.
This function can handle missing data.
PhyloEM(
phylo,
Y_data,
process = c("BM", "OU", "scOU", "rBM"),
check_postorder = TRUE,
independent = FALSE,
K_max = max(floor(sqrt(length(phylo$tip.label))), 10),
use_previous = FALSE,
order = TRUE,
method.selection = c("LINselect", "DDSE", "Djump"),
C.BM1 = 0.1,
C.BM2 = 2.5,
C.LINselect = 1.1,
method.variance = c("upward_downward", "simple"),
method.init = "lasso",
method.init.alpha = "estimation",
method.init.alpha.estimation = c("regression", "regression.MM", "median"),
methods.segmentation = c("lasso", "best_single_move"),
alpha_grid = TRUE,
nbr_alpha = 10,
random.root = TRUE,
stationary.root = random.root,
alpha = NULL,
check.tips.names = TRUE,
progress.bar = TRUE,
estimates = NULL,
save_step = FALSE,
rescale_OU = TRUE,
parallel_alpha = FALSE,
Ncores = 3,
K_lag_init = 5,
light_result = TRUE,
tol_tree = .Machine$double.eps^0.5,
allow_negative = FALSE,
option_is.ultrametric = 1,
trait_correlation_threshold = 0.9,
...
)
A phylogenetic tree of class phylo
(from package ape
).
Matrix of data at the tips, size p x ntaxa. Each line is a trait, and each column is a tip. The column names are checked against the tip names of the tree.
The model used for the fit. One of "BM" (for a full BM model, univariate or multivariate); "OU" (for an OU with independent traits, univariate or multivariate); or "scOU" (for a "scalar OU" model, see details).
Re-order the tree in post-order. If the Upward-Downward algorithm is used, the tree need to be in post-order. Default to TRUE if the upward-downward is used, otherwise automatically set to FALSE.
Are the trait assumed to be independent from one another? Default to FALSE. OU in a multivariate setting only works if TRUE.
The maximum number of shifts to be considered. Default to \(max(|\sqrt ntaxa|, 10)\).
Should the initialization for K+1 shifts use the estimation for $K$ shifts already obtained? Default to FALSE.
Should the estimations be done for K increasing (TRUE) or K decreasing (FALSE)? If use_previous=FALSE, this has no influence, except if one initialization fails. Default to TRUE.
Method selection to be used. Several ones can be used at the same time. One of "LINselect" for the Baraud Giraud Huet LINselect method; "DDSE" for the Slope Heuristic or "Djump" for the Jump Heuristic, last two based the Birgé Massart method.
Multiplying constant to be used for the BigeMassart1 method. Need to be positive. Default to 0.1.
Multiplying constant to be used for the BigeMassart2 method. Default to 2.5.
Multiplying constant to be used for the LINselect method. Need to be greater than 1. Default to 1.1.
Algorithm to be used for the moments computations at the E step. One of "simple" for the naive method; of "upward_downward" for the Upward Downward method (usually faster). Default to "upward_downward".
The initialization method. One of "lasso" for the LASSO base initialization method; or "default" for user-specified initialization values. Default to "lasso".
For OU model, initialization method for the selection
strength alpha. One of "estimation" for a cherry-based initialization, using
nlrob
; or "default" for user-specified
initialization values. Default to "estimation".
If method.init.alpha="estimation",
choice of the estimation(s) methods to be used. Choices among "regression",
(method="M" is passed to nlrob
); "regression.MM"
(method="MM" is passed to nlrob
) or "median"
(nlrob
is not used, a simple median is taken).
Default to all of them.
For OU, method(s) used at the M step to find new candidate shifts positions. Choices among "lasso" for a LASSO-based algorithm; and "best_single_move" for a one-move at a time based heuristic. Default to both of them. Using only "lasso" might speed up the function a lot.
whether to use a grid for alpha values. Default to TRUE. This
is the only available method for scOU. This method is not available for OU with
multivariate traits. OU with univariate traits can take both TRUE or FALSE. If
TRUE, a grid based on the branch length of the tree is automatically computed,
using function find_grid_alpha
.
If alpha_grid=TRUE
, the number of alpha values on the
grid. Default to 10.
whether the root is assumed to be random (TRUE) of fixed (FALSE). Default to TRUE
whether the root is assumed to be in the stationary state. Default to TRUE.
If the estimation is done with a fixed alpha (either known, or on a grid), the possible value for alpha. Default to NULL.
whether to check the tips names of the tree against the column names of the data. Default to TRUE.
whether to display a progress bar of the computations. Default to TRUE.
The result of a previous run of this same function. This function can be re-run for other model election method. Default to NULL.
If alpha_grid=TRUE, whether to save the intermediate results for each value of alpha (in a temporary file). Useful for long computations. Default to FALSE.
For the Univariate OU, should the tree be re-scaled to use a BM ? This can speed up the computations a lot. However, it can make it harder for the EM to explore the space of parameters, and hence lead to a sub-optimal solution. Default to TRUE.
If alpha_grid=TRUE, whether to run the estimations with different values of alpha on separate cores. Default to FALSE. If TRUE, the log is written as a temporary file.
If parallel_alpha=TRUE, number of cores to be used.
Number of extra shifts to be considered at the initialization step. Increases the accuracy, but can make computations quite slow of taken too high. Default to 5.
if TRUE (the default), the object returned is made light, without easily computable quantities. If FALSE, the object can be very heavy, but its subsequent manipulations can be faster (especially for plotting).
tolerance to consider a branch length significantly greater than zero, or
two lineages lengths to be different, when checking for ultrametry.
(Default to .Machine$double.eps^0.5). See is.ultrametric
and di2multi
.
whether to allow negative values for alpha (Early Burst). See details. Default to FALSE.
option for is.ultrametric
check. Default to 1.
the trait correlation threshold to stop the analysis. Default to 0.9.
Further arguments to be passed to estimateEM
, including
tolerance parameters for stopping criteria, maximal number of iterations, etc.
An object of class PhyloEM
. Relevant quantities can be extracted from it
using helper functions params_process.PhyloEM
,
imputed_traits.PhyloEM
Several models can be used:
BM with fixed root, univariate or multivariate.
OU with fixed or stationary root, univariate or multivariate.
For the OU in the multivariate setting, two assumptions can be made:
Independent traits. This amounts to diagonal rate and selection matrices.
"Scalar OU" (scOU): the rate matrix can be full, but the selection strength matrix is assumed to be scalar, i.e. all the traits are supposed to go to their optimum values with the same speed.
Note that the "scalar OU" model can also be seen as a re-scaling of the tree. The selection strength parameter alpha can then be interpreted as a measure of the "phylogenetic signal":
If alpha is close to 0, then the process is similar to a BM on the original tree, and the signal is strong.
If alpha is large, then the re-scaled tree is similar to a star-tree, and the signal is weak.
When there are no shifts, and the root is taken to be constant, this model is actually equivalent to an AC model (Uyeda et al. 2015). With this interpretation in mind, one might want to explore negative values of alpha, in order to fit a DC (or Early Burst) model. With no shift and a fixed root, the same proof shows that the scOU with alpha negative is equivalent to the DC model. There are two strong caveats in doing that.
The interpretation of the OU as modeling the dynamic of a trait undergoing stabilizing selection is lost. In this case, the scOU can only be seen as a re-scaling of the tree, similar to Pagel's delta.
The values of the "optimal values", and of the shifts on them, cannot
be interpreted as such (the process is actually going away from this values,
instead of being attracted). When looking at these values, one should only use
the un-normalized values happening of the underlying BM. You can extract those
using the params_process
function with rBM = TRUE
.
if (FALSE) {
## Load Data
data(monkeys)
## Run method
# Note: use more alpha values for better results.
res <- PhyloEM(Y_data = monkeys$dat, ## data
phylo = monkeys$phy, ## phylogeny
process = "scOU", ## scalar OU
random.root = TRUE, ## root is stationary
stationary.root = TRUE,
K_max = 10, ## maximal number of shifts
nbr_alpha = 4, ## number of alpha values
parallel_alpha = TRUE, ## parallelize on alpha values
Ncores = 2)
## Plot selected solution (LINselect)
plot(res) # three shifts
## Plot selected solution (DDSE)
plot(res, method.selection = "DDSE") # no shift
## Extract and solution with 5 shifts
params_5 <- params_process(res, K = 5)
plot(res, params = params_5)
## Show all equivalent solutions
eq_sol <- equivalent_shifts(monkeys$phy, params_5)
plot(eq_sol)
}