Fit a joint model of discrete and continuous characters via maximum-likelihood.

The hOUwie model jointly estimates the likelihood of a discrete and continuous character by combining the probability of the continuous character given a particular regime and the probability of that discrete regime painting, integrated over many regime paintings. Specifically, we combine hidden Markov models of discrete character evolution (Beaulieu et al. 2013; Boyko and Beaulieu 2021) with generalized Ornstein-Uhlenbeck models (Hansen 1997; Butler and King 2004; Hansen et al. 2008; Beaulieu et al. 2012; Ho and Ane 2014a).

This function takes as main input a phylogenetic tree and a data.frame containing species, character values, and potentially measurement error. The rate category is a way to specify the number of hidden states to be incldued, with one meaning no hidden states present. The structure of the model and association between discrete and continuous characters is specified via discrete_model and continuous model (see Details).

hOUwie(phy, 
       data, 
       rate.cat, 
       discrete_model, 
       continuous_model, 
       null.model          = FALSE,
       nSim                = 100, 
       root.p              = "yang", 
       dual                = FALSE, 
       collapse            = TRUE, 
       root.station        = FALSE, 
       get.root.theta      = FALSE, 
       tip.fog             = "none", 
       lb_discrete_model   = NULL, 
       ub_discrete_model   = NULL, 
       lb_continuous_model = NULL, 
       ub_continuous_model = NULL, 
       recon               = FALSE, 
       nodes               = "internal", 
       p                   = NULL, 
       ip                  = NULL, 
       optimizer           = "nlopt_ln", 
       opts                = NULL, 
       quiet               = FALSE, 
       sample_tips         = FALSE, 
       sample_nodes        = FALSE, 
       adaptive_sampling   = FALSE, 
       diagn_msg           = FALSE,
       n_starts            = 1, 
       ncores              = 1)

Arguments

phy: a phylogenetic tree, in ape “phylo” format and with internal nodes labeled denoting the ancestral selective regimes.
data: A data frane containing species information. The first column is always species names matching the tip labels of phy. The 2nd to ith columns are independent discrete characters. The ith+1 column is the continuous character value. Finally, there can be an ith+2 column for measurement error (optional).
rate.cat: Specifies the number of rate categories (see Details).
discrete_model: Either a user-supplied index of parameters to be optimized or one of "ARD", "SYM", or "ER". ARD: all rates differ. SYM: rates between any two states do not differ. ER: all rates are equal.
continuous_model: Either a user-supplied index matrix specifying the continuous model parameters to be estimated or one of "BM1", "BMV", "OU1", "OUA", "OUV", "OUM", "OUVA", "OUMV", "OUMA", "OUMVA" (See also getOUParamStructure).
null.model: A boolean indicating whether the model being run is a character-independent model with rate heterogeneity. Rate.cat must be greater than 1.
nSim: The number of stochastic maps evaluated per iteration of the ML search.
root.p: a vector used to fix the probabilities at the root, but “yang” can also be supplied to use the method of Yang (2006) (see Details).
dual: A boolean indicating whether or not to include dual transitions. If true, then transitions two or more states may change at any instant of time. For example, X0Y0 can change directly to X1Y1 without first going through X1Y0 or X0Y1.
collapse: A boolean indicating whether to collapse multiple character combinations into only the observed states. For example, if true a two character dataset contained (0,0), (1,0), and (1,1), this would be collapsed into 1,2,3. However, if set to false it would 1,2,4. In combination with a custom rate matrix this allows for the estimation of transitions between unobserved character combinations. Transitions to and from unobserved state combinations may not be estimatable: See Boyko and Beaulieu (2022).
root.station: A boolean indicating whether the starting state, \(\theta_0\), should be estimated (see Details of OUwie function).
get.root.theta: A boolean indicating whether the starting state, \(\theta_0\), should be estimated (see Details of OUwie function).
tip.fog: designates whether a fourth column in the data matrix contains within species variation for each species value ("known"). The tip fog is assumed to be the standard error of the species mean. The default is "none", but can be estimated by setting the option to "estimate".
lb_discrete_model: A single value for the lower bound of discrete character evolution during the ML search. The default is 1/(Tmax*10000).
ub_discrete_model: A single value for the upper bound of discrete character evolution during the ML search. The default is 1/(Tmax*0.0001).
lb_continuous_model: Three values specifying the lower bound of continuous character evolution during the ML search. Should be provided in the order of lower.bound for alpha, sigma squared, theta. The default is c(1e-10, 1e-10, smallest_continuous_value/10).
ub_continuous_model: Three values specifying the upper bound of continuous character evolution during the ML search. Should be provided in the order of upper.bound for alpha, sigma squared, theta. The default is c(log(2)/(0.01 * Tmax), log(2)/(0.01 * Tmax), largest_continuous_value*10).
recon: A boolean indicating whether marginal ancestral state reconstruction should take place following the ML search.
nodes: If recon=TRUE, a character or vector indicating which nodes should have their discrete ancestral state characters estimated. If a vector, nodes specified will be reconstructed with 1:Ntip representing tip reconstructions and Ntip+1 representing the root. A user can also indicate "all", "internal", or "external" to reconstruct all nodes, ancestral nodes, or extant nodes respectively. Reconstructing tip states ("external") is only useful if hidden states are included in the model.
p: A numeric vector of fixed parameters to be optimized. Parameters are organized such following c(p_trans, p_alpha, p_sigma.squared, p_theta)
ip: A numeric vector indicating initial parameters to start the ML search. Parameters are organized such following c(p_trans, p_alpha, p_sigma.squared, p_theta)
optimizer: One of "nlopt_ln", "nlopt_gn", or "sann" for a local search of parameter space, global search of paramater space, or simmulated annealing. The default is "nlopt_ln".
opts: A list of options to be passed to nloptr.
quiet: A boolean indicating whether or not to print messages.
sample_tips: A boolean indicating whether or not to sample tip probabilities when generating underlying regimes (see Deatils, but recommended to be set to FALSE).
sample_nodes: A boolean indicating whether or not to sample node probabilities when generating underlying regimes. Recommended that this is set to TRUE when evaluating character independent models, but can be set to FALSE for character dependent models.
adaptive_sampling: A boolean indicating whether each iteration will continue sampling stochastic maps until the likelihood of a set of parameters does not improve. Recommended that this is set to TRUE when evaluating character independent models, but can be set to FALSE for character dependent models.
diagn_msg: A boolean indiciating whether a message of loglikelihoods and parameters should be printed at each iteration. Can be useful for diagnosing problems.
n_starts: An integer, specifying the number of trials for fitting the model, using alternative (randomized) starting parameters at each trial. A larger n_starts reduces the risk of landing on a local non-global optimum of the likelihood function, and thus increases the chances of finding the MLE.
ncores: An integer, specifying the number of cores for running multiple fitting n_starts in parallel. Should generally not exceed the number of CPU cores on a machine, but must be a least 1.

Details

This model is composed of two processes: one that describes the evolution of a discrete character and the other describes the evolution of a continuous character. To model the evolution of a single continuous character we use an Ornstein-Uhlenbeck (OU) model (Hansen 1997; Butler and King 2004; Hansen et al. 2008; Beaulieu et al. 2012; Ho and Ane 2014a). This model combines the stochastic evolution of a trait through time with a deterministic component which models the tendency for a trait to evolve towards an optimum. In this model, the value of a trait, X_t, is pulled towards an optimum, theta, at a rate scaled by the parameter alpha. The optimum, theta, is a piecewise constant on intervals and takes values in a finite set theta. This can represent the set of "selective regimes", "regimes", or Simpson's "adaptive zones" (Cressler et al. 2015), though it is consistent with a variety of true underlying microevolutionary models (Hansen 2014). Additionally, random deviations are introduced by Gaussian white noise dB(t), which is distributed as a normal random variable with mean zero and variance equal to sigma^2. Thus, sigma^2 is a constant describing the rate of stochastic evolution around the optimum. We use the set of extensions introduced by Beaulieu et al. (2012). This allows for multiple primary optima theta in which both the pull strength (alpha) and the rate of stochastic evolution (sigma^2) can vary across the phylogeny.

This model allows for multiple rate categories (specified by the rate.cat argument). The default is one rate class in which only observed discrete states evolve. However, there are two main reasons one may be interested in increasing the number of rate categories. First, rate heterogeneity throughout the tree of life is more of a rule than a possibility. Not all things will evolve in the same way at the same time. For example, woody and herbaceous plants may evolve one way on the mainland and a completely different way on island. The challenge then is to allow for heterogeneity in the evolutionary process when we do not know what the 'mainland' or 'island' variables are. This is what hidden Markov models allow us to do (Boyko and Beaulieu 2021). The second reason to include hidden rate categories is to reduce the error rates of finding a false correlation. This has been discussed elsewhere in depth (see Maddison and FitzJohn 2015, Uyeda et al. 2018, Boyko and Beaulieu 2022). The problem lies in if we compare models with rate heterogeneity to models without it. Most character dependent models (correlation models) allow for different ways for the characters to evolve. In fact, their reliable inference depends on being able to define the differences between, for example, body size evolution on islands and mainlands. However, most character-independent models have no way to allow for variable ways for body size to evolve. So, whether or not the evolution of body size is connected to island systems, we may be biased towards selecting correlation models simply because they allow body size to have different rates of evolution.

It is not uncommon to map a discrete character on a phylogeny using a model of evolution for that discrete character and then run various continuous trait models using that mapping (or a stochastic set of these mappings), and OUwie has ways to use such mappings. However, that mapping is completely uninformed by the continuous trait – it could be that a nearly identical mapping fits the continuous trait far better. A different approach is to do the regime detection using the continuous trait only, as in SURFACE or OUwie.dredge. But if there is a reasonable discrete character (perhaps with hidden states), it can make the most sense to get the likelihood integrating across all different discrete mappings, summing the likelihood for the discrete and continuous traits while doing so. Because looking at all possible reconstructions is infeasible, and looking at them from a distribution from stochastic mapping biases it towards discrete-only mappings, we approximate this by summing over a large number of potential mappings centered on those that seem a good fit to the data.

One way that hOUwie differs from other implementations of automatic regime detection is that we explicitly calculate the the probability of discrete characters (D) and stochastic mapping (z). The difficulties of calculating the pathway probabilities explicitly are outlined in Boyko et al. (2022), but ultimately led us to use an approximation. This approximation relies on a finite number of degree-2 internodes and uses the standard Chapman-Kolmgorov equation to calculate the probabilities of beginning in a particular state i and ending in state j (Pagel 1994) and is identical to a joint probability of a set of state reconstructions (Yang 2006).

discrete_model defines the model between discrete states. This option controls the number of independent rates and the correspondence between independent and dependent rates. If a character, then it must be one of "ER", "SYM", "ARD". If an integer matrix, then it defines a custom parametric structure for the transition rates, by mapping entries of the transition matrix to a set of independent transition-rate parameters (numbered 1,2, and so on). All cells with the number 3 represent transitions that have the same estimated rate. Additionally, ambiguities (polymorphic taxa or taxa missing data) are assigned likelihoods following Felsenstein (2004, p. 255). Taxa with missing data are coded “?” with all states observed at a tip. Polymorphic taxa are coded with states separated by an “&”. For example, if a trait has four states and taxon A is observed to be in state 1 and 3, the character would be coded as “1&3”. Missing data are treated as ambiguous for all states, thus all states for taxa missing data are assigned a likelihood of 1.0 (some software may incorrectly assign a likelihood of 1.0/number of possible states; note this if comparing likelihoods). For example, for a four-state character (i.e. DNA), a taxon missing data will have likelihoods of all four states equal to 1.0 [e.g. L(A)=1.0, L(C)=1.0, L(G)=1.0, L(T)=1.0].

continuous_model option controls the number of independent continuous model parameters and the correspondence between continuous model parameters and underlying regime. If a character, then it must be one of "BM1", "BMV", "OU1", "OUM", "OUA", "OUV", "OUMV", "OUMA", "OUVA", or "OUMVA". These correspond to allowing theta ("OUM"), sigma_square ("BMV", "OUV"), or alpha ("OUA") to vary. If an integer matrix, then it defines a custom parametric structure for the transition rates, by mapping entries of the continuous model parameters to the states of the underlying regimes. Note that when some taxa have phenotypic values below 0, we add 50 to all continuous trait values to assist the optimization, but the estimated parameter values will be scaled back to the original values.

For each set of parameters evaluated during the maximum likelihood search, a set of stochastic mappings are generated to evaluate the discrete and continuous likelihoods. To do this efficiently, we first approximate of the conditional state probabilities at nodes. This is what happens if sample_nodes=TRUE. The conditional state probability, unlike the more common marginal reconstruction or joint state reconstruction, calculates the probability that a node has a particular state value conditioned only on the observations of its descendants. For a particular focal node, we calculate the probability of the observing all pairwise descendant values given the OU model parameters, integrated over all possible rootward node states, and observed tipward discrete states. This process is repeated for each iteration of parameters if adaptive_sample=TRUE, in which case it is repeated until the joint likelihood for a given set of mappings does not improve. We recommend that both of these arugements are set to TRUE for character independent models because it is often very difficult to generate mappings based soley on a discrete process that are good when the discrete character is predicted to be unlinked to the continuous character (i.e., when you're fitting a character independent model). In the case of a character dependent model, the continuous and discrete character are expected to influence one another and actually basing maps on the discrete character only (setting sample_nodes and adaptive_sampling to FALSE) can work fairly well.

IMPORTANT NOTE: The algorithm used to calculate the likelihood described in Beaulieu et al. (2012) involves matrix inversion – a computationally costly procedure. Therefore, we implement a linear-time computation of the likelihood of Gaussian trait models following (Ho and Ane 2014). For this we rely on changes in the most recent version of the phylolm package which is only available on github (not CRAN). Therefore, in order for hOUwie to function properly make sure that the most recent version of phylolm is installed which can be done by running the following code: devtools::install_github("lamho86/phylolm"). For more information on phylolm installation see: https://github.com/lamho86/phylolm.

Value

A named list with the following elements:

loglik: The maximum log-likelihood.
DiscLik: The marginal log-likelihood of the discrete state probability for the nSim maps.
ContLik: The marginal log-likelihood of the continuous value probability for the nSim maps.
AIC: Akaike information criterion.
AICc: Akaike information criterion corrected for sample-size.
BIC: Bayesian information criterion.
param.count: Number of parameters in the model.
solution.disc: The maximum likelihood estiamte of the transition rate matrix.
solution.cont: The maximum likelihood estimate of parameters which describe continuous evolution. Each column corresponds to a discrete state and each row is a different continuous model parameter. Estimated continuous parameters include alpha, sigma.squared, and theta.
recon: A data.frame of marginal probabilities of ancestral states given the best fitting model. If ancestral state reconstruction was not specified then NULL.
index.disc: An index matrix specifying which discrete model parameters are estimated.
index.cont: An index matrix specifying which continuous model parameters are estimated.
phy: User provided phylogenetic tree.
legend: A named vector with elements corresponding to input states and names corresponding to a numeric version of those discrete states. Useful for interpreting hOUwie solutions.
expected_vals: Expected values given OU parameters averaged over the joint likelihood of the nSim regime mappings.
data: User provided data.frame.
hOUwie.dat: Internally used data.frame (provided in case a user is concerned about character matching).
rate.cat: The number of rate categories.
discrete_model: The user provided index matrix or character string describing the discrete model.
continuous_model: The user provided index matrix or character string describing the continuous model.
root.p: The root prior used in model estimation.
time_slice: Sets how often internodes are sampled. At the moment it is automatically set to be the max(branching.times(phy))+1.
root.station: A boolean indicating whether the starting state, \(\theta_0\), was estimated (recommended FALSE).
get.root.theta: A boolean indicating whether the root.theta was included in the model.
tip.fog: Either includes tip fog for each species value ("known"). The default is "none".
sample_tips: A boolean indicating whether tips probabilities were sampled each iteration (recommended FALSE).
sample_nodes: A boolean indicating whether node probabilities were sampled each iteration (recommended TRUE).
adaptive_sampling: A boolean indicating whether each iteration continued sampling stochastic maps until the likelihood of a set of parameters did not improve (recommended TRUE).
lb_discrete_model: A vector of lower bounds on discrete character evolution.
ub_discrete_model: A vector of upper bounds on discrete character evolution.
lb_continuous_model: A vector of lower bounds of alpha, sigma square, and theta.
ub_continuous_model: A vector of lower bounds of alpha, sigma square, and theta.
p: Vector of maximum likeihood parameters if not given.
ip: Vector of starting parameters
nSim: Number of stochastic maps for each iteration.
opts: List of options input for nloptr optimization.
quiet: A boolean indicating whether quiet was TRUE or FALSE.
all_disc_liks: A vector of discrete probabilities in the order of the nSim simmaps.
all_cont_liks: A vector of continuous probabilities in the order of the nSim simmaps.
simmaps: nSim stochastic maps from the best fitting hOUwie parameters (to be removed).
run_time: Time to complete the model fit.

References

Beaulieu J.M., Jhwueng D.-C., Boettiger C., O'Meara B.C. 2012. Modeling Stabilizing Selection: Expanding the Ornstein–Uhlenbeck Model of Adaptive Evolution. Evolution. 66:2369–2383.

Beaulieu J.M., O'Meara B.C., Donoghue M.J. 2013. Identifying Hidden Rate Changes in the Evolution of a Binary Morphological Character: The Evolution of Plant Habit in Campanulid Angiosperms. Syst Biol. 62:725–737.

Boyko J.D., Beaulieu J.M. 2021. Generalized hidden Markov models for phylogenetic comparative datasets. Methods Ecol Evol. 12:468–478.

Butler M.A., King A.A. 2004. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution. The American Naturalist. 164:683–695.

Cressler C.E., Butler M.A., King A.A. 2015. Detecting Adaptive Evolution in Phylogenetic Comparative Analysis Using the Ornstein–Uhlenbeck Model. Systematic Biology. 64:953–968.

Felsenstein J. 2004. Inferring phylogenies. Sunderland MA: Sinauer Associates.

Hansen T.F. 1997. Stabilizing Selection and the Comparative Analysis of Adaptation. Evolution. 51:1341–1351.

Hansen T.F., Pienaar J., Orzack S.H. 2008. A Comparative Method for Studying Adaptation to a Randomly Evolving Environment. Evolution. 62:1965–1977.

Hansen T.F. 2014. Use and Misuse of Comparative Methods in the Study of Adaptation. In: Garamszegi L.Z., editor. Modern Phylogenetic Comparative Methods and Their Application in Evolutionary Biology. Berlin, Heidelberg: Springer Berlin Heidelberg. p. 351–379.

Ho L. si, Ane C. 2014. A Linear-Time Algorithm for Gaussian and Non-Gaussian Trait Evolution Models. Syst Biol. 63:397–408.

Author

James D. Boyko