Bumped version to 1.1.18
README.md fix
Some bug fixes
GLInvCI is a package that provides a framework for computing the maximum-likelihood estimates and asymptotic confidence intervals of a class of continuous-time Gaussian branching processes, including the Ornstein-Uhlenbeck branching process, which is commonly used in phylogenetic comparative methods. The framework is designed to be flexible enough that the user can easily specify their own parameterisation and obtain the maximum-likelihood estimates and confidence intervals of their own parameters.
The model in concern is the family of continuous-trait continuous-time Gaussian evolution processes along a known phylogeny, in which each species’ traits evolve independently of each others after branching off from their common ancestor and for every non-root node. Let k be a child node of i, and zₖ, zᵢ denotes the corresponding multivariate traits. We assume that zₖ|zᵢ is a Gaussian distribution with expected value wₖ+Φₖzᵢ and variance Vₖ, where the matrices (Φₖ,wₖ,Vₖ) are parameters independent of zₖ but can depend other parameters including tₖ. The traits zₖ and zᵢ can have different number of dimension.
The following command should install the latest version of the package:
install.packages('devtools')
devtools::install_url(
'https://git.sr.ht/~hckiang/glinvci/blob/latest-tarballs/glinvci_latest_main.tar.gz')
The package contains two levels of user interfaces. The high-level
interface, accessible through the glinv
function, provides facilities
for handling missing traits, lost traits, multiple evolutionary regimes,
and most importantly, the calculus chain rule. The lower-level
interface, accessible through the glinv_gauss
function, allows the
users to operate purely in the (Φₖ,wₖ,Vₖ)ₖ parameter space. The latter
parameter model obviously has huge number of parameters because k
corresponds to all the nodes and tips (except the root).
Most users should be satisfied with the high-level interface, even if they intend to write their own custom models.
To fit a model using this package, generally you will need two main pieces of input data: a rooted phylogenetic tree and a matrix of trait values. The phylogenetic tree can be non-ultrametric and can potentially contain multifurcation. The matrix of trait values should have the same number of columns as the number of tips.
For the purpose of demonstrating the functionality of the package, we will first generate a random tree.
library(glinvci)
set.seed(1)
ntips = 300 # No. of tips
k = 2 # No. of trait dimensions
tr = ape::rtree(ntips) # Random non-ultrametric tree
x0 = rnorm(k) # Root value
With the above material, we are ready to make a model object. We use the OU model as an example. The OU model is generally parameterised in three matrix parameters: the drift matrix H, the evolutionary optimum θ, and the symmetric positively definite Brownian motion covariance matrix Σ. In this example, we restrict H to be a symmetric positively definite matrix while leaving θ and Σ unrestricted:
repar = get_restricted_ou(H='logspd', theta=NULL, Sig=NULL, lossmiss=NULL)
mod = glinv(tr, x0, X=NULL, repar = repar)
print(mod)
# An alternative way to make the model is:
# mod = glinv(tr, x0, X=NULL,
# pardims = repar$nparams(k),
# parfns = repar$par,
# parjacs = repar$jac,
# parhess = repar$hess)
A GLInv model with 1 regimes and 8 parameters divided into 1 blocks,
all of which are associated to the only one existing regime.
The phylogeny has 300 tips and 299 internal nodes. Tip values are empty,
meaning that `fit()` and `lik()` etc. will not work. See `?set_tips`.
The first line above constructs a “re-parameterization object” repar
,
in which 'logspd'
means that H should be symmetric positively definite
with its diagonals parametrized in its log scale. There are 8 parameters
because the symmetric positive definite matrix H corresponds to 3
parameters (instead of 4 because of symmetry), θ has 2 elements, and Σ
has 3 parameters, again, because of symmetry.
Notice that we don’t have any trait vectors yet, and this is why the package says you cannot fit the model nor compute the likelihood of the model. To be able to fit the model, of course, we need some trait values as our data, and now we will generate some random trait vectors from the model and use it to fit the model.
But before we can generate data, we need to make some ground-truth parameters:
H = matrix(c(1,0,0,1), k)
theta = c(0,0)
sig = matrix(c(0.5,0,0,0.5), k)
sig_x = t(chol(sig))
diag(sig_x) = log(diag(sig_x)) # Pass the diagonal to log
sig_x = sig_x[lower.tri(sig_x,diag=T)] # Trim away upper-tri. part and flatten.
In the above, the first three lines defines the actual parameters that
we want, but notice that we performed a Cholesky decomposition on
sig_x
and took the logarithm of the diagonal. The package always
accept the variance-covariance matrix of the Brownian motion term in
this form, unless it is restricted to be a diagonal matrix. The
Cholesky decomposition ensures that, during numerical optimisation in
the model fitting, the matrix remain positively definite; and logarithm
further constrain the diagonal of the Cholesky factor to be positive,
hence eliminating multiple optima.
Because we have also constrained H
to be symmetric positively definite
(by passing H='logspd'
to get_restricted_ou
), we need to transform
H
in the same manner:
H_input = t(chol(H))
diag(H_input) = log(diag(H_input))
H_input = H_input[lower.tri(H_input,diag=T)]
This transformation depends on how you restrict your H
matrix. For
example, if you do not put any constrains on H
, by passing H=NULL
to
get_restricted_ou
, the above transformation is not needed.
Then we need to concatenate all parameters into a single vector
par_truth. All OU-related functions in the package assume that the
parameters are concatenated in the (H, theta, sig_x)
order, as
follows:
par_truth = c(H=H_input,theta=theta,sig_x=sig_x)
Now let’s simulate the some trait data and add the these trait data into
the model object mod
. The first line below simulates a set of trait
data using the parameters par_truth. The second line adds the simulated
data into the mod object.
X = rglinv(mod, par_truth, Nsamp=1)
set_tips(mod, X[[1]])
print(mod)
A GLInv model with 1 regimes and 8 parameters divided into 1 blocks,
all of which are associated to the only one existing regime.
The phylogeny has 300 tips and 299 internal nodes. Tip values are already set.
As you can see, after calling set_tips
, the warning that fit
etc.
will not work is gone.
Now let’s compute the likelihood, gradient, and Hessian of this model.
cat('Ground-truth parameters:\n')
print(par_truth)
cat('Likelihood:\n')
print(lik(mod)(par_truth))
cat('Gradient:\n')
print(grad(mod)(par_truth))
cat('Hessian:\n')
print(hess(mod)(par_truth))
Ground-truth parameters:
H1 H2 H3 theta1 theta2 sig_x1 sig_x2 sig_x3
0.00 0.00 0.00 0.00 0.00 -0.35 0.00 -0.35
Likelihood:
[1] -400
Gradient:
[1] 4.3 -6.2 -27.7 -9.9 -14.9 -12.8 -5.9 35.2
Hessian:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] -545 -29.4 0.0 99 0 432.2 13.8 0.0
[2,] -29 -289.0 -23.1 -13 50 2.7 328.3 9.7
[3,] 0 -23.1 -546.9 0 -26 0.0 3.9 496.3
[4,] 99 -13.0 0.0 -505 0 19.8 21.1 0.0
[5,] 0 49.5 -26.0 0 -505 0.0 14.0 29.9
[6,] 432 2.7 0.0 20 0 -574.3 5.9 0.0
[7,] 14 328.3 3.9 21 14 5.9 -574.3 11.9
[8,] 0 9.7 496.3 0 30 0.0 11.9 -670.4
The maximum likelihood estimates can be obtained by calling the
fit.glinv
method. We use the zero vector as the optimisation routine’s
initialisation:
par_init = par_truth
par_init[] = 0.
fitted = fit(mod, par_init)
print(fitted)
$mlepar
H1 H2 H3 theta1 theta2 sig_x1 sig_x2 sig_x3
-0.0247 -0.1219 -0.0058 -0.0350 -0.0485 -0.3860 -0.0783 -0.2966
$score
H1 H2 H3 theta1 theta2 sig_x1 sig_x2 sig_x3
-0.001171 -0.000099 -0.000025 0.000358 0.000461 0.000175 -0.000051 0.000109
$loglik
[1] -398
$counts
[1] 31 31
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH"
Once the model is fitted, one can estimate the variance-covariance
matrix of the maximum-likelihood estimator using varest
.
v_estimate = varest(mod, fitted)
The marginal confidence interval can be obtained by calling
marginal_ci
on the object returned by varest
.
print(marginal_ci(v_estimate, lvl=0.99))
Lower Upper
H1 -0.21 0.161
H2 -0.42 0.172
H3 -0.23 0.214
theta1 -0.17 0.097
theta2 -0.18 0.083
sig_x1 -0.56 -0.215
sig_x2 -0.27 0.118
sig_x3 -0.49 -0.104
It seems that these confidence intervals is indeed covering the ground truth.
Let’s assume we have the same data k
, tr
, X
, x0
as generated
before but we fit a simple Brownian motion model instead. To make a
Brownian motion model, we can call the following:
repar_brn = get_restricted_ou(H='zero', theta='zero', Sig=NULL, lossmiss=NULL)
mod_brn = glinv(tr, x0, X[[1]], repar=repar_brn)
print(mod_brn)
A GLInv model with 1 regimes and 3 parameters divided into 1 blocks,
all of which are associated to the only one existing regime.
The phylogeny has 300 tips and 299 internal nodes. Tip values are already set.
As you may might have already guessed, H='zero'
above means that we
restrict the drift matrix term of the OU to be a zero matrix. In this
case, theta
, the evolutionary optimum, is meaningless. In this case,
the package has a convention that in a Brownian motion we always have
H='zero', theta='zero'
.
The following calls demonstrates how to compute the likelihood:
par_init_brn = c(sig_x=sig_x)
print(c(likelihood=lik(mod_brn)(par_init_brn)))
likelihood
-536
The user can obtain the an MLE fit by calling
fit(mod_brn, par_init_brn)
. The marginal CI and the estimator’s
variance can be obtained in exactly the same way as in the OU example.
But just in this example, keep in mind that we have previously generated
X
using an OU process rather than from a Brownian motion.
Out of box, the package allows missing data in the tip trait matrix, as well as allowing multiple revolutionary regimes.
A ‘missing’ trait refers to a trait value whose data is missing due to
data collection problems. Fundamentally, they evolves in the same manner
as other traits. An NA
entry in the trait matrix X
is deemed
‘missing’. A lost trait is a trait dimension which had ceased to exists
during the evolutionary process. An NaN
entry in the data indicates a
‘lost’ trait. The package provides two different ways of handling lost
traits. For more details about how missingness is handled, the users
should read ?ou_haltlost
.
In this example, we demonstrate how to fit a model with two regimes, and
some missing data and lost traits. Assume the phylogeny is the same as
before but some entries of X
is NA
or NaN
. First, let’s
arbitrarily set some entries of X
to missingness, just for the purpose
of demonstration.
X[[1]][2,c(1,2,80,95,130)] = NA
X[[1]][1,c(180:200)] = NaN
The following call constructs a model object in which three evolutionary regimes are present: the first regime starts at the root and it has a Brownian motion evolution; the second regime starts at the beginning of the edge that leads to internal node number 390 and it has an OU process; and the third regime starts at the beginning of the edge that leads to internal node number 502 and it has an OU process that shares the same underlying ground-truth parameters as the second regime. In other words, there are three blocks of parameters. In a binary tree with N tips, the root’s node number is N+1; in other words, in our case, the root node number is 301. The following code constructs such a model:
repar_a = get_restricted_ou(H='logdiag', theta=NULL, Sig=NULL, lossmiss='halt')
repar_b = get_restricted_ou(H='zero', theta='zero', Sig=NULL, lossmiss='halt')
mod_tworeg = glinv(tr, x0, X[[1]], repar=list(repar_a, repar_b),
regimes = list(c(start=301, fn=2),
c(start=390, fn=1),
c(start=502, fn=1)))
# A long-form alternative way to do the same is this:
# mod_tworeg = glinv(tr, x0, X[[1]],
# pardims = list(repar_a$nparams(k), repar_b$nparams(k)),
# parfns = list(repar_a$par, repar_b$par),
# parjacs = list(repar_a$jac, repar_b$jac),
# parhess = list(repar_a$hess, repar_b$hess),
# regimes = list(c(start=301, fn=2),
# c(start=390, fn=1),
# c(start=502, fn=1)))
print(mod_tworeg)
A GLInv model with 3 regimes and 10 parameters divided into 2 blocks, whose
1-7th parameters are asociated with regime no. {2,3};
8-10th parameters are asociated with regime no. {1},
where
regime #1 starts from the branch that leads to node #301, which is the root;
regime #2 starts from the branch that leads to node #390;
regime #3 starts from the branch that leads to node #502.
The phylogeny has 300 tips and 299 internal nodes. Tip values are already set.
In the above, we have defined three regimes and two stochastic
processes. The repar
argument, or alternatively the pardims
,
parfns
, parjacs
, and parhess
argument, specifies the two
stochastic processes and the regime
parameter can be thought of as
‘drawing the lines’ to match each regime to a seperately defined
stochastic process. The start
element in the list specifies the node
number at which a regime starts, and the fn
element is an index to the
list passed to repar
. In this example, the first regime starts at the
root and uses repar_b
. If multiple regimes share the same fn
index
then it means that they shares both the underlying stochastic process
and the parameters. lossmiss='halt'
specifies how the lost traits (the
NaN
) are handled.
To compute the likelihood and initialize for optimisation, one need to
notice the input parameters’ format. When repar
(or alternatively
parfns
etc.) has more than one elements, the parameter vector that
lik
and fit
etc. accept is simply assumed to be the concatenation of
all of its elements’ parameters. The following example should illustrate
this.
logdiagH = c(0,0) # Meaning H is the identity matrix
theta = c(1,0)
Sig_x_ou = c(0,0,0) # Meaning Sigma is the identity matrix
Sig_x_brn = c(0,0,0) # The Brownian motion's parameters
print(lik(mod_tworeg)(c(logdiagH, theta, Sig_x_ou, Sig_x_brn)))
[1] -623
To write custom models, we cannot use the glinv(..., repar=repar)
shortcut anymore. Instead, one should use the parfns
, parjacs
, and
parhess
arguments.
It is important to note that the parfns
, parjacs
, and parhess
arguments to glinv()
are simply R functions, which the user can either
create themselves or obtain from calling get_restricted_ou()
, which is
simply a convenient helper function for making the likelihood, Jacobian,
Hessian functions automatically. Rather than writing all these from
scratch by yourself, it is often possible to customize a model is to
take the functions returned by get_restricted_ou()
and extending them.
In this example, we familiarize ourselves with these functions’s input
and output format and write a custom OU model with diagonal drift matrix
and a diagonal additive measurement error on each tips. The additive
measurement error could be estimated together if one wish (but
estimating all these together may require a fairly large tree).
Nonetheless, to investigate the format of the mentioned three functions,
first, we obtain the reparametrization likelihood, Jacobian, Hessian,
and number of parameter functions using get_restricted_ou()
:
repar = get_restricted_ou(H='diag', theta=NULL, Sig=NULL, lossmiss='halt')
print(sapply(repar, class))
par jac hess nparams
"function" "function" "function" "function"
We will deal with nparams later and let’s look at the other three first.
Recall that in the long-form calls in the previous example, repar$par
is always passed to parfns
, repar$jac
always to parjacs
and
repar$hess
to parhess
.Mathematically, the three functions map the OU
process parameters to (Φₖ,wₖ,Vₖ)ₖ, where k are the nodes. The input
format of all three of them are the same. They accepts four parameters,
and as an example, we could call
print(repar$par(c(1,1,0,0,0,0,0), 0.1, c('OK','OK'), c('OK','OK')))
[1] 0.905 0.000 0.000 0.905 0.000 0.000 0.091 0.000 0.091
In the call above:
The first argument passed to repar$par
is the parameters of the OU
model, with H being the identity matrix, represented by (1,1), the
evolutionary optimum θ being the 2D zero vector, represented by
(0,0), and Σ being the identity matrix, represented by (0,0,0).
Therefore concatenated together we have c(1,1,0,0,0,0,0)
.
The second argument is the branch length leading to node k.
The third argument is a vector of factors or string with three
levels OK
, LOST
, MISSING
, indicating which dimensions are
missing or lost in the mother node of node k. In our case, the
length of this vector is two because the we have two trait
dimensions; two OK
’s means that both the traits are “normal”,
neither missing nor lost.
The fourth argument is a vector of factors or string with the same three levels indicating the missingness of the node k. The format is the same as the third argument.
The return value is a concatenation of (Φₖ,wₖ,Vₖ), flattened in column-major order, which is the R default. This means that Φ is 0.905 times the identity matrix; w is the 2D zero vector and V is 0.091 times the identity matrix.
The repar$jac
function simply returns the Jacobian matrix of
repar$par
:
print(repar$jac(c(1,1,0,0,0,0,0), 0.1, c('OK','OK'), c('OK','OK')))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -0.0905 0.0000 0.000 0.000 0.00 0.000 0.00
[2,] 0.0000 0.0000 0.000 0.000 0.00 0.000 0.00
[3,] 0.0000 0.0000 0.000 0.000 0.00 0.000 0.00
[4,] 0.0000 -0.0905 0.000 0.000 0.00 0.000 0.00
[5,] 0.0000 0.0000 0.095 0.000 0.00 0.000 0.00
[6,] 0.0000 0.0000 0.000 0.095 0.00 0.000 0.00
[7,] -0.0088 0.0000 0.000 0.000 0.18 0.000 0.00
[8,] 0.0000 0.0000 0.000 0.000 0.00 0.091 0.00
[9,] 0.0000 -0.0088 0.000 0.000 0.00 0.000 0.18
The repar$hess
function also accepts the same argument but its return
values have a slightly different format:
tmp = repar$hess(c(1,1,0,0,0,0,0), 0.1, c('OK','OK'), c('OK','OK'))
print(names(tmp))
[1] "V" "w" "Phi"
print(sapply(tmp, dim))
V w Phi
[1,] 3 2 4
[2,] 7 7 7
[3,] 7 7 7
Notice that repar$hess
returns a list containing three elements, V
,
w
, and Phi
, each being a three-dimensional array. They contain all
the second-order partial derivatives of the repar$par
function, with
tmp$V[m,i,j]
containing ∂²Vₘ/∂ηᵢ∂ηⱼ, tmp$w[m,i,j]
containing
∂²wₘ/∂ηᵢ∂ηⱼ and tmp$Phi[m,i,j]
containing ∂²Φₘ/∂ηᵢ∂ηⱼ, where η denotes
the vector of parameters that repar$par
accepts and m means the index
of the matrices but not the node numbers. For example, in our situation,
tmp$w[2,3,4]
contains ∂²w₂/∂θ₁∂θ₂ and tmp$Phi[3,2,3]
is ∂²Φ₂₁/∂
H₂₂∂θ₁.
Having understood their input and output, we are now ready to make a
custom model. In this custom model, we assume that all species evolve
exactly the same as specified in repar, but we cannot measure the traits
at the tip accurately. To take into account this measurement error, we
add an uncorrelated Gaussian error at each tip. We assume that the tree
has a 800 tips in this example for simplicity. First, we extend
repar$par
to accept our additional parameters:
my_par = function (par, ...) {
phiwV = repar$par(par[1:7], ...)
if (INFO__$node_id > 800) # If not tip just return the original
return(phiwV)
Sig_e = diag(par[8:9]) # Our measurement error matrix
phiwV[7:9] = phiwV[7:9] + Sig_e[lower.tri(Sig_e, diag=T)]
phiwV
}
Note that we have accessed the node ID using INFO__$node_id
. In our
package, “node IDs” means the same thing as the node numbers in the
ape
package, hence the nodes with ID 1-300 are the tips and the rest
are the internal nodes. The INFO__
object is neither a global variable
nor an argument but a variable that lives in function’s enclosing
environment. Now let’s define the Jacobian function, which contains the
same Jacobian as the original no-measurement-error Jacobian, but with
some extra entries that are either one or zero.
my_jac = function (par, ...) {
new_jac = matrix(0.0, 9, 9)
new_jac[,1:7] = repar$jac(par[1:7], ...)
if (INFO__$node_id <= 800)
new_jac[7,8] = new_jac[9,9] = 1.0
new_jac
}
The Hessian matrix of our modified model is actually unchanged except that there are more zero entries, because the new parameters are simply a linear sum.
my_hess = function (par, ...)
lapply(repar$hess(par[1:7], ...), function (H) {
newH = array(0.0, dim=c(dim(H)[1], 9, 9))
newH[,1:7,1:7] = H[,,] # Copy the original part
newH # Other entries are just zero
})
Finally, we actually do not need to write our own repar$nparams
, which
accepts the number of trait dimensions and returns the number of
parameters, beacuase we know exactly we have 9 parameters in our
example. Now we can construct our custom model:
# Simulate a tree with 800 tips
set.seed(777)
tr = ape::rtree(800)
mod_measerr = glinv(tr, x0, NULL,
pardims = 9,
parfns = my_par,
parjacs = my_jac,
parhess = my_hess)
print(mod_measerr)
Now let’s make a ground truth parameter value, generate some random data and fit the model:
par_measerr_truth = c(H1=0.2, H2=0.5,
theta1=-1, theta2=1,
sig_x1=0, sig_x2=0, sig_x3=0,
sig_e1=0.4, sig_e2=0.8)
set.seed(999)
X_measerr = rglinv(mod_measerr, par_measerr_truth, Nsamp=1)
set_tips(mod_measerr, X_measerr[[1]])
## Fit the model
par_measerr_init = par_measerr_truth
par_measerr_init[] = 1.0
fitted_measerr = fit(mod_measerr, par_measerr_init,
method='BB', ## Try out different opt. methods
lower=c(H1=-Inf, H2=-Inf,
theta1=-Inf, theta2=-Inf,
sig_x1=-Inf, sig_x2=-Inf, sig_x3=-Inf,
sig_e1=1e-9, sig_e2=1e-9))
vest_measerr = varest(mod_measerr, fitted_measerr$mlepar)
confint = marginal_ci(vest_measerr, lvl=0.95)
cat('-- ESTIMATES --\n')
print(fitted_measerr)
cat('-- CONF. INTERVALS --\n')
print(confint)
-- ESTIMATES --
$mlepar
H1 H2 theta1 theta2 sig_x1 sig_x2 sig_x3 sig_e1 sig_e2
0.1467 0.3374 -1.8662 0.9153 -0.0407 -0.0039 -0.2291 0.4107 0.9276
$loglik
[1] -2614
$fn.reduction
[1] 498
$iter
[1] 207
$feval
[1] 208
$convergence
[1] 0
$message
[1] "Successful convergence"
$cpar
method M
2 50
$score
[1] -0.000042 -0.000207 -0.000413 0.000126 -0.000024 0.000062 -0.000058
[8] 0.000023 -0.000348
-- CONF. INTERVALS --
Lower Upper
H1 0.07 0.22
H2 0.14 0.54
theta1 -3.06 -0.67
theta2 0.62 1.21
sig_x1 -0.19 0.11
sig_x2 -0.11 0.10
sig_x3 -0.56 0.11
sig_e1 0.25 0.57
sig_e2 0.69 1.17