We present very quickly the main optimization methods. Please refer to Numerical Optimization (Nocedal & Wright, 2006) or Numerical Optimization: theoretical and practical aspects (Bonnans, Gilbert, Lemarechal & Sagastizabal, 2006) for a good introduction. We consider the following problem min for x\in\mathbb{R}^n.
The Nelder-Mead method is one of the most well known derivative-free methods that use only values of f to search for the minimum. It consists in building a simplex of n+1 points and moving/shrinking this simplex into the good direction.
The Nelder-Mead method is available in optim
.
By default, in optim
, \alpha=1, \beta=1/2, \gamma=2 and \sigma=1/2.
For smooth non-linear function, the following method is generally used: a local method combined with line search work on the scheme x_{k+1} =x_k + t_k d_{k}, where the local method will specify the direction d_k and the line search will specify the step size t_k \in \mathbb{R}.
A desirable property for d_k is that d_k ensures a descent f(x_{k+1}) < f(x_{k}). Newton methods are such that d_k minimizes a local quadratic approximation of f based on a Taylor expansion, that is q_f(d) = f(x_k) + g(x_k)^Td +\frac{1}{2} d^T H(x_k) d where g denotes the gradient and H denotes the Hessian.
The consists in using the exact solution of local minimization problem d_k = - H(x_k)^{-1} g(x_k).
In practice, other methods are preferred (at least to ensure positive definiteness).
The method approximates the Hessian by a matrix H_k as a function of H_{k-1}, x_k, f(x_k) and then d_k solves the system H_k d = - g(x_k).
Some implementation may also directly approximate the inverse of the Hessian W_k in order to compute d_k = -W_k g(x_k). Using the Sherman-Morrison-Woodbury formula, we can switch between W_k and H_k.
To determine W_k, first it must verify the secant equation H_k y_k =s_k or y_k=W_k s_k where y_k = g_{k+1}-g_k and s_k=x_{k+1}-x_k. To define the n(n-1) terms, we generally impose a symmetry and a minimum distance conditions. We say we have a rank 2 update if H_k = H_{k-1} + a u u^T + b v v^T and a rank 1 update if $H_k = H_{k-1} + a u u^T $. Rank n update is justified by the spectral decomposition theorem.
There are two rank-2 updates which are symmetric and preserve positive definiteness
In R
, the so-called BFGS scheme is implemented in optim
.
Another possible method (which is initially arised from quadratic problems) is the nonlinear conjugate gradients. This consists in computing directions (d_0, \dots, d_k) that are conjugate with respect to a matrix close to the true Hessian H(x_k). Directions are computed iteratively by d_k = -g(x_k) + \beta_k d_{k-1} for k>1, once initiated by d_1 = -g(x_1). \beta_k are updated according a scheme:
There exists also three-term formula for computing direction
d_k = -g(x_k) + \beta_k d_{k-1}+\gamma_{k} d_t for t<k.
A possible scheme is the Beale-Sorenson update defined as
\beta_k = \frac{ g_k^T (g_k-g_{k-1} )}{d^T_{k-1}(g_{k}- g_{k-1})}
and \gamma_k = \frac{ g_k^T (g_{t+1}-g_{t} )}{d^T_{t}(g_{t+1}- g_{t})} if k>t+1 otherwise \gamma_k=0 if k=t.
See Yuan (2006) for other well-known schemes such as Hestenses-Stiefel, Dixon or Conjugate-Descent.
The three updates (Fletcher-Reeves, Polak-Ribiere, Beale-Sorenson) of the (non-linear) conjugate gradient are available in optim
.
Let \phi_k(t) = f(x_k + t d_k) for a given direction/iterate (d_k, x_k). We need to find conditions to find a satisfactory stepsize t_k. In literature, we consider the descent condition: \phi_k'(0) < 0 and the Armijo condition: \phi_k(t) \leq \phi_k(0) + t c_1 \phi_k'(0) ensures a decrease of f. Nocedal & Wright (2006) presents a backtracking (or geometric) approach satisfying the Armijo condition and minimal condition, i.e. Goldstein and Price condition.
This backtracking linesearch is available in optim
.
To simplify the benchmark of optimization methods, we create a fitbench
function that computes
the desired estimation method for all optimization methods.
This function is currently not exported in the package.
function(data, distr, method, grad = NULL,
fitbench <-control = list(trace = 0, REPORT = 1, maxit = 1000),
lower = -Inf, upper = +Inf, ...)
The density of the beta distribution is given by f(x; \delta_1,\delta_2) = \frac{x^{\delta_1-1}(1-x)^{\delta_2-1}}{\beta(\delta_1,\delta_2)}, where \beta denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. We recall that \beta(a,b)=\Gamma(a)\Gamma(b)/\Gamma(a+b). There the log-likelihood for a set of observations (x_1,\dots,x_n) is \log L(\delta_1,\delta_2) = (\delta_1-1)\sum_{i=1}^n\log(x_i)+ (\delta_2-1)\sum_{i=1}^n\log(1-x_i)+ n \log(\beta(\delta_1,\delta_2)) The gradient with respect to a and b is \nabla \log L(\delta_1,\delta_2) = \left(\begin{matrix} \sum\limits_{i=1}^n\ln(x_i) - n\psi(\delta_1)+n\psi( \delta_1+\delta_2) \\ \sum\limits_{i=1}^n\ln(1-x_i)- n\psi(\delta_2)+n\psi( \delta_1+\delta_2) \end{matrix}\right), where \psi(x)=\Gamma'(x)/\Gamma(x) is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/.
R
implementationAs in the fitdistrplus
package, we minimize the opposite of the log-likelihood:
we implement the opposite of the gradient in grlnL
. Both the log-likelihood and its gradient
are not exported.
function(par, fix.arg, obs, ddistnam)
lnL <-:::loglikelihood(par, fix.arg, obs, ddistnam)
fitdistrplus fitdistrplus:::grlnlbeta grlnlbeta <-
#(1) beta distribution
200
n <- rbeta(n, 3, 3/4)
x <-grlnlbeta(c(3, 4), x) #test
## [1] -133 317
hist(x, prob=TRUE, xlim=0:1)
lines(density(x), col="red")
curve(dbeta(x, 3, 3/4), col="green", add=TRUE)
legend("topleft", lty=1, col=c("red","green"), legend=c("empirical", "theoretical"), bty="n")
Define control parameters.
list(trace=0, REPORT=1, maxit=1000) ctr <-
Call mledist
with the default optimization function
(optim
implemented in stats
package)
with and without the gradient for the different optimization methods.
fitbench(x, "beta", "mle", grad=grlnlbeta, lower=0) unconstropt <-
## BFGS NM CGFR CGPR CGBS L-BFGS-B NM-B G-BFGS
## 14 14 14 14 14 14 14 14
## G-CGFR G-CGPR G-CGBS G-BFGS-B G-NM-B G-CGFR-B G-CGPR-B G-CGBS-B
## 14 14 14 14 14 14 14 14
In the case of constrained optimization, mledist
permits the direct use
of constrOptim
function (still implemented in stats
package)
that allow linear inequality constraints by using a logarithmic barrier.
Use a exp/log transformation of the shape parameters \delta_1 and \delta_2 to ensure that the shape parameters are strictly positive.
function(x, shape1, shape2, log)
dbeta2 <-dbeta(x, exp(shape1), exp(shape2), log=log)
#take the log of the starting values
lapply(fitdistrplus:::startargdefault(x, "beta"), log)
startarg <-#redefine the gradient for the new parametrization
function(par, obs, ...)
grbetaexp <-grlnlbeta(exp(par), obs) * exp(par)
fitbench(x, distr="beta2", method="mle", grad=grbetaexp, start=startarg) expopt <-
## BFGS NM CGFR CGPR CGBS G-BFGS G-CGFR G-CGPR G-CGBS
## 14 14 14 14 14 14 14 14 14
#get back to original parametrization
c("fitted shape1", "fitted shape2"), ] <- exp(expopt[c("fitted shape1", "fitted shape2"), ]) expopt[
Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one).
Results are displayed in the following tables:
(1) the original parametrization without specifying the gradient (-B
stands for bounded version),
(2) the original parametrization with the (true) gradient (-B
stands for bounded version and -G
for gradient),
(3) the log-transformed parametrization without specifying the gradient,
(4) the log-transformed parametrization with the (true) gradient (-G
stands for gradient).
BFGS | NM | CGFR | CGPR | CGBS | L-BFGS-B | NM-B | |
---|---|---|---|---|---|---|---|
fitted shape1 | 2.665 | 2.664 | 2.665 | 2.665 | 2.665 | 2.665 | 2.665 |
fitted shape2 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 |
fitted loglik | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 |
func. eval. nb. | 15.000 | 47.000 | 191.000 | 221.000 | 235.000 | 8.000 | 92.000 |
grad. eval. nb. | 11.000 | NA | 95.000 | 115.000 | 171.000 | 8.000 | NA |
time (sec) | 0.005 | 0.004 | 0.033 | 0.039 | 0.047 | 0.004 | 0.010 |
G-BFGS | G-CGFR | G-CGPR | G-CGBS | G-BFGS-B | G-NM-B | G-CGFR-B | G-CGPR-B | G-CGBS-B | |
---|---|---|---|---|---|---|---|---|---|
fitted shape1 | 2.665 | 2.665 | 2.665 | 2.665 | 2.665 | 2.665 | 2.665 | 2.665 | 2.665 |
fitted shape2 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 |
fitted loglik | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 |
func. eval. nb. | 22.000 | 249.000 | 225.000 | 138.000 | 40.000 | 92.000 | 471.000 | 403.000 | 255.000 |
grad. eval. nb. | 5.000 | 71.000 | 69.000 | 43.000 | 6.000 | NA | 96.000 | 104.000 | 58.000 |
time (sec) | 0.007 | 0.068 | 0.066 | 0.041 | 0.017 | 0.019 | 0.129 | 0.131 | 0.077 |
BFGS | NM | CGFR | CGPR | CGBS | |
---|---|---|---|---|---|
fitted shape1 | 2.665 | 2.664 | 2.665 | 2.665 | 2.665 |
fitted shape2 | 0.731 | 0.731 | 0.731 | 0.731 | 0.731 |
fitted loglik | 114.165 | 114.165 | 114.165 | 114.165 | 114.165 |
func. eval. nb. | 8.000 | 41.000 | 37.000 | 49.000 | 47.000 |
grad. eval. nb. | 5.000 | NA | 19.000 | 39.000 | 33.000 |
time (sec) | 0.004 | 0.004 | 0.008 | 0.013 | 0.012 |
G-BFGS | G-CGFR | G-CGPR | G-CGBS | |
---|---|---|---|---|
fitted shape1 | 2.665 | 2.665 | 2.665 | 2.665 |
fitted shape2 | 0.731 | 0.731 | 0.731 | 0.731 |
fitted loglik | 114.165 | 114.165 | 114.165 | 114.165 |
func. eval. nb. | 21.000 | 175.000 | 146.000 | 112.000 |
grad. eval. nb. | 5.000 | 39.000 | 47.000 | 35.000 |
time (sec) | 0.009 | 0.044 | 0.049 | 0.036 |
Using llsurface
, we plot the log-likehood surface around the true value (green) and the fitted parameters (red).
llsurface(min.arg=c(0.1, 0.1), max.arg=c(7, 3), xlim=c(.1,7),
plot.arg=c("shape1", "shape2"), nlev=25,
lseq=50, data=x, distr="beta", back.col = FALSE)
points(unconstropt[1,"BFGS"], unconstropt[2,"BFGS"], pch="+", col="red")
points(3, 3/4, pch="x", col="green")
We can simulate bootstrap replicates using the bootdist
function.
bootdist(fitdist(x, "beta", method = "mle", optim.method = "BFGS"),
b1 <-niter = 100, parallel = "snow", ncpus = 2)
summary(b1)
## Parametric bootstrap medians and 95% percentile CI
## Median 2.5% 97.5%
## shape1 2.73 2.272 3.283
## shape2 0.75 0.652 0.888
plot(b1, trueval = c(3, 3/4))
The p.m.f. of the Negative binomial distribution is given by f(x; m,p) = \frac{\Gamma(x+m)}{\Gamma(m)x!} p^m (1-p)^x, where \Gamma denotes the beta function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/. There exists an alternative representation where \mu=m (1-p)/p or equivalently p=m/(m+\mu). Thus, the log-likelihood for a set of observations (x_1,\dots,x_n) is \log L(m,p) = \sum_{i=1}^{n} \log\Gamma(x_i+m) -n\log\Gamma(m) -\sum_{i=1}^{n} \log(x_i!) + mn\log(p) +\sum_{i=1}^{n} {x_i}\log(1-p) The gradient with respect to m and p is \nabla \log L(m,p) = \left(\begin{matrix} \sum_{i=1}^{n} \psi(x_i+m) -n \psi(m) + n\log(p) \\ mn/p -\sum_{i=1}^{n} {x_i}/(1-p) \end{matrix}\right), where \psi(x)=\Gamma'(x)/\Gamma(x) is the digamma function, see the NIST Handbook of mathematical functions https://dlmf.nist.gov/.
R
implementationAs in the fitdistrplus
package, we minimize the opposite of the log-likelihood: we implement the opposite of the gradient in grlnL
.
function(x, obs, ...)
grlnlNB <-
{ x[1]
m <- x[2]
p <- length(obs)
n <-c(sum(psigamma(obs+m)) - n*psigamma(m) + n*log(p),
*n/p - sum(obs)/(1-p))
m }
#(2) negative binomial distribution
200
n <- c("size"=10, "prob"=3/4, "mu"=10/3)
trueval <- rnbinom(n, trueval["size"], trueval["prob"])
x <-
hist(x, prob=TRUE, ylim=c(0, .3), xlim=c(0, 10))
lines(density(x), col="red")
points(min(x):max(x), dnbinom(min(x):max(x), trueval["size"], trueval["prob"]),
col = "green")
legend("topright", lty = 1, col = c("red", "green"),
legend = c("empirical", "theoretical"), bty="n")
Define control parameters and make the benchmark.
list(trace = 0, REPORT = 1, maxit = 1000)
ctr <- fitbench(x, "nbinom", "mle", grad = grlnlNB, lower = 0) unconstropt <-
## BFGS NM CGFR CGPR CGBS L-BFGS-B NM-B G-BFGS
## 14 14 14 14 14 14 14 14
## G-CGFR G-CGPR G-CGBS G-BFGS-B G-NM-B G-CGFR-B G-CGPR-B G-CGBS-B
## 14 14 14 14 14 14 14 14
rbind(unconstropt,
unconstropt <-"fitted prob" = unconstropt["fitted mu", ] / (1 + unconstropt["fitted mu", ]))
In the case of constrained optimization, mledist
permits the direct use
of constrOptim
function (still implemented in stats
package)
that allow linear inequality constraints by using a logarithmic barrier.
Use a exp/log transformation of the shape parameters \delta_1 and \delta_2 to ensure that the shape parameters are strictly positive.
function(x, size, prob, log)
dnbinom2 <-dnbinom(x, exp(size), 1 / (1 + exp(-prob)), log = log)
# transform starting values
fitdistrplus:::startargdefault(x, "nbinom")
startarg <-$mu <- startarg$size / (startarg$size + startarg$mu)
startarg list(size = log(startarg[[1]]),
startarg <-prob = log(startarg[[2]] / (1 - startarg[[2]])))
# redefine the gradient for the new parametrization
function(x)
Trans <-c(exp(x[1]), plogis(x[2]))
function(par, obs, ...)
grNBexp <-grlnlNB(Trans(par), obs) * c(exp(par[1]), plogis(x[2])*(1-plogis(x[2])))
fitbench(x, distr="nbinom2", method="mle", grad=grNBexp, start=startarg) expopt <-
## BFGS NM CGFR CGPR CGBS G-BFGS G-CGFR G-CGPR G-CGBS
## 14 14 14 14 14 14 14 14 14
# get back to original parametrization
c("fitted size", "fitted prob"), ] <-
expopt[ apply(expopt[c("fitted size", "fitted prob"), ], 2, Trans)
Then we extract the values of the fitted parameters, the value of the corresponding log-likelihood and the number of counts to the function to minimize and its gradient (whether it is the theoretical gradient or the numerically approximated one).
Results are displayed in the following tables:
(1) the original parametrization without specifying the gradient (-B
stands for bounded version),
(2) the original parametrization with the (true) gradient (-B
stands for bounded version and -G
for gradient),
(3) the log-transformed parametrization without specifying the gradient,
(4) the log-transformed parametrization with the (true) gradient (-G
stands for gradient).
BFGS | NM | CGFR | CGPR | CGBS | L-BFGS-B | NM-B | |
---|---|---|---|---|---|---|---|
fitted size | 57.333 | 62.504 | 57.337 | 57.335 | 57.335 | 57.333 | 58.446 |
fitted mu | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 |
fitted loglik | -402.675 | -402.674 | -402.675 | -402.675 | -402.675 | -402.675 | -402.675 |
func. eval. nb. | 2.000 | 39.000 | 2001.000 | 1001.000 | 1001.000 | 2.000 | 0.000 |
grad. eval. nb. | 1.000 | NA | 1001.000 | 1001.000 | 1001.000 | 2.000 | NA |
time (sec) | 0.001 | 0.002 | 0.188 | 0.163 | 0.190 | 0.001 | 0.003 |
fitted prob | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 |
G-BFGS | G-CGFR | G-CGPR | G-CGBS | G-BFGS-B | G-NM-B | G-CGFR-B | G-CGPR-B | G-CGBS-B | |
---|---|---|---|---|---|---|---|---|---|
fitted size | 57.333 | 57.333 | 57.333 | 57.333 | 57.333 | 58.446 | 57.333 | 57.333 | 57.333 |
fitted mu | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 | 3.440 |
fitted loglik | -402.675 | -402.675 | -402.675 | -402.675 | -402.675 | -402.675 | -402.675 | -402.675 | -402.675 |
func. eval. nb. | 28.000 | 28.000 | 28.000 | 28.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
grad. eval. nb. | 1.000 | 1.000 | 1.000 | 1.000 | NA | NA | NA | NA | NA |
time (sec) | 0.014 | 0.001 | 0.002 | 0.001 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 |
fitted prob | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 | 0.775 |
BFGS | NM | CGFR | CGPR | CGBS | |
---|---|---|---|---|---|
fitted size | 57.335 | 62.320 | 58.595 | 60.173 | 60.875 |
fitted prob | 0.943 | 0.948 | 0.945 | 0.946 | 0.947 |
fitted loglik | -402.675 | -402.674 | -402.675 | -402.674 | -402.674 |
func. eval. nb. | 3.000 | 45.000 | 2501.000 | 2273.000 | 2272.000 |
grad. eval. nb. | 1.000 | NA | 1001.000 | 1001.000 | 1001.000 |
time (sec) | 0.004 | 0.003 | 0.242 | 0.226 | 0.211 |
G-BFGS | G-CGFR | G-CGPR | G-CGBS | |
---|---|---|---|---|
fitted size | 57.333 | 57.333 | 57.333 | 57.333 |
fitted prob | 0.943 | 0.943 | 0.943 | 0.943 |
fitted loglik | -402.675 | -402.675 | -402.675 | -402.675 |
func. eval. nb. | 18.000 | 44.000 | 39.000 | 39.000 |
grad. eval. nb. | 1.000 | 3.000 | 3.000 | 3.000 |
time (sec) | 0.007 | 0.002 | 0.002 | 0.002 |
Using llsurface
, we plot the log-likehood surface around the true value (green) and the fitted parameters (red).
llsurface(min.arg = c(5, 0.3), max.arg = c(15, 1), xlim=c(5, 15),
plot.arg = c("size", "prob"), nlev = 25,
lseq = 50, data = x, distr = "nbinom", back.col = FALSE)
points(unconstropt["fitted size", "BFGS"], unconstropt["fitted prob", "BFGS"],
pch = "+", col = "red")
points(trueval["size"], trueval["prob"], pch = "x", col = "green")
We can simulate bootstrap replicates using the bootdist
function.
bootdist(fitdist(x, "nbinom", method = "mle", optim.method = "BFGS"),
b1 <-niter = 100, parallel = "snow", ncpus = 2)
summary(b1)
## Parametric bootstrap medians and 95% percentile CI
## Median 2.5% 97.5%
## size 57.33 57.33 57.33
## mu 3.46 3.24 3.72
plot(b1, trueval=trueval[c("size", "mu")])
Based on the two previous examples, we observe that all methods converge to the same
point. This is reassuring.
However, the number of function evaluations (and the gradient evaluations) is
very different from a method to another.
Furthermore, specifying the true gradient of the log-likelihood does not
help at all the fitting procedure and generally slows down the convergence.
Generally, the best method is the standard BFGS method or the BFGS method
with the exponential transformation of the parameters.
Since the exponential function is differentiable, the asymptotic properties are
still preserved (by the Delta method) but for finite-sample this may produce a small bias.