How to select a local optimizer#

This guide explains how to choose a local optimizer that works well for your problem. Depending on your strategy for global optimization it is also relevant for global optimization problems.

Important facts#

  • There is no optimizer that works well for all problems

  • Making the right choice can lead to enormous speedups

  • Making the wrong choice can mean that you don’t solve your problem at all. Sometimes, optimizers fail silently!

The three steps for selecting algorithms#

Algorithm selection is a mix of theory and experimentation. We recommend the following steps:

  1. Theory: Based on the properties of your problem, start with 3 to 5 candidate algorithms. You may use the decision tree below.

  2. Experiments: Run the candidate algorithms for a small number of function evaluations and compare the results in a criterion plot. As a rule of thumb, use between n_params and 10 * n_params evaluations.

  3. Optimization: Re-run the algorithm with the best results until convergence. Use the best parameter vector from the experiments as start parameters.

We will walk you through the steps in an example below. These steps work well for most problems but sometimes you need variations.

A decision tree#

This is a practical guide for narrowing down the set of algorithms to experiment with:

        graph LR
    classDef highlight fill:#FF4500;
    A["Do you have<br/>nonlinear<br/>constraints?"] -- yes --> B["differentiable?"]
    B["Is your objective function differentiable?"] -- yes --> C["ipopt<br/>nlopt_slsqp<br/>scipy_trust_constr"]
    B["differentiable?"] -- no --> D["scipy_cobyla<br/>nlopt_cobyla"]

    A["Do you have<br/>nonlinear constraints?"] -- no --> E["Can you exploit<br/>a least-squares<br/>structure?"]
    E["Can you exploit<br/>a least-squares<br/>structure?"] -- yes --> F["differentiable?"]
    E["Can you exploit<br/>a least-squares<br/>structure?"] -- no --> G["differentiable?"]

    F["differentiable?"] -- yes --> H["scipy_ls_lm<br/>scipy_ls_trf<br/>scipy_ls_dogbox"]
    F["differentiable?"] -- no --> I["nag_dflos<br/>pounders<br/>tao_pounders"]

    G["differentiable?"] -- yes --> J["scipy_lbfgsb<br/>nlopt_lbfgsb<br/>fides"]
    G["differentiable?"] -- no --> K["nlopt_bobyqa<br/>nlopt_neldermead<br/>neldermead_parallel"]

    

Going through the different questions will give you a list of candidate algorithms. All algorithms in that list are designed for the same problem class but use different approaches to solve the problem. Which of them works best for your problem can only be found out through experimentation.

Note

Many books on numerical optimization focus strongly on the inner workings of algorithms. They will, for example, describe the difference between a trust-region algorithm and a line-search algorithm in a lot of detail. We have an intuitive explanation of this too. Understanding these details is important for configuring and troubleshooting optimizations, but not for algorithm selection. For example, If you have a scalar, differentiable problem without nonlinear constraints, the decision tree suggests fides and two variants of lbfgsb. fides is a trust-region algorithm, lbfgsb is a line-search algorithm. Both are designed to solve the same kinds of problems and which one works best needs to be found out through experimentation.

Filtering algorithms#

An even more fine-grained version of the decision tree is built into optimagic’s algorithm selection tool, which can filter algorithms based on the properties of your problem. To make this concrete, assume we are looking for a local optimizer for a differentiable problem with a scalar objective function and bound constraints.

To find all algorithms that match our criteria, we can simply type:

import optimagic as om

om.algos.Local.GradientBased.Scalar.Bounded.All
[om.algos.fides,
 om.algos.ipopt,
 om.algos.nlopt_ccsaq,
 om.algos.nlopt_lbfgsb,
 om.algos.nlopt_mma,
 om.algos.nlopt_slsqp,
 om.algos.nlopt_tnewton,
 om.algos.nlopt_var,
 om.algos.scipy_lbfgsb,
 om.algos.scipy_slsqp,
 om.algos.scipy_truncated_newton,
 om.algos.scipy_trust_constr]

The available filters are: GradientBased, GradientFree, Global, Local, Bounded, LinearConstrained, NonlinearConstrained, Scalar, LeastSquares, Likelihood, and Parallel. You can apply them in any order your want. They are also discoverable, i.e. the autocomplete feature of your editor will show you all filters you can apply on top of your current selection.

Using .All after applying filters shows you all algorithms optimagic knows of that satisfy your criteria. Some of them require optional dependencies. To show only the algorithms that are available with the packages you have currently installed, use .Available instead of .All.

An example problem#

As an example we use the Trid function. The Trid function has no local minimum except the global one. It is defined for any number of dimensions, we will pick 20. As starting values we will pick the vector [0, 1, …, 19].

A Python implementation of the function and its gradient looks like this:

import warnings

warnings.filterwarnings("ignore")
import numpy as np


def trid_scalar(x):
    """Implement Trid function: https://www.sfu.ca/~ssurjano/trid.html."""
    return ((x - 1) ** 2).sum() - (x[1:] * x[:-1]).sum()


def trid_gradient(x):
    """Calculate gradient of trid function."""
    l1 = np.insert(x, 0, 0)
    l1 = np.delete(l1, [-1])
    l2 = np.append(x, 0)
    l2 = np.delete(l2, [0])
    return 2 * (x - 1) - l1 - l2

Step 1: Theory#

Let’s go through the decision tree for the Trid function:

  1. No nonlinear constraints our solution needs to satisfy

  2. No least-squares structure we can exploit

  3. Yes, the function is differentiable. We even have a closed form gradient that we would like to use.

We therefore end up with the candidate algorithms scipy_lbfgsb, nlopt_lbfgsb, and fides.

Note

If your function is differentiable but you do not have a closed form gradient (yet), we suggest to use at least one gradient based optimizer and one gradient free optimizer. in your experiments. Optimagic will use numerical gradients in that case. For details, see here.

Step 2: Experiments#

To find out which algorithms work well for our problem, we simply run optimizations with all candidate algorithms in a loop and store the result in a dictionary. We limit the number of function evaluations to 8. Since some algorithms only support a maximum number of iterations as stopping criterion we also limit the number of iterations to 8.

results = {}
for algo in ["scipy_lbfgsb", "nlopt_lbfgsb", "fides"]:
    results[algo] = om.minimize(
        fun=trid_scalar,
        jac=trid_gradient,
        params=np.arange(20),
        algorithm=algo,
        algo_options={"stopping_maxfun": 8, "stopping_maxiter": 8},
    )

fig = om.criterion_plot(results, max_evaluations=8)
fig.show(renderer="png")
../_images/bce6b2ca84d44e49fe90c9c9dabace6e49013209f77a452be35cac32f325e26e.png

All optimizers work pretty well here and since this is a very simple problem, any of them would probably find the optimum in a reasonable time. However, nlopt_lbfgsb is a bit better than the others, so we will select it for the next step. In more difficult examples, the difference between optimizers can be much more pronounced.

Step 3: Optimization#

All that is left to do is to run the optimization until convergence with the best optimizer. To avoid duplicated calculations, we can already start from the previously best parameter vector:

best_x = results["nlopt_lbfgsb"].params
results["nlopt_lbfgsb_complete"] = om.minimize(
    fun=trid_scalar,
    jac=trid_gradient,
    params=best_x,
    algorithm="nlopt_lbfgsb",
)

Looking at the result in a criterion plot we can see that the optimizer converges after a bit more than 30 function evaluations.

fig = om.criterion_plot(results)
fig.show(renderer="png")
../_images/059ac277b8a8c411c082cbe7aa7ae1457c7ca1216f7e8cd16ec01b4b0b47ef9d.png

Variations of the four steps#

The four steps described above work very well in most situations. However, sometimes it makes sense to deviate:

  • If you are unsure about some of the questions in step 1, select more algorithms for the experimentation phase and run more than 1 algorithm until convergence.

  • If it is very important to find a precise optimum, run more than 1 algorithm until convergence.

  • If you have a very fast objective function, simply run all candidate algorithms until convergence.

  • If you have a differentiable objective function but no closed form derivative, use at least one gradient based optimizer and one gradient free optimizer in the experiments. See here to learn more about derivatives.

How important was it?#

The Trid function is differentiable and very well behaved in almost every aspect. Moreover, it has a very short runtime. One would think that any optimizer can find its optimum. So let’s compare the selected optimizer with a few others:

results = {}
for algo in ["nlopt_lbfgsb", "scipy_neldermead", "scipy_cobyla"]:
    results[algo] = om.minimize(
        fun=trid_scalar,
        jac=trid_gradient,
        params=np.arange(20),
        algorithm=algo,
    )

fig = om.criterion_plot(results)
fig.show(renderer="png")
../_images/59603bcf5a42521bd9994df0d8c47240a1691a63298b5e4b5019638df7cd4d2a.png

We can see that our chosen optimizer solves the problem with less than 35 function evaluations. At this point, the two gradient-free optimizers have not yet made significant progress. CoByLA gets reasonably close to an optimum after about 4k evaluations. Nelder-Mead gets stuck after 8k evaluations and fails to solve the problem.

This example shows not only that the choice of optimizer is important but that the commonly held belief that gradient free optimizers are generally more robust than gradient based ones is dangerous! The Nelder-Mead algorithm did “converge” and reports success, but did not find the optimum. It did not even get stuck in a local optimum because we know that the Trid function does not have local optima except the global one. It just got stuck somewhere.

results["scipy_neldermead"].success
True