Augmented Lagrangian method

Definitions

The nonlinear program we wish to solve is of the form:

$\begin{matrix} (P) & \begin{aligned} \underset{x}{minimize} & f (x) & f : {I R}^{n} \to I R \\ subject to & \underset{―}{x} \leq x \leq \overset{―}{x} \\ \underset{―}{z} \leq g (x) \leq \overset{―}{z} & g : {I R}^{n} \to {I R}^{m} \end{aligned} \end{matrix}$

Define the convex sets $C$ and $D$ as

$\begin{matrix} (1) & \begin{aligned} C & = {x \in {I R}^{n} ∣ \underset{―}{x} \leq x \leq \overset{―}{x}} \\ D & = {z \in {I R}^{m} ∣ \underset{―}{z} \leq z \leq \overset{―}{z}} . \end{aligned} \end{matrix}$

These rectangular boxes can be decomposed as Cartesian products of 1-dimensional closed intervals:

$\begin{matrix} (2) & \begin{aligned} C & = C_{1} \times C_{2} \times \dots \times C_{n} & where C_{i} = [{\underset{―}{x}}_{i}, {\overset{―}{x}}_{i}] \\ D & = D_{1} \times D_{2} \times \dots \times D_{m} & where D_{i} = [{\underset{―}{z}}_{i}, {\overset{―}{z}}_{i}] \end{aligned} \end{matrix}$

Using these definitions, problem $(P)$ can equivalently be expressed as

$\begin{matrix} (3) & \begin{aligned} \underset{x \in C}{minimize} & f (x) \\ subject to & g (x) \in D . \end{aligned} \end{matrix}$

After introduction of a slack variable $z$ , problem $(3)$ can be stated as

$\begin{matrix} (P-ALM) & \begin{aligned} \underset{x \in C, z \in D}{minimize} & f (x) \\ subject to & g (x) - z = 0. \end{aligned} \end{matrix}$

The Lagrangian function of problem $(P-ALM)$ is given by:

$\begin{matrix} (4) & \begin{aligned} L : {I R}^{n} \times {I R}^{m} \times {I R}^{m} \to I R : (x, z, y) & \mapsto L (x, z, y) \\ ≜ f (x) + ⟨ g (x) - z, y ⟩ . \end{aligned} \end{matrix}$

The vector $y \in {I R}^{m}$ is called the vector of Lagrange multipliers.

The augmented Lagrangian function with penalty factor $Σ$ of the problem $(P-ALM)$ is defined as the sum of the Lagrangian function and a quadratic term that penalizes the constraint violation:

$\begin{matrix} (5) & \begin{aligned} L_{Σ} : {I R}^{n} \times {I R}^{m} \times {I R}^{m} \to I R : (x, z, y) & \mapsto L_{Σ} (x, z, y) \\ ≜ L (x, z, y) + \frac{1}{2} {‖ g (x) - z ‖}_{Σ}^{2}, \end{aligned} \end{matrix}$

where $Σ$ is a symmetric positive definite $m \times m$ matrix that defines a norm on ${I R}^{m}$ , $‖ z ‖_{Σ}^{2} ≜ z^{⊤} Σ z$ .

The augmented Lagrangian method algorithm

The augmented Lagrangian method for solving Problem $(P-ALM)$ consists of the successive minimization of $L_{Σ}$ with respect to the decision variables $x$ and the slack variables $z$ (1), after which the Lagrange multipliers $y$ are updated (2), and the penalty factors $Σ_{i i}$ corresponding to constraints with high violation are increased (3).

The augmented Lagrangian function is used as an exact penalty function for problem $(P-ALM)$ , it is equivalent to the shifted quadratic penalty method with shift $Σ^{- 1} y$ .

1. Minimization of the augmented Lagrangian

Using some algebraic manipulations, the augmented Lagrangian defined in $(5)$ can be expressed as

$\begin{matrix} (6) & L_{Σ} (x, z, y) = f (x) + \frac{1}{2} ‖ g (x) - z + Σ^{- 1} y ‖_{Σ}^{2} - \frac{1}{2} ‖ y ‖_{Σ^{- 1} .}^{2} \end{matrix}$

At each iteration $ν$ of the ALM algorithm, the following minimization problem is solved:

$\begin{matrix} (7) & (x^{ν}, z^{ν}) = \underset{x \in C, z \in D}{argmin} L_{Σ^{ν - 1}} (x, z; y^{ν - 1}) \end{matrix}$

2. Update of the Lagrange multipliers

The update of the Lagrange multipliers corrects the shift $Σ^{- 1} y$ in $(6)$ : if the constraint violation $g (x^{ν}) - z^{ν}$ is positive, the shift is increased, in an attempt to drive the next iterate towards a smaller constraint violation $g (x^{ν}) - z^{ν}$ . The following update rule formalizes that idea:

$\begin{matrix} (8) & y^{ν} \leftarrow y^{ν - 1} + Σ^{ν - 1} (g (x^{ν}) - z^{ν}) \end{matrix}$

When the constraint violation becomes zero, the Lagrange multipliers are no longer updated.

As the penalty factors $Σ$ tend towards infinity, the shift $Σ^{- 1} y$ has to vanish, because in that case, the quadratic penalty method without shifts solves the problem exactly. For $Σ^{- 1} y$ to vanish, the Lagrange multipliers must be bounded, which is achieved by the following projection:

Let $M > 0$ be some large but finite bound.

$\begin{matrix} (9) & {\underset{―}{y}}_{i} ≜ {\begin{cases} 0 & {\underset{―}{z}}_{i} = - \infty \\ - M & otherwise, \end{cases} {\overset{―}{y}}_{i} ≜ {\begin{cases} 0 & {\overset{―}{z}}_{i} = + \infty \\ + M & otherwise \end{cases} \\ (10) & Y ≜ [{\underset{―}{y}}_{1}, {\overset{―}{y}}_{1}] \times \dots \times [{\underset{―}{y}}_{m}, {\overset{―}{y}}_{m}] \end{matrix}$

The result of $(8)$ is therefore clamped as follows:

$\begin{matrix} (11) & y^{ν} \leftarrow Π_{Y} (y^{ν - 1} + Σ^{ν - 1} (g (x^{ν}) - z^{ν})) \end{matrix}$

3. Update of the penalty factors

When the penalty factor for the $i$ -th constraint, $Σ_{i i}$ is increased, minimizing the violation of this constraint becomes more important in $(7)$ . Therefore, if the constraint violation cannot be reduced by updating the shifts alone, the penalty factors are increased.

Selecting when and by how much each penalty factor should be increased is more of a heuristic. The strategy used here is to compare the violation at the current iterate with the violation at the previous iterate, it is the same strategy as used in QPALM. Denote the vector of constraint violations as $e^{ν} ≜ g (x^{ν}) - z^{ν}$ . Let $θ \in (0, 1)$ .
If $| e_{i}^{ν} | \leq θ | e_{i}^{ν - 1} |$ , meaning that the constraint violation has decreased by at least a factor $θ$ compared to the previous iteration, then the penalty factor is not updated.
If the constraint violation did not decrease sufficiently, then the penalty factor $Σ_{i i}$ is increased by a factor

$\begin{matrix} (12) & Δ \frac{| e_{i}^{ν} |}{‖ e^{ν} ‖_{\infty}}, \end{matrix}$

where $Δ > 1$ is a tuning parameter. The violation of each individual constraint is scaled by the maximum violation of all constraints, such that the penalty factors of constraints with a large violation are increased more aggressively. If the factor in $(12)$ is less than one, the penalty factor is not updated (otherwise it would result in a reduction of the penalty).

PANOC

PANOC is an algorithm that solves optimization problems of the form:

$\begin{matrix} (P-PANOC) & \begin{aligned} \underset{x}{minimize} & ψ (x) + h (x), \end{aligned} \end{matrix}$

where $ψ : {I R}^{n} \to I R$ has Lipschitz gradient, and $h : {I R}^{n} \to \overset{―}{I R}$ allows efficient computation of the proximal operator.

Recall the inner minimization problem $(7)$ in the first step of the ALM algorithm. It can be simplified to:

$\begin{matrix} (13) & \begin{aligned} min_{x \in C, z \in D} L_{Σ} (x, z, y) & = - \frac{1}{2} ‖ y ‖_{Σ^{- 1}}^{2} + min_{x \in C} {f (x) + min_{z \in D} {\frac{1}{2} {‖ z - (g (x) + Σ^{- 1} y) ‖}_{Σ}^{2}}} \\ = - \frac{1}{2} ‖ y ‖_{Σ^{- 1}}^{2} + min_{x \in C} {\underset{≜ ψ_{Σ} (x; y)}{\underset{⏟}{f (x) + \frac{1}{2} {dist}_{Σ}^{2} (g (x) + Σ^{- 1} y, D)}}} \end{aligned} \end{matrix}$

Within the PANOC algorithm, the parameters $y$ and $Σ$ remain constant, and will be omitted from the function names to ease notation:

$\begin{matrix} (14) & \begin{aligned} ψ (x) & = f (x) + \frac{1}{2} {dist}_{Σ}^{2} (g (x) + Σ^{- 1} y, D) \end{aligned} \end{matrix}$

The inner problem in $(13)$ has the same minimizers as the following problem that will be solved using the PANOC algorithm:

$\begin{matrix} (15) & \begin{aligned} \underset{x \in C}{minimize} & ψ (x), \end{aligned} \end{matrix}$

This problem is an instance of problem $(P-PANOC)$ where the nonsmooth term $h$ is the indicator of the set $C$ , $h (x) = δ_{C} (x)$ .

Evaluation

The following is a list of symbols and formulas that are used in the implementation of the PANOC algorithm.

$\begin{aligned} y & \in {I R}^{m} & Current Lagrange multipliers \\ Σ & \in diag ({I R}_{> 0}^{m}) & Current penalty factor \\ x^{k} & \in {I R}^{n} & Current PANOC iterate \\ γ_{k} & \in {I R}_{> 0} & Current proximal gradient step size \\ ζ^{k} & ≜ g (x^{k}) + Σ^{- 1} y & Shifted constraint value \\ {\hat{z}}^{k} & ≜ Π_{D} (g (x^{k}) + Σ^{- 1} y) & Closest feasible value for slack variable z \\ = Π_{D} (ζ^{k}) \\ d^{k} & ≜ ζ^{k} - Π_{D} (ζ^{k}) & How far the shifted constraint value ζ \\ = ζ^{k} - {\hat{z}}^{k} & is from the feasible set \\ e^{k} & ≜ g (x^{k}) - {\hat{z}}^{k} & Constraint violation \\ {\hat{y}}^{k} & ≜ Σ d^{k} & Candidate Lagrange multipliers, \\ = Σ (g (x^{k}) + Σ^{- 1} y - Π_{D} (g (x^{k}) + Σ^{- 1} y)) & see (8) \\ = y + Σ (g (x^{k}) - {\hat{z}}^{k}) \\ = y + Σ e^{k} \\ ψ (x^{k}) & = L_{Σ} (x^{k}, {\hat{z}}^{k}, y) + \frac{1}{2} ‖ y ‖_{Σ^{- 1}}^{2} & PANOC objective function \\ = f (x^{k}) + \frac{1}{2} {dist}_{Σ}^{2} (g (x^{k}) + Σ^{- 1} y, D) \\ = f (x^{k}) + \frac{1}{2} {‖ (g (x^{k}) + Σ^{- 1} y) - Π_{D} (g (x^{k}) + Σ^{- 1} y) ‖}_{Σ}^{2} \\ = f (x^{k}) + \frac{1}{2} {‖ ζ^{k} - {\hat{z}}^{k} ‖}_{Σ}^{2} \\ = f (x^{k}) + \frac{1}{2} ⟨ d^{k}, {\hat{y}}^{k} ⟩ \\ \nabla ψ (x^{k}) & = \nabla f (x^{k}) + \nabla g (x^{k}) Σ (g (x^{k}) + Σ^{- 1} y - Π_{D} (g (x^{k}) + Σ^{- 1} y)) & Gradient of the objective \\ = \nabla f (x^{k}) + \nabla g (x^{k}) Σ (ζ^{k} - {\hat{z}}^{k}) \\ = \nabla f (x^{k}) + \nabla g (x^{k}) {\hat{y}}^{k} \\ = \nabla f (x^{k}) + \sum_{i = 1}^{m} {\hat{y}}_{i}^{k} \nabla g_{i} (x^{k}) \\ \nabla {\hat{y}}_{i}^{k} (x^{k}) & = Σ_{i i} (1 - \partial Π_{D_{i}} (g_{i} (x^{k}) + Σ_{i i}^{- 1} y_{i})) \nabla g_{i} (x^{k}) \\ \nabla^{2} ψ (x^{k}) & = \nabla^{2} f (x^{k}) + \sum_{i = 1}^{m} {\hat{y}}_{i}^{k} \nabla^{2} g_{i} (x^{k}) + \sum_{i = 1}^{m} \nabla g_{i} (x^{k}) \nabla^{⊤} {\hat{y}}_{i}^{k} & Generalized Hessian of the objective \\ = \nabla_{x x}^{2} L (x^{k}, y) + \sum_{i = 1}^{m} \nabla g_{i} (x^{k}) {\hat{σ}}_{i} \nabla g_{i} (x^{k})^{⊤} \\ {\hat{σ}}_{i} & \in {\begin{cases} {Σ_{i i}} & if g_{i} (x^{k}) + Σ_{i i}^{- 1} y_{i} \notin D_{i} \\ [0, Σ_{i i}] & if g_{i} (x^{k}) + Σ_{i i}^{- 1} y_{i} \in bd D_{i} \\ {0} & if g_{i} (x^{k}) + Σ_{i i}^{- 1} y_{i} \in int D_{i} \end{cases} \\ {\hat{x}}^{k} & ≜ T_{γ^{k}} (x^{k}) & Next proximal gradient iterate \\ = Π_{C} (x^{k} - γ^{k} \nabla ψ (x^{k})) \\ p^{k} & ≜ {\hat{x}}^{k} - x^{k} & Proximal gradient step \\ r^{k} & ≜ \frac{1}{γ^{k}} p^{k} & Fixed-point residual (FPR) \\ φ_{γ^{k}} (x^{k}) & = ψ (x^{k}) + h ({\hat{x}}^{k}) + \frac{1}{2 γ^{k}} ‖ {\hat{x}}^{k} - x^{k} ‖^{2} + \nabla ψ (x^{k})^{⊤} ({\hat{x}}^{k} - x^{k}) & Forward-backward envelope (FBE) \\ = ψ (x^{k}) + \frac{1}{2 γ^{k}} ‖ p^{k} ‖^{2} + \nabla ψ (x^{k})^{⊤} p^{k} \\ q^{k} & ≜ H_{k} r^{k} & Quasi-Newton step \\ x^{k + 1} & = x^{k} + (1 - τ) p^{k} + τ q^{k} & Next PANOC iterate \end{aligned}$

Note that many of the intermediate values depend on the value of $x^{k}$ , it is sometimes easiest to define them as functions:

$\begin{aligned} ζ (x) & ≜ g (x) + Σ^{- 1} y \\ \hat{z} (x) & ≜ Π_{D} (g (x) + Σ^{- 1} y) \\ = Π_{D} (ζ (x)) \\ d (x) & ≜ ζ (x) - Π_{D} (ζ (x)) \\ = ζ (x) - \hat{z} (x) \\ \hat{y} (x) & ≜ Σ d (x) \\ = Σ (g (x) + Σ^{- 1} y - Π_{D} (g (x) + Σ^{- 1} y)) \\ e (x) & ≜ g (x) - \hat{z} (x) \end{aligned}$

The result of the PANOC algorithm is the triple $({\hat{x}}^{k}, \hat{y} ({\hat{x}}^{k}), \hat{z} ({\hat{x}}^{k}))$ .

The following graph visualizes the dependencies between the different values used in a PANOC iteration.

Structured PANOC

See [2] for details.

PANOC-OCP

Problem formulation

Consider the following general formulation of a nonlinear optimal control problem with finite horizon $N$ .

$\begin{matrix} (OCP) & \begin{aligned} \underset{u, x}{minimize} & \sum_{k = 0}^{N - 1} ℓ_{k} (h_{k} (x^{k}, u^{k})) + ℓ_{N} (h_{N} (x^{N})) \\ subject to & u^{k} \in U \\ C (x^{k}) \in D \\ x^{0} = x_{init} \\ x^{k + 1} = f (x^{k}, u^{k}) (0 \leq k < N) \end{aligned} \end{matrix}$

The function $f : {I R}^{n_{x}} \times {I R}^{n_{u}} \to {I R}^{n_{x}}$ models the discrete-time, nonlinear dynamics of the system, which starts from an initial state $x_{init}$ . The functions $h_{k} : {I R}^{n_{x}} \times {I R}^{n_{u}} \to {I R}^{n_{h}}$ for $0 \leq k < N$ and $h_{N} : {I R}^{n_{x}} \to {I R}^{n_{h}^{N}}$ can be used to represent the (possibly time-varying) output mapping of the system, and the convex functions $ℓ_{k} : {I R}^{n_{h}} \to I R$ and $ℓ_{N} : {I R}^{n_{h}^{N}} \to I R$ define the stage costs and the terminal cost respectively.

See [3] for more details.

Table of Contents