The nonlinear program we wish to solve is of the form:
\[ \begin{equation} \begin{aligned} & \underset{x}{\text{minimize}} & & f(x) &&&& f : \Rn \rightarrow \R \\ & \text{subject to} & & \underline{x} \le x \le \overline{x} \\ &&& \underline{z} \le g(x) \le \overline{z} &&&& g : \Rn \rightarrow \R^m \end{aligned} \label{eq:problem-orig} \tag{P} \end{equation} \]
Define the convex sets \(C\) and \(D\) as
\[ \begin{equation} \begin{aligned} C &= \left\{ x \in \Rn \mid \underline x \le x \le \overline x \right\} \\ D &= \left\{ z \in \R^m \mid \underline z \le z \le \overline z \right\}. \\ \end{aligned} \label{eq:setCD} \end{equation} \]
These rectangular boxes can be decomposed as Cartesian products of 1-dimensional closed intervals:
\[ \begin{equation} \begin{aligned} C &= C_1 \times C_2 \times \dots \times C_n \quad&&\text{where } C_i = \left[ \underline x_i, \overline x_i \right] \\ D &= D_1 \times D_2 \times \dots \times D_m \quad&&\text{where } D_i = \left[ \underline z_i, \overline z_i \right] \\ \end{aligned} \label{eq:setCDcartprod} \end{equation} \]
Using these definitions, problem \(\eqref{eq:problem-orig}\) can equivalently be expressed as
\[ \begin{equation} \begin{aligned} & \underset{x\in C}{\text{minimize}} & & f(x) \\ & \text{subject to} & & g(x) \in D. \end{aligned} \label{eq:problem-in-setCD} \end{equation} \]
After introduction of a slack variable \(z\), problem \(\eqref{eq:problem-in-setCD}\) can be stated as
\[ \begin{equation} \label{eq:problem-origCD-alm} \tag{P-ALM} \begin{aligned} & \underset{x\in C,\; z\in D}{\text{minimize}} & & f(x) \\ & \text{subject to} & & g(x) - z = 0. \end{aligned} \end{equation} \]
The Lagrangian function of problem \(\eqref{eq:problem-origCD-alm}\) is given by:
\[ \begin{equation}\label{eq:def-lagr} \begin{aligned} \Lagr : \Rn \times \R^m \times \R^m \rightarrow \R : (x, z, y) &\mapsto \Lagr(x, z, y) \\ &\,\triangleq\, f(x) + \left\langle g(x) - z,\; y\right\rangle. \end{aligned} \end{equation} \]
The vector \(y \in \R^m\) is called the vector of Lagrange multipliers.
The augmented Lagrangian function with penalty factor \(\Sigma\) of the problem \(\eqref{eq:problem-origCD-alm}\) is defined as the sum of the Lagrangian function and a quadratic term that penalizes the constraint violation:
\[ \begin{equation}\label{eq:def-auglagr} \begin{aligned} \Lagr_\Sigma : \Rn \times \R^m \times \R^m \rightarrow \R : (x, z, y) &\mapsto \Lagr_\Sigma(x, z, y) \\ &\,\triangleq\, \Lagr(x, z, y) + \tfrac{1}{2} \left\|g(x) - z\right\|^2_\Sigma, \end{aligned} \end{equation} \]
where \(\Sigma\) is a symmetric positive definite \(m\times m\) matrix that defines a norm on \(\R^m\), \(\|z\|^2_\Sigma \triangleq z^\top \Sigma z\).
The augmented Lagrangian method for solving Problem \(\eqref{eq:problem-origCD-alm}\) consists of the successive minimization of \(\Lagr_\Sigma\) with respect to the decision variables \(x\) and the slack variables \(z\) (1), after which the Lagrange multipliers \(y\) are updated (2), and the penalty factors \(\Sigma_{ii}\) corresponding to constraints with high violation are increased (3).
The augmented Lagrangian function is used as an exact penalty function for problem \(\eqref{eq:problem-origCD-alm}\), it is equivalent to the shifted quadratic penalty method with shift \(\Sigma^{-1}y\).
Using some algebraic manipulations, the augmented Lagrangian defined in \(\eqref{eq:def-auglagr}\) can be expressed as
\[ \begin{equation} \label{eq:auglagr2} \Lagr_\Sigma(x, z, y) = f(x) \;\;+\;\; \frac{1}{2} \Big\Vert g(x) - z + \Sigma^{-1} y \Big\Vert_\Sigma^2 \;\;-\;\; \frac{1}{2}\Big\Vert y \Big\Vert_{\Sigma^{-1}.}^2 \end{equation} \]
At each iteration \(\nu\) of the ALM algorithm, the following minimization problem is solved:
\[ \begin{equation} \label{eq:alm-step-1-argmin} (x^\nu, z^\nu) \;\;=\;\; \underset{x\in C,\; z\in D}{\text{argmin}}\quad \Lagr_{\Sigma^{\nu-1}}(x,\, z;\, y^{\nu-1}) \end{equation} \]
The update of the Lagrange multipliers corrects the shift \(\Sigma^{-1} y\) in \(\eqref{eq:auglagr2}\): if the constraint violation \(g(x^\nu) - z^\nu\) is positive, the shift is increased, in an attempt to drive the next iterate towards a smaller constraint violation \(g(x^\nu) - z^\nu\). The following update rule formalizes that idea:
\[ \begin{equation}\label{eq:lagr-update-explanation} y^\nu \leftarrow y^{\nu-1} + \Sigma^{\nu-1} \left(g(x^\nu) - z^\nu\right) \end{equation} \]
When the constraint violation becomes zero, the Lagrange multipliers are no longer updated.
As the penalty factors \(\Sigma\) tend towards infinity, the shift \(\Sigma^{-1} y\) has to vanish, because in that case, the quadratic penalty method without shifts solves the problem exactly. For \(\Sigma^{-1} y\) to vanish, the Lagrange multipliers must be bounded, which is achieved by the following projection:
Let \(M > 0\) be some large but finite bound.
\[ \begin{gather} \underline y_i \triangleq \begin{cases} 0 & \underline z_i = -\infty \\ -M & \text{otherwise}, \end{cases} \quad\quad\quad\quad\quad\quad \overline y_i \triangleq \begin{cases} 0 & \overline z_i = +\infty \\ +M & \text{otherwise} \end{cases} \\[0.66em] Y \triangleq [\underline y_1, \overline y_1] \times \dots \times [\underline y_m, \overline y_m] \end{gather} \]
The result of \(\eqref{eq:lagr-update-explanation}\) is therefore clamped as follows:
\[ \begin{equation}\label{eq:lagr-update-explanation-clamp} y^\nu \leftarrow \Pi_Y\left(y^{\nu-1} + \Sigma^{\nu-1} \left(g(x^\nu) - z^\nu\right)\right) \end{equation} \]
When the penalty factor for the \(i\)-th constraint, \(\Sigma_{ii}\) is increased, minimizing the violation of this constraint becomes more important in \(\eqref{eq:alm-step-1-argmin}\). Therefore, if the constraint violation cannot be reduced by updating the shifts alone, the penalty factors are increased.
Selecting when and by how much each penalty factor should be increased is more of a heuristic. The strategy used here is to compare the violation at the current iterate with the violation at the previous iterate, it is the same strategy as used in QPALM. Denote the vector of constraint violations as \(e^\nu \triangleq g(x^\nu) - z^\nu\). Let \(\theta \in (0, 1)\).
If \(|e^\nu_i| \le \theta |e^{\nu-1}_i|\), meaning that the constraint violation has decreased by at least a factor \(\theta\) compared to the previous iteration, then the penalty factor is not updated.
If the constraint violation did not decrease sufficiently, then the penalty factor \(\Sigma_{ii}\) is increased by a factor
\[ \begin{equation}\label{eq:multipliers-update-factor} \Delta \dfrac{|e^\nu_i|}{\|e^\nu\|_\infty}, \end{equation} \]
where \(\Delta > 1\) is a tuning parameter. The violation of each individual constraint is scaled by the maximum violation of all constraints, such that the penalty factors of constraints with a large violation are increased more aggressively. If the factor in \(\eqref{eq:multipliers-update-factor}\) is less than one, the penalty factor is not updated (otherwise it would result in a reduction of the penalty).
PANOC is an algorithm that solves optimization problems of the form:
\[ \begin{equation} \begin{aligned} & \underset{x}{\text{minimize}} & & \psi(x) + h(x), \end{aligned} \label{eq:problem-panoc} \tag{P-PANOC} \end{equation} \]
where \(\psi : \Rn \rightarrow \R \) has Lipschitz gradient, and \(h : \Rn \rightarrow \overline \R \) allows efficient computation of the proximal operator.
Recall the inner minimization problem \(\eqref{eq:alm-step-1-argmin}\) in the first step of the ALM algorithm. It can be simplified to:
\[ \begin{equation} \label{eq:problem-inner} \begin{aligned} \min_{x\in C,\, z\in D} \Lagr_\Sigma(x, z, y) &= -\tfrac{1}{2} \lVert y \rVert_{\Sigma^{-1}}^2 + \min_{x\in C}\left\{ f(x) + \min_{z\in D} \left\{ \tfrac{1}{2} \left\Vert z - \left(g(x) + \Sigma^{-1} y\right)\right\Vert_\Sigma^2 \right\} \right\} \\ &= -\tfrac{1}{2} \lVert y \rVert_{\Sigma^{-1}}^2 + \min_{x\in C}\;\; \bigg\{ \underbrace{ f(x) + \tfrac{1}{2} \text{dist}_\Sigma^2 \left( g(x) + \Sigma^{-1}y, \ D \right)}_{\triangleq\, \psi_{\Sigma}(x;\,y)} \bigg\} \end{aligned} \end{equation} \]
Within the PANOC algorithm, the parameters \(y\) and \(\Sigma\) remain constant, and will be omitted from the function names to ease notation:
\[ \begin{equation} \begin{aligned} \psi(x) &= f(x) + \tfrac{1}{2} \text{dist}_\Sigma^2 \left( g(x) + \Sigma^{-1}y, \ D \right) \end{aligned} \label{eq:psi-inner-panoc} \end{equation} \]
The inner problem in \(\eqref{eq:problem-inner}\) has the same minimizers as the following problem that will be solved using the PANOC algorithm:
\[ \begin{equation} \begin{aligned} & \underset{x\in C}{\text{minimize}} & & \psi(x), \end{aligned} \label{eq:problem-inner-panoc} \end{equation} \]
This problem is an instance of problem \(\eqref{eq:problem-panoc}\) where the nonsmooth term \(h\) is the indicator of the set \(C\), \(h(x) = \delta_C(x)\).
The following is a list of symbols and formulas that are used in the implementation of the PANOC algorithm.
\[ \DeclareMathOperator{\prox}{\mathbf{prox}} \DeclareMathOperator*{\argmin}{\mathbf{argmin}} \DeclareMathOperator{\dist}{\mathbf{dist}} \DeclareMathOperator*{\minimize}{\mathbf{minimize}\;\;} \begin{aligned} y &\in \R^m &\text{Current Lagrange multipliers} \\ \Sigma &\in \text{diag}(\R^m_{>0}) &\text{Current penalty factor} \\ x^k &\in \Rn &\text{Current PANOC iterate} \\ \gamma_k &\in \R_{>0} &\text{Current proximal gradient step size} \\[1em] \zeta^k &\triangleq g(x^k) + \Sigma^{-1}y &\text{Shifted constraint value}\\ \hat{z}^k &\triangleq \Pi_D\left(g(x^k) + \Sigma^{-1}y\right) &\text{Closest feasible value for slack variable } z \\ &= \Pi_D(\zeta^k) \\ d^k &\triangleq \zeta^k - \Pi_D(\zeta^k) &\text{How far the shifted constraint value }\zeta \\ &= \zeta^k - \hat{z}^k &\text{is from the feasible set}\\ e^k &\triangleq g(x^k) - \hat z^k &\text{Constraint violation} \\ \hat{y}^k &\triangleq \Sigma\, d^k &\text{Candidate Lagrange multipliers,}\\ &= \Sigma\, \left(g(x^k) + \Sigma^{-1}y - \Pi_D\left(g(x^k) + \Sigma^{-1}y\right)\right) &\text{see \eqref{eq:lagr-update-explanation}}\\ &= y + \Sigma\,\left(g(x^k) - \hat z^k\right) \\ &= y + \Sigma\, e^k \\[1em] \psi(x^k) &= \Lagr_\Sigma(x^k, \hat z^k, y) + \tfrac{1}{2} \lVert y \rVert_{\Sigma^{-1}}^2 &\text{PANOC objective function} \\ &= f(x^k) + \tfrac{1}{2} \dist_\Sigma^2\left(g(x^k) + \Sigma^{-1}y,\;D\right) \\ &= f(x^k) + \tfrac{1}{2} \left\|\left(g(x^k) + \Sigma^{-1}y\right) -\Pi_D\left(g(x^k) + \Sigma^{-1}y\right)\right\|_\Sigma^2 \\ &= f(x^k) + \tfrac{1}{2} \left\|\zeta^k - \hat{z}^k\right\|_\Sigma^2 \\ &= f(x^k) + \tfrac{1}{2} \langle d^k, \hat{y}^k \rangle \\[1em] \nabla \psi(x^k) &= \nabla f(x^k) + \nabla g(x^k)\, \Sigma \left(g(x^k) + \Sigma^{-1}y - \Pi_D(g(x^k) + \Sigma^{-1}y)\right) &\text{Gradient of the objective} \\ &= \nabla f(x^k) + \nabla g(x^k)\, \Sigma\, \big(\zeta^k - \hat{z}^k\big) \\ &= \nabla f(x^k) + \nabla g(x^k)\, \hat{y}^k \\ &= \nabla f(x^k) + \sum_{i=1}^m \hat{y}^k_i\, \nabla g_i(x^k) \\[1em] \nabla \hat y^k_i(x^k) &= \Sigma_{ii} \left(1 - \partial \Pi_{D_i}\left(g_i(x^k) + \Sigma^{-1}_{ii} y_i\right) \right)\, \nabla g_i(x^k) \\ \nabla^2 \psi(x^k) &= \nabla^2 f(x^k) + \sum_{i=1}^m \hat{y}^k_i\, \nabla^2 g_i(x^k) + \sum_{i=1}^m \nabla g_i(x^k)\, \nabla^\top \hat{y}^k_i &\text{Generalized Hessian of the objective}\\ &= \nabla^2_{xx} \Lagr(x^k, y) + \sum_{i=1}^m \nabla g_i(x^k)\, \hat\sigma_i\, \nabla g_i(x^k)^\top \\ \hat\sigma_i &\in \begin{cases} \left\{\Sigma_{ii}\right\} & \text{if } g_i(x^k) + \Sigma_{ii}^{-1} y_i \not\in D_i \\ \left[0, \Sigma_{ii}\right] & \text{if } g_i(x^k) + \Sigma_{ii}^{-1} y_i \in \operatorname{bd} D_i \\ \left\{0\right\} & \text{if } g_i(x^k) + \Sigma_{ii}^{-1} y_i \in \operatorname{int} D_i \\ \end{cases} \\[1em] \hat{x}^k &\triangleq T_{\gamma^k}\left(x^k\right) &\text{Next proximal gradient iterate}\\ &= \Pi_C\left(x^k - \gamma^k \nabla \psi(x^k)\right) \\ p^k &\triangleq \hat{x}^k - x^k &\text{Proximal gradient step}\\ r^k &\triangleq \tfrac{1}{\gamma^k} p^k &\text{Fixed-point residual (FPR)}\\[1em] \varphi_{\gamma^k}(x^k) &= \psi(x^k) + h(\hat{x}^k) + \tfrac{1}{2\gamma^k} \lVert \hat{x}^k - x^k \rVert^2 + \nabla\psi(x^k)^\top (\hat{x}^k - x^k) &\text{Forward-backward envelope (FBE)}\\ &= \psi(x^k) + \tfrac{1}{2\gamma^k} \lVert p^k \rVert^2 + \nabla\psi(x^k)^\top p^k \\[1em] q^k &\triangleq H_k r^k &\text{Quasi-Newton step} \\ x^{k+1} &= x^k + (1-\tau) p^k + \tau q^k &\text{Next PANOC iterate} \\ \end{aligned} \]
Note that many of the intermediate values depend on the value of \(x^k\), it is sometimes easiest to define them as functions:
\[ \begin{aligned} \zeta(x) &\triangleq g(x) + \Sigma^{-1}y\\ \hat{z}(x) &\triangleq \Pi_D\left(g(x) + \Sigma^{-1}y\right) \\ &= \Pi_D(\zeta(x)) \\ d(x) &\triangleq \zeta(x) - \Pi_D(\zeta(x)) \\ &= \zeta(x) - \hat{z}(x) \\ \hat{y}(x) &\triangleq \Sigma\, d(x) \\ &= \Sigma\, \left(g(x) + \Sigma^{-1}y - \Pi_D\left(g(x) + \Sigma^{-1}y\right)\right)\\ e(x) &\triangleq g(x) - \hat z(x) \end{aligned} \]
The result of the PANOC algorithm is the triple \((\hat x^k,\;\hat y(\hat x^k),\;\hat z(\hat x^k))\).
The following graph visualizes the dependencies between the different values used in a PANOC iteration.
See [2] for details.
Consider the following general formulation of a nonlinear optimal control problem with finite horizon \( N \).
\[ \newcommand\U{U} \newcommand\D{D} \newcommand\nnu{{n_u}} \newcommand\nnx{{n_x}} \newcommand\nny{{n_y}} \newcommand\xinit{x_\text{init}} \newcommand\xref{x_\text{r}} \newcommand\uref{u_\text{r}} \begin{equation}\label{eq:OCP} \tag{OCP}\hspace{-0.8em} \begin{aligned} &\minimize_{u,x} && \sum_{k=0}^{N-1} \ell_k\big(h_k(x^k, u^k)\big) + \ell_N\big(h_N(x^N)\big)\hspace{-0.8em} \\ &\subjto && u^k \in \U \\ &&& C(x^k) \in \D \\ &&& x^0 = \xinit \\ &&& x^{k+1} = f(x^k, u^k) \quad\quad (0 \le k \lt N) \end{aligned} \end{equation} \]
The function \( f : \R^\nnx \times \R^\nnu \to \R^\nnx \) models the discrete-time, nonlinear dynamics of the system, which starts from an initial state \( \xinit \). The functions \( h_k : \R^\nnx \times \R^\nnu \to \R^{n_h} \) for \( 0 \le k \lt N \) and \( h_N : \R^\nnx \to \R^{n_h^N} \) can be used to represent the (possibly time-varying) output mapping of the system, and the convex functions \( \ell_k : \R^{n_h} \to \R \) and \( \ell_N : \R^{n_h^N} \to \R \) define the stage costs and the terminal cost respectively.
See [3] for more details.