BFGS optimization

Optimization

Suppose we have a function f:IRnIR, we want to minimize/maxmizie the function, we can use the gradient descent method, follow current gradient and keeps going, the problem is that might not be fast enough. So, for such a function, we can use taylor expansion to get (expand around x)

f(x+Δx)=f(x)+ΔxTgn+12ΔxTHnΔx+...

Let hn(Δx)=f(x)+ΔxTgn+12ΔxTHnΔx

where gn and Hn is the gradient and Hessian at xn. So we can take the derivative against Δx, we would have fΔx=gn+12(HTn+Hb)Δx=gn+HnΔx.

Then we would have Δx=H1ngn, we would know that H1ngn is a good direction to minimize the problem. We can check this out by  H1ngn,gn=gT(H1)Tg, so as long as we keep H as a positive definite matrix, we are going the direction that makes the function smaller.

In practice, we would set the update as xn+1=xnα(H1ngn), where α can be seen as a step size and generally be solved by line search methods.

Quasi-Newton Method

What can be expected is that, such method will cost a lot to compute the inverse of the Hessian matrix, therefore, we need a quasi-newton method to solve this. Broyden proposed to use rank 2 update method to approxiate the Hessian matrix, because the rank 1 update would not be able to keep the Hessian positive definite.

Secant condition

As usual, we need a secant condition for our approxiate, here the condition is Hn+1(xn+1xn)=gn+1gn, which lead us to H1nyn=sn. DFP method is actually approximating the Hn and update the inverse using sherman-morrison, BFGS method is directly updating the inverse of Hn. In later derivation, I’ll just use H to represent the inverse.

To find the updating rule, we need to solve the following problem,

argminH|HHk|Wsubject toH=HT,Hyk=sk

where the weighted forbenius norm is defined to be

AW=W12AW12F,

where the weight matrix W is any positive definite matrix that satisfies Wsk=yk and we can assume that W=G1k where Gk is the average hessian, Gk=102f(xk+ταkpk)dτ

What else to notice here, is that we know W is a symmetric matrix, and any symmetric matrix can be diagonalized by an orthgonal matrix, QTWQ=D, where D is a diagonal matrix with all the eigenvalues as its diagnoal values. And since W is a positive definite matrix, there exist a matrix B such that BB=W. For example, we know that W=QDQT and we let B=QDQT where Dii=Dii, and we can easily prove that BB=W.

Frobenius norm

Frobenius norm is the only norm that is unitary invariant, which means that the frobenius norm is not changed by unitary transformation. For any square matrix, the unitary(orthgonal) transformation can be seen as B=RTAR,whereRis a unitary(orthogonal matrix).

It can also be easily proved:

AF=i,ja2ij=tr(ATA)=Riλi=Riσ2i, therefore

B2F=tr(BTB)=tr(RTATRRTAR)=tr(RTATAR)=tr(ARRTAT)=tr(AAT)=A2F

Derivation

Define ˆH=W12HW12, ^Hk=W12HkW12, ^sk=W12sk, ^yk=W12yk, then the problem becomes,

minˆHkˆHFsubject to ˆHˆyk=ˆyk =ˆsk

what else to notice is that, ^yk is the eigenvector of ˆH, since ˆH^yk=^yk, then we can introduce a new orthogonal matrix, with new orthogonal basis,

U=[u|u]whereu=ˆykˆyk

Since Frobenius norm is unitary invariant, (as it is sum of squares of the singular values), we can get

ˆHkˆHF=UTˆHkUUTˆHUF=[][100]F.

UTˆHkUUTˆHU=[uTuT]ˆHk[uu][uTuT]ˆH[uu]==[uTˆHkuuTˆHkuuTˆHkuuTˆHku][100uTˆHu].,

we can find that, the blue part cannot be affected since it’s fixed, we can only adjust the red part, which is to make them equal to cancel each other.

Therefore, uTˆHu=uTˆHku., which gives us the solution,

ˆH=U[100uTˆHku]UT.
ˆH=[uu][100uTˆHku][uTuT]=uuT+uuTˆHkuuT=uuT+(IuuT)ˆHk(IuuT)

We use the fact of the orthgonal basis matrix
I=UUT=[uu][uTuT]=uuT+uuTuuT=IuuT.

Now, remember that we have the property Wsk=yk, we expand the formula and get the BFGS update
Hk+1=(Iρkskyk)Hk(Iρkyksk)+ρkskskwhereρk=(yksk)1

Interpretation

From here, we can see that after the unitary transformation of the orthgonal matrix, we can easily solve the optimization problem. What else to notice is that, ^yk is the eigenvector of ˆH, and the unitary transformation is basically representing the matrix in the eigenvector space.

And look at the update rule: Hk+1=(Iρkskyk)Hk(Iρkyksk)+ρkskskwhereρk=(yksk)1, we can know that as long as ρk0, and Hk is psd, we can ensure that the approximated Hessian matrix is positive definite, ρk0 can be satisified if we have a convex function. Then as long as the first estimate of Hessian matrix is positive definite, we can make sure that we are always going in the right direction.

Reference:

https://math.stackexchange.com/questions/2271887/how-to-solve-the-matrix-minimization-for-bfgs-update-in-quasi-newton-optimizatio