Optimization
Suppose we have a function f:IRn⟶IR, we want to minimize/maxmizie the function, we can use the gradient descent method, follow current gradient and keeps going, the problem is that might not be fast enough. So, for such a function, we can use taylor expansion to get (expand around x)
f(x+Δx)=f(x)+ΔxTgn+12ΔxTHnΔx+...Let hn(Δx)=f(x)+ΔxTgn+12ΔxTHnΔx
where gn and Hn is the gradient and Hessian at xn. So we can take the derivative against Δx, we would have ∂f∂Δx=gn+12(HTn+Hb)Δx=gn+HnΔx.
Then we would have Δx=−H−1ngn, we would know that −H−1ngn is a good direction to minimize the problem. We can check this out by ⟨ −H−1ngn,gn⟩=−gT(H−1)Tg, so as long as we keep H as a positive definite matrix, we are going the direction that makes the function smaller.
In practice, we would set the update as xn+1=xn−α(H−1ngn), where α can be seen as a step size and generally be solved by line search methods.
Quasi-Newton Method
What can be expected is that, such method will cost a lot to compute the inverse of the Hessian matrix, therefore, we need a quasi-newton method to solve this. Broyden proposed to use rank 2 update method to approxiate the Hessian matrix, because the rank 1 update would not be able to keep the Hessian positive definite.
Secant condition
As usual, we need a secant condition for our approxiate, here the condition is Hn+1(xn+1−xn)=gn+1−gn, which lead us to H−1nyn=sn. DFP method is actually approximating the Hn and update the inverse using sherman-morrison, BFGS method is directly updating the inverse of Hn. In later derivation, I’ll just use H to represent the inverse.
To find the updating rule, we need to solve the following problem,
argminH|H−Hk|Wsubject toH=HT,Hyk=skwhere the weighted forbenius norm is defined to be
‖A‖W=‖W12AW12‖F,
where the weight matrix W is any positive definite matrix that satisfies Wsk=yk and we can assume that W=G−1k where Gk is the average hessian, Gk=∫10∇2f(xk+ταkpk)dτ
What else to notice here, is that we know W is a symmetric matrix, and any symmetric matrix can be diagonalized by an orthgonal matrix, QTWQ=D, where D is a diagonal matrix with all the eigenvalues as its diagnoal values. And since W is a positive definite matrix, there exist a matrix B such that BB=W. For example, we know that W=QDQT and we let B=QD′QT where D′ii=√Dii, and we can easily prove that BB=W.
Frobenius norm
Frobenius norm is the only norm that is unitary invariant, which means that the frobenius norm is not changed by unitary transformation. For any square matrix, the unitary(orthgonal) transformation can be seen as B=RTAR,whereRis a unitary(orthogonal matrix).
It can also be easily proved:
‖A‖F=√∑i,ja2ij=√tr(ATA)=√∑Riλi=√∑Riσ2i, therefore
‖B‖2F=tr(BTB)=tr(RTATRRTAR)=tr(RTATAR)=tr(ARRTAT)=tr(AAT)=‖A‖2FDerivation
Define ˆH=W12HW12, ^Hk=W12HkW12, ^sk=W12sk, ^yk=W−12yk, then the problem becomes,
min‖ˆHk−ˆH‖Fsubject to ˆHˆyk=ˆyk =ˆskwhat else to notice is that, ^yk is the eigenvector of ˆH, since ˆH^yk=^yk, then we can introduce a new orthogonal matrix, with new orthogonal basis,
U=[u|u⊥]whereu=ˆyk‖ˆyk‖Since Frobenius norm is unitary invariant, (as it is sum of squares of the singular values), we can get
‖ˆHk−ˆH‖F=‖UTˆHkU−UTˆHU‖F=‖[∗∗∗∗]−[100∗]‖F.UTˆHkU−UTˆHU=[uTuT⊥]ˆHk[uu⊥]−[uTuT⊥]ˆH[uu⊥]==[uTˆHkuuTˆHku⊥uT⊥ˆHkuuT⊥ˆHku⊥]−[100uT⊥ˆHu⊥].,
we can find that, the blue part cannot be affected since it’s fixed, we can only adjust the red part, which is to make them equal to cancel each other.
Therefore, uT⊥ˆHu⊥=uT⊥ˆHku⊥., which gives us the solution,
ˆH=U[100uT⊥ˆHku⊥]UT.We use the fact of the orthgonal basis matrix
I=UUT=[uu⊥][uTuT⊥]=uuT+u⊥uT⊥⇔u⊥uT⊥=I−uuT.
Now, remember that we have the property Wsk=yk, we expand the formula and get the BFGS update
Hk+1=(I−ρksky⊤k)Hk(I−ρkyks⊤k)+ρksks⊤kwhereρk=(y⊤ksk)−1
Interpretation
From here, we can see that after the unitary transformation of the orthgonal matrix, we can easily solve the optimization problem. What else to notice is that, ^yk is the eigenvector of ˆH, and the unitary transformation is basically representing the matrix in the eigenvector space.
And look at the update rule: Hk+1=(I−ρksky⊤k)Hk(I−ρkyks⊤k)+ρksks⊤kwhereρk=(y⊤ksk)−1, we can know that as long as ρk≥0, and Hk is psd, we can ensure that the approximated Hessian matrix is positive definite, ρk≥0 can be satisified if we have a convex function. Then as long as the first estimate of Hessian matrix is positive definite, we can make sure that we are always going in the right direction.
Reference:
https://math.stackexchange.com/questions/2271887/how-to-solve-the-matrix-minimization-for-bfgs-update-in-quasi-newton-optimizatio