next up previous contents
Next: The Variance Matrix Up: Appendices Previous: Radial Basis Functions

The Optimal Weight Vector


As is well known from elementary calculus, to find an extremum of a function you

  1. differentiate the function with respect to the free variable(s),
  2. equate the result(s) with zero, and
  3. solve the resulting equation(s).

In the case of least squares applied to supervised learning with a linear model the function to be minimised is the sum-squared-error




and the free variables are the weights tex2html_wrap_inline9384 .

To avoid repeating two very similar analyses we shall find the minimum not of S but of the cost function


used in ridge regression. This includes an additional weight penalty term controlled by the values of the non-negative regularisation parameters, tex2html_wrap_inline9732 . To get back to ordinary least squares (without any weight penalty) is simply a matter of setting all the regularisation parameters to zero.

So, let us carry out this optimisation for the j-th weight. First, differentiating the cost function


We now need the derivative of the model which, because the model is linear, is particularly simple.


Substituting this into the derivative of the cost function and equating the result to zero leads to the equation


There are m such equations, for tex2html_wrap_inline10217 , each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns the system of equations has, except under certain pathological conditions, a unique solution.

To find that unique solution we employ the language of matrices and vectors: linear algebra. These are invaluable for the representation and analysis of systems of linear equations like the one above which can be rewritten in vector notation as follows.




Since there is one of these equations (each relating one scalar quantity to another) for each value of j from 1 up to m we can stack them, one on top of another, to create a relation between two vector quantities.


However, using the laws of matrix multiplication, this is just equivalent to




and where tex2html_wrap_inline9406 , which is called the design matrix, has the vectors tex2html_wrap_inline10229 as its columns,


and has p rows, one for each pattern in the training set. Written out in full it is


which is where this came from.

The vector tex2html_wrap_inline10233 can be decomposed into the product of two terms, the design matrix and the weight vector, since each of its components is a dot-product between two m-dimensional vectors. For example, the i-th component of tex2html_wrap_inline10233 when the weights are at their optimal values is




Note that while tex2html_wrap_inline9758 is one of the columns of tex2html_wrap_inline9406 , tex2html_wrap_inline10245 is one of its rows. tex2html_wrap_inline10233 is the result of stacking the tex2html_wrap_inline10249 one on top of the other, or


Finally, substituting this expression for tex2html_wrap_inline10233 into the previous equation gives


the solution to which is


which is where the normal equation comes from.

The latter equation is the most general form of the normal equation which we deal with here. There are two special cases. In standard ridge regression tex2html_wrap_inline10261 , so


which is where this equation comes from.

Ordinary least squares, where there is no weight penalty, is obtained by setting all regularisation parameters to zero so


next up previous contents
Next: The Variance Matrix Up: Appendices Previous: Radial Basis Functions