next up previous contents
Next: The Variance Matrix Up: Appendices Previous: Radial Basis Functions

The Optimal Weight Vector

 

As is well known from elementary calculus, to find an extremum of a function you

  1. differentiate the function with respect to the free variable(s),
  2. equate the result(s) with zero, and
  3. solve the resulting equation(s).

In the case of least squares applied to supervised learning with a linear model the function to be minimised is the sum-squared-error

displaymath10171

where

displaymath10172

and the free variables are the weights tex2html_wrap_inline9384 .

To avoid repeating two very similar analyses we shall find the minimum not of S but of the cost function

displaymath10173

used in ridge regression. This includes an additional weight penalty term controlled by the values of the non-negative regularisation parameters, tex2html_wrap_inline9732 . To get back to ordinary least squares (without any weight penalty) is simply a matter of setting all the regularisation parameters to zero.

So, let us carry out this optimisation for the j-th weight. First, differentiating the cost function

displaymath10174

We now need the derivative of the model which, because the model is linear, is particularly simple.

displaymath10175

Substituting this into the derivative of the cost function and equating the result to zero leads to the equation

displaymath10176

There are m such equations, for tex2html_wrap_inline10217 , each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns the system of equations has, except under certain pathological conditions, a unique solution.

To find that unique solution we employ the language of matrices and vectors: linear algebra. These are invaluable for the representation and analysis of systems of linear equations like the one above which can be rewritten in vector notation as follows.

displaymath10177

where

displaymath10178

Since there is one of these equations (each relating one scalar quantity to another) for each value of j from 1 up to m we can stack them, one on top of another, to create a relation between two vector quantities.

displaymath10179

However, using the laws of matrix multiplication, this is just equivalent to

  tex2html_wrap10263

where

displaymath10085

and where tex2html_wrap_inline9406 , which is called the design matrix, has the vectors tex2html_wrap_inline10229 as its columns,

displaymath10181

and has p rows, one for each pattern in the training set. Written out in full it is

displaymath10182

which is where this came from.

The vector tex2html_wrap_inline10233 can be decomposed into the product of two terms, the design matrix and the weight vector, since each of its components is a dot-product between two m-dimensional vectors. For example, the i-th component of tex2html_wrap_inline10233 when the weights are at their optimal values is

displaymath10183

where

displaymath10184

Note that while tex2html_wrap_inline9758 is one of the columns of tex2html_wrap_inline9406 , tex2html_wrap_inline10245 is one of its rows. tex2html_wrap_inline10233 is the result of stacking the tex2html_wrap_inline10249 one on top of the other, or

displaymath10185

Finally, substituting this expression for tex2html_wrap_inline10233 into the previous equation gives

tex2html_wrap10265

the solution to which is

displaymath10186

which is where the normal equation comes from.

The latter equation is the most general form of the normal equation which we deal with here. There are two special cases. In standard ridge regression tex2html_wrap_inline10261 , so

displaymath10187

which is where this equation comes from.

Ordinary least squares, where there is no weight penalty, is obtained by setting all regularisation parameters to zero so

displaymath10188


next up previous contents
Next: The Variance Matrix Up: Appendices Previous: Radial Basis Functions

mark@cns.ed.ac.uk