As is well known from elementary calculus, to find an extremum of a function you
In the case of least squares applied to supervised learning with a linear model the function to be minimised is the sum-squared-error
and the free variables are the weights .
To avoid repeating two very similar analyses we shall find the minimum not of S but of the cost function
used in ridge regression. This includes an additional weight penalty term controlled by the values of the non-negative regularisation parameters, . To get back to ordinary least squares (without any weight penalty) is simply a matter of setting all the regularisation parameters to zero.
So, let us carry out this optimisation for the j-th weight. First, differentiating the cost function
We now need the derivative of the model which, because the model is linear, is particularly simple.
Substituting this into the derivative of the cost function and equating the result to zero leads to the equation
There are m such equations, for , each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns the system of equations has, except under certain pathological conditions, a unique solution.
To find that unique solution we employ the language of matrices and vectors: linear algebra. These are invaluable for the representation and analysis of systems of linear equations like the one above which can be rewritten in vector notation as follows.
Since there is one of these equations (each relating one scalar quantity to another) for each value of j from 1 up to m we can stack them, one on top of another, to create a relation between two vector quantities.
However, using the laws of matrix multiplication, this is just equivalent to
and where , which is called the design matrix, has the vectors as its columns,
and has p rows, one for each pattern in the training set. Written out in full it is
which is where this came from.
The vector can be decomposed into the product of two terms, the design matrix and the weight vector, since each of its components is a dot-product between two m-dimensional vectors. For example, the i-th component of when the weights are at their optimal values is
Note that while is one of the columns of , is one of its rows. is the result of stacking the one on top of the other, or
Finally, substituting this expression for into the previous equation gives
the solution to which is
which is where the normal equation comes from.
The latter equation is the most general form of the normal equation which we deal with here. There are two special cases. In standard ridge regression , so
which is where this equation comes from.
Ordinary least squares, where there is no weight penalty, is obtained by setting all regularisation parameters to zero so