As is well known from elementary calculus, to find an extremum of a function you

- differentiate the function with respect to the free variable(s),
- equate the result(s) with zero, and
- solve the resulting equation(s).

In the case of least squares applied to supervised learning with a linear model the function to be minimised is the sum-squared-error

where

and the free variables are the weights .

To avoid repeating two very similar analyses we
shall find the minimum not of *S* but of the cost function

used in ridge regression. This includes an additional weight penalty term controlled by the values of the non-negative regularisation parameters, . To get back to ordinary least squares (without any weight penalty) is simply a matter of setting all the regularisation parameters to zero.

So, let us carry out this optimisation for the *j*-th weight.
First, differentiating the cost function

We now need the derivative of the model which, because the model is linear, is particularly simple.

Substituting this into the derivative of the cost function and equating the result to zero leads to the equation

There are *m* such equations, for ,
each representing one constraint on the solution.
Since there are exactly as many constraints as there are
unknowns the system of equations has, except under
certain pathological conditions, a unique solution.

To find that unique solution we employ the language of matrices and vectors: linear algebra. These are invaluable for the representation and analysis of systems of linear equations like the one above which can be rewritten in vector notation as follows.

where

Since there is one of these equations (each relating one scalar quantity
to another) for each value of *j* from 1 up to *m* we can stack them,
one on top of another, to create a relation between two vector quantities.

However, using the laws of matrix multiplication, this is just equivalent to

where

and where , which is called the *design matrix*,
has the vectors as its columns,

and has *p* rows, one for each pattern in the training set.
Written out in full it is

which is where this came from.

The vector can be decomposed into the product of two
terms, the design matrix and the weight vector, since each of
its components is a dot-product between two *m*-dimensional
vectors. For example, the *i*-th component of when
the weights are at their optimal values is

where

Note that while is one of the columns of , is one of its rows. is the result of stacking the one on top of the other, or

Finally, substituting this expression for into the previous equation gives

the solution to which is

which is where the normal equation comes from.

The latter equation is the most general form of the normal equation which we deal with here. There are two special cases. In standard ridge regression , so

which is where this equation comes from.

Ordinary least squares, where there is no weight penalty, is obtained by setting all regularisation parameters to zero so