#optimization_algorithm #normal_equation

The Ordinary Least Squares (OLS) is an analytical method used to compute the θ parameters of the hypothesis function.

According to the Gauss Markov theorem, there are assumptions to meet in order to guarantee the validity of OLS for estimating the coefficients of a regression:

Lack of knowledge of these assumptions could result in incorrect results.

⚠️ This method can use feature scaling, but it doesn't need it.

Univariate OLS Linear Regression

Given the hypothesis function of a univariate linear regression:

hθ(x) = θ0 + θ1x + ε

One can compute the unbiased estimators θ^0 and θ^1 of θ0 and θ1:

θ^1 = Sx,ySx2θ^0 = y¯  θ^1 y¯

with Sx,y = the covariance of the training inputs X and outputs YSx2 = the variance of the training inputs Xx¯ = the means of the training inputs Xy¯ = the means of the training outputs Y

Then y can be estimated (predicted) with:

y^ = θ^0 + θ^1 x

with x = the input for which one wants the predict the output yθ^i = the computed estimator of θi

And that's done!


But various metrics can also be computed to evaluate the model.

The unexplained error of each training example:

ei = yiy^i            ei2 is called the residual error

with yi = the expected output of the ith sampley^i = the predicted output of the ith sample

The sum of residuals ε (which is the OLS cost function):

ε = i=1mei2 = i=1m(yi  y^i)2 = i=1m(yi  (θ^0 + θ^1 xi))2

with m = the number of samplesei2 = the residual for the error ei of the ith sample

The unbiased residual variance (also called unexplained variance
or error variance)
:

σ^2 = 1m  1  1 i=1mei2

with m = the number of samplesei2 = the residual for the error ei of the ith sample

Vectorized Univariate OLS Linear Regression

There exist vectorized formulas for batch predictions:

hθ(x) = θ0[11] + θ1[x0xm] + [e0em]Y = θ0  1m  +  θ1   X     +     ε ε = ||Y  (θ0 1n  θ1 X)||l22

with || . ||l2 = the euclidean normY = the ouput valuesX = the input values1m = an array filled with 1, of size mm = he number of samples

Although this approach is very simple and works very well with univariate linear regressions, this is not the case with the multivariate version.

Multivariate OLS Linear Regression

Given the hypothesis function of a Multivariate linear regression
with p predictors
(or features):

hθ(x) = θ0 x0 + θ1 x1 +  + θp xp + ε

One can compute the unbiased estimators θ^0θ^p of θ0θp using the normal equation with:

θ^ = (XT X)1XT Y

with X = a matrix of i.i.d. training examples arranged as rowsY = a vector of all the expected output values

⚠️ X must be invertible.
And otherwise a pseudo inverse matrix can be used instead.

Then y for a single new input x can be estimated (predicted) with:

y^i = θ^0 xi 0 + θ^1 xi 1+ + θ^p xi p

with xi = the input for which one wants the predict the output yθ^ = a vector of all the estimators computed with the normal equationp = the number of predictors / features

Or a batch of predictions can be made using the vectorized version:

Y^ = θ^  X[y^1y^m]=[θ^1θ^m][x1 1x1 pxm 1xm p]

with X = a matrix of i.i.d. features arranged as rowsθ^ = a vector of all the estimators computed with the normal equationp = the number of predictors / featuresm = the number of samples

And that's done!


And similarly to the univariate linear regression various metrics can be computed to evaluate the model (e.g. : statistical tests).

The unexplained error of each training example:

ei = yiy^i            ei2 is called the residual error = YY^

with yi = the expected output of the ith sampley^i = the predicted output of the ith sample

The sum of residuals ε (which is the OLS cost function):

ε = i=1mei2 = i=1m(yi  y^i)2 = i=1m(yi  (θ^0 x0 + θ^1 x1 +  + θ^p xp))2 = i=1m(yi  j=1p(θ^j xi j))2

with m = the number of samplesp = the number of predictors / featuresei2 = the residual for the error ei of the ith sample

The unbiased residual variance (also called unexplained variance or error variance):

σ^2 = 1m  p  1 i=1mei2

with m = the number of samplesp = the number of predictors / featuresei2 = the residual for the error ei of the ith sample

Normal Equation complexity

One disadvantage of this approach, is that computing the inversion (XTX)1 has a complexity O(n3).

So if there is a very large number of features (e.g. more than 10000), it will be slow and it might be a good time to use an iterative process such as Gradient Descent.

Normal Equation Non-invertibility

In some cases, XTX may be non-invertible and the common causes are either:

The machine learning libraries tends to offer a way to protect against this problem. For instance, in Numpy or Octave the pinv( ) function can be used instead of the inv( ) function.

(Bonus) Deciphering the Normal Equation

Given the vectorized hypothesis of a #Multivariate OLS Linear Regression:

hθ(x) = θT XY^ = θ^ XX θ  Y

One way to find θ is to get rid of X by multiplying it by its inverse (if X is square) so that it turns into an Identity matrix.

X1 X θ  X1 Y1×θ  X1 Y      if X is squareθ  X1 Y

with (X1 X) = an Identity matrix

Fortunately, the gram matrix is always square... so one can compute the Gram matrix (XTX), then it's Identity matrix (X1X) to get the θ expression.

(XT X)1(XT X) θ  (XT X)1(XT Y)1× θ  (XT X)1(XT Y)

with (XT X) = a Gram matrix
       (X1 X) = an Identity matrix
       (XT X)1 (XT X) = an Identity matrix of a Gram matrix

So, if the Gram matrix is invertible, the Normal equation finally appears:

θ  (XT X)1(XT Y)