#optimization_algorithm

The gradient descent is an iterative method used to automatically guess the optimal θ parameters of hθ using a cost function J(θ).

The idea is to take repeated steps in the opposite direction (when we try to minimize) of the approximate gradient of the function at the current point (because this is the direction of steepest descent), until it doesn't change anymore.

gradient_descent_01.png

However, the gradient descent doesn't guarantee to find the global optima (the best parameters), and it can end-up in a local optima or another one depending on the starting point.

gradient_descent_02.png

⚠️ It is recommended to use feature normalization to speed up the descent.

Four steps gradient descent example

gradient_descent_03.png

General Batch Gradient Descent algorithm

Repeat until convergence
{

θj := θj + α θj J(θ0,θ1θn)

}

with α = the learning rate θj J(θ0,θ1θn) = the derivative of the cost function

Normally, the derivative of the Mean Squared Error (MSE) cost function is as follows:

 θj J(θ0,θ1) = 2mi=1m(y^i  yi) = 2mi=1m(hθ(xi)  yi)

But if the cost function expression is already divided by 2 (see Mean Squared Error (MSE)), there is no need to do it again, and the formula becomes:

 θj J(θ0,θ1) = 1mi=1m(y^i  yi) = 1mi=1m(hθ(xi)  yi)

Batch Gradient Descent for univariate linear regression

Repeat until convergence
{

θ0 := θ0 + α1mi=1m(hθ(xi)  yi)θ1 := θ1 + α1mi=1m(hθ(xi)  yi) xi

}

with α = the learning ratem = the number of training examplesxi = the input (feature) of the ith training examplehθ(xi) = the computed ouput of the ith training exampleyi = the expected output of the ith training example

Batch Gradient Descent for multivariate linear regression

Repeat until convergence
{

θ0 := θ0 + α1mi=1m(hθ(xi)  yi) xi 0θ1 := θ1 + α1mi=1m(hθ(xi)  yi) xi 1θn := θn + α1mi=1m(hθ(xi)  yi) xi n

}

with α = the learning raten = the number of featuresm = the number of training examplesxi = the input (feature) of the ith training examplehθ(xi) = the computed ouput of the ith training exampleyi = the expected output of the ith training exampleyi j = the value of feature j in the ith training example

Vectorized Batch Gradient Descent

θ := θα1mXT(Xθ  y)

with α = the learning ratem = the number of training examplesX = a matrix of the training examples arranged as rows of Xy = a vector of all the expected output values

The “default” gradient descent algorithm is the batch gradient descent algorithm which uses all the TRAINING EXAMPLES training examples available.


⚠️ TODO : Add other types of gradient descents HERE.


Debugging Gradient Descent

The simplest way to debug a gradient descent is to make a plot with the number of iterations and the Cost function J(θ).

gradient_descent_04.png

The learning rate α must neither be too large nor too low.

So, one good way to choose α is to start with a small value and to progressively increase it by almost 3 (… 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1 …) as long as J(θ) do not increase.