Gradient Descent

Gradient descent is an algorithm for finding the local minimum of a differentiable function and is used in most machine learning algorithms.

In linear regression, we had the following minimization problem

$\min \limits_{b_{0}, b{1}} C(b_{0}, b_{1}) = \min \limits_{{b_{0}, b_{1}}} \sum_{i = 1}^{n}( y_{i} - (b_{0} + b_{1}*x_{i})^{2} = \sum_{i = 1}^{n} \epsilon_{i}^{2}$

where $C(a, b)$ is the cost function. Gradient descent is dealing with this generalized minimization problem

$\min \limits_{b_{0}, \cdots, b_{n}} C(b_{0}, \cdots, b_{n})$

used in multiple linear regression. We can minimize any (well defined) function using gradient descent. The goal is to minimize the error above (to get the line which best fits the data) and for that, we will use the gradient which will point us in the ‘right’ direction towards the minimum.

Gradient descent is an optimization algorithm. It starts with some $b_{0}, b_{1}$ and it keeps changing them with the goal of reducing the cost function $C(b_{0}, b_{1})$ until the minimum is reached.

As a derivative, gradient also represents a slope of the tangent of the graph of a function. Gradient can be seen as a change in a function when vector $x$ changes by a small amount vector $\delta$ , $x \to x + \delta$ . If our function $f$ is linear and we want to approximate it only up until the first order, we can write the Taylor expansion in the form

$f(x + \delta) = f(x) + \sum_{i} \delta_{i} \frac{\partial f}{\partial x_{i}} + O(\delta^{2})$ .

Here $\delta_{i}$ represents the step size. It can change in every iteration or we can keep it constant. The question is how to guess the step size to achieve fast convergence.

In this case, the gradient points along the direction in which the function increases most rapidly. This is the process of maximizing the function and we call this process the gradient ascent.

In gradient descent, we are interested in minimization instead of the maximization. We want the fastest way to reach the minimum. Similarly, as above, the gradient will point along the direction in which the function decreases most rapidly – in the negative gradient direction.

In this case we have

$f(x + \delta) = f(x) - \sum_{i} \delta_{i} \frac{\partial f}{\partial x_{i}} + O(\delta^{2})$ .

Now we can guess $x_{0}$ for a local minimum of $f$ and consider the generalization

$x_{i + 1} = x_{i} - \delta \frac{\partial}{\partial x_{i}} f(x_{i}), \quad i \geq 0$ .

$\delta$ is also called the learning function. If the learning function is too small, gradient descent can be very slow but it will converge. On the other hand, if the learning function is too big, the algorithm can skip the minimum or even diverge.

We can now apply this now to the linear regression.

If we substitute the formula for linear regression and divide by average for simplicity and take the partial derivative, we get

$b_{0} := b_{0} - \delta \frac{1}{n} \sum_{i = 1}^{n}( y_{i} - (b_{0} + b_{1}*x_{i})$ ,

$b_{1} := b_{1} - \delta \frac{1}{n} \sum_{i = 1}^{n}( y_{i} - (b_{0} + b_{1}*x_{i}) x_{i}$ .

Now we can run this simultaneously in iterations and in every iteration, we will get the line that has a smaller error than in the previous iteration and we will eventually reach the minimum.

Resources:

https://www.coursera.org/learn/machine-learning/lecture/8SpIM/gradient-descent

Coffee and PI

~ Mathematics. Machine Learning. Data Science.

Gradient Descent

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply