You Raise Me Up: [Research] some understanding about Gradient Descent

1. Introduction

Gradient descent is an important method in optimization. With the increasing research interest in Machine Learning and Deep Learning, more and more people get to know this term. However, many people feel confused about it after they have referred different materials. Therefore, I make a decision to try to explain based on my own understanding. Hopefully, you can benefit.

Tips: the formulas I use in this post are some screenshots.

2. Background

1) Gradient

The gradient is a vector that means it has directions and numerical size. if we use

as the gradient as shown in formula (1). The direction of

is the orientation in which the directional derivative has the largest value and |

| is the value of that directional derivative. where, i,j,k are the unit directional vectors. ∂f/∂x, ∂f/∂y, ∂f/∂z are the partial derivatives at x, y, z respectively. And they are also called direction derivatives.

(1)

2) Unconstrained Optimization

What is the optimization? The aim of the optimization is to find an optimized solution among all applicable solutions. This kind of questions is so common that you can easily find some instances. As for engineer design, how to choose design parameters so order to meet all requirements and lower down the cost. For neural network algorithms, how to adjust parameters to make functions converge fast. Optimization theory has already been used in lots of areas and plays an important role there. Optimation problem includes linear optimization (also called linear programming) and nonlinear optimization (also called nonlinear programming). depending on conditions, Optimization also divides into Constrained Optimization and Unconstrained Optimization which is the problem with respect to the Constrained Optimization problem. More details you can refer to super links. Following, I will go on my thinking from unconstrained optimation.

Extremum Condition

Considering a nonlinear programming problem as shown below,

(2)

where, f(x) is a real function. This problem is to find the minimum and it is called unconstrained extremum problem. To figure out this problem, we need some theorems.

Theorem One:

If f(x) is differentiable at the point x'. And if there exists a direction vector d that makes (3) established. Then we can say there must be small, real number δ and λ subject to λ∈(0,δ) to let (4) establish. In plain language, this theorem means if a differentiable f(x) (at the point x') has a reduction interval, even though it is very small. The formula (4) establish.

(3)

(4)

2. Theorem Two:

If f(x) is second-order differentiable at the point x'. And if x' is the point where f(x) equals to the local minimum. Then the gradient at x' is zero.

Tips: Extremum problem includes minimum problem and maximum problem, here I just take the minimum problem as an example. The principle of the maximum problem is similar.

3. Methodology

After a brief introduction to the background, following I will show you how to link them together to help you understand Gradient Descent. From the unconstrained extremum problem, you probably already know that Gradient Descent should relate to this. Actually, in the area of the optimization problem, the process to find an optimized solution is also equally to find the extremum, more specifically, the minimum including the local minimum and global minimum, or the maximum including the local maximum and global maximum. Therefore, our problem turns to find the minimum.

The meaning of parameters

From the Formula (4), d is the directional vector; λ is the increment of x', also called the step length. Theoretically, d can be set as any directions. However, for the sake of efficiency, we need to find the fastest way to arrive at the maximum point. We get two parameters. It means we need to adjust the value of these two to achieve our aim. Comparing to the step length λ, we care more about the direction as shown below. The direction is very important.

The fastest direction

One thing need to mention before we talk about the fastest direction is that here we only consider the direction under the condition of ||d||≤1. The other situations are not practical so that I don't care them. Generally, the fastest descent specially points to the direction under the condition of ||d||≤1.

let x=x'+λd, the first-order Taylor expansion of f(x) at point x' is:

(5)

Assume λ is fixed, we only need to care d. The third section of formula (5) is the tail and it approximates zero. So we can ignore it. Our target is to find the fastest descent direction which means the differentiation between the f(x) and f(x') should be as big as possible. Here only left the second section to consider as shown in formula (6).

(6)

To deduce the extremum of d, we have following formulas: (7), (8), (9). So only the value of |d| equals minus one and the direction of d is the reverse direction of the gradient, the function f(x) can achieve the fastest descent within the definition domain. Thus, as the fastest descent direction is the direction of the gradient, we also called this method, the Gradient Descent.

(7)

(8)

(9)

Independent variable step selection

Now we have determined d, then we can adjust the value independent variable (x) to let the function converge fast by following the formula (10). Here, we have the other parameter λ that determines the step length. In some machine learning methods (e.g. neural network), λ is also called the learning rate. Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. I will explain it in another post.

(10)

4. Conclusion

In this post, I tried to show you what is the gradient, why we called it gradient descent, what is the context of this method and what is the link between the gradient descent and the parameter adjustment in the neural network. Though the text is not very deep, it may help some beginners to understand it faster.

References

Wolfram Mathworld http://mathworld.wolfram.com/

You Raise Me Up

Saturday, 24 November 2018

[Research] some understanding about Gradient Descent