1. Introduction
Gradient descent is an important method in optimization. With the increasing research interest in Machine Learning and Deep Learning, more and more people get to know this term. However, many people feel confused about it after they have referred different materials. Therefore, I make a decision to try to explain based on my own understanding. Hopefully, you can benefit.
Tips: the formulas I use in this post are some screenshots.
2. Background
1) Gradient
The gradient is a vector that means it has directions and numerical size. if we use as the gradient as shown in formula (1). The direction of is the orientation in which the directional derivative has the largest value and || is the value of that directional derivative. where, i,j,k are the unit directional vectors. ∂f/∂x, ∂f/∂y, ∂f/∂z are the partial derivatives at x, y, z respectively. And they are also called direction derivatives.
(1) |
2) Unconstrained Optimization
What is the optimization? The aim of the optimization is to find an optimized solution among all applicable solutions. This kind of questions is so common that you can easily find some instances. As for engineer design, how to choose design parameters so order to meet all requirements and lower down the cost. For neural network algorithms, how to adjust parameters to make functions converge fast. Optimization theory has already been used in lots of areas and plays an important role there. Optimation problem includes linear optimization (also called linear programming) and nonlinear optimization (also called nonlinear programming). depending on conditions, Optimization also divides into Constrained Optimization and Unconstrained Optimization which is the problem with respect to the Constrained Optimization problem. More details you can refer to super links. Following, I will go on my thinking from unconstrained optimation.
- Extremum Condition
where, f(x) is a real function. This problem is to find the minimum and it is called unconstrained extremum problem. To figure out this problem, we need some theorems.
- Theorem One:
If f(x) is differentiable at the point x'. And if there exists a direction vector d that makes (3) established. Then we can say there must be small, real number δ and λ subject to λ∈(0,δ) to let (4) establish. In plain language, this theorem means if a differentiable f(x) (at the point x') has a reduction interval, even though it is very small. The formula (4) establish.
If f(x) is second-order differentiable at the point x'. And if x' is the point where f(x) equals to the local minimum. Then the gradient at x' is zero.
Tips: Extremum problem includes minimum problem and maximum problem, here I just take the minimum problem as an example. The principle of the maximum problem is similar.
3. Methodology
After a brief introduction to the background, following I will show you how to link them together to help you understand Gradient Descent. From the unconstrained extremum problem, you probably already know that Gradient Descent should relate to this. Actually, in the area of the optimization problem, the process to find an optimized solution is also equally to find the extremum, more specifically, the minimum including the local minimum and global minimum, or the maximum including the local maximum and global maximum. Therefore, our problem turns to find the minimum.
From the Formula (4), d is the directional vector; λ is the increment of x', also called the step length. Theoretically, d can be set as any directions. However, for the sake of efficiency, we need to find the fastest way to arrive at the maximum point. We get two parameters. It means we need to adjust the value of these two to achieve our aim. Comparing to the step length λ, we care more about the direction as shown below. The direction is very important.
- The meaning of parameters
From the Formula (4), d is the directional vector; λ is the increment of x', also called the step length. Theoretically, d can be set as any directions. However, for the sake of efficiency, we need to find the fastest way to arrive at the maximum point. We get two parameters. It means we need to adjust the value of these two to achieve our aim. Comparing to the step length λ, we care more about the direction as shown below. The direction is very important.
- The fastest direction
One thing need to mention before we talk about the fastest direction is that here we only consider the direction under the condition of ||d||≤1. The other situations are not practical so that I don't care them. Generally, the fastest descent specially points to the direction under the condition of ||d||≤1.
let x=x'+λd, the first-order Taylor expansion of f(x) at point x' is:
(5) |
Assume λ is fixed, we only need to care d. The third section of formula (5) is the tail and it approximates zero. So we can ignore it. Our target is to find the fastest descent direction which means the differentiation between the f(x) and f(x') should be as big as possible. Here only left the second section to consider as shown in formula (6).
(6) |
To deduce the extremum of d, we have following formulas: (7), (8), (9). So only the value of |d| equals minus one and the direction of d is the reverse direction of the gradient, the function f(x) can achieve the fastest descent within the definition domain. Thus, as the fastest descent direction is the direction of the gradient, we also called this method, the Gradient Descent.
(7) |
(8) |
(9) |
- Independent variable step selection
Now we have determined d, then we can adjust the value independent variable (x) to let the function converge fast by following the formula (10). Here, we have the other parameter λ that determines the step length. In some machine learning methods (e.g. neural network), λ is also called the learning rate. Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect the loss gradient. I will explain it in another post.
(10) |
No comments:
Post a Comment