You Raise Me Up: [Research]The learning algorithms in neural networks: (2) Restricted Boltzmann Machine

1. Introduction

This is the second post in the series of Learning algorithms in neural networks. The last post mainly showed you the principle of the error backpropagation (BP) algorithm taking the standard BP (with logistic sigmoid as the activation function, f(x)=1/(1 + exp(-x))) as the example. As we all know, the BP is a supervised learning method which works well in shallow neural networks. Generally, the more hidden layers, the stronger learning ability of the neural network. However, when we want to insert more hidden layers into the neural network, BP faces the divergence problem (also called vanishing gradient problem) when the error propagates backward in multi hidden layers. The divergence problem means there is no solution for optimization. How we should train our parameters in this situation?
Hinton invented the Restricted Boltzmann Machine (RBM) to train the parameters in an unsupervised way, which offers a solution to train the parameters of deep multi-layer neural networks. Multi RBMs combine together forming a deep brief network (DBN) which is used to pre-train the parameters of deep networks. After initiation of all parameters, the BP method is then employed to distribute real errors backward to the entire network to update the parameters (supervised learning). Of course, the unsupervised learning methods also can combine with DBN.

Though the DBN is less used in deep networks, this training strategy provides massive contributions to the advancement of deep learning. So in this post, I will show you what are the RBM and DBN and how it works.

2. Background

1) Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machine is a restricted version of Boltzmann Machine(BM). Graphical representations of an example BM and an example RBM are given in Figure. 1 and 2. From the comparison, we can see that MB has higher complexity which leads to low training efficiency than RBM. That's why we use the RBM instead of the BM to train artificial neural networks. So here I mainly focus on the RBM.

(Figure.1. A graphical representation of an example BM)

(Figure.2. A graphical representation of an example RBM)

There are two layers, visible (v) layer and hidden layer (h). The parameters of an RBM include the connection weights, bias (all shown in figure 2.). The value of all neurons is a binary vector,{0 or 1}. The energy function of the RBM is defined as:

Formula. 1: the energy function of the RBM

And the joint probability distribution of the RBM is defined as below. In a RBM, the value of all nodes in the visible layer or hidden layer is reliable to the value of their connected nodes which means the connected nodes are inter-affected. This joint probability distribution shows the probability relationship between them. Compared to BM with the full connection of all nodes, the RBM divides nodes into two parts, visible layer and hidden layer, there is no connection inside a visible layer or hidden layer.

Formula. 2: joint probability distribution of the RBM

where, Z is the Partition Function shown in Formula. 4 below.

2) Deep Belief Network (DBN)

Intuitively, the DBN can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines or autoencoders whose function is similar to RBM. Each sub-network includes an input or visible layer and a hidden layer and the hidden layer in this sub-network servers as the input or visible layer for the next. The DBN is generally used to pre-train parameters for deep multi-layer neural networks.

(Figure.2. DBN composed by multi Restricted Boltzmann Machines)

3) Energy Minimum and Simulated Annealing

In a dynamic system, it is of massive importance to find a state that makes this system global state's energy minimum. This progress is called simulated annealing (SA). The main idea of this method is just like what shows in the figure below. E means the system energy and T means the temperature. When the temperature is high, the system energy is also high because of the disorder. With the decreasing of temperature, energy lows down and so does the disorder. And finally, the system goes into a state with energy minimum.

Figure.5. An example of Simulated Annealing

Boltzmann Machine can be viewed as a Stochastic Dynamic System. This system runs beginning from a high temperature. And its temperature gradually decreases until reaching a thermal equilibrium at a lower temperature. It then may converge to a distribution where the energy level fluctuates around the global minimum. This idea is the same as that of simulated annealing.

4) Boltzmann Distribution

In statistical mechanics, a Boltzmann Distribution (also called Gibbs distribution) is a probability distribution of particles in a system over various possible states. The distribution is expressed in the form:

Formula. 3

where P is the probability of state α, E_α is the energy of state α, k is the Boltzmann constant and T is system temperature. Z is called Partition Function which is used to normalize the probability and can be expressed in the form:

Formula. 4

One important feature of Boltzmann distribution is that the ratio of probabilities of two states only depends on the states' energy difference.

Formula. 5

5) The applications of Boltzmann Machine

The BM has two important applications. Firstly, it is used to solve the searching problem: given the value of weights and bias to determine the value of neuron nodes making the global energy minimum. Secondly, it is used to solve the learning problem: given part of the value of neuron nodes, for example, the neuron nodes of the visible layer, to find the optimal weights and bias making the global energy minimum.

The principle of BM for the searching problem:

From the formula. 3, with the decreasing of energy, the probability of state will increase. To achieve the global energy minimum, the probability of state should be maximum. Given the value of weights and bias, we can find at least one binary vector which makes the probability of state α maximum.

Figure 6. exponential function graph

The principle of BM for the learning problem:

For a system under a state with energy minimum, given a group of specific neuron nodes (e.g. visible layer) with a known binary vector as the input, there must be a group of weights and bias to make the probability of the input maximum. Therefore, the aim is to find the optimal parameters to let the probability of the input be maximum. This process is also called Maximum likelihood estimation.

6) Bayes Theorem

Let me give a brief introduction to the Bayes Theorem. It is used to describe the relationship between two conditional probabilities. if P(A) and P(B) are the probabilities of A, B happening, respectively. Each of them is also called Prior probability that is an inherent probability. P(A|B) is the probability of A happening given B happening. P(A|B) is also called Posterior probability. P(A, B) is the probability of both happening at the same time. Then, we have the following equations：

Formula. 6

Formula. 7

7) Maximum Likelihood Estimation

In statistics, Maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. Let me give you an example to show how the MLE works.

Question: Let's assume P(A) known, we want to figure out P(A|B). When P(B|A) and P(B) are all known, we can figure out this problem directly using formula. 7. But if we can't get any probability information about P(B|A) and P(B), how should we do?

Of course, there is a "stupid" method that you can firstly randomly assign the probability of P(B), secondly figure out P(A, B) relying on the connection of A and B, thirdly use formula.6 to compute P(B|A), finally use formula. 7 to get the P(A|B). However, the problem is how to evaluate your assumption. Therefore, we use the principle of MLE to evaluate the aforesaid "assumption" that is to make P(A|B) as high as possible. It is understandable that given an observation, we only need to find one possible P(B) and P(B|A) to maximize the probability of P(A|B).
Instead of enumerating that is of low efficiency, we build a model for this kind of optimization, called the likelihood function. This aim of this sort of optimization problem is to find the maximum (global or local) of the likelihood function. The computing process is conducted according to the relationship between the extremum and the derivative,
This process is what the BM is applied to the learning problem. For a BM, we only have the given observations (visible layer). The value of weights and bias are unknown (we can initiate them randomly). Therefore, we build a likelihood function related to the probability of the input to estimate the parameters (weights and bias).

8) Gradient Ascent

In one of my past posts, I introduced the gradient descent, which is utilized to solve the Minimum Optimization Problem. However, as for the Maximum Optimization Problem, the common-used method is gradient ascent. They have the same principle. The main difference is the update of the parameter: "-" for gradient descent, "+" for gradient ascent.

Formula. 8 parameter update in gradient descent

Formula. 9 parameter update in gradient ascent

3. Parameters Learning Methods Analysis

1) The probability of variables of in the visible layer, P(v)

The variables of a simple RBM is shown in Figure.2 above. To avoid narration, please refer to figure. 2. Let θ=(W,a,b) present the parameter. According to the previous description in 7) Maximum Likelihood Estimation, only given the known variables in the visible layer, we need to employ the principle of MLE build a maximum likelihood function to estimate the parameter θ. The maximum likelihood function can be expressed in the form:

Formula. 10 The maximum likelihood function

where the variables in v are independent and identically distributed. Since f(x)=Inx is a strictly monotonic function. Generally, we use InP(v) as the final maximum likelihood function.

Formula. 11 The In-maximum likelihood function

However, from the formula. 2, we are only able to get the P(v, h) from the current state. So we need to deduce the P(v). This process is to compute the edge distribution, P(v) according to the joint distribution P(v, h). As mentioned before, all possible value of each node is binary, namely {0 or 1}.

Formula. 12 The deduction of edge distribution, P(v)

2) Directional derivative of Parameters based on one training sub-set

InP(v) is the objective function. Then we need to deduce the directional derivative of InP(v) subject to θ=(W, a, b). For a training set, we can choose its sub-set one-by-one or the entire to train. Their principles are the same, thus, firstly I consider one sub-set, v. Here, the red v stands for a training sub-set.

Formula. 13

Formula. 14

Here, the formula.14 divides into two parts: part1and part2 (shown in formula.14). In the part1, P(h|v) stands for the probability of hidden nodes given a known sub-set v. In the part2, P(v, h) is a Boltzmann Distribution that is independent of a certain training set. According to Bayes, we have formula.15. To get the directional derivative, we choose part2 to deduce as it is more general.

Formula. 15

Here is the process:

Formula. 16

Formula. 17

Formula. 18

For a specific sub-set, according to formula 16-18, we have:

Formula. 19

Then we need to deduce P(h_k=1|v) and P(h_k=1|v). Considering v is a certain training sub-set, we give the deducing process of P(h_k=1|v):

Formula. 20

Similarly:

Formula. 21

If we want to extend it into multi training sub-sets, we have:

Formula. 22

3) Approximation of Partition Function

Partition Function, represented by Z is used to normalize the probability. However, Z is not applicable. So we are only able to approximate Z using the sampling method, e.g. Gibbs Sampling.

Formula. 23 Partition Function

Gibbs Sampling

Figure.7 Gibbs sampling of RBM

Figure. 7 shows the process of Gibbs sampling. Since only when the system goes into the thermal equilibrium, the probability of sampling at that time obeys the Boltzmann Distribution, p(v, h). We need to sample numerous times to approximate it.
1) give a binary vector v0 of all nodes in the visible layer; initiate the parameters.
2) use Formula. 20 to compute the P(h=1|v0), then basing on the P(h=1|v) to sample a binary vector h0 of all nodes in the hidden layer;
3) use (v0, h0) as well as their probabilities to compute Z as well as the value of current RBM, <·>data
4) view the current hidden layer as the new visible layer, use Formula. 21 to compute the P(v=1|h0), then basing on the P(v=1|h0) to sample a binary vector, v1.
4) redo step 2 and 4......
5) when t→∞, (v_t, h_t) obeys p(v, h). Then use them to figure out Z as well as the value of current RBM, <·>model
6) finally update the parameters using formula.24. α stands for the learning rate.

Formula.24 parameter update of Gibbs sampling

Contrastive Divergence

Since the Gibbs method needs massive sampling times, it is of low efficiency. Then Hinton proposed another fast method, called contrastive divergence.
From formula. 19, for vk, we have:

Formula. 25

Formula.19 - Formula.25, then:

Formula. 26

We can see it avoids computing p(v, h) as well as Z. That is why this method is more efficient. We only need to conduct k steps of Gibbs sampling without the restriction that the system should stay at the thermal equilibrium.
1) give a binary vector v0 of all nodes in the visible layer; initiate the parameters.
2) use Formula. 20 to compute the P(h=1|v0), then basing on the P(h=1|v) to sample a binary vector h0 of all nodes in the hidden layer;
3) compute the directional derivative of parameters about v0;
4) view the current hidden layer as the new visible layer, use Formula. 21 to compute the P(v=1|h0), then basing on the P(v=1|h0) to sample a binary vector, v1.
5) re-do step 2 and 4......
6) after k (self-definition) steps, compute the directional derivative of parameters about vk;
7) finally update the parameters using formula.27. α stands for the learning rate.

Formula. 27 parameter update of Contrastive Divergence

About the k in the Contrastive Divergence method, also called CD-k, it is generally very small, Sometimes a quite good learning state can be achieved even k=1.

4) Initiation of Parameters

Generally, the initiation of w comes from the normal distribution N(0,0.01) randomly. the initiation of bias b at the hidden-layer end is 0. And for bias a, it comes from below:

Formula. 28

where Pi stands for the ratio of the samples that their i-th nodes are in an activation state (=1) to total samples.

4. Supplement

1) Types of RBM

The data type of each node in the RBM introduced above is binary. However, in practice, the data type of the visible layer input could be continuous. To apply RBM to different tasks, there are three different types of RBM.

Bernoulli-Bernoulli RBM, BB-RBM: this is what I exampled above which the input and output are binary, {0 or 1}.
Gaussian-Bernoulli RBM, GB-RBM: the visible input obeys Gauss distribution (continuous variable) and the hidden output obeys Bernoulli distribution (binary). Then the energy function of RBM turns to:

Formula. 29

Bernoulli-Gaussian RBM, BG-RBM: the visible input obeys Bernoulli distribution (binary) and the hidden output obeys Gauss distribution (continuous variable). Then the energy function of RBM turns to:

Formula. 30

2) Greedy Layer-Wise Unsupervised Pretrain

As described in the background about DBN, it is a composition of multi RBM. Because it is hard to directly use sigmoid to train deep networks, DBN divides every two layers as a RBM. So this training method is called greedy layer-wise unsupervised pre-train. As shown in the figure below, given the input v0, at first, we train the RBM (v0, h0). After training, we sample at h(0) as the input for h(1) and continues like that...

Figure.8

Formula. 31

5. Conclusion

In this post, I mainly focused on the principle of RBM as well as the process of learning using binary RBM. I think we can benefit a lot from a good understanding of the principle of the RBM. To let you better grasp two extremely important learning methods, BP and RBM. here makes a brief comparison between them:

Learning Type: (BP) supervised learning; (RBM) unsupervised learning.
Optimization Objective:(BP) minimum optimization; (RBM) maximum optimization.
Objective Function: (BP) error function; (RBM) maximum likelihood function.
Parameter Update Method: (BP) gradient descent; (RBM) gradient ascent.
Purpose:(BP) fine-tune parameters; (RBM) pre-train parameters.
Application:(BP-sigmoid) shallow neural networks; (RBM) deep neural networks.

6. References

https://wenku.baidu.com/view/db591d95770bf78a652954ee.html

https://blog.csdn.net/zengxiantao1994/article/details/72787849

https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
http://www.deeplearningbook.org/
https://nndl.github.io/chap-%E6%B7%B1%E5%BA%A6%E4%BF%A1%E5%BF%B5%E7%BD%91%E7%BB%9C.pdf
https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
https://skymind.ai/wiki/restricted-boltzmann-machine

https://nndl.github.io/

You Raise Me Up

Saturday, 1 December 2018

[Research]The learning algorithms in neural networks: (2) Restricted Boltzmann Machine