1. Introduction
Why we need to use Convolutional Neural Network:- There are too many parameters in the fully-connected forward networks. For example, if the input picture has the size 100*100*3 (height: 100; width: 100; RGB channels: 3). So in the fully-connected forward networks, there will be 100*100*3 weights between each neuron in the first hidden layer and the input layer. Along the increasing of hidden neurons, the number of parameters will burst out leading to low efficiency.
- local invariant feature: The objects in natural pictures all have a local-invariant feature which means operations including scaling, rotating, shifting will not ruin the semantic information. That is the key in the image processing. However, the fully-connected fashion is not able to capture these local-invariant features.
- automatically extracting features: Neural networks, whether recurrent or feedforward, can receive as input raw sensor signals. However, applying them to features derived from the raw sensor signals often leads to higher performance. Discovering adequate features requires expert knowledge, which necessarily limits a systematic exploration of the feature space. Convolutional networks (CNNs) have been suggested to address this.
2. Background
2.2. 1-D Convolution
One-dimension convolution is often used in the signal processing.
the form of one-dimension convolution is: [w1, w2, w3, ... wm]. m is the length. [w1..] is also called filter or convolution kernel. For example, as for [-1, 0, 1]:
Figure 2.0 |
2.2. Image Convolution (2-D)
The convolution was introduced in one of my past posts. Considering the aforesaid content is all about signal processing, here, I will brief its application in image processing. Since images have a two-dimension structure, the convolution for image processing should be expanded to 2D. Let an image X∈R (M*N) and a filter W∈R (M*N), generally, m<<M, n<< N. The image convolution is shown in the form:Figure 2.1 |
Figure 2.2 |
Figure 2.3 |
2.3. Cross-correlation
Figure 2.4 |
2.4. Convolution parameters
- The dimension of discrete convolution: e.g. 2-D (N=2)
- Square inputs: means the size of input images. e.g. 100*100 (ix=iy=i). of course, the imgae size can be not square.
- square kernel size: means the size of convolution kernel. e.g. 3*3 (kx=ky=k). The shape of the kernel is square.
- same stride along both axes: the step length of convolution window. e.g. s=1 (sx=sy=s). Sometimes, you can set different strides for different dimensions.
- same zero padding along both axes. e.g. p=2 (px=py=p). Sometimes, you can set different zero padding for different dimensions.
- square outputs: the square size of the output after the convolutional operation. e.g. o=4*4 (ox=oy=o)
Relationship 1: without zero-padding and unit stride--for any i and k, and for s=1 and p=0
So the square outputs, o:
Figure 2.5 |
So the square outputs, o:
Figure 2.6 |
So the square outputs, o:
Figure 2.7 |
So the square outputs, o:
Figure 2.8 |
So the square outputs, o:
Figure 2.9 |
- narrow convolution: s=1, p=0; o=(i-k)+1
- wide convolution: s=1, p=k-1; o=i+k-1
- equal-width convolution: s=1, p=floor((m-1)/2); o=i.
2.5. The derivative of Convolution (*)
Figure 2.10 |
If we use cross-correlation to replace convolution, we have:
Figure 2.11 |
3. The structure and feature of CNN
Current convolutional neural networks commonly consist convolutional layer, pooling layer and fully-connected layer with them stacking crossly. They have three structural features: local connection, sharing weight and subsampling, which grant convolutional neural network the invariance after shifting, rotation and scaling operations.
3.1 Convolutional Layer
The function of convolution is to extract the local features. Different convolution kernels are capable of extracting different kinds of features. As CNN is mainly used in image processing, we need to design the structure of convolutional layer.
An image generally has 3-d information, height(M), width(N) and depth(D). for gray images, D=1; for colour images, D=3 (RGB).
Let X is the input and the dimension of X is M*N*D. we cut X into slices as X_d (1≤d≤D).
Let Y is the output, feature map. Its dimension is M'*N'*P. similarly, P can be viewed as the depth of the feature map. Y can be cut into slices Y_p (1≤p≤P). About the difference between D and P: D is the real depth of an image. It is determined by the image itself. But P can be set by users. and D and P are independent. P stands for the number of convolution kernels the user wants to use to extract features simultaneously from the same input. The reason why we use multi kernels at the same time is that more kernels can extract more features. And more features are good for classification accuracy.
Let W is the convolution kernel. Since the image has D channels and P convolution kernels are used simultaneously for each channel. Then the dimension of W is m*n*D*P. The m and n are the size parameters of the 2-D convolution kernel. And W can be cut into slices W_p,d (1≤p≤P, 1≤d≤D).
Here is a figure for this:
Figure 3.1 |
From figure 3.1, we can see the output Y's dimension is M'*N'*P. But the W's dimension is m*n*D*P. So in order to get Y_p, we need to fuse the information from D channels as shown below.
Figure 3.2 |
So the fusion way is to get the sum of information from D channels. Then we have the equation:
Figure 3.3 |
If we want to obtain P feature maps, we only need to repeat this process above. Then we can figure out the number of parameters for the entire convolution layer. It equals P*D*(m*n)+P (including weights and bias).
3.2 Pooling Layer
It is also called subsampling layer and its function is to select features, reduce the number of features as well as parameters. Though we can reduce the dimension of features via increasing the stride, the number of neurons in the feature maps is still quite a lot which leads to overfitting easily. Then we can add a pooling layer to realize feature reduction. And pooling layer is usually following convolution layers.
Let X is the input of the pooling layer, actually, it is the result of the convolution layer. X∈R(M'*N'*P). We can divide it into many sub-areas R_p,m,n (1≤p≤P, 1≤m≤M', 1≤n≤N'). Those sub-areas can be overlapped or not. Pooling is to get a value by downsampling these sub-areas.
There are two pooling functions:
- Maximum pooling: generally, take the max value of these neurons in a sub-area.
- Mean pooling: generally, take the mean value of these neurons in a sub-area. (an example is shown below)
Figure 3.4 |
From this example, we can see that the pooling layer can not only effectively reduce the number of neurons, but keep some small local features invariant.
3.3 Fully connected Layer (Dense Layer)
As said above, the convolution is to extract the features. After a sequence of convolution layers and pooling layers being used, we need to use the fully connected layers to represent the feature in the form of numbers which can be used for classification. That is exactly the function of dense layers.
The principle of using fully-connected layers to represent the features is similar to that of pooling layers. The pooling layers use a small area to downsample. But fully connected layers use the whole area of a feature map to reproduce a value representing the feature. The number of fully-connected layers can be more than one.
How it process?
Let X∈R(M'*N'*P) stands for the output of a pooling layer or convolution layer. Let Z∈R(1*1*Q) stands for the output of a fully connected layer.
To transfer X to Z, we need a global convolution kernel, whose size is M'*N'*P*Q.
Sometimes, we have multi fully connected layers, the propagation between fully connected layers is can be viewed as conduct convolution whose size is 1*1.
The number of parameters:
if the number of neurons in a pooling layer or convolution layer is P, the number of neurons in a fully connected layer is Q, then the number of parameters is Q*(P+1).
3.4 Softmax
In mathematics, the softmax function takes an un-normalized vector, and normalizes it into a probability distribution. That is, prior to applying softmax, some vector elements could be negative, or greater than one; and might not sum to 1; but after applying softmax, each element xi is in the interval [0,1] and ∑xi=1. Softmax is often used in neural networks, to map the non-normalized output to a probability distribution overpredicted output classes.
The definition and one example of softmax computation are given below.
Figure 3.5 |
Figure 3.6 |
3.6 Local Connection
In CNN, for a 2D image as the input, each pixel will be viewed as an input neuron as shown below.
Figure 3.7 |
All input neurons constitute a 2D plane. The filter (or convolution kernel) includes weights. And each neuron in the hidden layer contains the result of convolution operation on each input. According to the process of image convolution, each hidden-layer neuron is only associated with the neurons covered by the filter window (e.g. 9 pixels). That is exactly the local connection. So it can be expressed in the form:
Figure 3.8 |
l means current layer and l-1 means the last layer. z is the collection of the aforesaid hidden-layer neurons. So the current layer looks like a 2-d array with neurons. w is the weights which actually is the convolution kernel. Notice that all neurons within the same layer share this convolution kernel (which is so-called weight sharing). a is not the collection of each pixel in the input layer, actually, it is the collection of windows (each window contains 3*3 neurons). b is the bias for current-layer neurons.
3.7 Weight Sharing
weight sharing means all neurons in one layer share the same convolution kernel. If the current layer has n neurons and the number of elements of the filter is m. That means the parameters of each convolution layer are m weights and n bias, m+n in all.
3.8 Subsampling
Please see section 3.2.
4. Parameters learning of CNN
In CNN, the parameters are the weights in convolution kernel and bias. And BP algorithm is used to train the parameters. In forward fully-connected networks, error propagates through each lay. But in CNN, apart from fully-connected layers, there are two other different layers: convolution layer and pooling layer. So we look into these two layers respectively.
Let l stands for the convolution layer and l-1 means the input layer. So X∈R(M*N*D) and Z∈R(M'*N'*P). For p-th feature map, we have:
Figure 4.1 |
Let L stands for the loss function, then according to chain rule, the derivative of L subject to W and b is:
Figure 4.2 |
So the next step is to figure out the error item δ. Here we need to consider different circumstances: case 1) the l+1-th layer is a pooling layer; case 2) the l+1-th layer is a convolution layer.
- the l+1-th layer is a pooling layer. (up--upsampling function)
Figure 4.3 |
- the l+1-th layer is a convolution layer.
Figure 4.4 |
From the equations above, we know the δ is iterative, it is associated with the next layer. And here only gives the derivative about the convolution kernel and bias. Actually, in fully connected layers and softmax, there exists the computation of derivatives which is to be added.
5. Conclusion
This post briefly introduces the CNN including the image convolution, structure and feature of CNN as well as its parameter learning.
CNN is very common-used in computer vision, image processing. To get a full understanding of how CNN works is essential. From this post, we know that a CNN constitutes 3 kinds of layers, convolution layers, pooling layers and fully connected layers. the classic CCN structure is: (generally, M∈[2, 5], b∈[0, 1], N∈[1, 100 or more], K∈[0, 2]).
Figure 5.1 |
From the figure, ReLU is used to act as the activation function. BP algorithm is used to train the parameters. So CNN uses a model of training entirely. The error propagates backwards layerwise to update the parameters of each layer.
There are many classic CNN: LeNet-5, AlexNet, Inception, Residual Network(ResNet), VGG、GoogLeNet. So for our better using, it is good for us to figure out the difference and features of these CNN which I plan to do when I start learning CV.
No comments:
Post a Comment