Understanding of Convolutional Neural Network (CNN): A Review

ABSTRACT


Introduction
In recent years, deep learning technology has been used in various sectors.Deep learning has developed human-like abilities, such as knowledge learning, problem-solving, and decision-making [1].Big companies have tried to adopt the latest digital technologies, including the Internet of Things (IoT), Big Data, Artificial Intelligence (AI), and Blockchain [2].Deep learning technology is a development of machine learning and Artificial Intelligence (AI) [3].
In general, machine learning and deep learning can perform self-training without repetitive programming by humans.Deep learning requires initial data collection, called a data set, to predict the outcome of the data.Deep learning will produce output data based on training and testing data [4].After passing the learning evaluation, deep learning can predict data.Deep learning can be used for pattern recognition or data prediction using big data in several scenarios [5].Some methods used for the learning system are supervised and unsupervised learning.The supervised algorithm tries to identify the relationship between input and output data, creating a predictive model to predict the output based on the matched input [5].In contrast, an unsupervised algorithm employs a learning system using non-labeled data.The algorithm can classify training data according to their distinctive characteristics, primarily based on dimension reduction and grouping systems [6].
Deep learning [7] differs from traditional machine learning systems that allow automatic feature extraction of raw data through various representational learning levels, from raw to high and abstract levels.Deep learning can increase learning capacity by amplifying significant patterns and suppressing irrelevant variation in input data along with the exponential advantage of representing complex non-linear functions of large amounts of data that continuously accumulate within hidden deep network layers [8].Several techniques used in deep learning include convolutional, recurrent, and deep neural networks [9].Deep learning technology utilizes artificial neural networks, especially the convolutional method.
One of the most widely used deep learning algorithms is the convolutional neural network (CNN).CNN was first introduced in the 1960s [10] and has shown promising performance results in computer vision [11].CNN has become the most representative neural network in deep learning [12].CNN has been utilized to solve complicated visual tasks with high computation [11] and is mainly used in image classification [13], [14], segmentation, object detection, video processing, natural language processing, and speech recognition [15].Some implementations of CNN are video analysis in a study by Shri [16] dan image analysis by Roncancio [17].The article's contribution is to describe CNN in a brief yet comprehensive explanation.Each constructing element is presented as another point of view in the AI method.

Convolutional Neural Network Layer and Architecture
CNN has four layers: convolution layer, pooling layer, fully connected layer, and nonlinearity layer [18].Illustrations of those four layers are presented in Fig. 1 [19].Further explanations regarding the description of each layer will be shown in the following subsections.

Convolutional Layer
A machine sees an image as a set of numbers, commonly known as matrices.Each number represents the light intensity on a particular point called a pixel.Adam Geitgey illustrates pixels in an image on his website, Medium, as shown in Fig. 2. The convolutional layer employs a kernel filter to calculate the convolution of input images, extracting the fundamental features.The filter kernel has the same dimension size but a smaller constant parameter value than the input image [20].For instance, the acceptable length of a kernel filter for a 2D scalogram with a size of 35×35×35 is  ×  × 2, where  = 3, 5, 7, and so on.However, the filter size has to be smaller than the size of the input image.The filter mask slides across the input image step by step and estimates the product point between the kernel filter weight and the pixel value of the input image.This process results in a 2D activation map.CNN will then learn the visual feature of the image.The general equation of the convolutional layer can be expressed as in the (1).Fig. 3 shows a simple illustration of the computational process in CNN that results in the activation map. (1)

Fig. 3. Convolutional Layer
A convolutional layer is defined by: kernel size, stride length, and padding [21].Kernel size is the kernel filter's size or the sliding kernel [22].Stride length is the number of kernels that slide before making product points and creating output pixels [23].Padding is the size of the 0-th frame set up around the input feature map [24].

Pooling Layer
The pooling layer will combine two consecutive convolutional layers.It reduces the number of parameters and computation loads by making down-sampling representations.The function in the pooling layer can result in a maximized or averaged value.A maximizing combination is often used for an optimal function [25].The pooling layer is also helpful in reducing overfitting or computation weights.Fig. 4 represents a simple operation in dimension reduction of an activation map using the max-pooling function [20].

Fully Connected Layer
The third layer is the fully connected layer, commonly called the convolutional output layer [26].The fully connected layer is similar to a feedforward neural network, as shown in Fig. 5.The layer is commonly found in the bottom layer of the network.It receives input from the final pooling or the convolutional output layer, flattened before being sent to the subsequent layer.Even distribution of the output means unrolling all the values of the result obtained after the last pooling or convolutional layer into a vector (3D matrix).This method is a simple technique for studying high-level non-linear combinations of a feature represented by the output convolutional layer [26].

Nonlinearity Layer (Activation Function)
An activation function plays an essential role in CNN layers.The filtered output provides another mathematical function called activation [26].ReLU, abbreviated from the Rectified Linear Unit [27], is the most common activation function in feature extraction using CNN.The main objective of the activation function is to decide the final output of a neural network, such as 'yes' or 'no'.The activation function maps the output values between -1 and 1, 1 and 0, and so on.
The activation function can be differentiated into two categories, which are [26].
1. Linear Activation Functions.A simplified mathematical expression of linear activation functions can be written as () = .The input values are multiplied with the constant parameter, , which is the weight of each neuron.The process results in an output that is proportional to the input.Linear functions can perform more than the step function since they only give a single final answer of yes or no and not multiple choices.Some of the most common or popular activation functions in CNN and other neural networks are listed as follows [28].
1. Sigmoid: this activation function uses real numbers as inputs and limits the output between 0 and 1.The curve of the sigmoid function is S-shaped and can be mathematically represented as in (2).
2. Tanh: Apparently, the tanh function is similar to sigmoid since both use real numbers as their inputs.However, the tanh function limits its output in -1 and 1.The tanh function can be mathematically represented as in (3).
3. ReLU: ReLU is the most common function used in CNN.All inputs are converted into positive numbers.The computational load of ReLU is relatively lower than other functions.Mathematically, the representation of the ReLU function is presented as in (4).
4. Leaky ReLu: If the ReLU function is responsible for down-scaling the negative inputs, the Leaky ReLU function ensures that inputs are never ignored.This function is used to solve a dying issue in ReLU.A mathematical representation of Leaky ReLU is presented in (5).
5. Noisy ReLU: This function is used to perform Gaussian distribution.A mathematical expression of the Noisy ReLU function is presented in (6).
() = max(x + Y), withY ~N(0, σ(x)) 6. Parametric Linear Units: Most of this function adopts the concept of Leaky ReLu.The difference between both functions is shown in the leak factor updated through the training mode.A mathematical representation of Parametric Linear Units can be seen in (7).

Popular CNN Architecture
Architecture in CNN is influenced by the organization and function of the visual cortex [26].The design is made to resemble neuron connections in human brains.After knowing several layers in CNN, we will discuss some popular CNN architectures in this section.

LeNet
Currently, the development of LeNet has reached the LeNet-5 version.This version is a gradientbased CNN learning structure and was first introduced for digital handwriting character recognition [29].The structure diagram of LeNet-5 is presented in Fig. 6 [30].The input of LeNet-5 is grayscale images with a dimension of 32×32×1, which then pass six feature maps of a convolutional layer with a 5×5 filter and a stride.Those six feature maps are pre-processed image channels from the 28×28×6sized convolutional operation.Stride is used as sliding control of a filter when passing through the dataset.The sliding control uses the tanh activation function.The second pooling layer has a 2×2 filter, six feature maps, and two strides.The tanh function on the second layer results in a 14×14×6 image.
The third step is a second convolutional layer with 16 feature maps, a 5×5 filter, and a stride, resulting in an image with a dimension size of 10×10×16.The fourth layer is a pooling layer with a 2×2 filter, two strides, and 16 feature maps.Four hundred nodes exist in the fourth layer, resulting in an output image with a dimension of 5×5×16.Then, there is a fully connected layer with 120 feature maps using the tanh activation function in the next layer; each has a dimension of 1×1.On this fifth layer, there are 120 nodes connected to 200 nodes on the fourth layer.The sixth layer is fully connected with 84 nodes, resulting in 10164 nodes of trained output parameters.The last layer in LeNet-5 is a fully connected layer with a 5-sized softmax activation function, resulting in a classified output image.

AlexNet
Alex Krizhevsky introduced AlexNet in 2012 on a research project called ImageNet LargeScale Visual Recognition Challenge [31].This architecture is one of CNN architectures with a basic, simple, yet effective layer design.AlexNet has five convolutional layers, followed by a pooling layer on its fourth layer and three layers of a fully connected layer on its fifth.In AlexNet architecture, the convolutional kernels are extracted during the back-propagation optimization procedure by optimizing with the stochastic gradient function [31].The convolutional layer acts with the sliding convolutional kernel, creating convolved feature maps to gain information within a given neighborhood window.Equation 8 is the function used in AlexNet as a half-wave rectifier, which significantly fastens the training phase and avoids overfitting.
The dropout technique in Alexnet is used as a stochastic regulator in determining the number of input neurons with 0 values to reduce co-adaptation neurons, which is commonly used in the fully connected layer.The architecture of Alexnet can be seen in Fig. 7 [31].

VGGNet
The latest version of VGGNet to the day the article was made is the VGGnet-16.This architecture employs 13 convolutional layers and 3 fully connected layers [32].The convolutional layer in VGG-16 has a size of 3×3 with a 1-sized stride and padding.Meanwhile, the pooling layer has a size of 2C2 with a 2-sized stride.The resolution of the input image in VGG-16 is 224×224.After each pooling layer is run, the size of the feature map will be reduced by 50%.The last feature map made before the fully connected layer is 7×7 with 512 channels and continues to be expanded to a vector with a size of 7×7×512 channels [33].The architecture of AVGGNet-16 is represented in Fig. 8.

Discussion
Artificial intelligence combined with a deep network is commonly called deep learning.In this study, deep learning is explained by some popular network architectures, such as LeNet [34], AlexNet [35], dan VGGNet [36].In general, all network architectures can be differentiated by the depth of the network and the architectural approach method.The resolution of input images used in each architecture differs based on the initial input criteria.LeNet uses a smaller input image (32×32) than AlexNet and VGG Net.The convolutional layers used in the architecture also differ; for instance, VGG Net has 13 layers.Then, LeNet in the study utilizes the MNIST database to measure accuracy, resulting in an accuracy greater than 90% of the prediction truth level.
Meanwhile, AlexNet and VGG Net utilize the ILSVRC database in their error measurement, resulting in 15.3% and 6.8% error rates.In detail, the distinctive characteristics of each CNN architecture are listed in Table 1.Another study finding by Swapna [37], explained error rates in each CNN architecture and is in accordance with the results of this study.

Conclusion
In general, machine learning can perform self-learning without any repetitive programming by humans.Meanwhile, deep learning is an implementation of machine learning that aims to imitate human brains' work using artificial neural networks.One of the most popular methods in deep learning is the convolutional neural network (CNN).This algorithm has many essential advantages, including image classification, segmentation, object detection, video processing, natural language processing, and speech recognition.CNN has four layers: a convolutional layer, a pooling layer, a fully connected layer, and a nonlinearity layer.The main technique in CNN algorithms is convolution; a filter will slide upon an input, then combine the input and filter values in the feature map.The pooling layer will combine two consecutive convolutional layers.It also minimizes the number of parameters and computational load by performing a down-sampling representation.The function in the pooling layer can result in maximized or averaged results.The fully connected layer connects all activation neurons from the retrospective layer to the next layer.An activation function plays an important role in CNN layers.
The filtered output provides another mathematical function called an activation function.The layer has different functions: Sigmoid, Tanh, ReLU, Leaky ReLU, Noisy ReLU, and Parametric Linear Units.Sigmoid uses real numbers as inputs and limits the output between 0 and 1. Tanh is similar to sigmoid since both use real numbers as inputs, but the tanh function limits its output in -1 and 1. ReLU becomes the most commonly used function in CNN.All inputs are converted into positive numbers.The computational load of ReLU is relatively lower than other functions.If the ReLU function is responsible for down-scaling the negative inputs, the Leaky ReLU function ensures that inputs are never ignored; this function is used to solve a dying issue in ReLU.Noisy ReLU is used to perform Gaussian distribution.
Meanwhile, most of the functions of Parametric Linear Units adopt the Leaky ReLU concept.The difference between both functions is shown in the leakage factor, updated through the training mode.Some of the popular CNN architectures are LeNet, AlexNet, and VGGNet.LeNet has become one of the simplest CNN architectures, which 2 convolutional and 3 fully connected layers.In comparison, AlexNet has 5 convolutional and 3 fully connected layers.VGGNet uses 13 convolutional and 3 fully connected layers.Various advantages of each CNN architecture make it suitable for solving complex visual tasks with high computational loads.CNN is also one of the most representative neural networks in deep learning.