Abstract

Convolutiona neural network (CNN) is one of the best neural networks for classification, segmentation, natural language processing (NLP), and video processing. The CNN consists of multiple layers or structural parameters. The architecture of CNN can be divided into three sections: convolution layers, pooling layers, and fully connected layers. The application of CNN became most demanding due to its ability to learn features from images automatically, involving massive amount of training data and high computational resources like GPUs. Due to the availability of the above-stated resources, multiple CNN architectures have been reported. This study focuses on the working of convolution, pooling, and the fully connected layers of CNN architecture, origin of architectures, limitation, benefits of reported architectures, and comparative analysis of contemporary architecture concerning the number of parameters, architectural depth, and significant contribution.

1. Introduction

Convolutional neural network (CNN) is a neural network that has outperformed computer vision problems [1]. CNNs are considered best for learning data from the image and have performed extraordinarily in image classification, segmentation, and detection. Nowadays, most image processing and computer vision-related problems apply CNN for a better solution [2]. The reason behind the better performance of CNN in the above said task is its ability to work on raw data without having prior knowledge [3]. CNN is a biologically inspired artificial neural network (ANN). In CNN, information travels unidirectionally as a feed-forward network. Its architecture is the same as the human brain’s visual cortex, consisting of complex and straightforward cells based on several alternative layers [4]. The figure below shows the complete overview of CNN architecture.

CNN learns by limiting the change in weights according to the target during training using the backpropagation method. The human brain’s response-based learning is akin to the optimization of an objective function using a backpropagation method. Deep CNN’s multilayered, hierarchical structure allows it to extract low, mid, and high-level information. Lower and mid-level characteristics are combined to form high-level features (more abstract features). CNN’s hierarchical feature extraction capabilities mimic the neocortex in the human brain’s deep and layered learning process, which dynamically learns features from raw input. CNN’s appeal stems mainly from its ability to extract hierarchical features [5]. Several researchers have contributed to performance improvements of CNN. According to the literature, improvements in CNN are made by optimizing the architectural parameters and weights [6]. Improvement in CNN’s performance can also be achieved by increasing training data, transforming the training data, and adjusting the parameters [7]. In this article, several CNN architectures will be discussed, along with their strengths and limitations.

This study will provide understanding of the essential components and theoretical and mathematical design principles of CNN. The rest of the paper is organized as shown in Figure 2. Section 1 develops a basic understanding of CNN, Section 2 provides knowledge about basic CNN components like convolution, pooling, and fully connected layers, Section 3 presents a mathematical representation of CNN, Section 4 shows a study on CNN architectures, including evolution, origin, and comparison of contemporary architectures, Section 5 presents the challenges of several CNN architectures, and finally conclusion and future work are discussed.

2. Review Protocol

The design of this article is based on a systematic research study. It is an appropriate and consistent procedure to record relevant points of interest in the appropriate study range for inspecting and analyzing all current studies identified. A review protocol for searching and selecting relevant research articles from logical research databases is developed in this research. Figure 2 represents an overview of the review protocol. In the proposed review protocol, publishers, selection criteria, and rejection criteria of research articles are considered. The complete review protocol is described in Figure 3.

2.1. Publisher

In this survey study, the article is selected from IEEE, ACM, Springer, and Elsevier. We also picked some articles having significance from Google Scholar and kept them in the category “Others.”

2.2. Selection Criteria

In this research, a standardized paradigm is prepared for the selection and rejection of articles. Several parameters like subject relevancy, year-wise range, and so on are kept in mind to select relevant research articles from digital libraries. A brief description of the selection and rejection criterion is described below.

2.2.1. Subject Relevancy

Article selected for this study must fit in research setting made as per criteria set in review protocol. It must be incorporating relevant responses, and it must be applied as per predefined standards. Remove the nonsignificant findings that are not in agreement with the predefined settings.

2.2.2. Year-Wise Range

For this study, the research articles of the last ten years (2011 to 2021) are considered. The research repository is set as it should not show any research paper more than ten years old.

2.2.3. Result Oriented

The research articles selected for this study must be result oriented. Before finalizing the article, consider a brief overview of the article, especially in the result section, and confirm that the selected paper has a significant contribution in the relevant field. If the article is not fulfilling the criteria, dismiss it.

2.3. Rejection Criteria

This research is focused on the quality survey. Thus, all irrelevant research articles must be discarded. Below are the settings made for keeping only relevant articles in the research bank.

2.4. Repetition

It is challenging to incorporate all of the research articles collected through a review protocol. Thus, remove the research articles which are not distinguishable as per research settings. Only choose the latest one and remove the remaining.

2.5. Title-Based Rejection

A brief observation of the titles of the research papers will justify the article selected for this research. Although assessing the research article may require some experience, the result will be fruitful. Substantial certainty and experimentation are necessary to support the study proposition and extreme consequences. If the title of the research article does not correspond to the research settings made, dismiss it.

2.6. Abstract Based

It can be challenging to decide when choosing an article by observing only the title of the research. In such a case, a decision may be made by having a brief look at the abstract of the article. From the abstract of article, we can get information about the technique used, and its results can be collected. With these information, the authors can confirm that the paper is suitable or not. If the relevant technique is not applied, then reject the article.

3. CNN Architecture Overview

3.1. Convolution Layers

The convolution layer is an initial part of CNN architecture after the input layer consisting of a combination of convolution kernels (neuron) [8]. Each kernel (neuron) is associated with a small portion. This diminutive portion is called a receptive field. It operates by dividing the input image into smaller pieces of images (receptive fields) and convolving them with a specific set of weights. Operation of convolution layer in CNN can be expressed as follows.

At ith convolution, we can denote as:

Input: with size (), being the image input.(i)Padding: , Stride: (ii)Several filters: where each has dimensions: , , (iii)Bias of the convolution: (iv)Activation function: (v)Output: with size ()

We have

Dim(Conv()) .

Thus,

Dim. with  = []; S > 0

 =   + ; s = 0.

 = number of filters.

The learned parameters at the lth layer will be(i)Filters with (1).(ii)Bias with (111) parameters (broadcasting) (2).

The convolution layer can be summed up as a graph given in Figure 4.

3.2. Pooling Layer

After convolution operation, the next layer in CNN architecture is pooling. This layer performs downsampling [9]. Its task is to downscale the information collected from the convolution layer from each feature and keep the essential information. At the same time, as input to the pooling layer, notations stated below are being considered.Input: with size (), being the image input.Padding: , Stride: Size of pooling filter: Pooling function: Output: with size  =  (3)

The pooling operation can be understood by Figure 5.

3.3. Fully Connected Layer

Fully connected layers are an essential element of convolutional neural networks (CNNs), which proved very effective in image classification [10]. A finite number of neurons are taken as input and classified into relevant classes [11]. Mathematical representations of the fully connected layer are described below.

By considering the Jth node of a convolution or pooling layer with the dimensions:

 = 

The input might be the result of a convolution or pooling operation with dimensions:

For plugging into a fully connected layer, we need to flatten the tensor to a 1D vector having the dimension:

The learned parameters at these layers are:

Weights: wj,lwith parameters.

Bias with parameters.

Figure 6 represents the complete working of the fully connected layer.

3.4. Loss Function

In CNN, loss function is considered as one of the most important components. Loss is also known as the error of network and the way by which loss is calculated is called loss function. In CNN, loss functions are being used to calculate the gradients, and gradients are used for updating weights of neural networks. Mean square error, binary cross-entropy, categorical cross-entropy, and sparse categorical cross-entropy are some common loss functions.

3.5. Architectural Evolution

CNNs are considered as one of the best and most widely used biologically inspired techniques [12]. Their origin started with a neurobiological study. They provided platforms for several cognitive models, which all are replaced by CNN. Several researchers made efforts to improve CNN performance [13]. Multiple researchers are focused on the architectural evaluation of CNN. Table.1 shows architectural development of CNN. The main reason behind this focus is to improve the performance of CNN in terms of accuracy, training time, and misrate. However, there is still a gap in automating the architecture development automatically instead of manual.

4. Origin of Convolutional Neural Networks

Application of convolutional neural network is in practice since late 1980s. The first multilayered CNN architecture ConvNet was introduced by LeCuN et al. LeCuN proposed supervised training of ConvNet with backpropagation algorithm making a comparison with unsupervised reinforcement learning by using its predecessor neocognition [1417]. LeCuN created the basic foundation of modern 2D CNN. ConvNet shows promising results in handwritten digit and zip code recognition [18]. In 1980, ConvNet was improved, and it was known in the neural network family as LeNet-5, and its application started in the classification of characters in document recognition. In early 1990, CNN became the most powerful as per its promising results in fingerprint recognition. Due to its powerful capacity, banks and ATMs started using it for the glory of fingerprints. The major drawback of LeNet-5 is that it does not perform well on image processing problems.

5. Comparative Analysis of CNN Architecture

For the last many years, application of CNN for various tasks like image classification, recognition, and speech recognition has increased [19]. Researchers for specific applications propose several CNN architectures. Table 2 presents brief information about each CNN architecture.

6. Impact of Hyperparameters

CNN has outstanding performance in several tasks, but designing the CNN architecture is still challenging. Its design is purely based on choosing the best set of hyperparameters like number of convolution layer, type of pooling, type of activation function, number of the fully connected layer in architecture, and so on. Recently proposed architectures are very deeper and more complex which need thousands of parameters to be trained for improved performance and need high-performance machines and plenty of time. Usually, tuning hyperparameter of CNN can be performed by the following methods which are manual search, grid search, and random search and are very time consuming and require GPUs for processing. Researchers are now focused on finding optimal ways of tuning hyperparameters. Several researchers applied particle swarm optimization, genetic algorithm, search and rescue algorithm, and so on. But still, there is room for improvement.

7. Challenges of CNN

Convolutional neural networks (CNNs) have performed extraordinarily in image processing and several other vision-related tasks [34]. However, CNN has some issues and limitations which need to be addressed like CNNs are based on a supervised learning mechanism, and therefore, they need a large amount of data for training. Sometimes, it is quite challenging [35]. The selection of hyperparameters has a significant impact on the performance of CNN. The minor change in values of the hyperparameter may affect the overall performance of a CNN [36]. So, careful selection of parameters is a significant design issue that needs to be addressed through some suitable optimization strategy [37]. Powerful hardware like GPU is required for the training of CNN. However, there is still a gap to implement it on smart devices [38]. Architecture-wise limitations and benefits are briefly discussed in Table 3.

8. Future Directions

As discussed in the above sections, creating suitable architecture of convolutional neural network depends upon the combination of convolution layers, number of pooling layers, number of filters, filter size, stride rate, and place of pooling layer, and all these parameters affect a lot on performance of classification in terms of accuracy, misrate, precision, and recall. Suitable parameter selection is purely handcrafted which takes lots of time and high computation powers like GPUs for training and testing combination of parameters again and again. In the future, we are planning to develop an algorithm based on swarm intelligence for the selection of structural parameters automatically.

9. Conclusion

A convolutional neural network is considered one of the best techniques for vision-related tasks. Researchers have contributed a lot in the last several years. Multiple CNN architectures are proposed as per the need of its application and issue in existing CNN architecture. The improvements in CNN can be classified as activation, loss function, optimization, regularisation, learning algorithms, and architectural advances. This work examines current advancements in CNN architectures, focusing on processing unit design trends, and proposes a taxonomy for contemporary CNN designs. This article discusses the history of CNNs, their uses, problems, and prospects in addition to categorizing CNNs into several classes.

By utilizing depth and other structural improvements, CNN’s learning ability has dramatically enhanced over time. The greatest gain in CNN performance has been noticed in recent research by substituting the usual layer structure with blocks. The creation of novel and effective block designs is now one of the themes of study in CNN architectures. A block in a network can play the function of an auxiliary learner. To increase performance, these auxiliary learners may use spatial or feature-map information or even boost input channels. By enabling problem-aware learning, these blocks significantly improve CNN performance.

Data Availability

The dataset used to support the findings of this study is available from the corresponding authors upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

.