Comparative Analysis of Deep Neural Networks Architectures for Visual Recognition in the Autonomous Transport Systems

This paper analyses and presents an experimental investigation of the efficiency of modern models for object recognition in computer vision systems of robotic complexes. In this article, the applicability of transformers for experimental classification problems has been investigated. The comparison results are presented taking into account various limitations specific to robotics. Based on the results of the undertaken studies, recommendations on the use of models in the marine vessels classification problem are proposed


Introduction
Obstacle avoidance is one of the most important tasks of autonomous robots. Besides, it is important to choose a collision avoidance route, taking into account the rules and restrictions. In maritime navigation, such rules are COLREG, based on visual identification of obstacles and vessel classes.
Deep neural networks have shown their efficiency in different domains [1,2,3]. Several approaches [4,5], such as those based on deep neural networks [6] were used to recognize vessel classes. In recent years, various SOTA approaches have emerged. Each of which was efficient in terms of speed or accuracy.
For example, the family of MobileNet models [7] due to the Depth-wise Separable Convolution can significantly reduce the requirements for computational resources without significant loss of accuracy.
EfficientNet networks [8] were obtained using a method of automatic architecture creation by combining the depth and width (number of channels) of the network. The use of EfficientNet Lite models allows replacing MobileNet in some tasks.
At the same time, it is promising to use transformers with attention blocks, which have made a real revolution in the natural language processing (NLP). Vision Transformers (ViT) [9] can achieve better results than most modern CNNs on a variety of image recognition datasets while using significantly less computational resources.
However, it should be noted that most tests are carried out on fairly "clean" datasets such as MNIST or CIFAR, and, often, are far from real conditions.
Many research groups are working on the development of algorithms for recognizing surface objects from their images.
The authors [10] propose to classify ships using Inception and ResNet. In other paper [11] adopts data augmentation and fine-tuning to further improve and optimize the baseline VGG16 model. The proposed model attains an average classification accuracy of 97.08% compared to the average classification accuracy of 88.54% obtained from the baseline model.
The aim of the work is to carry out a comparative analysis of modern architectures and SOTA models for visual recognition in relation to the problem of marine vessel classifying by the USV camera.

Problem statement
It is necessary to train the most advanced models for visual image classification and test them on the metrics of accuracy, speed and model size.
Mathematical formulation of the object recognition problem in the frames of a video stream is given in [12,13] represented by the training set for deep learning of the neural network with a teacher it is required to solve the pattern recognition problem: to detect the patterns in the form of features assessment by applying the neural networks realizing the mapping , and classify them using the mapping , according to the given criterion minimizing the classification error probability.
For training and testing the classifiers, the [14] dataset containing 40552 images of 22 vessel classes was assembled.

Exploratory Data Analysis
Basic Exploratory Data Analysis (EDA) was carried out for this dataset. Figure 1 shows the distribution of the number of images by classes to identify the class disbalance. Note that the classes are partially balanced and each class contains about 2000 images. Thus, we can perform the preprocessing steps and build the models we will use in this comparison. Figure 2 shows the samples of each class. Using a pretrained VGG19 neural network, the features of each image were extracted and a subsequent clustering was carried out.
Cross-validation was used to select the optimal number of clusters. The experiments used a range of values from 25 to 250 with a step of 10.

Model training for vessel image classifying
When developing an algorithm for surface objects recognition, we have studied various approaches to the algorithm design. The following models were chosen: To enlarge the dataset, we used augmentation with mirror reflection, rotation and image scaling. MobileNet and GhostNet architectures are specified in [6,14]. Using MobileNetV2 we have managed to achieve 96.03% on the "Top 5 Accuracy" metric.
However, recent researches in the neural networks for embedded devices, such as the Nvidia Jetson, have proposed more efficient architectures and approaches. In particular, the GhostNet architecture able to achieve higher recognition performance (for example, 75.7% top-1 accuracy) than MobileNetV3 with similar computational costs on the ImageNet ILSVRC-2012 classification dataset was proposed.
In this model, Ghost blocks are used, which allow calculating a smaller number of feature maps of objects with the same performance.
Using GhostNet, the loss curve reached 0.11, and the "Top 5 Accuracy" reached 97% over 120 epochs. Precision and Accuracy algorithm metrics for the testing sample were 78% and 78.3%.
Further development of the Reinforcement learning (RL) led to the development of the Neural architecture search (NAS) method. NAS has been used to design the networks that match or exceed the performance of the handcrafted architectures.
EfficientNet is a class of new models that resulted from the study of scaling the models and balancing the depth and width (number of channels) of the network, as well as the resolution of images in the network. The authors of [8] propose a new compound scaling method that regularly scales the depth / width / resolution with the fixed proportions between them ( Figure 5).
In order to improve the neural network performance, the researchers automatically selected the initial architecture ( Figure 6) using AutoML methods.  So EfficientNet-B1 -EfficientNet-B7 were built, with an integer at the end of the name indicating the value of the compound coefficient.
The advantage of using CNN was that they avoided the need to create hand-crafted visuals, instead learning to complete tasks directly from the data "from start to finish." However, CNN has a number of problems and disadvantages: • CNN does not take into account the location of features relative to each other • When you combine a word, information about the exact location of the object in the image is lost. Most of the models used for pattern recognition tasks are based on convolutional operations. However, the article [9] introduces the use of transformers for image classifying. While CNN uses pixel arrays, ViT divides the image into visual markers. The input sequence consists of a flattened vector (2D to 1D) of the fragment's pixel values. Each fragment is treated like words in a sentence in the NLP problem. The classification is done using MLP as the last layer. The figure 7 shows a complete ViT scheme from the authors' article [9].  Figure 7. Vision Transformers Architecture.
We have chosen 6x6 as a block size. Figure 8 shows the visualization of splitting the image into fragments.

Results and Discussion
Using a testing sample obtained from the dataset, an experiment was carried out on the equipment with the following parameters: Intel Core i7-5820K CPU, 1080 Ti GPU. Comparison of efficiency of the described neural networks in terms of standard metrics of classification quality is carried out.
The testing sample size is 225711 class-balanced images. The following results were obtained (Table  1) for the metrics: Precision, Recall F1-score by classes and for Accuracy, AUC -in general, and the image processing speed in seconds.

Conclusion
In this article, the applicability of generally accepted neural network architectures in the field of object recognition for robotic systems was investigated. The application of various architectures of deep neural networks on modern embedded graphics accelerators for detection and recognition of surface objects is proposed.
It is shown that the use of VIT allows one to achieve less waste of computing resources, which is especially important in embedded computing devices.
The simulation of the developed algorithms based on the hardware prototype of the intelligent vessel control system was carried out. The use of modern SOTA architectures allows improving accuracy and performance.
It is shown that networks with additional memory and attention mechanism are superior to classical approaches, which proves the applicability of these architectures to the classification problem.
In future works, it is necessary to investigate the influence of noise and various weather conditions on the quality of recognition.