Deep learning models and data-driven intelligent analytics are widely used components of artificial intelligence. Deep learning models discover features through autonomous or representation learning and process them through artificial neural networks to retrieve the desired results. There are several base types of deep learning models, such as radial basis function networks (RBFN), recurrent neural networks (RNN), generative-adversarial-networks (GANs), long-short-term memory networks (LSTMs), convolutional neural networks (CNNs), self-organizing maps (SOM), restricted Boltzmann machines (RBM), autoencoders, and multilayer-perceptron (MLP). These types depend on the requirement; for example, the autoencoder is designed to transform input data into a different representation, such as re-generating or re-constructing an image. In the same way, self-organizing maps are created to solve high-dimensional data that consist of the number of features that are larger than the number of observations and use the winning weight award technique to choose distinct features in the high-dimensional complex data. The industrial multimedia data that include hypermedia, hypertext, graphics having 2D and 3D formats, 3D animation, and audio and video types are fragile and complex. And, with the variety of base deep learning models, it is difficult to understand how we use a particular type for a specific multimedia data problem.

We observed the recent research contributions and understood the requirement of utilizing deep learning models in the industrial multimedia environment. And, sought the submissions carefully related to the topic of deep learning models having the scope of multimedia data formats only. In the response, we received various numbers of submissions out of which, a total of thirteen papers were accepted after rigorous review. We share a summary of the contributions from different parts of the world mentioned below.

The paper by Tiago do Carmo Nogueira et al. proposes a novel idea using encoder–decoder structure to extract features from reference images and gated-recurrent-units (GRUs) for creating descriptions. And they used part-of-speech (PoS) analysis to generate weights. They evaluated their technique using MS-COCO and Flickr30k datasets. They performed prediction resulting in more descriptive captions for predicted and KNN-selected captions.

The paper by Ahmed Barnawi et al. presents a new method of detecting COVID-19 using emergency services such as UAVs. They designed and proposed a transfer-learning-based deep CNN architecture to categorize patients into positive, negative, and null (pneumonia patient) categories. Using the developed model, they evaluated their technique through time-bounded services and achieved 94.92% accuracy.

The paper by Faria Nazir et al. proposes a deep learning model to address the problem of language pronunciation mistakes using speech mistakes analysis. They further divide the solution into phonemic errors (confusing phonemes) and prosodic errors (partially modified pronunciation variants of phones). They use CNN-based clustering technique to identify the faults and categorize phonemes through K-nearest neighboring technique along with Naïve Bayes mechanism, and support-vector-machine (SVM) algorithm. They evaluated the model using an Arabic dataset of 28 individuals and received an accuracy of 97% than traditional models.

The paper by Linbo Wang et al. present a collaborative transformational–spatial clustering model that identifies inliers with two-way proximities. They discuss the technique so that, at first, a generalized match is transformed into a collaborative transformational–spatial space. After that, a collaborative kernel density estimator maps the object with images. Finally, they fix matching proximities to enhance application on different images. They perform experiments and achieve superior performance on feature-identical jobs, such as multi-object pairing, duplicate-object pairing, and object-retrieve technique.

The paper by Loveleen Gaur et al. discusses a deep learning model that detects COVID-19 using autonomous deep convolutional neural networks. Using transfer learning, they focus on chest X-rays and evaluate three pre-trained CNN models, EfficientNetBo, VGG16, and InceptionV3. They consider the technique by measuring performance metrics, such as accuracy, recall, precision, and F1 scores. They achieve an overall accuracy of 92.92% with a sensitivity of 94.79%.

The paper by Asma Kausar et al. proposes a deep learning model to automate left-side-atrium segmentation on magnetic resonance imaging (MRI) to assist medication and diagnosis of heart surgical treatment. They discuss a three-D multi-scale residual-learning-based model to maintain granular and standard-level features through a network. They evaluated their model using the award-winning left-atrial-segmentation technique with less constraints. They claimed not to add any extensive pre-processing of input data for the said task.

The paper by Jimmy Ming-Tai Wu et al. proposes a graph-based CNN-LSTM deep learning mode and predicts stock prices having high indicators. They use a financial time series dataset onto a joint convolution neural network (CNN) and long-short-term-memory neural network (LSTM) and constructs a sequence-array of historical data with leading indicators. They evaluated their model using the USA and Taiwan stock datasets and achieved better results than existing approaches.

The paper by Gengsheng Xie et al. presents a re-identification (Re-ID) technique that focuses on a deep metric representation technique for extracting the features through a dataset. They discuss a pose-guided feature region-based fusion network (PFRFN) for using pose landmarks as local features. They evaluate the technique using various datasets, such as Market-1501, DukeMTMC, and CUHK03, and achieve improvement over traditional models.

The paper by Sumit Pundir et al. proposes an intelligent machine learning model that handles the botnet attacks through a malware detection technique in the IoT-enabled industrial multimedia environment. They use four types of methods: naïve Bayes, logistic regression, artificial neural network (ANN), and random forest, to detect the malware. They evaluate the idea and achieve 99.5% detection success having a 0.5% false positive rate.

The paper by Akshi Kumar presents a model that focuses on crowd knowledge and answers how-to-do concerns in the QA websites. For that, he develops the mechanism of Siamese neural architecture and extracts similarity-matching features. Furthermore, the training is performed through a multi-layer perceptron for predictions. And semantically matched questions are grouped to figure out the experts. He evaluated the technique by combining multi-layer perceptron and Manhattan distance function and compared the results with existing models.

Another paper by Ranran Lou et al. proposes a model to protect the ocean environment and predict the unknown elements and deep sea resource monitoring. They use a data-driven analytics approach to analyze ocean data, including sound source identification, element prediction, and physical constraints. They evaluated the model with standard ocean datasets and compared the results with existing approaches.

In the paper by Mohib Ullah Khan et al. presents a technique focusing on social media reviews for the restaurant industry. They use a novel convolutional attention-based bi-directional modified LSTM of the word, successive sequences, and patterns with aspect category detection (ACD). They extract features of public reviews as entities and attributes to further develop sequences and patterns. They compare the technique using SemEval-2015, SemEval-2016, and SentiHood datasets and achieve results with an average improvement of 79% than traditional models.

In the last, the paper by Celestine Iwendi et al. proposes an experimental analysis based on four deep learning models, such as recurrent neural network (RNNs) along with Bidirectional Long Short-Term Memory (BLSTM), Long Short-Term Memory (LSTMs), and Gated Recurrent Units (GRU), for detecting insults in the social media platform commentary. For that, they develop a method of text cleaning, tokenization, stemming, Lemmatization, and removal of stop words and perform prediction using the models. They evaluate deep learning models and share findings compared to existing models.

We are excited to share the details and hope that the research community related to deep learning models will find these articles with colossal interest and relevant to the multimedia-based deep learning models. We thank Editor-In-Chief Prof. Changsheng Xu and the editorial staff especially Garth Haller, senior publisher, for their support and collaboration in executing this special issue in the Multimedia Systems Journal.