The rapid advances in multimedia technology have resulted in a proliferation of multimedia data on the Internet. This is reflected in the success of the social networks, such as Facebook, Twitter, YouTube, Flickr, and Pinterest, which dramatically increased the volume of community-shared media, including images and videos. Although these websites allow users to annotate and rate them, the accurate annotations of online media are very rare and unsatisfactory. Thus, accurately understanding this multimedia content is a very significant and challenging issue.

The scenario of online multimedia understanding has usually large number of categories with unconstrained domains and noise. Recent progress on visual genome dataset and deep model open an exciting new era of knowledge-based multimedia computing, which can provide a knowledge base of images and capture the complex content with domain-specific knowledge. Moreover, some works about dense image captioning, visual relationship detection, visual question answering, knowledge inference, and social network knowledge graph also provide insight into tackling the knowledge-based multimedia computing. The prior knowledge, experience, and human perceptual theory are playing a critical pole in accurately understanding the multimedia under the network environment.

This special issue aims at providing a forum to present the original and innovative research works that report the latest results and advances in the field of knowledge-based multimedia computing. This special issue contains 11 papers, selected from the initial 23 submissions, after two rounds of blind review.

The paper “CrossbowCam: A cost-effective multi-camera system for advanced applica-tions” by Che-Hao Hsu et al. [4] proposes a novel multi-functional, low-cost handheld multi-camera system, “CrossbowCam.” The system is suitable for multi-viewpoint image acquisition, smooth switching, alignment, and seamless stitching applications. With the proposed system, the users can push one single button to change the configuration of the camera array rapidly to divergence (convex arc), parallel (linear), or convergence (concave arc). The three camera configurations can each be suitable for applications such as panorama image stitching, auto stereoscopic 3D display, bullet-time (time-freeze) visual effect, 3D scene reconstruction, etc.

The paper “Deep Learning based Basketball Video Analysis for Intelligent Arena Appli-cation” by Wu Liu et al. [7] proposes a deep learning based video analysis scheme for intelligent basketball arena applications. First of all, with multiple cameras or mobile devices capturing the activities in arena, the proposed scheme can automatically select the camera to give high-quality broadcast in real time. Furthermore, with basketball energy image based deep conventional neural network, the scoring clips as the highlight video reels are detected to support the wonderful actions replay and online sharing functions.

The paper “Hyperspectral Image Compression Based on Online Learning Spectral Features Dictionary” by Worku Jifara et al. [5] proposes a novel method of lossy hyperspectral image compression using online learning dictionary. From the perspective of sparse coding, learning a sparse dictionary could achieve a better result of data decorrelation. In order to compress the hyperspectral data, an online learning sparse coding dictionary which could describe the characteristics of spectral curve was created to represent and reconstruct hyperspectral data. In the online learning phase, effective clustering algorithm is applied to generate and update the dictionary more properly.

The paper “Sparse Representations based Distributed Attribute Learning for Person Re-identification” by Keyang Cheng et al. [1] is to solve the person re-identification task, and proposes a novel Sparse Representations based distributed attribute learning model (SRDAL) to encode targets into semantic topics. Compared to other models such as ELF, the proposed model performs best by imposing semantic restrictions onto the generation of human specific attributes and employing deep convolutional neural network to generate features without supervision for attributes learning model.

The paper “An Auto-Encoder-Based Summarization Algorithm for Unstructured Videos” by Meng-Xiong Han et al. [2] proposes an Auto-encoder-based summarization algorithm for unstructured videos. Each video structure is detected by an auto-encoder and both of the interestingness and representativeness of each video segment are predicted by the reconstruction errors of the segment. Meanwhile, the most interesting and representative summarization is generated with the limited summary length.

The paper “Efficient PCIe Transmission for Multi-Channel Video Using Dynamic Splicing and Conditional Prefetching” by Tingshan Liu et al. [6] proposes an efficient PCIe transmission method for multi-channel video. Firstly, a dynamic splicing mechanism is introduced to combine the video analyzed data and the compressed stream with the raw video to avoid the individual transmission of the auxiliary data. Secondly, a conditional prefetching mechanism is employed to determine whether there exists any entire video frame in other channel buffers. Finally, in the host-side driver, direct kernel buffer access technique is used to improve the application I/O request packet (IRP) performance.

The paper “Pedestrian Detection based on Multi-Convolutional Features by Feature Maps Pruning” by Ting Rui et al. [10] proposes a feature map selection method to reduce the number of high dimensional feature maps in shallow layers, which cuts the feature map number by correlation coefficient between kernels and finishes detection by HOG + SVM method. Firstly, the feature maps of shallow layers from trained CNN are extracted. Then, strongly relevant feature maps are merged and all maps among weakly relevant feature maps are chosen by analyzing correlation coefficient of kernels. Finally, HOG features of the chosen feature maps are extracted and SVM is used to complete the training and classification.

The paper “Salient Object Detection via Multiple Saliency Weights” by Weimin Tan et al. [11] proposes a novel bottom-up approach to automatically detect salient objects of an image via multiple visual cues. The key idea is to represent a saliency map of an image as an integration of multiple visual cues, i.e., local contrast weight, superpixel clarity weight, background probability weight, and central bias weight. To obtain the saliency map, the four resulting saliency weights are integrated in a principled way via multiplication and summation based fusion. Furthermore, a new superpixel-level saliency smoothing approach is proposed to optimize the integrated results for producing clean and consistent saliency maps.

The paper “Cross-media Similarity Metric Learning with Unified Deep Networks” by Jinwei Qi et al. [9] proposes the Unified Network for Cross-media Similarity Metric (UNCSM) to associate cross-media shared representation learning with distance metric in a unified framework. First, a two-pathway deep network pre-trained with contrastive loss is designed, and double triplet similarity loss for fine-tuning is employed to learn the shared representation for each media type by modeling the relative semantic similarity. Second, the metric network is designed for effectively calculating the cross-media similarity of the shared representation, by modeling the pairwise similar and dissimilar constraints.

The paper “Cross-lingual event-centered news clustering based on elements semantic correlations of different news” by Xudong Hong et al. [3] proposes a new method based on semantic correlations of news elements to solve the problem of similarity computation between bilingual documents. First, use bilingual entity lexical and terms co-occurrences in news to acquire the semantic correlation of news elements in different language. Then, the similarity between news in different languages is computed using the GVSM model on this basis. Finally, spectral clustering is applied to categorize news stories.

The paper “Justify Role of Similarity Diffusion Process in Cross-Media Topic Ranking: An Empirical Evaluation” by Junbiao Pang et al. [8] empirically revisits the correlations between the types of noises and modalities for cross-media topic ranking, in order to provide the necessary insights to understand when to choose a noise. This paper reviews the existing unsupervised ranking methods and compares them in a unified evaluation criterion. We evaluate different noises, Poisson noise and Gaussian noise, for different modalities, i.e., texts and images.

These 11 papers cover a wide range of knowledge-based multimedia computing, and we hope the readers will find interesting ideas in them. Finally, the guest editorial team would like to thank all the authors for contributing their work to this special issue, and to the reviewers for their hard work and constructive comments. We would also like to express our gratitude to Prof. Borko Furht, editor-in-chief for providing an opportunity to organize this special issue, as well as for his helpful guidance in the reviewing process of this special issue.