Elsevier

Expert Systems with Applications

Volume 106, 15 September 2018, Pages 107-120
Expert Systems with Applications

Feature evaluation for unsupervised bioacoustic signal segmentation of anuran calls

https://doi.org/10.1016/j.eswa.2018.03.062Get rights and content

Highlights

  • Comparison of acoustic features for unsupervised segmentation of anuran calls.

  • Information Theory Quantifiers evaluation for bioacoustic signals description.

  • Separation of different patterns in temporal series using Permutation Entropy.

  • Simulation and evaluation of the impact of colored noises on signal segmentation.

  • Low-cost method for Wireless Acoustic Sensor Networks.

Abstract

We present a comprehensive study of temporal Low-Level acoustic Descriptors (LLDs) to automatically segment anuran calls in audio streams. The acoustic segmentation, or syllable extraction, is a key task shared by most of the bioacoustical species recognition systems. Consequently, the syllable extraction has a direct impact on the classification rate. In this work, we assess several new entropy measures including the recently developed Permutation Entropy, Weighted Permutation Entropy, and Permutation Min-Entropy, and compare them to the classical Energy, Zero Crossing Rate and Spectral Entropy. In addition, we propose an algorithm to estimate the optimal segmentation threshold value used to separate deterministic segments from stochastic ones avoiding the creation of thin clusters. To assess the performance of our segmentation approach, we applied a frame-by-frame, a point-to-point and an event-to-event comparisons. We show that in a scenario with severe noise conditions (SNR ≤ 0dB), simple entropy descriptors are robust, achieving 97% of segmentation performance, while keeping a low computational cost. We conclude that there is no LLD that is suitable for all scenarios, and we must adopt multiple or different LLDs, depending on the expected noise conditions.

Introduction

The loss of amphibian biodiversity is a worldwide concern. Anuran (frogs and toads) have a close relationship with the environment. By monitoring anuran populations, we can detect ecological stress in early stages (Carey, Heyer, Wilkinson, Alford, Arntzen, Halliday, et al., 2001, Cole, Bustamante, Reinoso, Funk, 2014, Luque, Romero-Lemos, Carrasco, Barbancho, 2017). The variations in anuran populations can help us understand what is happening in their environment. Most of the monitoring programs are based on acoustic surveys applied by a group of experts and collaborators, who move from one place to another while counting the species and individuals (Gibbs, Whiteleather, Schueler, 2005, MacKenzie, Nichols, Hines, Knutson, Franklin, 2003). The full study takes many years and demands a lot of human and economic resources.

One possible solution to mitigate that cost is the development of an automatic method to detect the presence of different anuran species through their calls, without human intervention. In this context, the problem can be addressed by using Wireless Acoustic Sensor Networks (WASNs) (Colonna, Cristo, Nakamura, 2014, Colonna, Ribas, Santos, Nakamura, 2012, Ribas, Colonna, Figueiredo, Nakamura, 2012) and Machine Learning classification techniques to detect the presence of particular species (Brandes, 2008, Colonna, Peet, Ferreira, Jorge, Gomes, Gama, 2016, Somervuo, Harma, Fagerlund, 2006). However, the low cost of this technology results in hardware and software resource constraints, which demand algorithmic solutions of lower computational cost (Nakamura, Loureiro, Boukerche, Zomaya, 2014, Nakamura, Loureiro, Frery, 2007).

In the context of WASNs, the sound acquisition is performed non-intrusively by the sensor nodes, which allow us to monitor the environment for a long-term period. Replacing the sensor batteries may be too expensive or even unfeasible. Hence, we need to develop efficient methods that minimize the amount of information being processed, transmitted, or recorded, by the sensor nodes.

To enable monitoring with WASNs, it is necessary to embed an Automatic Call Recognition (ACR) method into the sensor nodes. A general ACR method for recognizing frog species, based on their calls, is shown in Fig. 1. This method consists of three major processing blocks. The first block performs the acoustic signal segmentation, recognizing the start and end of a minor vocalization unit, named syllable (Huang, Yang, Yang, Chen, 2009, Somervuo, Harma, Fagerlund, 2006). The second block maps the syllable into a feature vector (Brandes, 2008). The last block is a pattern-matching algorithm that considers the input feature vector and a feature set representing all the species included in the reference dataset (Colonna, Cristo, Nakamura, 2014, McIlraith, Card, 1997, Wichern, Xue, Thornburg, Mechtley, Spanias, 2010). Note that the classification step is not covered in this article.

As we can observe, the segmentation block impacts on the final species recognition rate, i.e., the better the segmentation result, greater is the probability of the correct species recognition (Alonso, Cabrera, Shyamnani, Travieso, Bolaños, García, et al., 2017, Colonna, Cristo, Salvatierra, Nakamura, 2015, Jaafar, Ramli, Shahrudin, 2013). Moreover, given the limitations of the sensors, keeping the segmentation method as economical as possible, from the viewpoint of computational complexity, is our major challenge. Therefore, here we provide a solid acoustic descriptor evaluation and comparison for bioacoustic signal segmentation that detects and extracts the syllables of an anuran call in an unsupervised manner.

In real situations, such as rain forests, the scenarios can be complex and present a high acoustic richness, as a result of the interaction of several species at the same place (Depraetere et al., 2012). Therefore, it is not possible to know all possible signal patterns a priori. Hence, we propose a change in the segmentation paradigm: instead of trying to identify different signal patterns, we identify only noise segments. Thus, the remaining segments may be considered syllables (see Section 2). This is possible because in our formulation the segmentation task is equivalent to an unsupervised binary classifier, in which we separate features that belong to segments of either “signal” class or “noise” class. After the segmentation, the final classifier (third block of Fig. 1) is responsible for the species recognition. The impact of the segmentation on the final recognition rate has been studied (Colonna, Cristo, Salvatierra, Nakamura, 2015, Jaafar, Ramli, Shahrudin, 2013, Somervuo, Harma, Fagerlund, 2006), but nothing was reported about individual LLDs applied to segmentation. The classification step is out of scope of this work.

We present an unsupervised segmentation approach. We focus our experiments on using only a reduced set of Low-Level acoustic Descriptors (LLDs) from temporal and spectral domains to cope with the hardware restrictions of low-cost sensors. Moreover, our method is useful for segmenting calls stored into a bioacoustic database in an unsupervised manner. We then analyze the segmentation performance considering several noise conditions, including white and colored noises (blue, red, violet and pink). In literature few authors discuss the problem of such color noises, but given the goal of our application it is essential.

The contributions of this work are twofold:

  • 1.

    a comparative assessment of three unconventional LLDs based on the new Permutation Entropy (PE) methodology and its variants (Weighted Permutation Entropy - WPE and Permutation Min-Entropy - PME), one LLD based on Spectral Entropy (HFFT), and two common temporal LLDs (Energy - E and Zero Crossing Rate - ZCR); and

  • 2.

    an algorithm to find the optimal segmentation threshold for the syllables using the descriptors mentioned above.

Hence, we perform several evaluations trying to answer why the same features, such as E, ZCR and HFFT, are often used in the literature, even in situations where the noises may have different spectral characteristics. To evaluate the performance of our algorithm to find the best threshold, we compared it against the Otsu and k-Means methods. To the best of our knowledge, this is the first work that applies and compares the new Permutation Entropy methodology, and its variants, to segment bioacoustic signals.

Finally, the performance was quantified by computing: the Area Under the ROC Curve (AUC), the Acoustic Event Error Rate (AEER), the false positive and false negative rates (FPR and FNR), the F-Score (F1), and the accuracy (Acc). Then, supported by experimental results, we demonstrate that these entropy quantifiers are robust enough for real applications, even considering noise levels below than 0 dB, achieving an accuracy superior to 95%. These results are significant to those who intend to design and implement a non-intrusive environmental monitoring method.

All these evaluations are equally important to obtain the final performance of the segmentation approach. Each metric helps to highlight different aspects of the segmentation. A complete assessment is generally not considered in the related works, in which the quality of the segmentation is frequently evaluated through the classification rate of the species. The problem with this is that the classifier also produces errors that mask the segmentation errors. This makes it difficult to identify the real failures of the whole system.

The remainder of this paper is organized as follows. Section 2 defines the segmentation problem of bioacoustic calls. Related works are presented in Section 3. Section 4 describes the set of LLD assessed in our comparative study. The algorithm we propose to find the optimal threshold value is presented in Section 5. We also show three different performance assessments of the segmentation task in the 6.1 Frame-by-frame analysis, 6.2 Event-to-event analysis and 6.3. Additionally, given that the LLDs are not necessarily correlated, in Section 6.4 we show a ranking of LLDs based on Information Gain criterion and an evaluation of some LLD combinations. Section 7 discusses which are the most robust LLDs to segment anuran calls.

Section snippets

Problem description

The accuracy of species recognition depends on two major factors: (1) the classifier’s ability to separate different signal patterns represented in the feature space; and (2) the quality of the mapping function, which transforms the raw signal segments into discriminating features. The quality of features depends on the mapping function, but also depends on using the correct part of the input signal, which contains more useful information. Thus, the final accuracy of the complete system,

Related work

In bioacoustic monitoring approaches, the recognition task has been discussed and studied extensively. However, the audio segmentation is usually neglected, treated as a secondary task or performed manually (McIlraith, Card, 1997, Strout, Rogan, Seyednezhad, Smart, Bush, Ribeiro, 2017). For instance, Luque et al. (2017) explain that syllable extraction is a highly complex task, especially in the case of noisy recordings, and they proposed an alternative method based on the processing of

Fundamentals concepts

This section presents the fundamental knowledge that supports this work.

Experimental methodology

The manual signal segmentation of the database is a crucial step to provide the Ground Truth (GT) and to access the methods’ performance. We collected the audio of fourteen different frog species with 3155 syllables and 6324 segments, which were manually labeled by a human expert into two classes: “signal” or “background noise”2 (BN). These recordings were collected in situ in the Amazon rainforest, in geographic areas around the

Results

In the previous section we present a segmentation example with the proposed method. In this section we analyze a larger set of species. To determine which are the most appropriate features, we used different metrics to evaluate the quality of segmentation: (a) a frame-by-frame metric quantified by receiver operating characteristic (ROC) curves and the AUC values; (b) an event-to-event metric, called Acoustic Event Error Rate (AEER); and (c) a point-to-point metric that account errors by

Conclusions and final comments

In this work, we presented a comprehensive evaluation of different Low-Level acoustic Descriptors (LLDs or features) used for automatically segmenting anuran calls. As an additional contribution, we showed that, depending on the noise pattern, the Permutation Entropy (PE) quantifier and its variants can improve signal segmentation. The idea is to combine the simplicity of the energy models with the robustness of the probabilistic models in an unsupervised manner. Hence, we computed the entropy

Acknowledgments

Juan Colonna acknowledges to National Council of Technological and Scientific Development (CNPq, Brazil), FAPEAM (PROTI) and CAPES for the PhD scholarship. Eduardo Nakamura acknowledges to FAPEAM for the support granted through the Anura Project (FAPEAM/CNPq PRONEX 023/2009) and CNPq (process 309471/2015-0). Osvaldo A. Rosso acknowledges the financial support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina.

References (71)

  • J.J. Noda et al.

    Methodology for automatic bioacoustic classification of anurans based on feature fusion

    Expert Systems with Applications

    (2016)
  • D.L. Rudnick et al.

    Red noise and regime shifts

    Deep Sea Research Part I: Oceanographic Research Papers

    (2003)
  • M. Sokolova et al.

    A systematic analysis of performance measures for classification tasks

    Information Processing & Management

    (2009)
  • T.M. Aide et al.

    Real-time bioacoustics monitoring and automated species identification

    PeerJ

    (2013)
  • C. Bandt et al.

    Permutation entropy: A natural complexity measure for time series

    Physical Review Letters

    (2002)
  • R. Bardeli

    Similarity search in animal sound databases

    IEEE Transactions on Multimedia

    (2009)
  • T.S. Brandes

    Feature vector selection and use with hidden Markov models to identify frequency-modulated bioacoustic signals amidst noise

    IEEE Transactions on Audio, Speech, and Language Processing

    (2008)
  • C. Carey et al.

    Amphibian declines and environmental change: Use of remote-sensing data to identify environmental correlates

    Conservation Biology

    (2001)
  • S.-S. Cheng et al.

    A sequential metric-based audio segmentation method via the Bayesian information criterion

    Proceedings of the European conference on speech communication and technology (INTERSPEECH)

    (2003)
  • W. Chu et al.

    Noise robust bird song detection using syllable pattern-based hidden Markov models

    Proceedings of the international conference on acoustics, speech and signal processing (ICASSP)

    (2011)
  • J.G. Colonna et al.

    A distribute approach for classifying anuran species based on their calls

    Proceedings of the 22nd international conference on pattern recognition (ICPR)

    (2014)
  • J.G. Colonna et al.

    Automatic classification of anuran sounds using convolutional neural networks

    Proceedings of the ninth international c* conference on computer science & software engineering

    (2016)
  • J.G. Colonna et al.

    Feature subset selection for automatically classifying anuran calls using sensor networks

    Proceedings of the international joint conference on neural networks (JCNN)

    (2012)
  • T.L.F. Evangelista et al.

    Automatic segmentation of audio signals for bird species identification

    Proceedings of the international symposium on multimedia (ISM)

    (2014)
  • B. Fadlallah et al.

    Weighted-permutation entropy: A complexity measure for time series incorporating amplitude information

    Physical Review E

    (2013)
  • T. Finch

    Incremental calculation of weighted mean and variance

    Technical Report

    (2009)
  • J. Foote

    Automatic audio segmentation using a measure of audio novelty

    Proceedings of the international conference on multimedia and expo (ICME)

    (2000)
  • T. Giannakopoulos et al.

    A novel efficient approach for audio segmentation

    Proceedings of the nineteenth international conference on pattern recognition (ICPR)

    (2008)
  • D. Giannoulis et al.

    A database and challenge for acoustic scene classification and event detection

    Proceedings of the European signal processing conference (EUSIPCO)

    (2013)
  • J.P. Gibbs et al.

    Changes in frog and toad populations over 30 years in new york state

    Ecological Applications

    (2005)
  • S. Heinicke et al.

    Assessing the performance of a semi-automated acoustic monitoring system for primates

    Methods in Ecology and Evolution

    (2015)
  • H. Jaafar et al.

    Automatic syllables segmentation for frog identification system

    Proceedings of the ninth international colloquium on signal processing and its applications (CSPA)

    (2013)
  • H. Jaafar et al.

    MFCC based frog identification system in noisy environment

    Proceedings of the international conference on signal and image processing applications (ICSIPA)

    (2013)
  • Kamper, H., Livescu, K., & Goldwater, S. (2017). An embedded segmental k-means model for unsupervised segmentation and...
  • N.J. Kasdin

    Discrete simulation of colored noise and stochastic processes and 1|f|α power law noise generation

    Proceedings of the IEEE

    (1995)
  • Cited by (0)

    View full text