Feature evaluation for unsupervised bioacoustic signal segmentation of anuran calls
Introduction
The loss of amphibian biodiversity is a worldwide concern. Anuran (frogs and toads) have a close relationship with the environment. By monitoring anuran populations, we can detect ecological stress in early stages (Carey, Heyer, Wilkinson, Alford, Arntzen, Halliday, et al., 2001, Cole, Bustamante, Reinoso, Funk, 2014, Luque, Romero-Lemos, Carrasco, Barbancho, 2017). The variations in anuran populations can help us understand what is happening in their environment. Most of the monitoring programs are based on acoustic surveys applied by a group of experts and collaborators, who move from one place to another while counting the species and individuals (Gibbs, Whiteleather, Schueler, 2005, MacKenzie, Nichols, Hines, Knutson, Franklin, 2003). The full study takes many years and demands a lot of human and economic resources.
One possible solution to mitigate that cost is the development of an automatic method to detect the presence of different anuran species through their calls, without human intervention. In this context, the problem can be addressed by using Wireless Acoustic Sensor Networks (WASNs) (Colonna, Cristo, Nakamura, 2014, Colonna, Ribas, Santos, Nakamura, 2012, Ribas, Colonna, Figueiredo, Nakamura, 2012) and Machine Learning classification techniques to detect the presence of particular species (Brandes, 2008, Colonna, Peet, Ferreira, Jorge, Gomes, Gama, 2016, Somervuo, Harma, Fagerlund, 2006). However, the low cost of this technology results in hardware and software resource constraints, which demand algorithmic solutions of lower computational cost (Nakamura, Loureiro, Boukerche, Zomaya, 2014, Nakamura, Loureiro, Frery, 2007).
In the context of WASNs, the sound acquisition is performed non-intrusively by the sensor nodes, which allow us to monitor the environment for a long-term period. Replacing the sensor batteries may be too expensive or even unfeasible. Hence, we need to develop efficient methods that minimize the amount of information being processed, transmitted, or recorded, by the sensor nodes.
To enable monitoring with WASNs, it is necessary to embed an Automatic Call Recognition (ACR) method into the sensor nodes. A general ACR method for recognizing frog species, based on their calls, is shown in Fig. 1. This method consists of three major processing blocks. The first block performs the acoustic signal segmentation, recognizing the start and end of a minor vocalization unit, named syllable (Huang, Yang, Yang, Chen, 2009, Somervuo, Harma, Fagerlund, 2006). The second block maps the syllable into a feature vector (Brandes, 2008). The last block is a pattern-matching algorithm that considers the input feature vector and a feature set representing all the species included in the reference dataset (Colonna, Cristo, Nakamura, 2014, McIlraith, Card, 1997, Wichern, Xue, Thornburg, Mechtley, Spanias, 2010). Note that the classification step is not covered in this article.
As we can observe, the segmentation block impacts on the final species recognition rate, i.e., the better the segmentation result, greater is the probability of the correct species recognition (Alonso, Cabrera, Shyamnani, Travieso, Bolaños, García, et al., 2017, Colonna, Cristo, Salvatierra, Nakamura, 2015, Jaafar, Ramli, Shahrudin, 2013). Moreover, given the limitations of the sensors, keeping the segmentation method as economical as possible, from the viewpoint of computational complexity, is our major challenge. Therefore, here we provide a solid acoustic descriptor evaluation and comparison for bioacoustic signal segmentation that detects and extracts the syllables of an anuran call in an unsupervised manner.
In real situations, such as rain forests, the scenarios can be complex and present a high acoustic richness, as a result of the interaction of several species at the same place (Depraetere et al., 2012). Therefore, it is not possible to know all possible signal patterns a priori. Hence, we propose a change in the segmentation paradigm: instead of trying to identify different signal patterns, we identify only noise segments. Thus, the remaining segments may be considered syllables (see Section 2). This is possible because in our formulation the segmentation task is equivalent to an unsupervised binary classifier, in which we separate features that belong to segments of either “signal” class or “noise” class. After the segmentation, the final classifier (third block of Fig. 1) is responsible for the species recognition. The impact of the segmentation on the final recognition rate has been studied (Colonna, Cristo, Salvatierra, Nakamura, 2015, Jaafar, Ramli, Shahrudin, 2013, Somervuo, Harma, Fagerlund, 2006), but nothing was reported about individual LLDs applied to segmentation. The classification step is out of scope of this work.
We present an unsupervised segmentation approach. We focus our experiments on using only a reduced set of Low-Level acoustic Descriptors (LLDs) from temporal and spectral domains to cope with the hardware restrictions of low-cost sensors. Moreover, our method is useful for segmenting calls stored into a bioacoustic database in an unsupervised manner. We then analyze the segmentation performance considering several noise conditions, including white and colored noises (blue, red, violet and pink). In literature few authors discuss the problem of such color noises, but given the goal of our application it is essential.
The contributions of this work are twofold:
- 1.
a comparative assessment of three unconventional LLDs based on the new Permutation Entropy (PE) methodology and its variants (Weighted Permutation Entropy - WPE and Permutation Min-Entropy - PME), one LLD based on Spectral Entropy (HFFT), and two common temporal LLDs (Energy - E and Zero Crossing Rate - ZCR); and
- 2.
an algorithm to find the optimal segmentation threshold for the syllables using the descriptors mentioned above.
Hence, we perform several evaluations trying to answer why the same features, such as E, ZCR and HFFT, are often used in the literature, even in situations where the noises may have different spectral characteristics. To evaluate the performance of our algorithm to find the best threshold, we compared it against the Otsu and k-Means methods. To the best of our knowledge, this is the first work that applies and compares the new Permutation Entropy methodology, and its variants, to segment bioacoustic signals.
Finally, the performance was quantified by computing: the Area Under the ROC Curve (AUC), the Acoustic Event Error Rate (AEER), the false positive and false negative rates (FPR and FNR), the F-Score (F1), and the accuracy (Acc). Then, supported by experimental results, we demonstrate that these entropy quantifiers are robust enough for real applications, even considering noise levels below than 0 dB, achieving an accuracy superior to 95%. These results are significant to those who intend to design and implement a non-intrusive environmental monitoring method.
All these evaluations are equally important to obtain the final performance of the segmentation approach. Each metric helps to highlight different aspects of the segmentation. A complete assessment is generally not considered in the related works, in which the quality of the segmentation is frequently evaluated through the classification rate of the species. The problem with this is that the classifier also produces errors that mask the segmentation errors. This makes it difficult to identify the real failures of the whole system.
The remainder of this paper is organized as follows. Section 2 defines the segmentation problem of bioacoustic calls. Related works are presented in Section 3. Section 4 describes the set of LLD assessed in our comparative study. The algorithm we propose to find the optimal threshold value is presented in Section 5. We also show three different performance assessments of the segmentation task in the 6.1 Frame-by-frame analysis, 6.2 Event-to-event analysis and 6.3. Additionally, given that the LLDs are not necessarily correlated, in Section 6.4 we show a ranking of LLDs based on Information Gain criterion and an evaluation of some LLD combinations. Section 7 discusses which are the most robust LLDs to segment anuran calls.
Section snippets
Problem description
The accuracy of species recognition depends on two major factors: (1) the classifier’s ability to separate different signal patterns represented in the feature space; and (2) the quality of the mapping function, which transforms the raw signal segments into discriminating features. The quality of features depends on the mapping function, but also depends on using the correct part of the input signal, which contains more useful information. Thus, the final accuracy of the complete system,
Related work
In bioacoustic monitoring approaches, the recognition task has been discussed and studied extensively. However, the audio segmentation is usually neglected, treated as a secondary task or performed manually (McIlraith, Card, 1997, Strout, Rogan, Seyednezhad, Smart, Bush, Ribeiro, 2017). For instance, Luque et al. (2017) explain that syllable extraction is a highly complex task, especially in the case of noisy recordings, and they proposed an alternative method based on the processing of
Fundamentals concepts
This section presents the fundamental knowledge that supports this work.
Experimental methodology
The manual signal segmentation of the database is a crucial step to provide the Ground Truth (GT) and to access the methods’ performance. We collected the audio of fourteen different frog species with 3155 syllables and 6324 segments, which were manually labeled by a human expert into two classes: “signal” or “background noise”2 (BN). These recordings were collected in situ in the Amazon rainforest, in geographic areas around the
Results
In the previous section we present a segmentation example with the proposed method. In this section we analyze a larger set of species. To determine which are the most appropriate features, we used different metrics to evaluate the quality of segmentation: (a) a frame-by-frame metric quantified by receiver operating characteristic (ROC) curves and the AUC values; (b) an event-to-event metric, called Acoustic Event Error Rate (AEER); and (c) a point-to-point metric that account errors by
Conclusions and final comments
In this work, we presented a comprehensive evaluation of different Low-Level acoustic Descriptors (LLDs or features) used for automatically segmenting anuran calls. As an additional contribution, we showed that, depending on the noise pattern, the Permutation Entropy (PE) quantifier and its variants can improve signal segmentation. The idea is to combine the simplicity of the energy models with the robustness of the probabilistic models in an unsupervised manner. Hence, we computed the entropy
Acknowledgments
Juan Colonna acknowledges to National Council of Technological and Scientific Development (CNPq, Brazil), FAPEAM (PROTI) and CAPES for the PhD scholarship. Eduardo Nakamura acknowledges to FAPEAM for the support granted through the Anura Project (FAPEAM/CNPq PRONEX 023/2009) and CNPq (process 309471/2015-0). Osvaldo A. Rosso acknowledges the financial support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina.
References (71)
- et al.
Automatic anuran identification using noise removal and audio activity detection
Expert Systems with Applications
(2017) - et al.
Is there enough zooplankton to feed forage fish populations off peru? an acoustic (positive) answer
Progress in Oceanography
(2011) - et al.
Evaluation of bic-based algorithms for audio segmentation
Computer Speech & Language
(2005) - et al.
Spatial and temporal variation in population dynamics of andean frogs: Effects of forest disturbance and evidence for declines
Global Ecology and Conservation
(2014) - et al.
An incremental technique for real-time bioacoustic signal segmentation
Expert Systems with Applications
(2015) - et al.
Monitoring animal diversity using acoustic indices: Implementation in a temperate woodland
Ecological Indicators
(2012) - et al.
Classification of audio events using permutation transformation
Applied Acoustics
(2014) An introduction to Roc analysis
Pattern Recognition Letters
(2006)- et al.
Frog classification using machine learning techniques
Expert Systems with Applications
(2009) - et al.
Localized algorithms for information fusion in resource constrained networks
Information Fusion
(2014)