Audio segmentation using flattened local trimmed range for ecological acoustic space analysis

The acoustic space in a given environment is filled with footprints arising from three processes: biophony, geophony and anthrophony. Bioacoustic research using passive acoustic sensors can result in thousands of recordings. An important component of processing these recordings is to automate signal detection. In this paper, we describe a new spectrogram-based approach for extracting individual audio events. Spectrogram-based audio event detection (AED) relies on separating the spectrogram into background (i.e., noise) and foreground (i.e., signal) classes using a threshold such as a global threshold, a per-band threshold, or one given by a classifier. These methods are either too sensitive to noise, designed for an individual species, or require prior training data. Our goal is to develop an algorithm that is not sensitive to noise, does not need any prior training data and works with any type of audio event. To do this, we propose: (1) a spectrogram filtering method, the Flattened Local Trimmed Range (FLTR) method, which models the spectrogram as a mixture of stationary and non-stationary energy processes and mitigates the effect of the stationary processes, and (2) an unsupervised algorithm that uses the filter to detect audio events. We measured the performance of the algorithm using a set of six thoroughly validated audio recordings and obtained a sensitivity of 94% and a positive predictive value of 89%. These sensitivity and positive predictive values are very high, given that the validated recordings are diverse and obtained from field conditions. The algorithm was then used to extract audio events in three datasets. Features of these audio events were plotted and showed the unique aspects of the three acoustic communities. ABSTRACT 8 The acoustic space in a given environment is ﬁlled with footprints arising from three processes: biophony, geophony and anthrophony. Bioacoustic research using passive acoustic sensors can result in thousands of minutes of recorded audio. An important component of processing this recorded audio is to automate signal detection. In this paper, we describe a new spectrogram-based approach for extracting individual acoustic events. 13 Spectrogram-based acoustic event detection (AED) relies on separating the spectrogram into background (i.e., noise) and foreground (i.e., signal) classes using a threshold such as a global threshold, a per-band threshold, or one given by a classiﬁer. These methods are either too sensitive to noise, designed for an individual species, or require prior training data. Our goal is to develop an algorithm that is not sensitive to noise, does not need any prior training data and works with any type of acoustic event. 18 To do this, we propose: (1) a spectrogram ﬁltering method, the Flattened Local Trimmed Range (FLTR) method, which models the spectrogram as a mixture of stationary and non-stationary energy processes and mitigates the effect of the stationary processes, and (2) an unsupervised algorithm that uses the ﬁlter to detect acoustic events. 22 We measured the performance of the algorithm using a set of six thoroughly validated audio recordings and obtained a sensitivity of 94% and a positive predictive value of 89%. These sensitivity and positive predictive values are very high, given that the validated recordings were collected in the ﬁeld from sites with very different environment conditions. The algorithm was then used to extract acoustic events from three datasets. Features of these acoustic events were plotted and showed the unique aspects of the three acoustic communities. 67 − 100% coverage and 254 1 . 4 − 0 . 8 kHz with 82 − 100% coverage. Three areas of medium event detection count can be found at 255 7 . 1 − 9 . 8 kHz with 59 − 98% coverage, 20 . 1 − 20 . 8 kHz with 65 − 98% coverage and one spanning the 256 whole 98 − 100% coverage band. The y max vs. tod plot shows three areas of high event detection count 257


31
The acoustic space in a given environment is filled with footprints of activity. These footprints arise 32 as events in the acoustic space from three processes: biophony, or the sound species make (e.g., calls, 33 stridulation); geophony, or the sound made by different earth processes (e.g., rain, wind); and anthrophony, 34 or the sounds that arise from human activity (e.g., automobile or airplane traffic) (Krause, 2008). The 35 field of Soundscape Ecology is tasked with understanding and measuring the relation between these 36 processes and their acoustic footprints, as well as the total composition of this acoustic space (Pijanowski 37 et al., 2011). Acoustic environment research depends more and more on data acquired through passive 38 sensors (Blumstein et al., 2011;Aide et al., 2013) because recorders can acquire more data than is possible 39 manually (Parker III, 1991;Catchpole and Slater, 2003;Remsen, 1994), and these data provide better  Currently, most soundscape analysis focus on computing indices for each recording in a given dataset 42 (Towsey et al., 2014), or on plotting and aggregating the raw acoustic energy (Gage and Axel, 2014). 43 An alternative approach is to use each individual acoustic event as the base data and aggregate features 44 computed from these events, but up until now, it has been difficult to accurately extract individual acoustic 45 events from recordings.

Computer Science
We define an acoustic event as a perceptual difference in the audio signal that is indicative of some 47 activity. While being a subjective definition, this perceptual difference can be reflected in transformations 48 of the signal (e.g. a dark spot in a recording's spectrogram). 49 Normally, to use individual acoustic events as base data, a manual acoustic event extraction is 50 performed (Acevedo et al., 2009). This is usually done as a first step to build species classifiers, and can 51 be made very accurately. By using an audio visualization and annotation tool, an expert is able to draw a 52 boundary around an acoustic event; however, this method is very time-consuming, is specific to a set of 53 acoustic events and it is not easily scalable for large datasets (e.g. > 1000 minutes of recorded audio), 54 thus an automated detection method could be very useful. 55 Acoustic event detection (AED) has been used as a first step to build species classifiers for whales, 56 birds and amphibians (Popescu et al., 2013;Neal et al., 2011;Aide et al., 2013). Most AED approaches 57 rely on using some sort of thresholding to binarize the spectrogram into background (i.e., noise) and classifier to label each cell in the spectrogram as sound or noise. These methods are either too sensitive to 66 noise, are specialized to specific species, require prior training data or require prior knowledge from the 67 user. What is needed is an algorithm that works for any recording, is not targeted to a specific type of 68 acoustic event, does not need any prior training data, is not sensitive to noise, is fast and requires as little 69 user intervention as possible.

70
In this article we propose a spectrogram filtering method, the Flattened Local Trimmed Range (FLTR) 71 method, and an unsupervised algorithm that uses this filter for detecting acoustic events. This method 72 filters the spectrogram by modeling it as a mixture of stationary and non-stationary energy processes, and 73 mitigates the effect of the stationary processes. The detection algorithm applies FLTR to the spectrogram 74 and proceeds to threshold it globally. Afterward, each contiguous region above the threshold line is 75 considered an individual acoustic event. 76 We are interested in detecting automatically all acoustic events in a set of recordings. As such, this 77 method tries to remove all specificity by design. Because of this, this method can work as a form of data 78 reduction. As a first step, this transforms the acoustic data into a set of events that can later feed further 79 analysis.

80
The presentation of the article follows the workflow in Figure 1: given a set of recordings, we compute 81 the spectrogram for each one, then the FLTR is computed, a global threshold is applied and, finally, we 82 proceed to extract the acoustic events. These acoustic events are compared with manually labeled acoustic 83 events to determine the precision and accuracy of the automated process. We then applied the AED 84 methodology to recordings from three different sites. Features of the events were calculated and plotted 85 to determine unique aspects of each site. Finally, events within a region of high acoustic activity were 86 sampled to determine the sources of these sounds.   Modeling the Spectrogram 107 We model the spectrogram S db (t, f ) as a sum of different energetic processes: where b( f ) is a frequency-dependent process that is taken as constant in time, while ε(t, f ) is a process 109 that is stationary in time and frequency with 0-mean, 0-median, and some scale parameter, and R i (t, f ) is  In this model, we assume that the set of localized energy processes has four properties: 115 A1 The localized energy processes are mutually exclusive and are not adjacent. That is, no two localized energy processes share in the same (t, f ) coordinate, nor do they have adjacent coordinates. Thus, ∀1 ≤ i, j ≤ n, i = j, 0 ≤ t < τ, 0 ≤ f < η, we have: This can be done without loss of generality. If two such localized processes existed, we can just 116 consider their union as one.

117
A2 The regions of localized energy processes dominate the energy distribution in the spectrogram on each given band. That is, ∀0 ≤ t 1 ,t 2 < τ, 0 ≤ f < η, we have: A3 The proportion of samples within a localized energy processes in a given frequency band, denoted as ρ( f ), is less than half the samples in the entire frequency band. That is, ∀0 ≤ f < η, we have: A4 Each localized energy process dominates the energy distribution in its surrounding region, when 118 accounting for frequency band-dependent effects. That is, for every (t 1 , f 1 ) point that falls inside 119 a localized energy process (∀1 ≤ i ≤ n,

Manuscript to be reviewed
Computer Science region-dependent time-based radius r i,1 and a frequency-based radius r i,2 , such that for every other We want to extract the R i components or, more specifically, their support functions I i (t, f ), from 123 the spectrogram. If we are able to estimate b( f ) reliably for a given spectrogram, we can then compute , a spectrogram that is corrected for frequency intensity variations. Once that is 125 done, we compute local statistics to estimate the I i (t, f ) regions and, thus, segregate the spectrogram into 126 the localized energy processes R i (t, f ) and an ε(t, f ) background process.

127
Flattening -Estimating b(f) 128 Other than A2, A3 and A4, we do not hold any assumptions for ε(t, f ) or R i (t, f ). In particular we do not 129 presume to know their distributions. Thus, it is difficult to formulate a model to compute a Maximum of a given spectrogram do not give a good estimate on b( f ) since they get mixed with the sum of non-zero 132 expectations of any intersecting region : 133 Since ε(t, f ) is a stationary 0-mean process, we do not need to worry about it as it will eventually 134 cancel itself out, but the localized energy process regions do not cancel out. Since our goal is to separate 135 these regions from the rest of the spectrogram in a general manner, if an estimate of b( f ) is to be useful, it 136 should not depend on the particular values within these regions.

137
While using the mean does not prove to be useful, we can use the frequency sample medians, along 138 with A2 and A3 to remove any frequency-dependent time-constant bands from the spectrogram. We   Thus, assuming A2 and A3, m( f ) gives an estimator whose bias is limited by the range of the ε 149 process and is completely unaffected by the R i processes. Furthermore, as ρ( f ) approaches 0, m( f ) 150 approaches b( f ). 151 We use the term band flattening to refer to the process of subtracting the b( f ) component from 153 Figure 2B shows the output from this flattening procedure. As can be seen, this procedure removes any 154 frequency-dependent time-constant bands in the spectrogram.

Manuscript to be reviewed
Computer Science We can use the band flattened spectrogramŜ(t, f ) to further estimate the I i (t, f ) regions, since: We do this by computing the local α-trimmed range Ra α {Ŝ}. That is, given some 0 ≤ α < 50, and some r > 0, for each (t, f ) pair, we compute: where P α (·) is the α percentile statistic, andŜ ne r (t, f ) is the band flattened spectrogram, with its domain 157 restricted to a square neighborhood of range r (in time and frequency) around the point (t, f ).

158
Assuming A4, the estimator would give small values for neighborhoods without localized energy 159 processes, but would peak around the borders of any such process. This statistic could then be thresholded 160 to compute estimates of these borders and an estimate of the support functions I i (t, f ). Figure 2C shows There are many methods that can be used to threshold the resulting FLTR image (Sezgin and Sankur, 2004). Of these, we use the entropy-based method developed by Yen et. al. (Yen et al., 1995). This method works on the distribution of the values, it defines an entropic correlation TC(t) of foreground and background classes as: where m and M are the minimum and maximum values of the FLTR spectrogram, and f (·) and F(·) are the Probability Density Function (PDF) and Cumulative Density Function (CDF) of these values. The PDF and CDF, in this case, are approximated with a histogram. The Yen threshold is then the valuet that maximizes this entropy correlation. That is: To test the FLTR algorithm we used two datasets collected and stored by the ARBIMON system (Sieve  The first dataset, the validation dataset, consisted of 2051 manually labeled acoustic events from six 173 audio recordings (every acoustic event was labeled) (Vega, 2016d). This set includes a recording from 174 Lagoa do Peri, Brazil; one recording from the Arakaeri Communal Reserve, Perú; one from El Yunque, 175 Puerto Rico; and three underwater recordings from Mona Island, Puerto Rico. 176 The second dataset, the sites dataset, was a set of 240 recordings from the Amarakaeri Communal

Computer Science
Methodology 182 We divided our methodology into two main steps. In the first step, FLTR validation, we extracted acoustic 183 events from the recordings in the first dataset, which we then validated against the manually labeled 184 acoustic events. In the second step, site acoustic event visualization, we extracted the acoustic events from 185 the second dataset, computed feature vectors for each event and plotted them. The recording spectrograms 186 were computed using a Hann window function of 512 audio samples and an overlap of 256 samples.

188
We used the FLTR algorithm with a 21 ×21 window and α = 5 to extract the acoustic events and compared 189 them with the manually labeled acoustic events.

190
For the validation, we used two comparison methods, the first is based on a basic intersection test between the automatically detected and the manually labeled events' bounds, and the second one is based on an overlap percent. For each manual label and detection event pair, we defined the computed overlap percent as the ratio of the area of their intersection and the area of their union: , where L is a manually labeled event, D is a automatically detected event area, and A (L ∩ D) and A (L ∪ D) 191 are the area of the intersection and the union of their respective bounds.

192
On the first comparison method, for each acoustic event whose bounds intersected the bounds of a 193 manually labeled acoustic event, we registered it as a detected acoustic event. For each detected event 194 whose bounds did not intersect any manually labeled acoustic events, we registered it as detected, but 195 without an acoustic event. On the other hand, the manually labeled acoustic events that did not intersect 196 any detected acoustic event were registered as undetected acoustic events.

197
The second method followed a similar path as the first, but it requires an overlap percent of at least 198 25%. For each acoustic event whose overlap percent with a manually labeled acoustic event was greater 199 than or equal to 25%, we registered it as a detected acoustic event. For each detected event that did not

204
These data were used to create a confusion matrix to compute the FLTR algorithm's sensitivity and positive predictive value for each method. The sensitivity is computed as the ratio of the number of manually labeled acoustic events that were automatically detected (true positives) over the total count of manually labeled acoustic events (true positives and false negatives).

Sensitivity =
True Positives True Positives + False Negatives .
This measurement reflects the percent of detected acoustic events that were present in the recording.

205
The positive predictive value is computed as the ratio of the number of manually labeled acoustic events that were automatically detected (true positives) over the total count of detected acoustic events (true positives and false positives).

Positive Predictive Value =
True Positives True Positives + False Positives .
This measurement reflects the percent of real acoustic events among the set of detected acoustic events. (2) dur Duration. Length of the acoustic event in seconds. That is, given T i = {t|∃ f , I i (t, f ) = 1}, the duration is defined as: y max Dominant Frequency. The frequency at which the acoustic event attains its maximum power: cov Coverage Ratio. The ratio between the area covered by the detected acoustic event and the area detected by the bounds: Using these features, we generate a log-density plot matrix for all pairwise combinations of the 213 features for each site. The plots in the diagonal are log histograms of the column feature. 214 We measured the information content of each feature pair by computing their joint entropy H: where log 2 is the base 2 logarithm, h i, j is the number of events in the (i, j) th bin of the joint histogram, 215 and N 1 and N 2 are the number of bins of each variable in the histogram. A higher value of H means a 216 higher information content. 217 We also focused our attention on areas with a high and medium count of detected acoustic events (log 218 of detected events greater than 6 and 3.5, respectively). As an example, we selected an area of interest in 219 the feature space of the Sabana Seca dataset (i.e., a visual cluster in the bw vs. y max plot). We sampled 220 50 detected acoustic events from the area and categorized them manually.

223
Under the simple intersection test, out of 2051 manually labeled acoustic events, 1922 were detected (true 224 positives), and 129 were not (  of high event detection count from 5pm to 5am at 2.9 − 5.9 kHz and from 7pm to 4am at 9.6 − 10 kHz.

246
Three areas of medium event detection count can be found at from 6pm to 5am at 17.6 − 20.9 kHz, from

277
In the Sabana Seca bw vs. y max plot, there are five high event detection count areas (Fig. 7). We focus  The 50 sampled acoustic events from the area of interest were arranged into six groups (Table 3).

281
The majority of the events (40 out of 50 events) were the second note, "qui", of the call of the frog

289
The FLTR algorithm, in essence, functions as a noise filter for the recording. It takes the spectrogram 290 from a noisy field recording ( Fig. 2A), and outputs an image where, in theory, the only variation is due 291 to localized acoustic energy (Fig. 2C). The model imposed on a spectrogram is also very generic in the 292 sense that no particular species or sound is modeled, but rather it models the different sources of acoustic 293 energy. It is reasonable to think that any spectrogram is composed of these three components without 294 a loss of generality: (1) a frequency-based noise level, (2) some diffuse energy jitter, and (3) specific, 295 localized events.

296
The end product of the flattening step is a spectrogram with no frequency-based components (Fig.   297 2B). By using the frequency band medians we are able to stay ignorant of the nature of the short-term 298 dynamics in the spectrogram while being able to remove any long-term nuisance effects, such as (constant) 299 background noise or a specific audio sensor's frequency response. Thus, we end up with a spectrogram 300 that is akin to a flat landscape with two components: (1) a roughness element (i.e. grass leaves) in the 301 landscape and (2) a series of mounds, each corresponding to a given acoustic event. Due to this roughness 302 element, a global threshold at this stage is ineffective. The local trimmed range however is able to enhance 303 the contrast between the flat terrain and the mounds (Fig. 2C), enough to detect the mounds by using 304 a simple global threshold (Fig. 2C). By using a Yen threshold, we maximize the entropy of both the 305 background and foreground classes (Fig. 2D). In the end, the FLTR algorithm has the advantage of not 306 trying to guess the distribution of the acoustic energy within the spectrogram, but rather it exploits robust 307 statistics that work for any distribution to separate these three modeled components. values, more notably the highest values in the aggregate localized energy processes, will always be above 337 b( f ). Depending on factors, such as their size and remaining intensity, they could very well be detected.

338
The degradation of the detection algorithm in relation to these assumptions, however, is still to be studied.

340
In all three sites, the plots with the most joint entropy were cov vs. y max, y max vs. tod and cov vs. Manuscript to be reviewed Computer Science of peak activity in a soundscape.

364
Using the confusion matrix in Table 1 as a ruler, we could estimate that around 94% of the acoustic 365 events were detected and that they conform around 89% of the total amount of detected events.

367
The FLTR algorithm uses a sound spectrogram model. Using robust statistics, it is able to exploit the 368 spectrogram model without assuming any specific energy distributions. Coupled with the Yen threshold, 369 we are able to extract acoustic events from recordings with high levels of sensitivity and precision.

370
Using this algorithm, we are able to explore the acoustic environment using acoustic events as base 371 data. This provides us with an excellent vantage point where any feature computed from these acoustic 372 events can be explored, for example the time of day vs. dominant frequency distribution of an acoustic 373 environment (i.e. a temporal soundscape) down to the level of the individual acoustic events composing it.

374
As a tool, the FLTR algorithm, or any improvements thereof, have the potential of shifting the paradigm 375 from using recordings to acoustic events as base data for ecological acoustic research.