Species-specific audio detection: a comparison of three template-based detection algorithms using random forests

8 We developed a web-based cloud-hosted system that allow users to archive, listen, visualize, and annotate recordings. The system also provides tools to convert these annotations into datasets that can be used to train a computer to detect the presence or absence of a species. The algorithm used by the system was selected after comparing the accuracy and efficiency of three variants of a template-based detection. The algorithm computes a similarity vector by comparing a template of a species call with time increments across the spectrogram. Statistical features are extracted from this vector and used as input for a Random Forest classifier that predicts presence or absence of the species in the recording. The fastest algorithm variant had the highest average accuracy and specificity; therefore, it was implemented in the ARBIMON web-based system. 9


18
Monitoring fauna is an important task for ecologists, natural resource managers, and conservationists. 19 Historically, most data were collected manually by scientists that went to the field and annotated their 20 observations (Terborgh et al., 1990). This generally limited the spatial and temporal extend of the 21 data. Furthermore, given that the data were based on an individual's observations, the information was 22 difficult to verify, reducing its utility for understanding long-term ecological processes (Acevedo and 23 Villanueva- Rivera, 2006). 24 To understand the impacts of climate change and deforestation on the fauna, the scientific community 25 needs long-term, wide-spread and frequent data (Walther et al., 2002). Passive acoustic monitoring (PAM) 26 can contribute to this need because it facilitates the collection of large amounts of data from many sites  Passive recorders can easily create a very large data set (e.g. 100,000s of recordings) that is over-   (Taylor, 1995;Grigg et al., 1996), In this study, we developed a method that detects the presence or absence of a species' specific call 48 type in recordings with a response time that allows researchers to create, run, tune and re-run models 49 in real time as well as detect hundreds of thousands of recordings in a reasonable time. The main 50 objective of the study was to compare the performance (e.g. efficiency and accuracy) of three variants 51 of a template-based detection algorithm and incorporate the best into the ARBIMON II bioacoustics 52 platform. The first variant is the Structural Similarity Index described in Wang et al. (2004), a widely 53 use method to find how similar two images are (in our case the template with the tested recording). The 54 second method filters the recordings with the dynamic thresholding method described in Wang et al. 55 (2004) and then use the Frobenius norm to find similarities with the template. The final method uses the 56 Structural Similarity Index, but it is only applied to regions with high match probability determined by 57 the OpenCV's matchTemplate procedure (Bradski, 2000).

59
Passive acoustic data acquisition 60 We gathered recordings from five locations, four in Puerto Rico and one in Peru. Some of the record-61 ings were acquired using the Automated Remote Biodiversity Monitoring Network (ARBIMON) data  Figure 1). The location in Peru was the Amarakaeri Communal Reserve in the Madre de Dios Region 69 (see Figure 2). In all the locations, the recorders were programmed to record one minute of audio every 70 10 minutes. The complete dataset has more than 100,000 1-minute recordings. We randomly chose 362 71 recordings from Puerto Rico and 547 recordings from Peru for comparing the three algorithm variants. We used the ARBIMON II web application interface to annotate the presence or absence of 21 species 73 in all the recordings. Regions in the recording where a species emits a sound were also marked using   Figure 3. The three phases of the algorithm to create the species-specific models. In the Model Training phase Rec i is a recording, V i is the vector generated by the recognition function on Rec i and in the Detection phase V is the vector generated by the recognition function on the incoming recording. In the following sections the Template Computation process will be explained, then the process of 87 using the Template to extract features from a recording is presented and finally, the procedures to use the 88 features to train the model and to detect recordings are discussed.

90
The template refers to the combination of all ROIs in the training data. To create a template, we first start 91 with the examples of the specific call of interest (i.e. ROIs) that were annotated from a set of recordings 92 for a given species and a specific call type (e.g. common, alarm). Each ROI encompasses an example 93 of the call, and is an instance of time between time t 1 and time t 2 of a given recording and low and high 94 boundary frequencies of f 1 and f 2 , where t 1 < t 2 and f 1 < f 2. In a general sense, we combine these 95 examples to produce a template of a specific song type of a single species.

96
Specifically, for each recording that has an annotated ROI, a spectrogram matrix (SM) is computed using the Short Time Fourier Transform with a frame size of 1024 samples, 512 samples of overlap and a Hann analysis window, thus the matrices have 512 rows. For a recording with a sampling rate of 44,100 Hz, the matrix bin bandwidth is approximately 43.06 Hz. The SM is arranged so that the row of index 0 represents the lowest frequency and the row with index 511 represents the highest frequency of the spectrum. Properly stated the columns c 1 to c 2 and the rows from r 1 to r 2 of SM were extracted, where: The rows and columns that represent the ROI in the recording (between frequencies f 1 and f 2 and between 97 times t 1 and t 2 ) are extracted. The submatrix of SM that contains only the area bounded by the ROI is 98 define as SM ROI and refer in the manuscript as the ROI matrix.

99
Since the ROI matrices can vary in size, to compute the aggregation from the ROI matrices we have 100 to take into account the difference in the number of rows and columns of the matrices. All recordings 101 have the same sampling rate, 44100Hz. Thus, the rows from different SMs, computed with the same 102 parameters, will represent the same frequencies, i.e. rows with same indexes represent the same frequency.

103
After the ROI matrix, SM ROI , has been extracted from SM, the rows of SM ROI will also represent specific 104 frequencies. Thus, if we were to perform an element-wise matrix sum between two ROI matrices with 105 potentially different number of rows, we should only sum rows that represent the same frequency.

Computer Science
To take into account the difference in the number of columns of the ROI matrices, we use the Frobenius 107 norm to optimize the alignment of the smaller ROI matrices and perform element-wise sums between 108 rows that represent the same frequency. We present that algorithm in the following section and a flow 109 chart of the process in Figure 4.
where a i, j are the elements of matrix A. 133 iv. Define Sub min{d k } as the Sub k matrix with the minimum d k . This is the optimal align- 135 v. Align the rows of Sub min{d k } and T temp so they represent equivalent frequencies, perform 136 an element-wise addition of the matrices and put the result in T temp .

137
vi. Add one to all the elements of the W matrix where the previous addition participated. 138 7. Define the matrix T template as the element-wise division between the T temp matrix and the W matrix.

139
The resulting T template matrix summarizes the information available in the ROI matrices submitted by 140 the user and it will be used to extract information from the audio recordings that are to be analyzed. In 141 this article each species T template was created using five ROIs.

142
In Figure    In the following section we present the details of the algorithm that processes a recording to create the 151 recognition function vector and in Figure 6, we present a flowchart of the process. Manuscript to be reviewed Computer Science 4. Define step, the step factor by which T template will progressed over the SPEC matrix.
where SPEC i is the submatrix of SPEC that spans the columns from i × step to i × step + c template and the same number of rows as T template and V = [meas 1 , meas 2 , meas 3 , · · · , meas n ] with n = c SPEC − c template step + 1.
1 Note that for recordings with a sample rate of 44100 when we calculate the STFT with a window of size 512 and a 50% overlap, one step is equivalent to 5.8 milliseconds, therefore, 16 steps is less than 100 milliseconds. Although this procedure may miss the strongest match, the length of the calls are much longer than the step interval; therefore, there is a high probability of detecting the species-specific call.

7/20
PeerJ Comput. Sci. reviewing PDF | (CS-2016:12:15269:2:0:NEW 3 Apr 2017) Manuscript to be reviewed Computer Science Second, the dynamic thresholding method (threshold_adaptive) described in Wang et al. (2004) with a block size of 127 and an arithmetic mean filter is used over both T template and SPEC i before multiplying them and applying the Frobenius norm and normalized by the norm of a matrix with same dimensions as T template and all elements equal to one. Therefore, meas i for the NORM variant is defined as: where again SPEC i is the submatrix of SPEC that spans the columns from i × step to i × step + c template , FN is the Frobenius norm, DTM is the dynamic thresholding method, U is a matrix with same dimensions as T template with all elements equal to one and . * performs an element-wise multiplication of the matrices. Again, V = [meas 1 , meas 2 , meas 3 , · · · , meas n ] with n = c SPEC − c template step + 1.
Finally, for the CORR variation we first apply the OpenCV's matchTemplate procedure (Bradski, 2000) with the Normalized Correlation Coefficient option to SPEC i , the submatrix of SPEC that spans the columns from i × step to i × step + c template . However, for this variant, SPEC i includes two additions rows above and below, thus it is slightly larger than the T template . With these we can define: where SPEC j,i is the submatrix of SPEC i that starts at row j (note that there are 5 such SPEC j,i matrices).

174
Now, we select 5 points at random from all the points above the 98.5 percentile of meas j,i and apply the Structural Similarity Index 5 strongly-matching regions. The size of these regions is eight thirds (8/3) of the length of T template , 4/3 before and 4/3 after the strongly-matched point and was selected. Then, define FilterSPEC as the matrix that contains these 5 strongly-matching regions and FilterSPEC i as the submatrix of FilterSPEC that spans the columns from i to i + c template then, the similarity measure for this variant is define as: and the resulting vector V = [meas 1 , meas 2 , meas 3 , · · · , meas n ] but this time with n = 5 × 8/3 × c template + 1 .

Computer Science
It is important to note that no matter which variant is used to calculate the similarity measures, the 175 result will always be a vector of measurements V . The idea is that the statistical properties of these 176 computed recognition functions have enough information to distinguish between a recording that has the 177 target species present and a recording that does not have the target species present. However, notice that 178 since c SPEC , the length of SPEC, is much larger that c template the length of the vector V for the CORR 179 variant is much smaller than the other two.

181
After calculating V for many recording we can train a random forest model. First, we need a set of 182 validated recordings with the specific species vocalization present in some recordings and absent in others.

183
Then for each recording we compute a vector V i as described in the previous section and extract the 184 statistical features presented in Table 2   To decide which of the three variants should be incorporated into the ARBIMON web-based system, we performed the algorithm explained in the previous section with each of the similarity measures. We computed 10-fold validations on each of the variants to obtained measurements of the performance of the algorithm. In each validation 90% of the data is used as training and 10% of the data is used as validation data. Each algorithm variant used the same 10-fold validation partition for each species. The measures calculated were the area under the receiver operating characteristic (ROC) curve (AUC), accuracy or correct detection rate (Ac), negative predictive value (N pv), precision or positive predictive value (Pr), sensitivity, recall or true positive rate (Se) and specificity or true negative rate (Sp). To calculate the AUC, the ROC curve is created by plotting the false positive rate (which can be calculated as 1 -specificity) against the true positive rate (sensitivity), then, the AUC is created by calculating the area under that curve. Notice that the further the AUC is from 0.5 the better. The rest of the measures are defined as follows: Ac = t p + t n t p + t n + f p + f n , N pv = t n t n + f n , Pr = t p t p + f p , Se = t p t p + f n and Sp = t n t n + f p with t p the number of true positives (number of times both the expert and the algorithm agree that the states that the species is present while the expert states is absent) and f n the number of false negatives 199 (number of times the algorithm states that the species is not present while the expert states it is present).

200
Note that accuracy is a weighted average of the sensitivity and the specificity.

201
Although we present and discuss all measures, we gave accuracy and the AUC more importance

218
The six measurements (area under the ROC curve -AUC, accuracy, negative predictive value, precision, 219 sensitivity and specificity) computed to compared the model across the three variants varied greatly among 220 the 21 species. The lowest scores were among bird species while most of the highest scores came from 221 amphibian species. Table 3 Table 3, while the CORR variant had a greater 224 number of species with 80% or greater for all the measures and an overall median accuracy of 81%. We 225 considered these two facts fundamental for a general-purpose species detection system.

226
The local species ANOVA suggested that there are significant accuracy differences at the 95% 227 significance level for 6 of the 21 species studied as well as 4 in terms of precision and 3 in terms of 228 specificity (see supplemental materials). The algorithm variant CORR had a higher mean and median 229 AUC at 78% and 81% respectively, but the SSIM variant seems to be more stable with a standard deviation 230 of 20%. In terms of accuracy, both the SSIM and CORR have higher mean accuracy than the NORM 231 variant. Nevertheless, variant CORR had the highest median accuracy of 81%, which is slightly higher 232 than the median accuracy of the SSIM variant at 76%. In addition, variant CORR had more species with 233 an accuracy of 80% or greater.  (Table 4).  (Table 3), and the mean execution 254 time of CORR variant did not increase with increasing T template size (Table 4).

256
The algorithm used by the ARBIMON system was selected by comparing three variants of a template-   where there was a strong-match determined by the OpenCV's matchTemplate procedure (Bradski, 2000).

288
The results show that this method performed better both in terms of ability to detect as well as in terms of A fast and accurate general-purpose algorithm for detecting presence or absence of a species com-291 pliments the other tools of the ARBIMON system, such as options for creating playlists based on many 292 different parameters including user-created tags (see Table 5). For example, the system currently has  Manuscript to be reviewed Computer Science   Table 6. Area Under the ROC Curve (AUC), Accuracy (Ac), negative predictive value (Npv), precision (Pr), sensitivity (Se) and specificity (Sp) of the 21 species and three variants of the algorithm. Best values are shaded and the cases where the ANOVA suggested a significant difference between the algorithm variants at the 95% confidence level are in bold .