Efficient Audio Fingerprint Application Verification Using the Adapted Computational Geometry Algorithm

An earlier work of the authors introduced an adapted version of the Computational Geometry Algorithm (CGA) designed to analyse an audio stream and produce a unique coding-independent fingerprint. As the adaptability and the induced calculation load of the proposed algorithm form a key characteristic for multiple applications, our current investigation aims to measure its performance and stability in dynamic, real-time applications, i.e., in large audio library indexing and dynamic audio recognition. In addition, we investigate the fact that context similarity is also evident across fingerprints; hence a number of comparisons are used to explore the possible uses of this highly desirable algorithmic feature.


Introduction
The wide availability of Internet connectivity and media-enabled devices has altered the way content is produced, distributed and finally reproduced.Increased demand for all types of audiovisual streams is clearly evident, as most network traffic today consists of multimedia data exchanged in global scale (Deliyannis, 2012).Digitization and networked distribution of audio, video, live broadcasts and the increased demand for interactive control, leads authors and companies to the path of content re-use and reproduction via customization of existing content and distribution (Karydis, Deliyannis, & Floros, 2011) through networked multimedia databases and multicast systems.Within evolving markets such as electronic multimedia-content retail and exchange points, various new services emerge (Deliyannis, Karydis, & Anagnostou, 2011).In comparison to traditional media, new access methods alter the way that data are distributed and reproduced, often forming new applications and domains such as interactive and new-media arts (Trifonova, Jaccheri, & Bergaust, 2008), while changing the user culture in terms of content use (Gillespie, 2004).These global changes introduce new markets and services (Simpson, 2004), a fact that is clearly evident when observing the evolution of standards such as MPEG-7 linking content to context and offering multimedia accessibility for all as with MPEG-21 (Kosch, 2004).The considerable content availability through various media will certainly require new broadcast-control and verification mechanisms to be established, a sector that may be aided by current research.
Computers are also employed in the area of copyright management under a wide variety of applications, one example being the application of pattern-matching algorithms and techniques to identify copyrighted content (Furht & Kirovski, 2005;Karydi, Karydis, & Deliyannis, 2012).The task is straightforward in text-based applications.This particular data format is transferred and delivered in complete form, without loss of content during transfer and reproduction, a fact that significantly aids the pattern-matching process.In contrast, audio (and/or video streams) are often degraded in terms of quality due to the employment of various compression techniques and the inevitable stream re-compression processes introduced by the wide variety of transmission formats available in all media-enabled platforms.These algorithmic-based compression processes such as MPEG 1, 2, 3 and 4 are based on mechanisms of human perception for minimizing the required transmission bandwidth.Ultimately, conversion between various media formats alters significantly the original information, a fact that introduces various problems in the identification process, as error and distortion are clearly evident when contrasting original versus transmitted data.
The present paper can be considered to be a continuation of our latest research which lays the necessary theoretical foundation research on audio fingerprinting based on convex layer definition in the frequency domain (Poulos, Deliyannis, & Floros, 2012).In this work the experimental aspects of a novel algorithm for defining convex layer areas over audio signal spectral peaks as a track identification procedure are addressed in an attempt to standardize the identification process.According to our view, the latter process is clearly identified as a key issue that needs to be resolved before this technology may be exploited commercially.Our experimentation indicates that beyond direct pattern matching, dynamic content detection is also possible.In that respect, once the standards are related to the semantic layers (fingerprints) of information and communication systems, important consequences arise that require further research under Music Information Retrieval (MIR) research (Aucouturier & Pachet, 2003;Casey et al., 2008;Chandrasekhar, Sharifi, & Ross, 2011;Levy & Sandler, 2009;Logan, Ellis, & Berenzweig, 2003;Marsden, 2010;McFee, Barrington, & Lanckriet, 2010;McKay & Fujinaga, 2008;Slaney, Weinberger, & White, 2008;Wang, 2003).Our previous research indicates application areas such as gaming (Deliyannis, Karydis, & Anagnostou, 2011;Karydis et al., 2011) and copyright identification (Deliyannis, Karydis, & Karydi, 2011;Karydi et al., 2012).
The paper at hand is organized as follows.Section 2 briefly presents a synopsis of the Computational Geometry Algorithm (CGA) audio fingerprinting algorithm.This is an issue covered fully in our theoretical definition of the above algorithm published recently (Poulos et al., 2012) and the reader is encouraged to refer to this article for a detailed algorithmic and mathematical analysis.Next, Sections 3 and 4 provide extended experimentation cases based on a number of widely-employed application scenarios and present the results obtained using multiple forms of audio content as well as the statistical evaluation of the derived data.Finally, Section 5 concludes this work by proposing future research directions.

Related Study
In our latest study, a novel audio content identification (matching) approach is presented, based on the significant reduction of the original spectral peaks enclosed in convex layer areas (Poulos et al., 2012).This work introduced audio-track identification through the use of computational geometry algorithms, where the problem of matching sample peaks with original peaks was addressed using an intersection technique between convex layers.In particular, this approach produced a convex polygon in the frequency domain that resembles a coordinate-based pattern in terms of a unique set of points that can be considered to be the audio data "fingerprint."In the above work it was also shown that this fingerprint pattern is coding-independent, a fact that provides indications that the proposed algorithm may be suitable for multiple purposes and applications, including the categorisation of content identity and the identification of audio clips, hence providing support for the realisation of audio sorting/searching tasks and services.
The above described method was realised via the use of the Computational Geometry Algorithm (CGA), a computationally efficient scheme of onion-like layers that results into unique frequency-domain representations of the innermost onion layer (Poulos et al., 2012).More specifically, the digital audio signal under identification (test signal), denoted here as x(n), is initially transformed in the frequency domain and represented in terms of its Power Spectral Density (PSD) X(f) via Bartlett's estimation.The same procedure is applied on the original (reference) signal x ref (n), producing the X ref (f) PSD vector of size N.Then, the CGA algorithm is applied on the derived PSD data, producing onion-like layers denoted in the case of reference signal as S.An example of such algorithmically constructed layers is graphically represented in Figure 1.Finally, a critical algorithmic parameter, the total depth of layers (or the k-depth value) is defined, following the algorithm described again in our latest study (Poulos et al., 2012).Finally, by algorithmically isolating the k-th inmost layer, we obtain the convex subset S xy that corresponds to the reference signal.The same procedure is applied on the test signal PSD data and the k-th convex subset N xy is similarly derived.During the final matching/identification process, the intersection of the above convex subsets S xy and N xy is computed, that is: The identification procedure is completed by extracting the degrees of similarity s 1 and s 2 using the computed areas (A) of the calculated convex subsets (S xy , N xy and R xy ) using the following fractions (see also Figure 1): (2) The above identification/matching process architecture is graphically illustrated in Figure 2.

Implementation Issues-Decision Stages
The degrees of correlation s 1 and s 2 (see Equations 1, 2) between S xy and N xy (see Section 2) are calculated according to the selected null hypothesis.The null hypothesis claims that there is no link between the two sampled subsets.Since the distribution of the subsets is unknown, a reasonable strategy is to use a non-parametric approach for testing the hypothesis and thus to use permutations to obtain the subsets distribution under Η 0 =0 with p=0.05 in which all the subsets present random distribution.However, in our case, we used an alternative hypothesis, Η 1 , which controls the specific similarities between the groups.More specifically, under the current study, we investigated the following three decision stages: (a) the pairs of audio fingerprints are identified; (b) the pairs of audio fingerprints have common features; and (c) the pairs of audio fingerprints are not identified.
For this, the decision rules are extracted by the unsupervised clustering k-mean procedure.Thus, one particular issue with the dataset of vector S i (see Equation 1) is to define the number of categories.In order to achieve a safe result, we use the well-established k-means clustering algorithm to examine whether the three targeted decision groups are the optimal clustering approach for the dataset presented later in Table 1.The procedure follows a trivial way to classify a given data set through a certain number of clusters (assuming k clusters) fixed a priori.The main idea is to define k centroids, one for each cluster.These centroids should be placed in a cunning way, because different locations cause different results and a loop evaluates this case recurrently.In our case we submitted the data set (of Table 1) into k = 3 clusters, using Equation 3:   32, 48, 64, 96, 112,128, 160, 192, 256 and 320 kbps) in order to produce the audio content under identification for the previously described application scenarios.
In order to realize the above application scenarios in practice, for the experiments followed, we divided the audio tracks database into two sets: • The first set is formed by 10 (different typical coding compression schemes) x 3 (different music genres considered) = 30 (total music tracks).
• The second set consists of short-length audio clips (with time-duration equal to 3 seconds).These clips were produced by randomly setting a 3-seconds time window on the compressed audio data.For each audio-recording genre considered (violin, rock and orchestra) we created 5 randomly selected short-audio clips, resulting into a total of 5x30=150 clips belonging into the second dataset.
In the testing procedure, we create 4500 pairs combinations between two sets m=30 and k=150, extracted by Equation 4: (4) The three investigation scenarios presented earlier allow a number of possible application suggestions to be made.These include copyright applications that include the interactive and fairer calculation of royalty charges, which may then be attributed directly to the copyright owners.Creating a sensing networked device that is installed in order to constantly monitor, identify and report on the commercial use of audio (radio & TV stations) is certainly a novel application which will allow producers the flexibility to interactively select and broadcast content.Also the partial-detection feature may enable charging in a per-second tariff, instead of a flat charge that applies today.The limited sample duration required for this algorithm to detect the source is clearly an advantage that may also boost the commercial application of the method for personal use, enabling the optimisation of current algorithms used in proprietary applications.Finally, the availability of a tool that detects similarity across different versions of the same theme allows a number of tools that may be used as a research or commercial tool to detect audio influences, automatic categorisation of content and copyright issues which could have been missed in the past.
As it will be shown in the next Sections, exhaustive testing across all datasets described above showed that the CGA-based fingerprint-matching algorithm performs well across different compression schemes and most importantly when random segmented tracks are considered.We must also note here that an audio signal time-length equal to 3 seconds outperforms most industrial-level algorithms utilised today in audio proprietary recognition applications which work efficiently for samples greater than 10 seconds (Chandrasekhar et al., 2011).
The sequence of tests performed was organised in the following manner: For every music theme we assessed the performance of the proposed algorithm for all considered compression rates, a fact that allowed to identify how well the algorithm performed in music theme recognition at varying compression rates and audio quality.In the testing procedure (Note 1), we considered 4500 pairs of combinations between the two databases sets defined previously.

Case 1: Compressed Application Scenario
As mentioned previously, under this application case, the three different audio tracks considered as testing material were encoded using the the MPEG-1 Layer III lossy compression standard at different typical coding bitrates, specifically equal to 32,48,64,96,112,128,160,192,256 and 320 kbps.The compressed content was then decoded, producing a distorted, uncompressed version of the original track.We then applied the CGA algorithm on the original data, as well as on this uncompressed version, and we calculated the degrees of correlation for these signals.Based on the implementation analysis provided in Section 3, we obtained identification or no-identification results for all the combinations of the compressed audio material considered.
A representative set of the above results is presented in Tables 1 and 2. For illustration purposes, we present only the diagrams and values obtained for the violin recordings.Clearly, the CGA algorithm successfully identifies the compressed audio signal correctly, under any selected compression bit rate (even for those that imply lower sampling rates, such as the 32kbps).The same trends were observed for both the rest music tracks considered here (rock and orchestra).This fact is illustrated in Figure 4, were the degrees of correlation s 1 and s 2 values are graphically presented as a function of the compression bit rate.Obviously, in all test cases, at least one of the s 1 or s 2 values exceeds the thresholds defined in Section 3 (and presented in this

Case 2
As mentio music con material.T usually, on rock-recor identificati formed as violin and music trac attributed for an orc particular identificati measured

Conclusions
This study's major objective was to try to evaluate a mechanism of audio detection similarity and fingerprinting that, under certain circumstances, can be inserted into the information-management strategies of (large) information organisations and large, co-operative libraries as a technique for the identification and possible control of the intellectual property of electronically published audio material.The existing techniques, and especially the digital signature schemes, could fulfill only the first, the identification and part of the objective.
In particular, this work employs the adjustment of a computational geometric algorithm for the semantic representation of the information of audio data in terms of a frequency-domain audio fingerprint.The idea for this construction came from the test of the onion-peeling algorithm in other areas of signal processing, such as the identification of humans by fingerprints.The aim of this application is to construct an audio fingerprint (i.e. in terms of a serial number) that could identify a copyright-protected published audio file even if its file format has changed from one type to another.Furthermore, it aims to provide a satisfactory amount of correlation similarity with other audio files created from the original by applying different coding / compression techniques, and to detect and automatically reject audio files that are not related to the original.
For a realistic implementation and efficiency assessment of the proposed audio fingerprinting algorithm, the authors created a small database with three different audio genres encoded using the MPEG-1 Layer III specification at multiple compression ratios, enabling experimentation with internal and external data-sets.This demonstrated the computational efficiency of the algorithm, which was sucessfully used under three different application scenarios: the first investigates matching of a full audio clip duration using varying compression settings.In the second scenario compression and sample duration vary while the third introduces context testing across different performances and orchestrations.This last scenario introduces a "common feature" tracking mechanism, which allows automated comparison of different audio tracks that share musicological characteristics.For instance, we found that the audio clip containing the violin may be partially associated with instrumental tracks containing the violin (orchestra file), a characteristic that was consistent across a wide variety of experimental executions.
We additionally proved, via the Wilcoxon sample paired test, that the categories of the intersection areas between different or same audio clips are related strongly.We also found that the fingerprint features must be aligned temporally; that is, if a set of features appears in both the original recording in the database and in a sample query, the relative positions of each feature within each recording must be the same.The computational load of the algorithm behaves linearly (i.e.O(2n)) for each comparing tuple and may be bounded with a second-order polynomial for the comparison procedure O(2n^2) under the worst-case scenario.
Our future research on this work topic will focus on the comparison of the algorithmic results for data with varying similarity.Mixed audio tracks and segments may be used for pattern matching, enabling automated copyright-verification to be performed.The authors believe that the same algorithm may also be utilised for other multimedia data types including images, video, text and combined applications such as web pages, multimedia systems and databases.

Figure 1 .
Figure 1.A graphical representation of the onion-like layer extraction process

Figure 2 .
Figure 2. Schematic representation of the preprocessing, feature extraction and identification stages Figure 3 Figure 4. Th

Table 6
above application scenarios, this content was compressed using the MPEG-1 Layer III standard at different typical coding bitrates (i.e.

Table 6 .
Wilcoxon Paired -Sample test on k-means clustering