Classification of Overlapped Audio Events Based on At, Plsa, and the Combination of Them

Audio event classification, as an important part of Computational Auditory Scene Analysis, has attracted much attention. Currently, the classification technology is mature enough to classify isolated audio events accurately, but for overlapped audio events, it performs much worse. While in real life, most audio documents would have certain percentage of overlaps, and so the overlap classification problem is an important part of audio classification. Nowadays, the work on overlapped audio event classification is still scarce, and most existing overlap classification systems can only recognize one audio event for an overlap. In this paper, in order to deal with overlaps, we innovatively introduce the author-topic (AT) model which was first proposed for text analysis into audio classification, and innovatively combine it with PLSA (Probabilistic Latent Semantic Analysis). We propose 4 systems, i.e. AT, PLSA, AT-PLSA and PLSA-AT, to classify overlaps. The 4 proposed systems have the ability to recognize two or more audio events for an overlap. The experimental results show that the 4 systems perform well in classifying overlapped audio events, whether it is the overlap in training set or the overlap out of training set. Also they perform well in classifying isolated audio events.


Introduction
Audio information, as a manifestation of multimedia information, can carry rich information, and has been developed and applied extensively [1][2][3][4][5].Recently, audio event classification technology which is an important part of Computational Auditory Scene Analysis (CASA) has attracted much attention.Unlike audio event detection, which means to determine the identity and the occurrence time of the sounds that may exist in an audio document, audio event classification is to identify the sounds in the given audio segments.Audio event classification is useful in a variety of applications, including multimedia retrieval [6], intelligent robots [7], and smart home project etc. [8].For an audio document, there are two types of audio event which can be defined as follows: Definition 1 Isolated Audio Event: The audio event that does not have temporal overlap with other audio events.That is, at the time when the audio event occurs, no other audio events occur simultaneously.

Definition 2 Overlapped Audio Event:
The audio event that has temporal overlap with other audio events.That is, at the time when the audio event occurs, there are other audio events that occur simultaneously.
Nowadays, the audio classification technology is mature enough to classify the isolated audio events accurately, but when encounters with the overlapped ones, large performance decay would occur.In the international evaluation campaign of CLEAR 2007 [9], the overlapped segments (the segments that contain overlapped audio events) account for more than 70% of errors produced by every submitted system.Toni Heittola [10] pointed out that the overlapped audio events would make the automatic sound event recognition problem more difficult to handle.So dealing with the overlapped audio events is really a challenge.While in real life, most audio files would have certain percentage of overlapped audio events, and so overlapped audio event classification is an important part for audio file analysis.The overlapped audio events constitute a natural auditory scene.Most researches did the auditory scene recognition by modeling global acoustic characteristics of the auditory scene, and had neglected the classification of the overlapped audio events.In this paper, we propose several overlap classification systems based on two topic models, i.e.AT (author-topic model) [11] and PLSA (Probabilistic Latent Semantic Analysis) [12].Both AT and PLSA were first proposed in text analysis field.AT can extract the topic information of authors, and PLSA can extract the topic information of documents.The two topic models will be briefly introduced in Sec. 3. The related work will be described in Sec. 2. The problem of how to use the two topic models and the combination of them to classify the overlaps will be discussed in Sec. 3. The experimental results are presented in Sec. 4. Finally, conclusions and future work are given in Sec. 5.

Related Work
Both AT and PLSA are specific cases of topic models.AT is in fact an extension of the LDA (Latent Dirichlet Allocation) [13].So far there has been no report on applying AT in audio field, but much work has been done on applying LDA in audio retrieval.For example, Samuel Kim [14] assumed that an audio clip was a mixture of some acoustic topics, and took LDA to extract the topic distribution information for each audio clip to realize audio retrieval.Pengfei Hu [4] overcame the shortage of LDA in processing continuous data, and proposed a new topic model named Gaussian-LDA for audio retrieval.In this paper we introduce AT into audio classification based on the idea that an audio document can be expressed as a combination of acoustic topics as well as a combination of acoustic events.A similar idea is proposed in [15], where a LATEA (Latent Acoustic Topic and Event Allocation) model was proposed for acoustic scene analyzing.The difference is that instead of expressing an audio document as a combination of acoustic events, LATEA expresses an acoustic topic as a combination of acoustic events.PLSA is a popular topic model in audio processing field.Yuxin Peng [16] employed audio PLSA model to do semantic annotation.Through PLSA, Keansub Lee [17] decomposed the soundtrack into separate descriptions of the specific sounds, and successfully applied it to classify consumer videos.With the latent topics learnt by PLSA, Timothy J. Hazen [18] proposed a method to automatically summarize the content of an audio corpus.
As that pointed out in [19], the overlap problem can be addressed at different system levels.At the signal level, the overlap problem is related to the source separation technology.For example, in order to detect sound events from everyday contexts, Toni Heittola [10] adopted the source separation technology to separate the audio signal into four individual signals, and then each individual signal was separately processed and classified.At the decision level, the overlap problem is dealt by assigning different weights to different microphones based on the assumption that the audio sources are well separated in space.At the model level, the overlap problem is resolved by modeling all types of overlap.For example, in [19], a SVM-based audio event detection system, called ISO-CLUSTER, was proposed to detect the non-speech events that were overlapped with speech in meeting-room environment.The ISO-CLUSTER system is a two-step approach.First, a [mp] class which contains all overlaps is defined.The [mp] class, along with the ISO system (the system that is constructed only by isolated audio events) is used to complete the set of 1 vs. 1 SVM classifiers.Then in order to further classify the detected overlapped segments, an optimal decision tree is generated based on a confusion matrix.At each node of the decision tree, the audio event classes are split into two clusters by minimizing the splitting criterion shown in formula (1), and then a SVM model is trained.
Here, e ij is the i,j-th element of the confusion matrix, and C 1 , C 2  denote cardinalities of the two clusters.
There are also other model-level systems dealing with the audio overlap problem.Another system that was also designed to detect non-speech audio events was proposed by Miquel Espi [20].In [20], some hidden features were learnt from spectrogram patches, and then were integrated within the deep neural network to detect audio events.The exemplar-based NMF approach for audio event detection in [21], the context-dependent sound event detection in [10], and the HMM based sound event detection in [22] have a similar idea.They all employed the Viterbi algorithm of HMM to detect the most likely event for each frame.Also, in order to detect more events in an overlapped frame, they all experimented with multiple Viterbi passes.At each pass, for each state, all events of the previous passes were forbidden.But the authors in [21] pointed out that the method of multiple Viterbi passes did not yield satisfactory results, because it would cause large numbers of insertion errors.Considering the success of the tandem connectionist-HMM in automatic speech recognition, Xiaodan Zhuang [23] introduced it into the real-world acoustic event detection.Other models, such as GMM [24], and the model constructed by NMF (Non-negative Matrix Factorization) [25] were also adopted to detect overlapped audio events.
Most of the model-level systems can only recognize one audio event for an overlap.In this paper, we aim to propose systems to recognize more than one audio event for an overlap.To do so, we adopt the topic models of AT, PLSA, and the combination of them.When combine AT with PLSA, one of them is used to find out the potential audio events in an overlap, and the other is used to determine the final audio events among them.The contributions of our work are as follows.To deal with the overlap classification problem, we innovatively introduce AT into audio classification, and innovatively combine it with PLSA.We design 4 systems, i.e.AT, PLSA, AT-PLSA and PLSA-AT to resolve the overlap classification problem.The proposed systems have the ability to recognize two or more audio events in an overlap, which cannot be done by most of the current audio overlap classification systems.Also we have tested the ability of AT, PLSA, AT-PLSA and PLSA-AT in classifying isolated audio event.

Classification of Overlapped Audio Events Based on AT, PLSA, and the Combination of Them
PLSA is first proposed by Thomas Hofmann for text analysis [12].It can discover the latent topical structure of text documents, and so is very useful in disambiguating polysemes and in exploring synonyms.PLSA is also feasible in the audio field.There have been many studies that apply PLSA to do audio analysis.
AT is first proposed to extract the author and topic information of large text collections [11].It is a generative model based on the idea that a document can be represented as a mixture of topics.AT takes the authorship information into consideration, and is in fact an extension of LDA [13].For text documents, AT can be applied to rank authors by topic, or to rank topics by author, and to parse abstracts by topics and authors, etc.An audio document is comparable to a text document.The audio components are equivalent to words, and the audio events are equivalent to the authors of the text.Thus it is feasible to introduce AT into audio field.
In this section, we will first introduce PLSA and AT briefly.The symbols of AT are consistent with those used in [11].Then the two topic models as well as the combination of them are tested to deal with the overlap classification problem.

PLSA
For a corpus with D documents, assume the words in the corpus are taken from a dictionary with W unique words, and there are totally T topics, denoted as z t  {z 1 , …,z T }.Let P(d i ) denote the probability that a word will be observed in document d i , P(w j z t ) the probability of word w j conditioned on the latent topic z t , P(z t d i ) the probability of the latent topic z t conditioned on document d i .With these definitions, the generation process of the corpus can be described as follows: (1) Pick a document d i with probability P(d i ); (2) Choose a latent topic z t with probability P(z t d i ); (3) Generate a word j w with probability P(w j z t ).
The goal of PLSA modeling is to maximize the following joint probability with the constraints Here n(d i , w j ) denotes the number of words w j in document d i , and n(d i ) =  j n(d i , w j ) denotes the document length.EM (Expectation Maximization) is employed to resolve the above maximum likelihood estimation, and finally P(w j z t ) and P(z t d i ) could be obtained.

The Author-Topic Model
Assume there are T topics and A authors in the text corpus, and the words in the corpus are taken from a dictionary with W unique words. stands for a T A  matrix whose element θ ta denotes the probability of assigning topic t to a word generated by author a .The column θ a in  indicates the multinomial distribution over topics for author a , and satisfies  T t=1 θ ta = 1. stands for a W T  matrix whose element ϕ wt denotes the probability of generating word w from topic t.The column ϕ t in  indicates the multinomial distribution over words for topic t, and satisfies  W w=1 ϕ wt = 1.Take the A d -dimensional vector a d to represent the authors of document d, and take the N ddimensional vector w d to represent the words in document d, then a corpus with D documents can be represented by a vector w obtained through concatenating all document vectors, and thus w has N =  D d=1 N d entries.Each word in the corpus is associated with a latent author, x , and a latent topic, z , and then we use N-dimensional vectors x and z to represent the latent authors and the latent topics for the N words of the corpus.Assume the prior distributions of θ a and ϕ t are symmetric Dirichlet with hyperparameters α and β respectively, and the authors of each document are known in advance, then the generation process of the corpus can be described as follows: (1) For each author a ( 1, , ) , generate θ a according to the Dirichlet distribution with hyperparameters α; for each topic t ( 1, , )  t T   , generate ϕ t according to the Dirichlet distribution with hyperparameters β.
(2) For word i ( 1, , ) given the authors a d , first, choose an author x di uniformly at random; next, choose a topic z di according to the multinomial distribution x di  ; finally, choose a word w di according to the multinomial distribution The graphical model of the generation process is shown in Fig. 1.
The key point of the author-topic model is to estimate the parameter  and  .This is done by estimating the posterior distribution through the following equation: To assess the convergency of the Markov chain, a perplexity score is proposed as follows: Here   , , is the posterior probability of words d w conditioned on d a ,  and  , and can be calculated as follows: From (4) it can be seen that lower perplexity value means larger posterior probability   , , , and therefore means better performance of the model.

Classification of Overlapped Audio Events 1) Classification through AT
In order to apply AT to classify the overlapped audio events, there are three key problems needed to be resolved.
(1) How to get the "words" of one audio document?(2) What are the authors of one audio document?(3) With AT, how to perform classification?
(1) To get the "words" of audio documents, here we adopt the vector quantization method.The training audio documents are first split into frames, and for each frame, some audio features are extracted.Assume there are totally L frames in the corpus, denoted as {f 1 , f 2 ,…,f L }.The frames are clustered by k-means.Assume the frames are clustered into W clusters, and then the cluster centroids, denoted as {C 1 , C 2 ,…,C W }, are taken as the dictionary of size W .With the dictionary, each frame would then get an index as follows: where Dis f C represents the distance between i f and j C ,

 
i IDX f represents the index of frame i f , and then  are just the "words" of the audio documents.
(2) An audio document is comparable to a text document.The audio components are equivalent to words, and the audio event classes are equivalent to the authors.For a text document, it can be represented as a combination of latent topics, and its authors are the people who write it.AT can extract the topic distribution of each author and the word distribution of each topic.The combination of the author-topic distributions and the topic-word distributions then generates the text.Similarly and reasonably, an audio document can also be represented as a combination of latent topics which can be understood in the same way as that they are understood in the text document.For an audio document, we think that the audio event classes in it have generated it, and then we take the audio event classes as the authors.For example, if an audio segment is an overlap with speech and music in it, then we say that the authors of this audio segment are speech and music.With AT, the topic distribution of each audio event and the word distribution of each topic could be obtained, and then an audio document could be generated by combining the audio event-topic distributions and the topic-word distributions.
(3) In [11], the application of AT includes detecting unusual papers by authors and separating the combined documents into its component parts.In this paper we extend its application, and reform it to classify the overlaps.How to reform it to be fit for classification is the key problem.
From (4) it can be seen that conditioned on specific authors, perplexity can be used to estimate the posterior probability of a document.Inspired by it, we think that conditioned on a specific audio event, the perplexity value can be used to estimate the likelihood that the audio event is one of the real audio events in the document.But one problem is that the calculation of perplexity needs that the authors should be known in advance, while the authors of the audio documents are just what we want to get.To overcome this problem, we design the following classification scheme.For an overlapped audio segment to be tested, conditioned on each author, one perplexity value could be got, and with the authors appearing in the training set, a series of perplexity values could be obtained.In Sec.3.2 we have discussed that the lower the perplexity the better the performance of the model, and then if an author is one of the real authors of the segment, we have reason to believe that the perplexity value conditioned on it should be relatively small.Based on the above discussion, we take the authors with smaller perplexity values as the audio events or potential audio events of the segment.
To express the above idea more clearly, here we define some variables as follows.Assume there are totally A authors, that is, a  {1,…,A}.For an overlapped test segment d test , conditioned on author a , a perplexity value, Perplexity (d test a, Θ, Φ), could be obtained according to (4).Then the audio events or potential audio events of audio segment d test , denoted as AE(d test ), can be expressed as follows: Here M F min denotes the first M minimum values.

2) Classification through PLSA
To reform PLSA to be fit for overlap classification, the concepts of word, topic and document should be redefined.The words and the construction of the dictionary are the same as that in AT.The topics refer to audio events, or in other words, refer to authors, and then P(w j z t ) should be rewritten as P(w j a), a  {1,…,A}, and P(z t d i ) should be rewritten as P(ad i ), a  {1,…,A}.The document refers to the audio segment segmented from the original audio documents.That is to say, the original audio documents are segmented into a series of shorter segments, and these segments are taken as the classification units.In the training stage of PLSA, as that described in Sec.3.1, P(w j a), a  {1,…,A}, could be obtained through EM.Since here we refer topics to audio events, and in the training set, the audio events in each audio segment are known, then P(w j a), a  {1,…,A}, does not need to be calculated through EM, but can simply be obtained through statistics.For an overlapped segment in the training set, it should participate the statistics of all the audio events contained in it.For example, if an overlapped segment contains the audio events of speech and music, then it should participate the statistics of P(w j a) for speech and also for music.In the test stage, for an audio segment d test , P(ad test ) can be obtained through EM.In each M-step, only P(ad test ) is updated, while the P(w j a), a  {1,…,A}, obtained from the training set are kept fixed.P(ad test ) reflects the test segment-specific probability distribution over audio events.
The audio events with larger probability can be taken as the audio events or the potential audio events in the test segment.That is: Here M F max denotes the first M maximum values.

3) Classification through Combining AT with PLSA
From the above discussion it can be seen that both AT and PLSA can be used separately to classify overlaps, and both can recognize two or more audio events for an overlap.Also, we can combine AT with PLSA to classify overlaps.Here we design two combination strategies.One is that we use AT to find out the potential audio events for a test audio segment, and then within these potential events, PLSA is performed to find out the most likely audio events which are then taken as the classification result.This strategy will be denoted as AT-PLSA hereafter.The other is that we use PLSA to find out the potential audio events, and then within these potential events, AT is performed to find out the first several audio events with smaller perplexity values, and then such audio events are taken as the classification result.This strategy will be denoted as PLSA-AT hereafter.More details about AT-PLSA and PLSA-AT are explained as follows.
AT-PLSA: For a test segment d test , first, 1 M potential audio events, denoted as a i , i = 1,2,…,M1, are determined through (7); then, for a i , P(a i d test ) i = 1,2,…,M1 are obtained as that described in 2); finally, among these potential audio events, 2 ) audio events, selected through (8), are taken as the classification result.That is, the classification result can be expressed as:  , are obtained as that described in 1); finally, among these potential audio events, 2 ) audio events, selected through (7), are taken as the classification result.That is, the classification result can be expressed as: The 4 proposed systems of AT, PLSA, AT-PLSA and PLSA-AT will all be tested to classify overlaps in the experimental section.Also we are interested in the classification performance of the 4 systems in classifying isolated audio events and in classifying the complete test set (including overlapped audio events and isolated audio events), so these two aspects will also be tested.

Dataset, Feature and Metric
The proposed systems are evaluated on two datasets.One is a dataset constructed by the first 5 episodes of drama "Band of Brothers", abbreviated as BOB dataset, and the other is a dataset constructed by 5 episodes of melodrama "Friends", abbreviated as Friends dataset.The average length of one episode is about 55 minutes in BOB dataset, and about 22 minutes in Friends dataset.For both datasets, the audio events are hand-labeled, and the labeling results are shown in Tab. 1.The time intervals for which the content is difficult to describe are labeled as unknown, and are not used.The audio events that occur rarely in the dataset are not labeled and not used.Since the silence class can be easily classified through a threshold of energy, it is also not used.The audio recordings are split into segments according to labels.The audio segments are set to be mono channel format, down-sampled to 16 kHz, and framed using a Hamming window.The frame length/shift is 32/16 ms.For each frame, some features are extracted.MFCCs, as the most efficient audio features, are first adopted.Some other features that are proposed in works about content-based audio analysis are also adopted, including energy entropy, signal energy, zero crossing rate, spectral rolloff, spectral centroid and spectral flux.

Dataset
The evaluation metrics are: the segment-based version of audio event error rate (AEER), precision (Pre), recall (Rec), and F1-measure, which are defined as follows: Here Num, De, In, and Su are the number of events to classify for a specific segment, the number of deletions, the number of insertions, and the number of substitutions respectively.gt, es, and ce denote respectively the number of ground truth, estimated and correctly estimated audio events for a given audio segment.Segment-level metrics are averaged throughout the segments in the test set.

Experimental Setting
In order to determine the parameters in the topic models, one episode is chosen from each dataset to construct an experiment dataset.For the rest episodes in each dataset, the leave-one-out cross validation is adopted.Each time one episode is chosen as the test set, and the rest as the training set.The average performance of all the combinations of training-test set is taken as the final result.
The proposed system is compared with the baseline system and the ISO-CLUSTER system both proposed in [19].In [19], the authors only considered the overlapped segments in which one non-speech audio event is overlapped with speech, and the other overlaps of two or more audio events, either with or without speech are not used.In this paper, the overlapped segments of two or more audio events, whether the audio event is speech or non-speech, are all considered.The baseline system is constructed by several SVM classifiers, and the 1 vs. 1 multi-class classification strategy is used.Both the segments of isolated audio events and the segments of overlapped audio events in the training set are used to train the SVM classifiers.The overlapped segments are averagely assigned to the corresponding classes.For example, for the overlapped segments in which there are two audio events of A and B, 50% of the segments are included in class A, and the other 50% in class B. The ISO-CLUSTER system is trained by segments of isolated and overlapped audio events as that proposed in [19].To be consistent with that in [19], all SVM classifiers use RBF kernel function.The parameter of the kernel function and the penalty factor of SVM are determined by 5-fold cross validation.
The baseline and the ISO-CLUSTER systems classify an overlapped segment as a certain audio event.In other words, they cannot recognize two or more audio events in an overlapped segment.Obviously, for an overlapped segment, we wish to recognize as many audio events in it as possible.Recognizing two or more audio events in an overlapped segment can help people to analyze the audio scenes, which cannot be well done by recognizing only one audio event, and also it is useful in other applications.For example, for an overlapped audio segment of the type speech&bus, if both audio events have been correctly recognized, then it can help us to infer that it is an outdoor scene, but this cannot be done by recognizing only speech.From the labeling results of the two datasets it can be seen that most of the overlaps are the overlapping of two audio events, so in order to classify an overlapped segment, we design the systems as follows: For AT, the 2 audio events determined through (7) are taken as the classification result; For PLSA, the 2 audio events determined through (8) are taken as the classification result; For AT-PLSA, 5 potential audio events are first found out by AT, and then among the potential audio events, the first 2 most likely audio events are determined by PLSA, and are taken as the classification result; For PLSA-AT, 5 potential audio events are first found out by PLSA, and then among the potential audio events, the 2 with the first two minimum perplexity values are taken as the classification result.Also we hope to test the performance of the proposed systems in classifying isolated audio events.A SVM classifier trained with some overlapped segments and some isolated segments (the segments that contain isolated audio events) is used to determine whether a test segment is an overlapped one or an isolated one.For an isolated segment, the classification strategy is similar to that of an overlapped segment.For AT/PLSA, the most likely audio event determined through (7)/(8) is taken as the classification result, and for AT-PLSA and PLSA-AT, among the 5 potential audio events, the most likely one determined by PLSA (for AT-PLSA), or by AT (for PLSA-AT) is taken as the classification result.With the classification results of isolated audio events and overlapped audio events, the overall classification performance of the systems is also tested.
The burn-in time of the Gibbs sampler is set to be 1000, and the parameters α and β are set to be 200/W and 50/T respectively, as that suggested in [11].

Determine W and T
To run the proposed systems, the size of the dictionary, W, and the number of topics, T, should be determined in advance.In our experiments, the optimal W and T are found by full grid searching of the F1-measure surface obtained on the experiment dataset.From each dataset, except for the episode chosen to construct the experiment dataset, from the rest episodes, 3 are randomly chosen, and then the 6 episodes from the two datasets are used to con- and are then tested respectively on the experiment dataset with different W and different T , as that shown in Fig. 2.
It can be seen that for AT-PLSA, generally, W should be no less than 700, and for PLSA-AT, generally, W should be no less than 500.When W is big enough, T can decrease appropriately.An appropriate W can well describe the content of the audio corpus, and an appropriate T can well discover the latent semantic structure of the audio corpus.For AT and AT-PLSA, W and T are set to be 700 and 140.For PLSA and PLSA-AT, they are set to be 500 and 100.

Classification Results
In this section, the proposed systems are compared with the baseline system and the ISO-CLUSTER system.Table 2 and Table 3 show the classification results of overlapped segments on dataset BOB and Friends respectively.Table 4 and Table 5 show the classification results of isolated segments on dataset BOB and Friends respectively.Table 6 and Table 7 show the overall classification results (the classification results on the complete test set, including isolated segments and overlapped segments) on dataset BOB and Friends respectively.From Tab. 2 and Tab. 3 it can be seen that when classify overlapped audio events, for the Friends dataset, the 4 proposed systems perform much better than the baseline and the ISO-CLUSTER systems; for the BOB dataset, except for PLSA, the other 3 systems perform better than the baseline and ISO-CLUSTER systems from the perspective of Rec and F1-measure, but a little worse from the perspective of AEER and Pre.Compare AT with AT-PLSA, it can be seen that they perform similarly on both datasets, which means that the classification performance of AT is not enhanced after combining it with PLSA, so when classify overlapped audio events, AT alone is enough.Compare PLSA with PLSA-AT, it can be seen that PLSA-AT performs better than PLSA on both datasets, which means that when classify overlapped audio events, the classification performance of PLSA can be enhanced after combining it with AT.In summary, AT has the ability to well explore the authorship information of overlapped audio segments, and then has the biggest advantage in classifying overlaps.On both datasets, the performance of ISO-CLUSTER is worse than that of the baseline system, which does not agree with the experimental results in [19].It is maybe because that in our experiments, the overlaps of two or more audio events, whether the audio event is speech or non-speech, have all been used, and so the classification situation is more complex than that in [19].Moreover, the unbalance problem of ISO-CLUSTER in constructing the decision tree would also cause performance degradation.
From Tab. 4 and Tab. 5 it can be seen that on both datasets, except for PLSA, the other 3 proposed systems perform better than the baseline and the ISO-CLUSTER systems from the perspective of all evaluation metrics, and among them, AT performs best.This means that the 3 systems (AT, AT-PLSA, PLSA-AT) proposed to classify overlaps can also better classify isolated audio events, and AT not only has the biggest advantage in classifying overlapped audio events, but also has the biggest advantage in classifying isolated audio events.PLSA performs a little worse, but its performance can be enhanced by combining it with AT (see the performance of PLSA-AT).
From Tab. 6 and Tab.7 it can be seen that for dataset BOB, comparing the baseline and the ISO-CLUSTER systems with our proposed 4 systems, the baseline and the ISO-CLUSTER systems perform much better from the perspective of AEER and Pre, and AT, AT-PLSA and PLSA-AT perform better from the perspective of Rec and F1.For dataset Friends, the proposed 4 systems all perform better than the baseline and the ISO-CLUSTER systems from the perspective of all evaluation metrics.In summary, except for PLSA on dataset BOB, the overall classification performances of our proposed systems are much better than that of the baseline and the ISO-CLUSTER systems.Among the 4 proposed systems, AT performs best; PLSA performs worst, but its performance can be enhanced through combining it with AT (see the performance of PLSA-AT), and after combination, the resulting PLSA-AT performs similarly to AT-PLSA.

Testing on the Overlaps in Training Set
Testing on the overlaps in training set means that the types of overlap being tested have ever appeared in the training set.In practical application, we would try to collect for the training set as many types of overlap as possible, in case that they would appear in the test set.If a type of overlap has been collected for training, we hope that once it appears in the test set, the system would recognize it.In this section we will test the ability of the proposed systems in recognizing the types of overlap that have ever appeared in the training set.For the leave-one-out cross validation in test stage, each time, from the test set, the types of overlap that have ever appeared in the training set are chosen for testing.The classification results are shown in Tab. 8 and Tab. 9. From Tab. 8 it can be seen that when classify the overlaps in training set for dataset BOB, except for PLSA, the other 3 proposed systems perform better than the baseline and the ISO-CLUSTER systems from the perspective of Rec and F1, but much worse from the perspective of AEER and Pre.Among the 4 proposed systems, PLSA performs worst, while the other 3 systems perform similarly.From Tab. 9 it can be seen that when classify the overlaps in training set for dataset Friends, the 4 proposed systems perform better than the baseline and the ISO-CLU-STER systems from the perspective of AEER, Rec and F1, but a little worse from the perspective of Pre.Once again, among the 4 proposed systems, PLSA performs worst, while the other 3 systems perform similarly.In summary, AT, AT-PLSA and PLSA-AT have the similar ability in classifying the overlaps in training set, while PLSA is relatively not good at classifying the overlaps in training set.

Testing on the Overlaps Out of Training Set
Testing on the overlaps out of training set means that the types of overlap being tested have never appeared in the training set, but the audio events in such overlaps have ever appeared in the training set in the form of isolated ones or in the form of other types of overlap.In practical application, though we would try our best to collect for the training set as many types of overlap as possible, there is always the case that the type of overlap being tested has never appeared in the training set.Because in real life, there would be many audio events in an audio document, and the number of combinations of them, that is, the number of types of overlap, would be very large, and then collecting all types of overlap for the training set would be unrealistic.In such case, it is very important for the system to have the ability of recognizing the overlaps out of training set.
In this section we will test the ability of the proposed systems in recognizing the types of overlap that have never appeared in the training set.For the leave-one-out cross validation in test stage, each time, from the test set, the types of overlap that have never appeared in the training set are chosen for testing.The classification results are shown in Tab. 10 and Tab.11. performance indicator to evaluate the classification system.Our proposed systems can well classify the overlaps out of training set as long as the audio events in such overlaps have ever appeared in the training set, no matter in the form of isolated ones or in the form of other types of overlap.This indicates that based on the AT model and the PLSA model, the proposed systems can well discover the latent semantic structure of the audio corpus, and so when a new type of overlap appears, though it has never appeared in the training set, based on the latent semantic similarity, the proposed systems can still recognize it.

Conclusions and Future Work
In this paper, we focus on the audio overlap classification problem which is a big challenge in audio classification field.Inspired by AT and PLSA, both of which are first proposed for text analysis, we propose 4 systems, i.e.AT, PLSA, AT-PLSA and PLSA-AT, to resolve it.Compared with the baseline and the ISO-CLUSTER systems, the proposed systems have the following advantages: generally, they work better not only in classifying overlapped and isolated audio events, but also in classifying the types of overlap in and out of training set; they have the ability to recognize two or more audio events in an overlap, which cannot be done by the baseline and the ISO-CLUSTER systems.
Audio event classification is a more controlled task, while audio event detection is more realistic in real applications.In the future, more work will be done to try to expand the proposed systems into the general audio event detection problem.One direction is that for each frame of the audio document, one of the two models AT and PLSA is used to find out the active audio events, and the other is used to confirm the result; then some post-processing, such as smoothing, will be performed to improve the detection result.

Fig. 1 .
Fig. 1.Graphical model for the author-topic model [11].The above equation is executed as follows: first, a sample-based estimation of   , , , train P D

Fig. 2 .
Fig. 2. The F1-measure contours at different W and different T, obtained on the experiment dataset: (a) for AT-PLSA, and (b) for PLSA-AT.
The classification performances of the systems on dataset BOB when classify the overlaps in training set.