Music Boundary Detection using Convolutional Neural Networks: A comparative analysis of combined input features

The analysis of the structure of musical pieces is a task that remains a challenge for Artificial Intelligence, especially in the field of Deep Learning. It requires prior identification of structural boundaries of the music pieces. This structural boundary analysis has recently been studied with unsupervised methods and \textit{end-to-end} techniques such as Convolutional Neural Networks (CNN) using Mel-Scaled Log-magnitude Spectograms features (MLS), Self-Similarity Matrices (SSM) or Self-Similarity Lag Matrices (SSLM) as inputs and trained with human annotations. Several studies have been published divided into unsupervised and \textit{end-to-end} methods in which pre-processing is done in different ways, using different distance metrics and audio characteristics, so a generalized pre-processing method to compute model inputs is missing. The objective of this work is to establish a general method of pre-processing these inputs by comparing the inputs calculated from different pooling strategies, distance metrics and audio characteristics, also taking into account the computing time to obtain them. We also establish the most effective combination of inputs to be delivered to the CNN in order to establish the most efficient way to extract the limits of the structure of the music pieces. With an adequate combination of input matrices and pooling strategies we obtain a measurement accuracy $F_1$ of 0.411 that outperforms the current one obtained under the same conditions.


I. INTRODUCTION
Music Information Retrieval (MIR 1 ) is the interdisciplinary science for retrieving information from music. MIR is a field of research that faces different tasks in automatic music analysis, such as pitch tracking, chord estimation, score alignment or music structure detection. One of the most active communities and references in MIR is the Music Information Retrieval Evaluation eXchange (MIREX 2 ). This is the community that every year holds the International Society for Music Information Retrieval Conference (ISMIR). Algorithms are submitted to be tested in MIREX's datasets within the different MIR tasks. Most of the previous results analyzed and compared in this work have been presented in different MIREX campaigns.
The automatic structural analysis or Music Structure Analysis (MSA) of music is a very complex challenge that has been studied in recent years [1], but it has not yet been solved with an adequate accuracy that surpasses the analysis performed by musicians or specialists. This kind of analysis is only a part of the musical analysis, which involves musical aspects like harmony, timbre and tempo, and segmentation principles like repetition, homogeneity and novelty [2]. This automatic music analysis can be faced starting from music representations such as the score of the piece, the MIDI file of the piece, or the raw audio file.
In music, form refers to the structure of a musical piece, which consists of dividing the musical pieces into small units, starting with the motifs, then the phrases, and finally the sections that express a musical idea. Boundary detection is the first step that has to be done in musical form analysis and must be done before the naming of the different segments depending on the similarity between them. This last step is named Labelling or Clustering. This task, translated to the most common genre in MIREX datasets, the pop genre, would be the detection and extraction of the chorus, verse, or introduction of the corresponding song. Detecting the boundaries of music pieces consists on identifying the transitions where these parts begin and end, a task that professional musicians do almost automatically by listening a piece of music. This detection of the boundaries in a musical piece is based on the Audio Onset Detection task, which is the first step for several higher-level music analysis tasks such as beat detection, tempo estimation, and transcription.
This problem can be accomplished with different techniques that have in common the need of pre-processing the audio files in order to extract the desired audio features and then apply unsupervised or supervised methods. There are several studies where this pre-processing step is made in different ways, so there is not yet a generalized input pre-processing method. The currently end-to-end best-performing methods use CNNs trained with human annotations. The inputs to the CNN are Mel-Scaled Log-magnitude Spectograms (MLSs) [3], Self-similarity Lag-Matrices (SSLMs) in combination with the MLSs [4], and also combining these matrices with chromas [5].
One of the limitations of these methods is that the analysis and results obtained depend largely on the database annotator since there can be inconsistencies between different annotators when analyzing the same piece. Therefore, these methods are limited to the quality of the labels given by the annotators and they cannot outperform them. This paper deals with the issue of structure detection in music pieces. In particular, we study the comparison of different methods of boundary detection between the musical sections by means of Convolutional Neural Networks. The paper is structured as follows: Section II presents an overview of the related work and previous studies in which this work is based on. The Self-Similarity Matrices and the used datasets are also presented. In Section III, the pre-processing method of the matrices that will be used as inputs of the neural network (NN) is explained. Section IV introduces the database used for training, validating and testing, and the labelling process. Sec-tion V shows the NN structure and the thresholding and peakpicking strategies and section VI describes the metrics used to test the model and exposes the results of the experiments and their comparison with previous studies. Finally, section VII presents the discussion and section VIII discusses proposals for future lines of work. All code used in this paper, including the pre-trained models of every case of study in this work, is made publicly available 3 and further results are shown in the website 4 .

II. RELATED WORK
Several studies have been done in the field of structure recognition in music since Foote introduced the self-similarity matrix (SSM) in 1999 [6] and later, in 2003, he derived from it the self-similarity lag matrix (SSLM) [7]. Before the introduction of the SSMs and SSLMs, the studies were based on processing spectrograms [8], but in recent years it has been demonstrated that SSMs and SSLMs calculated from audio features in combination with spectrograms provide better results. We describe some previous works of both unsupervised and supervised methods which belongs to the MIREX's task: Music Structure Segmentation.

A. Unsupervised Methods
The main idea of most of the unsupervised methods is to extract the musical structure of the music pieces but not necessarily the boundaries between the structure sections.
According to Paulus et al. [9], these methods can be summarized in three approaches based on: novelty, homogeneity and repetition. These approaches are computed with unsupervised Machine Learning algorithms such as genetic algorithms (fitness functions), Hidden Markov Models (HMM), K-means, Linear Discriminant Analysis (NDA), Decision Stump or Checkerboard-like kernels.
The Novelty-based approach consists on the detection of the transitions between contrasting parts [1]. This approach is well-performed using checkerboard-like kernel methods which were introduced by Foote in 2000 [10]. These methods have evolved during the years and it has been found that multipletemporal-scale kernels, as those of Kaiser and Peeters in 2013 [11], outperformed the results of previous works by proposing a fusion of the novelty and repetition approaches.
The Homogeneity-based approach is based on the identification of sections that are consistent with respect to their musical properties [1]. These methods use Hidden Markov Models, like Logan and Chu [12], Aucouturier and Sandler [13] and Levy and Schandler [14] or combinations of SSMs like Traile and McFee [15], and McFee and Bello [16].
The Repetition-based approach refers to finding recurring patterns. These methods apply a clustering algorithm to the SSMs or SSLMs. They are more applicable for labeling the structural parts of music pieces rather than precise segmentation which is required by boundary detection. Lu et al. in 2004 [17], Paulus and Klapuri in 2006 [18], Turnbull et al. [19], McFee and Ellis [20], and McCallum [21] are examples of this method.
To conclude, we can affirm that unsupervised algorithms are very efficient performing the labelling (clustering) part, but not the boundaries detection task, which is better performed by supervised neural networks which came up in 2014 and are described in section II-B.

B. Supervised Neural Networks
Supervised neural networks learn from input representations given the ground truth, which are the label annotations of the targets (Fig. 1).
Previous studies of boundary detection used Mel-Scaled Log-magnitude Spectograms (MLS) as the inputs of CNNs [3]. This method was based on Audio Onset Detection task [22], which consists on finding the starting points of every musically relevant event in an audio signal, specifically the beginning of a music note. This task can be interpreted as a computer vision problem, like edge detection, but applied to spectrograms instead of images with different textures.
Later on, in 2015, Grill and Schlüter improved their previous work by adding SSLMs, which yielded to better results [4], and the addition of SSLMs with different lag factors to the input of the CNN [5], outperforming this method and reaching the best result to date.
In Tables I and II we show a recap of the results of almost all of the previous works that have been done in boundary detection using both unsupervised and supervised neural networks. Results and algorithms nomenclature in Table I have been extracted from MIREX's campaigns of different years. It must be said that the results obtained with unsupervised methods on Table I are not as high as the results obtained with supervised neural networks because, as it has been mentioned in section II-A, the main goal of the unsupervised methods is not the boundary detection (segmentation) itself but the full structure identification (labelling).

C. Self-Similarity Matrices (SSMs)
The Self-Similarity Matrix [2] is a tool not only used in music structure analysis but also in time series analysis tasks. In these matrices, the different parts of the structure of a music piece can be identified as homogeneous regions. This representation of the structural elements of music analysis leads this matrix and its combination with spectrograms to be the input of almost every model described in sections II-A and II-B. For this work, this matrix is important because music is in itself self-similar, in other words, it is formed by similar time series.
Self-Similarity Matrices have been used under the name of Recurrence Plot for the analysis of dynamic systems [23], but their introduction to the music domain was done by Foote [6] in 1999 and since then, there have appeared different techniques for computing these matrices. The SSM relies on the concept of self-similarity, which is measured by a similarity function that is applied to the audio features representation. The similarity between two feature vectors y n and y m is a function that can be expressed as Eq. 1 shows. The result is a N -square matrix SSM ∈ R N xN being N the time dimension: where n, m ∈ [ 1,...,N ] .
The similarity function is obtained by the calculation of a distance between the two feature vectors y mentioned before. In the literature, this distance is usually calculated as the Euclidean distance δ eucl or the cosine distance δ cos : where u and v are time series vectors. Self-Similarity Matrices can be computed from different audio features representations, such as MFCCs or chromas, and they can also be obtained by combining different framelevel audio features [15]. Once the similarity function has been computed for each pair of audio feature vectors and the SSM has been calculated, we can filter the SSM by applying thresholding techniques, smoothing or invariance transposition. The SSM can also be obtained with other techniques such as clustering methods as Serra et al. proposed [33], where the SSM is obtained by applying the k-nn algorithm.
After Foote in 1999 defined the SSM, in 2003, Goto [7] defined a variant of the SSM which is known as the Self-Similarity Lag Matrix (SSLM). The SSLM is a matrix that represents the similarities between low-level features of one point in time and points in the past, up to a certain lag time. This representation makes possible to plot the relations between past events and their repetitions in the future. Some approaches calculate this SSLM after computing the SSM or the recurrence plot as we show in Eq. 4: with i = 1, ..., N , j = 1, ..., L and k = i+j−2modulus(N ) The dimensions of this matrix are not N × N as the SSM, but they are N xL, being L the lag time factor. That means that the SSLM is a non-square matrix: SSLM ∈ R N ×L .
The choice of the type of audio features representation for computing the SSMs or SSLMs, and the choice of using SSMs or SSLMs is one of the most important steps when solving a MIR task and has to be studied depending on the issue we we want to face.

D. Datasets
Previous works had been tested in the annual Music Information Retrieval Evaluation eXchange (MIREX [36]), which is a framework for evaluating music information retrieval algorithms.
The first dataset of the MIREX campaign for the structure segmentation task was the MIREX09 dataset, consisting on a collection of The Beatles' songs plus another smaller dataset 6 . Beatles dataset have 2 annotation versions, one is Paulus Beatles or Beatles-TUT 7 dataset and the second one is the Isophonic Beatles or Beatles-ISO 8 dataset. The second MIREX dataset was MIREX10, formed by the RWC [37] dataset. This dataset has 2 annotation versions; RWC-A 9 of QUAERO project which is the one which corresponds to MIREX10 and RWC-B 10 [38], which is the original annotated version following the annotation guidelines established by Bimbot el al. [39].
A few years later, the MIREX12 dataset provided a greater variety of songs than the MIREX10 [40]. MIREX12 is a dataset formed by the "Structural Analysis of Large Amounts of Music Information" (SALAMI 11 ) dataset which has evolved in its more recent version, the SALAMI 2.0 database. The analysis of MIREX structure segmentation task was published in 2012 [41]. Our work uses the publicly available SALAMI 2.0 dataset.

III. AUDIO PROCESSING
This work is based on the previous works of Schuler, Grill et al. [3], [4] who propose a pre-proscessing method to obtain the SSLMs from MFCCs features. We will extend these works by calculating the SSLMs from chroma features and applying also the Euclidean distance that has not been considered in preliminary studies, to compute the SSLMs in order to give a comparison and find the best-performing input to the NN model.

A. Mel Spectrogram
The first step of the pre-processing part is to extract the audio features. To do that, we first compute the the Short-Time-Fourier-Transform (STFT) with a Hanning window of 46ms (2048 samples at 44.1kHz sample rate) and an overlap of 50% as Grill et al. proposed [4]. Then, we obtain a mel-scaled  filterbank of 80 triangular filters from 80Hz to 16kHz and we scale logarithmically the amplitude magnitudes to obtain the mel-spectrogram (MLS). We used the librosa library [42] to compute the mel-spectrogram. After obtaining the MLS, we apply a max-pooling of p = 6 in the temporal dimension to give the Neural Network a manageable size input. The size of the MLS matrix is P × N with P being the number of frequency bins (that are equal to the number of triangular filters) and N the number of time frames. We define x i with i = 1 . . . N as the i-th frame of the MLS.

B. Self-Similarity Lag Matrix from MFCCs
The method that we used to generate the SSLMs 12 is the same method that Grill and Schluter used in [4] and [5], which in turn derives from Serra et al. [43].
The first step after computing each frame mel-spectrogram x i is to pad a vector Φ with noise of -70dB with a duration of L frames at the beginning of the mel-spectrogram.
where Φ is a matrix of size L × P whose elements are equal to -70dB.
Then, a max-pool of a factor of p 1 is done in the time dimension as shown in Eq. 6.
After that, we apply a Discrete Cosine Transform of Type II to each frame omitting the first element.
where P are the number of mel-bands. Now we stack the time frames by a factor m so we obtain the time series in Eq. 8. The resultingX i vector has where N is the number of time frames before the max-pooling and L the lag factor in frames.X The final SSLM matrix is obtained by calculating a distance between the vectorsX i . In our work, we use two different distance metrics: the Euclidean distance and the cosine distance. This will allow us to make a comparison between them and conclude which SSLM performs better.
Therefore, the distance between two vectorsX i andX i−l using the distance metric δ is where δ is the distance metric as defined in Eqs. 2 and 3. Then, we compute an equalization factor ε i,l with a quantile κ of the distances δ( We now remove the first L/p lag bins in the time dimension of the distances matrix D and in the equalization factor matrix ε, and we apply Eq. 6 with max-pooling factor p 2 . Finally we obtain the SSLM applying Eq. 11. where Once the SSLM has been obtained, we need to pad some noise to the begin and end of the SSLM because the labels which are used to train our model will be given to the NN as Gaussians (see section IV), so the first and last labels need information in their left and right sides respectively. We add the noise to the begin and end of the SSLM and MLS by padding them with γ = 50 time frames of pink noise at the beginning and end of the MLS matrix. Then we then normalized each frequency band to zero mean and unit variance for MLS and each lag band for the SSLMs. Note also that if there are some time frames that have exactly the same values, the cosine distance would give a NAN (not-a-number) value. We avoid this by converting all this NAN values into zero as the last step of the SSLM computation.

C. Self-Similarity Lag Matrix from Chromas
The process of computing the SSLM from chroma features is similar to the method explained in section III-B. The difference here is that instead of starting with padding the melspectrogram in Eq. 5, we pad the STFT. After applying the max-pooling in Eq. 6, we compute the chroma filters instead of computing the DCT in Eq. 7. The rest of the process is the same as described in section III-B.
All the values of the parameters used to obtaining the Self-Similarity Matrices are summarized in Table III. In addition to the Euclidean and cosine metrics, and MFCCs and chromas audio features, we will compare two pooling strategies. The first one is to make a max-pooling of factor p 1 = 6 to the STFT (from MLS calculation), and to the Chromas or MFCCs for the SSLMs computation, as it is described in Eq. 6. The other pooling strategy is the one showed in Fig. 2, where we first do a pooling of p 1 = 2 and then a pooling of p 2 = 3 once the SSLMs are obtained. We denote these pooling variants as 6pool and 2pool3 respectively. The total time for processing all the SSLMs (MFCCs and cosine distance) was a factor or 4 faster for 6pool than 2pool3 because by applying a higher padding factor in Eq. 6 the size of the matrices D and ε is much lower so the calculation of these matrices take more time but it also implies a resolution loss that can affect the accuracy of the model as [4] remarks.
The general schema of the pre-processing block is depicted in Fig. 2.

IV. DATASET
The algorithm was trained, validated and tested on a subset of the Structural Analysis of Large Amounts of Music Information (SALAMI) dataset [44]. SALAMI dataset contains 1048 double annotated pieces from which we could obtain 1006 pieces since the datasest does not provide the audio files due to copyright restrictions. For the training of the model, we used the text files of labels from annotator 1 and for the songs that were not annotated by annotator 1, we use the same text file but from annotator 2.
It is important to highlight that, as described in [35], previous works such as [3], [4] and [5] use a private nonaccessible dataset of 733 songs from which 633 pieces were used for training and 100 for validation. Therefore, we reimplemented the work presented in [4] but we trained it in our dataset composed by only public SALAMI pieces and  Fig. 1. Each background color contains the steps that are necessary to compute each of the inputs: MLS (green), SSLM from Chromas (orange) and SSLM from MFCCs (blue). The red background in the max-pooling blocks refers to the 2 variants done in this work: 2pool3 is the one showed in the scheme, while 6pool is computed by applying the maxpooling of factor 6 in the first red block and removing the second red block of the scheme.

A. Labelling Process
As explained in [3], it is necessary to transform the labels of the SALAMI text files into Gaussian functions so that the Neural Network can be trained correctly. We first set the center values of the Gaussian functions by transforming the labels in seconds into time frames as showed in Eq. 12 constructing the vector y which contains the center of the gaussians and has its dimension equal to the number of labels in the text file. In Eq. 12, label i are the labels in seconds extracted from SALAMI text file "functions" and p1, p2, h, sr and γ are defined in Table  III.
Then, we apply a gaussian function with standard deviation σ = 0.1 and µ i equal to each label value in Eq.12. In Eq.13 we show the expression of the gaussians of the labels. with where µ i is a vector of To train the model, we removed the first tag from each text file due to the proximity of the first two tags in almost every file and the uselessness of the Neural Network identifying the beginning of the file. It's also worth mentioning the fact that we have resampled all the songs in the SALAMI database at a single sampling rate of 44100Hz as showed in Table III. V. MODEL Our work and current methods that tackle the problem of boundary detection in MSA use neural network-based models that were originally developed for image processing tasks, in particular Convolutional Neural Networks (CNN). The model developed in this work for boundary detection is shown in Fig. 3. Once the matrices of the pre-processing step are obtained, they are padded and normalized to form the input of a Convolutional Neural Network (CNN). The obtained predictions are post-processed with a peak-picking and threshold algorithm to obtain the final predictions.

A. Convolutional Neural Network
The model proposed in this paper is nearly the same than the model proposed in [3] and [4], so we could compare the results and make a comparison with different input strategies as Cohen [35] did. However, we take into account more inputs combinations and with high and low dimensions in order to see the better inputs combination for the model. The model is composed by a CNN whose relevant parameters are shown in Table IV. The difference between this model and the model proposed in [3] and [4] is that our final two layers are not dense layers but convolutional layers in the time dimension because we do not crop the inputs and get a single probability value at the output, but we give the Neural Network the whole matrix and we obtain a time prediction curve at the output. The general schema of the CNN is shown in Fig. 4. The parameters of the CNN model have been chosen according to previous literature [4] for a fair comparison in the study of how different input features affect the performance of the MSA task. The changes that have been done from the state-of-the-art model rely on adding the dilation parameter that we use in the layers of our model, and we also changed the last layer of our implementation in comparison with previous literature models. This is because previous studies passed a segment of the SSLM trough the CNN while we pass the entire SSLM to it. The last layer of our implementation outputs one feature map that is passed trough a Sigmoid function which outputs the boundary probability of each time frame of the entire music piece, so the output of the model is a vector of length equal to the time frames of the input. This differs from the literature models where the output is the boundary probability of the segmented part of the input.

B. Training Parameters
We trained our CNN with Binary Cross Entropy or BCE-withLogitsLoss in Pytorch [45] as the loss function which in Pytorch implementation includes a Sigmoid activation function in the last layer of the Neural Network, a learning rate of 0.001 and Adam optimizer [46]. We perform early-stopping during training to determine the best-performing model. The SSLMs and MLS have to be passed to the GPU one by one because they have different lengths, which means that 1 song is passed forward and backward through the NN at once. However, to get more robust gradients and a more stable optimization process, the optimizer is executed with the average gradients of batchs of 10 songs. We could say that we use a batch size of 1 in terms of GPUs calls but a batch size of 10 in terms of the training. The models were trained on a GTX 980 Ti Nvidia GPU and we used TensorboardX [47] to graph the loss and F-score of training and validation.

C. Peak-Picking
Peak-picking consists on selecting the peaks of the output signal of the CNN that will be identified as boundaries of the different parts of the song. Each boundary on the output signal is considered true when no other boundary is detected within 6 seconds. The application of a threshold helps us to discriminate boundary values that are not higher than an optimum threshold. We calculate the optimum threshold for our experiments by computing the average F 1 in our validation set for all possible threshold values in the range [0, 1] and then we select the highest value. Therefore, the optimum threshold is the value between [0, 1] for which the average F 1 is higher in our validation set. It is reasonable to realise that the optimum threshold value may vary when training our model with the different combination of inputs that we show in Table VI. When we train our model with isolated inputs (see Table V) we compute the threshold with the MLS but we do not vary it when testing SSLMs trainings. We vary the threshold value when we train our model with different inputs combinations in order to optimize the each case of study and give the best-performing method (see Table VI). In Fig. 5, we set a thresold of 0.205 for the models using only the MLS as input and for the rest of the models we used the values indicated in Table VI. From the optimum threshold calculation, we can observe that almost all optimum threshold values for each input variant belong to [2.05, 2.6] Fig. 5 shows Recall, Precision and F-score values (see Section VI-A) of the testing dataset evaluated for each possible threshold value.

A. Evaluation Metrics
MIREX's campaings use two evaluation measures which are Median Deviation and Hit Rate. The Hit Rate (aslo called  Table IV. F-score or F-measure) is denoted by F β , where β = 1 is the measure most frequently used in previous works. Nieto et al. [48] set a value of β = 0.58, but the truth is that F 1 continues being the most used metric in MIREX works. We will later give our results for both β values. The Hit Rate score F 1 is normally evaluated for ±0.5s and ±3s time-window tolerances, but in recent works most of the results are given only for ±0.5s tolerance which is the most restrictive one. We test our model with MIREX algoritm [49] which give us the Precision, Recall and F-measure parameters.
Precision : P = TP TP + FP (15) Recall : R = TP TP + FN (16) F measure : Where: • TP: True Positives. Estimated events of a given class that start and end at the same temporal positions as reference events of the same class, taking into account a tolerance time-window. • FP: False Positives. Estimated events of a given class that start and end at temporal positions where no reference events of the same class does, taking into account a tolerance time-window. • FN: False Negatives. Reference events of a given class that start and end at temporal positions where no estimated events of the same class does, taking into account a tolerance time-window.  Fig. 3) separately in order to know what input performs better. We trained the model using the MLS and SSLMs obtained from MFCCs and Chromas and applying Euclidean and cosine distances, and we also give the results for both of the pooling strategies mentioned before, 6pool (lower resolution) and 2pool3 (higher resolution). As mentioned in section IV, we removed the first label of the SALAMI text files corresponding to 0.0s label. Results in terms of F score, Precision and Recall are showed in Table V. Note that the results showed from previous works used a different threshold value.

B. Results
The best-performing input when training our model with isolated inputs is the MLS which has a F 1 value of 0.389 (see Table V). Taking only into account the 6pool pooling strategy, regarding the SSLMs computed from audio features (MFCCs and chromas) we found that the best-performing SSLMs are the ones that are computed from the MFCCs with more than a 5% difference with the SSLMs computed from chromas.
According to the distance measures with which we compute the SSLMs, we found that there is not a high impact on the results when computing the SSLMs with Euclidean or cosine distances. The F 1 difference between the SSLMs computed with Euclidean or cosine distances is not higher than 1%. Overall, the best-performing SSLM for the 6pool pooling strategy is the SSLM MFCCs euclidean with a F 1 value of 0.361, which is a 2.8% less than the MLS F ! value of 0.389.
In view of the results in Table V, we can affirm that doing a max-pooling of 2, then computing the SSLMs and doing another max-pooling of 3 afterwards (2pool3) slightly improves the results but it does not make a high impact in the performance. The best-performing (2pool3) SSLM, the SSLM MFCCs euclidean has a F 1 value of 0.375, which is less than a 2% of the F 1 value of 0.361 for the same SSLM but computed with the 6pool pooling strategy. This procedure not only takes much more time to compute the SSLMs but also the training takes also much more time and it does not perform better results in terms of F-score.
In Fig. 6 we show an example of the boundaries detection results for some of our input variants on the MLS and SSLMs. We obtained lower results than [4] but higher results than [35] who tried to re-implement [4]. The reasons for this difference could be that the database used by Grill and Schlüter [4] to train their model had 733 non-public pieces. Cohen and Peeters [35], as in our work, trained their model only with pieces from the SALAMI database, so that our results can be compared with theirs, since we trained, validated and tested our Neuronal Network with the same database (although they had 732 SALAMI pieces and we had 1006).
2) Inputs Combination: With the higher results in Table V we make a combination of them as in [4] and later in [35]. A summary of our results can be found in Table VI.
The inputs combination that performs the best in [35] was MLS + (SSLM MFCCs cosine + SSLM chromas cosine ) for which F 1 = 0.291. We overcome that result for the same combination of inputs obtaining they obtained a F score F 1 = 0.404. In spite that, previous works [4] says that cosine distance performs better, we proof that in our model the Euclidean distance gives us better results. We also found that the best-performing inputs combination is MLS + (SSLM chromas euclidean + SSLM chromas cosine + +SSLM MFCCs euclidean + SSLM MFCCs cosine ) for which F 1 = 0.411. There is not a huge improvement in the F-measure obtained with this combination in comparison with the results obtained with the combination of the MLS with two SSLMs, but it is still our best result.

VII. DISCUSSION
We can affirm that the best-performing input, when training the model with isolated inputs, is the Mel Spectrogram, which has a F 1 equal to 0.389, more than a 2% higher than the next best-performing input respresentation, the SSLM MFCCs euclidean , whose F 1 is equal to 0.361 (Table V).
We have also demonstrated that by computing a maxpooling of factor 6 at the beginning of the process not only takes much less pre-processing time but also the training of the Neural Network is faster and it does not affect the results as much as it could be expected. As an example, the SSLM MFCCs euclidean obtained with the 6pool method has an F 1 value of 0.361 versus the 2pool3 method for the same input which F 1 is equal to 0.375.
Despite the fact that we could not replicate some previous studies of Ullrich et al. [3] and Grill et al. [4] which used nearly the same model that the one which we described in our work, we outperform the results in Cohen et al. [35] work, who also tried to re-implement the model described in the previous literature. There has to be highlighted the fact that previous studies of Ullrich et al. [3] and Grill et al. [4] had at their disposition a private dataset of 733 pieces that they used (a) CNN predictions on MLS (b) CNN predictions on SSLM calculated with MFCCs and Euclidean distance with 2pool3 (best-performance SSLM input in terms of F-measure). In this case F1 = 0.486 for a ±0.5s tolerance.
(c) CNN predictions on SSLM calculated with MFCCs and cosine distance with 2pool3. In this case F1 = 0.686 for a ±0.5s tolerance.
(d) CNN predictions on SSLM from MFCCs with cosine distance for model MLS + (SSLM MFCCs euclidean + SSLM MFCCs cosine ). In this case F1 = 0.75 for a ±0.5s tolerance. Fig. 6: Boundaries predictions using CNN on different inputs obtained from the "Live at LaBoca on 2007-09-28" of DayDrug corresponding to the 1358 song of SALAMI 2.0 database. The ground truth from SALAMI annotations are the gaussians in red, the model predictions is the white curve and the threshold is the horizontal yellow line. Note that the prediction have been rescaled in order to plot them on the MLS and SSLMs images. All these images have been padded according to what is explained in the previous paragraphs and then normalized to zero mean and unit variance. for training the model, and in this paper the model has been trained only with the public available dataset of SALAMI 2.0.
Adding more inputs to the model does not improve the results in a significant way and it is very time consuming, specially in our last case of study where we take 4 SSLMs in combination with the Mel Spectrogram, which has a F 1 value of 0.411 in contrast with the F 1 value of the MLS + SSLM MFCCs euclidean case which is 0.402, so the difference is less than 1%. This leads us to suggest that the use of another neural network architecture that only uses the Mel spectrogram with a SSLM could outperform the current results.
The results obtained in this work improve those presented previously with the same database. However, the accuracy in obtaining the boundaries in musical pieces is relatively low and, to some extent, difficult to use. This makes it necessary, on the one hand, to continue studying different methods that allow a correct structural analysis of music and, on the other hand, to obtain databases that are properly labeled and contain a high number of musical pieces. In any case, the results obtained are promising and allow us to adequately set out the bases for future work.

VIII. CONCLUSIONS
In this work we have developed a comparative study to determine the most efficient way to compute the inputs to a convolutional neural network to identify boundaries in musical pieces, combining different methods of generating SSLM matrices. In order to make the comparison and analyse the optimal way to perform the boundary detection task in MSA, different audio features and different pooling strategies have been employed, as well as the combination of different inputs to the CNN.
With an adequate combination of input matrices and pooling strategies, we obtain an accuracy F1 of 0.411 that outperforms the current one obtained under the same conditions (same input data and same datasets for training and testing). In spite of the fact that the best result is given by combining four SSLMs and the MLS, the difference in the F-measure value between our best result and experiments which require less input data and whose training time is lower, is not as high as what it could be expected. We can also affirm that current methods that have been used to date to face music boundary detection do not perform well, so MSA task needs further research because it is not solved yet.
Future work should use new Neural Network architectures that have not been used to solve MSA yet. Architectures employed in language models from Natural Language Processing such as Transformers can lead to out-perform the actual results that are presented in this work due to the memory improvement that they provide in comparison with Long-Short Term Memory Networks (LSTMs). In the case of Transformers, the self-attention mechanism can help the model to betterprocess the SSMs and SSLMs matrices. Further research, as it has been mentioned before, should also take into account to perform some data augmentation on the current public available datasets in order to have more data to train deep Neural Network models. Data augmentation, if done, should be done with pitch-shifting or by adding Gaussian noise to the inputs, but they should not use rotation or scaling techniques which affect the time distances of the input representations (horizontal axes) and thus, the structure of the music pieces.