Music Similarity Ranking: Case Study Fairphonic

Abstract


B. Research Method
The overall flow of this experiment can be seen in Figure 1.Both datasets will be formatted as depicted in Figure 1.After extracting features from each song in the datasets, we will retrieve the most similar songs based on the original input, followed by calculating the evaluation metrics.This section will explain the selected algorithms and the evaluation metrics used.

Chroma & HPCP
Chroma or Chromagram represents the tonal content in an audio signal [24].In music, chroma features correlate with the 12 musical notes.Essentially, a Chromagram is a Mel-Spectrogram that does not use the mel scale, but instead divides the frequency range into 12 musical notes.There is a variant of the Chromagram that does not use the Fourier transform when transforming from the time domain to the frequency domain, but uses the constant Q-transform (CQT), known as Chroma CQT.CQT is considered more suitable for musical signals compared to the Fourier transform, although CQT computation is more demanding.Another variant is known as Chroma Energy Normalized Statistics (CENS), which is a normalized Chroma CQT, and is widely used for audio matching applications [25].
Another name for Chromagram is Pitch Class Profile (PCP).If PCP only considers pitch, there is a variant that includes harmonic components in the PCP calculation, called the Harmonic Pitch Class Profile (HPCP).Harmonic components are integer multiples of a base frequency.For example, if the base frequency is 50 Hz, the first harmonic component is 100 Hz, the second harmonic component is 150 Hz, and so on.In the PCP calculation, these harmonic components are not considered.

Rhythm Pattern
According to Lidy and Rauber [23] , Rhythm Pattern or Fluctuation Pattern is an audio feature that captures information about the strength of amplitude fluctuation changes within a certain frequency range [23].The frequency range in the rhythm pattern uses the bark scale.The rhythm pattern can be visualized with the x-axis representing the rate of noise changes, the y-axis representing the frequency range on the bark scale, and the color indicating the strength of the noise changes.

Similarity Matrix Profile: SiMPle-Fast
Simple-Fast quickly calculates the similarity matrix profile by comparing subsequences from two time series data such as audio data [3].The similarity matrix profile is formed by calculating the distance between subsequences.This distance calculation is performed using Mueen's Algorithm for Similarity Search (MASS).MASS utilizes the Fourier transform algorithm and convolution theory [26].The distance is determined using the median of the similarity matrix profile.This method ensures efficient and accurate comparisons, making it suitable for large datasets.By leveraging the strengths of Fourier transforms and convolution, MASS provides a robust approach for identifying similar patterns within complex audio signals.

Euclidean Distance
Euclidean distance calculates the shortest distance between two points [27].Euclidean distance also accommodates calculations in higher dimensions.In the example of two-dimensional Cartesian coordinates, the formula for calculating the distance between two points p and q in Figure 2 using Euclidean distance is Equation 1. Euclidean distance for more than two dimensions is described in equation 2.
(, ) = √( 2 −  1 ) 2 + ( 2 −  1 ) 2  (1) Euclidean distance is computationally efficient and straightforward, making it a popular choice for applications in clustering, classification, and similarity measurement.Despite its simplicity, Euclidean distance assumes that the feature space is isotropic and that all dimensions contribute equally to the distance, which might not always be the case in real-world data.However, its ease of implementation and clear interpretation make it a foundational tool in many analytical and scientific endeavors.

Binary Similarity matrix
This approach [22] involves four main stages: pre-processing, similarity matrix creation, dynamic programming local alignment (DPLA), and post-processing.In the pre-processing stage, Harmonic Pitch Class Profiles (HPCP) matrices are extracted, and then these HPCP matrices are averaged to become global HPCP vectors for each song.

Figure 2. Two points cartesian coordinates
One song is then transposed to match the key of the other song.The transposition process first calculates the optimal transposition index based on the global HPCP vector.The similarity matrix creation involves calculating the binary similarity function for two HPCP vectors, one of which has already been transposed.If HPCP vector A has dimensions (300, 12) and HPCP vector B has dimensions (400, 12), the resulting binary similarity matrix will have dimensions (300, 400), with each element being binary, either 1 or -1.DPLA is then implemented on the binary similarity matrix to account for drastic changes in song structure, and finally, postprocessing produces the similarity metrics between the two songs.This method is said to improve accuracy by avoiding intermediary techniques such as key estimation or beat tracking and focusing on structural changes within the songs.By emphasizing structural modifications, this approach aims to provide more reliable similarity assessments, making it particularly useful for diverse musical genres and complex compositions.Additionally, the method's robustness in handling key transpositions and structural variations allows it to capture deeper musical relationships that simpler models might miss, ensuring a comprehensive analysis of musical similarity.This can be particularly advantageous in applications such as music recommendation systems, plagiarism detection, and musicological research, where nuanced understanding of musical similarity is essential..

Mean Rank
Mean Rank is the average rank of the first cover song (the song considered similar) that appears in the ranking results [28].For example, Figure 3 shows a dataset of original and cover song pairs.Then, Figure 4 shows the computed similarity ranking results based on their similarity levels.
Based on Figure 4, for original song 1, the correct cover song is ranked 2nd (Cover Song 1c), for original song 2, it is ranked 1st (Cover Song 2b), for original song 3, it is ranked 1st (Cover Song 3a), and for original song 4, it is ranked 3rd (Cover Song 4a).Therefore, the mean rank of this scheme is (2 + 1 + 1 + 3) divided by 4, which equals 1.75.This metric provides a straightforward way to evaluate the effectiveness of similarity ranking algorithms by focusing on the rank positions of the first relevant items.A lower mean rank indicates better performance, as it shows that the correct cover songs appear closer to the top of the ranking list.Thus, Mean Rank is a crucial metric for assessing and comparing different algorithms in music similarity ranking tasks..

Mean Reciprocal Rank
Mean Reciprocal Rank (MRR) transforms the index position into a range from 0 to 1, with higher values indicating better performance in the returned ranks [28].Essentially, a higher MRR value signifies more accurate ranking results.The MRR calculation process is illustrated in Figure 5.This metric provides a clear and concise measure of ranking effectiveness, making it a valuable tool for evaluating the performance of similarity ranking algorithms in various applications, including music similarity analysis and recommendation systems.By using MRR, one can easily assess and compare the precision of different algorithms.

Mean Average Precision
Mean Average Precision (MAP) is a widely used metric for evaluating the performance of similarity ranking systems [28].MAP provides a single-figure measure of quality across all ranking positions, making it a comprehensive metric for performance evaluation.It is the average of the Average Precision (AP) values calculated for each query or task.Precision@K, a related metric, measures how many relevant items are present among the top K recommended items.However, Precision@K does not account for the actual ranking positions of these relevant items.
To address this, Average Precision@K modifies Precision@K by incorporating the ranks of the relevant items into the calculation, ensuring that items appearing earlier in the ranking contribute more to the score.This approach offers a more nuanced understanding of ranking effectiveness by emphasizing the importance of higher-ranked relevant items.An illustration of MAP is provided in Figure 6, which demonstrates how this metric helps in assessing the overall effectiveness of similarity ranking results.
Higher MAP values indicate better performance of the ranking system, reflecting a higher proportion of relevant items appearing at the top of the ranking list.By using MAP, researchers and practitioners can gain insights into how well their similarity ranking algorithms perform, allowing them to make informed decisions about algorithm improvements and optimizations.This metric is particularly useful in applications such as information retrieval, recommendation systems, and any domain where the quality of ranked results is crucial.

C. Result and Discussion
This section discusses a comprehensive analysis of the performance metrics used to evaluate the algorithms.This includes the scores for Mean Rank, Mean Reciprocal Rank (MRR), and Mean Average Precision (MAP), alongside the velocity of the processing.The Mean Rank scores provide insight into the average position of the first correctly identified similar song in the ranking results, while the MRR scores transform these positions into a range between 0 and 1, indicating higher values for better performance.MAP scores offer an average precision value that considers the ranking positions of relevant items, providing a nuanced measure of algorithm accuracy.Additionally, the velocity of the process, or the time taken to compute these metrics, is evaluated to ensure that the algorithms are not only accurate but also efficient.These metrics collectively highlight the strengths and weaknesses of each algorithm, guiding the recommendation for the most suitable algorithm for Fairphonic's needs.
The experiment tested on two datasets, Covers80 and Fairphonic's internal production dataset.The Covers80 dataset comprises 39 original songs and 80 cover songs, while the production dataset contains 80 original songs and 160 cover songs.The diversity and size of these datasets ensure a thorough evaluation of the algorithms, reflecting real-world scenarios.This comprehensive testing framework allows for a robust assessment of each algorithm's performance, providing valuable insights into their practical applicability and reliability in various contexts.
The evaluation metric calculations for the three algorithms: Chroma CENS -Simple Fast, Rhythm Pattern -Euclidean Distance, and HPCP -Binary Similarity Matrix can be seen in Table 1.Based on Table 1, the most accurate algorithm is HPCP -Binary Similarity Matrix, outperforming the other two algorithms across both datasets.The second-best algorithm is Chroma CENS -Simple Fast.The Rhythm Pattern -Euclidean Distance algorithm performed poorly on the Covers80 dataset.
The superior performance of Chroma CENS -Simple Fast and HPCP -Binary Similarity Matrix over Rhythm Pattern -Euclidean Distance may be due to the fact that song similarity is more influenced by pitch features (Chroma CENS and HPCP) rather than rhythm features (Rhythm Pattern).Additionally, the Euclidean Distance algorithm tends to be ineffective for calculating distances in high-dimensional spaces (the curse of dimensionality).This phenomenon arises because, as the number of dimensions increases, the distance between points becomes less meaningful due to the dilution of data density.Thus, the Rhythm Pattern -Euclidean Distance algorithm may not capture the intricate nuances of musical similarity as effectively as the pitch-based algorithms.The findings suggest a need for more sophisticated methods to handle high-dimensional data in music similarity tasks.This study calculates the time required for the three algorithms to perform 100 comparisons.The experiment was conducted 30 times, and the average time in seconds to perform 100 comparisons and its standard deviation were calculated.The calculation results can be seen in Table 2.
The Rhythm Pattern -Euclidean Distance algorithm recorded a much faster time compared to the other two algorithms.This is due to the relatively simple computation, which only involves calculating the Euclidean distance.The second fastest algorithm is Chroma CENS -Simple Fast, which recorded a time almost four times faster than HPCP -Binary Similarity Matrix.These findings highlight the tradeoff between speed and accuracy, emphasizing the need for balancing computational efficiency and performance in practical applications.

D. Conclusion
Based on the experimental research and evaluation metrics testing conducted, the values of Mean Rank, Mean Reciprocal Rank, and Mean Average Precision were calculated and recorded in Table 1.The most accurate algorithm, according to these evaluation metrics, is HPCP -Binary Similarity Matrix.However, the fastest algorithm is Rhythm Pattern -Euclidean Distance.The Chroma CENS -Simple Fast algorithm falls in the middle, being reasonably fast but not as accurate as HPCP -Binary Similarity Matrix.Specifically, Chroma CENS -Simple Fast is 72 percent faster (six seconds) than HPCP -Binary Similarity Matrix (22 seconds).
The HPCP -Binary Similarity Matrix algorithm excels in accuracy due to its robust handling of harmonic and pitch class features, which are crucial for precise music similarity assessment.Its comprehensive analysis of musical structure allows for more reliable similarity determinations, making it ideal for applications requiring high accuracy.Despite its accuracy, the computational cost and processing time are significant, which may limit its practicality in real-time applications or large-scale datasets.
In contrast, the Rhythm Pattern -Euclidean Distance algorithm, while less accurate, offers a much faster processing time.Its simplicity in calculating Euclidean distances between rhythm patterns makes it efficient, suitable for scenarios where speed is prioritized over precision.This algorithm's efficiency can be particularly beneficial in applications where quick responses are necessary, albeit with a tradeoff in similarity detection accuracy.
The Chroma CENS -Simple Fast algorithm strikes a balance between speed and accuracy.It provides a reasonable level of precision in identifying similar music tracks while maintaining a faster processing time compared to HPCP -Binary Similarity Matrix.This makes it a versatile option for applications needing a compromise between computational efficiency and accuracy.
In summary, the choice of algorithm depends on the specific requirements of the application, whether it be the need for high accuracy, fast processing time, or a balance of both.

E. Acknowledgment
The author wishes to express special thanks to hi supervisor, Denny, S.Kom., M.I.T., Ph.D. for his excellent guidance, supervision, encouragement, and invaluable advice throughout a period of study and in writing this paper.

Figure 3 .
Figure 3. Example original song -cover song pairs

Figure 4 .
Figure 4. Example of similarity ranking results

Figure 6 .
Figure 6.Mean Average Precision example

Table 1 .
Metrics Calculation Result

Table 2 .
Processing Speed Results