GPU Based Similarity Metrics Computation and Machine Learning Approaches for String Similarity Evaluation in Large Datasets

The digital era brings up on one hand massive amounts of available data, and on the other hand the need of parallel computing architectures for eﬃcient data processing. String similarity evaluation is a processing task applied on large data volumes, commonly performed by various applications such as search engines, bio-medical data analysis and even software tools for defending against viruses, spyware


Introduction
The amount of digital data produced and transmitted around the world is increasing at an exponential rate. Consequently, the data processing tasks are more and more complex, leading to the need of hardware resources with increased parallel computing capabilities. As referred to the current technologies used for intensive processing tasks, the Moore law did not survive, meaning that in the past years, the number of decisional elements in a compact integrated circuit was more than doubled every 18 months. A few years ago, Moore underlined potential paths for further technology improvement, such as the use of the speed of light and the atomic nature of materials [1]. State-of-the-art processing technologies in the context of the big-data paradigm are the field programmable gate arrays (FPGAs), graphic processing units (GPUs) and multi-core processor architectures. Currently, GPUs take the lead, overcoming the disadvantages of the FPGAs and last generation CPUs. The main drawbacks were the difficulty of programming and the increased time-to-market in case of the FPGAs and the reduced number of processing threads (i.e., two orders of magnitude) of the CPUs as compared with the GPUs. All these powerful computing resources are used to tackle the main problems brought up by the big-data paradigm.
String similarities in large datasets A key issue, considering the available data on a tremendous variety of subjects, is to find the occurrences of a given string or pattern within large datasets. The aforementioned task, also known as string similarity evaluation or approximate string matching, is considered to be the heart of application from various fields such as network intrusion detection systems, stock market estimation, web searching and even computational biology [2][3][4]. The importance of this task is underlined as follows. In terms of network security, the undesired traffic cannot be filtered by the header information because most threats target the application layer [5,6]. Deep packet inspection and payload checking are needed to assure protection against viruses, malware, spyware, spam and deliberate incursion efforts, making this task increasingly difficult, leading to the design of security filters with lower processing time using GPUs and FPGAs [7]. Bioinformatics applications, where a considerable number of reads must be mapped (or aligned) to a longer reference or database sequence, also make use of string similarities evaluation [8]. Several exact string-matching tools are available, one of the most successful being the Burrows Wheeler Transform, which also benefits from efficient implementations using GPUs [9]. Moreover, to allow the presence of some mismatches and structural variations, approximate string matching is also performed in bioinformatics applications [10]. Considering both exact and approximate string-matching approaches applied on large scale data of RNA/DNA sequences, GPUs were successfully used for efficient implementation in terms of processing time and throughput [11]. Another type of applications where algorithms for the determination of string similarities were successfully applied is stock market estimation [3]. In this case symbolic aggregate approximation was used as a pattern recognition technique for identifying market and/or stock trends on increased size financial time series. Once the importance of the string similarity evaluation is underlined, we further on proceed to the description of the state-of-the-art algorithms used for this purpose.
String matching algorithms have been explored extensively utilizing a variety of methodologies, including dynamic programming, finite state machines, bit parallelism, filtering, and indexing [12]. The literature survey led to the classification of these algorithms into automata-based algorithms, similarity metrics-based algorithm and artificial intelligence approaches. The automatabased algorithms can be splitted into deterministic finite automata (DFA) and nondeterministic ones (NDFA) [10]. In the first category, Knuth-Morris-Pratt [13] and Aho-Corasick (AC) [14] algorithms have been widely used for exact string matching. Meanwhile, the typical nondeterministic approaches namely the bit-parallel algorithms are used for both approximate and exact string matching [15]. The number of memory references is defined only by the text and pattern lengths in the bit-parallel algorithms. To put it another way, the bit-parallel approach has the same best-and worst-case time complexity, allowing it to deliver very reliable throughput while dealing with a variety of patterns with different text and pattern contents. Taking advantage of this bit parallelism approach, the number of operations to be performed is reduced by a factor of size w, corresponding to the computer word size. Moreover, in addition to bit-parallelism, parallel computation strategies using GPUs were implemented to achieve increased computational speed [16]. Considering the DFAs, the parallel failureless Aho-Corasick (PFAC) algorithm [17] extends the AC algorithm for efficient parallelization on GPUs. Nevertheless, traditional methods, such as the AC algorithm, fail to meet the current need for efficiency [6]. Performance degradation occurs if the text has many partially matching patterns, and their matching lengths are relatively long. Such unstable search throughput is not desirable in case of large data sets. Another DFA algorithm is the Wu-Manber algorithm [18] which uses the hash function to eliminate impossible matching. This considerable validation step reduces the algorithm efficiency.
String similarity metrics are known for increased efficiency, and they are divided into two categories: string-based, where syntactic similarities are calculated and language-based, where semantic similarities are evaluated [19]. As opposed to the first category which computes similarities based on characters appearance and sequence, semantic similarities are evaluated based on the likeness of their meaning or semantic content rather than lexicographical similarity [20]. According to [21], the Jaro-Winkler algorithm is a heuristic character-based approach which delivers increased accuracy and precision in string similarity evaluation, close to the artificial intelligence approaches. Its 65.17 % accuracy and 78 % precision when applied on five million pairs of toponyms, makes the algorithm an important candidate for string similarity evaluation. Regarding GPU based implementations for string similarity evaluation using specific metrics, the approximate string-matching algorithm under various distances were performed using FPGAs [22], multi-core processors [23] and GPUs [24,25]. In [25] the implementation idea is based on the use of warpshuffle operations to eliminate the access of global memory or shared memory. The proposed implementation outperformed the previous parallel approach on GPUs.
The machine learning approaches come into play when computing a similarity score is not necessary enough, for a match/mismatch decision. This means that, besides the string likeness, the context of their use is also important. For example, in case of author-level scientometric indicators, supervised machine learning (ML) approaches are used to disambiguate author names in the Web of Science publication database [26]. The main parameters to describe the ML approach are the precision to quantify the number of positive class predictions that belong to the positive class, the recall to quantify the number of positive class predictions made of all positive examples in the dataset and the F-Measure which provides a single score that balances both the concerns of precision and recall in one number. In [26] the random forest and logistic regression machine learning algorithms are used to achieve feature assessment as opposed to classic feed-forward neural networks together with increased precision and recall rates. In [21] randomized trees, support vector machines and random forests were used for string similarity evaluation and increased accuracy was obtained on a large dataset as compared with classic approaches such as Jaro-Winkler and Damerau-Levenshtein approaches. Moreover, in [27] an unsupervised machine learning approach is used for data reduction in string space.

String similarities in musical industry
In the musical market, inequality still exists in establishing royalty rights for broadcasting the recorded music. The intellectual rights owners (e.g.: performers, producers, artists) look for a transparent and accelerated remuneration process in the context of TVs and Radios broadcasts. Sometimes the remuneration comes after more than one year from broadcasting. Diving into the context, we simplify the business flow as follows: producers record media content on different supports (e.g.: audio and audio-video) which are used by users such as TVs and Radios which are obliged to pay a royalty, according to the national legislation and to the existing records within the collective management organizations that represents the intellectual rights owners. The right owner is defined here as a record characterized by the following information: performer/artist, title, and producer. Each record has also a specific timestamp and a broadcasting duration. The repertory represents all the right owner records managed by the collective management organization, and it is composed of audio and/or video records together with their corresponding record attributes: performer/artist, title, and producer. On the other hand, the playlist is a table of records for all broadcasted audio and/or video recordings of a TV or Radio. Based on the received playlists and on the existing repertoire of each collective management organization, the remuneration is established for each right owner. The repertoire is established based on the contractual terms of representation between collective management organizations and intellectual rights owners. Records are described also by the three attributes: title, artist, and producer. The key process in establishing the remuneration is the matching between the user delivered playlists records, described by their attributes (performer/artist, title, and producer) and the repertoire of the collective management organization. This process must be transparently, reliable, and its results have to be efficiently delivered in terms of computational time. The current paper is developed based on an IT system that has been implemented for a collective management organization, that is meant to manage the repertoires, ingest the playlists and determine the matchings and create transparency and a space of dialog for key stakeholders in the process of intellectual property royalties' distribution. The software application deals with large playlists composed of multiple string records that must be matched with repertory records in order to assure both copyright protection of all broadcast materials and royalty rights identification. The variety of string similarity evaluation algorithms together with state of the art computing capabilities and machine learning algorithms show great potential developing an efficient solution in terms of both accuracy and computational time.
To answer the previous demands, the current paper provides a Jaro-Winkler algorithm implementation that takes advantage of the GPU parallel computing capabilities. The Jaro-Winkler approach was chosen for string similarity evaluation considering the precise quantification of string similarities compared with approximate and exact string-matching algorithms and its comparable accuracy performances with ML algorithms [21]. As underlined in [15], GPU constraints regarding the data type and their fixed length have to be met, in spite of the variable string lengths for which the similarity evaluation needs to be performed on. Thus, the proposed implementation complies with the previous constraints, delivering increased throughput compared with state of the art approaches. Moreover, to improve the accuracy of the aforementioned implementation, two machine learning approaches were proposed. Considering the large datasets they are applied on, the computational complexity is analyzed, whereas a compromise between the algorithms complexity and their string matching accuracy is discussed.
The remainder of this paper is organized as follows. Section II introduces the available computing tools of the GPU architecture together with the Jaro-Winkler algorithm description and the proposed implementation for the string similarity evaluation with increased throughput. Moreover, a simple threshold-based algorithm is implemented onto the GPU for matching playlist records with repertory records, whereas machine learning approaches are used to increase the accuracy of the record matching procedure. The results and discussion including the obtained throughput and accuracy as compared with state-of-the-art approaches are described in Section III. Finally, conclusions are drawn in Section IV.

Material and Methods
The interest for efficient string similarity evaluation especially in case of large data sets was detailed in the previous section. Besides search engines, stock market estimations and computational biology applications, we underlie another task, royalty rights identification, where string similarity evaluation is mandatory. The main objective can be resumed to the matching between the playlist and repertoire records, each of them characterized by three attributes: title, artist and producer. Let us mention that the records are not uniquely identified by a key publicly available, and the playlist is completed by various users, based on different methods, which allow artists name or producers to be reversed or shortcut, whereas titles can be altered or mistyped. These premises lead to a challenging matching process, also known in the literature as author name disambiguation [26]. Considering the size of the datasets, in a particular case, we have to compare 80.000 repertoire records with more than 15.000.000 playlist records on the 3 attributes in order to establish the audio recordings royalties.

String records matching using the Jaro-Winkler similarity metric
We resume further on the challenge of finding the most appropriate approach for string matching which better suits our task. The scientific literature identifies the following units of analysis in matching between two texts: characters and tokens (words, n-grams). These main units are generally used in quantitative approaches for string matching tasks. The following table summarizes the two types of approaches, with the mention that hybrid approaches are often adopted. Character based approaches for string similarity make use of specific distances between two strings and they are employed in applications where semantic meaning is not as important as the similarity. The best results in terms of accuracy obtained with Jaro-Winkler [21] and Damerau-Levenshtein distances [28]. Considering the token -based similarities, similar results were obtained using Jaccard N-grams and Monge-Ekan algorithms.

Jaro Winkler string similarity algorithm description
Let us consider two strings denoted by s 1 = a 1 a 2 . . . a N and where a i and b j represent the characters included in the s 1 and s 2 string, respectively. A character a i in s 1 is common for the two strings in case a ..a ′ m be the common characters in s 1 and let s ′ 2 = b ′ 1 ...b ′ m be the common characters in s 2 , a transposition of s ′ 1 and s ′ 2 is defined as a position i, such that a i ̸ = b i . Considering m the number of the common charactgers and t/2 half the number of transpositions, the Jaro distance is defined as: An improved variant of aforementioned metrics was proposed by Winkler, where the constraint of a common prefix is added to the similarity metrics as denoted by: where l ′ is the longest common prefix of s 1 and s 2 and p < 0.25 (commonly set to 0.1).

Fig. 1 Jaro-Winkler algorithm
Let us consider each record i from our database described by the triplet of variables author (a), title (t) and producer (p), denoted by (a, t, p). There are two sets of records called Repertoire and Playlist, denoted by R = (a r , t r , p r ) and Y = (a y , t y , p y ) respectively, where R and Y represent their size. We extracted from the Repertoire 3 subsets of size R, containing authors, song titles and producers, denoted by A R , T R and P R and another 3 subsets A V , T V and P V of size Y each, from the Playlist. The strings included in the set A are evaluated in terms of similarity with all the strings from the set A V , using the Jaro Winkler algorithm. Considering a selected threshold k 1 , all the entities from the set A V having a higher JW similarity score than k 1 are selected according to the eq. (3).
In case of the selected records, the values for the title variable t are further on checked for similarities with the title variable t V in a similar manner based on the threshold k 2 . The pseudocode for the proposed algorithm is described in figure 2.
Please note that, in round numbers, the number of selected entities after the k 1 threshold is applied is equal or greater than R. This approximation is used further on for the computational complexity estimation. The computational complexity of our proposed approach, as well as the input data size, have a significant impact on the implementation to be developed. For this reason, we evaluate the computational complexity of our approach, which is given by the order of growth for the total computational cost in case of the Jaro Winkler algorithm O J (mn), where m is the length of string s 1 and n is the length of string s 2 . The number of instances the JW algorithms is applied is rv in case of the authors' similarities evaluation between the two sets A and A v . Considering string similarities evaluation of titles and producers from Repertoire with the ones from Playlist, these are performed only for the entities that meet the criterion expressed by eq. (3). For these last cases, the number of JW algorithm instances are approximately rr, which is an order of magnitude lower than the number of instances used for the determination of author similarities. This leads to a total computational complexity of The V and R parameters in our application have values of up to 1.000K and 100K, respectively. On a general-purpose computer, this results in an increase in computational time of up to several hours, even if multithreading capabilities are used in implementation. Results and discussion section details these timing considerations. Thus, a GPU based implementation is further on used in order to reduce the computational time. Before proceeding to the implementation, the specific tools for parallel computations available through the GPU architectures are summarized. It should be noted that this still does not address the issue of the low accuracy for the existing similarity metrics (see Table 1). One can say that the solution can be found among the large variety of machine learning techniques. Nevertheless, the large datasets and the large variety of categorical variables and large number of categories to be identified may lead to complex training procedures and increased computational time. Taking all of this into account, we can conclude that employing both GPUs and machine learning techniques represents the solution for determining string similarity in the application at hand. But first, let's have a look at the GPUbased implementation of the Jaro-Winkler algorithm.

GPU approach for parallel computing
Within the large variety of computing applications, graphics processing technology has evolved to provide unique benefits when parallel computation capabilities are needed. Thus, in addition to their intended use (i.e., accelerate the rendering procedure of the 3D graphics), the latest graphics processing units (GPUs) become flexible and programmable and serve as important tools used to significantly speed-up complex tasks in high-performance computing, deep learning, and many other fields. These hardware devices are composed of tens of streaming multiprocessors (SM). Each SM represents an individual computing unit which executes multiple 32-wide vector instructions in parallel. The CUDA application programming interface (API) used for building software applications which employ the GPUs for general purpose processing, refers to the vector instructions as warps. A warp is also known as 32 independent threads, where a thread can be defined as the smallest subunit of a computer program that can be executed independently. A group of warps is called a thread block and it is associated with a SM. Considering the first GPU architectures from version 1 to Pascal v6.2, the warps were executed synchronously, an implicit synchronization procedure being applied in case of any thread's divergence within the same warp [29]. Starting with the Volta architecture, warp threads are executed asynchronously, whereas any threads synchronization must be explicitly programmed if needed [30]. In order to perform processing operations on the given data, the execution threads require memory to work with. The memory architecture of the GPUs is divided into three categories: global memory, shared memory, and local memory [31]. The global memory is in the order of GBs, accessible by all the SM threads and used to store the application data and the expected post-processing results. Given the increased global memory size, it is preferable that consecutive threads access memory addresses in consecutive order. This approach is known as coalesced memory access, and it ensures optimal bandwidth utilization and low latency. The shared memory is on the order of MBs in size and has a lower latency than the global memory, however it can only be shared within thread blocks. The local memory is the fastest, but it is also the smallest (in the order of MBs), and it can only be accessed by threads within a warp. Having multiple threads and their corresponding memory available, there is one more step to build software routines which make use of GPU parallel computing capabilities for data processing: writing CUDA kernel functions. A CUDA kernel is a GPU function which takes as arguments the input data arrays and arrays corresponding to the output data. Thus, these functions do not return any values, their results being written in the output data arrays. Moreover, each kernel has its own structure of threads (i.e., the number of thread blocks and the number of threads per block) declared each time they are executed. Important tools in kernel execution are the stride and atomic operations. In case similar independent computational steps are performed, they can be assigned to multiple threads executed at once. In case there are more steps to be performed than available threads, the stride operation comes in handy. Thus, after the execution of all available threads, the stride operation feeds new data to the same threads for performing the next computational steps. When results from one execution thread need to be used by another thread, atomic operations are used, which schedule the execution of threads sequentially.

Parallel implementation of the Jaro Winkler based record matching algorithm using GPU
Both Jaro Winkler score computation and record matching algorithms are implemented using parallel computing capabilities of the GPUs. Two kernels are used, one for each of the algorithms described in Fig. 3 and Fig. 4 respectively. In the case of the first kernel, the subsets A, T and P , where the lower indice R or Y denotes their correspondence to the Repertoire or the Playlist triplets, are copied within the GPU global memory. The similarity scores are computed for the pairs (A R , A Y ), (T R , T Y ) and (P R , P Y ) using the same kernel. As shown in Fig 3, for each aj element in A R , similarities with several a i elements, with i = 1...t c in A Y are computed in parallel using all t c threads specify within the kernel configuration. The a j similarity scores for the a i elements with i = t c . are computed using stride operations. These computational steps support block-based data partitioning and coalesced memory access on GPU, for reduced computational time. The resulting similarity scores for the a j with a i elements with i = 1. . . Y are stored within the global memory and further on copied to the destination HOST for storage. It is to be mentioned, intermediate variable arrays of size Y are stored in the global memory and used in similarity scores computation. The source code for the GPU implementation is available at Kaggle data repository together with a selection of entities from Repertory and Playlists sets [32]. Considering the second kernel, it implements part of the algorithm described in Fig. 4 to determine the list of matches between entities from the Repertory and the ones from the Playlist. It can be observed that the computational tasks corresponding to the pseudocode lines 3,7 and 8 from Fig. 4   memory access is assured for efficient memory access and reduced computational time. This algorithm section is also chosen for GPU implementation due to increased execution time of conditional branches generated by if statements in case of general-purpose CPUs.

Adaptive Neural Network for increased string similatity evaluation accuracy
Despite the parallel computation capabilities used for the string similarity evaluation, the main disadvantage remains the reduced efficiency of the Jaro Winkler algorithm. A precision of 78% and 68% accuracy is achieved considering the threshold of 0.75 used in implementation. As referred to our database, using 0.75 threshold values lead to significantly increased number of false positives matches. Empirically determined, we employed a threshold value of 0.9 for the authors and 0.8 threshold value for the titles and producers' information to use the JW similarity score as a stand-alone approach for similarity evaluation. In order to increase the accuracy, threshold values k < 0.75 are used to preliminary associate matches for each of the authors from the Repertory to the ones from the Playlist. Let M be the set of matches for a given author a i . Even if significant false positive results are included, a neural network approach is used to further on classify all the matches from the set M to the Repertory entities corresponding to the author a i , denoted by N . The main issue is that both the Repertory and the Playlist are updated over time. This addition of new data has an impact on the neural network structure to be used for classification, an issue also known as incremental computing. The general idea used for our adaptive Neural Network (aNN) to cope with the incremental computing is to use multiple neural networks, one for each author a i and its corresponding titles. In this manner, when a new song n is added in the Repertory, only one network aN N i is updated in terms of its size and training, according to Fig. 5.a. Regarding network training, a procedure which uses automatic data generation is used for training to classify input data corresponding to the new item n. The number of our aN N and their size are determined based on our next observations related to the existing data set. In the case of an 80k Repertory size and 1.5M Playlist size, there is a set K of approximately 10k matches returned by the JW algorithm. Please note that the size of K is similar with the Repertory size, nevertheless, only a few of an authors (e.g., in round numbers a n = 500) are present in the Playlist, due to the fact that the same title is associated with the same author multiple times. Also, it can be observed that an average of a k = 40 items per author are present in the Repertory. The range of parameters such as items per author and the number of similarities between the two sets lead to a number of an neural networks with the structure presented in Fig. 5. Considering a given a i author from the Repertory, each neural network aN N i classifies the set M from the Playlist, into one of the entities from the Repertory, corresponding to the same author a i . The total input nodes are a + t + p + c + l, where all the terms are constant values. The output layer is composed of U j = 1...a k nodes, one for each of the Repertory records corresponding to the author a i . The output layer is used to represent the classification scores, which represent real numbers, each giving the probability of the input data to be associated with one of the ak Repertory entities. Two hidden layers are considered for the neural network. Ideally the number of hidden layer units is between N and 2N , where N is the number of units in the output layer. Our choice is 3N/2 nodes, considering the relatively increased number of predicted outputs. The node connections within network layers are unidirectional, leading to a feed forward topology. As for the activation function used by network elements, the softmax function is applied: It uses the standard exponential function to each element of the input vector and normalizes these values by dividing them with the sum of all these exponentials, assuring that the sum of the output vector components is 1. The aforementioned output can be interpreted as the probability for the input data values to be associated with one of the output entities. Further on, the network training and the outliers detection are discussed as the key elements of the implementation. For training purposes, besides the existing playlists and repertory record entries, synthetic datasets are generated by introducing reversed and shortcuted titles, producer, and performer names together with typos, for each aN N i classifier. On the other hand, considering the test data, there will be two types of inputs: inputs like the training data and inputs that differ in some respect from the data that are available during training. This classification task is also known as novelty detection, and it will be used in our case for detecting the abnormal data at the neural network input. For this purpose, according to [33] and [34], an auto-detector is built for the identification of the novelty inputs, which correspond to the playlist records that are not part of the group of records corresponding to the ai artist, due to the low Jaro-Winkler threshold. The network training and validation accuracy considering two well-known Romanian artists, namely (PS) and (NP), are presented in Fig. 5. The number of output classes for the two authors are 105 and 176 for P S and N P , respectively. It can be observed a faster accuracy convergence towards 93% accuracy in case of the networks having a smaller number of inputs. According to the proposed workflow, based on the JWs similarities between the playlist and the repertory for a given author a i , the resulted records are fed to the adaptive aN N i networks. Nevertheless, due to the low JWs threshold, records which do not correspond to the a k classes of the author in question are given to the network for classification, and consequently they have to be classified as intrusions. For this purpose, the adversarially learned one-class classifier for novelty detection proposed in [35] is used.

Results and Discussion
In this section, the performances of both the proposed GPU implementation for string similarity metrics computation and the proposed machine learning approaches for string similarity evaluation by means of classification are presented and compared with existing approaches. Consequently, processing throughput and speed-up are analyzed considering multi-threading CPU and GPU implementations, whereas in terms of string similarity evaluation accuracy, specific quality measures such as accuracy, precision and recall scores are computed for classic algorithms such as Jaro-Winler and also for the machine learning algorithms used for performing the string-matching task. Please note that the proposed work envisages string matching by means of similarity metrics and neural network-based classification. Our experimental setup includes an Intel(R) i7-9750H CPU workstation, with 6 cores, 12 logical processors at 2.60GHz, 32 GB of RAM and an NVIDIA GeForce RTX 2060 with 6 GB RAM. We used Visual Studio Code running on Windows 10 with python number libraries for CUDA programming and the Keras API for implementing neural-network classification algorithms.
Evaluation of the GPU based string similarity metrics computation Throughput comparison for several input packets applied to both Jaro Winkler and Levenstein algorithms for similarity metrics computation are presented in Table 2. Our implementation used the Jaro Winkler approach whereas the implementation from [25] used the Levenstein distance for similarity estimation. The accuracy of the two approaches is described in section II and [21]. The throughput is calculated as the number of bytes in the input strings and the total processing time. Note that search throughput does not include data transfer time between the CPU and GPU. Table 2 shows a throughput comparison of the proposed approach with a similar one for string similarity evaluation. The resemblance of the two approaches is underlined in terms of computational complexity. Thus, the computational complexity is estimated based on the input data and string length for all the approaches accounted for string similarity evaluation. The number of computational steps performed by the full workflow used for the proposed record matching approach is given by O(3V Rn 2 ), according to eq. (4) where M and N were each replaced with the average string length n. Regarding the Levenshtein distance computation, regular implementations also have a computational complexity of O(n 2 ), where n represents the average string size. This leads to a total computational complexity of O(M n 3 ) for the overall approximate string-matching algorithm with k differences under the Levenshtein distance. It can be observed that, by adopting the GPU based parallel execution, the performance was greatly enhanced compared to the CPU approach. Even though the computational complexity is similar, the increased throughput of our approach is mainly due to independent computing tasks performed by our implementation, compared with the approach presented in [25]. For having a better image of the computational power introduced by the GPU, the entire workflow including CPU to GPU data transfer and vice versa is considered for speed-up evaluation of the proposed implementation compared with multi-threading-based CPU implementation and a classic CPU one. The hardware resources used for the implementation are specified in the beginning of the current section. Considering the Repertory and Playlist sizes of 100k records, the speed-up factor introduced by using a number of 10 threads reached to the value of 4.38x, compared to the classic CPU implementation. When using the GPU implementation, the speed-up factor compared with the classic CPU based approach is 21.6x, considering the entire processing workflow. Considering these results, one can say the GPU approach for computing the string similarities can be used as a stand-alone procedure for the determination of playlist matches with the given repertory records. The powerful computation capabilities of the GPU architecture were used to compare one by one the repertory records with the playlist ones.
Evaluation of the GPU based string similarity metrics computation Aside from the aforementioned straightforward approach, an adaptive neural-network approach aN N was introduced for string similarity evaluation. Its main benefits are the reduced number of string comparisons (i.e., string similarity metrics computation) together with the improved accuracy and the possibility of computing task deployment on multiple machines. The reduced number of operations can be observed by looking at the order of growth for the computational complexity of the aN N approach compared with the GPU-SA and a naive NN implementation namely the multi-features neural-network (MF-NN) (see Table 3). Before discussing in more detail, the computational complexity, the MF-NN is shortly summarized. Thus, each record p i from the playlist is compared in terms of string similarity with the repertory r i ones. Multiple metrics (i.e. Jaro Winkler, Levenshtein distance, permuted Jaro Winkler, Jaccard N-grams and unique character count) are computed on the record strings and they represent features describing the (p i , r i ) pair similarities. A multilayer perceptron classifier with the input layers corresponding to the previous features is trained and used to classify the (p i , r i ) pair either as a match or as two different records. It becomes naturally that using the multiple similarity features and machine learning approaches for evaluating the records similarities the accuracy and the precision are high. Nevertheless, the computational complexity leads to an increased amount of hardware resources to perform the similarity evaluation in a reasonable time frame. Thus, considering a number of 5 features for the MF-NN implementation, the order of growth for the computational complexity is 15V Rn 2 for the features computation and V Rn 5 for the machine learning classification, leading to the order O(V Rn 5 + 15V Rn 2 ). Considering the aN N approach, let g be the number of authors in the Playlist for which the JW score JW s(a i , a j ) > k and a n be the total number of authors present in the playlist. Considering the ga n factor, the computational complexity of the aN N approach is 3 orders of magnitude times lower than the MF-NN. Moreover, the computational complexity term V Rn 2 is also significantly higher than the corresponding term for similarity metrics computation used in the aN N approach. Consequently, the increased accuracy of the MF-NN comes with the price of significantly increased computational complexity for large datasets.
Commonly, the efficiency of a machine learning classifier or any other classifier is described by the accuracy, precision and recall parameters computed based on true positives, true negatives, false positives, and false negatives classifications performed by the method in question. It can be observed that, in the case of the GPU-SA approach, using a threshold value of 0.92, the accuracy reached 0.83 in case of the full workflow of the entity match algorithm (Figure 4, 54 Mb input data). Due to the increased number of false negative classifications. Comparing the aN N with the GPU-SA procedure the efficiency is significantly improved, whereas the computational complexity makes the approach feasible in terms of hardware resources and computational time.

Conclusions
String similarity metrics can be successfully used for string matching decisions by tuning the similarity threshold values. When large datasets are involved, high performance computing tools such as GPUs are available for similarity metrics computation aiming to obtain increased processing throughput. Consequently, we propose a string records matching procedure based on the Jaro Winkler similarity metric for royalty rights identification and copyright protection of broadcast materials in the musical industry. The procedure was implemented onto a GPU, which leads to a processing throughput of 31.04 Gbps for the Jaro Winkler similarity metric computation, and 7.41 Gbps for the full string record matching algorithm. A matching accuracy of 0.83 was obtained, but there are still the following limitations to be mentioned. First, the manual tuning of similarity thresholds needs to be eliminated from the processing workflow, whereas the accuracy needs to be increased. Consequently, we proposed two machine learning approaches for string records matching, both starting from the parallel implementation of the Jaro Winkler similarity metrics. The first one entitled the multi-features neural-network approach delivers increased accuracy (up to 0.99), for the record matching procedure, but the computational complexity is also increased, which makes it unsuitable for large data sets. On the other hand, the second approach, namely the adaptive neural network, leads to an accuracy of 0.92, whereas the computational complexity is approximately 3 orders of magnitude lower than the previous one. This fact qualifies it as an efficient solution for string records matching in case of large data sets.

Statements and Declarations
Ethical approval. The authors declare that the submitted work is original and have not been published elsewhere in any form or language. All authors read and approved the final version of the manuscript.