An approach for improving searching algorithm in big data using hybrid technique

Search algorithms play an essential role in wide variety of applications in computer science. Different search algorithms can be applied to different data structures. Each data structure has its own advantages and limitations which in turn affect the search algorithm complexity and performance. In the context of Big Data such limitations can be more obvious and noticeable in reducing the performance offered by the data structure and search algorithms applied to it because of big data characteristics such as its huge volume. Also, string data sets can be more challenging. They can cause more overhead in preprocessing. Some string datasets such as personal names may have some skew that has negative effect on hash indexes by increasing the ratio of collisions produced. The work in this paper offers a technique for improving the famous Hash table Data Structure by reducing the collision ratio in hashing in the average case, thus reducing the total number of comparisons made by search algorithm. The advantage of reducing the collision ratio was achieved by building a hybrid approach making use of three common data structures, namely Hash Table, B-Tree and Linked List. The results obtained showed noticeable improvement for time complexity of data searching algorithm over the traditional data structures that were used for comparisons.


Introduction
Big Data refers to structured and unstructured massive and complex datasets that traditional processing techniques and/or algorithms fail at properly handling its challenges and fulfilling its processing requirements [1].These properties of big data emphasize the need for search algorithms and data structures used to offer high performance. Since data structures plays an important role in determining search algorithm potential, improving data structures used is a key step towards achieving better performance. Some of the most common data structure used to index datasets are B-Tree and Hash Table. B-Tree is a widely known and used data structure that can be used to index large datasets. B-Tree is a balanced search tree that represent a natural generalization of binary search tree data structure. It consist of a set of nodes and pointers that link them. A node in a B-Tree can have many tree pointers(children) with each two tree pointers separated by a key that represents a means to guide the search through the tree. B-Tree has search complexity of O(Log n) in the worst case [2]. Hash tables are data structure that can be used to search through an indexed dataset. A Hash table is basically a table that uses a randomizing function(called the hash function) to assign each indexed element into a specific location determined by the hash function. Hash Tables are appealing because of the interesting property of offering O(1) search time complexity in best case. on the other hand they suffer from the probability of the occurrence of collisions(two items must be stored in the same place according to the hash function). In this case ,it may be appropriate to try to employ these indexes in a way that reduces some of their limitations to a specific degree. In the context of big data , both of these data structures have some significant limitations. B-Tree's O(log n) search time complexity can be considered large for large datasets , and Hash Table's O(1) search time complexity can degrade to O(n) in the worst case due to the occurrence of collisions. Linked List data structure organize elements as sequential group of elements.it uses pointers(links) to keep the order of elements in the list. Linked List works well in insertion and deletion operations, yet it's search complexity is O(n).

Problem statement
Improving existing index data structures by making some changes to these data structures and using hybrid technique in order to reduce the time complexity required to perform searching algorithm on the index. (Wu, X. et al., 2019) [3] proposed an ordered index data structure that makes use of the strengths of three data structures, namely : hash table, prefix tree, and b+ tree, to come up with a single fast ordered index that has O(Log L) worst-case time complexity for searching a key with a length of L. publicly available datasets collected at Amazon. com and MemeTracker.org were used. (Breslow, A. D. et al., 2016) [4] presents the Horton table, a revamped BCHT(bucketized cuckoo hash table) that reduces the expected cost of positive and negative lookups to fewer than 1.18 and 1.06 buckets, respectively, while still achieving load factors of 95%. The remap entries were proposed ,which represent small in-bucket records that allow items that overflow buckets to be tracked and rehashed with one of alternate hash functions, and reduces the majority of negative searches to 1 bucket access. (Xia, F. et al., 2017) [5] presents Hybrid Index Key-Value Store (HIKV) , a key-value store that focuses on the idea of building hybrid index in hybrid memory. HiKV uses a hash table and a B+-tree and it selects the respective memory part that each part of the index is to be stored at with the thought of maintaining the consistency of the two indexes. The hash table is built in Non-Volatile Memory(NVM) so as to retain it's fast index searching. B+-tree on the other hand is built in Dynamic Random Access Memory (DRAM) ,so as to avoid long NVM writes. HiKV when using single_thread outperforms the state-of-the-art NVM-based key-value stores by reducing latency up to 86.6%, and for multi-threaded performance, HiKV increases the throughput by up to 6.4x under Yahoo! Cloud Serving Benchmark (YCSB) workloads.

Literature review
(Nguyen, M. K. et al., 2010) [6] proposes a novel type of index structure named B+Hash Tree ,that combine the strengths of B-Tree data structure and the fast search capability of hash based structure. The main idea is using a hash map to improve the B+ Tree in order to allow a constant retrieval time instead of the logarithmic time of B+ Tree. The index was evaluated against existing RDF indexing schemes, And it was showed that a B+Hash Tree is at least twice as fast as its competitors, And that this speed should grow as dataset sizes increase.

Data set
In this paper, data was collected from real data sets of resident patients at the public and private hospitals of several cities in Iraq. Collected data belong to several years, and contains more than 30 attributes, including Personal name, age, diagnosis, date, job … etc. data collected contains 607,119 record. With respect to recurring personal names, a high ratio of recurring names was found. The most useful attribute for this research was personal names written in Arabic Language. A sample of the dataset is shown in Table 1 where names and mother names were eliminated to keep privacy of patients. Some Data Preprocessing steps were performed such as a aggregation of data from many excel files into one file, extraction of personal names, elimination of null names (After elimination of null names, Dataset had 600007 names), elimination of Characters not belonging to Arabic language letters ( slashes, "parsing signs", etc). In this paper, Personal names (String keys) were converted into natural numbers using equation 1. Equation 1 can be described as follows [2] : Where Ci is the natural number that corresponds to the string key Ki (Ki being an element of the set of personal names or one single name out of a full personal name), J is the Letter Sequence Of the letter Ki,L (Ki,L means the letter at position L of key Ki) in alphabet beginning from ‫0='ﺍ'‬ and ‫1='ﺏ'‬ …etc., and B is the base. B for example can have one of these values 31, 33, 35, 37 .

Proposed index
In this section, a description of the proposed index structure is provided, along with basic operations algorithms.
6. Overview of index structure As mentioned previously, the proposed index structure combines three data structures, namely, B-Tree, hash table, and linked list. We proceed to explain each structure. The B-Tree used has the same structure as an ordinary B-Tree with two differences. The first one is a limited allowed maximum number of levels that need to be specified before inserting any keys into the whole structure. The  4 second one is that each node in the B-Tree has aside from ordinary B-Tree contents, an array. This array can be expanded under certain conditions into a linked list(at internal nodes) or into a hash table(at leaf node).
The hash table structure is built under certain conditions at leaf nodes by expanding a leaf node's associated array as mentioned above so that each leaf node can build its own hash table as needed. Each hash table structure is represented as an array that uses chaining to solve collisions. Each hash table array element has a pointer and therefore can build a linked list to represent the chained list of collided keys. The difference from ordinary hash table structure is that each element in the chained list has aside from its first pointer that helps to create the chained list, a second pointer that can be used to group identical collided elements, therefore reducing the length of the chained list the search has to go through when dataset include some recurring keys.
The third part of the structure is the linked list. One case where linked list is used is the case mentioned above in hash table chained lists. The second one is in internal nodes linked lists. The array associated with each node has in each element two pointers. This array is used to expand under certain conditions into an array that has only one element(at internal nodes), and that element can form a linked list by using the first pointers to form the linked list, this can be referred to as the "first linked list". Elements are added to this first linked list if they are identical to a key stored in the internal node. The second pointer at each element in the first linked list is expanded as needed into a second linked list that holds values of recurring keys identical to the corresponding key stored in the first linked list where the 2nd linked list was initiated. The goal of using this second linked list is also to reduce length of the first linked list when dataset include recurring keys so as to speed up searching process.

Index construction
The structure construction begins by inserting the 1st key into the B-Tree. After that, each key to be inserted is searched for in B-Tree first, if it is found, the new key(or its value) is added to Linked List/Hash Table associated to that node holding key. Note that Each key added to the associated Linked List/ Hash Table of a node is added once along with its value, and after that all recurring keys of this key won't be added to linked list/hash table but instead their associated values will be attached to key occurrence already present in linked list/hash table. If key is not found in B-Tree, it is inserted into it if possible. What makes key insertion into B-Tree not always possible is maximum allowed number of levels.
Once actual number of levels reached maximum allowed number, any further split of full root node during insertion would cause a violation for the condition of keeping levels <= maximum allowed levels. Therefore, insertions are allowed when actual number of levels reaches maximum allowed number as long as they do not require splitting root. If they require splitting root, insertion is conducted in appropriate hash table (the hash table associated with the leaf node that k belongs to its range of keys). The insertion procedure proceeds to find appropriate leaf node and insert k into its associated hash table. This insertion into hash table requires searching for key since it might exist previously in the hash table , in which case only the new value is attached to the key that is already present in table. If it is not present in hash table, k and its value are added to hash table.
An example that demonstrates inserting into BT-HT-LL is explained next. Let BT be a B-Tree of order P=4, and of maximum number of levels = 2. Let S be the set of keys to be inserted. S holds the following keys : S={1,2,3,4 ,1}. Insertion of these keys in order is depicted in Figure 1. Figure 2 depicts the insertion of 1 into the structure lastly produced in Figure 1.

Insertion algorithm in BT_HT_LL
Insertion of key k into BT_HT_LL index follows the following steps:

Searching algorithm in BT_HT_LL
Searching for all occurrences of key k in BT_HT_LL can be described as follows :  10. Deletion algorithm in BT-HT-LL Deletion in BT_HT_LL depends basically on Ordinary BTree-Delete procedure with preemptive merging found in [2] . Ordinary BTree-Delete procedure is tuned to work with the proposed index. The most important points of difference are as follows : x In BT_HT_LL, keys can have values list related to them, therefore, once a key is deleted, replaced, or moved to another location, its values list should also be handled according to the operation being conducted on k.
x In BT_HT_LL, if k is not found in BTree, it still might exist in appropriate hash table, therefore, it is searched and deleted from there if it is found in hash table.
x When merging nodes, keys in associated linked lists or hash tables should also be moved to the merged node along with their values lists. In merging leaf nodes, keys that are stored only in hash tables should also be moved to the merged node.

Indexes used for comparisons
For the sake of conducting comparisons, an additional index was built that represents a part of BT_HT_LL. This additional index is called HashTable-LinkedList. This index is a simple index that consist of a standard hash table that uses chaining to resolve collisions. The difference from ordinary hash table is that inside each chained list, identical keys are grouped together in linked lists as explained in BT_HT_LL. Therefore, each chained list contains unique keys belonging to this hash table location put in one sequence(the chained list) and recurring values for each key are stored aside from this main chained list in the key's associated linked list. Each unique key in a chained list has a value and associated set of values that belong to it. The application of insertion, searching, deletion is then straightforward. This index can give some informations about ordinary hash table as well since number of keys that were stored during insertion in a chained list can be counted through keys themselves or their values if recurring keys were inserted. With respect to BTree , [7] describes a BTree that assumes all keys are unique. If keys have duplicates, BTree algorithms can be modified to work with duplicate keys. Insertion was sufficient to conduct insertion of duplicate keys. Searching was tuned by searching recursively through surrounding sub-trees that surround each instance of k in the node where k is first found. Deletion also was tuned and it was conducted by treating each instance of key k as a separate key and deleting it separately, so as long as key k is still found in B-tree, deletion procedure is called again to delete it. This tuned version will be referred to as "Tuned-B-Tree". Note that in time complexity table , a variable related to number of occurrences of the key may be required to be multiplied by complexity found in table. This was ignored in table because the described tuning of BTree is a straightforward tuning and better, more efficient alternatives may exist. The space complexity described in space complexity table is based on the tuned version.

Experiments and results
This subsection describes experiments conducted to evaluate BT_HT_LL. Basically, BT_HT_LL provides a potential improvement on time complexity of B-Tree and Hash Table. Therefore, following experiments were conducted: 1. Average length of chained list in hash tables of BT_HT_LL versus in chained list of standard hash index table. Results of 1st experiment are given in figure 3, figure 4, and they are also shown using tables in table 2 and table 3.     Table 3 Average length of chained list in Standard Hash Table. Hash     Table 6 Total number of collisions in BT_HT_LL hash tables for total hash tables size~=2n.   B-tree Order P  10  10  10  15  15  20  20  10  50  100   B-tree Max number of  Levels   3  4  5  3  4  3  4  6  3  2   B-tree Actual number of  Levels   3  4  5  3  4  3  4  6  3  2   Hash table size  1200  120  12  355  23  150  7  1  9  120   Number  of  Hash  Tables(even null hash tables)   1000  10000  100000  3375  50625  8000  160000  1000000 125000  10000   Total Size of Hash Tables  1200000 1200000 1200000 1198125 1164375 1200000    x B-Tree stores part of the keys apart from the hash tables..
x BT_HT_LL stores recurring keys of internal nodes' keys aside from the B-Tree nodes in associated Linked Lists which allows BTree nodes to free some space for other keys. x BT_HT_LL may use limited amount of available total size of hash tables due to having several tables being null, which can increase collisions. x BT_HT_LL uses Linked List to decrease length of chained list in hash table.
14. Time and space complexity analysis Time and space complexities are important measures that demonstrate some useful informations about an algorithm. Time complexity denotes how output of an algorithm is related to the size of its input and hence the speed of executing the algorithm. Space complexity denotes space required to implement the algorithm. Table 9 and table 10 summarize time and space complexities of various indexes including Proposed index. Table 11 demonstrates the notation used to describe time and  space complexities in table 9 and table 10 and in explanation coming next. It is worth noting that regardless of specified maximum levels, BT_HT_LL's BTree actual number of levels cannot be bigger than log(Unique n) compared to (log n) in standard BTree, where Unique-n < = n . This is due to storing recurring keys of keys held in BTree aside from BTree nodes.

LenLL
Length of Linked List associated with internal nodes, max of LenLL is P-1.

Maximum allowed number of levels in B-Tree
Act Actual number of levels in B-Tree

Conclusion
From the result obtained in tables 2 through 8, it can be concluded that: 1. BT_HT_LL improves time complexity due to grouping identical keys together, limiting searching in b-tree by limiting its maximum levels to reduce B-Tree searching time, and allowing use of hash tables to continue searching. 2. By looking at expected and counted total collisions from experiments and count of identical keys found in real data, Grouping of identical keys in hash table-LinkedList has significant impact on time efficiency of the hash table searching. Since this grouping is also used in BT_HT_LL, it can be impactful on BT_HT_LL as well.

Recommendations
From the results obtained in tables 2 through 8, It can be recommended that: 1. For small amount of data, HashTable-LinkedList Indexing technique can be used since basic operations are simple to implement, and this index can be sufficient for this data since n is small and worst case is acceptable. 2. For moderate and big size of data, BT_HT_LL has several features and tunable parameters that can improve time efficiency of B-Tree and Hash Table.