Performance analysis for similarity data fusion model for enabling time series indexing in internet of things applications

The Internet of Things (IoT) has penetrating all things and objects around us giving them the ability to interact with the Internet, i.e., things become Smart Things (SThs). As a result, SThs produce massive real-time data (i.e., big IoT data). Smartness of IoT applications bases mainly on services such as automatic control, events handling, and decision making. Consumers of the IoT services are not only human users, but also SThs. Consequently, the potential of IoT applications relies on supporting services such as searching, retrieving, mining, analyzing, and sharing real-time data. For enhancing search service in the IoT, our previous work presents a promising solution, called Cluster Representative (ClRe), for indexing similar SThs in IoT applications. ClRe algorithms could reduce similar indexing by O(K − 1), where K is number of Time Series (TS) in a cluster. Multiple extensions for ClRe algorithms were presented in another work for enhancing accuracy of indexed data. In this theme, this paper studies performance analysis of ClRe algorithms, proposes two novel execution methods: (a) Linear execution (LE) and (b) Pair-merge execution (PME), and studies sorting impact on TS execution for enhancing similarity rate for some ClRe extensions. The proposed execution methods are evaluated with real examples and proved using Szeged-weather dataset on ClRe 3.0 and its extensions; where they produce representatives with higher similarities compared to the other extensions. Evaluation results indicate that PME could improve performance of ClRe 3.0 by = 20.5%, ClRe 3.1 by = 17.7%, and ClRe 3.2 by = 6.4% in average.


INTRODUCTION
The Internet of things (IoT) paradigm allows things and objects (e.g., animals, bulbs, cars, persons, etc.) to speak on the Internet by enabling them to express their states. The lower layer of the IoT, called the sensing or perception layer, contains things attached with smart devices such as sensors and actuators to be managed through the Internet, such things are defined as Smart Things (SThs). They are the key elements for building IoT applications (Younan et al., 2020b). Entities of Interest (EoI) is another concept that is used indexes range of SThs readings, because they slightly differ. This type of information is called semi-dynamic. According to Fathy et al. (2016), distributed indexes are recommended for indexing IoT data. Studies (Younan et al., 2020a) and (Tran et al., 2017) recommend and present solutions for balancing data indexing and updating.
Because search service has a precious impact on IoT applications as studied in Younan et al. (2020a), the previous research works (Younan et al., 2020a) and (Younan et al., 2020c) focus on similarity data fusion for enabling balanced indexing for SThs Time Series (TS). Novel communication techniques were presented in Younan et al. (2020b), Papageorgiou, Cheng & Kovacs (2015), and Al-Qurabat, Abou Jaoude & Idrees (2019) to reduce redundancy of readings and data transmission to save power consumption. A novel similarity data fusion model based Dynamic Time Warping (DTW) (Salvador & Chan, 2007), called Cluster Representation (ClRe) were presented in Younan et al. (2020a) and improved in Younan et al. (2020c) to generate representatives with higher similarities ∀ TS. Novelty of this method is indexing single STh dataset (i.e., cluster representative) per each cluster of TS, which captures most of all SThs behaviors (i.e., sensing patterns). Contributions of this paper are summarized as follows: Presenting a brief overview on ClRe algorithms and their extensions for analyzing their performance in terms of run-time complexity, representative accuracy, and representative length. Proposing a new method for parallel execution, called Pair-merge execution (PME).
Testing all permutations of datasets in each cluster to study impact of sorting datasets by dissimilarity value on ClRe performance.
The remaining of this paper is arranged as follows. Related works that recommend searching, crawling, and balanced indexing for IoT resources (SThs) are discussed in the next section. "ClRe Algorithm" presents a brief overview for ClRe algorithms and their extensions (proposed in the previous works). "The Proposed Execution Methods for ClRe Algorithm" proposes two execution methods for enhancing ClRe performance for indexing clusters representatives with higher accuracy. "Discussion and Performance Analysis" discusses performance evaluation results of ClRe running in the proposed modes. Conclusion and future work are presented in "Conclusion".

RELATED WORK
This section discusses related work in terms of recent research works that highlight recommendations on searching and indexing IoT resources, followed by recent search engines and frameworks that enable IoT search. Finally, recent studies on balancing IoT data indexing.
Multiple studies such as Younan et al. (2020b), Tran et al. (2017) and Barnaghi & Sheth (2016) review and summarize searching challenges and requirements in the IoT. Research works (Tran et al., 2017) and (Pattar et al., 2018) highlight recommendations for an indexing mechanism for balancing data indexing and refreshing. In Khalil et al. (2020), discovery techniques in the IoT were reviewed and classified into data-based and object-based. IoT resources require special type of crawlers (Baldassarre et al., 2019).
As mentioned earlier data fusion is a promising solution for handling real-time big data in the IoT (Akbar et al., 2018). The wisdom is to reduce redundancy in the lower layer, where sensors are integrated (i.e., on SThs level). Liu, Song & Liu (2020) implemented data aggregation for improving data gathering in Wireless Sensor Networks (WSNs) on the level of sensors to increase network life-time, their proposed model bases on data generation and transmission cycles (i.e., transmission bases on wake up cycles). This level of data aggregation corresponds to the perception layer in the IoT. For For the same purpose, Liu, Song & Liu (2020), Huang et al. (2020) propose SPS-IUTO and ACMC schemes, respectively for data gathering. In the previous work (Younan et al., 2020a), threshold data and time are used statically and semi-dynamically for deciding on data transmission, but the difference is that these models integrate stability and prediction of SThs readings in the decision. The second proposed model for data aggregation in Younan et al. (2020aYounan et al. ( , 2020c is done in the fog layer-on the gateway level (i.e., STh datasets are already gathered)-for each cluster to build higher level indexes as proposed in Younan, Khattab & Bahgat (2016). Ramachandran et al. (2018) reduce search space by enabling a cluster-based search. Similarly, this work and previous works (Younan et al., 2020a(Younan et al., , 2020c suppose that SThs are grouped/clustered to be indexed. Indexing clusters representatives not only allows data reduction, but also keeps indexes as up-to-date as possible. Moreover, enabling search in cluster representatives enhances search engines performance by saving time consumed in crawling, indexing, retrieving, and ranking. Liu et al. (2018) focus on indexing the most valuable attributes. DTW (Salvador & Chan, 2007) has multiple applications in this them, For instance, in Kocyan et al. (2013) and Maschi et al. (2018) it is used for identifying repeating episodes (i.e., assessing similarities between patterns). According to Vaughan & Gabrys (2016), it is used for combining time series datasets. Thus, this paper proposes new execution methods for improving ClRe performance to index similar SThs resources.

CLRE ALGORITHM
This section presents a brief overview for Cluster Representative (ClRe) algorithms and their extensions. ClRe algorithms are presented in the previous works (Younan et al., 2020a(Younan et al., , 2020c. In Younan et al. (2020a), a new communication architecture is presented for reducing data transmission between SThs and their gateways saving resources consumption (e.g., battery) to increase SThs life-time. Moreover, a novel similarity data fusion model for reducing indexed data in the IoT is proposed. This model bases on indexing clusters representatives for similar SThs. In IoT applications, SThs are clustered by type and location (i.e., SThs measure the same phenomenon in the same environment). Validity and consistency of clustering is assessed by DTW, Silhouette is another metric for effective clustering (Kalpakis, Gada & Puttagunta, 2001). For the same cluster, dissimilarity between datasets is measured by the DTW. Resulting clusters representatives have to combine/ aggregate SThs datasets in a form that capture almost of SThs' behaviors. So, dissimilarity value of the resulting cluster representative is the main criteria for assessing and improving ClRe algorithm performance. Table 2 shows a brief summary for performance analysis for all ClRe extensions; where K is number of datasets in the cluster and n is maximum dataset length (maximum number of STh readings). All extensions are compared based on their run-time complexity, memory complexity required to generate cluster representative, and resulting representative length and average dissimilarity between cluster representative and all datasets in the cluster. Average dissimilarity value and length are ranked in ascending order (1:5), i.e., '1' means the least dissimilarity value and the least length, while '5' the highest dissimilarity value and the highest length.
The main idea of ClRe 1.0 and its extension (ClRe 1.1), is to select higher similarity dataset to be the cluster representative, then assessing similarity to K dataset between datasets and each other consumes running time of O(K 2 DTW). Because ClRe 1.0 uses DTW then its running time complexity is O(K 2 n 2 ), while ClRe 1.1 consumes O(K 2 n); where it uses FastDTW. The two extensions consume the same memory complexity to store resulting dissimilarity values between datasets and the other in the cluster. ClRe 2.0 gets the average of each corresponding times (readings in the same time for all SThs), thus its running time is O(Kn) and resulting representative dataset consumes O(n) memory space. ClRe 3.0 and its extensions (ClRe 3.1 and ClRe 3.2) base on handling warped items after getting the warped path between each pair of datasets for capturing SThs behavior in a new one called inner/temporary representative, but getting the accurate warped path relies on assessing similarity using DTW, thus their running time consumes O(Kn 2 ), the required memory for storing resulting dataset is of range of maximum dataset length.
In brief, promising solutions for balancing time series indexing in the IoT based on priority of some criteria are as follows. ClRe 3.1 for capturing most of SThs behavior in the cluster, where it has the higher average similarity. ClRe 1.1, where it has balanced running time, average similarity, and dataset length. ClRe 3.2, for indexing only common behavior for all SThs in the cluster, which means that the resulting representative will has less number of items (ω(n)).

THE PROPOSED EXECUTION METHODS FOR CLRE ALGORITHM
As listed earlier, one of the main reasons for assessing, measuring, or monitoring certain phenomenon (e.g., CO 2 level) using multiple SThs of the same type, is certainty of readings. In such cases, one of those SThs is selected to be indexed (i.e., filtering SThs), while in other cases all SThs behaviors are required to be expressed as a single STh (i.e., aggregation). Thus, the term accuracy in this theme means having least dissimilarity score ∀ TS in the cluster. ClRe algorithms (versions 1.0, 2.0, and 3.0) are presented in Younan et al. (2020a), based on their performance analysis, ClRe 1.0, which bases on DTW and its extension ClRe 1.1, which bases on FastDTW, have the minimum running time, O(Kn 2 ) and O(Kn), respectively, while ClRe 3.0 and its extension ClRe 3.1 have the higher accuracy (i.e., less dissimilarity values for TS in their clusters). For enabling similar SThs indexing, main targets proposed in Younan et al. (2020a), Papageorgiou, Cheng & Kovacs (2015), and in this paper are: (a) reducing index size, which is achieved by indexing only one STh (TS) per cluster-and (b) increasing accuracy of indexed data, which is achieved as well by decreasing dissimilarity values of indexed clusters representatives.
Because reducing TS of the cluster representative could deflect or deviate main behaviors of SThs, the main goal of this paper is to increase accuracy of indexed data by decreasing dissimilarity values of the clusters representatives. This could be achieved if the resulting dataset (i.e., cluster representative) captures almost behaviors of SThs in the cluster. Based on the experimental results discussed in the next section, ClRe 3.1 is the promising solution in this theme. Enhancing running time of this extension is another goal in addition to its accuracy. So this section proposes two methods for execution, as shown in Fig. 1: (a) sequential execution (trivial method), called linear representative and (b) parallel execution, called pair-merge representative.
Similar to merge sort algorithm, a representative dataset will be generated for each pair of TS in the lower level, and recursively on the resulting datasets representatives at each higher level to the root (final representative). The proposed execution methods are shown in Algorithms 1 and 2, respectively.
Analysis of running time complexity consumed by the proposed execution methods are indicated as follows, where ClRe V represents any extension of ClRe 3.0 (e.g., ClRe 3.1). As mentioned earlier (Table 2), ClRe 3.0 and its extensions consume O(Kn 2 ).
Linear execution method (LE): this method is the trivial execution method, it starts by calling ClRe V to generate a representative between the accumulative representation and current dataset in the cluster, so run time complexity of ClRe V will not affected as indicated in Eq. (1).
ClReðAccRep; List½iÞ where AccRep is the accumulative representation dataset, List[i] is the i th TS in the cluster, and (C 1 (n 2 ) + C 2 (N)) is the running time consumed by ClRe V.
Pair-merge execution method (PME): this method starts at level (log K) by calling ClRe V between each pair, running time consumed at this level is indicated by Eq. (2), also it could be clarified as shown in Eq. (3) as a summation of calls at each level.  ClReðList j ½i; List j ½i þ 1Þ (2) From these equations, the two methods the same running time complexity. The two methods generate that same number of inner representatives datasets (K -1 datasets) for getting the final cluster representative dataset, but the second method (pair-merge) could reduce running time from O(Kn 2 ) to O(n 2 log K), and outperforms linear representatives in similarity rates. Cons and pros of the two methods are summarized in Table 3. However resulting representative length in range of maximum length of datasets in the cluster but it is larger than resulting representative length in sequential execution.

DISCUSSION AND PERFORMANCE ANALYSIS
In this section, ClRe 3.0 and its extensions (ClRe 3.1 and ClRe 3.2) are implemented using the two proposed methods (LE and PME). Evaluation is done using real examples in the first subsection and using real dataset (Szeged-weather (Budincsevity, 2016)) in the second subsection. Performance analysis is measured ∀ TS permutations into two folds; first fold is for assessing dissimilarity score in terms of; (a) Min: minimum dissimilarity value, (b) Max: maximum dissimilarity value, (c) Avg: average dissimilarity value, (d) 50 th Perc: 50 th percentile dissimilarity value, and (e) Range: dissimilarity range value (maximum value-minimum value). The second fold is for assessing resulting representative length (number of readings or items) in terms of; (a) Min: minimum Table 3 A comparison between the proposed execution methods (LE and PME).
• Accumulative building for final representative at each step.
• Temporal representative for each pair.
• One representative per iteration.
• Accumulative representative at each level (sub-tree)
• No parallel execution • Parallel: O(n 2 log 2 (K)) Pros • Less memory at each iteration (only one dataset of length N, where n < N < 2n).
• Generates K/2 representatives with less dissimilarities form original datasets.

Cons
• Only sequential execution.
• At the level (log K), it requires (log K) memory spaces, each of length N, where n < N < 2n for representing each pair. • Average dissimilarity < Pair-merge dissimilarity.
number of items, (b) Max: maximum number of items, (c) Avg: average number of items, and (d) Range (maximum length-minimum length). All permutations are tested to clarify sorting impact on ClRe performance.

Real dataset
This subsection tests ClRe extensions using linear and pair-merge execution methods on four clusters formed using Szeged-weather dataset. This dataset contains real historical weather on temperature, pressure, wind speed and more. High level clustering is done on sensor type (phenomenon). In this experiment, temperature is selected as a higher level cluster; where annual cycles in temperature are extracted to represent datasets produced by similar sensors (i.e., lower level clustering). Clusters are formed using temperature readings on 'January' for years from 2011:2016 with TS labels from a:f as indicated in Similarly as discussed in the previous subsection, average dissimilarities between every dataset and other datasets in the cluster are indicated in Table 9. Minimum dissimilarity value in each cluster is indicated in Table 10. ClRe extensions are tested on four clusters as well in mode (Table 11) and in PME mode (Table 12). All permutations are tested to clarify the impact of sorting TS datasets by dissimilarity value on ClRe performance, and it is noticed that ClRe execution on ordered TS produces dissimilarity value ≈ the average dissimilarity ∀ permutations. For instance, for CLS B in real dataset experiments, ClRe 3.1 produces 1.54 in LE and 1.24 in PME mode (Tables 11 and 12).
To sum up, the proposed methods have an impact on the performance of ClRe 3.0 and its extensions (versions 3.1 and 3.2), Consequently, Tables 13 and 14 summarize and compare their average dissimilarity scores and resulting average datasets lengths using LE and PME methods. Figure 2 visualizes these results as well; where blue bars represent LE results, while red bars represent PME results. Evaluation performance for resulting representatives using LE and PME is measured using Eq. (4).   EnhancementðClReÞ ¼ 1 À Dis PME Dis LE Â 100 (4) where Dis PME and Dis LE are dissimilarity values in case of running ClRe algorithm in PME and LE modes respectively. Based on this equation, evaluation results of real dataset experiments indicate that PME could improve performance of ClRe versions 3.0, 3.1, and 3.2 in average by approximately 20.5%, 17.7%, and 6.4%, respectively.