Multiple Wide Tables with Vertical Scalability in Multitenant Sensor Cloud Systems

Software-as-a-service (SaaS) has emerged as a new computing paradigm to provide reliable software on demand. With such an inspiring motivation, sensor cloud system can benefit from this infrastructure. Generally, sharing database and schema is the most commonly used data storage model in the cloud. However, the data storage of tenants in the cloud is approaching schema null and evolution issues. To address these limitations, this paper proposes multitenant multiple wide tables with vertical scalability by analyzing the features of multitenant data. To solve schema null issue, extended vertical part is used to trim down the amount of schema null values. To reduce probability of schema evolution, wide table is divided into multiple clusters that we called multiple wide tables. This design reaches the balance between tenant customizing and its performance. Besides, the partition and correctness of multiple wide tables with vertical scalability are discussed in detail. The experimental results indicate that the solution of our multiple wide tables with vertical scalability is superior to single wide table, and single wide table with vertical scalability in the aspects of spatial intensity and read performance.


Introduction
Software-as-a-service (SaaS), in its broadest sense, refers to an on-demand software, which is delivered as services over the Internet [1]. SaaS has been incorporated into the strategy of all leading enterprise software companies to develop various multitenant applications [2]. One of the biggest selling points for these companies is the potential to reduce IT support costs. According to International Data Corporation's (IDC) latest market report, SaaS will grow at a 29.2% annual compound rate through 2013-2017 [3]. With such an inspiring motivation, sensor cloud infrastructure is becoming popular because it can provide a flexible and configurable platform for several legacy Web sensor applications [4]. As for emerging sensor cloud system, serviceology for services is a general tendency [4], which explores theoretical structures of service concepts and generates scientific systematization of services and products. Since multitenancy is an essential component for SaaS, data storage model is a prime problem in multitenant sensor cloud applications. Due to increasing demand on sensor data and their support in multitenancy, sensor cloud architecture has been introduced as an integration of cloud computing into wireless sensor networks (WSNs) to innovate a number of other new techniques, such as WSN cloud services, databases, and applications. Some relevant publications conduct their own research on tenant customizing and its performance issues [5,6]. Currently, there are at least three virtual data storage models: separate database, sharing the database with separate schemas, and sharing the database with sharing schemas [7,8]. The most popular approach is the third one. Salesforce.com has passed up traditional virtualization software for custom technology that puts up to 10,000 customers on 15 databases and 100 servers using this approach [9].
There are some challenges in sensor cloud scenarios. First, cloud users were still approaching some practical difficulties, in particular when handling a large amount of sensor data [10]. The traditional data storage approach will become 2 International Journal of Distributed Sensor Networks a bottleneck problem when working on large data sets of interactive sensor data. The second challenge is how tenants access their sensor data in the sharing storage model with a logical view.
Wide table [11] was introduced as an authoritative multitenant sparse data storage model. This raw model has some limitations. First, the data in wide table is left-intensive. Since tenants consume columns of wide table from left to right, the right column might be assigned a value of null. This will consume a lot of spaces when working with massive tenants [12]. Another improved version of this wide table is called single wide table with vertical scalability [7]. This model stores personalized data in the extended vertical part. Although this model can reduce the amount of schema null values, it will lead to another schema evolution issue [13]. With the increase of tenants' requirements, the length of tenants' schema exceeds the preset width of core horizontal part. In this situation, some data in the vertical part are forced to move into the horizontal part to address schema null issue manually. Further details are presented in Sections 2 and 3.
Tenant customizing and its performance are two interdependent elements which depend on each other and limit each other in the context of multitenant data storage of sensor data. Our motivation of this paper is to achieve this balance between flexibility and complexity. In our paper, we propose multiple wide tables with vertical scalability in sensor cloud system, for the purpose of raising data intensity of data spaces, as well as spatial-temporal performance. Multitenant multiple wide tables are composed of core horizontal metadata and extended vertical metadata, in order to limit schema null issue. In addition, dividing a single wide table into various multiple wide tables with vertical scalability in sensor cloud system reduces the probability of schema evolution.
The contributions of this study are divided into several aspects. We attempt to address schema null and evolution issues using this new model. First, our data model provides a logical relational view with a minimum of resources. Second, our data model improves read and write performance by adding indexes to frequently accessed columns. Third, this model reduces the waste of null values in relational database with vertical scalability. Finally, dividing a wide table into multiple wide tables addresses schema evolution issue effectively. In conclusion, our data model achieves the balance between tenant customizing and its performance.
The remainder of the paper is organized as follows. Section 2 discusses the related work, and Section 3 presents multitenant multiple wide tables with vertical scalability of the sensor cloud system. First, we add some reserved columns dividing a wide table into different multiple wide tables to address the schema evolution issue. And then, the data model of multiple wide tables with vertical scalability is discussed in detail. Next, we analyze the correctness and implementations of this model. Section 4 gives the experimental evaluation of our multiple wide tables with vertical scalability. In Section 5, we discuss the superiority of our proposed multitenant multiple wide tables with vertical scalability in sensor cloud system with qualitative assessments. Brief conclusions and future work are outlined in the last section.

Related Work
In recent years, WSNs have become an established technology for a large number of applications, ranging from monitoring to event detection, as well as target tracking [14]. On one hand, cloud WSN applications with sharing storage usually lead to massive data. On the other hand, the cloud tenants have the requirements on personalization. That is to say that different tenants have different personalized columns. Accordingly, the storage of multitenant sharing data is more challenging. This section outlines some classical sharing multitenant data storage models and features in sensor cloud system and introduces private table, extension table, document store, and wide table. Figure 1 illustrates the structure and metadata of different data models. Columns and are the customizing data, and columns , , and are the personalized data. Table. The most basic way to support extensibility is to give each tenant their own private table. In this simple approach, what the query-transformation layer needs to do is renaming tables. Thus, this approach has stronger pertinence and better expansibility on the customization and isolation. However, only moderate consolidation is provided, since International Journal of Distributed Sensor Networks 3 many tables are required. Aulbach et al. stated that relational structures might be made more competitive in the case of over 50,000 tables [15]. Table. This approach is combined with splitting of the extension into a separate table. The sharing data are stored in the public storage (basic tables), while the separate data are stored in the extension table [16]. Because multiple tenants may use the same extension, extension tables as well as basic tables should be given an identity column of a tenant ( ). However, this approach has not yet resolved the expansion problem. It still needs a mechanism to map the basic and extended tables into one logical table.

Extension
2.3. Document Store. NoSQL databases are often highly optimized using key/value stores that are intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and throughput [17]. Take document store for example, it provides JSON-style documents with dynamic schemas to store data. Although this dynamic structure minimizes null values, relational query is destroyed by this structure. The query of this model will produce an amount of connective operations, which make it difficult to refactor a relational tuple with columns. We will compare this solution with our solution in Section 4. Table   2.4.1. Single Wide Table. Single wide table is usually highly sparsely populated so that most data can be fit into a line or a record [18]. Using this solution, queries are composed on only a small subset of the attributes. However, this model will produce the schema null problem, since it has too many null values. Moreover, it cannot provide dynamic customization capabilities since the reserved column is still limited in numbers. Indexing on a table generally brings up extra high storage and update costs. When the data set is sparse, the extra cost can be overcome by using a sparse index. A sparse index is one special kind of partial index, which maps only the non-NULL values to object identifiers [11].  [12]. This model extracts the personalized data from the relational wide table and then describes it using extended vertical metadata. Each row in an instance of extended vertical metadata is a key/value pair, which is used to store the personalization of tenants to fulfill the requirements. In the event that the personalization of tenants is identical, extended vertical metadata can be omitted. The advantage of this approach is that it can reduce the waste of data resources efficiently. However, this model will produce schema evolution issue with the increase of customizing columns. In this case, the length of tenants' schema exceeds the preset width of wide table. Table. In the wide table  solution, all the tenants' sensor data are stored in a sharing database. The features of multitenant sensor data have brought with them our own two new challenges.

Challenges of Current Wide
First, it leads to the schema null issue when tenants do not customize some columns. If one tenant has customized this column, a value might be assigned to be useful at run time. If one tenant has not customized this column, the value of this column must be null. Multitenant sensor data that already looks like is left-intensive, since tenants consume the data from left to right. This is a key issue of data storage of multitenants to be solved.
In the relational wide table, we suppose as the customizing column count of the data, is the personalized (non-customized) column count of the data, and is the index number of the tenants. We denote the length of a wide table as LENGTH = + . We suppose as the row number of tenant , and as the tenant number. The data intensity of relational wide table is It is observed that 0 ≤ ≤ 1. When is 0, is 1. When is far smaller than , is the largest, which causes schema null issue.
Second, it will lead to the schema evolution issue in the case of single wide table solution with vertical scalability. When the length of tenants' schema exceeds the preset width of a single wide table, the extended vertical part of a single wide table should be transferred to the core horizontal part. We call this transfer schema evolution. Therefore, fixed length of customizing columns can introduce a flaw that affects the performance of the cloud database. In addition, , , and in the expression of data intensity are the forecast on the tenants' customizing. Due to inaccurate forecast on the tenants' requirements, single wide table with vertical scalability is approaching schema evolution issue.

Schema Evolution Issue.
In order to reduce the probability of schema evolution, we attempt to add some additional reserved columns to improve single wide table with vertical scalability. Moreover, the wide table is partitioned to formulate multiple wide tables by the sum of the amount of customizing and reserved columns. The structure of multiple wide tables with vertical scalability is shown in Figure 2.
One of multiple wide tables is divided into customizing and reserved columns, which is called a cluster. The personalized data are stored in the extended vertical part, and the customizing data are stored in the core horizontal part. On one hand, vertical scalability can reduce the access granularity of tuples. On the other hand, the reserved columns in the core horizontal part can reduce the probability of schema evolution. In the context of multiple wide tables, tenants' data are spread over different single wide tables. That is to say that multiple wide tables with gradient distribution replace several single wide tables, meeting the demand on dynamic storage requirements of tenants. The tenants' data are determined in either the core horizontal part or the extended vertical part by the requirements on the tenants' customization.

Vertical Scalability.
In order to personalize the data of tenants, we design the vertical scaling method to solve the sparse and schema null issues of a single wide table. That is to say that we extract the personalized data from the wide table and then describe it using vertical metadata. As depicted in Figure 2, extended vertical metadata reduces the waste of data resources efficiently. Each row in the extended vertical metadata is a key/value pair, which is used to store the personalization of tenants to fulfill the requirements of different tenants. In the case that the personalization of tenants is the same, the extended vertical metadata can be omitted.
Although the extended vertical metadata reduces schema null values effectively, it increases the computational complexity. For the purpose of evaluating whether the data are stored in the core horizontal part or the extended vertical part, we give the evaluation function to determine whether the extended vertical part is worth adopting. We describe the evaluation function of column , , as where (i) is the proportion that different tenants customize column ; (iii) is the access number of the th column; (iv) is the access number of the tables containing column ; (v) is the service factor that the tenant serves column .
It is shown that the larger is, the less appropriate column is in the extended vertical metadata.
Since the data type of is weak, we take advantage of the families of multiple wide tables with different data types. Suppose there are three column families: family 1 just has a varchar attribute, family 2 just has a timestamp attribute, and family 3 just has a digit attribute. The table using hybrid representation is described in Figure 3.

Table Partition
. This section discusses the reasonable partition of multiple wide tables with vertical scalability. Consider that tenant has customizing columns, and the maximum value of customizing columns is MAX( ). Tenants' data are assigned to wide table , which has Δ reserved columns. There is no need for tenant to adjust the schema, as long as ≤ + Δ . If > + Δ , tenant need to transfer the data to wide table + 1. Therefore, reasonable amount of reserved columns and partition can reduce the probability of schema evolution.
The statistical analysis of the customizing requirements shows that customizing columns of tenants approximately corresponds to the normal distribution ( , ), where is the average amount of customizing columns of tenants, and reflects the differentiation of the amount of customizing columns. We use the normal distribution to fit the frequency count of customizing columns of tenants. Frequency histogram is partition basis of multiple wide tables. According to three-sigma rule [19], almost all (99.73%) of the values lie within 3 standard deviations of the mean. We divide [ − 3 * , − 3 * ] into intervals, and each interval corresponds to a kind of wide table. A mass of statistical analysis indicates two facts. First, we should make thinner granular partitions of intervals (the number of common columns is close to ) with smaller and Δ . Second, we should make coarser granular partitions of intervals (the number of common columns is far greater than ) with larger and Δ . Furthermore, another two facts are concluded from the statistics. First, the amount of partition intervals is set at 5 to 6 where MAX( ) ≤ 50.

Correctness Analysis.
In this section, we apply equivalence analysis between multitenant multiple wide tables with vertical scalability and traditional relational table.

Theorem 1. Multitenant multiple wide tables with vertical scalability are equivalent to a relational table.
(1) A relational table can be converted into multitenant multiple wide tables with vertical scalability.
We denote a relational table using the mathematical relation. This relational table has customizing columns, reserved columns, and personalized columns: where (i) is the unique identification of the tenant; (ii) is the primary key of a relational table; (iii) is a set of customizing columns, is a set of reserved columns, and is a set of personalized columns; If attribute is contained within ( ∈ ), attribute is assigned to store customizing data in core horizontal part of the multiple wide tables.
If attribute is contained within ( ∈ ), attribute is assigned to store reserved data in core horizontal part of the multiple wide tables.
If attribute is contained within ( ∈ ), attribute is assigned to store personalized data in extended vertical part of the multiple wide tables. The mapping from to extended vertical part of the multiple wide tables is called UNPIVOT [20], denoted as where (i) = ∈{ 1 , 2 ,..., } ; (ii) is a relational projection operator; (iii) is a relational selection operator.
The mapping rule is also described in Algorithm 1.
(2) Multitenant multiple wide tables with vertical scalability can be converted into a relational table. We use core horizontal part and extended vertical part of multiple wide tables to refactor the relational table.
First, we pivot the extended vertical metadata to the horizontal storage, denoted as (ii) is a relational projection operator; (iii) is a relational selection operator; (iv) ∞ is a relational connection operator.
The mapping rule is also described in Algorithm 2.
Finally, relation = is the refactored relational table, where 3.6. Implementation. Multiple wide tables with vertical scalability are composed of various single wide tables with vertical scalability, and each wide table is composed of two parts: core horizontal metadata and extended vertical metadata. We use MySQL 5.6 GA to store both parts. Figure 4 illustrates a running example that shows the mapping between multitenant wide tables with vertical scalability and a relational table. This operation is transparent to end users with the help of the transformation view. We use the example of Figure 4 to describe the creation and read process. We take column and in the core horizontal part, and ℎ , and ℎ in the extended vertical part.
When We provide a read-only logical relational view for query using Algorithm 3.
Therefore, the developers can query the relational view regardless the actual storage, which is similar to a virtual wide table. This operation is transparent to the developers.

Performance Evaluation
The experiments were performed on a server with the following configurations: an Intel(R) Core(TM) CPU I5-2300 3.0 GHz server with 8 GB RAM, a 100 M Realtek 8139 NIC in 100 M LAN. This server was deployed with the operating system CentOS 6.4 × 64. We have done four experiments to evaluate the performance of our multiple wide tables with vertical scalability. We take advantage of the actual sensor data collected from a WebSocket-based real-time monitoring system for remote intelligent buildings [14], which serves different cloud tenants. We use multiple wide tables with vertical scalability to store different tenants' data. We use MySQL 5.6 GA to store the core horizontal part as well as the extended vertical part.

Spatial Intensity.
We select cloud sensor data generated by 20 different tenants to do the first special experiment. We compare single wide table with 50 columns, with our multiple wide tables with vertical scalability. Consider that tenants consume wide table from left to right. The amount of customizing column of each tenant is as = {4, 6,8,9,10,12,15,16,16,17,19,20,22,22,25,28,35,40,46, 48}, where is the number of customizing column of tenant . As shown in Figure 5, is in accordance with the normal distribution. The average amount of customizing column is 21.3, and its standard deviation is 11.5. To measure the data intensity, we suppose that each tenant has the same number of rows. Since MAX( ) ≤ 50, we divide the partition of customizing columns into five intervals: [0, 7], [8,18], [19,24], [25,36], and [37, 56]. Each interval reflects a kind of a wide table, which is shown in Table 1. If the single wide table solution is adopted, the overall data intensity of the single wide table is 0.38. If multiple wide tables with vertical scalability solution is adopted, the data intensity is 0.90. The experimental results show that multiple wide tables with vertical scalability can enhance the degree of data intensity, and reduce schema null values. We can use a finer-grained partition to enhance the intensity when the tenants' requirements are fixed. With the changes of tenants' requirements, as long as the length of customizing columns is no more than the sum of customizing and reserved columns, multiple wide tables work without the adjustment of the schema. Therefore, the probability of schema evolution is reduced.

Read Performance.
We have done the second experiment to evaluate the read performance. Consider that there are five columns ( , , , , and ) that are stored in a wide table. To change the selectivity of predicates, we use different distributions for the column values. Among 20,000 sensor data, there are only 10 records for which column is defined; then there are 100, 1,000, 10,000, and 100,000 records for which columns through are defined, respectively, and in this example, which are both rarely selective. We have made indexes on , , and to optimize the query.
We have executed six queries on the sharing multi-tenant sensor data with the same query condition. The queries are to extract data from columns , , , , , and all columns. The query time of each query of three solutions is shown in Figure 6. Solution 1 is single wide table with vertical scalability, where columns , , and are stored  By contrast, solution 1 is more time consuming than solutions 2 and 3. That is due to the cost of transformation from the extended vertical data to the core horizontal data in solution 1. After all the data has been assembled, the performance difference between solutions 1, 2, and 3 is the execution cost of PIVOT. Then, another index seek is  performed on the rows to make the performance of PIVOT close to the wide table. Compared with the deficiency of read performance, solution 3 is better at strong spatial intensity than solution 2. Next, we have made the third experiment to observe the effect of concurrent transactions on read performance. We use multithreads to simulate a large number of requests to wide tables with 10,000 records. We issue a query "query the monitoring sensor data in the past three months" for 10 times and record the average query time. We have added solution 4 for the sake of further comparison. In solution 4, multitenant sensor data are stored in the form of document stores with free and dynamic schema. Figure 7 shows the average query Due to the data refactor of extended vertical part, query time of solution 1 is always the largest. Since solution 3 adopts a relational view for query, the effect of concurrent transactions on read performance is small. When the amount of concurrency is less than 1,200, the query time of solution 3 is larger than solution 2. When the amount of concurrency is more than 1,200, the query time of solution 2 is rising sharply. The query time of solution 3 is close to solution 4, but solution 4 cannot provide the support of SQL. It is difficult for the legacy sensor applications to utilize this solution transparently.

Write Performance.
We have made the last experiment to observe the effect of concurrent transactions on write performance. We measure the number of writes per second. In solutions 1 and 3; writing one record will generate more than one record in the database because of the vertical extension. We have inserted 100,000 sensor records to observe the throughput of different solutions. Figure 8 shows the write performance. When the amount of concurrency is less than 600, the throughput of solution 3 is the largest. In the case of high concurrency, solution 3 is a little worse than solution 4. Among wide table-like solutions (solutions 1, 2, and 3), our solution has ensured optimal write performance. Future work is target to explore read/write splitting to improve the write performance in the case of high concurrency.  extra schema evolution issue. With the addition of tenants' customizing columns, extended vertical part is forced to move to core horizontal part. The motivation of this paper is to solve schema null and evolution issues at the same time. That is the motivation of our paper.

Discussion
Next, we evaluate the performance of our proposed multiple wide tables with vertical scalability in two ways. The first way is spatial intensity. Consider that integral intensity of single wide table is , integral intensity of single wide table with vertical scalability is , integral intensity of multiple wide tables is , and integral intensity of multiple wide tables with vertical scalability is . 0 < < < ≈ < 1. The second way is read and write performance. Compared with current multitenant data models, multiple wide tables with vertical scalability are most powerful.

Conclusions and Future Work
Motivated by sharing multitenant storage in sensor cloud system, this paper attempts to propose a better solution to store multitenant sensor data in the context of cloud computing. Compared with current multitenant data models, we propose multitenant multiple wide tables with vertical scalability of the sensor cloud system. This model consists of two parts: core horizontal part and extended vertical part. Core horizontal part is used to store customizing data, while extended vertical part is used to store personalized data. To address the schema evolution issue, we divide a wide table into multiple clusters that we called multiple wide tables. On one hand, this model mainly focuses on solving schema null and evolution issues with high scalability. On the other hand, our proposed model meets the demand on tenants' personalization. Further, the partition of multiple wide tables with vertical scalability is discussed in detail. Besides, we illustrate the equivalence analysis of the multitenant multiple wide tables with vertical scalability and traditional relational table. A running example of the transformation is presented at the end. The experimental results show that our multiple wide tables with vertical scalability is superior to single wide table and single wide table with vertical scalability in the aspects of spatial intensity and read/write performance.
Multiple wide tables with vertical scalability that we have proposed formulate a kind of sharing multitenant sensor data storage model. We attempt to store tenants' data together with the same schema. In this solution, the cloud sensor data are maintained for centralized management. Therefore, this method has some limitations in the distributed environment. Future work is targeted to explore some distributed techniques to optimize our method, such as data sharing, data partition, and read/write splitting.