Data federation through on-demand queries in intelligent transport systems

The paper proposes a federated approach to data integration in intelligent transport systems when data are requested from various sources using the on-demand query technique rather than traditional approach when data are accumulated in a single storage. The proposed approach is designed to reduce the number of data manipulations, thereby increasing the processing speed and reducing the necessary computing resources. The study was carried out on synthetically generated data: 10 data sources are used; each of them consists of 650000 records. Data federation reduces the memory used by an average of 53%, while the time of virtual federal database creation is on average 71% less than the traditional integration approach implementation. The approach to data federation is designed for implementation in intelligent transport systems of a new generation, where it is required to process large amounts of data coming from various heterogeneous sources.


Introduction
With the growth of the amount and complexity of data using by modern intelligent transport systems (ITS), the need for developing approaches for collecting, processing, analyzing, and storing data increases, that is, effective data manipulation in ITS comes to the fore [1,2]. So, ITS rely on detailed data containing information not only about the characteristics of traffic flows but also about transport infrastructure facilities, technical means of traffic control, various road, weather, and social events leading to a change in traffic [3,4]. Thus, the sources of data for ITS are many urban information systems, technical and personal mobile devices, which often duplicate data on the same characteristics used by ITS. In addition, due to the specifics of the ITS operation, data processing must be online, which can be difficult with a large amount of data.
In this work, the author proposes a federated approach to data integration in intelligent transport systems, in which data is not accumulated in a single storage, but is requested from sources, if necessary, using the query technique. The proposed approach is designed to reduce the number of data manipulations, thereby increasing the processing speed and reducing the necessary computing resources that are used in data processing in ITS.

State of the art
The problem of data integration in ITS has been considered by many researchers. In [5], the authors proposed a two-tier architecture for collecting, storing, processing, and integrating information in realtime, however, the implementation of the proposed solution requires significant changing of existing ITS subsystems, which is associated with colossal time, organizational, financial and technical difficulties. In [6], the features to reduce the amount of transmitted data in the conditions of ITS are considered, including the proposed method based on indexing database tables, which, however, cannot be applied in the conditions of modern cities. For example, in [7] it is shown that the basis of ITS functioning is now formed by "non-traditional" data generated by systems of the "Smart City" class, mobile applications, data from social networks and, in general, any devices in the concept of the Internet of Things.
Another feature of the ITS is the need for a system architecture that would provide scalability, that is, the connection of new data sources (subsystems) for the ITS [8]. First of all, the rapid implementation of new data sources is facilitated by the development of technologies for data mining and machine learning [9]. The development of intelligent technologies makes it possible to overcome direct and indirect problems of obtaining data on traffic and events affecting it, thus, in the short term, these are one of the most promising data sources, in which data are presented in a structure that is not traditional for ITS. Taking this into account, the data integration system for ITS should support complex data structures coming from video analytics systems [10], audio analytics [11], crowdsourcing systems [12], and others [13][14][15].
An effective strategy for data integration in such conditions is federation, in which the raw data remains in the systems where it originated, that is, the data is not copied to the ITS database. Classical approaches to the federation of databases cannot be applied in modern ITS since they operate with various heterogeneous data sources [16]. A promising approach is seen in which virtual storage is created that supports SQL-like queries and presents data from several data sources [17]. In this case, the presentation of data can be implemented in the form of patterns [18,19], which provide both flexibilities in the structure of the data record and in its attribute composition.
In this paper, the author proposes the data federation approach based on structured queries on demand. The results are proposed to be used as part of modern ITS, where traffic control is based on data generated by a multiple of interacting subsystems.

On-demand queries for data federation
Let's represent one data record with the following tuple: where: D -the domain of the data type; -an attribute with i number; n -total count of attributes; j -current record number. The domain of data type D for an attribute is represented by one of the following sets: where: ∅ -an empty set; ℤ -a set of integers; ℝ -a set of real numbers; -a set of characters: = {" ", " ", " ", … , " ", … }; -a set of strings: = 〈 1 , 2 , … , 〉, ∀ ∈ . The dataset of one source is denoted as: where: -a record in the source; 3 k -source number: = 1, ̅̅̅̅ , where t is the total number of sources; m -the number of data records in the source. Thus, the statement of the integration problem is as follows: When solving the integration problem in a classical way, the following operation is performed: where ( ) -is a preprocessing operation performed on each attribute. In the simplest case, when data preparation operations are not required: ( ) = . In the case when it is required to process complex records: When implementing the principle of federation, data unification is represented as follows: That is, this way is replaced the direct processing with reflection through on-demand queries: where ( ) is a projection.

Results
Data federation is implemented on the .NET Framework 4.7.2. The tests are carried out on a workstation running Windows 10. The workstation is equipped with 32 GB of DDR4-2933 RAM, an Intel Core i9-10900 processor (up to 5.2 GHz, 10 cores, 20 threads), an Nvidia GeForce RTX 2070 Super video card, and 1 TB SSD. PostgreSQL 13.0 with standard configuration settings was used as a relational data storage. During data integration operations, the used memory and the operation time are measured.
The study was carried out on the following synthetic test data. 10 data sources are used. The number of data records ranges from 50000 to 650000. The length of data record (1) is random but does not exceed 10 attributes. The domain D (2) of an attribute is evenly distributed among all its types. The length of the strings is random but does not exceed 255 characters. The characters in the standard UTF-8 set are used.
As an alternative to federation (4), classical integration (3) is used, hereinafter there is a comparison with it.
The results of measuring memory usage are shown in figure 1.  Figure 1 shows that traditional integration consumes more memory than the federation. As the number of records increases, the difference in used memory becomes more and more significant. On average, federalization shows 53% of memory savings.
The results of measuring the execution time when building a database are shown in figure 2.  The second step of study is done with the complexity of the data records (1): the data record can include links to other records, i.e. domain set (2) extension is proposed: ∃ ∉ , = 〈 1 , 2 , … , 〉. Domains are distributed evenly. The number of data records ranges from 50000 to 450000. In this case, the advantage of federalization becomes more and more noticeable, especially in the used memory ( figure 3). On average, federation requires 72% less memory.  The study on data retrieval shows that the execution time of a simple query (5) is equivalent: simple queries to retrieve data against a virtual database (federation) are faster than similar queries against a traditional database (integration), but the execution time differs on average only by 3%.

O P E R A T I O N T I M E
Integration Federation

Conclusions
The approach to data integration presented in the work can be used in intelligent transport systems and situational centers of a new generation, where it is required to process large amounts of data coming from various sources and partially duplicating each other. The use of data federation reduces the memory used by an average of 53% during the operation of the ITS, while the time to create a virtual federal database is on average 71% less than when using the traditional approach. At the same time, simple queries to retrieve data against a virtual database are faster than similar queries against a traditional database, because in virtual database queries are not performed on relational tables with a large number of records.
Further work on this topic will be aimed at developing ways to ensure the reliability and relevance of data since the same characteristic of the traffic and events can be represented by data from sources that are different in their adequacy. Another issue requiring further development is securing federalized data, since different sources may have different access rights, they will also require integration.