Data mining technologies in a distributed medical information system

Proposed a method for solving the problem of identifying hidden relationships in hard-to-structure data that have an implicit character is considered using information mining. Proposed decision trees, the effectiveness of which is illustrated by a specific example. The use of OLAP analysis systems on data presented using in the form of a real or virtual hypercube’s information is an effective tool for the effectiveness of the management for medical monitoring


Introduction
One of the main elements of a distributed information system is difficult to structure data, the analysis of which is necessary to improve the efficiency of this system, but it is often difficult to implement.
Let us consider the decision support system (DSS) [1,2], which is a system for the diagnosis and differential diagnosis of diseases.
All medical problems that have an implicit character are solved by explicit methods with accuracy and convenience, which are completely insufficient for widespread practical use in specific problems of forecasting and decision-making [3]. Implicit problems of medicine have become an ideal field for intellectual data mining (IDM) [4].
Decision support systems are complex software and hardware complexes designed to help decision makers manage complex objects and processes of various nature [5]. The information based on which the decision is made must be reliable, complete, consistent and adequate.
As a rule, when organizing DSS, so-called data warehouses are used, which perform the functions of preliminary preparation and storage of data for DSS based on information from the organization's automated management system, as well as information from third-party sources [6].
The approach to data search and retrieval, which is based on the idea that there is a relationship between the frequency of requests and the degree of aggregation of the data with which the requests operate (the more aggregated the data, the more often the request is executed), is called On-line Analytical Processing (OLAP) [7]. Basic in OLAP is the concept of a multidimensional data cube, in the cells of which the analyzed (numeric) data is stored.

Multidimensional data mining
IAD is understood as a decision support process based on the search for hidden patterns (information patterns) in data [8,9]. Operational analytical processing and intelligent data analysis are two integral parts of this decision support process. The multidimensional IDM tool (figure 1) finds patterns not only in detailed data, but also in aggregated data with varying degrees of generalization. A study of IDM methods in medical information systems (MIS) has shown that decision trees are the most effective for solving medical problems [10].
From the point of view of the field under consideration, CART (Classification and Regression Tree) is the most preferred among algorithms that implement decision trees, since it allows solving not only classification problems, but also regression problems [11].

CART mathematical tool
In the CART algorithm, each node in the decision tree has two children. At each step of tree construction, the rule generated at the node divides the specified set of examples (training set) into two partsthe part in which the rule is executed and the part in which the rule is not executed. To select the optimal rule, the partitioning quality evaluation function is used, which is based on the intuitive idea of reducing impurity (uncertainty) in a node. In the CART algorithm, the idea of "impurity" is formalized in [11,12,13].
If the dataset T contains data from n classes, then the Gini index is defined as: where pi is the probability (relative frequency) of class i in T.
If the set T is divided into two parts T1 and T2 with the number of examples in each N1 and N2, respectively, the partition quality indicator is: The best partition is considered to be the one for which Ginisplit(T) is minimal. We denote Nthe number of examples in the ancestor node, L, Rthe number of examples in the left and right descendants, respectively, and li and rithe number of instances of the i-th class in the left and right descendants. Then the quality of partitioning is evaluated: Since multiplication by a constant does not play a role in minimization, the following transformations of the selection criterion are possible: Thus, the best partition is the one for which the value of Ĝsplit is maximal.

Application of data mining in a distributed medical information system
The necessary information is extracted from the databases of information subsystems and placed in storage. The logical model of the data warehouse is shown in figure 2.
To create and manage a data warehouse, the Microsoft SQL Server DBMS is used due to the simpler installation procedure and the greatest applicability and preference in MIS [1]. In addition, the main advantage of this DBMS is decision support services (Microsoft Decision Support Services)these are tools that allow you to make publicly available the capabilities of OLAP analysis and information stored in the data warehouse.
Microsoft SQL Server Analysis Manager analytical services administration tool contains an implementation of algorithms for some IDM methods, in particular, the algorithm for building decision trees (Microsoft Decision Trees) [14].
As a result, it is easy to see tight integration between the data warehouse and decision trees. OLAP analysis using various operations allows you to collect the necessary data, store it, and make crosssections of hyper cubes. Decision trees based on the information placed in them help to identify the dependence of changes in observation results on certain indicators. Based on the results obtained, during each examination during dynamic outpatient observation, the doctor makes the necessary additions and changes, determines treatment measures and the frequency of repeated examinations in accordance with the changes in the course of the disease, and also appoints the necessary consultations and additional studies (according to indications).

Conclusion
A DSS based on data mining has been developed to help the doctor in solving problems of planning and monitoring the timeliness, effectiveness and quality of dispensary monitoring of registered patients. Using a decision tree allows you to identify hidden relationships between parameters that affect the disease and the disease itself.
The use of IDM on data presented using OLAP analysis systems in the form of a real or virtual hypercube's information is an effective tool for the effectiveness of the management for medical monitoring.
The results can be used for cloud technologies in educational institutions, approximation of the distribution law, switching subsystems within the framework of distributed operational annunciator and monitoring systems, SOA reference model, policy control at the network edge, air quality monitoring assessment based on IoT platform.