APPLICATION OF THE STATISTICAL APPROACH IN DIAGNOSING IN MEDICAL AND BIOLOGICAL RESEARCHES

Н.О. Комлева, Д.Д. Бондаренко, О.М. Комлевой. Застосування статистичного підходу при діагностуванні в медикобіологічних дослідженнях. Задача діагностування у медико-біологічних дослідженнях у ряді випадків може бути вирішена із застосуванням статистичного підходу. Актуальними є дослідження щодо можливості використання статистичного аналізу для діагностування стану дихальної системи людини на основі значень відсоткових внесків частинок різних розмірів, що містяться у видихуваному повітрі. Метою роботи є виявлення певних закономірностей в значеннях діагностичних ознак конденсату вологи видихуваного повітря, що дозволить вважати досліджувані групи непересічними класами. Досліджено три групи осіб: здорові люди та пацієнти, хворі на бронхіт та пневмонію. Для кожної групи за допомогою методу лазерної кореляційної спектроскопії виконано ідентифікацію частинок, що є первинними діагностичними даними, та подальшу обробку даних з використанням методу дискримінантного аналізу. Проведено відбір змінних, що дискримінують досліджувані групи найкращим чином; побудовано модель змінних та функції класифікації. Наведено результати основних кроків аналізу – сукупність змінних, що увійшли в модель, і коефіцієнти функцій класифікації для трьох груп, – які лягли в основу алгоритму роботи розробленого програмного продукту. Ключові слова: дискримінантний аналіз, дихальна система, класифікація


Introduction.
Research moisture exhaled air condensate (EAC) as a material for diagnosing physiology and pathology of the human respiratory system is quite new perspective direction of modern science.The relevance of these studies due to their safety and gentle teaching methods of collection of material for analysis.Taking into account these advantages, the search for new methodical approaches to the study of EAC is particularly relevant for improving the accuracy of differential diagnosis of the state of the human broncho-pulmonary system [1].
Differentiation of the states of the broncho-pulmonary system can be considered as a classification problem.The classification system allows you to group objects and highlight certain classes that are characterized by a number of common properties.Thus, the set of rules for the distribution of multiple objects on a subset is considered a system of classification.
Different methods are used for classification, each of which has its advantages and uses.The main ones are classification using decision trees, Bayesian classification, classification using artificial neural networks, classification by means of reference vectors, statistical methods, classification using the nearest neighbor method, CBR classification method, classification using genetic algorithms.
The task of diagnosing the state of the respiratory system of the patient can be solved using statistical analysis of data [2].Statistics in medicine are one of the tools for analyzing experimental data and clinical observations, as well as the language in which the mathematical results are reported.In addition, the mathematical apparatus is widely used for diagnostic purposes in solving classifications.
The purpose of this work is to identify certain patterns in the values of diagnostic features of the coefficient of moisture of exhaled air, which will allow the studied groups of non-overlapping classes.The classification system, which will be based on these laws, will allow us to make decisions about the belonging of the patients under study to this or that diagnostic class.
Analysis of possibilities for statistical approach using the package STATISTICA.
In the framework of the statistical approach, the possibility of applying discriminatory analysis was considered, the main idea of which is to determine whether different aggregates differ by the mean of any variable (or linear combination of variables).This needs to be clarified for the further use of such variables in order to predict the belonging of new objects to one or another group [3].
Thus, the a priori classification (the forecast for new objects) is based on data derived from a posteriori (based on available data) classification.In discriminant analysis, it is considered that classes (groups) are already given, and the new object is classified into one of these classes based on the meaning of a variable.
To solve the problem in the work it is necessary to reveal the significance of differences in the composition of EAC in healthy people and in patients with bronchitis or pneumonia.For this purpose, the standard Discriminant Function Analysis module STATISTICA, intended for statistical analysis and data processing, was originally used.
Dates from three groups of patients (norm, bronchitis, pneumonia), each of which was preceded by a pulmonary examination, were taken as starting dates.Intermediate results of the examination of each patient are represented by a vector of 32 signs that characterize the state of the respiratory system [4].
Diagnosis was chosen as a variable, the percentages of contribution of EAC particles in the range 2...3100 nm was chosen as independent variables, (contributions on particles of other radii are absent).We obtain a scattering chart of canonical values, which shows the distribution of groups in the graph (Fig. 1).
We obtain a table in which coefficients and free members are given for variable linear functions (Table 1).
Then we built the classification functions -linear functions that were calculated for each diagnostic class.These functions can be used to classify the states of the respiratory system.Based on the functions, a classification matrix is constructed containing information on the number and percentage -Norma; -Bronchitis; -Pneumonia of correctly classified patients in each group.For a visual representation of these data Table 2 shows a fragment of the matrix of classification.
For greater convenience, you can use the Squares Mahalanobis Distances, showing how much the state of each patient is from the centre of the group.When diagnosing the state of the respiratory system of a new patient, it is attributed to the diagnostic group to which it is closest.
According to the results of the classification of 200 patients, for whom a diagnosis was known in advance, high accuracy was obtained.This proves the effectiveness of the use of diagnostic features of EAC to assess the state of the human respiratory system.

Realizing of the algorithm of discrimination analysis for automated diagnosis.
To implement its own software product, the following algorithm has been applied to test the feasibility of discriminatory analysis.
1. Check whether a sample has been created in interval scales or relation scales, whether the signs have a normal distribution.
2. Check whether the sample is divided into a finite number (at least two) of non-disjoint classes, or is known for each object the probability of belonging to a class.
3. Check no correlation between variables using correlation matrix.In the presence of a relationship between averages in dispersion or standard deviations (multicollinearity), there is no single measure of the relative importance of the variables.4. In each class check for at least two objects from the training sample [5].To obtain the exact value of the probability of belonging of the analysis object to this class and the criterion of significance for the initial data, the distribution law for each class should be multidimensional normal, that is, each variable should have a normal distribution for fixed other variables.
In the case of violation of the assumption about the normality of the distribution the probability value to calculate precisely impossible.Therefore, in the case when the data does not satisfy the condition of normal distribution, another method [6] will be used.
Educational information is formed on the basis of the results of the examination of patients, characterized by a large number of signs and reliably established fact of belonging to one of the groups.The reliability of the use of discriminant analysis is ensured by the reliability of the training information and the number of objects in the observation matrix from several tens to several hundreds for each class of states.
The number of signs in the matrix of observations is not limited.However, to solve a diagnostic problem according to the algorithm of discriminant analysis, a limited number of most informative attributes (usually up to 5...10 signs) are taken.Signs that are included in the matrix of observations can be both quantitative and qualitative.But at the same time, they all have to be quantified or scored in terms of their severity [7].In this study three groups of children were examined -healthy, patients with bronchitis and patients with pneumonia before treatment.A group of healthy children consisted of 15 patients, with bronchitis -37 patients and patients with pneumonia -24 children aged 6 to 10 years.Using the method of laser correlation spectroscopy, the percentage contribution of particles with a radius of 2, 3, 4...18500 nm (total 32 values in logarithmic scale) was determined in the composition of EAC.
To begin the analysis, you must select the variables that are the best discriminators of the groups.One or more variables may turn out to be bad discriminators, because the average values of classes differ slightly in these variables.In addition, two or more variables may carry the same information, although each one is a good discriminator.If some of them are used in the analysis, others are redundant.
The latter do not make any contribution to the analysis, because they do not have enough new information.Variables that do not carry new information or are bad discriminators need to be removed from the model as they complicate the analysis and may even increase the number of incorrect classifications.
To solve this problem, one of the ways to exclude unnecessary variables was to use the step-bystep selection of the most useful discriminant variables to include in the model.
The selection of variables is based on the results of the tolerance test, the statistics of the Finclusion and the F-exclusion.By testing tolerance, you can determine whether a given variable is a linear combination of one or more already selected variables.The variable with low tolerance (less than 0.01 the threshold value that was taken in this experiment) is undesirable to use in the analysis, because it does not provide any new information, and in addition, it can lead to an error in calculation due to the rapid accumulation of rounding errors.
The F-inclusion statistics assess the contribution of the variable to improve the distinction between its use and the differentiation achieved from the already selected variables.A variable that makes a significant contribution to the analysis should have more significance for the F-inclusion statistics than the threshold (the threshold value of the F-inclusion statistics was assumed to be equal to the unit for this experiment).
The F-exclusion statistics evaluates the significance of the deterioration of the distinction after removing a variable from the list of already selected variables.This procedure is performed at the beginning of each step to check if there is any variable that does not make a fairly large contribution to the distinction, since the later selected variables duplicate its contribution.That is, if the value of the statistics of the F-exception of the variable is less than the threshold (the threshold value of the Fexclusion statistics was taken to be zero for this experiment), then the variable should be excluded from the analysis.
Table 3 shows the results of the selection of variables: the tolerance value and the statistics of the F-exclusion of all variables in the model at the last step of the selection of variables.In the final step, F-exclusion statistics can be used to rank discriminant features of selected variables.The variable with the highest value of F-exclusion statistics gives the largest contribution to the distinction.From Table 3 it is seen that the variable of 2 nm makes the most distinction, the variable 20 nm -the least among the selected.
Thus, the result of choosing the best group discriminators is a model that includes the following variables: 2, 3, 4, 5, 8, 11, 15, 20, 26, 210, 290 nm.The next step in the analysis is to construct functions for the classification of observations.The classification of observations was carried out using a linear combination of selected discriminant variables, which maximizes the differences between classes, but minimizes dispersion within classes and is called "classification function".It has the following form: S j = a j + b 1j ⋅x 1 + b 2j ⋅x 2 + ... + b pj ⋅x p , where S j -classification function for j-th group; x i -the value of the i-th variable of observation, the classification of which occurs; b ij -coefficients of the i-th variable of the classifying function of the j-th group; a j -the constant of the classifying function of the j-th group.Table 4 shows the values of coefficients of classification functions, which were obtained according to the algorithm for further use in the course of the program system.To determine which group can be assigned a certain observation, it is necessary to calculate the values of the classifying functions for each group and to attribute observations to the group with the most calculated values.

Conclusions
Thus, the paper demonstrates the possibility of applying a statistical approach to solving diagnostic problems in medical and biological research.Discriminant analysis can be used in the diagnosis of the condition of the respiratory system of a person based on the analysis of the composition of EAC using the method of laser correlation spectroscopy.To automate pulmonologic diagnosis a software product has been developed that showed high accuracy and statistical stability.The limitation on the use of the implemented algorithm is as follows: the situation with the lack of data on a posteriori classification; The impossibility of automatically forming new groups.If it is necessary to classify objects in groups that were not predefined, other tools, for example, cluster analysis, should be used.

Table 1
Parameters of classification of linear functions

Table 2 A
fragment of the matrix of classification

Table 4
The coefficients of classification functions