Application of Data Mining Algorithms for Dementia in People with HIV/AIDS

Dementia interferes with the individual's motor, behavioural, and intellectual functions, causing him to be unable to perform instrumental activities of daily living. This study is aimed at identifying the best performing algorithm and the most relevant characteristics to categorise individuals with HIV/AIDS at high risk of dementia from the application of data mining. Principal component analysis (PCA) algorithm was used and tested comparatively between the following machine learning algorithms: logistic regression, decision tree, neural network, KNN, and random forest. The database used for this study was built from the data collection of 270 individuals infected with HIV/AIDS and followed up at the outpatient clinic of a reference hospital for infectious and parasitic diseases in the State of Ceará, Brazil, from January to April 2019. Also, the performance of the algorithms was analysed for the 104 characteristics available in the database; then, with the reduction of dimensionality, there was an improvement in the quality of the machine learning algorithms and identified that during the tests, even losing about 30% of the variation. Besides, when considering only 23 characteristics, the precision of the algorithms was 86% in random forest, 56% logistic regression, 68% decision tree, 60% KNN, and 59% neural network. The random forest algorithm proved to be more effective than the others, obtaining 84% precision and 86% accuracy.


Introduction
Data mining (MD) is one of the data exploration processes capable of predicting and extracting consistent patterns by using strategies such as learning algorithms, such as artificial intelligence (AI), or classification in statistics, which can reveal hidden relationships and accurate data [1,2].
The application of MD is in health information systems, in the public and private spheres, which, through a process of selection, preprocessing, and data transformation, one can discover patterns and generate knowledge through their interpretations. With this method, the health professional will identify, characterise, and guide the patient based on patterns of health problems and care therapies for different diseases [2].
As one of the most prevalent diseases and considered a public health problem, we have neurocognitive disorders associated with the human immunodeficiency virus (HAND) because of the virus compromising the immune system. Also, it has an affinity for the central nervous system (CNS) by destroying macrophages that cross the blood-brain barrier, which cannot be invaded by highly active antiretroviral therapy (HAART) [3,4].
Among the cognitive disorders associated with the human immunodeficiency virus (HIV), dementia is consid-ered its most serious type, as it interferes with the motor, behavioural, and intellectual functions of the infected individual, causing him the inability to perform instrumental activities of daily living, whose symptoms are manifested by irritability, emotional lability, memory loss, dysdiadochokinesia, and spastic gait [5,6].
The health professional, despite carrying out the clinical history, neurological, and cognitive examination, sometimes feels undecided as to the therapeutic approach to be taken in the face of a possible neurological complication, since the onset of HIV dementia is insidious and initial symptoms may be confused with other psychiatric illnesses, such as depression and anxiety [7].
In this context, the use of technological strategies, such as information and communication technologies (ICT), can be considered essential tools for detecting the early diagnosis of dementia and other diseases. Computer science methodologies with ICT may be able to classify the main characteristics of the disease, in the face of the patient's social and clinical changes, in a practical way and serve as an aid for health professionals in promoting information management and quality safe care [8].
Among the computational techniques, there is the use of algorithms that consists of the understanding and structure of existing data and capable of generating predictive rules, whose results, when combined, can bring benefits in reducing costs and in the effectiveness of interventions, both in treatment and in preventive actions [9].
Thus, the following question arises: can a data mining technique identify the most relevant characteristics given the risk of dementia in people with HIV/AIDS? Early recognition and categorisation of the characteristics that influence the risk of dementia will allow the ability to reverse the disease to choose an effective HAART and additional treatments through better social and psychological support to improve the person's life expectancy and quality life with HIV/AIDS. It will also assist the health professional by offering more precise and targeted guidance to family members in the face of behavioural and cognitive changes of the seropositive individual.
For some years, predictive models have been developed to predict outcomes of interest to the population's health, like computerised image analysis based on precision medicine. Researchers believed that the machine could interpret the analysis of complex and straightforward testing [10].
Besides, contributing to the agility in the interpretation of results by reducing the time of action for urgent cases and increasing the degree of confidence in the diagnoses is an advantage of the technologies applied to health, together with the presence of the health professional during the verification and validation of the results based on clinical, epidemiological correlations, and therapeutic decision [10].
In this context, more flexible artificial intelligence algorithms have been used in predictive modelling of outcomes of health interest due to the high amount of data available to carry out this research. As a result, the algorithms show a better outcome prediction potential because they capture complex relationships in the data and their ability to deal with many predictors [11,12].
Regarding sexually transmitted infections to affected people, prognostic information related to the risk of death and other opportunistic diseases has become increasingly important for health professionals. Also, researchers and public policymakers are a tool to assist in decision-making regarding disease tracking targeting preventive programs and offering specialised treatments [10].
In this context, the study seeks to identify the algorithm with the best performance capable of revealing the most relevant characteristics in the high risk of developing dementia in people with HIV/AIDS, based on data mining.

Methodology
In this research, the principal component analysis (PCA) algorithm was applied to assess the impact of reduced dimensionality and the quality of the solution between the following machine learning algorithms: logistic regression, decision tree, neural network, KNN, and forest random.
Machine learning algorithms have specific functions, such as logistic regression, whose statistical technique describes the behaviour between a binary dependent variable and independent metric or nonmetric variables. It investigates the effect of the variables by which individuals are exposed to the probability of a specific event occurring [10].
The decision tree is a nonparametric method applied for data classification and regression, in which a set of attributes is used to predict an output value. The model is formed by elements with information called nodes, which are read in a descending way, and which starts at the root node, with a higher hierarchical level until it reaches the leaf node that consists of the output, that is, the predicted result [10][11][12].
Neural network seeks to simulate the processes of a human neural network through models composed of nonlinear elements that operate in parallel, through the environment and ends when the neural network manages to generalise solutions for a class of problem [13].
The KNN algorithm is a learning based on instances stored. As a new sample is presented to the algorithm, a new classification is generated from a distance closest to neighbouring cases. The decision rule is built directly without estimating the densities conditioned to the classes; relative patterns can be classified in the space of characteristics that probably belong to the same class [13][14][15].
The random forest consists of a set of decision trees generated within the same object, using a classifier aggregation technique built so that its structure is composed at random. Thus, the method combines several decision trees through a vote that classifies the final class to receive the highest votes [2].
The database used for this study was built by collecting data from 270 HIV/AIDS-infected individuals and followed up at the outpatient clinic of a Reference Hospital for Infectious and Parasitic Diseases, in the State of Ceará, Brazil, from January to April 2019. The questionnaire used for the collection was composed of sociodemographic data, clinical health, and risk behaviours; simplified medication adherence questionnaire (SMAQ), Brazilian version to assess adherence to treatment with antiretrovirals; the International Dementia Scale to detect dementia in the HIV patient; the Barthel Index 2 Computational and Mathematical Methods in Medicine and the Lawton and Brody Scale to assess the degree of functional capacity; and finally the Social Support Scale for people living with HIV/AIDS. The International HIV-Associated Dementia Scale (IHDS) is a screening tool to identify the high or low risk of dementia in HIV patients who do not need a specialist and are not influenced by the individual's education. Also, the use of the scale in clinical practice has a sensitivity of approximately 80% and specificity around 57%, with a cut-off score less than or equal to 10 for HAD diagnosis. IHDS can be used to identify probable cases of HAD in situations where formal neuropsychological assessment is impossible. Table 1 shows the classes with the most relevant independent variables used in this research: The Python programming language is versatile and capable of providing good code performance and readability. This language is viable for working with rapid application development methods, reducing waste, and optimising code quality; development techniques "are not repeated" and do not allow ambiguities, and have fast and straightforward software [15].
The study was carried out from the execution of all the machine learning algorithms already mentioned. It sought to analyse the performance of the algorithms from the 104 available characteristics present in the database.
Before executing the PCA, the data were standardised using the mean and standard deviation, according to the following equation: where μ is the mean and σ is the standard deviation applied in the execution of machine learning algorithms. Subsequently, the principal component analysis (PCA) algorithm was performed, whose multivariate statistical technique is widely used to reduce dimensions and extract resources. The processing of this technique can be summarised as a set of orthogonal linear transformations of the original variables so that the transforming variables maintain, as much as possible, the information contained in the actual characteristics [1,16].
The PCA captured 90% of the variation in the data and selected the main characteristic of each of these components without repetition.
In the third moment, the main components were analysed, which captured 80% of the data variation and selected the main characteristic of each one without repetition.
They sorted the components in descending order by the variation of the data they represented. Thus, the first components reflected the most significant relevance and corresponded to most of the data variation.
For each of these executions, we analysed the performance of the algorithms and the respective most representative characteristics.

Results and Discussion
Initially, the logistic regression, decision tree, neural network, KNN, and random forest algorithms were obtained for 104 components, with the following results in Table 2.
The total execution time of the algorithms took a maximum of 03 seconds, which corresponded from reading and processing the database to reading the PCA. It should be noted that each algorithm independently took up to 02 seconds to obtain the results.
Through the PCA, the number of characteristics was reduced from 104 to 47. Generally, there was a reduction in the performance of the algorithms; random forest, for example, started to reach 80.49% accuracy, and therefore, the number of characteristics was reduced from 47 to 33.

Computational and Mathematical Methods in Medicine
Moreover, the performance of the algorithms remained as in the execution already mentioned, except Logistic regression continued to lose performance.
Logistic regression continued to lose performance. The following steps were performed considering the main components that captured 70%, 60%, 50%, and 40% of data variation, respectively, when observed in Table 3. Figure 1 shows that 70% captured from 23 characteristics were analysed; the accuracy of the different algorithms showed the best performance of random forest: From the results obtained in the executions carried out, it is possible to analyse approximately 86% in random forest, 56% logistic regression, 68% decision tree, 60% KNN, and 59% neural network.
Dimensionality reduction could improve the quality of machine learning algorithms. It delimits the less relevant components that will only cause noise in the algorithms or increase the complexity of the problem. However, ignoring elements causes some of the data variation to be lost, leading to reduced performance of the prediction algorithms.
It can be seen in Figure 2 that during the tests, even losing about 30% of the data variation, that is, when considering only 23 characteristics, the random forest algorithm obtained better results regarding the precision and accuracy of 84% and 86%, respectively. In different areas of study, random forest has obtained good results and demonstrated that these algorithms had provided superior results in the prediction capacity compared to other machine learning algorithms. Moreover, it can analyse from samples considered small like this one study until analysing complex data structures [17,18].
It is also noted that when using less than 23 characteristics, the algorithm loses performance, which assumes that essential aspects may be being disregarded. Reducing the dimensionality of PCA has reduced the complexity of the data and filters out the crucial characteristics to classify whether a patient is at risk of developing dementia.
Thus, the main characteristics extracted with this technique are described below, in an increasing order of representativeness of the data: paranoid ideation; walking; adherence to antiretroviral medication; having the impression that Psychopathological symptoms can be assessed by the Brief Symptom Inventory indicator, whose nine dimensions are elements of psychopathology, such as somatisation that reveals complaints from cardiovascular to gastrointestinal; obsession-compulsion, obsession-compulsion is a symptom of the individual's impulsive behaviour. Interpersonal sensitivity, which turns to the feeling of inferiority and shyness; depression; anxiety, which can be assessed through nervous manifestations and apprehension; hostility; phobic anxiety; paranoid ideation, whose behaviour is considered harmful and disturbing; and psychoticism, in which the individual isolates himself and has schizophrenic behaviours, through hallucinations [19,20].
Phobic conditions such as terror or panic attacks may require the individual to overestimate the danger offered by the object or event in question since the discovery of HIV infection, as well as the infection's relationship with certain mental illnesses, given the neuroinflammatory factors and the stress that contribute to this situation [19][20][21].
Depression is one of the most common comorbidities in HIV-positive patients, causing worsening of medication adherence and increased mortality. Clinical findings demonstrate a relationship between depression and increased levels of inflammatory cytokines (TNF-alpha, IL-1beta, and IL-6) in both systemic and CSF terms [19][20][21].
Among the feelings of sadness and loneliness related to a depressive condition, it can be associated with factors other than disease progression. The depressive feeling persists when the subject reveals a lack of hope for the future, thereby affecting adherence to antiretrovirals [21].
The low antiretroviral adhesion directly implies the cognitive damages caused by the high viral load and the degree of immunosuppression of the infected individual, whose determinant most susceptible to HAND corresponds to the nadir values of LTCD4 + <350 cells/mm 3 or LTCD4 + current < 350 cells/mm 3 [22,23].
Also, it is emphasised about opportunistic neurological diseases of HIV, such as neurotoxoplasmosis, tuberculous meningitis, and neurocryptococcosis, which are essential comorbidities for psychiatric damage. Neurotoxoplasmosis is the most frequent opportunistic infection in the CNS in HIV-positive individuals, causing a lesion with a more common mass effect, especially when the patient has CD4 + <200 cells/mm 3 [22][23][24][25][26]. Table 4 presents the demographic characteristics of people with HIV according to the research covered.
Regarding age, scientific evidence reveals that the proportion of older adults living with HIV in the Netherlands in 2030 will reach 73%. In the United States of America (USA) and Brazil, it currently exceeds 50%. Among the statistics, half of HIV infections in older women are acquired at approximately 56 years. Thus, age can be considered significant when associated with HIV, given the high probability of late diagnosis and short-term mortality [27].
In terms of schooling, theoretical studies investigate the relationship between the level of knowledge and its influence on health behaviour in the awareness of self-perceived risk, continued condom use during sexual intercourse, and adherence to antiretroviral drug therapy [28].
Thus, people with a low level of education need social support and continuous monitoring by health professionals regarding the assessment of precession, which is permanently monitored in the evaluation of the information they understand and its impact on the quality of life in the dimensions: social, cultural, psychic, and clinical [28].
In this way, the practical and early diagnosis of these neuropsychological conditions and the correct therapeutic intervention can delay or even reverse dementia associated with HIV and benefit the individual.

Conclusion and Future Works
Because of the database used, it was possible to classify and analyse the main characteristics of dementia in people with HIV/AIDS using techniques such as PCA and MD. Through computational tests between the MD algorithms, the random forest algorithm proved to be more effective concerning the others, obtaining higher precision and accuracy.
In addition to dementia, this research could be insightful about other mental disorders in the face of initial symptoms of depression, panic attacks, and feelings of persecution. Thus, it may be able to sensitise health professionals to a more accurate assessment of factors related to the individual's personality, which may be associated with medication adherence and, consequently, with the value of viral load and TCD4 lymphocytes.
Thus, identifying the main characteristics of prediction of dementia due to HIV can assist the health professional for referral and appropriate treatment to these infected people through interpersonal, cognitive-behavioural, and supportive psychotherapy. In addition, orientation contributing to the family about adherence to antiretroviral therapy monitoring, social life, and instrumental activities by the HIV-positive individual.
Since the results of this application are in a local population, it is suggested that these algorithms be applied to more massive databases to confirm and improve their performance. Although the sample is small, its data is independent and distributed equally about those observed in the future and follow the same distribution of the population from which the sample was taken.