Diabetes Type 2: Poincaré Data Preprocessing for Quantum Machine Learning

: Quantum Machine Learning (QML) techniques have been recently attracting massive interest. However reported applications usually employ synthetic or well-known datasets. One of these techniques based on using a hybrid approach combining quantum and classic devices is the Variational Quantum Classifier (VQC), which development seems promising. Albeit being largely studied, VQC implementations for “real-world” datasets are still chal-lenging on Noisy Intermediate Scale Quantum devices (NISQ). In this paper we propose a preprocessing pipeline based on Stokes parameters for data mapping. This pipeline enhances the prediction rates when applying VQC techniques, improving the feasibility of solving classification problems using NISQ devices. By including feature selection techniques and geometrical trans-formations,enhanced quantum state preparation is achieved. Also, a representation based on the Stokes parameters in the Poincaré Sphere is possible for visualizingthe data.Our results show that by using the proposed techniques we improve the classification score for the incidence of acute comorbid diseases in Type 2 Diabetes Mellitus patients. We used the implemented version of VQC available on IBM’s framework Qiskit, and obtained with two and three qubits an accuracy of 70% and 72% respectively.


Introduction
Several efforts have been made in recent time to advance quantum software capable of exploiting the power of the available Noisy Intermediate Scale Quantum (NISQ) devices. These devices are being developed on a variety of hardware platforms and technologies with a number of qubits ranging from fifty to a few hundred [1]. Despite the limitations in the number of qubits and their susceptibility to noise, these devices are leading to the development of more powerful quantum technologies for the future [2]. Each successful application is an important step in the development of Quantum Computing. One of these applications lies in the advance of Machine Learning (ML) techniques, a technology that is widely used in a multitude of real-world applications [3] motivated by the advances achieved in different knowledge fields and the derivative commercial applications.
Quantum Machine Learning (QML) is one of the most encouraging applications, being actively studied by several research groups [4]. In general, looking forward to developing new techniques able to exploit Quantum Computing advantages to improve machine learning [5]. Supervised learning is a specific QML task recently emerging with massive interest from academy and industry. There are several contributions in this field, including approaches with quantum inspired neural networks and their applications [6][7][8], hybridized low-depth Variational Quantum Circuits (VQC) [9], optimization algorithms [10] with simple error-mitigation [11], preprocessing techniques like PCA showing exponential improvements [12], and experiments for classification [13,14]. Also, multiple implementations of linear regression [15,16] in Quantum Computers have been propose.
Several approaches to encode classical data into quantum states have been presented. These describe advantages including the experimental overhead reduction in terms of resources and the introduction of non-linearities in the data [14,17], enabling the use of linear classifiers and kernel-based methods [18,19] on near-term quantum processors, with an exponential speed-up when compared to classical algorithms [20]. In references [21,22] independent authors described the advantages from the usage of quantum algorithms in machine learning methods, being one example the polynomial reduction from query complexity in nearest-neighbor classification when compared to classical algorithms [23].
Multiple examples from the usage of quantum computing techniques in machine learning applications for well-known datasets have been presented. Datasets such as MNIST, Wine, Cancer and Iris are of common use to test these approaches [13,18,24,25]. Nonetheless, real-world applications are scarce, a handful of applications including a Reactor Coolant Pump (RCP) state classification at a Nuclear Power Plan [26], Wine recognition [8], Dementia prediction [27] and the partial dynamics of a complex 10-spin system [28] have been explored. There is a need for explore real applications using quantum machine learning, that motivate further research in this area, using quantum properties in real-world applications. It should be noted that quantum machine learning algorithms may not yield to an advantage when compared with their classical counterparts, but understanding their scope and limitations is critical in the development of current quantum technologies.
In this paper we present a real case study of Type 2 Diabetes Mellitus (T2DM), this disease is the fourth cause of mortality, with rising prevalence this disease is a major public health problem [29]. There are 415 million people with diagnosed diabetes and it is estimated that around 193 million people suffer the disease without diagnosis, in both cases it could lead to micro and macrovascular complications, causing major distress to both patients and caregivers [30]. We introduce a preprocessing technique to map the data into quantum states to perform quantum classification. In particular, this technique is based on a data representation using the Stokes parameters, enhancing data encoding techniques proposed by [3], improving on average a 20% the classification score of a Quantum Variational Classifier implemented using IBM's framework Qiskit [31]. We conducted three different experiments using VQC over the same data features and parameters. In the first experiment, we normalized the data with zero standard deviation, in the second we add to this normalization an Ellipsoidal coordinate transform and finally in the third we found the Stokes parameters from data. This paper is organized as follows: First, we present the proposed pipeline to pre-process data describing each of the stages developed: Feature Scaling and selection, ellipsoidal coordinate mapping and Stokes parameters data representation, then we describe the employed quantum classifier and the experiments performed to classify acute comorbidities incidence in Type 2 Diabetes Mellitus (T2DM) patients. Finally, we present the obtained results and a discussion about these results and conclusions.

Materials and Methods
The current limitations of NISQ devices impose restrictions on Quantum Machine Learning techniques [1]. Currently, many proposed QML applications rely on using well-known datasets, where the preprocessing techniques are now standard [13,18,24]. When using real-world datasets these techniques are not always suitable to adequate the data to be processed by Quantum Classification models. We proposed a data preprocessing pipeline presented in Fig. 1, to transform datasets before applying current kernel techniques for VQC algorithms. Each of the components of this pipeline are described.

Feature Scaling
This step ensures same scaling for the numerical inputs in the model, enhancing the accuracy and speed of optimization methods during training. In general, this is a required step in the data preprocessing pipeline for most of the classical Machine Learning techniques [32]. For QML implementations, this is a fundamental step due to data constraints when representing it as quantum states. These restrictions result from quantum mechanics properties, in this sense, we standardized the data to zero mean deviation and unit variance. Then, each feature vector was scaled to a range of [−1, +1].

Feature Selection
Feature selection techniques are based on the idea of identifying and removing less relevant or redundant features, providing faster and more cost-effective predictors [33]. These techniques are relevant when processing medical datasets where features could induce noise in the models, making the classification process more difficult during training. Algorithms like Principal Component Analysis (PCA) have proven to be proficient to perform data preparation, even in QML applications and processing [12,34]. However, using this kind of algorithms make data interpretation unfeasible, therefore, avoiding artificial transformations enables feature interpretation before using them for the model.
In this sense, we based our methodology in variable ranking, calculating the mean value from the scores obtained using the feature importance of four different classical classification methods. Our main goal was to find a subset of features from our dataset that give us the best performance in classification, using the minimum number of features considering the encoding transformation to define quantum states, and the current quantum devices limitations in terms of quantum volume. Therefore, we selected the top three features of the calculated score, this dimensionality constrained is imposed by the ellipsoidal coordinate mapping. Our chosen classification methods were Gradient Boosting [35], Random forest, K-Best and Extra Trees that minimizes overfitting the data.

Ellipsoidal Coordinate Mapping
The selected features were transformed into a coordinate space where it can be easily represented using the Poincaré sphere. We use an iterative method based on [36] to transform those features from a Cartesian coordinate space (x, y, z) to an ellipsoidal coordinate space (ϕ, λ, h), following Eqs. (1)- (3).
where N (i) is defined as: The constant p is: and e is based on the semi-major axis a, the semi-minor axis b.
We executed the method specifying a dispersion error of 10 −5 for each data point.

Stokes Parameters: Poincaré Sphere Representation
A convenient geometrical representation of the Quantum States is obtained when using the Bloch Sphere, also known as Poincaré Sphere, it has been used to describe polarization states by using the Stokes parameters. By defining the data in terms of ellipsoids, these definitions are mathematically analogous to Stokes parameters to describe polarization, however, they have no physical relation with them. The ellipses parameters are represented by: where φ is the azimuth angle between the semi-major axis of the ellipse and the x-axis, and ψ is the elliptic angle, defined by the inverse tangent of the relation between the length of the semi-axes of the ellipse (Fig. 2a). A simple geometric representation of these parameters is obtained when defining a spherical surface of unit radius: In this, the variables S 1 , S 2 and S 3 can be considered as the Cartesian coordinates of the point S on the surface of the unitary radius sphere, being 2ψ and 2φ the angular coordinates of this point.

Variational Quantum Classifier
The Variational Quantum Classifier (VQC) is a quantum method for supervised learning that allows performing classification problems in current NISQ devices. Based on a method proposed by Havlíček et al. [3] this algorithm allows to obtain experimental results in NISQ devices without the need to perform additional error-correction techniques. The calculation of the cost function based on the iterative measurements from the device serve as error mitigation, by including noisy measurements into optimization calculations. Also, it has been showed that mapping features to quantum states using amplitude encoding, is a suitable option to preprocess data when using VQC, provided that data is low dimensional or its structure allows for efficient approximate preparation [14,18,37]. This method is a hybrid approach where the parameters are optimized and updated in a classical computer, making the optimization process without increasing the coherence times needed [3,28].
One of the key components from this method is the feature map definition, which maps data into a potentially vastly higher-dimensional Hilbert space of a quantum system [14] allowing to perform efficient computations over non-linear basic functions on a possibly intractably large space, the feature space. A similar implementation known as kernel-trick has been explored using classical machine learning [38]. Nonetheless, using classical devices to perform these operations could take exponential resources, therefore quantum computing allows for creating more complex models that could predict with higher precision [28]. Havlíček et al. [3] proposed a VQC with two main elements presented in Fig. 3. First, a feature map that works as a fixed black-box encoding classical data x i into a quantum states | Ψ (x i ) , by applying transformations to the ground state | 0 n using products of single and twoqubits unitary phase-gates. In specific, the experimental implementation of the authors results in a unitary gate u φ (x) = U φ(x) H n U φ(x) H n where H represent the Hadamard gate and U φ(x) = exp i S⊆ [n] φs (x) i∈S Z i , is a diagonal gate in the Pauli-Z basis. Second, a short depth unitary U (θ) circuit with l layers of θ-parameters, optimized during training by minimizing a cost function in a classical device, and tuning θ iteratively. Using parameterized quantum circuits known as quantum circuit learning QCL, implies the usage of an exponential number of functions with respect to the number of qubits from the parameterized circuit, this is intractable on classical computers, therefore, allows to represent more complex functions than the classical counterparts [28].

Case Study: Incidence of Acute Diseases in Diabetic Patient
Diabetes Type 2 is a rising public health problem [29]. The patients with Diabetes Type 2 represent over 90% of the total of patients with any type of diabetes and is the seventh cause of death worldwide [29,39]. This disease leads to a number of micro and macrovascular events [40], which represent short and long-term complications such as cardiovascular disease, nephropathy, retinopathy, peripheral neural disease, limb amputation, erectile dysfunction, depression, among others. Given its close relation with lifestyle and obesity, the numbers of people suffering from this condition and its complications keep increasing [30]. The steady increment in the number of people suffering from Type 2 Diabetes Mellitus results in a huge burden on the health-care system increasing the healthcare costs [30,41,42]. Provided the wide range of complications and disabilities that come along, this disease has a major impact on the patients' life and on healthcare system supporting them.
Several T2DM related complications have been studied through different classical Machine learning, Deep Learning and Data Mining techniques [43,44]. The risk factor identification associated with these complications is of great value to the clinical management of individuals with diabetes. Due to the high level of disability and incremental costs of the disease, it is necessary to investigate the causes involved in the genesis of complications. In order to address them in the future and apply the medical knowledge not only from a healing perspective but also on a preventive one, saving suffering to the patient and money to the health care system.
A dataset containing clinical information from patients diagnosed with Type 2 Diabetes Mellitus has been used. For each subject a successive 12-month time period was defined, during this period, a patient is considered diagnosed with T2DM if the disease onset date was prior to the established cut-off point. By following these criteria, the total study population was 149,015 filtered from a larger database containing Electronic Health Records (EHR) from Osakidetza (Basque Health Service) in Bilbao, Spain. This dataset includes clinical variables such as LDL-Cholesterol, Body Mass Index and glycated hemoglobin (A1C). Also, demographic variables including age, gender and socioeconomic status position were considered.
The study protocol was approved by the Clinical Research Ethics Committee of Euskadi (PI2014074), Spain. Informed consent was not obtained because patient health records were made anonymous and de-identified prior to analysis.

Results
In particular, our concern is the prediction acute conditions, we studied the incidence of acute myocardial infarction, major amputation or avoidable hospitalizations. Following the methodology discussed in Section 2.2 for feature selection we used gender, cholesterol LDL and Johns Hopkins' Aggregated Diagnosis Groups (ADG). These features were contained in the higher scores when gradient boosting, random forest, k-best and tree-based techniques were applied as feature selectors.   [45], ADG has been used in the past to assess diabetic patients' mortality [46], and we included gender as one of the selected features because it provides a differential characteristic, meaning that could enhance separability due to non-direct correlation with the output. This could be explained by the difference in hormones, fatty tissue distribution or simply differences in the lifestyles.
Then we randomly selected a balanced set of 250 samples for the data. These were split into two subsets, 200 for train and 50 for testing. After preprocessing the data, it is possible to represent the data points using the Poincare Sphere as depicted in Fig. 4, where the red dots represent patients with acute conditions and blue dots without. Albeit this step is not needed to process the information, it provides a good visual representation of the data distribution.  [31]. Every combination of the experiments was executed with 1024 shots, using the implemented version of the COBYLA optimizer [47] through the same framework. We conducted tests with two and three qubits, in each case we compared the accuracy, precision, recall and F1-Score when applying data normalization between −1 and 1 with zero standard deviation, adding to this normalization the ellipsoidal coordinate transform, and finally adding the Stokes parameter representation, the results from these experiments are summarized in Tab. 1.
To show the advantage of using the proposed pipeline and its elements we performed three experiments using VQC over the T2DM dataset. In the first experiment we normalized the data using zero standard deviation before passing to the model. In addition to the normalization, in the second experiment we also transformed the data to Ellipsoidal coordinates. Finally, in the third experiment we calculated the Stokes parameters additional to all the previous steps, giving the possibility to visualize the data points using a Poincaré sphere. By including all the data preprocessing elements, we obtained a pipeline that enhance data preparation for VQC application. The results of these experiments are presented in Fig. 5, using accuracy, precision, recall and F1-score as metrics to evaluate the model's performance. In these figures it can be seen that using the proposed pipeline to transform data, induce significant improvements in the classification performance in particular in the 2 qubits case. Moreover, by using the proposed technique, the model results when employing 2 qubits resemble those obtained when 3 qubits were used, enhancing the classification results even if fewer resources are available.

Conclusions
Research on Quantum Machine Learning applications is advancing the uses of current state quantum computers, given the wide range of applications and the industry interest in machine learning techniques to solve practical problems. We consider that this work contributes in the usage of new techniques for the exploitation of NISQ devices in "real-world" applications of QML.
A milestone to pursue is to achieve quantum advantages for commercial applications. Machine learning is an area of computer science where statistics, data processing and analytics converge, given the relevance of data across the different fields and the breadth of applications. In particular, Quantum Machine Learning is being actively investigated by several research groups, as the exploit of quantum computing advantages could improve and expand the range of real-world machine learning applications.
In this paper we propose a pipeline to transform and preprocess data, making it feasible to be classified using Quantum Machine Learning techniques. By using this pipeline, we enhanced the quantum state preparation for VQC algorithm. Our results showed that by using the proposed techniques we obtained similar results when classifying the incidence of acute diseases in diabetes patients using a Variational Quantum Classifier with two and three qubits, with 70% and 72% accuracy respectively. We are currently studying and developing unsupervised and supervised machine learning techniques suitable for NISQ devices, given the current limitations on coherence times and qubits available on current devices. In particular, conducting further research on the application of the proposed pre-processing pipeline to improve the data suitability for different QML techniques such as Quantum Support Vector Machine. We are also evaluating the execution advantages of applying the proposed technique in different environments.
Funding Statement: This project was partially funded by eVIDA Research group IT-905-16 from Basque Government.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.