Formulating multi diseases dataset for identifying, triaging and prioritizing patients to multi medical emergency levels: Simulated dataset accompanied with codes

This paper provides simulated datasets for triaging and prioritizing patients that are essentially required to support multi emergency levels. To this end, four types of input signals are presented, namely, electrocardiogram (ECG), blood pressure, and oxygen saturation (SpO2), where the latter is text. To obtain the aforementioned signals, the PhysioNet online library [1], is used, which is considered as one of the most reliable and relevant libraries in the healthcare services and bioinformatics sciences. In particular, this library contains collections of several databases and signals, where some of these signals are related to ECG, blood pressure, and SpO2 sensor. The simulated datasets, which are accompanied by codes, are presented in this paper. The contributions of our work, which are related to the presented dataset, can be summarized as follow. (1) The presented dataset is considered as an essential feature that is extracted from the signal records. Specifically, the dataset includes medical vital features such as: QRS width; ST elevation; peaks number; cycle interval from ECG signal; SpO2 level from SpO2 signal; high blood (systolic) pressure value; and low-pressure (diastolic) value from blood pressure signal. These essential features have been extracted based on our machine learning algorithms. In addition, new medical features are added based on medical doctors' recommendations, which are given as text-inputs, e.g., chest pain, shortness of breath, palpitation, and whether the patient at rest or not. All these features are considered to be significant symptoms for many diseases such as: heart attack or stroke; sleep apnea; heart failure; arrhythmia; and blood pressure chronic diseases. (2) The formulated dataset is considered in the doctor diagnostic procedures for identifying the patients' emergency level. (3) In the PhysioNet online library [1], the ECG, blood pressure, and SpO2 have been represented as signals. In contrast, we use some signal processing techniques to re-present the dataset by numeric values, which enable us to extract the essential features of the dataset in Excel sheet representations. (4) The dataset is re-organized and re-formatted to be presented in a useful structure feasible format. Specifically, the dataset is re-presented in terms of tables to illustrate the patient's profile and the type of diseases. (5) The presented dataset is utilized in the evaluation of medical monitoring and healthcare provisioning systems [2]. (6) Some simulated codes for feature extractions are also provided in this paper.

ically, the dataset includes medical vital features such as:

Value of the Data
The effectiveness of the presented dataset can be summarized as follows: • The presented dataset contains some of the essential features of patients. In particular, these patients' features can be considered as significant symptoms indicators of many diseases such as: (a) Heart attack or stroke; (b) Sleep apnea; (d) Heart failure; (e) Arrhythmia; and (f) Blood pressure chronic diseases [ 3 , 4 ]. In addition, from the doctor diagnostics procedure point of view, these features can be considered as essential indicators of other sicknesses such as:

Dataset description
The dataset presented in this paper includes ECG, blood pressure and SpO2 records and textinputs. The dataset has been collected from PhysioNet databases [1] . However, the collected dataset is simulated, re-organized, re-structured in tables context to extract (1) some essential features from the signals, (2) database type, (3) signal record, (5) type of disease and (6) patients' profiles. All these details are presented in the attached appendixes with the following brief descriptions: Ü Table 1 outlines the description of ECG databases and signals records along with all the patients' profiles. Moreover, a sample of the ECG signal is presented in Fig. 1 . Ü Table 2 shows the description of SpO2 database and signal records. In addition, SpO2 signal sample is showed in Fig. 2 . Ü Table 3 presents the blood pressure signals and database description. A sample of the blood pressure signal is demonstrated in Fig. 3 .
These records have been used in our simulation for ECG, blood pressure and SpO2 signals to extract the vital features such as: QRS width; ST elevation; peaks number and cycle interval from ECG signal; SpO2 level from SpO2 signal; high and low blood pressure values from blood pressure signal.
The databases and signal records presented in [1] have been simulated and implemented using our machine learning algorithms. This allows us to extract the essential medical features that are important for healthcare research studies. The outcome of our algorithms is presented in Table 4 as numeric values. MIT-BIH Arrhythmia database (mitdb) In most records, the upper signal is a modified limb lead II (MLII), obtained by placing the electrodes on the chest. The lower signal is usually a modified lead V1 (occasionally V2 or V5, and in one instance V4); as for the upper signal, the electrodes are also placed on the chest. This configuration is routinely used by the BIH Arrhythmia Laboratory. Normal QRS complexes are usually prominent in the upper signal. ( continued on next page ) ( continued on next page ) ( continued on next page )   Blood pressure and Spo2 datasets provide different values. According to medical guidelines, there are predefined ranges of values that represent the patient condition, which is known as " triage level". This triage level is used to evaluate the performance of the healthcare system, which is specifically focused on patient's medical assessment, e.g., monitoring the patients who have chronic heart diseases or chronic blood pressure diseases. The researchers would need to consider all the probabilities of blood pressure and Spo2 values in their simulation and implementation. Therefore, more analyses would be needed to the dataset records mentioned in [1] , which is considered to be time and resources consuming. Hence, the dataset needs to be organized as presented in Tables 1-3 to allow the researchers to use simplified numeric values in their research work. This essential task has been achieved in our paper so that we have done it on their behalf. Furthermore, we provide the researchers a dataset with different ranges of values that represent different triage levels. Table 5 demonstrates the dataset with all the probabilities of low blood pressure value, (mHg)high blood pressure value (mHg) and SpO2 value. Moreover, we provide new heterogeneous sources, i.e., text sources. The context of the text-inputs is provided as medical questions. These questions are expressed based on doctors' recommendations. Also, these questions are considered in the doctor diagnostics procedure. The answer to each question considers the feature of each text source. The questions are addressed manually, and all probabilities for the different answers are also considered. These questions can be summarized as follows:

MIT-BIH
1 Chest pain. The answer is either (Yes) OR (No). 2 Shortness of breath. The answer is either (Yes) OR (No). 3 Palpitation. The answer is either (Yes) OR (No). 4 Patient at rest. The answer is either (Yes) OR (No).
According to the medical guidelines, four main ECG features, which are related to many chronic heart diseases, should be extracted. These features are presented as follows: 1 Rhythm, which indicates the sinus bradycardia, sinus tachycardia, atrial tachycardia, atrial flutter, and sick sinus syndrome [9] . 2 QRS complex width, which indicates the activity of the bundle branch in the heart [9] . 3 Peak-to-peak regularity. 4 ST elevation, which indicates acute myocardial infarction, Prinzmetal's angina, and left ventricular aneurysm [9] .
In our evaluation, all the simulated ECG signals represent an abnormal ECG signal. Each signal represents a patient with a certain type of heart disease. We have extracted the four main ECG features and organized them as a new ECG dataset. The researchers can directly use this dataset in their future works. Moreover, to enrich our dataset, we have added our new ECG dataset in Table 5 . This dataset becomes easy to access in case the researchers eager to use all the sources in one platform. In addition, we have added our simulation outcomes in terms of triage levels to the table. Our outcomes have already evaluated by medical doctors. Table 6 represents outcomes form our simulation of 580 patients including the formulation of 11 features dataset and variety records of ECG signals where the triage level is provided as output. Table 7 presents dataset used in our paper [7] to provide different packages of healthcare services in the telemedicine environment.

Simulation setup
The software architecture of our algorithms is implemented using JAVA programming language. This is because JAVA has many benefits, such as: (a) real-time implementation, (b) parallel execution, (c) usage from anywhere by all interested parties, (d) ability to run JAVA-based applications on different platforms, (e) and compatibility to be used with different operating systems, e.g., Android, Windows, and Linux. The advantages of using JAVA have paved the way for the implementation of our algorithms in different hardware platforms. XAMMP has also been used. Specifically, XAMPP is a small and light Apache distribution tool that contains the most common web development technologies in a single package. XAMPP is a free/open-source software, and its name stands for (X) cross-platform for Web server, HTTP Apache Server, (M) MySQL database, (P) PHP scripts writing language, and (P) Perl programming language. In our paper, the dataset is re-organized and re-formatted in structure dataset format. The dataset is represented in terms of tables to illustrate the patient's profile and the type of diseases.

Computational analytic methods and codes
To extract the dataset mentioned in Tables 4-6 , advanced processing algorithms have been applied to the signals mentioned in Tables 1-3 . To this end, a multi-function data processing algorithm is proposed and implemented [7] in order to extract the essential features from each source individually. Each signal is represented by an array.
According to the extracted dataset, each element in the signal has two values. The first value represents time and the second represents voltage. The array of each signal has two columns (each column represents a value). The number of rows is defined by the number of elements in the signal, which starts from (0) and ends at (n). The array of text feature is 1 × 4 because there are four variables that represent four features. A real-time data processing algorithm have been utilized to extract ECG features. The ECG signal is represented by an array of two columns (time in (ms) and voltage in (mv)). These values have been used to extract the features. The ECG signal provides many cycles. One ECG cycle has many ECG features such as: Rythem; QRS; ST; and P-P.
For each cycle, the signal values in time are varied around the zero lines. These values are used to split the ECG cycle to Up and Down halves, then sorting the upper half based on voltage values. This is then applied to find the maximum point, which is represented by the R point. Accordingly, the upper half of the ECG cycle can be splatted into right and half. As such, by using certain functions to sort the values of the ECG cycle for each half (Up_Lift and Up_right) based on (t) value and (v) value, the location of Q and S points can be found. Moreover, the ST                elevation can be determined based on the differences of (t) and (v) values using the subtraction functions. The SpO2 and blood pressure values have been calculated as mentioned in [6] . The proposed algorithm is presented as pseudo-codes to enable the researchers to implement it in any software platform. Moreover, the algorithm is implemented using Java code, which is provided in the attached appendix.

Ethics statement
The authors would like to point out that the primary data sources are available in a public repository and given in PhysioNet online library [1] . PhysioNet online library includes many types of medical raw datasets. PhysioNet online library gives the permission to all researchers around the world to download and use the raw datasets. However, our main contribution is presented in applying signal processing algorithms in order to extract the essential vital features from the raw datasets. Consequently, the essential raw dataset and the outcomes of the simulated data are organized, structured, formulated and presented as multi diseases dataset.
Finally, the authors would like to indicate that neither human subjects nor animal experiments are involved in this paper.

CRediT Author Statement
Omar H. Salman: Responsible for methodology, conceptualization, designing the algorithms, simulation and writing the original draft. Mohammed I. Aal-Nouman : his task was visualization and investigation the state-of-the-art related research works. Zahraa K. Taha : Responsible for software development, data curation, and writing the article. Muntadher Q. Alsabah: Responsible for proofreading the paper and improve the English writing of our manuscript. Yaseein S. Hussein: his task was to review the paper and provide some useful comments regarding the paper organization and development. Zahraa Adnan: Responsible for reviewing the dataset tables, gathering related information, and providing technical comments regarding the features' extraction.