Prediction and evaluation of healthy and unhealthy status of COVID-19 patients using wearable device prototype data

COVID-19 pandemic seriousness is making the whole world suffer due to inefficient medication and vaccines. The article prediction analysis is carried out with the dataset downloaded from the Application peripheral interface (API) designed explicitly for COVID-19 quarantined patients. The measured data is collected from a wearable device used for quarantined healthy and unhealthy patients. The wearable device provides data of temperature, heart rate, SPO2, blood saturation, and blood pressure timely for alerting the medical authorities and providing a better diagnosis and treatment. The dataset contains 1085 patients with eight features representing 490 COVID-19 infected and 595 standard cases. The work considers different parameters, namely heart rate, temperature, SpO2, bpm parameters, and health status. Furthermore, the real-time data collected can predict the health status of patients as infected and non-infected from measured parameters. The collected dataset uses a random forest classifier with linear and polynomial regression to train and validate COVID-19 patient data. The google colab is an Integral development environment inbuilt with python and Jupyter notebook with scikit-learn version 0.22.1 virtually tested on cloud coding tools. The dataset is trained and tested in 80% and 20% ratio for accuracy evaluation and avoid overfitting in the model. This analysis could help medical authorities and governmental agencies of every country respond timely and reduce the contamination of the disease.• The measured data provide a comprehensive mapping of disease symptoms to predict the health status. They can restrict the virus transmission and take necessary steps to control, mitigate and manage the disease.• Benefits in scientific research with Artificial Intelligence (AI) to tackle the hurdles in analyzing disease diagnosis.• The diagnosis results of disease symptoms can identify the severity of the patient to monitor and manage the difficulties for the outbreak caused.


Methodology and data
The method used for the Data mining classification is Random Forest Algorithm for machine learning. Generic Machine Learning is employed to build a diagnosis model for COVID-19 patient symptoms with the steps involving support vector machine, Decision tree, and Random Forest, and logistic regression for processing the diagnosis data to detect COVID-19 cases ( Fig. 3 ). The random forest algorithm is a classifier built to diagnose the disease from the signs and symptoms of COVID-19 patients [8] . The ( Fig. 1 ) shows the design flow employed to judge the essential and represent an AI project which can build a model to gather every possible data and give us an insight understanding to analyze the health status of COVID-19 patients.

Data descriptive and statistics
The dataset contains four measured values taken from a wearable device fixed with individual sensors of Temperature, blood pressure, heart rate, and SpO 2 as given in Table 1 . The dataset includes 1085 patients with eight features representing the proportion of balanced data ( Table 3 ). Through the web platform, dataset is downloaded for the patients in .CSV, PDF, and Excel format consist of 8 columns and 1085 rows [12] . The source file is a collection of data from the given API link ProjectC (c19data.info) ( Table 4 ). The proposed work has also been tested and implemented in the Anaconda tool (AEN 4.1 version) for data analysis [1] .
We can read the dataset as a supplementary file easily in .CSV forma ( Table 4 ). The data is updated and stored from the above API link is provided. Random forest Algorithm is composed of different decision trees with supervised learning to perform both regression and classification ( Fig. 4 ). The algorithm is a diverse model with decision trees, nodes, and leaves to classify unlabeled data [6] . In the proposed work, numerical data with irrelevant attributes such as Patient Id, gender, age, Heart rate, temperature, SpO 2 saturation, blood pressure monitor [4] . The informative data values  Measures the blood pressure in the circulatory system > 95  are selected to predict the health status and probability of infection among these attributes [3] . The algorithm shows the sample dataset of COVID-19 patients to associate a set of training documents with the selected features (Tables 2-10) . The data classification is carried out with the real-time measurements collected from different patients [13] , commonly known as a definite response, to predict the output Y from the input variables X ( Table 8 ). In actuality, the relationship is between response and predictors [4] . The background classification is carried out with nearest neighbors' classifiers to obtain the linear model classification ( Table 6 ). This work uses supervised learning with inputs and correct outputs to model the dataset over time to yield the desired outcome from the diagnostic devices to minimize the error sufficiently [10] . The method used to model is Random Forest classifier where scikit-learn version 0.22.1 and python version is 3.7.5 was used and tested on google colab. Multi-class classification gives the best understanding of the measured performance with one part of data as a training set and another for testing data [3] . The following steps explain the performance metric and splitting strategy, where the raw data is converted into a sequence to analyze from a viewpoint ( Table 5 ). The proposed work has also been tested and implemented in the Anaconda tool (AEN 4.1 version) for data analysis [1] .
Pseudo code for RF algorithm 1. From the total 'K' features, select the informative attributes as 'n' features. Here the condition is n << K. 2. Now, for the n features defined calculate the best point for splitting the features. 3. Each node is classified as best split into daughter nodes. 4. Perform the steps from 1 to 3 until the number of nodes reaches 1.

5.
Hence the n number of trees are generated to deploy and build the Random Forest model from 1 to 4.

Dataset classification
Random Forest algorithm is chosen as the best among the classifiers as it takes very little time for training and overfitting [2] . Also, its significant feature is the level of accuracy to predict class-wise error rate ( Figs. 2 -5 ).
• The tree classification of the RF model to the following steps. • A binary tree is grown to classify the data.
p KL = Left node in proportion of class K. p KR = Right node in proportion of class K.
Regression The technique used to estimate the difference from independent feature to dependent features is linear regression which can easily forecast and predict the impact of relationship variables [5] .

Algorithm procedure
A Random Forest algorithm extracts the subsamples from the given dataset to the ensemble datasets ( Table 7 ). The dataset contains eight features, with four features are relevant attributes having a meaningful relationship.
The algorithm works in two phases as random bootstrap sampling and decision trees creation. These methods together are used to classify the result for the prediction. In the first phase, it uses the bootstrap sampling method to bootstrap the samples as f 1 (x), f 2 (x) ...F M (x) to obtain f(x) utilizing model averaging. The second phase defines the criteria in classifying the trees as daughter nodes and implements a simple vote [7] .
This work considers a mathematical and AI approach for the real-time dataset of COVID-19 patients to determine the current state of infection from SpO 2 saturation, temperature, heartbeat, and blood pressure values [9] . The current health state trained and tested from the dataset gives a data-driven model to monitor and forecast the pandemic health condition of different patients [11] .

Illustrative Pseudo code with python programming
# Importing Libraries Import pandas as pd # Load dataset from your local drive DATASET_LOC = /path/downloads/covid-19-26.csv # Correlates all the attributes Correlation = correlation.colums Plt.scatter = Range Index(start = 0, stop = 1085, step = 1) #InteractiveShell from IPython.core.interactiveshell InteractiveShell.ast_node_interactivity = "all'' # split train and test and fit the model from sklearn.model_selection dcf = RandomForestClassifier() # Creating training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) #Inference on validation of dataset Pred = dcf.predict_model # Accuracy check and stats for inference accuracy_score(y_test,y_predict) lr.score(X_train, Y_train, X_test, Y_test) 1. To implement and understand the work carried the following steps are defined. 2. Load the dataset in google colab or visual code ( Table 4 ). 3. Add the proposed work in the Anaconda tool (AEN 4.1 version) for data analysis ( Table 5 ). 4. The dataset is loaded, and it displays the first five rows of the data frame packed in the above software tool used. The command used to display the five rows is df. head ( Table 2 ) The dataset shape is obtained by using print statement as Dataset Shape: (1085, 8)

Data wrangling, collection, and cleaning
The raw data can perform meaningful analytics and train a machine learning model. The data stored in .CSV (comma Separated) file format determines the relevant attributes collected for patients of age and gender with symptoms and signs of SPO 2 saturation, heart rate, blood pressure, and temperature ( Fig. 6 ). The data cleaning step is to remove missing values and unwanted characters used in the data. df in the code indicates drawn data frame and the null values by using autocleaning and summing the predicted null values to perform data manipulation operations.
1. The correlation coefficients represent a relationship between two variables where it is a relationship between dependent and independent variables. The features for each attribute are separately shown in each column to define the variables in the dataset ( Table 6 ). The above step avoids false repetition of the values. The below Eq. 1 represents with a and b for first and second variable values, m is the quantity information.
2. When multiple lines are in a cell, an interactive shell defines the core simulation. In our dataset, relevant features from columns 3-7 are considered, with x defining the input response and y is predicted outputs. The head represents the first five rows of x and y ( Table 2 ). For the dataset based on the conditions, split into train and test. This step maps the data in an optimal format for selecting a training set to process the data together, known as feature transformation. 3. Splitting data into training and testing Sk learns function separates the train and test data from the source dataset by specifying the test size and train size ( Table 10 ).
4. The model is fitted based on the parameters assigned in the random forest model. This model specifies the parameters such as features per node, num Trees, max Tree depth, RF predictor, confusion matrix. It set the best fit model for the random forest classifier. In this step, the algorithm is trained for evaluation to ensure proper testing. The data is split with 80% for training and 20% for testing to refine and optimize the model over time ( Table 9 ). 5. The model is classified with the dataset to measure accuracy by using binary classificatory as the following ( Table 8 ).
Accuracy = T P + T N T P + T N + F P + F N Where True positive (TP), True negative (TN), False positive (FP), and False Negative (FN) are the metrics for non-binary classificatory, the data of machine learning model determines the highest probability as overall accuracy where a correct number of segments are counted as an actual class and divided by the total number of elements.
1. Model validation: The training and testing data are the same, where the data is split into training data to test the final model. The data has classes to define overfitting and underfitting to generalize the data. In this work, overfitting applies to the training data as the value obtained is too close to the outcome ( Table 9 ). 2. To predict the classification and its score, a confusion matrix is used. The matrix information collects actual and predicted information in a separate column specifying the health status (Tables 2-4) .

Conclusion
This simulation study has analyzed the risk of COVID-19 disease progression using random forest classifier algorithm. The eight features intensify the uncertainty to forecast the disease progression, which has brought health and financial crisis. The result has predicted the accuracy score of 99.26%, with training and testing scores separately as required. The 1085 samples used have total volatility to spillover during diversity. The comprehensive open-source framework of google colab uses Anaconda AEN 4.1 version with designed efficiency to parameterize many body functions in artificial neural networks. The random forest classifier algorithm shows the sample dataset of COVID-19 patients to associate a set of training documents with the selected features. The jupyter notebook software offers a real-time simulation with attributes for informative data values, which are determined to predict the health status and probability of infection. The data analysis used is to predict the classification and its score confusion matrix as 96.8 and 97.05%. This performance uses a classification process of two classes in the form of the available data matrix. The matrix information collects actual and predicted information in a separate column specifying the health status.