A Predictive Analytics Framework for Blood Donor Classification

India faces numerous challenges to the meet ever-increasing demand of human blood so as to improve the health indicators across its rural and urban population. The gap between demand and supply can be fulfilled by increasing voluntary blood donations. Hence, it becomes important to understand the attitude of population towards blood donations. In this paper an effort has been made to identify features in order of their importance that affect the decision of a person to become a blood donor. This research uses extensive visualization techniques to get an insight into potential blood donor characteristics and then applies classification technique to classify youth of an Indian state university as donor or non-donor. The k-nearest neighbour classification algorithm discovers the relationship between attributes of blood donors and hence predicts the outcome. The important factors that dissuade potential donors from donating blood have been extracted that can be worked upon to meet the demand of blood to save human lives.


INTRoDUCTIoN
Human blood is the precious constituent of life and there is no substitute for it.There has always been an acute shortage of human blood as far as a developing nation like India is concerned as stated by Verma et al. (2016).It is mentioned by Abolghasemi et al. (2009) that the rate of blood donations in developing countries is eighteen times lesser as compared to that of developed countries.Voluntary blood donations meet a significant portion of blood requirement in countries with higher income as explained by Nigatu and Demissie (2015).This non-remunerated donation has been considered as best and safest by Gharebhaghian, 2005 andRahman et al. (2011).
A report on National Estimation of Blood Requirement in India has mentioned that the country faces many challenges in maintaining a sufficient supply of blood and its products.With an ever increase in Indian population augmented by advancement in clinical medicine, the demand of blood far outweighs its supply.This is also emphasized by Agrawal et al. (2013) and Benedict et al. (2012) This article, published as an Open Access article on April 23, 2021 in the gold Open Access journal, International Journal of Big Data and Analytics in Healthcare (converted to gold Open Access January 1, 2021), is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/)which permits unrestricted use, distribution, and production in any medium, provided the author of the original work and original publication source are properly credited.
in their research reports.According to World Health Organization (WHO), every country should be able to provide safe and adequate blood to its needy population.It is also underlined by WHO that a country can meet its blood requirements if just 1 percent of its eligible population donates.According to report by Office of the Registrar General & Census Commissioner, India Census, approximately 50 percent of India's population is in the age group 18-65years which is the eligible age group for blood donation yet India fell short of 1.9million units of blood in the year 2017.Hence, it becomes essential for India as a nation to understand the factors that dissuade people from donating blood.With proper preparation, potential donors can be identified and registered with blood donation banks.
The precise aim of this piece of study is to search for realistic and convincing features in the youths' data that could be valuable for envisaging the probability of his/her becoming a blood donor.An effort has been made to categorize candidates into donors or non-donors class on the basis of their characteristics related to blood donation.This is the first time that real datasets related to students' views, sentiments and myths towards blood donations has been collected from students of a state university of India.
The organization of the paper is mentioned here: Literature review is described in Background section.Research methodology section explains about Data Collection, Data Pre-processing, Data Analysis, Feature Extraction, Machine Learning Algorithms used for Study, Model Evaluation and Prediction, Feature Ranking.At the end, conclusions are mentioned.

BACKGRoUND
Various data mining techniques have been used extensively by researchers for classification, prediction, clustering, finding association and summarization tasks in the healthcare field.One of the unsupervised data mining techniques named k-means clustering, has been used to categorize the blood donors based on the gender, age, weight and blood group.The authors, Ramachandran et. al. (2011), have used the datasets from Indian Red Cross Society Blood Bank.A system has been developed by ChanLee and Cheng (2011) that uses classification and clustering algorithms to determine the variations in blood donation behaviour amongst the present donors and envisage their intents towards donation so as to understand various matters and to increase the voluntary blood donation frequency.The authors have applied clustering technique to create four groups and have found that the best accuracy is 0.783.
In order to understand the awareness and attitude of students of Semnan university of medical sciences, a descriptive analytical approach has been used by Majdabadi et al. (2018).It was found that a large number of students are not aware of blood donation and possess a negative attitude towards blood donation.
In order to help the humanity and save precious lives, a web-based system for maintaining records of blood donors has been created by Khan et al. (2009).The system registers the donors and keeps their record that has details of blood donors' blood groups, address for communication, and status of blood donation.This web-enabled system acts as an interface between donors and receptors.Similar web enabled systems have been developed and deployed by Arif et al. (2012) and Guangpeng et al. (2009).With the wide spread usage of mobile communication technologies, a few notification based systems have also been deployed by Singh et al. (2007), Rahman et al. (2011), Samsudinnet al. (2011) and Islam et al. (2013).
In a study by Hamouda et al. (2012), automatic red blood cell has been recognized and counted using image processing.Decision tree has been used to classify Red Blood Cells that has classified the data with an accuracy of 97%.Real datasets collected from an Electronic Data Processing wing of a blood bank has been classified using J48 algorithm by Sharma and Gupta (2012) that can facilitate the blood bank in-charge to make suitable decisions quicker and more accurate.There is a study by Mostafa (2009) where Intelligent data modelling techniques have been used in Egypt to examine the impact of demographic, perceptive and psychological factors on blood donations.The author has observed that there are five factors that are important for understanding blood donors' behaviour, viz.
Altruistic values, knowledge of blood donation, intent to donate blood, perceived risks of donation of blood, and attitude towards blood donation.A framework for the predictors for behaviour of established Australian blood donors has been determined by Masser et al. (2009).Rajput et al. (2009) have stated that it is a great challenge to utilize data mining algorithms in the fields of healthcare and medicine.A report by Government of India ( 2007) has suggested that voluntary non-remunerated regular blood donations are the safest.The strategy of Indian government focuses on motivating non-remunerated blood donors and it emphasizes to maintain good epidemiological data on the occurrence of infectious markers in the general population.One of the main hindrances for blood donations are risks associated with the process as explained by Tscheulin and Lindenmeier (2005) that mainly includes fear of infection.
Qualitative studies have been used by Ferguson and Chandler (2005) to express that blood-donors depict their behaviour using Trans Theoretical Model.Schlumpf et al. (2007) have done extensive study based on a questionnaire filled in by approximately 8000 active donors.The possibility of return of a current donor within next 12 months has been explored using logistic regression.
The prediction of blood donor using age and blood group has been done by Sharma and Gupta (2012).The authors have made use of WEKA tool for data mining.A data mining system based on clustering and classification has been developed by Chan-Lee and Cheng (2011) [7.I] in order to understand the behaviour of blood donors.
From all these studies, it is clear that it is very important to remove myths by educating people and also to identify donors so that the blood banks and other voluntary organizations chalk out a strategy for organizing blood donation camps.By applying classification technique, the potential donors can be identified and the important factors that dissuade eligible donors from donating blood can be extracted.This research work is based on data of an Indian university's students so as to understand their knowledge, attitude and psychology towards blood donations like their fears, myths, risks while becoming a blood donor.This paper uses machine learning algorithms viz.k-nearest neighbour and logistic regression to classify potential donors as donor or non-donor.It also makes use of feature extraction to find and rank important features that play a significant role for a person to become blood donor.The ultimate objective is to motivate such eligible people to donate blood regularly so that many human lives can be saved.

ReSeARCH MeTHoDoLoGy
The population used for the research belongs to Generation Z (born between 1995 and 2015).These are the students of Undergraduate programmes of a Delhi state university, India.The system framework showing all steps of research in order to perform predictive analytics of blood donors, is shown in Figure 1.

online Collection of Data
Data has been collected by using questionnaire developed in google forms by students of an undergraduate programme of a Delhi state university in order to complete their major project.Approximately 500 students of ten different colleges, pursuing undergraduate programmes have been surveyed and responses have been gathered.Convenience sampling technique has been used and hence the selection of the participants was non-random and voluntary.
The questionnaire is based on personal attributes, intention towards blood donation, myths related to blood donation, risks associated with blood donation and perceived belief of probable donors before taking decision on blood donation.There is a total of 20 questions that includes one question describing the class of blood donor (i.e.whether the person is willing to become a blood donor or not).The response to nineteen questions is on Likert scale in which the respondents were asked to select the choice that suited them the most.The choices are Definitely yes, Probably yes, Maybe yes, Probably no and Definitely no.The complete description of the questionnaire is mentioned in Table 10, in the Appendix.

Data Pre-Processing
Out of 20 questions, nineteen have responses on Likert scale.One-hot encoding which is a dummification technique, has been used on these nineteen features.This encoding is basically the representation of categorical variables as binary vectors.These categorical values are first mapped to integer values.Each integer value is then represented as a binary vector that is all 0s (except the index of the integer which is marked as 1).This transformation is required so as to prepare datasets for feeding to an appropriate classification algorithm in Python.The output attribute is "Willingness to become donor" that can take value viz.Yes / No. "Yes" has been transformed to "1" and "No" has been converted to "0".

Data Summarization
A total of 448 participants responded and their frequency distribution on basis of willingness to donate blood, blood groups and religion are shown in Table 1, Table 2 and Table 3 respectively

Data Visualization
Various visualization and numerical calculations libraries of Python have been used to understand the attitude of respondents towards blood donations.These libraries are seaborn and matplotlib.pyplotfor generating barplots and numpy for numerical calculations like grouping the participants on the basis of their response on likert scale.Bar plot showing percentage of respondents in each of the five options to a particular question, has been generated.There was a total of 17 such questions to understand the characteristics of participants and hence 17 such graphs are generated as shown in Table 4.

Feature extraction
The data visualization bar plots have been interpreted so as to select only those features for input to machine learning algorithm for classification.A few of the bar plots do not show much of variation and the distribution of respondents in each of the five categories on likert scale is of similar nature, so these features were removed before applying machine learning algorithms.The removed features are donating blood is purely a personal choice, donating blood would renew blood of donor, donating blood would avoid blood shortage.

Machine Learning Algorithms Used for Study
Two popular machine learning algorithms viz. a lazy learner classifier (K-Nearest Neighbor) and logistic regression have been used on Spyder which is a powerful scientific environment written in Python.K-Nearest Neighbor (K-NN) is an algorithm for classification which memorizes the training data first and when presented with a testing record, it looks for similarity to the memorized training records.Whichever training record is most similar to the test case, that class is assigned to the test tuple.As explained by authors in (Peng et al., 2009) the benefit in using a lazy learning algorithm is that there is local approximation of target function for each query posted to the classifier.This leads to solving many queries in an organized and easy manner.K-NN classifier applies an incremental approach wherein the input comprises a set of attribute-value pairs, as described by Witten Eibe (2011).There is one attribute that corresponds to the class of tuple and other attributes are used as predictors.
Logistic Regression is a very popular machine learning technique based on statistics.It is a type of regression analysis method to apply when the dependent variable is dichotomous (binary).The logistic regression is a predictive analysis technique that uses a logistic function.As explained by Hosmer and Lemeshow (2000), it is used to describe the relationship between one dependent binary variable and one or more ratio-scaled, nominal, interval or ordinal independent variables.
For experimentation, K-NN and Logistic regression have been used for classifying the participants as blood donor or non-donor.The responses have been split into two parts, viz.training and testing.The training of the machine has been done with 70% of records and rest 30% have been used for testing.

Model evaluation and Prediction
The machine learning algorithm for classification would predict the output class of a student as either Blood donor (Positive class) or Non-donor (Negative class).There are only four categories, given below, that any student X could end up with: • True positive (TP): Prediction is Donor and X is actually a Donor.
• True negative (TN): Prediction is Non-Donor and X is actually a Non-Donor • False positive (FP): Prediction is Donor but X is actually a Non-Donor, so it is a false alarm.
• False negative (FN): Prediction is Non-Donor but X is actually a donor, again a wrong prediction.These four cases in confusion matrix are shown in Table 5.Using the confusion matrix, a number of performance metrics have been calculated in Python.These metrics are explained below.

Accuracy
It is the ratio of the correctly labelled class to the entire collection of classes:

Precision
Precision is the ratio of the correctly predicted positive labelled records by the algorithm to all positive labelled records including wrongly labelled also:

Recall (Sensitivity)
Recall is the ratio of the correctly predicted positive labelled records by the algorithm to all who are actually positive in reality: F1-Score (F-Measure) F1 Score takes into account both precision and recall.It is the harmonic mean of the precision and recall.It is a good indicator of performance of classifier when there is uneven class distribution:

Specificity
Specificity is the ratio of the correctly predicted negative labelled records by the algorithm to all who are actually negative in reality:

K-NN Algorithm
The value of k has been varied from 1 to 10 in order to find the maximum value of correctly classified records.It is found that the best classification accuracy is when k=8.The corresponding confusion matrix is shown in Table 6.Using this confusion matrix, the calculation of mentioned performance metrics has been done and is shown in Table 8.

Logistic Regression
By applying this machine learning algorithm, the predicted class vs. actual class data has been shown in confusion matrix in Table 7.Using this confusion matrix, the calculation of mentioned performance metrics has been done and is shown in Table 8.As it is evident from Table 8, the K-NN algorithm with k=8 has outperformed the logistic regression in all the performance metrics.

Feature Ranking
The task of determining the important features (independent) that are greatly affecting the decision of a student to be a blood donor has also been done.For this purpose, Recursive Feature Elimination or non-donors.K-NN algorithm was executed by varying value of k from 1 to 10 and the best classification accuracy was obtained with value of k equal to 8.After performing comparison of various performance metrics, it was found that the K-NN classifier has demonstrated convincing results with an Accuracy of 0.7027, Precision equal to 0.7209, Sensitivity value 0.7949, F1-score equal to 0.7561 and Specificity value 0.5789.This study has provided the ability to identify important factors that influence the decision of youth to donate blood.These factors are time taken by blood donation process, companionship of a friend or family member while donating blood, availability of intimate place for blood donation and feeling of being rewarded.With the knowledge of these important determinants, the blood donation services can come up with newer and more efficient strategies that would increase the number of donors.This identification of probable donors would help blood banks and voluntary organizations plan in advance for the organization of blood donation camps.Also, the participants predicted as Non-donors can be motivated and the factors that restrict them from donation can be worked upon.Hence the significant gap between demand and availability of blood in India can be reduced by better management and collection of blood. .

Figure 1 .
Figure 1.System framework for research