Data Mining Techniques in the Diagnosis of Tuberculosis

Data mining is the knowledge discovery process which helps in extracting interesting patterns from large amount of data. With the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection, medical and scientific discovery (J.Han & M.Kamber,2006).


Selection of data mining / knowledge discovery in database
The third step is data mining that extracts patterns and models hidden in data.This is an essential process where intelligent methods are applied in order to extract data patterns.In this step we have to first select data mining tasks and then data mining method.The major classes of data mining methods are predictive modeling such as classification and regression; segmentation (clustering) and association rules which are explained in detail in the next section.

Interpretation and evaluation of results
The fourth step is to interpret (post-process) discovered knowledge, especially the interpretation in terms of description and prediction which is the two primary goals of discovery system in practice.Experiments show that discovered patterns or models from data are not always of interest or direct use, and the KDD process is necessarily iterative with judgement of discovered knowledge.One standard way to evaluate induced rules is to divide the data into two sets, training on the first set and testing on the second.One can repeat this process a number of times with different splits, and then average the results to estimate the rules performance.

Using discovered knowledge
The final step is to put the discovered knowledge in practical use.Putting the results in practical use is certainly the ultimate goal of the knowledge discovery.The information achieved can be used later to explain current or historical phenomenon, predict the future, and help decision-makers make policy from the existed facts (ho, nd).

Data mining tasks and functionalities
Data Mining functionalities are specifically of two categories: descriptive data mining and predictive data mining.Descriptive methods find human-interpretable patterns that describe the data.Predictive methods perform inference on the current data in order to make predictions (J.Han & M.Kamber, 2006).
The predictive tasks of data mining are:


Classification -Arranges the data into predefined groups.For example an email program might attempt to classify an email as legitimate or spam.Common algorithms include Decision Tree Learning, Nearest neighbor, Naive Bayesian classification and Neural Network.


Regression -Attempts to find a function which models the data with the least error.
The descriptive tasks of data mining are:  Association rule learning -Searches for relationships between variables.For example a supermarket might gather data on customer purchasing habits.Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes.This is sometimes referred to as "market basket analysis". Clustering -Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.
Data mining finds its applications in various fields.Data mining draws ideas from many fields such as Machine learning/Artificial Intelligence, Pattern Recognition, Statistics, and Database Systems.In recent years, data mining has been widely used in the area of genetics, medicine, bioinformatics with its applications applied to biomedical data as facilitated by domain ontologies and mining clinical trial data which is also called medical data mining.
Different types of medical data are now available on the web, where DM algorithms and applications can be applied, helping in easy diagnosis.Efficient and scalable algorithms can be implemented both in sequential and parallel mode thus improving the performance.Such type of mining is called medical data mining.

Medical data mining
In recent years, data mining has been widely used in the area of genetics and medicine, called medical data mining.In the past two decades we have witnessed revolutionary changes in biomedical research and bio-technology.There is an explosive growth of biomedical data, ranging from those collected in pharmaceutical studies and cancer therapy investigations to those identified in genomics and proteomics research.The rapid progress of biotechnology and bio-data analysis methods has led to the emergence and fast growth of a promising new field: Bioinformatics.On the other hand, recent progress in data mining research has led to the developments of numerous efficient and scalable methods for mining interesting patterns and knowledge in large databases, ranging from efficient classification methods to clustering, outlier analysis, frequent, sequential and structured pattern analysis methods, and visualization and spatial/temporal data analysis tools.The question becomes how to bridge the two fields, Data Mining and Bioinformatics, for successful data mining in biomedical data.Especially, we should analyze how data mining may help efficient and effective bio-medical data analysis and outline some research problems that may motivate the further developments of powerful data mining tools for bio data or medical data analysis.
Data mining is a process that involves aggregating raw data stored in a database and analyzing them to identify trends, patterns and anomalies.Medical data mining is an active research area under data mining since medical databases have accumulated large quantities of information about patients and their clinical conditions.Relationships and patterns hidden in this data can provide new medical knowledge as has been proved in a number of medical data mining applications.A Doctor quickly swung into action after a renowned pharmaceutical company in the USA announced in 2001 that it was withdrawing a cholesterol-lowering drug following the deaths of more than 30 people.Using his medical records database, his staff was able to identify all patients taking the cholesterol-lowering drug and notify them within 24 hours of the announcement.What the doctor did is technically known as Data Mining.Very few doctors, however, were able to act on the situation, because they did not have accessible raw data in the electronic format.
Not only does disciplined storage of medical data helps the physicians and healthcare institutions, but it also helps pharmaceutical companies to mine the data to see the trends in diseases.It also helps prioritize product development and clinical trials based on the accurate demands visible from the data that is mined.
Various data mining tasks can be applied on different diseases data set.This helps even the doctor to identify hidden associations between various symptoms.Research has been carried out on gene data, proteonomic data and attributes related to diseases covering even risk factors.Prediction of diseases has also been done on scanned images leading to medical imaging, which is the fastest growing area.Lot of Research has been carried out leading to breast cancer, liver diseases and other types of cancer and also diseases related to heart.There are very few articles related to Tuberculosis.

Tuberculosis
Tuberculosis (TB) is a common and often deadly infectious disease caused by mycobacterium; in humans it is mainly Mycobacterium tuberculosis.It usually spreads through the air and attacks low immune bodies such as patients with Human Immunodeficiency Virus (HIV).It is a disease which can affect virtually all organs, not sparing even the relatively inaccessible sites.The microorganisms usually enter the body by inhalation through the lungs.They spread from the initial location in the lungs to other parts of the body via the blood stream.They present a diagnostic dilemma even for physicians with a great deal of experience in this disease.Hence Tuberculosis (TB) is a contagious bacterial disease caused by mycobacterium which affects usually lungs and is often co-infected with HIV/AIDS.
It is a great problem for most developing countries because of the low diagnosis and treatment opportunities.Tuberculosis has the highest mortality level among the diseases caused by a single type of microorganism.Thus, tuberculosis is a great health concern all over the world, and in India as well (wikipedia.org).
Symptoms of TB depend on where in the body the TB bacteria are growing.TB bacteria usually grow in the lungs.TB in the lungs may cause symptoms such as a bad cough that lasts 3 weeks or longer pain in the chest coughing up blood or sputum.Other symptoms of active TB disease are: weakness or fatigue, weight loss, no appetite, chills, fever and sweating at night.
Although common and deadly in the third world, Tuberculosis was almost non-existent in the developed world, but has been making a recent resurgence.Certain drug-resistant strains are emerging and people with immune suppression such as AIDS or poor health are becoming carriers.

Data set description
The medical dataset we are using includes 700 real records of patients suffering from TB obtained from a city hospital.The entire dataset is put in one file having many records.Each record corresponds to most relevant information of one patient.Initial queries by doctor as symptoms and some required test details of patients have been considered as main attributes.Totally there are 12 attributes (symptoms) and last attribute is considered as class in case of Associative Classification.The symptoms of each patient such as age, chronic cough(weeks), loss of weight, intermittent fever(days), night sweats, Sputum, Bloodcough, chestpain, HIV, radiographic findings, wheezing and TBtype are considered as attributes.
Table 1 shows names of 12 attributes considered along with their Data Types (DT).Type Nindicates numerical and C is categorical.

Association Rule Mining
Association Rule Mining (ARM) is an important problem in the rapidly growing field called data mining and knowledge discovery in databases (KDD).The task of association rule mining is to mine a set of highly correlated attributes/features shared among a large number of records in a given database.For example, consider the sales database of a bookstore, where the records represent customers and the attributes represent books.The mined patterns are the set of books most frequently bought together by the customer.An example could be that, 60% of the people who buy Design and Analysis of Algorithms also buy Data Structure.The store can use this knowledge for promotions, self-placement etc.There are many application areas for association rule mining techniques, which include catalog design, store layout, customer segmentation, telecommunication alarm diagnosis and so on.1. List of Attributes and their Datatypes

Definition of association rule
Here we give the classical definition of association rules.Let { t 1 , t 2 ,…..t n } be a set of transactions and let I be a set of items, I={ I 1 ,I 2 ,….I m }.An association rule is an implication of the form XY, where X, Y are disjoint subsets of item I and X∩Y=ф.X is called the antecedent and Y is called the consequent of the rule.In general, a set of items such as the antecedent or consequent of a rule is called an Itemset.Each itemset has an associated measure of statistical significance called support.support(x)=s is the fraction of the transactions in the database containing X.The rule has a measure of strength called confidence defined as the ratio support(X Ụ Y) / support(X) (J.Han & M.Kamber, 2006).
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold and confidence ≥ minconf threshold.
Mining Association rule is a Two-step approach: -Frequent Itemset Generation  Generate all itemsets whose support  minsup.-Rule Generation  Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset.
Apriori algorithm employs two actions join step and prune step as explained in the following algorithm to find frequent itemsets.
-Apriori principle: It states that if an itemset is frequent, then all of its subsets must also be frequent -Apriori principle holds due to the following property of the support measure:  Support of an itemset never exceeds the support of its subsets

Rule Generation
O n c e t h e f r e q u e n t i t e m s e t s f r o m t r a n s a c t i o n s i n a d a t a b a s e D h a v e b e e n f o u n d , i t i s straightforward to generate strong association rules from them where strong association rules satisfy both minimum support and minimum confidence.This is calculated from the following equation Based on the above equation association rules can be generated as follows:


For each frequent itemset l, generate all non empty subsets of l.


For every nonempty subset s of l, output the rule " s -> (l-s)" if support_count(l) / support_count(s) is greater than or equal to min_conf, where min_conf is the minimum confidence threshold.

Tuberculosis association rules
Tuberculosis association rules can be generated by applying data mining ARM technique with the following steps:  Pre-processing the dataset by discretizing and normalizing  Generating rules by applying apriori on preprocessed range data

Pre-processing
Incomplete, noisy, and inconsistent data are common among real world databases.Hence it is necessary to preprocess such data before using it.The most common topics under data preprocessing are Data cleaning, Data integration, Data Transformation, Data reduction, Data discretization and automatic generation of concept hierarchies.
Discretization and Normalization are the two data transformation procedures that help in representing the data and their relationships precisely in a tabular format that makes the database easy to understand and operationally efficient.This also reduces data redundancy and enhances performance.
The above TB attributes are normalized and discretized to a suitable binary format.A categorical data field has a value selected from an available list of values.Such data items can be normalized by allocating a unique column number to each possible value.Numerical data fields are discretized by taking values that are within some range defined by minimum and maximum limits.In such cases we can divide the given range into a number of subranges and allocate a unique column number to each sub-range respectively.
Here we give a small example of five patients medical records with five attributes.Table 2 shows original data.may not be interesting to users, only few rules like explained above gives very good description and some hidden relationship may also be found.
We could see from the following output that left side (Antecedent) and right side (consequent) of the rule keep on interchanging repeatedly, which can be pruned by applying some conditions on both antecedent and consequent of a rule.

Associative classification
Association Rule Mining (ARM) as explained in section 3 is one of the most popular approaches in data mining and if used in the medical domain has a great potential to improve disease prediction.This results in large number of descriptive rules.Therefore ARM can be integrated within classification task to generate a single system called as Associative classification (AC) which is a better alternative for predictive analytics.
Classification based on association rules has been proved as very competitive (Liu.B et al., 1998).The general idea is to generate a set of association rules with a fixed consequent (involving the class attribute) and then use subsets of these rules to classify new examples.This approach has the advantage of searching a larger portion of the rule version space, since no search heuristics are employed, in contrast to Decision Tree and traditional classification rule induction.The extra search is done in a controlled manner enabled by the good computational behaviour of association rule discovery algorithms.
Another advantage is that the produced rich rule set can be used in a variety of ways without relearning, which can be used to improve the classification accuracy ( Jorge and Azevedo, 2005).
The procedure of associative classification rule mining as shown in figure 6 is not much different from that of general association rule mining.A typical associative classification system is constructed in two stages: 1) discovering all the event association rules (in which the frequency of occurrences is significant according to some tests); 2) generating classification rules from the association patterns to build a classifier.In the first stage, the learning target is to discover the association rules inherent in a database, but generating frequent itemsets may prove to be quite expensive.The number of rules generated from association rule discovery is quite large.Hence rule pruning is required.Moreover, to avoid the problem of overfitting, proper rule pruning method is to be employed.Ranking of the rules is also important.When a test instance has more than one potentially applicable rules, rule ranking is necessary to prefer one rule over the others.In the second stage, the task is to select a set of relevant association rules discovered to construct a classifier given the predicting attribute.
For example given a rule X -> Y, AC will only consider rules having a target class as the consequent.This means the new integration focuses on a subset of association rules, whose right hand-sides are restricted to the classification class attribute.This type of rule is called Class Association Rules (CARs).While normal association rule allows more than one condition as its consequent and any item from X can be the consequent, CARs generated in AC limit the consequent to one fixed target class for each rule and item from X are forbid to appear as the class label.In order to perform AC, a classifier will first mine CARs from a given transaction and later select the most predictive rule to perform a classifier (Chien and Chen, 2010).AC generates CARs depending on the frequent item generation technique in mining rules.Despite its benefit, AC does propose challenges in its classification performance.The most important thing is to the approach in mining appropriate CARs for the classification and its pruning technology since AC will generate large number of frequent item sets due to its pruning algorithm.Its prominent pitfalls are in its incapability of handling numerical data.Chen, 2010).Generally, AC consists of three main phases, which are rule generation, rule pruning, and classification (Do et al., 2009;Tang and Liao, 2007).The performance, however, might differ depending on the algorithm employed in any of these three phases.

CBA
The first AC algorithm was introduced by (Liu. B et al., 1998), namely CBA.The algorithm is based on the Apriori association rule algorithm in generating CARs.These rules are later pruned and only one most suitable rule will be used to classify the test set.Essentially, the CBA algorithm performs three tasks.First, it mines all CARs.Second, it produces a classifier from CARs, and finally, it mines normal association rules.

Generation of CARs
In CBA, the classification Association rules (CARs) are found iteratively in an apriori algorithm-like fashion.At first, frequent 1-rule itemsets are generated and are pruned.Using this iteratively, other frequent rule itemsets are also found.They are then pruned to get complete set of Classification association rules.

Building classifier (Ranking and Pruning Rules)
To prune the rules, CBA uses pessimistic error based pruning method in C4.5.The rule ranking is defined as below: Given two rules r i and r j , r i > r j (i.e., r i precedes r j or r i has higher precedence over r j ) if one of the following holds good: 1.The confidence of r i is greater than that of r j 2. Their confidences are the same but support of r i is greater than that of r j 3.Both the confidences and supports of r i and r j are the same, but r i is generated before r j After rule ranking, each training instance is covered by the rule having highest precedence among the rules that can cover the case.Every rule correctly classifies at least one training instance.The rules that do not cover any training instance are removed.The training instances that do not fall into any of the observed classes are added to a default class.
The multiple capabilities in CBA solve a number of problems in traditional classification systems.Since traditional classifiers only generate a small subset of rules that exists in data to form a classifier, the discovered rules may not be interesting.Also, to generate more rules we would need the classification system to load the entire database into the main memory.But because CBA generate all rules, the algorithm is more successful in finding interesting rules and the system also allows the data to reside on disk.However, in CBA, the rule generation process might degrade the accuracy of the classifier due to its randomness in selecting the most suitable rule to form the classifier model.CBA inherits Apriori multiple scan features that generates large number of rules, which is costly in terms of large computational time.

CMAR
CMAR is later introduced as the extension to CBA (Li et al., 2001).The CMAR algorithm implements FP-Growth algorithm instead of Apriori in generating its frequent itemset.
The Local Gain Threshold (LGT) is given by the formula: LGT = bestGain * GAIN_SIMILARITY_RATIO Where, GAIN_SIMILARITY_RATIO is a constant whose value is 0.99.
CPAR takes as input a (space separated) binary valued data set R and produces a set of CARs.The resulting classifier comprises a linked-list of rules ordered according to Laplace accuracy.CPAR also uses a dynamic programming approach to avoid repeated calculation in rule generation, which in turn is more economical.More importantly, CPAR selects best k rules in prediction.

Predictive accuracy and rules of associative classifiers
Difference between ARM and AC with reference to results is that the former generates only large number of descriptive rules whereas the latter generate fewer rules along with their performance measure thru accuracy.
CBA generates around 81 rules once it is pruned we get only two rules with an accuracy of 81.14%.When compared to both ARM and AC rules, it can be seen that AC rules are smaller and better in description and also CPAR provides better rules compared to all algorithms.

Summary
In this chapter two data mining techniques which help in the diagnosis of Tuberculosis have been discussed.Medical databases have accumulated large quantities of information about patients and their clinical conditions and digital era has provided the availability of these information in abundance.Data mining is a knowledge discovery process that helps in extracting relationships and patterns hidden in this data and can provide a new medical knowledge to doctors in their treatment procedure.
Association Rule Mining (ARM) is one of the most popular approaches in data mining and if used in the medical domain has a great potential to improve disease prediction.It shows doctor the hidden disease symptoms associated with one another.There are many algorithms associated with ARM and the most popular is Apiori.It works in two phasesfirst is frequent itemset generation where all the items in a database above some minimum specified threshold called support will be generated.Second one is rule generation which generates from the frequent sets, an association rule of the form X->Y based on some minimum confidence.We can say that whenever X appears there is a chance that Y also appears along with it with minimum confidence threshold.These concepts are applied on TB dataset which reveals important association between the symptoms.But this method results in large number of repetitive rules.
Associative classification (AC) is another data mining approach that integrates association rule mining and classification.It uses association rule mining algorithm, such as Apriori or Frequent pattern growth , to generate the complete set of association rules.Then it selects a small set of high quality rules and uses this rule set for prediction.This method results in smaller number rules compared to ARM.
Three important algorithms of AC such as CBA, CMAR and CPAR have been discussed in the chapter.Almost every algorithm contains two major data mining steps, an association rule (AR) mining stage-rules generated here are called as classification association rules (CARs) and a classification stage which uses the mined rules from the first stage directly.The second stage chooses rules with high priority from the CARs to cover training set.The difference between them is based on the priority evaluation of rules which usually depends on the confidence, support, rule length or common quality standard of classification rules.CPAR is better in rule generation compared to others.TB rules and accuracy are compared for every associative classification algorithm Though the entire rules may not help doctors, few rules may describe the relationship between one symptom with the other and also sometimes it can reveal hidden relationship.

Table
is known as the anti-monotone property of support Since the processing of the Apriori algorithm requires plenty of time, its computational efficiency is a very important issue.In order to improve the efficiency of Apriori, many researchers have proposed modified association rule-related algorithms.

Table 2 .
Table 3 contains schema of how the attributes are mapped to individual column numbers.Table 4 is the final translated or normalized data.Original (raw) Data www.intechopen.com

Table 3 .
Schema TableIn the above tables, note that Age is a numerical attribute and its cut off point is <25 & >=25.Similarly HIV is a categorical attribute where positive value is assigned one number and negative another.The value Null for categorical attribute weightloss is equivalent to No and is assigned a unique number.By using the schema table above we map each tuple in the original data of table 2 to a resulting normalized table shown in table 4. Resulting table has the same number of columns as the original table but filled with unique integer values.