Enhanced MultiView Point Non-Negative Matrix Factorization Clustering for Clinical Documents Analysis

Clustering of clinical documents is the major research area in the field of machine learning and artificial intelligence which aims to acquaint some type of association with the information that helps to highlight huge examples and patterns. The rich corpus of clinical notes consists of several unprocessed data which needs to be mined with appropriate technique to improvise the existing healthcare system. Biomedical information mining is a research strategy to recover, break down and analyze clinical information from a collection of medicinal records. This paper presents a novel approach that utilizes Non-Negative Matrix Factorization Clustering approach to mine the medication names based on age of the patients. Pharmaceutical data from clinical notes is regularly communicated with prescription names and other medication information which needs to be mined based on the similarity between documents so that more accurate extraction of similarity could be accomplished. Even in the wake of being an exceptionally effective solution, clustering is yet not deployed in the major search engines. The basic issue with it is to determine a fast and accurate cluster values even after reducing the complexity of the technique. This paper presents an enhanced multi-viewpoint similarity measure that utilizes many distinct viewpoints to measure similarity between documents so that more accurate extraction of similarity could be accomplished.


INTRODUCTION
Information mining is known to be a notable wellspring of knowledge retrieval techniques from database.It is considered as a counterfeit strategy that allows us to find helpful information dwelling in important parts of data sources.It has been demonstrated to have good potential to separate valuable data from a comprehensive gathering of information.Clustering has been one of the most proficient approaches that can provide the researchers a way to extract information from the grouped clinical notes.Various researches have been made in this domain that have doubtlessly accomplished incredible pace in the locale of restorative exploration and clinical practice.Clinical data mining is the way toward applying the information mining strategies on the acquired literary clinical reports.
Rich content information of clinical reports contains data about drugs and syndromes.Extricating this data has proven useful to refine the medical system.Many studies have proposed various proficient techniques for diagnosing diseases to extricate the right medical information from the unprocessed data.Clinical documents are broadly utilized for future investigation and determination of the sickness.The clinical notes have an incredible use in drug store so to lessen imitation and avoid drug misuse.Record clustering helps gather them into pertinent groups.This is done basically to find critical patterns to put a collection of the comparable articles into groups.By utilizing this method, we can perform grouping on literary information corpus that is in unstructured and semi organized configuration.These sorts of important data separated from biomedical prescriptions are very useful to build a consolidated summary for patients 1,2 Pharmaceutical data is an integration of prescription names and other medication information, for example drug name, dosage, course, etc.A significant amount of work has been done in the field of clinical research for clustering of pharmaceutical records for mining useful text.In 2008 Doing-Zhang et al. worked on ontology-based learning and proposed nine similarity measures of words in clinical document clustering 3 .The ontology based work discussed in 4 can help to build an ontology based clinical retrieval system.In 2011 Patterson et al. showed their research by clustering a number of 17 biomedical notes with the help of an unsupervised clustering algorithm using various lexical and semantic patterns for proper extraction 5 .The work in 6 ontology based access control mechanism is used to secure clinical data.
In 2010 Han et al. clustered clinical notes using latent semantic indexing method and found it as an effective method for measuring the similarity 7 .Harris et al. compared the linguistic features of clinical documents of various institutions according to their respective medical specialty by using clustering methods.Nonnegative matrix factorization (NMF) framework significantly outperforms the various classical clustering algorithms such as that of bisecting K-means, and hierarchical clustering 5 .Our contribution to the research paper is summarized as: 1) An enhanced framework system for extricating symptom/medication names from clinical documents; 2) Application of multi-view NMF to effectively use medication/symptom names to improvise the clustering results; 3) Extraction based on age 4) Computing accuracy on the basis of word count and tf-idf factor.

Related Work
Over the past few years, researchers have carried out various works relating to clustering of biomedical notes this section of document presents some of them.Xiaodi Huang et.al 8 proposed the ensemble non-negative matrix factorization technique for effective clustering of clinical notes.A combination of ensemble non-negative matrix factorization and Hybrid Bipartite Graph formulation (HBGF) is proposed for clustering the prescriptions.Yuan Linget.al 9 Build a combined approach for extricating drug names and their relating symptom names from a collection of clinical documents.Their work showed a comparison of two approaches namely, non-negative matrix factorization technique and multi-view NMF to produce clusters out of a set of pharmaceutical prescriptions based on different sample-feature matrices generated.DucThang Nguyen et.al 10 Introduced an efficient multi viewpoint-based similarity measure for clustering clinical notes.This concept was applied on clustering which the authors named as MVSC.Subsequently, they represented it as MVSC-and MVSC-i.e.MVSC as a criterion function and respectively.The main goal is to perform document clustering by optimizing and.Hung Chim at.al 12 Presented their work using an expression (phrase) based record clustering.They focused their work to compute the similar documents by the usage of a method called as Suffix Tree Documents model.The authors considered three kinds of suffix namely the root node, leaf nodes, and internal nodes for every document.STC Algorithm was then applied to obtain better clustering results than the conventional algorithms.Tsang-Hsiang Cheng et.al 13 A technique named as clustering-based Category-Hierarchy Integration (CHI) was proposed which is an improvement in the technique of clustering-based Category Integration Approach abbreviate as CCI.Their execution of category-hierarchy integration showed improvement in the results as compared to the accuracy of that obtained by non-hierarchical category-integration techniques.William Hsu et.al 14 Stated that knowledge from the clinical sources can be utilized to mine the useful data and also analyze clinical data present in the dataset improving the accessibility to various portions of the record.They extricated the features from the biomedical records which were mapped to the concepts available in the text data thereby computing results based on concepts mentioned in the knowledge bases.Shady Shehataet.al 15 Presented a model that computed the similarity by calculating the similarity between the sentences present in the documents by analyzing their meaning.Concept-based Analysis was the proposed algorithm that used the concept of likeliness to measure the values of term level, sentence level (ctf), document (tf), and corpus levels (df) for a set of documents.Concept based similarity was applied on various datasets and the results proved that the proposed work outperformed the conventional analysis methods.Jiayue Zhang et.al 16 They devised a novel approach to enhance the search results for electronic search records by calculating temporal similarity.
The authors proposed an algorithm to combine textual and temporal relevance with the help of adapted hierarchical clustering method for the purpose of re-ranking of healthcare records.This is used for re-ranking and re-positioning of records.Adil M. Bagirovet.al 17 .The authors further modified the existing modified K-means algorithm by calculating the cluster in a stepwise increment manner.For this they generated the starting points with the help of the auxiliary functions.The minimization of the function is achieved by applying k-means algorithm on it.
Taxiarchis Botsis et.al 18 Aimed to apply various clustering algorithm to document networks on the respective values of threshold to obtain document clusters.The authors applied three clustering algorithms namely k-means, visualization of similarities and Louvain to the obtained networks and calculated the performance to determine cluster values.Arthur, D et. al 19 They applied their technique to achieve a better running time for the k-means approach.The authors discovered that the running time of k-means clustering algorithm was limited by a polynomial of 1/Ã and n.The authors concluded that they would like to evaluate the quality of the local optimum obtained using k-means approach and whether the values achieved could be considered as global optimum or not.
Atanaz Babashzadeh et.al 20 .They utilized semantic data to enhance the performance results of clinical IR framework by defining queries in a representative and significant manner.A dataset namely TREC was used for validation of their approach.Results demonstrated the devised approach incomparably improved the performance values of the retrieval of information as compared to conventional keyword-based IR model.Nan Cao et.al 21 Presented their technique named as FacetAtlas, which is a multifaceted visualization approach used for visually evaluate rich text datasets.In order to extract the local as well as global values the authors devised an integrated approach by the application of searching technique on advanced visual analytical device.Edward Omiecinskiet.et.al 22 The authors presented an efficient parallel approach for record clustering.This technique was run on a SIMD machine.The authors proved that there does not exist any difference between the SplitMerge algorithm and their performance results.Honigman et al 23 Their work focussed on the detection of the ADEs in the patients' record using a lexicon device.This device was applied on clinical documents to extract Adverse Drug Events (ADEs).Their approach determined various problems in outpatients in an efficient and economic manner.Hripcsak et al 24 Focused their work on extraction of information regarding patients with tuberculosis using an NLP method.Also there is an NLP based generic framework discussed in 25 can help to retrieve the clinical data efficiently.They devised a clinical policy to ensure immediate solution to the problem and to isolate these patients.When combined with automated protocols the Clinical protocols produced an extraordinary good results rather than using it alone.Emilia Apostolova et.al 26 who determined a technique that automatically segmented medical documents into semantic sections.They used Hand-crafted rules to develop a scalable biomedical documents segmentation approach that required very less user effort for efficient extraction of information from clinical texts.They for automatically identifying high confidence training set.Renchu Guan et.al 27 An Affinity Propagation (AP) based approach namely approach named as Seeds Affinity Propagation (SAP) to improvise the semisupervised clustering approach.A dataset namely

Fig. 1: An example of textual clinical prescription
Reuters-21578 was selected for comparison of the results obtained with the two clustering algorithms, namely, AP algorithm and k-means algorithm.

METHODOLOGY Clinical Documents
Clinical documents consist of medical data about patients such as demographic, symptoms, medicines, billing, gender, age or other historical information.The information present is in the unstructured text form.They consist of tags, stem words etc which needs to be removed to a structured format of text.An example of free text format of clinical document is shown in figure 1.

Pre-processing
The clinical prescriptions consist of the structured sections which consist of unorganized pharmaceutical information.Enormous amount of hidden medical information can be mined from them by applying right technique.In this research an efficacious approach is employed to pre-process documents in order to extract the valuable words and  proper sections from the clinical notes.This results into improvement of the quality of prescriptions present in the dataset.An overview of the preprocessing approach for the extraction of symptoms and medication is described in figure 1.
Pre-processing of the clinical prescriptions includes the removal of unnecessary words such as stem or stop words, eliminating unnecessary sections from the clinical notes.For this the clinical notes were first pre-processed the text document using the Standard CoreNLP Tool (http://stanfordnlp. github.io/CoreNLP/)which is a java based tool that helps to parse, identify the entities, sentimentally analyse the text.
A number of sections are present in the biomedical documents, symptoms and medication names needs to be extracted from these sections.Majorly the symptom names are contained in sections such as Chief Complain, History of illness, Assessment plan, Physical Exam, Specimen, Review of systems.Likewise, the drug names can be found in the following sections Medication, Impression, Recommendations, Past medical history, Assessment plan, medication on discharge.For such computations section annotator is used in order to differentiate between the sections present in the textual prescription.The header information for the respective sections is computed and based on that the necessary sections are retrieved from the document.The sections which provide the negation recommendations are excluded using the negation annotator.This is done using the NegEx (https://healthinformatics.wikispaces.com/NegEx+Algorithm), a Natural Language Processing (NLP) tool which detects the negative terms present in the clinical text and removes the corresponding medicine name so that right medication is retrieved.This tool identifies the trigger terms and works according to the scope of the terms identified.This tool recognizes the pre-negation and post-negation words such as keptoff, avoid, away, without and was ruled out, free respectively.The clinical note shown in the figure 1 consists of chief complaint of chest pain.The statements "The naprosyn is not helping that much" and "keptoff Protonix" have negation words "not helping much" and "keptoff" respectively so the negation medicine relating these medicines are removed.Likewise the negation terms such as 'avoid', 'allergic' are annotated in the document and the corresponding medication is discarded for correct medical recommendation.
After the pre-processing process is over the sample matrices is obtained which has the attributes as shown in table 1.

Symptom/Medication Names Extraction
The medication and symptom names are present in the different sections of the clinical notes.After pre-processing of documents MedEx [28] is used for identifying and extricating the medicine names and MetaMap 8,27 is used for symptom names extraction from these sections.Figure 2 shows the extraction of the same using MedEx and MetaMap after pre-processing the clinical notes.MedEx (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2995636/figure/fig1/ ) 30 is a java based open source tool that helps to identify the medicine terms such as dosage, intake time, drug names, duration and amount in the clinical text.It maps the medication information found to the UMLS thesaurus concepts to find the most precise match for accurate medication extraction.The symptoms extraction is done using MetaMap (https://metamap.nlm.nih.gov/) which is a configurable program that helps to relate the biomedical data to Meta-thesaurus concepts.It uses semantic knowledge representation NLP based approach for mapping the concepts such as "aapp" , "clna", "clnd", "nnon" which means amino acid, clinical attribute, clinical drug and nucleic acid respectively.Some other are bact, bodm, enzy, impo, vita etc (https://mmtx.nlm.nih.gov/MMTx/semanticTypes.shtml).

Matrix Factorization Technique (M-NMF) Non-Negative Matrix Factorization
NMF is an efficient method to factorize a given non-negative matrix says, into the product of two non-negative matrices of lower dimension in the forms as matrix and another matrix, which can be efficiently expressed using the Euclidean distance formula 1: ... (1)   The NMF aims at finding a low rank estimate of a matrix V ( by considering V as a product of two-dimensional decreased matrices W and H.Each of the columns of W represents a basis vector and that of H contains an encoding of the linear composition of the basis vectors that approximates the respective column of V. Measurements of W and H are and separately, where k is the decreased rank matrix 9 .Lee and Seung 7,29 proposed an efficient technique to update rules of W and H multiplicatively which is known as multiplicative method (MM).The algorithm is stated below as 9 : (1) Initialization of H and W with non-negative values.
(2) Iterate for each variable c, i, and j until convergence or after l iterations:

Multi-View Point NMF
NMF has been effectively utilized in multiview learning.Multi-view technique helps to identify latent components in various distinct sub-matrices in a simultaneous manner.Z. Akata et al. [9] proposed the extension of the basic NMF in an optimized form using different views.Figure 3 shows the M-NMF technique that utilizes three views i.e. words, symptoms and medication to compute the feature matrix.The matrix is used for the computation of multi-view point similarity computation.As more than one view are utilized the accuracy of the computations made becomes high.Also, the extraction is based on the age of the patients', so the accuracy value is much higher.

Dataset Result
The result for the medicines and symptoms extraction is shown in table 2. The idea is to extricate the medicine and symptoms based on the age of the patients' using the proficient clustering technique i.e.Multi-View Point Non-negative matrix factorization.The value of k (cluster value) is taken as three (k=3) for clustering the documents into three clusters.

Evaluation Metrics
Accuracy based on three views is computed in this research work.It is the measurement of the fraction of documents that are labelled correctly.A one-to-one correspondence exists between the true classes and the assigned clusters.Accuracy is calculated using formula 6.
... (6)  Where q is the possible permutation from 1 to k.Two age groups are considered for the extraction of symptoms and medication names.Table 4 shows the results for the accuracy based on patients' age for the age group less than 30.As shown in Table 3 the count based on words in the clinical notes is calculated using the feature matrix thereby calculating the accuracy using the formula specified in equation 6.The accuracy based on count and TF-IDF factor as per the respective view is computed.
Table 4 shows the results for the accuracy based on patients' age for the age group above 30.As shown in Table 3 the count based on words in the clinical notes is calculated using the feature matrix thereby calculating the accuracy using the formula specified in equation 6.The accuracy based on count and TF-IDF factor as per the respective view is computed.