Biomedical Named Entity Recognition Using the SVM Methodologies and bio Tagging Schemes

Biomedical Named Entity Recognition (BNER) is identification of entities such as drugs, genes, and chemicals from biomedical text, which help in information extraction from the domain literature. It would allow extracting information such as drug profiles, similar or related drugs and associations between drugs and their targets. This venue presents opportunities for improvement even though many machine learning methods have been applied. The efficiency can be improved in case of biological related chemical entities as there are varied structure and properties. This new approach combines two state-of-the-art algorithms and aims to improve the performance by applying it to varied sets of features including linguistic, orthographic, Morphological, domain features and local context features. It uses the sequence tagging capability of CRF to identify the boundary of the entity and classification efficiency of SVM to detect subtypes in BNER. The method is tested on two different datasets 1) GENIA and 2) CHEMDNER corpus with different types of entities. The result shows that proposed hybrid method enhances the BNER compared to the conventional machine learning algorithms. Moreover the detailed study of SVM and the methodologies has been discussed clearly. The linear and non linear text classification can be mapped clearly in the section 3. The final section describes the results and the evaluation of the proposed method.


1.Introduction
Named Entity Recognition (NER) refers to identifying and classifying terms belonging to a domain from unstructured text and mapping them to predefined categories. Generally there are three methods for entity recognition [1] i) Rule-based methods ii) Dictionary-based methods and iii) Machine learning based methods. Rule based systems will be effective if patterns can be defined based on Orthographic or Morphological features for all types of entities. But identifying and addition of patterns and disambiguating them is a tiresome task. The Rule-based method is neither robust not portable, that is it has to be kept updated for precise extraction and features varies for every domain. Dictionary based approach is useful if the vocabulary is complete and updated and also requires certain pre-processing such as normalization for matching the text with the vocabulary. Both the approaches does not extracts unseen entities, that is if there is variation in patterns or the term is not present in the vocabulary they are not extracted. Machine learning based approach solves the problem by learning the distinctive features to identify an entity. Primarily supervised machine learning algorithms are used for extracting named entities and it requires large annotated data to learn the features of entities [2]. Analysing characteristics of microbiological characteristics was studied by [3] and comparative analysis was carried out between chemical and microbiological character for analysing antibacterial activities [4].
Unsupervised algorithms can be used to explore and analyse the text for various segments of entities based on their common properties, but needs discriminating features for classifying [5]. Topic model, Distributional information or semantics similarity is used to cluster similar entities and classify them. Semi-supervised approaches are useful if large amount of un-annotated text is available but needs proper selection of seeds with selective features for efficient learning of the task. A common approach for extracting entities across domain is not available as the features and class for every domain [6]. To identify multiple types of entities, the features set has to be equally diverse and to identify them different machine learning algorithms are necessary.
In this work NER from biomedical text is considered, as Biomedical Named Entity Recognition (BM-NER) task has huge impact on tasks such as information extraction and knowledge discovery in the domain. The availability of large unstructured text provides opportunity for knowledge discovery from it using NLP and Machine learning techniques. The biomedical domain has various entities like proteins, genes such as DNA & RNA sequences, drugs, diseases etc. BM-NER presents specific challenges when compared with other domain specific NER such as i) entities vary in size, composition and occurrence ii) does not have standard naming convention to represent the chemical structural information, iii) detecting entity boundaries with precision, iv) ambiguous abbreviations, v) creation and maintenance of domain vocabulary is a tedious task since terms evolve increasingly more, etc.NLP can be used to extract features that are effective for NER and also helps in extracting and disambiguating domain specific entities.
Domain specific NLP plays an important role in identifying entities by adapting the functionalities to the task at hand as domain text varies greatly from generic text in the case of Bio-medical domain. Creation of gold standard datasets and knowledge bases are in favour of supervised machine learning approaches. NER is a natural sequence tagging problem rather than general classification and is a twostep process: 1) Entity boundary detection and 2) Assigning entities pre-defined classes. Various machine learning algorithms are used for NER some of them are Bayesian approaches, Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), Support Vector Machines (SVM), Structured Support Vector Machines (SSVM), Conditional Random Fields (CRF)etc. CRF is the most used due to its ability to model multivariate outputs and utilize large set of features for predicting the labels.
The objective of this work is to improve the BNER task with respect to the chemical entities. To overcome the challenges, domain specific NLP is used to extract features and discriminative functions are learned using supervised machine learning algorithm with those features. A novel method to extract biomedical domain entities is proposed and is tested with two different datasets GENIA and CHEMDNER corpus. The approach uses CRF as the tagger and SVM as classifier as both are most effective algorithm for the respective task. The model is tested with different sets of features such as linguistic, orthographic, Morphological, domain features and local context features. The rest of the paper is organized as follows; Related works details the existing methods and process to identify biomedical named entity followed by the proposed methodology. Section Datasets gives the corpus used for named entity recognition and result and evaluation shows the performance of the proposed method. It is followed by discussion on the results that identify venues where improvements can be made and ends with conclusion of the paper.

Related works
In [7] handle NER as sequence labelling problem and utilizes CRF to tag the Chinese clinical text. To adapt the process to Chinese text, word segmentation, clause level tagging and modified tag set are used. The article [8] utilizes pool-based and uncertainty sampling active learning to annotate clinical text with named entities. [9] uses different features such as linguistic, brown cluster and vector representation on various taggers to tag the clinical text. Tm Chem proposed in [10] applies ensemble of two taggers with varied feature sets along with different pre-processing and post-processing applied on it. Utilize features such as bag-of-word, orthographic features, morphological features, part-of-speech (POS), document structure information, domain knowledge along with different word representations on CRF and SVM machine learning models to tag chemical entities [11]. The article [12] employs hybrid method i.e. rule-based with the use of lexicons and with CRF machine learning algorithm to recognize named entities from Arabic text. In [13,14] make use of CRF with varied features for medical entity detection. The work proposed in [15] combined SVM, which is used to classify terms as entity and non-entity, with CRF model to assign entity tag to it. The approach in [16] uses dictionary to identify candidate entities and uses neural network and CRF to tag them.
The method in [17] use the vector similarity between the entity class from the knowledge base and the term in the corpus to classify the entity category. The vectors are formed based on the tf-idf values of the terms and Noun phrase chunking is used to select candidate terms. In [18] apply probabilistic generative model to generate features for entity extraction. Word embeddings based on LDA i.e word along with topic, provides different embeddings for different word-topic pair which is used as feature for entity identification. The research in [19] uses lexical resources and search results to identify the boundary of the entities and distributional context to classify them. [20] uses dictionary-based method approach to annotate named entity based on direct match, stemmed match and string edit distance match. Uses external resources like UMLT meta thesaurus for annotate diseases and NCI, MESH, USPMG for identifying medication. Korkontzelos et al. [19] uses dictionary along with aggregate classifier to identify drugs.
Dictionary based approaches [21] utilizes string matching and normalization techniques to match entities in text with the domain based dictionary. In this method, the precision is high but recall is low since only the terms that are matched are extracted. Due to spelling variation and mistakes the string matching methods cannot extract entities efficiently. Also, some entity recognition done using ontology as mentioned in [22]. Also the method suffers from incompleteness that is not all entities are covered and evolve over time, due to which creation and maintenance of domain vocabulary is difficult. Rulebased methods are useful to extract entities that are systematic and follow a pattern. These methods also suffer from covering all patterns and updating patterns based on new entities.
Machine learning algorithms need features that are informative and discriminative so that entities are classified efficiently. As feature selection influence the performance of the algorithms, irrelevant and redundant features are to be neglected and useful features are to be identified using feature selection methods. Features can be classified be as generic and domain dependent features, generic features which can be applied across domain alone are not enough as domain specific features are more effective in identifying domain oriented entities [23,24]. Based on the algorithm, selected features have to be represented so that the model learns the discriminative function to classify the entities.

Materials and methods
The objective of this work is to efficiently identify named entities from biomedical literature. Conditional Random field is the most used algorithm for named entity recognition since it combines the capability of discriminative classification and graphical modelling in to one. It considers the context of the input while predicting the sequence labels which is lacking in general classification methods and that makes it a better algorithm for sequence tagging. Support Vector Machine is a state of the art classifier which can perform both linear and non-linear classification. SVM does not consider the context as CRF does and learns a maximum margin classifier to predict the classes. The proposed method utilizes both the algorithms to detect biomedical entities and tag them with their semantic classes.
The drawback of CRF compared to SVM is that it requires more computational space and time. In the proposed method CRF is used to extract entity using the BIO tagging scheme and SVM is used to identify the subtype of the extracted entity. Since SVM is more suitable for only binary classification the multiclass problem is converted to multiple binary classification as it is easy to learn and provides the necessary accuracy. Generally multi-class SVM is constructed using multiple binary class SVM and is carried out by one-vs-rest and one-one pair-wise binary classifier. The one-vs-rest method suffers from class imbalance problem as samples are not equal for both the class. Hence the one-one pair wise classifier is used to identify the subtypes.
To identify 'N' different classes N*(N-1)/2 classifiers are constructed and the class is determined by majority voting. The increase in performance of the system comes with the increase in time complexity to learn the model. In order to correctly identify the boundary and also to detect the subtype of an entity, https://doi.org /10.37358/Rev.Chim.1949 Rev. Chim., 72 (4) CRF alone is not so efficient since the features used are not so discriminative. Hence the SVM model is used to detect entity subtypes effectively after the entity term extraction.

Features used
The following sets of features are tested to identify entities by the proposed method: -Window based context words: the words occurring in the left and right of the given word and the window range from 1 to 3.
-Orthographical information: the special constituents of the given word, such as capital letters, symbols, numbers etc. calculated the number of uppercase and lowercase letters, the number of symbols, number of digits and added as features.
-Roman Numerals and Greek letters: Boolean feature representing the presence of Roman Numerals and Greek letters. It is used as a separate feature as it is domain specific feature.
-Morphological features: prefixes and suffixes present in the term up to a length of five.
-POS tags: POS tag of the given word along with the context words. For each POS, a Boolean feature is added to the feature set.
-Dependency relation feature: Selected dependency relations are used to represent the context. For each relation, Boolean feature is added to the feature set.
-Chemical elements: list of elements and their symbols. Boolean feature identifying the word matches a element or symbol.

SVM and its methodologies
Support Vector Machine (SVM) is a classical machine learning algorithm based on linear model, whose fundamental idea is to change the info space into a high dimensional highlight space by nonlinear change and to locate the ideal straight interface in the new space. As a rule, the higher measurement will prompt the unpredictability of calculation, yet the SVM calculation takes care of the issue in the wake of presenting the piece work, which not exclusively does not expand the computational multifaceted nature, yet additionally maintains a strategic distance from the "Curse of dimensionality".
At the point when the information is straight indivisible, SVM isolates the information directly by mapping to high dimensional element space through bit work. SVM calculation is developed from the ideal isolating surface. Give the preparation a chance to test ( , ), = 1, … . , ∈ , ∈ {+1, −1} as the classification mark, to take care of the accompanying quadratic programming issue.

Subject to [( . + ) ≥ 1 = 1,2 …
We derive optimal classification surface as hyper plane: At that point use Lagrange streamlining to tackle above issue by changing over the issue to its double issue, to comprehend it with Kuhn-Tucker hypothesis. We can get optimal classification function:  The instance of linear classification as appeared in Figure 1, the five-pointed flowers and the diamond represent two distinct kinds of samples, where K is arranged line. K1 and K2 are straight-lines, which go through various kinds of samples separating the characterized line, and parallel to grouped line K. The separation between line K1 and line K2 is called class interval. The purported optimal separating line alludes to the grouping line which won't just have the option to isolate two sorts of tests effectively, yet additionally to make class interval maximum [25].
Classification line equation can be expressed as: . + = 0; ( , ), = 1,2, … . , ∈ , ∈ {+1, −1} Normalize classification line to make linear sample set satisfy the condition: Algorithm: Improved supervised SVM algorithm for classifying the samples Step 1: Input the training data set values = ( , ); ( , ), = , , … . , ∈ , ∈ {+ , − } Step 2: Use the SVM classifier to train the sample set S to classify the data models Dm1; Step 3: The data samples can be pre-processed through the SVM sampling methods PrS; Step 4: Categorizing the sampling data's in the corresponding segments and upload the sample set Qs to the segmented region SGreg Step 5: Iteration of the sample set Qs can be done until all sample data's were labelled; Step 6: Reusing the complete labelled data sample training set ( , ) to get a improved and better classification model Dm2; // In few cases different SVM methodologies were used to categorize the samples to get a better output result.// Step 7: Input the training set to Dm2; Step 8: Output the result

Task and Evaluation Protocols
The content extraction is the important task which is extracting the con from scene and web pictures. The content extraction based three things i.e., content location, cropped word acknowledgment, and start to finish acknowledgment. In this content location is used to estimate the zone in the picture or the content available along the vertices. Cropped word acknowledgement is helped to trimmed the related content from the existing picture or web content. Start to finish Recognition, where the goal is to confine and perceive all words in the picture in a solitary advance.

Figure 2. Non linear SVM
In the above figure the objects can be classified into different categories. The mixed of objects can be categorized into various different segments to identify each groups in the non linear SVM.The genuine estimation of Support Vector Machine is utilized to tackle nonlinear [26]. The strategy is through a nonlinear mapping to outline test space to a high-dimensional or even vast dimensional component space, with the goal that direct SVM technique in the element space can be connected to take care of the nonlinear arrangement issues in the example space. The nonlinear mapping from the example space to the element space is appeared in the figures.

Process of text classification
Generally text classification consists of few ways of process. Collecting data's or data set, pre processing the data's, feature extraction, classifying the model and the training model were explained in the Figure 3. Text splitting, removing and stopping the text, counting the words of the specified domain and mapping are the few works done by the classifier. In medical scenario the relevant words will be classified using various methodologies.

Pre-processing
The abstracts in the GENIA corpus is divided into sentences using Ling pipe and then basic NLP process such as tokenization, lemmatization, POS tagging and chunking are done using GDep, a biomedical domain dependency parser built using GENIA tagger. The parser produces annotations with BIO tags along with the subtypes for the given abstracts. Similarly sentence tokenization is carried out using Stanford NLP tool for CHEMDNER corpus. Word tokenization is carried out breaking tokens at white space, punctuation's, digits and at case changes. The other features are extracted and then annotations are produced with BIO tags along with the subtypes based on the human annotations.

Post-processing
It is done to maintain tagging consistency and tag abbreviations effectively as discussed by Leaman et al. [12] if a specific character sequence is tagged more than twice as a chemical mention in an abstract then all other untagged sequence is also tagged as a chemical mention. In case of abbreviations, if the full-form is tagged by the model then its abbreviations are tagged correspondingly. If the abbreviations are tagged and the full-forms are not tagged then the entity tag of abbreviations are removed. As only space tokenization is used entity tagged with unbalanced parenthesis are rejected and taken as false positives.

Dataset
The GENIA corpus is a collection of biomedical literature which is semantically annotated by humans. The compiled Medline abstracts is used for NER and has five major classes such as protein, DNA, RNA, cell line and cell type and thirty three thousand unique terms. 5-Cross fold validation is performed on the GENIA corpus for evaluating the model on BNER. CHEMDNER corpus is a collection of ten thousand abstracts from various chemistry related documents that are manually annotated with seven different classes -abbreviations, family, formula, identifier, multiple, systematic, trivial. It has around eight four thousand mentions of chemical compounds and drugs entities and is created to aid in the development of named entity recognising tools.
It has three subsets namely 1) a training set containing3500 Medline abstracts annotated with 29478 mentions of chemical entities, 2) a development set composed of 3500 abstracts with 29526 entity mentions and 3) a test set composed of 3000 abstracts, and containing 25351 mentions. To train and improve the model the training and development set is used and to test the model the test set is used.

3.Result and discussions
To evaluate the performance of NER applications, the known information extraction measures are used as given below.

Precision = TP/(TP + FP) (1), Recall = TP/(TP + FN) F1 = 2 × Precision × Recall/(Precision + Recall)
where TP refers to true positives, FP to false positives, and FN refers to false negatives.  Naive Baye's is the standard algorithm in the classification techniques. The text classification method helps to sample the objects in a category using various algorithms. The figure defines that the SVM technique is having highest accuracy rate when compared with Naive bayes, SVD and random forest. Figure 6 describes the duration of the classification method. When comparing all the existing techniques the time taken for the classification is very less in SVM techniques. https://doi.org /10.37358/Rev.Chim.1949 Rev. Chim., 72 (4)

Figure 9. Precision Recall and F-measure for various entities in CHEMDNER corpus
It is seen from the results that with SVM classifier the recall is low and with CRF both precision and recall are moderate. The combined use of both the algorithms is effective and increases both precision and recall. This increase is also attributed to post-processing after entity extraction as it helps in resolving ambiguities and reduce false positives. In GENIA corpus there is decline in recognizing protein entities and this is due to class imbalance in the training data. Multiple entities are not recognized in CHEMDNER corpus due to coordination ellipsis.
The common issue in detecting entities is identifying the boundary of an entity mention which drastically reduces the model performance. The root of the issue is the modifier, both adjective and noun modifiers that are added to the head word which can be a part of entity mention. Hence to classify them accurately dependency relation between the word and the context word are used. The dependency context seems to provide discriminative features to identify whether the modifiers are part of the entity. The other issue is efficiency to tag unseen entities by the learned model using the context of the word.
In case of biomedical entities in GENIA corpus, in detecting entity tag for unseen word the context of the word plays a vital role. But in case of entities in CHEMDNER corpus, certain entities are formed following a pattern and the composition of word itself can be used to identify the type and hence unseen entities can be found using these discriminative features. Hence both the features are combined in this method for identifying entities across the domain. It is further observed that to classify entities in to subtypes in CHEMDNER corpus the features are not so discriminative. The trivial and family class entities have unambiguous boundaries but other types have uncertainty in determining the ending of a mention and start of a new one. SVM model used cannot fully distinguish between the subtypes based on the features as some set of entities have close resemblance to each other.
As there is a crucial necessity to identify the completeness of the entity subtypes, rule-based or dictionary based method can be incorporated for identifying subtypes in case of ambiguity. The coordination ellipsis is where one or more of the conjuncts is not complete and is missing a part of the constituent. This type of coordination ellipsis is difficult to find due to the various structure of ellipsis formed by entity mentions. In this case the proper identification of multiple class entities is deterred due to the ellipsis as shown in the result. The incorporation of dependency context does not provide discriminative feature to identify these entities. The learned model often extracts mentions that are nonelliptical with ease but to extract mentions with ellipsis long distance context of the word has to be incorporated or domain lexicons to be used.
Other type of error is from the ambiguous abbreviations and is resolved using the post-processing steps. But the error is not fully eradicated as there are abbreviations that start with non-alphabets. Hence the detection of abbreviations should also take in to account the problem and modify the model to suit the extraction of such types. Finally, the error due to missing annotation due to guidelines is also found and they can be rectified using semi-supervised models that can tag missed and evolving entities.

Conclusions
The proposed method combines models with different strengths for identifying entities and subtypes. The performance of the model is tested with varied feature sets with post processing to examine its efficiency on BNER task. The result obtained shows the efficiency of the model to extract entities from both corpuses. The model's performance is slightly better in GENIA corpus rather than CHEMDNER corpus. This is because of the boundary detection problem in the corpus along with the classification error. The analysis of result exposed various errors in the BNER that provides venue for improvement which would benefit the task. Due to significant evolution in biomedical entities and increasing volume of literature it would be inefficient to use only supervised methods as it requires golden standard data for learning. The future work can be to adopt semi-supervised methods for BNER since it utilizes unlabelled corpus and provide generic solution which will boost the performance of the task.