Development of Novel Patent Classification Framework by Exploiting Semantic Deep Learner

Knowledge documents are growing remarkably to serve the organization for information processing and various management tasks. Text mining is hard but important research topic in knowledge discovery where hidden information is extracted from unstructured and semi-structured data. Patents are rich knowledge source needed to be organized efficiently and conveniently. Patent documents are used for gathering business intelligence and identifying key trends in technology development. The main focus of this study is to propose a electrical patent classification framework based on Semantic Deep Learner (SDL). In this framework, initially key terms of the patent documents are extracted and represented using Vector Space Model (VSM), the importance of the key terms are weighted based up on their frequencies using TF-IDF. The semantic similarity between the key features is computed using cosine measure. Terms with higher correlations are synthesized into a smaller set of features. Finally the semantic deep learner is trained using the correlated features and accordingly patents are classified. The target output identifies the category of a patent document based on a hierarchical classification scheme of the International Patent Classification (IPC) standard. Our approach is new to the patent domain and shows some improvement in the classification accuracy when compared to the other state of art classifier.


INTRODUCTION
Over the last decade, with the increasing availability of powerful computing platforms and high capacity storage hardware, the number of digital documents exceeds the capacity of manual control and management.Document classification is one of the most crucial techniques to organize the documents in a supervised manner.Text classification plays an important role in many application domains.People are increasingly required to handle wide ranges of information from multiple sources.As a result, knowledge management systems are implemented by enterprises and organizations to manage their information and knowledge more efficiently.Knowledge management includes sorting useful knowledge from information, storing knowledge in good order and infer new knowledge from an existing knowledge base.Turban and Aronson (2001) focus an explicit knowledge management, i.e., management of semi-structured documents such as patent documents.
Patents give exclusive rights to the inventor for using and protecting his intellectual property.The International Patent Classification (IPC) is a standard taxonomy developed and administered by the World Intellectual Property Organization (WIPO) for classifying patents and patent applications.The IPC covers all areas of technology and is currently used by the industrial property offices of more than 90 countries.The use of patent documents and the IPC for research is interesting for several reasons.The IPC covers a range of topics that spans all human inventions and uses a diverse technical and scientific vocabulary.A large part of it is concerned with chemistry, mechanics, computers, electrical and electronics.Necessarily, the IPC is thus a complex, hierarchical taxonomy, which has been refined for 30 years.Over 40 million documents have been classified in it worldwide.Furthermore, all domain experts in national and regional patent offices currently classify patent documents manually.These experts have an intimate knowledge of the IPC and aim to provide excellent and consistent classifications (Tikk et al., 2005).Looking from economical side, Intellectual Patent Rights (IPR) are becoming one of the most important mechanisms for business in extracting economical value from creativity and encouraging greater investment in innovation.
The objective of this study is to propose an approach for patents classification framework using SDL in electrical domain.Deep learning has emerged as a new area of machine learning research since 2006.Deep learning (or sometimes called feature learning or representation learning) is a set of machine learning algorithms which attempt to learn multiple-layered models of inputs and to train complex and deep models on large amounts of data, in order to solve a wide range of text mining and Natural Language Processing (NLP) task Glorot et al. (2011) and Liu et al. (2012).Restricted Boltzman Machines (RBM) have been used as a generative models of many different types of data including labeled or unlabeled images and bag of words that represents documents.The feature learning is trying to learn a new transformation of the previously learned features at each level, which is able to reconstruct the original data.The greedy layer-wise unsupervised pre-training is based on training each layer with an unsupervised learning algorithm, taking the features produced at the previous level as input for the next level.Finally, the set of layers with learned weights could be stacked to initialize a deep supervised predictor, such as a neural network classifier, or a deep generative model, such as a Deep Boltzman Machine (Glorot et al., 2011).
Our novel approach explores the document patent classification framework using SDL.The automatic document classification methodology is described in the following steps.Initially, the important terms are extracted from patent documents and represented as feature vectors using VSM and features vectors are weighted using TF-IDF based on the frequency of terms in a patent document.Next, Cosine measure is applied to find the similarities between the features and depicted in a correlation matrix in order to synthesize features into a smaller set representing key features within the patent domain.Finally SDL is trained using the consolidated set of key features available in a correlation matrix.The output of SDL gives the category of the corresponding patent document.As deep learner is better than neural networks and other learners, the accuracy of the classification will be improved to some extent especially for patent document.Applying deep learning technique for classification of the documents is a new approach in patent domain.

Document categorization:
Categorization can be divided in two principal phases.The first phase is document representation and the second phase is classification.Document categorization is the process of assigning a document into more than one pre-defined document classes Antonie and Zaiane (2002).Another method is document clustering that splits many documents into groups according to the similarity between documents.Similarity is measured by evaluating key representing attributes and features among documents.Both document categorization and document clustering extract and use the features of the document for group assignment.The main distinction between categorization and clustering is that document categorization compares document features and predefined class features and selects the most suitable document class.Document clustering divides a set of documents into groups without using pre-defined classes.
The traditional document categorization is to classify documents by experts within a specific domain.Since experts are costly and vary in capabilities and generate the result of classification is not accurate and reliable.Because of these reasons, automatic document categorization has become an important research area.
Patent analysis: Patent classification is one of the application areas in text mining.Text classification approaches for patent classification problems have to manage simultaneously very large size of hierarchy, large documents, huge features set and multi-labeled documents (Karki, 1997).International Patent Classification (IPC) is a standard taxonomy Tikk et al. (2005) developed and maintained by World Intellectual Property Organization (WIPO).IPC consists of about 80000 categories that cover the whole range of industrial technologies.There are 8 sections at the higher level of hierarchy, 128 classes and 648 subclasses.The IPC is a complex hierarchical system with layers of increasing order.For example, section: G physics, class: G02 Optics, subclass: G02C spectacles, main group G02C5 construction of non-Optical parts.The survey on machine learning methods for text classification and its various challenges were discussed in Sebastiani (2002Sebastiani ( , 2005)).The issues with respect to the representation of documents and learning Salton et al. (1975) were proposed.The hierarchies for classifying a large corpus of web contents (Dumais and Chen, 2000) were implemented.There are large number of statistical classification and machine learning techniques for text classification including KNN classifier (Ko and Seo, 2000), Centroid based technique (Drazic et al., 2013), Naïve Bayes classifier (McCallum and Nigam, 1998) and Support Vector Machine classifier (Joachims, 1998).These machine learning techniques are applied to patent analysis (Fig. 1).

Document representation model:
The Vector Space Model (VSM) proposed by Salton et al. (1975) is a In this model each document is represented as vectors of features.Each feature is associated with a weight.Usually these features are simple words.The feature weight can be simply a boolean indicating the presence or absence of the word in document, its occurrence number in document or it can be calculated by a formula like the well known tf*idf method.VSM has been widely used in traditional information retrieval and for automatic document categorization.There are three key steps where terms are first extracted from the document text, then the weights of the indexed terms are derived to improve the document retrieval accuracy and then the documents are ranked with respect to a similarity measure.VSM is a multi-dimensional vector where each feature of a document is a dimension.For instance, Term Frequency (TF) and Inverted Document Frequency (IDF) are two features of a text document.After the vector of a text document is derived, a cosine function is applied to measure the similarity between two documents: where, X = {x 1 , x 2 , ….. x n }, x i represents i th feature of document X. Y = {y 1 , y 2 , …. y n }, y i with similarity between X and Y calculated by using cosine function.Naive Bayes approach: Naive Bayes classification does not require more observations for all possible combinations of the variables.Each and every variable are assumed to be independent to each other.In another words, Naive Bayes classifiers assume that the influence of a variable is independent of other variables for a given class, an assumption called class conditional independence (McCallum and Nigam, 1998).This algorithm uses the joint probability of document features to calculate the probability that a new document belongs to a specific class:

State-of-art classifiers
Naïve Bayes is used to construct an unstructured text classifier with sufficient accuracy.

K-Nearest Neighbor (KNN):
kNN is an algorithm that uses pre-trained documents to classify new documents based on a similarity measure (Ko and Seo, 2000).kNN uses the distance between two document vectors as a measure for their similarity.The similarity z(x, c i ) is used as the confidence score to indicate x belonging to the particular category c i and is calculated by using an equation: where, sim (x, ˤ ) is the similarity between tested document x and trained document ˤ which is calculated using Euclidean distance or the cosine value between two document vectors.y (ˤ I ) is 1 (or 0) when the trained document dj (not) belongs (or does not belong) to ˕ .Finally, bi is the threshold of classifying a document to categoryI .

Genetic algorithm:
The genetic algorithm Holland (1992) works well on mixed (continuous and discrete) and combinatorial problems.Three operational components selection, crossover and mutation are used to generate various models.The GA can provide an optimal solution but is possible for the algorithm to stick at local optimal solutions.Another shortcoming of GA is the computational requirements of the algorithm, which may not be a concern when there is strong computing power.denotes the number of times m th one length unique term repeated in a th document, where a varies from 1 to N excluding the n th document.The term frequency-inverse document frequency of the one length unique term is then calculated using the term frequency and inverse document frequency of the same one length unique term.The term frequency-inverse document frequency is shown by the equation below: The document vectors are constructed using the above values by using standard VSM.Similarly, two length unique terms (phrases) are identified and weighted.

Correlation matrix:
The document vectors are grouped based on the similarity of the terms using cosine similarity measure based on the Eq. ( 1).A domain term that appears in a document with high frequency indicates that the term is a significant keyword or key features.After extracting all high frequency terms, a correlation matrix of terms is created by calculating their frequency of occurrence within same documents.The correlation of two key features (˕ˠ , ˕ˠ ) appeared in set of patent documents are determined using equation.
where, I is the correlation of ˕ˠ IJˤ ˕ˠ that appear in a set of patent documents; I , is the frequency that ˕ˠ appeared in document ˖ , I , is the frequency that ˕ˠ appeared in document ˖ .I is the average frequency that ˕ˠ appeared in all documents(all ˖ ); I is the average frequency that ˕ˠ appeared in all documents (all ˖ ); ˚ is the total number of documents.The highly correlated key features are selected and stored as the related features list.Finally, the key features list is completed when highly correlated features are merged, a necessary step since it is easier to train SDL models using fewer variables.
When a new document is uploaded into a patent knowledge management system, the key features and their frequencies are extracted from the document.The frequencies of all terms are derived.Then ˕ˠ˦ is used to present the frequency of key terms ˕ˠ in the document and ˕ˠ˦ to represent the frequency of related-term Iˠ .The correlation of Iˠ and ˕ˠ are listed as I and the final frequency of ˕ˠ is: After calculating the CTF of all key features, a vector of key features frequencies is listed as: The correlation matrix is constructed based on the vectors generated mapping the top features and corresponding documents.This matrix is trained by SDL.

Document categorization using SDL:
In this section, we describe the document categorization methodology based on the Semantic Deep Learner (SDL).It is a class of machine technique that exploit many layers of nonlinear information processing for supervised or unsupervised extraction and transformation and for pattern analysis and classification.An advantage of SDL is that it does not need to change the network structure and achieve the target output.Another advantage of using SDL is its rapid execution when a trained network is applied.The learning stage of SDL involves a pretraining phase and a fine tuning for reconstruction.The structure of the simple learning model is depicted in Fig. 3.The two passes of SDL are described in the following section.
The hidden layer provide recurrent connection to S(t-1) and thus provide short term memory that models context of a word.The generated feature frequency matrix is given as input to SDL.Deep architecture is identical to the multi-layer physical structure of the human cerebral cortex.The neocortex, which is associated with many cognitive abilities, has a complex multilayer hierarchy.The development of intelligence follows with the multi-layer structure.From an evolutionary viewpoint, the phylogenetically most recent part of the brain is the neocortex.In humans and other primates, starting from catarhinians, the multilayers structure began to appear in the neocortex.Therefore, a deep architecture actually represents the result of human intelligence evolution.It thus provides a possible way to achieve the ultimate target of natural language processing, which is to enable the computer to understand the human (natural) languages.The proposed deep architecture of our proposed system is depicted in Fig. 4.
Here, the correlation matrix formed using the whole document taken for training is given as input to the semantic deep learner.The deep learner will train the system based on the target given.The training is done by the hidden layer of the deep learner that exploits the target and the input.The target would be based on the documents taken for training belongs to the topic, because we know the topic (domain) for the documents taken for training.After training the system Fig. 4: Deep architecture of our proposed system based on the semantic deep learner, the testing is done by giving the testing document.When the testing document is given as input to the system, the correlation matrix is formed for the input document using the keywords that formed the correlation in the training process.The semantic deep learner will give a score for the given input document and based on the score the document will be classified to which category it belongs.

Pre-training phase:
The pre-training phase starts from the input layer.The input of every node is calculated.Then, the output from the activation functions of the nodes is derived and is passed pre-training to the next layer where processing continues until reaching the final output layer.The net input from input layer to the hidden layer node j is calculated using: where, ˱ is the weight of connection between input layer node i and hidden layer node j, ˲ is the input of node i and I is the bias associated with node j.The output of node j can be determined using the following equation: The activation function f(x) and the net input from hidden layer to output layer node k is computed using: where, ˱ " is the weight of connection between hidden layer node j and output layer node k.Finally, we can determine the output of the SDL: g(x) denotes the activation function of node k.The error of the network is expressed as: where, ˠ is the real output of the training data.
Fine tuning phase: In this phase, the transfer of data back from the output layer to the previous hidden layer is carried out.The fine tuning is used for determining errors and adjusting weights.Since E is defined as the function of ˛ and ˛ is the function of ˱ " , the weight adjustment between output layer and hidden layer ∆˱ " can be expressed as: where, ɳ = The learning rate and " = (ˠ − ˛ ) ˧ ′ (J˥ˮ " ) Similarly, since ˛ is a function of H , the function of ˱ B is the weight adjustment between hidden layer and input layer: Therefore, the weight adjustment is depicted as: where, = The output error of layer j, net i is the input of layer i.
As described in above section, the key term frequency (CTF) vector is used to represent a patent document.Before importing the vector into the SDL, all key phrase frequencies in the vector are normalized between 0 and 1 using a transformation function: The output values represent the goodness-of-fit between a test document and all potential classes.In our study, IPC provides the target classes for the patent documents.

Evaluation metrics:
To analyze the performance of classification, we adopt the following measure.Four cases are considered as the result of classifier to the document.

TP (True Positive):
The number of documents correctly classified to the respective class.

TN (True Negative):
The number of documents correctly rejected from the class.

FP (False Positive):
The number of documents incorrectly rejected from the class.

FN (False Negative):
The number of documents incorrectly classified to the class.Using these quantities, the performance of the classification is evaluated in terms of precision (pr), recall (re) and Fl measure.Recall is defined to be the ratio of correct assignments by the system divided by the total number of correct assignments.Precision is the ratio of correct assignments by the system divided by the total number of the system's assignments.The Fl measure is the combination of recall and precision with an equal weight.pr(Ci) = TP/TP(Ci)+FP(Ci) (21)

EXPERIMENTAL RESULTS
Patents taken for analysis consists of 120 patents in the electrical field downloaded from http://www.freepatentsonline.com.These patents were classified in the following four domains: Conductors, Connectors, Devices and Outlets.

CONCLUSION
The approach proposed in this study suggests efficient solution for patent document classification using SDL.There are certain shortcomings in applying SDL because insufficient training data may lead to the unreliable model and at the same time the training procedure involves more computing resources.A welltrained semantic model can help companies better manage documents, the cost of computing resources can be easily justified.The patent documents extracted from WIPO, are originally classified using a hierarchical classification scheme.After the extraction of key terms and their frequencies, the resulting key feature base adequately represents the characteristics of the patent documents.The ability of deep learning approach to discover the hidden structures and features at different levels of abstraction is useful for efficiently classifying patents.The accuracy of the trained model is found to be better than other classifier and it was evidenced only with electrical patents.The extension of this study involves combining the deep learner with other approaches such as semantic smoothing models and ontology-based feature extraction methods to improve the flexibility and accuracy of the current method.Besides the classification of patent documents, our work presents framework which can be used to extract more meaningful data representation for analysis of other type of documents.

:
There are various document classifiers proposed by previous researchers including the Support Vector Machine (SVM), k-Nearest-Neighbor (KNN), Naive Bayes (NB), Neural Networks (NN) and Genetic algorithms.Support vector machine: The Support Vector Machine (SVM) proposed by Joachims (1998), is a supervised learning algorithm that can be applied to classification.It is a binary linear classifier which separates the positives and negatives examples in a training set.The method represents the hyperplane that separates positive examples from negative examples, ensuring that the margin between the nearest positives and negatives is maximal.The effectiveness of SVM is superior to other methods of text classification.SVM makes a model representing the training examples as the points in a dimensional space separated by the hyperplane and it uses this model to predict a new example belongs to which side of this hyperplane.The examples used in searching the hyperplane are no longer used and only these support vectors are used to classify new case.This makes a very fast method (Fig. 2).
frequency of m th one length unique term of n th document; and of times m th one length unique term repeated in n th document.The inverse document frequency (idf) of the one length unique term is calculated by summing the frequency of the one length unique term in other documents except the respective document which is shown by an equation given below:

Fig. 5 :
Fig. 5: Depicts the sample experimental work Figure 5 depicts the sample experimental work.Figure 6 depicts the details of the domains, key terms, sample patent in electric conductor domain, testing documents and correlation matrix [doc Vs term] of training and testing sets.
Figure 7 depicts the output of SDL showing the predicted category with accuracy.Figure 8 represents the classification accuracy of our proposed system based on the claim information.The result reveals that our proposed approach for classification outperforms other classifiers in the patent domain.

Table 1 :
Comparison of different categorization approaches (Selamat and Omatu, 2004ed processing nodes computationally linked to solve problems.Neural networks are frequently used for pattern recognition and document classification and learn by using training data to adjust the weights between connecting nodes.Some research has applied artificial neural networks to text classification.Document clustersFarkas (1994)are generated using thesaurus and neural network.Adaptive Resonance Theory(Massey, 2003)was used to cluster documents.Web-page classification system(Selamat and Omatu, 2004) that uses a neural network with inputs gained by principal component analysis and class profile-based features that contain the most regular words in each class.Table1shows the advantage and shortcomings of the commonly used categorization models.
Artificial neural network: Artificial Neural Network (ANN) is an information processing method inspired by biological nervous systems.