Parameter tuning Naïve Bayes for automatic patent classification

doi:10.1016/j.wpi.2020.101968

World Patent Information

Volume 61, June 2020, 101968

https://doi.org/10.1016/j.wpi.2020.101968 Get rights and content

Abstract

I present an analysis of feature selection for automatic patent categorization. For a corpus of 7,309 patent applications from the World Patent Information (WPI) Test Collection (Lupu, 2019), I assign International Patent Classification (IPC) section codes using a modified Naïve Bayes classifier. I compare precision, recall, and f-measure for a variety of meta-parameter settings including data smoothing and acceptance threshold. Finally, I apply the optimized model to IPC class and group codes and compare the results of patent categorization to academic literature.

Introduction

In an era of exponential technological growth, business intelligence professionals are more in need than ever of an organized patent landscape in which to conduct technology forecasting and industry positioning. However, the construction of such a system requires time and trained experts, both of which are expensive investments for such a small part of any actual analysis.

A natural solution is to employ machine learning (ML), a branch of artificial intelligence that uses statistical information to find patterns and make inferences. The primary benefit of using ML is that these algorithms do not require explicit instruction. Rather, they require the analyst to choose the features and representation he or she believes to be most informative.

Computational analysis for patent applications, specifically, relies heavily on bibliometric features such as author and reference networks. However, these features are impractical for many real-world applications, in which the analyst is generally working in a single technological space and the network is too dense to convey any meaningful information.

Instead, I focus on finding patterns in the unstructured text of fields such as title, abstract, and claims. In a ML context, this means comparing the distribution of word types between documents, a technique that has found success in related fields including biomedical and academic literature.

Patent text presents unique challenges for this approach, not least of which is the heavy use of jargon. Industry-specific language lowers vocabulary density and results in a sparse search space for the algorithm. Additionally, the intentionally non-standardized language may help an applicant broaden the patent’s scope or reduce the likelihood of infringement, but it results in noise for the machine learner, making it difficult to find clear patterns.

To combat these restrictions, I propose a modified Naïve Bayes classifier,¹ described in detail in Section 3.1. I show that my model achieves 34.26% F1 with minimal training data on a classification task assigning section codes from the International Patent Classification (IPC) and compare this performance to the assignment of finer-grained IPC codes and to the classification of academic literature. Though modest in comparison to many automatic classification efforts, this performance is achieved at the hands of a system designed to be completely domain-neutral and to perform above chance with training sets of 500 documents or fewer.

Section snippets

Previous work

The earliest attempts to classify documents by machine instead of by hand were rule-based classifiers for which knowledge engineers and domain experts created if-then rules for sorting documents based on their content [1]. These systems performed rather well on the data they were designed to classify, but their rules were expensive to create and not generalizable to new datasets. Due to this constraint, machine learning has been the preferred approach to text classification since the early

Algorithm

The choice of algorithm for such a task depends entirely on the end-user, and how we envision him or her using the classification product. As the typical analyst will be working with one small technological sector at a time, we expect the categories to change frequently and vary in size and number, and the data to have quite a bit of overlap between categories as well.

One of the most popular and straight-forward algorithms is K-Nearest Neighbors (KNN) [33], in which documents are classified by

Optimization

All machine learning models rely on optimized feature selection and meta-parameters. In text categorization, this means we want to choose the most informative text fields and numerical thresholds. In the following section, I detail the decisions that the analyst needs to make when using Naïve Bayes for classification, as well as the values these parameters might take, all of which are tested and compared to produce the optimal setting for IPC code assignment.

Corpus

In order to empirically test for the optimal settings of my Naïve Bayes classifier, I sort a corpus of 7,309 patent applications from the World Patent Information (WPI) Test Collection [40]. This selection represents all publications from the week of January 2, 2014 submitted to the United States Patent Authority. All 8 IPC sections are represented, though the distribution is highly skewed. In particular, almost 65 times as many records are tagged with Section G as Section D.

This model is

Results

I first compare performance for various combinations of text fields in order to determine the most useful for patent classification. As discussed in Section 4.1, I explore the use of Title (T), Abstract (A), Claims (C), and Description (D) by training and testing on every possible combination and taking the average performance of every run utilizing a given field. Thus, the performance metrics given for Title in Fig. 1 are the average from models trained on {T}, {T, A}, {T, C}, {T, D}, {T, A,

Conclusions and future work

The results of these parameter tuning experiments clearly show the difficulty in applying traditional machine learning methods to the language of patent applications. While many domains can expect an F1 of 80% or above for very little training data, this model reaches $\sim$ 40% at its peak. This figure is modest compared to many patent classification works described in Section 2, but it is noteworthy that this result is achieved with a training set of just 317 documents as opposed to the 2000 used

CRediT authorship contribution statement

Caitlin Cassidy: Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Caitlin Cassidy is a data scientist at Search Technology, Inc. in Norcross, GA. She graduated with a MS in Artificial Intelligence from the University of Georgia in May of 2015 and is working to complete her MA in Linguistics at the University of Illinois at Urbana-Champaign. Her work centers on machine learning for intellectual property and scientific publication language.

References (41)

KrierMarc et al.
Automatic categorisation applications at the European patent office
World Patent Inf.
(2002)
LaiKuei-Kuei et al.
Using the patent co-citation approach to establish a new patent classification system
Inf. Process. Manage.
(2005)
LohHan Tong et al.
Automatic classification of patent documents for TRIZ users
World Patent Inf.
(2006)
TrappeyAmy J.C. et al.
Development of a patent document classification and search platform using a back-propagation network
Expert Syst. Appl.
(2006)
TrappeyAmy J.C. et al.
Development of a patent document classification and search platform using a back-propagation network
Expert Syst. Appl.
(2006)
RichterGeorg et al.
The impact of metadata on the accuracy of automated patent classification
World Patent Inf.
(2005)
SebastianiFabrizio
Machine learning in automated text categorization
ACM Comput. Surv.
(2002)
Philip J. Hayes, Steven P. Weinstein, CONSTRUE/TIS: A system for content-based indexing of a database of news stories,...
JoachimsThorsten
Text categorization with support vector machines: Learning with many relevant features
DumaisSusan et al.
Inductive learning algorithms and representations for text categorization

David D. Lewis, Marc Ringuette, A comparison of two learning algorithms for text categorization, in: Third Annual...

YangYiming et al.

A re-examination of text Categorization methods

RosenblattFrank

The perceptron: a probabilistic model for information storage and organization in the brain

Psychol. Rev.

(1958)

HochreiterSepp et al.

Long short-term memory

Neural Comput.

(1997)

BishopChristopher M.

Neural Networks for Pattern Recognition

(1995)

Douglas Teodoro, Julien Gobeill, Emilie Pasche, P. Ruch, D. Vishnyakova, Christian Lovis, Automatic IPC encoding and...

EisingerDaniel et al.

Automated patent categorization and guided patent search using IPC as inspired by MeSH and PubMed

J. Biomed. Semant.

(2013)

LiXin et al.

Automatic patent classification using citation network information: An experimental study in nanotechnology

AgatonovicMilan et al.

Large-scale, parallel automatic patent annotation

FallC.J. et al.

Automated categorization in the international patent classification

SIGIR Forum

(2003)

Cited by (15)

Identifying patent classification codes associated with specific search keywords using machine learning
2022, World Patent Information
Citation Excerpt :
Our method is a form of pattern recognition through dimension reduction, clustering and visualisation leading to the emergence of relationships which we can easily interpret and translate into useful insights. This is in sharp contrast to other reductionist research efforts making use of training datasets for patent classification [20–25]. There is definitely room for further work integrating the use of information found in query logs [6–8], in particular legal information contained in those logs.
The purpose of this research is to retrieve relevant patent documents and identify classification codes and search keywords that best characterize a given technological domain found in patent literature. The World Intellectual Property Organization (WIPO) recorded a rising number of patent applications filed under the Patent Cooperation Treaty (PCT) which is becoming the norm for filing patents in multiple jurisdictions. As such, PCT documents are a valuable source of information related to innovation activities with some degree of entrepreneurial intention. However, searching for relevant patent documents can be a daunting and uncertain process. We constructed a high-dimensional matrix consisting of two data types: classification codes and search keywords known as the code-keyword matrix. In turn, two machine learning algorithms called principal components analysis (PCA) and k-means clustering were used to derive insights from the high-dimensional dataset. Consequently, a two-dimensional PCA biplot and clustering on an optimized PCA dataset called Eigen-PCA were obtained using our combined machine learning method. Using such algorithms, we were able to identify correlation relationships found between the two data types. We also clustered the classification codes by least-relevance, medium-relevance, and high-relevance for the domain of anti-corrosion technologies, an impactful area for steel infrastructure in maritime environments. Such patent data analytics can be adapted to other areas such as medical technologies, green energy transition towards Net Zero and conservation of biological diversity.
Clustered Bayesian classification for within-class separation
2022, Expert Systems with Applications
Citation Excerpt :
This problem is also called within-class multimodal classification or intra-class separation. There are various studies which improves density estimation of Bayesian classification someway (Balaji et al., 2020; Cassidy, 2020; Dai et al., 2017; Friedman et al., 1997; Jiang et al., 2019; Joshi et al., 2012; Kim et al., 2020; Kim et al., 2018; Li et al., 2018; Liu et al., 2020; Mukherjee & Bala, 2017; Niazi et al., 2019; Tang et al., 2020; Tang et al., 2002; Wu, 2018; Xu et al., 2020; Yao & Ye, 2020; Zhang et al., 2017; Zhang, Liu, et al., 2020; Zhang, Mao, et al., 2020; Zhang & Sakhanenko, 2019). However, they do not address within-class separation problem.
The Bayesian classification is one of the frequently used approaches in machine learning. This approach obtains probabilities based on attributes of classes using Bayes' theorem and makes predictions according to these probabilities. Bayesian classifiers employ densities such as Gaussian, kernel, multivariate Gaussian, and Copula densities when attributes consist of continuous variables. These densities partially produce rough density values. When the attributes of any of the classes are concentrated on more than one region, above mentioned densities are not inherently suitable. In order to overcome this problem, this study introduces a novel approach called Clustered Bayesian classification. The proposed method creates a new class variable by detecting the different concentrations within the class using the Gaussian Mixture Clustering method. It makes predictions by setting a model over the new class variable. Then, the probabilities of the original classes are calculated over the probabilities of the new classes. The proposed method is compared with 5 different Bayesian classifiers on 27 different data sets. As a result, it has been seen that Clustered Bayesian classification outperformed all Bayesian classifiers for different performance metrics.
Patent text modeling strategy and its classification based on structural features
2021, World Patent Information
Citation Excerpt :
Xiao et al. [10] added the content characteristics of patent text in security field based on Naive Bayes, and the experiments proved that the algorithm combined with security domain features was better than the traditional algorithm in classifying patent texts of security. Caitlin Cassidy [11] present an analysis of feature selection for automatic patent categorization, assign International Patent Classification (IPC) section codes using a modified Naïve Bayes classifier, and applied the optimized model to IPC class and group codes and compare the results of patent categorization to academic literature. Ma Shuangangang [12] designed the automatic classification method of noise reduction autoencoder (DAE) and SVM algorithm and verified the classification effectiveness by selecting six IPC categories in the computer field.
To find the optimal combination of text modeling for rapid and accurate classification of patent texts and solve the severe problem of manual classification of patent texts in the face of massive patent scientific and technological information. In order to improve the efficiency of patent text automatic classification, the patent texts are split and spliced into 15 different modeling combinations based on the content structure of patent text. Through the 360 comparative experiments, the optimal modeling combination of patent text classification under different classification levels is obtained. The experimental results show that the use of full-text content is not always the best modeling choice in patent classification task. The modeling combination of title, abstract and specification (TAD) is more suitable for patent text classification task. In the subclass level, the highest classification accuracy can be obtained by selecting the specification (d) of patent text. The patent text modeling strategy proposed in this paper provides a good support for improving the effect of patent text classification.
The One-vs-Rest Method for a Multilabel Patent Classification Machine Learning Approach using a Regression Model
2023, 2023 International Conference on Informatics, Multimedia, Cyber and Information Systems, ICIMCIS 2023
Flash-flood susceptibility modelling in a data-scarce region using a novel hybrid approach and trend analysis of precipitation
2023, Hydrological Sciences Journal
Mapping Personality Traits to Customer Complaints: Framework for Personalized Customer Service
2023, Proceedings of the 2023 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology, IAICT 2023

View all citing articles on Scopus

^☆: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

View full text

Parameter tuning Naïve Bayes for automatic patent classification☆

Abstract

Introduction

Section snippets

Previous work

Algorithm

Optimization

Corpus

Results

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

World Patent Inf.

Inf. Process. Manage.

World Patent Inf.

Expert Syst. Appl.

Expert Syst. Appl.

World Patent Inf.

Machine learning in automated text categorization

ACM Comput. Surv.

Text categorization with support vector machines: Learning with many relevant features

Inductive learning algorithms and representations for text categorization

A re-examination of text Categorization methods

The perceptron: a probabilistic model for information storage and organization in the brain

Psychol. Rev.

Long short-term memory

Neural Comput.

Neural Networks for Pattern Recognition

Automated patent categorization and guided patent search using IPC as inspired by MeSH and PubMed

J. Biomed. Semant.

Automatic patent classification using citation network information: An experimental study in nanotechnology

Large-scale, parallel automatic patent annotation

Automated categorization in the international patent classification

SIGIR Forum