Parameter tuning Naïve Bayes for automatic patent classification

https://doi.org/10.1016/j.wpi.2020.101968Get rights and content

Abstract

I present an analysis of feature selection for automatic patent categorization. For a corpus of 7,309 patent applications from the World Patent Information (WPI) Test Collection (Lupu, 2019), I assign International Patent Classification (IPC) section codes using a modified Naïve Bayes classifier. I compare precision, recall, and f-measure for a variety of meta-parameter settings including data smoothing and acceptance threshold. Finally, I apply the optimized model to IPC class and group codes and compare the results of patent categorization to academic literature.

Introduction

In an era of exponential technological growth, business intelligence professionals are more in need than ever of an organized patent landscape in which to conduct technology forecasting and industry positioning. However, the construction of such a system requires time and trained experts, both of which are expensive investments for such a small part of any actual analysis.

A natural solution is to employ machine learning (ML), a branch of artificial intelligence that uses statistical information to find patterns and make inferences. The primary benefit of using ML is that these algorithms do not require explicit instruction. Rather, they require the analyst to choose the features and representation he or she believes to be most informative.

Computational analysis for patent applications, specifically, relies heavily on bibliometric features such as author and reference networks. However, these features are impractical for many real-world applications, in which the analyst is generally working in a single technological space and the network is too dense to convey any meaningful information.

Instead, I focus on finding patterns in the unstructured text of fields such as title, abstract, and claims. In a ML context, this means comparing the distribution of word types between documents, a technique that has found success in related fields including biomedical and academic literature.

Patent text presents unique challenges for this approach, not least of which is the heavy use of jargon. Industry-specific language lowers vocabulary density and results in a sparse search space for the algorithm. Additionally, the intentionally non-standardized language may help an applicant broaden the patent’s scope or reduce the likelihood of infringement, but it results in noise for the machine learner, making it difficult to find clear patterns.

To combat these restrictions, I propose a modified Naïve Bayes classifier,1 described in detail in Section 3.1. I show that my model achieves 34.26% F1 with minimal training data on a classification task assigning section codes from the International Patent Classification (IPC) and compare this performance to the assignment of finer-grained IPC codes and to the classification of academic literature. Though modest in comparison to many automatic classification efforts, this performance is achieved at the hands of a system designed to be completely domain-neutral and to perform above chance with training sets of 500 documents or fewer.

Section snippets

Previous work

The earliest attempts to classify documents by machine instead of by hand were rule-based classifiers for which knowledge engineers and domain experts created if-then rules for sorting documents based on their content [1]. These systems performed rather well on the data they were designed to classify, but their rules were expensive to create and not generalizable to new datasets. Due to this constraint, machine learning has been the preferred approach to text classification since the early

Algorithm

The choice of algorithm for such a task depends entirely on the end-user, and how we envision him or her using the classification product. As the typical analyst will be working with one small technological sector at a time, we expect the categories to change frequently and vary in size and number, and the data to have quite a bit of overlap between categories as well.

One of the most popular and straight-forward algorithms is K-Nearest Neighbors (KNN) [33], in which documents are classified by

Optimization

All machine learning models rely on optimized feature selection and meta-parameters. In text categorization, this means we want to choose the most informative text fields and numerical thresholds. In the following section, I detail the decisions that the analyst needs to make when using Naïve Bayes for classification, as well as the values these parameters might take, all of which are tested and compared to produce the optimal setting for IPC code assignment.

Corpus

In order to empirically test for the optimal settings of my Naïve Bayes classifier, I sort a corpus of 7,309 patent applications from the World Patent Information (WPI) Test Collection [40]. This selection represents all publications from the week of January 2, 2014 submitted to the United States Patent Authority. All 8 IPC sections are represented, though the distribution is highly skewed. In particular, almost 65 times as many records are tagged with Section G as Section D.

This model is

Results

I first compare performance for various combinations of text fields in order to determine the most useful for patent classification. As discussed in Section 4.1, I explore the use of Title (T), Abstract (A), Claims (C), and Description (D) by training and testing on every possible combination and taking the average performance of every run utilizing a given field. Thus, the performance metrics given for Title in Fig. 1 are the average from models trained on {T}, {T, A}, {T, C}, {T, D}, {T, A,

Conclusions and future work

The results of these parameter tuning experiments clearly show the difficulty in applying traditional machine learning methods to the language of patent applications. While many domains can expect an F1 of 80% or above for very little training data, this model reaches 40% at its peak. This figure is modest compared to many patent classification works described in Section 2, but it is noteworthy that this result is achieved with a training set of just 317 documents as opposed to the 2000 used

CRediT authorship contribution statement

Caitlin Cassidy: Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Caitlin Cassidy is a data scientist at Search Technology, Inc. in Norcross, GA. She graduated with a MS in Artificial Intelligence from the University of Georgia in May of 2015 and is working to complete her MA in Linguistics at the University of Illinois at Urbana-Champaign. Her work centers on machine learning for intellectual property and scientific publication language.

References (41)

  • David D. Lewis, Marc Ringuette, A comparison of two learning algorithms for text categorization, in: Third Annual...
  • YangYiming et al.

    A re-examination of text Categorization methods

  • RosenblattFrank

    The perceptron: a probabilistic model for information storage and organization in the brain

    Psychol. Rev.

    (1958)
  • HochreiterSepp et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • BishopChristopher M.

    Neural Networks for Pattern Recognition

    (1995)
  • Douglas Teodoro, Julien Gobeill, Emilie Pasche, P. Ruch, D. Vishnyakova, Christian Lovis, Automatic IPC encoding and...
  • EisingerDaniel et al.

    Automated patent categorization and guided patent search using IPC as inspired by MeSH and PubMed

    J. Biomed. Semant.

    (2013)
  • LiXin et al.

    Automatic patent classification using citation network information: An experimental study in nanotechnology

  • AgatonovicMilan et al.

    Large-scale, parallel automatic patent annotation

  • FallC.J. et al.

    Automated categorization in the international patent classification

    SIGIR Forum

    (2003)
  • Cited by (15)

    • Identifying patent classification codes associated with specific search keywords using machine learning

      2022, World Patent Information
      Citation Excerpt :

      Our method is a form of pattern recognition through dimension reduction, clustering and visualisation leading to the emergence of relationships which we can easily interpret and translate into useful insights. This is in sharp contrast to other reductionist research efforts making use of training datasets for patent classification [20–25]. There is definitely room for further work integrating the use of information found in query logs [6–8], in particular legal information contained in those logs.

    • Clustered Bayesian classification for within-class separation

      2022, Expert Systems with Applications
      Citation Excerpt :

      This problem is also called within-class multimodal classification or intra-class separation. There are various studies which improves density estimation of Bayesian classification someway (Balaji et al., 2020; Cassidy, 2020; Dai et al., 2017; Friedman et al., 1997; Jiang et al., 2019; Joshi et al., 2012; Kim et al., 2020; Kim et al., 2018; Li et al., 2018; Liu et al., 2020; Mukherjee & Bala, 2017; Niazi et al., 2019; Tang et al., 2020; Tang et al., 2002; Wu, 2018; Xu et al., 2020; Yao & Ye, 2020; Zhang et al., 2017; Zhang, Liu, et al., 2020; Zhang, Mao, et al., 2020; Zhang & Sakhanenko, 2019). However, they do not address within-class separation problem.

    • Patent text modeling strategy and its classification based on structural features

      2021, World Patent Information
      Citation Excerpt :

      Xiao et al. [10] added the content characteristics of patent text in security field based on Naive Bayes, and the experiments proved that the algorithm combined with security domain features was better than the traditional algorithm in classifying patent texts of security. Caitlin Cassidy [11] present an analysis of feature selection for automatic patent categorization, assign International Patent Classification (IPC) section codes using a modified Naïve Bayes classifier, and applied the optimized model to IPC class and group codes and compare the results of patent categorization to academic literature. Ma Shuangangang [12] designed the automatic classification method of noise reduction autoencoder (DAE) and SVM algorithm and verified the classification effectiveness by selecting six IPC categories in the computer field.

    • The One-vs-Rest Method for a Multilabel Patent Classification Machine Learning Approach using a Regression Model

      2023, 2023 International Conference on Informatics, Multimedia, Cyber and Information Systems, ICIMCIS 2023
    • Mapping Personality Traits to Customer Complaints: Framework for Personalized Customer Service

      2023, Proceedings of the 2023 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology, IAICT 2023
    View all citing articles on Scopus

    Caitlin Cassidy is a data scientist at Search Technology, Inc. in Norcross, GA. She graduated with a MS in Artificial Intelligence from the University of Georgia in May of 2015 and is working to complete her MA in Linguistics at the University of Illinois at Urbana-Champaign. Her work centers on machine learning for intellectual property and scientific publication language.

    This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

    View full text