Construction and evaluation of gold standards for patent classification—A case study on quantum computing

This article discusses options for evaluation of patent and/or patent family classification algorithms by means of ‘‘gold standards’’. It covers the creation criteria, and desirable attributes of evaluation mechanisms, then proposes an example gold standard, and discusses the results of applying the evaluation mechanism against the proposed gold standard and an existing commercial implementation.


Introduction
There are a number of problems in the strategic patent decision making and portfolio management domain where artificial intelligence techniques can be applied.One of the more common is that of mapping patent assets to technologies, for example to perform patent landscaping, or for reporting on the contents of your own, or competitor portfolios.This is also one of the hardest tasks to perform mechanically, and has been identified as a source of friction in strategic patent decision making [1].
Conventional ''mandrolic'', or semi-automated solutions typically revolve around performing a boolean search over the assets to discover a superset of the assets to be identified, then manually reviewing returned results to determine if each individual asset falls into the desired class.
There are a number of compromises involved in this approachpredominantly related to the time taken to perform a thorough review of the technology domain, or the cost of outsourcing this work to external experts.
In addition there is also the issue of inconsistency of results from month to month, as the output of manual review by different individuals can be highly variable.In a study conducted by Elextrolux [2] across 29 outsourced patent search service providers it was found that there was a high degree of variability in the results.The requested search was ''LED lighting of handle for refrigerator'', which was believed to be precise enough to make interpretation of scope a minor factor.In total, across the 29 providers there were 194 distinct patent families identified, of which 114 were deemed to be relevant to the scope of the query by independent review.Within the relevant families 19 were identified as being highly relevant, and the number of those identified by a single provider varied from one to twelve, with a median of 4 and a mean of 5.2.
Because of these factors, automation of this process would be advantageous to the industry, resulting in more consistent reporting, and freeing up subject matter experts to work on higher value projects.As this article will show, measuring the accuracy of Machine Learning algorithms in a neutral and representative way poses challenges, even for experts in the field.This makes it difficult to answer questions such as ''which operations are viable to automate?'', and ''how does the accuracy of algorithms compare to manual work?''.
This article proposes an approach for generating gold standards for machine classification of patents, and presents one such example.It then describes a methodology to test against that gold standard, and presents the results of evaluation of a commercially available system against it.The gold standard is intended to be a representative reflection of a patented technology, such that it includes a number of positive labelled patents that cover the technology, and a number of negative labelled patents that do not, but are close enough in content that they would be challenging for an algorithm to identify.The technologies selected should be representative of real classification challenges faced by practitioners.https://doi.org/10.1016/j.wpi.2020.101961Received 1 July 2019; Received in revised form 9 March 2020; Accepted 12 March 2020 The term ''gold standard'' is somewhat ambiguous, due to its use in many fields and contexts, but Aroyo and Welty [3] provide a description which covers many cases: ''Gold standards exist in order to train, test, and evaluate algorithms that do empirical analysis.Humans perform the same analysis on small amounts of example data to provide annotations that establish the truth.This truth specifies for each example what the correct output of the analysis should be.Machines can learn (in the machine-learning sense) from these examples, or human programmers can develop algorithms by looking at them, and the correctness of their performance can be measured on annotated examples that were not seen during training''.
In the following text, we use the binary classification convention of denoting the data labelled as positive (examples of in-scope patents) with  ⊕ , and those labelled as negative (counter-examples) with  ⊖ , where  is the labelled set.
We will also describe the processes in set notation for brevity, though restrict the use to just ∪ (union), ∩ (intersection), ⧵ (set difference), and || (set cardinality).

Existing gold standards
There exist a number of gold standards and more general test datasets for evaluation of machine classification, such as those published in the OpenML3 online database of labelled machine learning test data.These datasets cover a wide range of topics, but are largely numeric in content, and do not include rich patent data labelled with the technologies which they cover.
There also exists a series of gold standard datasets in the patent domain, the IPC classifications from CLEF-IP,4 however they are optimised for evaluation of other types of algorithm, chiefly the detection of prior art.

Using class codes for evaluation
There have been attempts to use the examiner class code information in CLEF-IP, or wider patent datasets to evaluate classification algorithms [4,5], and while the class code labels are plentiful and widely available there exists a question over their suitability for evaluation of this class of problem -the mapping of patents to industry-relevant technologies.Clearly class codes are a suitable gold standard for the automation of the process of patent examiners assigning class codes to applications, however the requirements of practitioners in the industry -mapping their assets against technologies that are relevant to the business -may differ.
In order for class codes to be representative of such real-world problems they should resemble the scope and coverage of technology definitions in use by practitioners in the field.To evaluate this we consider some classes from the Cipher Automotive taxonomy of technologies.This was co-developed with a number of well-known companies in the automotive industry, and is widely used for patent classification in that domain, so can be said to be a reasonable reflection of common practice.Five technologies were selected at random from the Cipher Automotive5 taxonomy, and their relationship to (CPC6 ) class codes is observed.In the following text  denotes the training set manually constructed in order to train a classifier to the given technology topic.
As can be seen in Table 1 there is no single class code that spans every patent in  ⊕ in any of these cases.The maximum being only 62.6%, and the mean being 40.2%.From Table 2 we can see that a minimum of 25.6%, and mean of 44.3% of the class codes in  ⊕ also appear in  ⊖ .
Taken together this indicates that class codes do not discriminate between technology domains at the level which is expected by practitioners -the class codes are both too narrow in scope, such that it requires many hundred codes to circumscribe an industry-relevant technology; and too broad in that many of them span both the  ⊕ and  ⊖ sets.
It should be noted that the relationship to the older IPC class codes was not evaluated, as IPC is essentially a subset of CPC.From the World Intellectual Property Organization (WIPO) FAQs ''CPC is the Cooperative Patent Classification scheme used by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO), which was jointly developed by the two Offices based in a large part on the existing European Classification System (ECLA) and on the USPC, respectively.It is based on the IPC, but it is much more detailed.'' 7

Cross validation
Outside of gold standards, another popular technique for measuring classification accuracy is cross-validation [6].Cross-validation has the benefit that it requires no additional manual effort to evaluate the accuracy, and it is a useful technique for evaluating classification accuracy in the absence of external data.However, cross validation suffers from the problem that the scope of the evaluation is bounded by the training set, which is not guaranteed to reflect the domain as a whole [7].

S. Harris et al.
Even in the cases where it can be determined that the training set is truly representative of the domain, there are well-known issues with the inherent inaccuracies of the various cross-validation methods [8].While these can be compensated for to a degree, they can be avoided altogether with an independently created gold standard, against which robust information retrieval characteristics can be calculated.
Equally it would not be reasonable to take an existing training set from an academic or commercial system to use as a gold standard.There will be some inherent bias towards the system under which it was constructed, due the choices made by the operator to correct identified errors during the training and evaluation cycle, unrepresentatively penalising other systems to which it may be compared.

Desirable characteristics of a patent classification gold standard
A number of challenges are faced in the construction of a gold standard for use in the evaluation of classification algorithms, including those described by Aroyo and Welty [3], and those specific to the patent domain.
To address these, the following criteria are proposed for a robust gold standard in this domain: Scope Defining a scope which is both clear enough to offer a reasonable level of agreement between subject matter experts, and also reflective of real-world use cases.An embodiment of this in the patent domain could be a scope which clearly delineates patents which cover a particular feature, which is relevant to licensing activity.
Agreement Ideally the gold standard covering each topic would be reviewed by multiple subject matter experts -allowing testing against the consensus, most generous, and most narrow definitions.This requires a definition which is clear enough to allow subject matter experts to independently reach the same conclusion as to membership.

Diversity of technology
Different patented technology areas have quite differing characteristics in terms of variety of terminology, density of class codes, and quantity of patents, so it is reasonable to assume that different systems will perform with differing degrees of accuracy against each.A good implementation of this would be multiple gold standards covering different technological areas, such as mechanical engineering, software, business methods, semiconductors, and so on.

Size of dataset
There is a tension between selecting technologies that are precise enough to be representative of real requirements, yet large enough that multiple experiments can be run without substantial overlap, and withholding enough data for the evaluation to be robust and representative.
Challenging Classifying against the gold standard should be sufficiently difficult that existing solutions cannot easily achieve 100% accuracy, which would render any comparison impossible.If, for example a simple classification technique such as naive bayes could achieve 100% accurate results then the test is not sufficient to discriminate high from low performing solutions.

Independent
The gold standard should be created without reference to any existing system, independently, and as far as possible through manual research, to avoid systematic bias -such as the preponderance of a small number of class codes.If during the construction of the gold standard data, the person constructing the set relies upon existing known metadata, then the data set will be compromised by including an artificially easy to discover feature.
Identification One of the more trivial though persistent problems in patent data is the lack of standardisation of patent serial number formatting.The gold standard should use whatever format is the most widely understood.For example, the string ''US10012345'' may denote the US patent ''US10012345B2'' (Method and apparatus for an icemaker adapter, 2001), or the US application ''US10/012,345'' (Multi-mode print data processing, 2015).

The process of creating the quantum computing gold standard
The quantum computing gold standard was created by Anthony Trippe of Patinformatics. 8he summary is ''Qubit Generation for Quantum Computing'', and the scope (i.e. the natural language definition of what technologies were included, and what were excluded) was given as: ''Qubit Generation for Quantum Computing refers to patents that discuss the various means of generating qubits for use in a quantum mechanics based computing system.Types of qubits included superconducting loops, topological, quantum dot based and ion-trap methods as well as others.The excluded technologies are applications, algorithms and other auxiliary aspects of quantum computing that do not mention a hardware component, and hardware for other quantum phenomena outside of qubit generation''.
This scope was selected by Patinformatics as representative of a real-world problem, and because they have significant experience of analysing patents in the area [9]. 9Because of this background knowledge, the belief was that there would be a sufficient quantity of patent families to allow the construction of a large enough gold standard.It was also felt that the difficulty of identifying the patents in-scope (asin, those which cover technologies such that they fall into the class of positives) would be challenging for machine processing, based on their experiences of manually classifying the technology.
The source data was a mixture of existing known data, and manually directed searching with manual review.During the process of creating the gold standard data Patinformatics did not have access to Cipher, as this could have skewed the selection process in favour of machine processing over human review.
Patinformatics were compensated for their time by Aistemos to allow open publication of the resulting data, however there is no other relationship between the organisations.
The version of Cipher evaluated in this text predates the creation of this gold standard, so there is no optimisation (for example of the text embeddings used to generate the intermediate vectors) specific to this data in the results.
Instructions for obtaining the data produced can be found in Section 10.In order to understand the contents of the gold standard further, we can compare the members to the data presented in Section 2.2.Ideally there should be a degree of similarity in the makeup, coverage, and intersection of the class codes that is inline with real-world classifier training sets.Some differences are expected, as the gold standard is by definition a superset of the data required for accurate training, and the quantum computing domain is likely to be different in terms of class code coverage from automotive technologies, though the characteristics should be broadly similar.Table 3 shows the coverage of the most common class codes in the gold standard data.This is broadly in line with the Overhead cameras technology from the automotive data, though somewhat above the mean at 63.7%, 34.0%, 10.8% for  against 62.2%, 31.5%,15.9% for Overhead cameras.

Analysis of the gold standard data
For the gold standard the number of unique class codes in  ⊕ is 596, in  ⊖ is 1,403, and the number appearing in both is 130.The intersection is lower than any examples from the automotive domain, being almost half of the mean.
Without a more comprehensive study of the distribution of class codes across different training sets in different technologies it is not possible to be confident if this is reflective of the technology area, or indicative of some minor bias in the construction of the data.The results are similar enough to the automotive technologies to not cause concern about intrinsic problems in the construction of the data -for example class codes that are uncommonly discriminative.

Quantifying classifier performance
Demšar [10] identifies the most frequently used information retrieval metrics used in the analysis of supervised learning classifier performance in the literature as , ,  1 , and , with AUC also being used, though less often.
These metrics are defined in terms of the binary classifier confusion matrix, shown in Table 4, where  are true positives (correctly identified positives),  are true negatives (correctly identified negatives),   are false positives (Type I errors), and   false negatives (Type II errors).
The metrics are all presented as numbers in the range 0 to 1, and are defined as follows: In plain language these can be thought of in the following terms: of the answers returned, the proportion that are correct.The  score can be misleading in the presence of unbalanced classes [11] (e.g. more negatives than positives), which is generally the case in patent classification, however it has been included here for consistency with other work.

Naïve training set construction
For less challenging classification tasks accurate results can be obtained by constructing a training set from a random subset of  ⊕ and  ⊖ , and evaluating the classifiers built from these training sets against ⧵ .However, from practical experience in the field it has been discovered that this technique is not effective for patent classification against commercially relevant topics.
Because of this it is expected that a similarly constructed training set drawn randomly from  will produce relatively low  and  numbers, and this can be used as a test of the appropriate difficulty of the classification challenge.
Anecdotally,  and  scores (and hence  1 ) in excess 0.9 have been reported by end-users as being approximately equivalent to results produced by manual search and review of patents by skilled practitioners.Based on this we would expect to see a random construction of training sets produce scores under this threshold for a robust gold standard.
A series of 10 training sets were randomly constructed, such that Classifiers as described in Section 9.1 were then trained against the random training sets.The confusion matrices from evaluation against  ⧵  were then calculated, and the results can be seen in Table 5.The value of 300 was chosen as a typical training set size, from practical experience of patent technology classification, with manually curated training sets.
From this we can see that, at least for this implementation, randomly generated training sets do not produce a level of  that meets user expectations.This gives a degree of confidence that the task presented by classifying the gold standard is not so trivial as to be unrepresentative of real-world technology classifications.
It is interesting that the mean  is just in excess of the 0.9 threshold, though the micro-average [12] mean  1 score for this test is 0.842, suggesting that the overall perceived accuracy would be insufficient.
The disparity between  and  scores for random training sets is substantial.In order to illustrate possible causes, a UMAP dimensional reduction [13] was performed on the entire gold standard data, using a pre-existing deep learning embedding10 of CPC class codes.100 positive, and 100 negative families were then selected at random, and plotted in Fig. 1.The parameters used were _ℎ = 5, _ = 0.5, and the euclidean metric was used for distance calculations.
This reduction gives some indication of why this may occur.The positive points are mostly densely clustered in a small number of locations in the space, whereas the negatives are more scattered.If this is a meaningful representation of the information space, then it would be hard for the classifier to define the boundaries of the space denoting positive class, and will tend to be over-inclusive of positive results.This would be a cause of the high , but low .
With further work, it could be established if the process of following the algorithm described in Section 8 also identifies negative points that help to define the boundary in such a space.

Modelling real-world directed training
As we have shown, random construction of training sets does not yield results that are either representative of practical experience, or sufficiently accurate to be useful.Consequently it is necessary to define a representative, repeatable, and fully algorithmic way to model operator-directed construction of training sets, so that classification accuracy can be evaluated a robust manner.
The domain of interest can be characterised as per Fig. 2, where  is the entire domain (all patents relevant to the subject of the gold standard, whether positive or not),  is the patents that are positive to the class,  is the patents that are currently known to the operator, or will become known during the training process, and  is the current training set.By definition the training set must be a (non strict) subset of the known patents.
From this we can define the sets:  ⊕ =  ∩  training set positives ( 7) The process of training a supervised learning classifier can be characterised as moving patents from  into  in such a way as to increase the extent to which  ⊕ and  ⊖ represent the characteristic differences between  and , enabling the classifier to ''learn'' what those differences are.
Hence, during a typical training process the operator follows the following cycle: 1. Identify a small number of members of  to form the initial  ⊕provided by an end-user, discovered by manual search, or some combination 2. Identify a small number more-or-less arbitrary members of  to form the initial  ⊖ 3. Train the classifier on  4. Apply the classifier to some subset of  5. Correct the most obvious errors, adding to  6. Repeat from 3., until the classifier evaluates to some success criteria, such as  1 ≥ 0.9 In real-life situations the operator is a human expert, who is responsible for training the classifier, in the algorithm described in this article it is a software simulation of that operator, modelled as the selection of new members of the training set based on highest log-loss.
The way this was modelled in software was to start with randomly selected initial , and  sets, such that  ∩ = ∅ and  ∪ = , where the gold standard is , and the holdout set is  (representing  ⧵  in the user-driven case). , the training set starts out with a balanced random subset of , and progressively acquires members from  ⧵  to reduce errors observed in .The pseudocode is given in algorithm 1.
In this way the data is divided into three portions -a randomly selection withheld set () that is purely used for testing, an initial training set ( ), and the remainder which is used to augment the training set ().

Selection of constants
The constants , , ,  were given the values 100, 350, 286 and 5 respectively for this study.The choice of , ,  are chosen largely for reasons of computational efficiency, and to match the scale of the quantum computing gold standard, as described below.
Reducing , the initial cardinality of  , causes the evaluation scores to be slightly higher in earlier iterations, and increases computational ⊳ apply classifier to evaluation set effort, though from some experimentation this effect on evaluation scores did not appear to be substantial.
Increasing , which governs the maximum cardinality of  , and hence the number of iterations, substantially increases the evaluation time, and can have some negative impact on the selection of patents in  ⊕ , as discussed in Section 8.2.
, the cardinality of the held-out set was chosen as 0.2 || -20% of the data, a typical proportion for this kind of evaluation.
Increasing , the number of families added to  in each iteration, reduces the computational effort of evaluation, at the cost of decreasing the resolution of the analysis of the rate of change of the metrics with respect to training set size seen in Section 9. A value of 10 or 20 would be more representative of human-directed training, though some experiments revealed that it does not materially affect the evaluation results.
The conditional on precision and recall reflects the user choosing to correct for false positives or false negatives, depending on which are more apparent, and the function highloss(, ) reflects a user identifying the most obvious errors, and correcting them.From analysing user behavioural statistics of the Cipher system (see Section 9.1), it has been observed that users tend to briefly focus on identifying runs of positives, then runs of negatives, so this has been reflected in this process.

Limits of 𝛽
As | ⊕ | approaches | ⊕ ⧵ | the training set becomes equivalent to a randomly generated one, due to the reduction in selection freedom of the training model.Consequently, at some point the metrics for the classifier will decrease with increasing training set size.
As we can see in Fig. 3, when | | = 400, the available set of positives ( ⊕ ⧵) that has been incorporated into  ⊕ is around 58%.At | | = 500 the consumed proportion is 72%, significantly constraining the selection that can be made from  ⊕ .This is due to the size of the held-out set, and the tendency of the training set to be balanced between ⊕ and ⊖, while the gold standard as a whole is not.
This effect could be attenuated by reducing ||, however this methodology would then be a worse representation of the real-world situation, as the unknown portion of data is typically substantial in comparison to the size of the training set, and would reduce the accuracy of the evaluation.
Consequently we limit  to 350 for evaluation of this gold standard, to minimise the impact of this effect.Further work would be required to determine the exact significance of this factor.By observation, with || = 0.2 ||,  = 350 the evaluation results appear to be unaffected for this gold standard - and  at | | = 350 are greater than or equal to those at | | < 350, see Table 6.For future gold standards other choices of  may be required, to reflect differing cardinalities of  ⊕ and  ⊖ .
It is not clear how much the reduction in the rate of change of  and  with respect to | | is due to diminishing returns from the increased data, and how much to the lack of choice in candidate patents to add to  ⊕ .Further work would be required to determine this.

Alternative algorithms
An obvious alternative would be to only evaluate against the  set, which gives the advantage of a constant evaluation set size.In order to reduce the variance of the scores it was found that large values of  were required, limiting values of , as described in Section 8.2 to levels where limits of values for  and  could be obtained.For this dataset, and this implementation the combinations of  and  that were evaluated also resulted in a high variance of the  1 score.
A much larger gold standard would reduce this effect, though would introduce other issues.Commercially relevant technologies to be classified tend to be relatively specific, and it is rare for examples with tens of thousands of candidates for  ⊕ to exist.Any such training set is likely to be idiosyncratic, and not reflective of many real-world tasks.
Additionally, this variant approach would not reflect what users experience when evaluating classifier results -the results from  ⧵  seen in the final output are a mixture of results that are not known to the operator (), and ones that have been observed, though deemed to be correct ().

Comparison to randomly selected training sets
Fig. 4 shows the difference in  and  for classifiers built from randomly selected training sets (as in Section 7) and the directed training algorithm described here, for the same classification engine.
The data in Fig. 4 is drawn from Table 5, and the source data for Table 6.As can be clearly seen, the directed training produces substantially higher  and higher  for the same training set size.
The difference in mean  between the methodologies of 6.1% (0.956 and 0.901) may not seem substantial, however this reflects a 44.9% reduction in the False Negative Rate, from 0.098 to 0.044.

About Cipher
In the results that follow the TRAIN() function in the pseudocode is provided by the July 2019 version of the Cipher classification algorithm, as described in this section.
Cipher 11 is a commercially available strategic patent information system, the key feature of which is the ability to use trained AI classifiers to tag patent assets against defined technologies.
The classification system is based on an ensemble of learners, each trained on embeddings generated from patent data.
Textual and medadata embeddings for model training are obtained through a combination of domain specific normalisation and transformations, and a separately trained patent-specific language model.Model parameters were typically obtained either through a random or directed hyperparameter searches.
As the hyperparameters are determined algorithmically it is necessary to perform a large number of iterations, to ensure representative results.

Methodology and results
In order to obtain representative results, we executed algorithm 1, a total of 200 times.This required 183 CPU-hours, when executed on 4.3 GHz Intel i7-7740X CPUs with NVIDIA GeForce GTX 1080 Ti GPU accelerators.
The results of executing algorithm 1 are shown in Table 6, abbreviated to only show the results of every 5 iterations (5 increments to

|𝑇 |).
11 http://cipher.ai/.Note that the results in Table 6 and Figs.5-7 are computed as the micro-average of the confusion matrices from multiple runs, rather than the macro-average [12].This allows direct comparison of  1 and Var( 1 ).
Figs. 5 and 6 show the evolution of  and , and  1 and  as the training set grows.

Observations
We can see that the initial  is much higher than as expected from a random starting point, as observed in Section 7. From this point  grows steadily, while  reduces quickly with respect to training set size, then increases as  develops towards a more optimal selection.Eventually  and  converge around similar values, due to the balancing action of the training simulation algorithm.This reduction in  causes a corresponding reduction in the  1 score in the range of 100 < | | < 150.
As we can see, the ultimate , , and  1 of the Cipher implementation are around 0.96, and these values are reached for values of | | = 300, or | ⊕ | ≈ | ⊖ | ≈ 150.This reflects anecdotal reports from users who have trained the Cipher system against real-world topics (such as the automotive taxonomy), and provides reassurance that both the gold standard data, and the training simulation algorithm are a reasonable reflection of real-world situations.The ultimate  is 0.98, though as discussed previously this is not a robust measure for unbalanced sets.
One important aspect of classifier performance that is often under reported is the variance of results over repeat runs with different datasets [10].The variance of the  1 score can be seen in Fig. 7, and the distribution of  1 is rendered as a scatter plot in Fig. 8.As can be seen from the plots, the variance is initially high (when the training set is close to random), and converges on stable values as the training set is increasingly optimally selected from the pool of available data.

Conclusion
Though this is early work on the analysis of this class of classification problem, the quantum computing gold standard appears to be representative of real-world experience of classification of patents.The required training set sizes, and eventual accuracy of classification match anecdotal evidence from real users, and statistically the data resembles known-good training sets.
In so far as possible to this point we have addressed the criteria presented in Section 3, though some remain for future work: Scope Appears to be an accurate reflection of commercial practise, though more evaluation would be beneficial.

Agreement
The published data represents a single view, future work includes obtaining blind reviews of the data by other practitioners.This could also be used to establish consensus, maximal, and minimal classification targets.

Diversity of technology
Currently only one technology is covered.Future work includes adding further gold standards from different industries.

Size of dataset
The size of the dataset appears to be sufficient for robust evaluation, whilst also being a representative technology.
Challenging Analysis of the behaviour of random training sets indicates that the difficulty of analysis of this technology is comparable with real-world classification tasks.

Independent
The gold standard was created in the absence of an algorithmic classification system, by a neutral third party.
Identification Data is published in the widely-used European Patent Office format, with titles and dates available to aid crosschecking.
The data for the quantum computing gold standard can be found at https://github.com/swh/classification-gold-standard/tree/master/data.It is made available under the BSD 3-Clause License, to allow reuse in other projects in a variety of ways.The site includes documentation for the file and data format the gold standard is represented in.
In the future, additional gold standard datasets will be published at this location, to allow a more comprehensive analysis of the behaviour of various patent classification algorithms, across multiple domains, and created by different authors.
The cardinality of the example gold standard created for this study (||) is 1429 EPO simple patent families, broken down as | ⊕ | = 435, and | ⊖ | = 994, these consist of 2282, and 2801 publications, respectively.

Fig. 1 .
Fig. 1.UMAP dimensional reduction of 100 randomly selected positives (orange), and 100 randomly selected negatives (blue).(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 2 .
Fig. 2. Euler diagram showing sets of interest to the training process.

S
. Harris et al.

Fig. 4 .
Fig. 4. Scatter plot of  and , for randomly selected training sets and directed training sets, with | | = 300.

Fig. 5 .
Fig. 5. Means of  and  with training set size, calculated over 200 runs from random starting point, and with random held-out data.

Fig. 6 .
Fig. 6.Means of  1 and  with training set size, calculated over 200 runs from random starting point, and with random held-out data.

Fig. 7 .
Fig. 7. Variance of  1 with training set size, calculated over 200 runs from random starting point, and with random held-out data.

Fig. 8 .
Fig. 8. Scatter plot of  1 against training set size, calculated over 200 runs from random starting point, and with random held-out data.

Table 1
Most frequent class codes in the  ⊕ set (positive labelled training set), and their coverage -| ⊕ ∩| | ⊕ | , for class code .

Table 2
The number of distinct class codes appearing in each set, and the intersection of the class codes of the two sets.The percentage indicates the proportion of class codes appearing in  ⊕ (positive labelled training set) that also appear in  ⊖ (negative labelled training set)

Table 3
Most common class codes appearing the gold standard, and the proportion of families in each set that include the class code.

Table 4
The confusion matrix for binary classification.

Table 5
Results for randomly generated training sets, with | | = 300, evaluated against  ⧵  . 1 the harmonic mean of  and , provides a simple way to combine them into a single value, such that poor performance in either metric is visible in the  1 score.
of all the answers given, what proportion are correct.