A semi-supervised approach using label propagation to support citation screening

Graphical abstract


Selection of number of nearest neighbours for label propagation
We investigate the performance of the semi-supervised approach as a function of the number of nearest neighbours that is used to transfer classification codes from manually labelled to unlabelled instances (i.e., k parameter of the model). To this end, we incrementally increase the k parameter of a semi-supervised certainty-based model (i.e., SemiSpectral-AL-C) and we record the utility and burden performance of the model across one clinical (i.e., Proton Beam) and one public health review (i.e., Tobacco Packaging). As a baseline method, we use a certainty-based active learning model without label propagation (i.e., AL-C).
We compare the utility ( Figure 1a) and burden ( Figure 1b) performance of 5 active learning models, namely SemiSpectral-AL-C (for k = {1, 3, 10, 30}) and the baseline AL-C method, when applied to the Proton Beam clinical review. It can be noted that the utility performance of the SemiSpectral-AL-C method increases as the k parameter is raised above k = 10. The SemiSpectral-AL-C method achieves the best utility performance when using a k parameter of 30. However, the semi-supervised method (for k = 30) shows a substantially increased burden performance when compared to the baseline AL-C method. This indicates that a large value of k may result in an increased number of false positive predictions. For smaller values of k (e.g., k = 3), the SemiSpectral-AL-C method obtains relatively smaller performance gains (in terms of the utility metric) but the semi-supervised model is able to maintain a reduced burden performance.
Figures 1a and 1b illustrate the utility and burden performance, respectively, of the SemiSpectral-AL-C (for k = {1, 3, 10, 30}) and AL-C methods when applied to the the Tobacco Packaging public health review. Similarly to the clinical review, the SemiSpectral-AL-C obtains a superior utility performance for k = 30 on the public health review but with a considerably increased screening burden when compared to the baseline method.
The results demonstrate that the choice of k has a large impact on utility in the early screening stages for a certaintybased model, and depending on the goals of the screening prioritisation, selecting a higher k may be appropriate.

Performance graphs
We provide the evaluation results (i.e., yield/burden and utility) of six active learning screening methods, namely: a) active learning with certainty sampling (AL-C) [1], b) active learning with uncertainty sampling (AL-U) [2], c) two semi-supervised active learning models that propagate classification labels using a bag-of-words feature space (i.e., SemiBoW-AL-C for certainty sampling and SemiBoW-AL-U for uncertainty sampling), d) two semi-supervised active learning methods that use a spectral embedded space for label propagation (SemiSpectral-AL-C and SemiSpectral-AL-U). The semi-supervised models (SemiBoW-AL and SemiSpectral-AL) are new automatic screening methods proposed in this work while the two active learning modes, namely AL-C and AL-U, were previously presented by [1] and [2], respectively, and are used in this study as baseline methods. All methods use linear SVMs. Figures 2-7 show the yield, burden and utility performance achieved by the automatic screening methods when applied to two clinical and four public health reviews. Regarding utility, we also record the performance of a conventional, manually conducted citation screening process (i.e., Manual).