Abstract
In this work, we investigate neural networks as visual relationship classifiers for precision-constrained applications in partially annotated datasets. The classifier is a convolutional neural network, which we benchmark on three visual relationship datasets. We discuss the effect of partial annotation on precision and why precision-based metrics are not adequate in partial annotation cases. So far, this topic has not been explored in the context of visual relationship classification. We introduce a threshold tuning method that imposes a soft constraint on precision while being less sensitive to the degree of annotation than a regular precision-recall trade-off method. Performance can then be measured via recall of predictions computed with thresholds tuned by the proposed method. Our previously introduced negative sample mining method is now extended to partially annotated datasets (namely Visual Relationship Detection, VRD, and Visual Genome, VG), by sampling from unlabeled pairs instead of unrelated pairs. When thresholds are tuned using our method, negative sample mining improves recall from \(24.1\%\) to \(30.6\%\) and from \(36.7\%\) to \(41.3\%\) for VRD and VG, respectively. The neural networks also maintain the ability to correctly classify between predicates. When considering only ground-truth relationships for threshold tuning, there is only a small decrease in recall (from \(45.1\%\) to \(43.8\%\) in VRD, or from \(60.5\%\) to \(58.7\%\) in VG) compared to when the neural networks are trained only on ground-truth samples.
Similar content being viewed by others
Data Availability
the datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Notes
More information on the challenge is available at https://storage.googleapis.com/openimages/web/challenge2019.html.
Instead of using the conv_5 layers on deeper backbones, we use a randomly initialized convolutional stack with the same architecture as the conv_5 layers from the ResNet-18 backbone.
References
Ahmad S, Mehfuz S, Mebarek-Oudina F et al (2022) Rsm analysis based cloud access security broker: a systematic literature review. Cluster Comput 25(5):3733–3763. https://doi.org/10.1007/s10586-022-03598-z
Anderson P, Fernando B, Johnson M, et al (2016) Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, pp 382–398 https://doi.org/10.1007/978-3-319-46454-1_24
Cole E, Mac Aodha O, Lorieul T, et al (2021) Multi-Label Learning From Single Positive Labels. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 933–942
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3298–3308 https://doi.org/10.1109/CVPR.2017.352
Estevão RdMFilho, Gomes JGR, Nunes LdO (2020) Visual relationship classification with negative-sample mining. In: 2020 IEEE International Conference on Image Processing (ICIP), pp 2251–2255 https://doi.org/10.1109/ICIP40778.2020.9191170
Everingham M, Eslami SMA, Van Gool L et al (2015) The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111(1):98–136. https://doi.org/10.1007/s11263-014-0733-5
He K, Zhang X, Ren S et al (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. 2015 IEEE International Conference on Computer Vision. Santiago, Chile, pp 1026–1034
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 770–77 https://doi.org/10.1109/CVPR.2016.90
He K, Gkioxari G, Dollár P, et al (2017) Mask r-cnn. In: IEEE International Conference on Computer Vision, pp 2980–2988 https://doi.org/10.1109/ICCV.2017.322
Inayoshi S, Otani K, Tejero-de Pablos A, et al (2020) Bounding-box Channels for Visual Relationship Detection. In: European Conference on Computer Vision, pp 682–697 https://doi.org/10.1007/978-3-030-58558-7_40
Johnson J, Krishna R, Stark M, et al (2015) Image retrieval using scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3668–3678 https://doi.org/10.1109/CVPR.2015.7298990
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4565–4574 https://doi.org/10.1109/CVPR.2016.494
Johnson J, Gupta A, Fei-Fei L (2018) Image generation from scene graphs. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1219–1228 https://doi.org/10.1109/CVPR.2018.00133
Krishna R, Zhu Y, Groth O et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7
Kuznetsova A, Rom H, Alldrin N et al (2020) The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128:1956–1981. https://doi.org/10.1007/s11263-020-01316-z
Li L, Chen L, Huang Y, et al (2022) The devil is in the labels: Noisy label correction for robust scene graph generation. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 18,847–18,856 https://doi.org/10.1109/CVPR52688.2022.01830
Li Y, Ouyang W, Wang X, et al (2017a) Vip-cnn: Visual phrase guided convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 7244–7253 https://doi.org/10.1109/CVPR.2017.766
Li Z, Peng C, Yu G, et al (2017b) Light-head r-cnn: In defense of two-stage object detector. arXiv preprint arXiv:1711.07264
Liang K, Guo Y, Chang H, et al (2018) Visual relationship detection with deep structural ranking. In: AAAI Conference on Artificial Intelligence
Liang Y, Bai Y, Zhang W, et al (2019) Vrr-vg: Refocusing visually-relevant relationships. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 10,402–10,411 https://doi.org/10.1109/ICCV.2019.01050
Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400
Liu R, Lehman J, Molino P, et al (2018) An intriguing failing of convolutional neural networks and the coordconv solution. In: Advances in Neural Information Processing Systems, pp 9605–9616
Lu C, Krishna R, Bernstein M, et al (2016) Visual relationship detection with language priors. In: European Conference on Computer Vision, pp 852–869 https://doi.org/10.1007/978-3-319-46448-0_51
Ma C, Sun L, Zhong Z et al (2021) ReLaText: Exploiting visual relationships for arbitrary-shaped scene text detection with graph convolutional networks. Pattern Recognition 111(107):684. https://doi.org/10.1016/j.patcog.2020.107684
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Newell A, Deng J (2017) Pixels to graphs by associative embedding. Advances in Neural Information Processing Systems 30:2171–2180
Nyo MT, Mebarek-Oudina F, Hlaing SS, et al (2022) Otsu’s thresholding technique for mri image brain tumor segmentation. Multimedia Tools and Applications 81:43,837–43,849. https://doi.org/10.1007/s11042-022-13215-1
Paszke A, Gross S, Massa F, et al (2019) Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems
Peng Y, Chi J (2020) Unsupervised Cross-Media Retrieval Using Domain Adaptation With Scene Graph. IEEE Transactions on Circuits and Systems for Video Technology 30(11):4368–4379. https://doi.org/10.1109/TCSVT.2019.2953692
Peyre J, Laptev I, Schmid C, et al (2017) Weakly-supervised learning of visual relations. In: IEEE International Conference on Computer Vision, pp 5189–5198 https://doi.org/10.1109/ICCV.2017.554
du Plessis MC, Niu G, Sugiyama M (2014) Analysis of Learning from Positive and Unlabeled Data. In: Advances in Neural Information Processing Systems
Qi M, Wang Y, Li A et al (2020) Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling. IEEE Transactions on Circuits and Systems for Video Technology 30(8):2617–2633. https://doi.org/10.1109/TCSVT.2019.2921655
Ren S, He K, Girshick R et al (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Ren S, He K, Girshick R et al (2017) Object detection networks on convolutional feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(7):1476–1481. https://doi.org/10.1109/TPAMI.2016.2601099
Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1745–1752 https://doi.org/10.1109/CVPR.2011.5995711
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York, NY, USA
Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. International Conference on Machine Learning. Atlanta, Georgia, USA, pp 1–9
Tang K, Niu Y, Huang J, et al (2020) Unbiased scene graph generation from biased training. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3713–3722 https://doi.org/10.1109/CVPR42600.2020.00377
Wang H, Ganapathiraju MK (2015) Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets. arXiv preprint arXiv:1509.05742
Xi Y, Zhang Y, Ding S et al (2020) Visual question answering model based on visual relationship detection. Signal Processing: Image Communication 80(115):648. https://doi.org/10.1016/j.image.2019.115648
Xu D, Zhu Y, Choy CB, et al (2017) Scene graph generation by iterative message passing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3097–3106 https://doi.org/10.1109/cvpr.2017.330
Yang J, Ang YZ, Guo Z et al (2022) Panoptic scene graph generation. In: Avidan S, Brostow G, Cissé M et al (eds) Computer Vision - ECCV 2022. Springer Nature Switzerland, Cham, pp 178–196
Yu F, Tang J, Yin W, et al (2020) ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. arXiv preprint arXiv:2006.16934
Yu R, Li A, Morariu VI, et al (2017) Visual relationship detection with internal and external linguistic knowledge distillation. In: IEEE International Conference on Computer Vision, pp 1068–1076 https://doi.org/10.1109/ICCV.2017.121
Zellers R, Yatskar M, Thomson S, et al (2018) Neural motifs: Scene graph parsing with global context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5831–5840 https://doi.org/10.1109/cvpr.2018.00611
Zhan Y, Yu J, Yu T, et al (2019) On exploring undetermined relationships for visual relationship detection. In: IEEE Conference on Computer Vision and Pattern Recognition, p 5123–5132https://doi.org/10.1109/cvpr.2019.00527
Zhang H, Kyaw Z, Chang SF, et al (2017a) Visual translation embedding network for visual relation detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3107–3115 https://doi.org/10.1109/CVPR.2017.331
Zhang J, Elhoseiny M, Cohen S, et al (2017b) Relationship proposal networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 5226–5234 https://doi.org/10.1109/CVPR.2017.555
Zhang J, Kalantidis Y, Rohrbach M, et al (2019) Large-scale visual relationship understanding. In: AAAI Conference on Artificial Intelligence, pp 9185–9194 https://doi.org/10.1609/aaai.v33i01.33019185
Zhang Z, Wu Q, Wang Y et al (2021) Exploring region relationships implicitly: Image captioning with visual relationship attention. Image and Vision Computing. https://doi.org/10.1016/j.imavis.2021.104146
Zhou H, Zhang C, Zhao M et al (2021) Improving Visual Relationship Detection With Two-Stage Correlation Exploitation. IEEE Transactions on Circuits and Systems for Video Technology 31(7):2751–2763. https://doi.org/10.1109/TCSVT.2020.3032650
Zhuang B, Liu L, Shen C, et al (2017) Towards context-aware interaction recognition for visual relationship detection. In: IEEE International Conference on Computer Vision, pp 589–598 https://doi.org/10.1109/ICCV.2017.71
Funding
this work has been supported in part by Microsoft ATL in Rio de Janeiro, and in part by Conselho Nacional de Desenvolvimento Científico e Tecnológico - CNPq - Brazil.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
the authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
de Moura Estevão Filho, R., Rodríguez Carneiro Gomes, J.G. & Oliveira Nunes, L. Evaluation of visual relationship classifiers with partially annotated datasets. Multimed Tools Appl 83, 18333–18352 (2024). https://doi.org/10.1007/s11042-023-15967-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15967-w