Doubling down: sparse grounding with an additional, almost-matching caption for detection-oriented multimodal pretraining | IEEE Conference Publication | IEEE Xplore