Focus and Align: Learning Tube Tokens for Video-Language Pre-Training | IEEE Journals & Magazine | IEEE Xplore