Learning Visual-Audio Representations for Voice-Controlled Robots | IEEE Conference Publication | IEEE Xplore