Multimedia content may be supplemented with time-aligned closed captions for accessibility. Often these captions are created manually by professional editors — an expensive and time-consuming process. In this paper, we present a novel approach to automatic creation of a well-formatted, readable transcript for a video from closed captions or ASR output. Our approach uses acoustic and lexical features extracted from the video and the raw transcription/caption files. We compare our approach with two standard baselines: a) silence segmented transcripts and b) text-only segmented transcripts. We show that our approach outperforms both these baselines based on subjective and objective metrics.
Cite as: Pappu, A., Stent, A. (2015) Automatic formatted transcripts for videos. Proc. Interspeech 2015, 2514-2518, doi: 10.21437/Interspeech.2015-542
@inproceedings{pappu15_interspeech, author={Aasish Pappu and Amanda Stent}, title={{Automatic formatted transcripts for videos}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={2514--2518}, doi={10.21437/Interspeech.2015-542} }