Abstract
Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information, using 3-Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multimodal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. Bell Labs Tech. J. 54(2), 297–315 (1975)
Junqua, J.-C.: Robustness and cooperative multimodal man-machine communication applications. In: Second VENACO Workshop the Structure of Multimodal Dialogue (1991)
Ying, G., Mitchell, C., Jamieson, L.: Endpoint detection of isolated utterances based on a modified teager energy measurement. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-1993, vol. 2, pp. 732–735. IEEE (1993)
Junqua, J.-C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech Audio Process. 2(3), 406–412 (1994)
Wu, G.-D., Lin, C.-T.: Word boundary detection with mel-scale frequency bank in noisy environment. IEEE Trans. Speech Audio Process. 8(5), 541–554 (2000)
Wu, G.-D., Lin, C.-T.: A recurrent neural fuzzy network for word boundary detection in variable noise-level environments. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31(1), 84–97 (2001)
Tan, C.K.-Y., Kim-Teng, L.: Learning of word boundaries in continuous speech using time delay neural networks (2003). http://bit.ly/2xjbHvq
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). doi:10.1007/978-3-319-54184-6_6
Gergen, S., Zeiler, S., Abdelaziz, A.H., Nickel, R.M., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: INTERSPEECH, pp. 2135–2139 (2016)
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Gu, L., Zahorian, S.A.: A new robust algorithm for isolated word endpoint detection. Energy 2, 1 (2002)
Garg, A., Noyola, J., Bagadia, S.: Lip reading using CNN and LSTM. Technical report (2016)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)
Irsoy, O., Cardie, C.: Opinion mining with deep recurrent neural networks (2014)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010)
Boureau, Y.-L., Le Roux, N., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-way local pooling for image recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2651–2658. IEEE (2011)
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)
dlib.net/face_landmark_detection.py.html. http://dlib.net/face_landmark_detection.py.html
eddersko/wordboundary. https://github.com/eddersko/WordBoundary
Wallach, D., Goffinet, B.: Mean squared error of prediction as a criterion for evaluating and comparing system models. Ecol. Model. 44(3–4), 299–306 (1989)
Pontius, R.G., Thontteh, O., Chen, H.: Components of information for multiple resolution comparison between maps that share a real variable. Environmental and Ecological Statistics 15(2), 111–142 (2008)
Acknowledgment
This work is supported by Indexed Thesis Publication Grant funded by Directorate of Research and Public Services, Universitas Indonesia under contract No: 411/UN2.R3.1/HKP.05.00/2017 and is supported by GPU Grant from NVIDIA Inc.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Maulana, M.R.A.R., Larasati, R., Fanany, M.I. (2017). Visual-Only Word Boundary Detection. In: Phon-Amnuaisuk, S., Ang, SP., Lee, SY. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2017. Lecture Notes in Computer Science(), vol 10607. Springer, Cham. https://doi.org/10.1007/978-3-319-69456-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-69456-6_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69455-9
Online ISBN: 978-3-319-69456-6
eBook Packages: Computer ScienceComputer Science (R0)