Visual-Only Word Boundary Detection

Maulana, Muhammad Rizki Aulia Rahman; Larasati, Retno; Fanany, Mohamad Ivan

doi:10.1007/978-3-319-69456-6_13

Visual-Only Word Boundary Detection

Muhammad Rizki Aulia Rahman Maulana¹⁶,
Retno Larasati¹⁶ &
Mohamad Ivan Fanany¹⁶

Conference paper
First Online: 19 October 2017

1563 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10607))

Abstract

Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information, using 3-Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multimodal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. Bell Labs Tech. J. 54(2), 297–315 (1975)
Article Google Scholar
Junqua, J.-C.: Robustness and cooperative multimodal man-machine communication applications. In: Second VENACO Workshop the Structure of Multimodal Dialogue (1991)
Google Scholar
Ying, G., Mitchell, C., Jamieson, L.: Endpoint detection of isolated utterances based on a modified teager energy measurement. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-1993, vol. 2, pp. 732–735. IEEE (1993)
Google Scholar
Junqua, J.-C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech Audio Process. 2(3), 406–412 (1994)
Article Google Scholar
Wu, G.-D., Lin, C.-T.: Word boundary detection with mel-scale frequency bank in noisy environment. IEEE Trans. Speech Audio Process. 8(5), 541–554 (2000)
Article Google Scholar
Wu, G.-D., Lin, C.-T.: A recurrent neural fuzzy network for word boundary detection in variable noise-level environments. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31(1), 84–97 (2001)
Article Google Scholar
Tan, C.K.-Y., Kim-Teng, L.: Learning of word boundaries in continuous speech using time delay neural networks (2003). http://bit.ly/2xjbHvq
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). doi:10.1007/978-3-319-54184-6_6
Chapter Google Scholar
Gergen, S., Zeiler, S., Abdelaziz, A.H., Nickel, R.M., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: INTERSPEECH, pp. 2135–2139 (2016)
Google Scholar
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)
Article Google Scholar
Gu, L., Zahorian, S.A.: A new robust algorithm for isolated word endpoint detection. Energy 2, 1 (2002)
Google Scholar
Garg, A., Noyola, J., Bagadia, S.: Lip reading using CNN and LSTM. Technical report (2016)
Google Scholar
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)
Google Scholar
Irsoy, O., Cardie, C.: Opinion mining with deep recurrent neural networks (2014)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010)
Google Scholar
Boureau, Y.-L., Le Roux, N., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-way local pooling for image recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2651–2658. IEEE (2011)
Google Scholar
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)
Google Scholar
dlib.net/face_landmark_detection.py.html. http://dlib.net/face_landmark_detection.py.html
eddersko/wordboundary. https://github.com/eddersko/WordBoundary
Wallach, D., Goffinet, B.: Mean squared error of prediction as a criterion for evaluating and comparing system models. Ecol. Model. 44(3–4), 299–306 (1989)
Article Google Scholar
Pontius, R.G., Thontteh, O., Chen, H.: Components of information for multiple resolution comparison between maps that share a real variable. Environmental and Ecological Statistics 15(2), 111–142 (2008)
Article MathSciNet Google Scholar

Download references

Acknowledgment

This work is supported by Indexed Thesis Publication Grant funded by Directorate of Research and Public Services, Universitas Indonesia under contract No: 411/UN2.R3.1/HKP.05.00/2017 and is supported by GPU Grant from NVIDIA Inc.

Author information

Authors and Affiliations

Faculty of Computer Science, Universitas Indonesia, Kampus UI Depok, Depok, Jawa Barat, 16424, Indonesia
Muhammad Rizki Aulia Rahman Maulana, Retno Larasati & Mohamad Ivan Fanany

Authors

Muhammad Rizki Aulia Rahman Maulana
View author publications
You can also search for this author in PubMed Google Scholar
Retno Larasati
View author publications
You can also search for this author in PubMed Google Scholar
Mohamad Ivan Fanany
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Muhammad Rizki Aulia Rahman Maulana or Retno Larasati .

Editor information

Editors and Affiliations

Universiti Teknologi Brunei, Gadong, Brunei Darussalam
Somnuk Phon-Amnuaisuk
Universiti Teknologi Brunei, Gadong, Brunei Darussalam
Swee-Peng Ang
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Soo-Young Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maulana, M.R.A.R., Larasati, R., Fanany, M.I. (2017). Visual-Only Word Boundary Detection. In: Phon-Amnuaisuk, S., Ang, SP., Lee, SY. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2017. Lecture Notes in Computer Science(), vol 10607. Springer, Cham. https://doi.org/10.1007/978-3-319-69456-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-69456-6_13
Published: 19 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69455-9
Online ISBN: 978-3-319-69456-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics