Skip to main content

Visual-Only Word Boundary Detection

  • Conference paper
  • First Online:
  • 1563 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10607))

Abstract

Word boundary detection is one of the primary components in speech recognition system, which can be learned jointly as part of the speech model or independently as an extra step of preprocessing, reducing the problem into a conditionally independent word prediction. It can also be used to separate Out of Vocabulary (OOV) words in the sentence, thereby avoiding unnecessary computation. By itself, word boundary detection is essential in multimodal corpus collection, in which it allows automated and detailed labeling towards the dataset, be it on sentence or word level. In this research, we proposed a novel approach in word boundary detection, that is, by utilizing only visual information, using 3-Dimensional Convolutional Neural Network (3D-CNN) and Bidirectional-Gated Recurrent Unit (Bi-GRU). This research is important in paving the way for a better lip reading system, as well as multimodal speech recognition, as it allows easier creation of novel dataset and enables conventional word-level visual or multimodal speech recognition system to work on continuous speech. Training was done on GRID video corpus on 118 epochs. The proposed model performed well compared to the baseline method, with considerably lower error rate.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Rabiner, L.R., Sambur, M.R.: An algorithm for determining the endpoints of isolated utterances. Bell Labs Tech. J. 54(2), 297–315 (1975)

    Article  Google Scholar 

  2. Junqua, J.-C.: Robustness and cooperative multimodal man-machine communication applications. In: Second VENACO Workshop the Structure of Multimodal Dialogue (1991)

    Google Scholar 

  3. Ying, G., Mitchell, C., Jamieson, L.: Endpoint detection of isolated utterances based on a modified teager energy measurement. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP-1993, vol. 2, pp. 732–735. IEEE (1993)

    Google Scholar 

  4. Junqua, J.-C., Mak, B., Reaves, B.: A robust algorithm for word boundary detection in the presence of noise. IEEE Trans. Speech Audio Process. 2(3), 406–412 (1994)

    Article  Google Scholar 

  5. Wu, G.-D., Lin, C.-T.: Word boundary detection with mel-scale frequency bank in noisy environment. IEEE Trans. Speech Audio Process. 8(5), 541–554 (2000)

    Article  Google Scholar 

  6. Wu, G.-D., Lin, C.-T.: A recurrent neural fuzzy network for word boundary detection in variable noise-level environments. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 31(1), 84–97 (2001)

    Article  Google Scholar 

  7. Tan, C.K.-Y., Kim-Teng, L.: Learning of word boundaries in continuous speech using time delay neural networks (2003). http://bit.ly/2xjbHvq

  8. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  9. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: Lipnet: sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)

  10. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). doi:10.1007/978-3-319-54184-6_6

    Chapter  Google Scholar 

  11. Gergen, S., Zeiler, S., Abdelaziz, A.H., Nickel, R.M., Kolossa, D.: Dynamic stream weighting for turbo-decoding-based audiovisual ASR. In: INTERSPEECH, pp. 2135–2139 (2016)

    Google Scholar 

  12. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424 (2006)

    Article  Google Scholar 

  13. Gu, L., Zahorian, S.A.: A new robust algorithm for isolated word endpoint detection. Energy 2, 1 (2002)

    Google Scholar 

  14. Garg, A., Noyola, J., Bagadia, S.: Lip reading using CNN and LSTM. Technical report (2016)

    Google Scholar 

  15. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  16. Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)

    Google Scholar 

  17. Irsoy, O., Cardie, C.: Opinion mining with deep recurrent neural networks (2014)

    Google Scholar 

  18. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  19. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  20. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  21. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

    Google Scholar 

  22. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-2010), pp. 807–814 (2010)

    Google Scholar 

  23. Boureau, Y.-L., Le Roux, N., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-way local pooling for image recognition. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2651–2658. IEEE (2011)

    Google Scholar 

  24. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)

    Google Scholar 

  25. dlib.net/face_landmark_detection.py.html. http://dlib.net/face_landmark_detection.py.html

  26. eddersko/wordboundary. https://github.com/eddersko/WordBoundary

  27. Wallach, D., Goffinet, B.: Mean squared error of prediction as a criterion for evaluating and comparing system models. Ecol. Model. 44(3–4), 299–306 (1989)

    Article  Google Scholar 

  28. Pontius, R.G., Thontteh, O., Chen, H.: Components of information for multiple resolution comparison between maps that share a real variable. Environmental and Ecological Statistics 15(2), 111–142 (2008)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgment

This work is supported by Indexed Thesis Publication Grant funded by Directorate of Research and Public Services, Universitas Indonesia under contract No: 411/UN2.R3.1/HKP.05.00/2017 and is supported by GPU Grant from NVIDIA Inc.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Muhammad Rizki Aulia Rahman Maulana or Retno Larasati .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Maulana, M.R.A.R., Larasati, R., Fanany, M.I. (2017). Visual-Only Word Boundary Detection. In: Phon-Amnuaisuk, S., Ang, SP., Lee, SY. (eds) Multi-disciplinary Trends in Artificial Intelligence. MIWAI 2017. Lecture Notes in Computer Science(), vol 10607. Springer, Cham. https://doi.org/10.1007/978-3-319-69456-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69456-6_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69455-9

  • Online ISBN: 978-3-319-69456-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics