LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling

India, Miquel; Fonollosa, José A.R.; Hernando, Javier

doi:10.21437/Interspeech.2017-407

LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling

Miquel India, José A.R. Fonollosa, Javier Hernando

This paper presents a new speaker change detection system based on Long Short-Term Memory (LSTM) neural networks using acoustic data and linguistic content. Language modelling is combined with two different Joint Factor Analysis (JFA) acoustic approaches: i-vectors and speaker factors. Both of them are compared with a baseline algorithm that uses cosine distance to detect speaker turn changes. LSTM neural networks with both linguistic and acoustic features have been able to produce a robust speaker segmentation. The experimental results show that our proposal clearly outperforms the baseline system.

doi: 10.21437/Interspeech.2017-407

Cite as: India, M., Fonollosa, J.A.R., Hernando, J. (2017) LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling. Proc. Interspeech 2017, 2834-2838, doi: 10.21437/Interspeech.2017-407

@inproceedings{india17_interspeech,
  author={Miquel India and José A.R. Fonollosa and Javier Hernando},
  title={{LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling}},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2834--2838},
  doi={10.21437/Interspeech.2017-407}
}