Improved Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering

Yang, Gene-Ping; Tuan, Chao-I; Lee, Hung-Yi; Lee, Lin-shan

doi:10.21437/Interspeech.2019-2181

Improved Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering

Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, Lin-shan Lee

Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over magnitude spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals. It is highly correlated to the phonetic structure of speech, or “how the speech sounds” when perceived by human, but primarily frequency domain features carrying temporal behaviour. Very impressive work achieving speech separation over time domain was reported recently, probably because waveforms in time domain may describe the different realizations of speech in a more precise way than magnitude spectrogram lacking phase information. In this paper, we propose a framework properly integrating the above two directions, hoping to achieve both purposes. We construct a time-and-frequency feature map by concatenating 1-dim convolution encoded feature map (for time domain) and magnitude spectrogram (for frequency domain), which was then processed by an embedding network and clustering approaches very similar to those used in time and frequency domain prior works. In this way, the information in time and frequency domains, as well as the interactions between them, can be jointly considered during embedding and clustering. Very encouraging results (state-of-the-art to our knowledge) were obtained with WSJ0-2mix dataset in preliminary experiments.

doi: 10.21437/Interspeech.2019-2181

Cite as: Yang, G.-P., Tuan, C.-I., Lee, H.-Y., Lee, L.-s. (2019) Improved Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering. Proc. Interspeech 2019, 1363-1367, doi: 10.21437/Interspeech.2019-2181

@inproceedings{yang19c_interspeech,
  author={Gene-Ping Yang and Chao-I Tuan and Hung-Yi Lee and Lin-shan Lee},
  title={{Improved Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1363--1367},
  doi={10.21437/Interspeech.2019-2181}
}