It is well-known that the characteristics of L2 speech are highly influenced
by the speakers’ L1. The main objective of this study was to
uncover discriminative speech features to identify the L1 background
of a speaker from their L2 English speech. Traditional phonetic approaches
tend to compare speakers based on a pre-selected set of acoustic features,
which may not be sufficient to capture all the unique traces of the
L1 in the L2 speech for forensic speaker profiling purposes. Convolutional
Neural Network (CNN) has the potential to remedy this issue through
the automatic processing of the visual spectrogram.
This paper reports
a series of CNN classification experiments modelled on spectrogram
images. The classification problem consisted of determining whether
English speech samples are spoken by a native speaker of English, Japanese,
Dutch, French, or Polish. Both phonetically transcribed and untranscribed
speech data were used.
Overall, results showed
that the CNN achieved a high level of accuracy in identifying the speakers’
L1s based on spectrogram pictures without explicit phonetic segmentation.
However, the results also showed that training the classifiers on certain
combinations of phonetically modelled spectrogram images, which would
make features more transparent, can produce results with comparable
accuracy rates.
Cite as: Graham, C. (2021) L1 Identification from L2 Speech Using Neural Spectrogram Analysis. Proc. Interspeech 2021, 3959-3963, doi: 10.21437/Interspeech.2021-1545
@inproceedings{graham21_interspeech, author={Calbert Graham}, title={{L1 Identification from L2 Speech Using Neural Spectrogram Analysis}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3959--3963}, doi={10.21437/Interspeech.2021-1545} }