High-Resolution, Non-Invasive Imaging of Upper Vocal Tract Articulators Compatible with Human Brain Recordings
Fig 1
Data processing steps for facial, lingual, laryngeal, and acoustic data.
a) The lips of the speaker were painted in blue and dots were painted in red on the nose and chin. A camera was then placed in front of the speaker’s face such that all painted regions were contained within the frame and the lips are approximately centered. Video was captured at 30 frames per second (fps) during speaking (i). Each frame of the video was thresholded based on hue value, resulting in a binary mask. Points were defined based upon the upper, lower, left and right extents of the lip mask and the centroids of the nose and jaw masks (ii). The X and Y position of these points was extracted as a time varying signal (iii). Grey lines mark the acoustic onset. b) The tongue was monitored using an ultrasound transducer held firmly under the speaker’s chin such that the tongue is centered in the frame of the ultrasound image. Video output of the ultrasound was captured at 30 fps (i). The tongue contour for each frame was extracted using EdgeTrak, resulting in an X and Y position of 100 evenly placed points along the tongue surface (ii). From these 100 points, three equidistant points were extracted, representing the front, middle, and back tongue regions which comprises our time varying signal (iii). c) Instances of glottal closure were measured using an electroglottograph placed with contacts on either side of the speaker’s larynx. The instances of glottal closure were tracked by changes in the impedance between the electrodes using the SIGMA algorithm [28]. d) Speech acoustics were recorded using a microphone placed in front of the subject’s mouth (though not blocking the video camera) and recorded at 22 kHz (Fig 1di). We measured the vowel formants, F1-F4, as a function of time for each utterance of a vowel using an inverse filter method. For the extraction of F0 (pitch), we used standard auto-correlation methods.