Synchronized, concurrent optical coherence tomography and videostroboscopy for monitoring vocal fold morphology and kinematics

Voice disorders affect a large number of adults in the United States, and their clinical evaluation heavily relies on laryngeal videostroboscopy, which captures the mediallateral and anterior-posterior motion of the vocal folds using stroboscopic sampling. However, videostroboscopy does not provide direct visualization of the superior-inferior movement of the vocal folds, which yields important clinical insight. In this paper, we present a novel technology that complements videostroboscopic findings by adding the ability to image the coronal plane and visualize the superior-inferior movement of the vocal folds. The technology is based on optical coherence tomography, which is combined with videostroboscopy within the same endoscopic probe to provide spatially and temporally co-registered images of the mucosal wave motion, as well as vocal folds subsurface morphology. We demonstrate the capability of the rigid endoscopic probe, in a benchtop setting, to characterize the complex movement and subsurface structure of the aerodynamically driven excised larynx models within the 50 to 200 Hz phonation range. Our preliminary results encourage future development of this technology with the goal of its use for in vivo laryngeal imaging. © 2019 Optical Society of America under the terms of the OSA Open Access Publishing Agreement


Introduction
Voice disorders affect approximately 30% of adults in the United States at some point in their lives [1][2][3] and include a broad spectrum of diagnostic categories, including phonotraumatic vocal hyperfunction (organic vocal fold lesions such as nodules, polyps), non-phonotraumatic vocal hyperfunction (muscle tension dysphonia in the absence of lesions), laryngeal paralysis/paresis, and vocal fold scar.
Laryngologists and speech-language pathologists heavily rely on laryngeal videostroboscopy (VS) for the clinical assessment of voice disorders. VS captures medial-lateral and anteriorposterior motion of the vocal folds using stroboscopic sampling [4][5][6][7], enabling clinicians to observe many salient features in real time, such as left-right symmetry and amplitude, which are difficult to evaluate at standard video rates [8]. In addition to clinical diagnosis, VS is also used to evaluate the surface motion of the vocal folds before and after surgical intervention [9]. These capabilities, combined with cost affordability, have promoted VS as the 'gold standard' clinical tool for diagnosing voice disorders. However, VS has limited capabilities. It only provides the assessment of vocal fold motion in the medial-lateral dimension, with only an indirect appreciation of superior-inferior, or vertical, tissue motion in the coronal plane. However, accurate measurement of vertical tissue motion is considered very significant for understanding vocal fold function [10]. Furthermore, VS is not capable of subsurface visualization, which is central wavelength, a 3-dB bandwidth of 100 nm, a 20 kHz sweeping frequency, and an average power of 12 mW is used as illumination source. This low illumination provides minimal risk with no discomfort or pain to subjects and does not exceed the ANSI standard for safe human use [35]. The near-infrared light is passed through a fiber circulator and then split by a 1x2 (90:10) single-mode coupler, such that 90% of the light goes to the probe and 10% to the optical delay line. The light beam going into the endoscopic probe is scanned by two galvanometers (6200H, Cambridge Technology) to generate volumetric OCT images. OCT spectral interferograms generated by the reflected light from both arms of the fiber interferometer are sent to a balanced photodetector. The signal from the balanced detector is digitized using a 100 MHz digitizer and is fed to a custom-built field-programmable gate array (FPGA) module. A custom-built timing circuit is used to synchronize the OCT data acquisition with the VS unit. The timing sequence in the circuit uses the strobe trigger from the Strobe Unit to control the frame rate of the video camera and the streaming of the OCT data to the FPGA board.
The Data Acquisition & Processing Unit consists of a computer equipped with a data acquisition card, a frame-grabber, and a graphics processing unit (GPU), which enables realtime data processing and display. Depth-resolved structural images are generated in real time and displayed at various rates (between 10 and 40 Hz) depending on the number of pixels per frame.
As shown in Fig. 2, both the OCT and the Data Acquisition & Processing units are enclosed within a custom-made 19" rack. The OCT imaging beam is directed at an angle of 70 degrees from the distal end of the probe, and the imaging plane is formed at a distance of 60 mm from the probe tip to meet the clinical requirements for endoscopic laryngeal imaging.

Benchtop setup for excised larynx models
Five fresh calf larynges, obtained from a research tissue provider, Research87 (Boylston, MA), as well as two previously frozen human larynx specimens, which were obtained from autopsies performed in the Massachusetts General Hospital Department of Pathology, as approved by the Partners Institutional Review Board were used in this study. The excised broadband wavelength-sweeping laser source (Model HSL-2000, Santec, Inc.) with 1310 nm central wavelength, a 3-dB bandwidth of 100 nm, a 20 kHz sweeping frequency, and an average power of 12 mW is used as illumination source. This low illumination provides minimal risk with no discomfort or pain to subjects and does not exceed the ANSI standard for safe human use [35]. The near-infrared light is passed through a fiber circulator and then split by a 1 × 2 (90:10) single-mode coupler, such that 90% of the light goes to the probe and 10% to the optical delay line. The light beam going into the endoscopic probe is scanned by two galvanometers (6200H, Cambridge Technology) to generate volumetric OCT images. OCT spectral interferograms generated by the reflected light from both arms of the fiber interferometer are sent to a balanced photodetector. The signal from the balanced detector is digitized using a 100 MHz digitizer and is fed to a custom-built field-programmable gate array (FPGA) module. A custom-built timing circuit is used to synchronize the OCT data acquisition with the VS unit. The timing sequence in the circuit uses the strobe trigger from the Strobe Unit to control the frame rate of the video camera and the streaming of the OCT data to the FPGA board.
The Data Acquisition & Processing Unit consists of a computer equipped with a data acquisition card, a frame-grabber, and a graphics processing unit (GPU), which enables real-time data processing and display. Depth-resolved structural images are generated in real time and displayed at various rates (between 10 and 40 Hz) depending on the number of pixels per frame.
As shown in Fig. 2, both the OCT and the Data Acquisition & Processing units are enclosed within a custom-made 19" rack. The OCT imaging beam is directed at an angle of 70 degrees from the distal end of the probe, and the imaging plane is formed at a distance of 60 mm from the probe tip to meet the clinical requirements for endoscopic laryngeal imaging.

Benchtop setup for excised larynx models
Five fresh calf larynges, obtained from a research tissue provider, Research87 (Boylston, MA), as well as two previously frozen human larynx specimens, which were obtained from autopsies larynx specimens were mounted in a custom holder by placing the trachea over a plastic tube, and the cricoid and thyroid cartilages were secured with four corkscrew-tipped rods that were mounted on a rigid frame surrounding the specimen. Tissues above the true vocal folds were partially resected and/or retracted with stay sutures to provide a clear view of the superior surface of the true vocal folds. A suture through the paired arytenoid cartilages was used to adduct the vocal processes into a phonatory posture. Warm (37 °C) humidified air was generated using a ConchaTherm (model 380-55, Respiratory Care Inc., Arlington Heights, IL) and used to drive phonation. Airflow was controlled with a pressure regulator, and subglottal pressure was monitored using a pressure transducer placed about 10 cm below the vocal folds. The phonatory acoustic signal was captured with a microphone that was placed 10 cm above the glottis. The subglottal pressure signal was passed through a custom zero-crossing detector circuit to generate a synchronization TTL pulse for each glottal cycle.
The instrument was first tested in a morphological scanning (M-scan) mode, where the vocal folds were at rest (non-vibratory state). A raster scan (C-scan) consisting of 512 Bscans was obtained for both human and calf excised larynx specimens. The OCT unit yielded subsurface images of the vocal folds with 10 μm superior-inferior resolution and 40 μm medial-lateral resolution, which was mainly dictated by the relatively long imaging plane (~60 mm away from probe tip) that is typical during clinical practice.

OCT-VS synchronization
Stroboscopy is commonly used to observe vocal fold vibration at typically 0.5 to 2 periods per second, enabling real-time examination of vocal fold motion in the clinic [36]. A microphone, placed near the larynx, recorded an acoustic signal that was down-sampled to generate a low-frequency strobe illumination trigger per video field (see Fig. 3(A)). The trigger was incremented relative to the phase of the glottal cycle for each video field such that vocal fold oscillation appears slowed down. This is represented as the "reconstructed glottal cycle" in Fig. 3(B). Synchronization of OCT with VS was then achieved by triggering B-scan OCT acquisitions simultaneously with each strobe flash by using the synchronization pulse.
In this way, the VS images and the OCT B-scans were acquired simultaneously during a reduced speed glottal cycle (see Figs 3(C) -3(E)). VS acquires top-down two-dimensional (2D) images of the vocal folds, while OCT acquires cross-sectional information in the third, vertical dimension. The plane of the OCT image can be incremented for each new reconstructed glottal cycle along the anterior-posterior axis of the vocal folds to capture performed in the Massachusetts General Hospital Department of Pathology, as approved by the Partners Institutional Review Board were used in this study. The excised larynx specimens were mounted in a custom holder by placing the trachea over a plastic tube, and the cricoid and thyroid cartilages were secured with four corkscrew-tipped rods that were mounted on a rigid frame surrounding the specimen. Tissues above the true vocal folds were partially resected and/or retracted with stay sutures to provide a clear view of the superior surface of the true vocal folds. A suture through the paired arytenoid cartilages was used to adduct the vocal processes into a phonatory posture. Warm (37°C) humidified air was generated using a ConchaTherm (model 380-55, Respiratory Care Inc., Arlington Heights, IL) and used to drive phonation. Airflow was controlled with a pressure regulator, and subglottal pressure was monitored using a pressure transducer placed about 10 cm below the vocal folds. The phonatory acoustic signal was captured with a microphone that was placed 10 cm above the glottis. The subglottal pressure signal was passed through a custom zero-crossing detector circuit to generate a synchronization TTL pulse for each glottal cycle.
The instrument was first tested in a morphological scanning (M-scan) mode, where the vocal folds were at rest (non-vibratory state). A raster scan (C-scan) consisting of 512 B-scans was obtained for both human and calf excised larynx specimens. The OCT unit yielded subsurface images of the vocal folds with 10 µm superior-inferior resolution and 40 µm medial-lateral resolution, which was mainly dictated by the relatively long imaging plane (∼60 mm away from probe tip) that is typical during clinical practice.

OCT-VS synchronization
Stroboscopy is commonly used to observe vocal fold vibration at typically 0.5 to 2 periods per second, enabling real-time examination of vocal fold motion in the clinic [36]. A microphone, placed near the larynx, recorded an acoustic signal that was down-sampled to generate a lowfrequency strobe illumination trigger per video field (see Fig. 3(A)). The trigger was incremented relative to the phase of the glottal cycle for each video field such that vocal fold oscillation appears slowed down. This is represented as the "reconstructed glottal cycle" in Fig. 3(B). Synchronization of OCT with VS was then achieved by triggering B-scan OCT acquisitions simultaneously with each strobe flash by using the synchronization pulse.
In this way, the VS images and the OCT B-scans were acquired simultaneously during a reduced speed glottal cycle (see Figs images of the vocal folds, while OCT acquires cross-sectional information in the third, vertical dimension. The plane of the OCT image can be incremented for each new reconstructed glottal cycle along the anterior-posterior axis of the vocal folds to capture multiple slices and generate a 3D image, which can be remapped on the video image to generate a high-resolution 3D video of vocal fold movement ( Fig. 3(E) illustrates one such 3D image). The duration of the B-scan is determined by the number and rate of A-lines captured. A relatively sparse A-line density (16 A-lines per scan using a 20 kHz OCT system) was used in order to capture the essential features of the mucosal wave without motion blur artifacts during a B-scan. Higher-density scans could be performed with faster swept-source OCT systems, at the expense of a significant increase in instrument cost.

OCT-VS image processing and visualization
The 3D rendering of the glottal phases, similar to the one shown in Fig. 3(E), was made possible by overlaying the OCT images on the VS images. This allowed to reconstruct a videoendoscopic movie that showed both the medial-lateral and superior-inferior motion of the VFs. An image processing and visualization algorithm was developed and implemented in MATLAB (The Math Works, Inc.) for this purpose. The main image processing steps are illustrated in the flow chart as shown in Fig. 4. Sparse OCT B-scans, corresponding to several phases of a glottal cycle were acquired at time intervals t 0 to t n (glottal phases) and repeated at several vertical locations in subsequent phonation cycles, forming a 4D raster (C-scans collected at different time intervals). The coarse resolution of the B-scans was dictated by the number of the A-lines that can be collected within a glottal cycle, for a specific phonation frequency. For a swept-source sweeping frequency of 20 kHz, we were able to collect 10 B-scans associated to 10 phases of a glottal cycle repeating it subsequently for 10 vertical positions of the VF opening. For each B-scan, the vertical profile of the mucosal wave was obtained by segmenting each image and detecting the VF surface. In the first step, tissue surface formed from B-scans at each vertical step (e.g. at t 0 ) is detected. The detection of the surface of mucosal wave is performed by calculating the maximum derivative of the slope of each A-line in the OCT B-scan image. It is to be noted that each OCT scan is sparser (16 A-lines *10 B-scans) than the VS image, which has 400 *800 pixels spanning 5 mm *10 mm. An OCT surface map is generated by assembling the surfaces of the OCT B-scans images in a 3D volume. Linear interpolation is then used to smooth the surface of the reconstructed mucosal wave and match it to the number of pixels of the 2D VS images. This routine is repeated for subsequent glottal phases (t 1 to t n ). The superimposed OCT/VS images were rendered using ImageJ (1.8.0, NIH, Bethesda, MD) for optimal display.

Resolution
The medial-lateral resolution s l (number of A-lines per B-scan) is defined by the following relation: where s A is the A-line rate of the swept source system, 0 f is the fundamental frequency of phonation, and n φ is the number of phases desired per strobe cycle, equivalent to the strobe flashes utilized. With a 20 kHz A-line rate, for a 100 Hz fundamental frequency and a desired number of 10 phases, we were able to record 16 A-lines per B-scan spanning 5 mm, which provided a medial-lateral resolution of 312 µm per pixel. We collected 10 B-scans in vertical direction spanning 10 mm that provided an anterior-posterior resolution of 1 mm. It can be noted that this resolution is easily scalable with longer durations of phonation at relatively constant fundamental frequency as each step is based on an independent reconstructed glottal cycle. The superior-inferior (axial) resolution of the OCT system was measured to be ~12 µm, dictated by the light source bandwidth. We were able to record 10 phases for a phonation frequency of 100 Hz, giving us a temporal resolution of 1 millisecond.

Fig. 4. Flow chart for image processing and visualization
The main steps for data processing were: (1) surface detection of the mucosal wave in the 3D OCT volumetric image; (2) interpolation of the OCT surface map to match the number of pixels from the 2D VS image, and (3) overlaying of the OCT surface map on the VS image.
In the first step, tissue surface formed from B-scans at each vertical step (e.g. at t 0 ) is detected. The detection of the surface of mucosal wave is performed by calculating the maximum derivative of the slope of each A-line in the OCT B-scan image. It is to be noted that each OCT scan is sparser (16 A-lines *10 B-scans) than the VS image, which has 400 *800 pixels spanning 5 mm *10 mm. An OCT surface map is generated by assembling the surfaces of the OCT B-scans images in a 3D volume. Linear interpolation is then used to smooth the surface of the reconstructed mucosal wave and match it to the number of pixels of the 2D VS images. This routine is repeated for subsequent glottal phases (t 1 to t n ). The superimposed OCT/VS images were rendered using ImageJ (1.8.0, NIH, Bethesda, MD) for optimal display.

Resolution
The medial-lateral resolution l s (number of A-lines per B-scan) is defined by the following relation: where A r is the A-line rate of the swept source system, f 0 is the fundamental frequency of phonation, and φ n is the number of phases desired per strobe cycle, equivalent to the strobe flashes utilized. With a 20 kHz A-line rate, for a 100 Hz fundamental frequency and a desired number of 10 phases, we were able to record 16 A-lines per B-scan spanning 5 mm, which provided a medial-lateral resolution of 312 µm per pixel. We collected 10 B-scans in vertical direction spanning 10 mm that provided an anterior-posterior resolution of 1 mm. It can be noted that this resolution is easily scalable with longer durations of phonation at relatively constant fundamental frequency as each step is based on an independent reconstructed glottal cycle. The superior-inferior (axial) resolution of the OCT system was measured to be ∼12 µm, dictated by the light source bandwidth. We were able to record 10 phases for a phonation frequency of 100 Hz, giving us a temporal resolution of 1 millisecond.

Results
The instrument was used to collect high-resolution images in the morphological M-scan mode with the vocal folds at rest, and lower lateral resolution in the dynamic OCT-VS mode. The results from the excised larynx experiments are reported here.

Morphological M-scan mode
With the vocal folds at rest, high-density cross-sectional OCT scans with 1024 A-lines per B-scan were collected. As shown in Fig. 5, the epithelium, superficial lamina propria, and small blood vessels were well resolved, although specimen quality was not optimal for optical imaging due to being frozen and thawed.

Results
The instrument was used to collect high-resolution images in the morphological M-scan mode with the vocal folds at rest, and lower lateral resolution in the dynamic OCT-VS mode. The results from the excised larynx experiments are reported here.

Morphological M-scan mode
With the vocal folds at rest, high-density cross-sectional OCT scans with 1024 A-lines per Bscan were collected. As shown in Fig. 5, the epithelium, superficial lamina propria, and small blood vessels were well resolved, although specimen quality was not optimal for optical imaging due to being frozen and thawed.

Stroboscopic OCT-VS mode
Synchronized VS-OCT imaging was performed with the vocal folds vibrating at fundamental frequencies ranging from 50 to 200 Hz (80 to 200 Hz for the human larynx specimens and from 50 to 160 Hz for the calf specimens). An example of ten phases of vocal fold motion with associated sample OCT depth information is shown in Fig. 6 for a calf specimen phonated at 100 Hz. In our system, the strobe flash frequency determined the temporal and spatial resolution of the OCT images. The strobe triggers, at about 60 Hz, made the cyclical motion appear to be 0.5 Hz. Hence, it took 2 seconds to capture one full glottal cycle. During this visualized cycle, OCT has captured one axial cross-section (B-scan) per trigger in medial-lateral direction. For each temporal phase imaged by VS, the number of A-lines across the vocal folds (B-scan) was determined by the fundamental frequency of phonation and the speed of the OCT source as dictated by Eq. (1). Here, for 20 kHz A-line rate and 100 Hz fundamental frequency, we acquired 16 A-lines (one B-scan) to generate 10 phases per reconstructed glottal cycle. Since stroboscopy enabled the examination of multiple phonatory cycles at a reduced speed, it was possible to collect OCT B-scans synchronously with VS images and build up 3D maps of vocal fold kinematics. Figure 7 shows examples of co-registered OCT-VS frames at 3 different phases, each one with corresponding OCT B-scan at the center, OCT 3D interpolated surface of 10 B-scans, and reconstructed OCT-VS superimposed images. For simplicity, three distinct temporal phases of a phonatory cycle are shown in the left column, when the vocal folds are closed (A 1 ), half open (B 1 ) and fully open (C 1 ). The second column from the left shows the corresponding OCT B-scans (A 2 , B 2 , C 2 ) at the center location of the C scan, with the red line Synchronized VS-OCT imaging was performed with the vocal folds vibrating at fundamental frequencies ranging from 50 to 200 Hz (80 to 200 Hz for the human larynx specimens and from 50 to 160 Hz for the calf specimens). An example of ten phases of vocal fold motion with associated sample OCT depth information is shown in Fig. 6 for a calf specimen phonated at 100 Hz. In our system, the strobe flash frequency determined the temporal and spatial resolution of the OCT images. The strobe triggers, at about 60 Hz, made the cyclical motion appear to be 0.5 Hz. Hence, it took 2 seconds to capture one full glottal cycle. During this visualized cycle, tracing the vocal fold surface. The third column from the left shows the OCT 3D surface images (A 3 , B 3 , C 3 ), which were obtained by interpolating 10 slices of the OCT C-scan taken across the anterior-posterior direction. In the fourth column of Fig. 7 are shown the VS images re-mapped based on the OCT 3D surface data, creating a pseudo-3D video stroboscopic image. Due to the low resolution of the OCT scan, an interpolation between OCT scans at various lateral positions has been made, to generate an image with the same number of pixels as the VS image. As a result, some artifacts can be noticed in the reconstructed image. Synchronized VS-OCT imaging was performed with the vocal folds vibrating at fundamental frequencies ranging from 50 to 200 Hz (80 to 200 Hz for the human larynx specimens and from 50 to 160 Hz for the calf specimens). An example of ten phases of vocal fold motion with associated sample OCT depth information is shown in Fig. 6 for a calf specimen phonated at 100 Hz. In our system, the strobe flash frequency determined the temporal and spatial resolution of the OCT images. The strobe triggers, at about 60 Hz, made the cyclical motion appear to be 0.5 Hz. Hence, it took 2 seconds to capture one full glottal cycle. During this visualized cycle, OCT has captured one axial cross-section (B-scan) per trigger in mediallateral direction. For each temporal phase imaged by VS, the number of A-lines across the vocal folds (B-scan) was determined by the fundamental frequency of phonation and the speed of the OCT source as dictated by Eq. (1). Here, for 20 kHz A-line rate and 100 Hz The subsurface morphology was recovered using the high-density M-scan mode, while the vocal folds were at rest, as shown in Fig. 5. However, subsurface information can be obtained during phonation at a lower resolution, as shown in Fig. 8 (see Visualization 1). The right vocal fold (left side of image) was injected sub-epithelially with a stiff hyaluronic acid-based gel (Restylane). To improve resolution, the two vocal folds were imaged separately and then combined during post-processing. Each A-line in the OCT image was expanded horizontally to restore the natural aspect ratio. The injected gel can be seen in the OCT image on the left side as a low-reflectance, circular zone, approximately 1.25 mm in diameter. In contrast, the gel injection site is not easily visible in the VS video images. The presence of the gel induced asymmetry of the mucosal wave in vertical axis, is clearly apparent in the OCT video but more challenging to observe in the VS data. In this particular case, the fundamental phonation frequency was ∼50 Hz, which allowed the system to capture 20 phases per cycle and thus 16 A-lines per B-scan on each of the vocal folds.   Besides the amplitude of the vocal fold vibration, velocity measurements were derived as well from the recorded data set, as shown in Fig. 9. The plots in Fig. 9(a) show the surface contour of the vocal folds across the medial-lateral dimension. The surface contour was measured manually by placing each B-scan image on a high resolution grid for five equidistant phases (captured OCT B-scans). With the known phonation frequency of 50 Hz, for the 20 phases captured, the elapsed time between each phase was estimated to be 1.05 milliseconds. These data were used to calculate the maximum velocity in the vertical (z) dimension at which each lateral position of the vocal fold surface moved (see Fig. 9(b)). As it can be observed, the maximum velocity varied across the vocal folds since the mucosal wave speed varied across the vocal fold surface. The opening of the vocal folds vibrated at the highest speed (∼100 mm/sec) with over 1 mm amplitude. The speed/amplitude decreased while the mucosal wave traveled laterally from the midline. However, OCT could not detect the position of the surface at the opening, and thus the velocity could not be evaluated at the midline.
The measured amplitudes and velocities may be compared with those obtained for human vocal folds in in vivo (∼1-2 mm and ∼2 m/sec, respectively) [31] and excised (∼2 mm and ∼1 m/s, respectively) [29]. However, the human excised larynx exhibited an induced reduced amplitude, presumably significantly affecting the normal kinematics of the contralateral vocal fold that, while mechanically normal, came into contact with a vocal fold of atypical stiffness. As acknowledged in [31], these types of measurements are not widely available in the literature and may be significantly affected by the type of larynx (in vivo, excised, healthy/pathological, etc.), imaging technology (structured illumination, etc.), and specific tissue tracking method (suture flesh points, medial edge tracking, surface edge tracking, etc.). In addition, a comprehensive  reporting of phonatory characteristics is needed for fair cross-study comparisons, including subglottal pressure, sound pressure level, and fundamental frequency. The current study presented a proof-of-concept technology that can be used in the future for more detailed studies of vocal fold kinematics.

Discussion and conclusions
This paper demonstrates the feasibility of combined OCT/VS used for vocal folds imaging during phonation, enabling the clinician to quantify the amplitude of the mucosal wave, as well as its temporal and spatial irregularities during a clinical exam. To our knowledge, this is the first demonstration of combined OCT/VS within the same optical path, providing spatial co-registration and temporal synchronization of the recordings.
Previous efforts imaging vocal folds with OCT imaging systems (including polarizationsensitive OCT) with catheter-based systems have shown high-quality images with useful subsurface information [17,[19][20][21]. These studies, however, were performed with the imaging probe in contact with the vocal folds, and thus were not suitable for dynamic imaging. More recent efforts using high speed OCT technique [23] have demonstrated the dynamic imaging capability, but significant aliasing was still noted.
The presented approach scores over previous efforts in terms of temporal synchronization with VS, which is currently the gold standard for laryngologists. It demonstrated the capability of providing co-registered visualization of the mucosal wave amplitude (using OCT), while maintaining the existing benefits of the VS for the diagnostic practice.
The system described in this paper has certain limitations, mainly related to the reduced speed of the used OCT swept source. By increasing the A-line rate of the swept source to 200 kHz or more, either the lateral resolution, as described in Eq. (1) or the phase resolution can be improved by 10 times.
The increase of the A-line rate could also reduce image blurring due to the reduced time used to produce a number of B-scans during the duration of the glottal cycle. During each phase, the vocal folds are still moving as the beam is scanned across them (B-scan), although the motion is limited to 1/n th of the laryngeal cycle for a desired 'n' number of phases to be captured. In terms of the imaging time, 10 B-scans were acquired in 2 sec (time for one reconstructed glottal cycle). Ten such axial movement scans separated by 1 mm in the orthogonal direction took 20 seconds.
Another limitation is that we are underutilizing the strobe flashes (strobe flashes allow to capture a maximum of 120 phases in 2 seconds in a single reconstructed glottal cycle). The OCT system was too slow to capture a B-scan for each strobe frame. To achieve a more accurate temporal snapshot, a parallel OCT approach might allow the capture of one OCT cross-section per strobe flash [37][38][39][40]. Another key parameter requiring improvement is the diameter of the current probe head (∼18 mm), which is almost twice as large as the current state of the art rigid laryngoscopes used for in vivo human vocal fold imaging [41].
Imaging range is another key feature for clinical imaging. To employ this approach in a clinical setting, a laser source with longer imaging range (at least 10 mm) is preferred to keep the vocal folds within the imaging range during phonation. A stabilization apparatus compensating unwanted relative motion between the probe head and vocal folds is potentially needed for clinical operation to keep the motion of the vocal folds within the imaging range of the OCT system. Therefore, future efforts will employ the use of a high-speed parallel OCT approach with synchronous video imaging, along with a stabilization apparatus which would enable clinicians to reliably monitor 3D vocal fold kinematics in real time.