Introduction

Rapid vibration of vocal folds in the larynx generates sound waves, an essential process in human speech. Injury and disease can alter the microstructure and mechanical properties of the tissues in vocal folds, which can degrade vocal quality and lead to voice loss1. Therefore, there is a great deal of interest in imaging the motion of vocal folds for diagnosis, measuring their mechanical properties in situ and developing novel treatments to restore their function. Laryngeal stroboscopy and high-speed videoscopy have proven useful to visualize the vocal folds during phonation2. However, these methods are limited to surface view. For the larynx, elastic layers deep underneath the epithelium are key to normal function and to various disease states, but they are neither visible nor readily assessed by the current examination methods. Subsurface or cross-sectional imaging may provide new information about the physiological function and pathophysiology of these organs in both clinical and preclinical settings.

Optical coherence tomography (OCT) has the ability to image cross-sections of soft tissues up to about 2 mm in depth with spatial resolution of about 10 μm. Current imaging speed of OCT is, however, insufficient to directly capture the motion of vocal folds in the audio frequency range. In laryngeal imaging, studies with high-speed camera suggested that frame rates of about 4,000 frames per second (fps) are needed to optimally capture the surface motion of vocal folds with high-speed cameras2. To achieve this frame rate with 1,000 axial lines (A-lines) per frame, the A-line rate should be about 4 MHz or higher. Current OCT systems are operated at A-line rates far less than several hundreds of kHz3,4,5,6; Although higher A-line rates can be reachable with advanced laser technology7, the increased speed inevitably involves a reduction of signal-to-noise ratio (SNR), limiting the maximum practical A-line rate. Furthermore, the speed requirement would be even more demanding if one wishes to capture motion in volume over time, i.e. 4D imaging8.

In the case of periodic or quasi-periodic motion, such as vocal vibration, gated imaging is an established, effective technique. A traditional method called prospective gating acquires data at a specific phase over many motion cycles9, as used in laryngeal stroboscopy and cardiac MRI. Because the data is acquired during a small fraction of time in each cycle, the time required to acquire the full data set can take orders of magnitude longer than simple static anatomical imaging10. In contrast, another technique called retrospective gating employs continuous image acquisition and realignment of the data according to the acquired images or simultaneously acquired physiological signal11. The application of retrospective gating to OCT has been demonstrated for visualizing cardiac motion of Xenopus laevis12, which was further extended to imaging embryonic hearts of a chicken and a mouse at beating frequencies of about 10 Hz13.

In this paper, we describe a modified scheme of dynamic OCT capable of producing “snapshots” of periodic tissue motion at frequencies over 100 Hz by employing motion-triggered laser scanning. At each transverse location, multiple A-line images are continuously acquired over a single cycle and the probe beam is moved to a next spot at the end of the period of oscillation. Subsequent A-line registration in post processing synthesizes phase-aligned snapshots of tissue oscillation over the entire vibratory cycle. Compared to the previous gated imaging techniques, the triggered data acquisition facilitates precise temporal and spatial registration of A-lines, minimizing artifacts associated with asynchrony between the periods of sample motion and A-line acquisition. The frequency range that can be captured with dynamic OCT is determined by the A-line rate rather than its frame rate of the system and could be easily extended to the entire audible frequency range up to over 20 kHz with an A-line rate of approximately 200 kHz or higher. Since the data are acquired continuously, it offers distinct advantages of faster image acquisition compared to the prospective gated techniques and more robust time synchronization compared to the conventional retrospective gating methods. We demonstrate the proof-of-concept and an initial application of this technique for imaging aerodynamically driven vocal folds in an ex vivo calf larynx model.

Results

Principle of operation

Consider a sample in periodic motion (Fig. 1A). An OCT console acquires A-line images continuously as the optical probe beam is scanned across the sample. Each pixel represents the magnitude of optical backscattering at a specific space-time location labeled with four coordinates: x, y and z for spatial dimensions and t for time. For periodic motion, the time coordinate can be mapped to the phase of motion, φ, with the help of a timing signal obtained either with a motion sensor or, in case of actively driven motion, from a driving signal. In post processing, the acquired A-lines are re-grouped with respect to phase to produce sequential snapshots of tissue oscillation across the vibratory cycle.

Figure 1
figure 1

Principle of 4D OCT.

(A) System configuration. (B) A-line acquisition and registration. Each column represents an A-line taken at a specific transverse location and time. A set of A-lines corresponding to the same specific time in the motion cycle is grouped together to form a “snapshot” image at the particular phase of motion.

This general concept is implemented in a triggered scanning mode. Here, A-lines are acquired continuously at each lateral coordinate for the duration of one full vibratory cycle. At the end of each cycle, the beam scanner is triggered by the timing signal to increment the OCT sampling beam position by one step. This procedure is repeated until the beam is scanned over the entire extent of desired field of view determined by the user.

Figure 1B illustrates the data acquisition and registration procedure. At each lateral location, a series of A-lines is recorded during a single full cycle of motion. Once all the lateral locations (x1 to xN) are scanned, A-lines that correspond to the same phase but correspond to different lateral locations are grouped together to create cross-sectional images for each phase (Fig. 1B, bottom). Therefore, image acquisition requires as many cycles of vibration as there are lateral locations. For example, when there are 500 lateral locations and the frequency of oscillation is 100 Hz, it takes 500 cycles, or 5 seconds to acquire the dataset. At the A-line rate of 10 kHz, 100 A-lines are obtained during a single oscillation cycle at each lateral location, giving the A-line limited temporal resolution of 100 μs between frames in the series of reconstructed motion snapshots.

Validation with samples on an acoustic shaker

We performed a validation experiment with a flat mirror mounted on an acoustic shaker (SF 9324, PASCO Scientific) that was driven with sinusoidal signals from a function generator (Fig. 2A). The shaker was operated at a frequency of 100 Hz that was derived from the DAQ device and had amplitude of 2 mm peak-to-peak. The DAQ derived sample signal was used to eliminate quantization errors during signal digitization in data acquisition. To illustrate the limitation of conventional OCT, we first used beam scanning synchronized with the wavelength-swept laser source. That is, the beam is moved to the next lateral position after each A-line acquisition. Figure 2B shows a typical image taken with this conventional beam scanning showing the sinusoidal trace of the oscillating mirror surface, which is an erroneous representation or motion-induced artifact of the moving mirror resulting from the limited frame rate (20 Hz). Next, we implemented the triggered scanning scheme using a trigger signal from the signal generator that drove the acoustic shaker.

Figure 2
figure 2

Experimental validation with a flat mirror on an oscillating shaker.

(A) Setup. (B) Image obtained with conventional beam scanning. (C) Synchronized M-mode data of single oscillation cycle at each specific lateral location. (D) Final reconstructed snapshot images. (E) Axial point spread function of the snapshot mirror image (red) in comparison with that of a stationary mirror (black).

Figure 2C shows two M-mode images at different lateral positions of xi and xj respectively, with each image showing a full cycle of mirror oscillation. The acquired A-lines were registered to the correct spatiotemporal locations to generate snapshot images. Figure 2D shows representative snapshot images reconstructed at two different motion phases φl and φm, respectively. These images correctly reveal the flat surface of the mirror at various vertical positions. For the quantitative validation, the mirror was fixed by applying varying constant offset voltages to the shaker. We imaged the steady mirror at various vertical positions with the conventional beam scanning method and found an excellent agreement with the reconstructed snapshot images. Figure 2E shows axial point spread function (PSF) comparison between static and snapshot mirror images acquired by averaging over 150 measurements. The two curves including the noise floors are nearly indistinguishable over a dynamic range of 45 dB, except for a small difference in the peak level due to alignment. This result indicates the accuracy of the timing control and image construction. The measured axial resolution for both images was 12 μm, in correspondence with the theoretical value.

We replaced the solid mirror with an elastic polymer gel to test if our technique can accurately capture the structural deformation of a sample during vibration (Fig. 3A). For this experiment, the flat mirror was replaced with 1.6 mm thick homogeneous hydrogels (5% gelatin for amplitude variation or 10% gelatin for frequency variation experiments). Signals derived from the DAQ device were used to drive the shaker. Figure 3B shows four representative images reconstructed from the dynamic OCT data obtained as the shaker was driven at a frequency of 110 Hz. Figure 3C show the measured traces of the top and bottom surfaces over one period of cycle, which are in excellent correlation with sinusoidal curves. From these traces, we measured the change in thickness during one full cycle and calculated the maximum strain, according to ε = Δl/l, where ε is strain, l is the thickness of the gel and Δl is the maximum change in thickness. Figure 3D shows the peak strain measured as a function of driving voltage at the fixed frequency of 110 Hz. The data show the expected linear dependence of strain on applied voltage.

Figure 3
figure 3

Hydrogel sample on an oscillating shaker.

(A) Schematic of the setup. (B) Snapshot images of the gel on the shaker. Scale bar, 1 mm. (C) Vertical displacements of the top surface (red) and bottom surface (green) of the gel over the entire cycle. Lines are curve fits with sinusoidal waveform. (D) Voltage applied to the shaker vs. strain at the driving frequency of 110 Hz. Dotted line is a linear fit (R2 = 0.999). (E) Frequency of the speaker vs. strain at the applied voltage of 6 V. Red line is a curve fit with a 1/f2 function (R2 = 0.998).

We then investigated the gel motion as a function of driving frequency at fixed input voltage. The electrical impedance of the shaker had the 1/f2 dependence, where f denotes the frequency. Therefore the vibration amplitude of the shaker followed the 1/f2 curve (the measured coefficient of regression R2 was >0.99). Figure 3E shows the peak strain measured from the dynamic OCT images. The strain turned out to follow the same 1/f2 dependence (R2 > 0.99) in the frequency range from 60 to 150 Hz, suggesting a linear mechanical response of the gel to the applied oscillating pressure in this frequency range.

Visualizing mechanically driven samples using a motion sensor

While the acoustic shaker can be driven by periodic external signals that are synchronized with the data acquisition system, vocal folds are aerodynamically driven and thus a trigger signal must be derived from their motion or by other means correlated with the motion. For example, the trigger may be derived from a microphone at the fundamental frequency of voice. To simulate autonomous periodic motion, we constructed an oscillating platform using a modified electric toothbrush (Fig. 4A). The oscillation frequency of the toothbrush (Crest) was measured to be about 50 Hz. For motion sensing, a small magnet was attached to the motor shaft and a copper wire coil was placed over the magnet to pick up current generated by magnet movement. The signal from the pick-up coil was filtered and converted to a TTL pulse, which then was channeled to a computer to control the galvanometer in the beam scanner. As a result, the probe beam was laterally shifted by about 15 μm for each cycle of vibration. With the A-line acquisition rate of 10 kHz and the sample vibration frequency of 50 Hz, 200 motion phases per cycle were acquired at 500 A-line locations in 10 seconds.

Figure 4
figure 4

Autonomously driven hydrogel sample.

(A) Sample on top of a vibrating toothbrush head. (B) Bi-layered hydrogel sample as seen with conventional stroboscopy (media). The scan path of OCT beam is shown in red line. (C) Snapshot images of the two-gel sample at various motion phases (media). Scale bar is 1 mm.

As sample we used a two-layer hydrogel consisting of a highly elastic gel (a mixture of Xanthan gum and Glucomannan) placed on top of a stiffer gel (2% gelatin). Cross-sectional imaging was compared to conventional videostroboscopic imaging. In the stroboscopic images (Fig. 4B and Supplementary Movie 1), it was not discernable that the sample had a bilayer structure. However, the OCT sequences revealed the differential motion of the two layers in striking detail (Fig. 4C and Supplementary Movie 2). The more elastic gel on top lagged the bottom layer and the toothbrush head with a whip-like motion. The crisp detail observed in this sequence demonstrates that the triggering is working properly and that the motion is sufficiently periodic to reconstruct a single cross-sectional cycle from 500 separate cycles.

Next, we imaged an extirpated sample of vocal fold tissue placed on the same vibrating stage. The OCT probe beam was scanned parallel to the axis of motion from the supraglottic to the subglottic region (Fig. 5A and Supplementary Movie 3). Prior to imaging, a biopolymer (cross-linked polyethylene glycol) was injected into the superficial lamina propria just below the epithelium to simulate surgical vocal fold augmentation14. The movie of reconstructed snapshots shows a dynamic tissue deformation during rapid vibratory motion (Supplementary Movie 4). Figure 5B shows a sequence of representative frames at various motion phases.

Figure 5
figure 5

Autonomously driven vocal fold tissue sample.

(A) A vocal-fold tissue sample on a toothbrush head (media). Red line indicates OCT beam scan path. (B) Snapshot images of the excised calf vocal fold at various motion phases (media). Scale bar, 1 mm.

Imaging aerodynamically driven vocal folds

Next, we tested the capability of the system to image an aerodynamically driven vocal fold. For this experiment, we used the hemilarynx model15. The close apposition of the vocal fold to the glass window causes it to oscillate in what has been shown to be a remarkably normal pattern. This allows for excellent visualization of the medial surface of the vocal fold16,17. Bisected calf larynges were mounted in a custom-built chamber (Fig. 6A). As in the previous experiment, the OCT probe beam was scanned along the axis of mucosal wave motion, from the supraglottic to the subglottic direction (Fig. 6B and Supplementary Movie 5). Figure 6C shows a sequence of representative frames at various motion phases for a single coronal plane (Supplementary Movie 6). The fundamental vibration frequency was about 150 Hz and the total data acquisition time was 3.45 seconds. The flat line at the top of the movie corresponds to the bottom surface of the glass window. The line that appears in the middle of the image is an artifact of the top surface of the window, which appears due to an inherent but remediable property of depth degeneracy in Fourier-domain OCT8.

Figure 6
figure 6

Aerodynamically driven vocal fold tissue sample.

(A) Schematic of the hemilarynx preparation. (B) Vocal fold tissue in the hemilarynx chamber (media) seen through the transparent glass window. Red line indicates OCT beam scan path. (C) Snapshot images of hemilaryngeal dynamics at various motion phases (media). Scale bars, 500 µm. (D) Tracked motion of the apex of the mucosal wave. (E) The vertical (Vx, red) and transversal (Vz, blue) components of the velocity of the apex from the closing (c) through propagation (p) to opening (o) phase of the vocal fold motion.

To illustrate the potential for quantitative analysis from such image sequences, the crest of a propagating mucosal wave was identified in one image sequence and its coordinates were tracked from the point when it emerges (0.1π) until it disappears from the field of view (1.6π). A 5-point moving average filter was applied to the data subsequently. In Figs. 6D and E, the position and velocity of the crest over the course of one motion cycle are shown. Three distinctive phases of motion were observed: (i) c-phase: closing of the vocal fold, characterized by rapid uprising motion of the tissue toward the glass window (Fig. 6D), (ii) p-phase: propagation of the mucosal wave during the closed phase and (iii) o-phase: opening of the vocal fold at the end of the propagation. During the closing and opening phases, horizontal (Vz) motions were predominant (Fig. 6E, blue), while during the propagation phase the motion was mainly in the upward (Vx) direction (Fig. 6E, red). The peak speed derived from the images was about 0.6 m/s. While this sequence was fairly typical, a wide variety of mucosal wave patterns were observed by varying the driving air pressure and the gap between the vocal fold and the glass window, suggesting that this method may eventually provide insight into the tissue dynamics associated with different vocal intensities and qualities.

For 4D imaging, thirty-one coronal sections spaced 7 µm apart were acquired from the middle of a vibrating vocal fold in less than 2 minutes. 3D image analysis software (Amira, Visage Imaging) was used to render the 4D video (Fig. 7 and Supplementary Movie 7).

Figure 7
figure 7

A snapshot of aerodynamically driven vocal fold tissue.

The apex of the tissue is in contact with the glass window (top flat surface). The distance between coronal sections is exaggerated by 10-fold for better visualization of the fine wrinkles in the tissue surface (media).

Discussion

Triggered OCT has distinct advantages over previously developed stroboscopic and gated imaging techniques. First, compared to prospective gated imaging where data is selectively acquired in a pulsed manner, data acquisition is continuous and therefore more time-efficient. Second, triggering allows the data acquisition to be synchronized with sample motion for accurate timing control. This step is crucial since it ensures that the number of motion points resolved per cycle is constant for all of the acquired cycles as long as motion is periodic and such points are in phase with other points from different cycles. In short, triggering minimizes and even eliminates possible cycle to cycle time-misalignments in the snapshots reproduced. The previous retrospective gating techniques11,12,13 are vulnerable to such misalignments because image acquisition is asynchronous with the sample motion. Third, motion triggering enables a single-cycle registration per lateral location. This step significantly increases the overall acquisition speed by obviating the need for time-consuming M-mode acquisition of multiple cycles at each lateral location12,13. Lastly, when external periodic stimuli are used18, the driving signal can be perfectly synchronized with the data acquisition and beam scanning. Therefore, in such cases triggered OCT is ideally suited.

One important limitation of triggered OCT arises when sample motion deviates from perfect periodicity. The fundamental frequency of phonation from patients may vary considerably more than an inanimate preparation. Healthy individuals are able to produce sound at a stable fundamental frequency over an extended period of time within stability of 1%. However, this ability tends be compromised during endoscopy and in patients with severe voice disorders. Such variations in frequency and/or amplitude during OCT data acquisition can cause erroneous timing registration, resulting in artifacts in the reconstructed snapshots. While it is difficult to correct for these errors completely, several possible strategies may mitigate the problem. For example, the A-line rate can be made adjustable according to the drift of the oscillation frequency and this arrangement can reduce the phase asynchronization error. Post error correction by A-line interpolation based on recorded timing signal may be helpful.

The maximum speed of motion that can be captured is fundamentally limited by the speed of A-line acquisition, because sample motion during the A-line acquisition − i.e. during a single sweep of laser wavelength − can cause motion artifacts, such as axial image distortion known as the Doppler artifact19. During the closing phase, the vocal fold experiences rapid and accelerated axial movements and such motion results in noticeable image distortion. In the future, such errors can be suppressed with faster A-line repetition rates. For example, at a motion speed of 0.5 m/s and an increased A-line speed of 200 kHz20, the axial motion during the acquisition time of 5 μs would be 2.5 μm, smaller than the axial resolution, which would cause negligible motion-induced artifacts.

Artificially phonated excised calf larynges generate fundamental frequencies ranging from roughly 100 to 200 Hz, for which our current imaging system with an A-line rate of 10 kHz captures 50 to 100 phase steps per cycle. The human speech typically involves frequency components up to 1 kHz. It would be necessary to use higher A-line rates for future clinical applications. For example, a 200 kHz A-line rate system20 would be able to capture 200 motion phases per cycle for a tissue vibration at 1 kHz.

Triggered OCT may be used to measure the frequency-dependent 3D deformation of sample in response to force and, therefore, may be useful for elastic modulus mapping of tissues, tissue engineered materials and constructs. Injecting biomaterials to restore the normal mechanical properties in the damaged vocal folds is an emerging concept for treatment in laryngology14. Triggered OCT may allow noninvasive assessment of the mechanical properties of the implanted biomaterials and their surrounding tissues in vivo and over time. This method may be particularly useful for evaluating sub-surface injection of substances designed to enhance vocal fold motion in patients who are deficient in normal pliable tissue21.

Methods

Imaging system

The OCT system (Fig. 8) was configured for optical frequency domain imaging (OFDI), as described elsewhere in detail4. Briefly, the light source is a wavelength-swept laser source with a tuning range from 1220 to 1345 nm (12 µm of axial resolution in air) and the average output power of 50 mW. The laser was operated at a repetition rate (A-line acquisition rate) of 10 kHz. The sample arm employed a pair of XY galvanometer-mounted mirror scanners (Cambridge Technologies) and a 35 mm focusing lens, resulting in the lateral resolution of 15 µm. The maximum sensitivity of the system was measured to be over 110 dB. A data acquisition board (DAQ) digitized the interference signal at 10 MS/s, yielding the Nyquist limited free-space depth range of 3.3 mm. For the initial experiments, samples were mounted on mechanical vibration stages, which were either an acoustic shaker or a modified motorized toothbrush. When the acoustic shaker was employed, a DAQ based master clock was used to derive the fundamental frequency of the acoustic shaker as well as laser swept frequency of the source. This step ensures that the vibration frequency is an integer fraction of the digitization rate (10 MHz). For samples that vibrate with natural fundamental frequencies (i.e. vocal fold and electric toothbrush vibrations), however, this ideal synchronization could not be applied. Sample motion was detected using the AC waveform that drove the mechanical vibrator, a sensor attached to the toothbrush, or a subglottal pressure signal from the phonating larynx. These signals were fed to an oscilloscope or custom-built timing circuitry to produce TTL trigger signals. The TTL signals were then used to generate a stepwise ramp signal to control the galvanometer position.

Figure 8
figure 8

Schematic of the OCT imaging setup.

Hemilarynx preparation

We obtained fresh calf larynges from a local meatpacker. Larynges were bisected sagittally and placed in a custom holder that allowed air to flow between the vocal fold and a glass window22. Air leaks were sealed with dental alginate. A glass microscope slide was used as the imaging window and the gap between the window and the intact vocal fold was adjusted to be 1 mm or less. A pressure transducer was attached to a side port in the subglottic region of the hemilarynx. Air was warmed to 37 °C and humidified using a ConchaTherm-IV device. The air pressure was finely adjusted with a regulator until a stable phonation of the vocal folds was reached. Subglottal pressure varied with each cycle of vocal fold motion, thus providing a useful trigger signal. The pressure waveform was filtered, amplified and fed to an oscilloscope trigger circuit. The trigger pulses were used to control the scanning of the OCT beam.