In vivo tomographic visualization of intracochlear vibration using a supercontinuum multifrequency-swept optical coherence microscope.

This study combined a previously developed optical system with two additional key elements: a supercontinuum light source characterized by high output power and an analytical technique that effectively extracts interference signals required for improving the detection limit of vibration amplitude. Our system visualized 3D tomographic images and nanometer scale vibrations in the cochlear sensory epithelium of a live guinea pig. The transverse- and axial-depth resolution was 3.6 and 2.7 µm, respectively. After exposure to acoustic stimuli of 21-25 kHz at a sound pressure level of 70-85 dB, spatial amplitude and phase distributions were quantified on a targeted surface, whose area was 522 × 522 μm2.


Introduction
The cochlea of the inner ear transduces sound energy, which is a form of mechanical energy, into electrical signals, which are essential for a neurotransmitter release. This process is triggered by nanoscale vibrations induced in the cochlear sensory epithelium, which contains a layer of sensory hair cells and the basilar membrane (BM), i.e., the underlying extracellular matrix. The vibrations in the BM are controlled by the active motion of hair cells [1][2][3][4]. Although this arrangement is thought to critically contribute to the high sensitivity and sharp tuning of hearing, the in vivo behavior of each layer as well as the correlation of the dynamics among multiple layers remain unclear.
To address these issues, various optical measurement systems have been developed. Laser Doppler vibrometer (LDV) techniques can detect epithelial vibrations in the picometer range [5][6][7][8][9][10]. For example, in the cochlea, traveling waves elicited by sound stimulation propagate from the base to apex on the BM thus forming a spatial amplitude distribution. To determine the vibration distribution caused by the motion of a traveling wave, asynchronous measurement has been conducted by means of beam scans in an LDV system [11]. Nonetheless, such methods are inherently unable to simultaneously determine the vibration distribution on a targeted surface and extract tomographic information from a sample. Therefore, they are not applicable to the analysis of each layer in the sensory epithelium. In this regard, stroboscopic detection schemes may be useful for detecting the vibrations [12][13][14][15]; however, systems based on such methods are usually too complicated to be integrated into a microscope for in vivo measurement.
To analyze the motions inside the tissue and overcome the above-mentioned shortcomings, we recently proposed multifrequency-swept optical coherence microscopic vibrometry (MS-OCMV), which is a combination of wide-field heterodyne interferometric vibrometry (WHIV) [16] and multifrequency-swept interferometry [17,18]. This approach can successfully record not only 3D volumetric tomography but also wide-field vibrations on a surface inside a biological tissue [18]. Nevertheless, it cannot accurately measure nanoscale vibrations in the sensory epithelium for the following two reasons. First, the power of the installed superluminescent diode (SLD) is too low to achieve adequate signal reflection from the tissue (the power of the light applied to the sample: 0.9 mW). Second, the method for analysis of the interference signals is unsuitable because one of the frequency components required for quantifying vibration amplitude overlapped with direct current (DC) component at frequency of 0 Hz.
To overcome these problems, we employ a supercontinuum (SC) light source, which provides more powerful irradiation than an SLD does in the present study. Furthermore, to effectively extract the interference signals that represent vibrations of an object in the nanometer range, we modulate the motion of the reference mirror in the WHIV system.
As for newly devised technologies, in recent years, optical coherence tomography (OCT) was widely used and can be regarded as a competitor of our optical system. Among such methods, spectral-domain OCT and swept-source OCT combined with Doppler techniques have been applied to the measurement of inner-ear vibration [19][20][21]. The scanning in these two technologies is oriented differently. Doppler types of OCT can immediately determine a cross-sectional distribution of vibration in the depth direction along an "a-scan" line. On the other hand, our system can reduce the lag for the lateral beam scan by performing the en face measurement using a CMOS camera and can immediately capture the lateral vibration on a surface. Nonetheless, it requires multifrequency sweeping for the cross-sectional scan in the axial depth direction, as is the case for time domain OCT. Therefore, Doppler types of OCT are specialized for cross-sectional imaging, whereas our technique is useful for en face imaging of a laterally spread vibration distribution on an internal surface such as the BM in the cochlear sensory epithelium.
When we limit the application to en face vibration measurement in a sensory epithelium with low reflectance (0.02%-0.06%) [22], Doppler types of OCT require repeated A-scans to average the data and reduce the noise floor [23]. In addition, the M-scan mode is often used for monitoring temporal changes in the vibrations. Thus, these two configurations result in an asynchronous B-scan with a lag, which makes simultaneous measurement of wide-range motions difficult in a live biological tissue.
To overcome this difficulty, we attempted in vivo en face vibration analysis of a sensory epithelium by MS-OCMV. The improvements enable us to quantitatively visualize the widefield vibrations on the surface of a desired depth position in the sensory epithelium of a live animal. Moreover, the motion and a three-dimensional (3D) volumetric image of the tissue can be captured without averaging the data.

Instrumentation
The setup of the improved MS-OCMV is shown in Fig. 1(A). This system consists of a multifrequency generation unit, microscopic interferometer, and detection unit.
The multifrequency generation unit consists of an SC light source (SuperK EXR-4; NKT Photonics, Denmark), a Fabry-Pérot filter (FPF), and an optical bandpass filter (OBF). For practical use, we extracted a wavelength band ranging from 600 to 980 nm by means of the OBF. The light beam characterized by discrete multifrequency components was acquired by transmitting the collimated and filtered SC light through the FPF. Figure 1(B) depicts a comparison between the spectrum of the multifrequency light from the SC and that of an SLD (T-850-HPI, Superlum, Ireland), which we used in the original system [17]. These data were obtained through the FPF. The bandwidth of the SC was similar to that of the SLD (~200 nm); nevertheless, maximum irradiation of the sample surface in the case of the former light source was improved to 37 mW, whereas maximum irradiation in the case of the latter light source was 0.9 mW.
The FPF consists of two partially reflecting mirrors with reflectivity 0.8, each of which is attached to a different piezoelectric actuator (PA) (MOB-A or MD-140L; MESS-TEC, Japan). Cavity length, d, determines the interval frequency (i.e., the free spectral range), Δν, via the relation Δν = c/(2d), where c is the light speed in air. The linewidth of the longitudinal mode is determined by the finesse value of the plates, which was ~14 in our implementation. In our experiments, d was set to ~20 mm; Δν was estimated to be 7.5 GHz. Thus, considering the finesse, multifrequency components were produced with estimated linewidth of 535.7 MHz (1.2 pm in terms of wavelength). The cavity length was varied by means of the two aforementioned PAs to perform an axial depth scan by multifrequency-swept interferometry [18]. The PAs were driven by a ramp signal from a function generator (WW5064; Tabor Electronics, Israel). The signal was magnified by a high-power amplifier (TZ-0.5P; Matsusada Precision, Japan), and the total stroke of the PAs was 980 μm. The generated multifrequency light entered the microscopic interferometer that was subjected to full-field tomographic and vibration measurements. The incident light was split using a polarization beam splitter (PBS) to irradiate the sample and reference mirror surfaces. The polarizer was combined with the PBS to adjust the branching ratio between the two split beams. These beams were polarized orthogonally to each other. In each arm, a quarter wave plate (QWP) was inserted to avoid unexpected reflections from the optical elements. The beam that passed through the QWP was incident toward either the reference mirror or the sample. The reflected beam again passed through the QWP and was directed to the PBS. The polarization of the reflected beam was rotated by 90° as compared to the polarization of the incident beam. Consequently, the beam in the reference arm was polarized orthogonally to the beam in the sample arm; these two beams were recombined in the PBS and entered the objective lens. In this process, polarizations of stray light beams, which were scattered from other optical elements located between the QWPs and PBS or polarizer, were not rotated; therefore, they could not enter the objective lens through the PBS. Moreover, the system was equipped with a rotatable linear polarizer as an analyzer to attain optimal interference contrast by controlling the polarization extinction ratio between the two beams from reference and sample arms.
The reference mirror is a well-polished glass plate with a reflectance of approximately 4%. The reference path length is modulated by a piezoelectric transducer (PZT) attached to the mirror; this mechanism plays a key role in the operation of the improved WHIV (see the next subsection).
An en face interferometric image of the sample surface was captured by means of an inverted microscope having an objective lens characterized by an ultralong working distance of 205 mm (UWZ200; Union Optics, Japan). This profile provided us with sufficient space for laying an anesthetized guinea pig under the lens. Optical magnification varied from 0.7 to 9.8. The depth of focus and numerical aperture at maximum magnification were 62 μm and 0.093, respectively. This information is available in the manufacturer's instructions (URL: http://www.union.co.jp/en/union_uwz.php). Imaging resolution was estimated to be 3.6 μm via microscopic examination of a test target (USAF Resolving Power Test Target 1951) as shown in Fig. 1(C). In this figure, we confirmed that 1 pixel of the CMOS camera corresponds to approximately 2.1 μm at maximum magnification.
The axial depth scan can be operated without changing the optical path difference (OPD), although this process requires vertical motion of the microscopic interferometer in conventional time domain full-field OCT systems.
The principle of multifrequency-swept interferometry has been established in our earlier study [18]. As mentioned above, the interval of the spectral multifrequency components, Δν, is correlated with the cavity length, d [because Δν = c/(2d)]. The interference signal of the multifrequency light manifested repetitive fringe peaks (i.e., high-order interference) with a constant interval of Δ OPD = c/Δν = 2d. Thus, the OPD that yields the first-order interference peak is 2d. During our measurement, the optical path length of the object arm was set to be identical to focal length, whereas the length of the reference arm was set to approximately 22 mm (i.e., half of Δ OPD ) longer than that of the other arm. Owing to this arrangement, the firstorder interference peak overlapped with the focal plane. When d was varied by controlling the FPF, Δ OPD changed, resulting in operation of the axial depth scanning. Note that for this mode, the motion of the reference mirror is unnecessary. The maximum scanning range of the first-order interference peak is equivalent to the maximum stroke of the FPF.
Depth imaging of 3D OCT and vibration analysis with the WHIV technique, which are described in detail in the next subsection, are separate procedures. Because d of the FPF corresponds to Δ OPD of the interferometer, the position on the Z-axis where interference occurs changes according to the cavity length of the FPF. In this manner, we can determine a depth position of interest for the en face vibration measurement and obtain the vibration parameters on the X-Y plane at the depth position.
To rapidly capture en face interferometric images, a high-speed CMOS camera (FASTCAM Mini AX200; Photron, Japan; 1024 × 1024 pixels; pixel size, 20 × 20 μm; bit depth, 12; frame rate, 2000 fps) was employed as a detector. The sensitivity of the sensor is ISO 40,000 as per ISO standard. Full-well capacity is 16,000 e − , and the sensor dynamic range is 54.8 dB. The captured images were transferred to a PC, which also controlled the CMOS camera, SC light source, and function generator. The trigger for initiating a recording was generated by the function generator. In our system, two types of acquisition were performed: volume scans (three spatial dimensions) and vibrometry (two spatial dimensions + one temporal dimension). In both cases, the acquisition speed of the 3D volume data was 2 Gvoxels/s. Transfer of one volume data set of 1.024 GB took approximately 1 min. The maximum data size was 1024 pixels on the X-axis, 1024 pixels on the Y-axis, and 8000 frames along the Z-axis for a typical volumetric data set and a similar approach for the vibrometry data sets. All the captured images were processed and analyzed in the MATLAB software.

Improvement of the WHIV technique
Our previous WHIV technique requires Fourier domain analysis using two different frequency components to quantify the amplitude and phase of the sample's vibration [16][17][18]. These two components are called a "zeroth-order signal" at a frequency of 0 Hz and a "firstorder signal," which corresponds to a difference frequency (i.e., beat frequency) between the frequencies of sample and reference vibrations. In the original method, the DC component overlapped and added linearly to the zeroth-order signal because the frequency of the firstorder signal was 0 Hz. Therefore, a possible disadvantage of the original method is that the DC component interferes with proper extraction of the zeroth-order component required for estimation of the vibration amplitude (see Subsection 4.2). This problem is prominent when the measurement targets are biological samples that yield weak interference signals, such as the cochlear sensory epithelium.
To overcome this drawback, we added low-frequency DC offset modulation to the modulation provided for the reference mirror. Figure 2 illustrates a comparison of the spectral signal obtained by the improved method with the signal obtained by the original technique. In the former case, the zeroth-order frequency component was clearly separated from the bias noise owing to the offset modulation. The details of the improvement are as follows. Suppose that the sample is stimulated by a pure tone sound at a frequency of f s and the reference mirror is sinusoidally vibrated at a slightly different frequency, f r = f s + Δf. Moreover, offset modulation with a low frequency, f 0 , is added to the motion of the reference mirror.
where A xy and B xy denote DC component that do not contribute to the interference and interferometric amplitude, respectively. α xy is the spatial interferometric phase distribution at depth position d', and Z s , Z r , and Z 0 are the spatial amplitude distributions of the vibrating sample, reference mirror, and offset modulation, respectively. In addition, Φ s , Φ r , and Φ 0 are the initial phase distributions of the acoustic stimulus, reference mirror vibration, and offset modulation, respectively. In general, the frame rate of a standard CMOS camera is much lower than that of conventional photodetectors whose sampling rate ranges from a few hundred kilohertz to several tens of gigahertz. This configuration averages and reduces relatively high-frequency components of the temporal interference signal [see Eq. (1)] but affects low-frequency components negligibly. Therefore, the components Δf and f 0 that are sufficiently lower than the frame rate of the camera are retained in the frequency domain.
After application of the Jacobi-Anger expansion [24] and exclusion of the terms associated with higher frequencies, Eq. (1) can be rewritten as follows: where m and n are positive integers that are indices associated with the harmonics f 0 and Δf, and J k denotes the kth order Bessel function of the first kind. M and N are limits of harmonic orders, which are determined by the exposure time and frame rate of the camera. Φ denotes a relative phase, which is expressed as Φ = Φ r − Φ s . The absolute phase value of Φ is relatively changed depending on the timing of triggering to start capturing. In our current system, however, the trigger and signals for modulations were not synchronized. Thus, absolute phase value Φ was changed randomly with each measurement. The absolute value of Φ can be changed arbitrarily by adjusting the start point of the recorded signal during this data processing. Thus, in this case, relative spatial phase differences become more important. Therefore, for convenience, we redefine Φ s as a resulting phase Φ s = Φ including the reference phase Φ r .
In the frequency domain, the signal represented by Eq. (2) is composed of a carrier frequency of f 0 and neighboring sidebands with a frequency spacing of Δf ( Fig. 2(B)). After the Fourier transform presented in Eq. (2), complex amplitudes of high-order components defined as F m,n associated with harmonic frequency mf 0 + nΔf can be denoted as For estimating Z s and Φ s , frequency components F 1,0 and F 1,1 are extracted from the observed frequency components. Here, an intensity ratio, r 01 , is defined as r 01 = |F 1,0 |/|F 1,1 |. It can also be described as |J 0 (Z s )J 0 (Z r )|/|J 1 (Z s )J 1 (Z r )|. In this context, we can derive an evaluation function, ε(z) = {r 01 -|J 0 (z)J 0 (Z r )|/|J 1 (z)J 1 (Z r )|} 2 . Amplitude distribution Z s (x, y) can be obtained with such z that this value minimizes ε(z) at each x-y coordinate. We developed a MATLAB code that evaluated ε(z) and solved for z by comparing the measured value of r 01 with a precomputed data set of |J 0 (z)J 0 (Z r )|/|J 1 (z)J 1 (Z r )| as the function of z varied from 0.000 rad to 2.404 rad in increments of 0.001 rad. Therefore, the accuracy of this solving procedure was 0.001 rad. The identified value of z can be estimated to be Z s within the constraint condition of z ≤ 2.404 rad. Preferred values of Z r and Z 0 are approximately 1.5 and 2.0 rad, respectively. These parameter settings provide an ideal condition. Spatial phase distribution Φ s is calculated as Furthermore, interference phase α is obtained using F 1,0 and F 2,0 : For this purpose, Z r and Z 0 should be determined before measurement; the two parameters can be calibrated arbitrarily via the function generator and assumed to be constant. For the calibration, these parameters were measured in advance using a conventional LDV. This approach is similar to the sinusoidal phase modulation technique [25] except for one characteristic: the analysis described in this study involves the beat signals resulting from the three different modulations mentioned above.
Note that the modified WHIV technique can determine all the vibration parameters twodimensionally without lateral scanning. On the other hand, this method has the following disadvantage. When the values of α are in the vicinity of integer multiples of π rad, the intensities of frequency components of F 1,1 and F 1,0 are hardly detectable due to the sin(α) dependence for odd terms in Eq. (2) and (3). Therefore, in this so-called "unmeasurable area," the amplitude and phase values cannot be obtained accurately (see Subsection 3.2). Therefore, we filtered out the unmeasurable area in the process of determination of the parameters necessary for characterization of the sample's vibration.

Animal preparation
In vivo wide-field tomographic and vibration measurements were performed on the cochlear sensory epithelium of a guinea pig as a sample, as follows.
First, a guinea pig was deeply anesthetized with intraperitoneal injection of urethane (1.5 g/kg). The toe pinch, corneal reflexes, and respiratory rate were examined to evaluate the depth of anesthesia. When anesthesia was insufficient, urethane (0.3 g/kg) was additionally injected into the animals. After tracheotomy, which was conducted for the maintenance of spontaneous breathing, the animals were paralyzed by intravenous injection of vecuronium bromide (3 mg/kg) (Vecuronium for intravenous injection; Fuji Pharma, Japan) [26]. Subsequently, the animal was artificially ventilated with room air using a respirator (SN-408-7; Shinano Manufacturing, Japan) [27]. We stopped the ventilation during the measurements for 4 s to prevent the motion artifact.
A fenestra was surgically opened on the lateral site of the bulla in order to shine the SC light on the sensory epithelium in the basal turn through the transparent round window. Basically with this hole the measurements can be carried out. In our preparations, an additional hole was made on the anterior portion of the bulla. This arrangement allowed us to directly confirm the position of the beam spot during the recording and thereby significantly improved the efficiency of the experiments. Then, the animal's head was fixed on an acrylic plate (50 × 20 × 5 mm). The plate was tightly connected to an articulating base stage (SL20/M; Thorlabs, USA). In this process, the angle and position of the cochlea were manually controlled to enable the laser beam to irradiate the sensory epithelium through the round window membrane as perpendicularly as possible. This method does not require a hole to be artificially made in the cochlear bony wall for the irradiation and is hence noninvasive (Fig. 3).
The irradiation time during the measurements was several minutes in total. In spite of high irradiation power (37 mW), the tissue is likely to be damaged only minimally because the body fluid in the cochlea dissipates the heat generated by the irradiation. Alternatively, because the area irradiated by the light beam is relatively wide, the energy actually received by the tissue might be weaker than expected. To preserve an animal's condition as much as possible, all the procedures in the experiment were completed within 4 hours.
The experimental protocol was in compliance with federal guidelines for the care and handling of small rodents and was approved by the Institutional Animal Care and Use Committee of Niigata University [28].

Validation of OCT imaging
We initially evaluated the axial resolution and sensitivity of the imaging by the improved MS-OCMV. Figure 4(A) depicts a first-order interference fringe detected by a pixel of the CMOS camera when a planar mirror was illuminated in the OCT system. The frames of interference images were captured as a time series and were used to reconstruct 3D volumetric data. As shown in Fig. 4(A), the fringe was obtained after removing the DC component from the raw data. This signal was processed with the Hilbert transform [29] and an adequate bandpass filter to reduce noise and extract the envelope. The signal process was carried out along the zaxis (axial depth direction) at each x and y coordinate. As depicted in Fig. 4(B), to properly apply the Hilbert transform, at least eight frames are necessary to construct one period of the fringe. Finally, a cross-sectional distribution in the depth direction was obtained as an "Aline" by squaring the extracted envelope as presented in Fig. 4(C). This process was carried out at all x-y coordinates simultaneously via the 3D fast Fourier transform (FFT) in the Matlab software. The axial resolution determined via full width at half maximum (FWHM) of the A-line was approximately 2.7 μm.

A pilot e
Next, we per improved as improved WH that was vib measurement, kHz) with the these conditio frequency of f offset modula during 2 s. T pixels. The o microscopic e the measurem vibration para of π rad. To c were induced 4. The first-order e with its envelop ged plot near the ope of the interfer of (C    Fig. 5(A)). In this experiment, the PZT was stimulated with an alternating current (AC) voltage of 5 V. From the recorded data, the frequency-domain signals were obtained by fast FFT, as shown in Fig. 5(C). We detected two components, i.e., F 1,1 and F 1,0 , at 250 and 170 Hz, respectively. These results are consistent with the theoretical observations mentioned above. We further analyzed all the data points obtained on the surface of the sample with FFT (Fig. 5(A)). Then, from the signals that ranged from 0 to 1 kHz, the F 1,1 and F 1,0 components were extracted; they are visualized two-dimensionally in Fig. 5(D). Note that these two series of data were subjected to detection of the amplitude and phase distribution of the vibrations in the sample. As expected, little spatial information was available from the background noise observed at 800 Hz as shown in Fig. 5(D). onents were -C) presents th The pixels wer togram given α values beca tions. Neverthe eger multiples re, regarding Fig. 6(F)). Hen atching with ve , and the spati measurement plitude obtaine can target only n AC voltage o ured in the mirror re used. 2D distri hase α are displa surable area were . The pixels were e histogram in (E) . The unmeasurabl analyzed by t he distributions re profiled in a in Fig. 6 where M an respectively.  Fig. 7 ed almost 5-5 nm). nm (Z s = viation of vibration Therefore, σ Φ can be a measure of accuracy in the detected signal. On the other hand, σ Z was proportional to Z s , indicating that the rate of measurement error was constant irrespective of the amplitude value in our system.

In vivo wide-field vibration analysis of the sensory epithelium
Using the MS-OCMV with the improved WHIV technique, we examined the cochlear sensory epithelium of a live guinea pig. First, to perform tomography of the tissue, we carried out an en face OCT measurement in a planar area of 522 × 522 μm at a resolution of 850 × 850 pixels. A total of 8000 en face images were acquired at a frame rate of 2000 fps by scanning within approximately 560 μm in the axial direction. Because of the high output power of the SC light source, the data were obtained in a single scan. The time of acquisition of these volumetric data (i.e., total scanning time) was 4 s. Optical magnification of the objective lens was set to 9.8, resulting in numerical aperture of 0.0093 and the depth of focus of 92 µm.
From the obtained data, we selected an area of 256 × 256 pixels for subsequent analyses. This component was processed as described in Subsection 3.1. After that, 3D volumetric images were reconstructed Fig. 8. (A), (B) shows the 3D volumetric images of the sensory epithelium including the neighboring bony component from different viewpoints. The contrast of the images was controlled by discarding low-intensity data below a threshold manually determined for each 3D data series. The dynamic range of the visualization was approximately 11 dB. Figure 8(C) illustrates a remeasured result from the sensory epithelium in the portion enclosed by the dotted line in Fig. 8(B). In this measurement, 4000 en-face images were acquired by scanning within approximately 350 μm in the axial direction. Acquisition time was 2s. Figure 8(D-F) depicts the X-Z cross-sectional images sliced at Y axes corresponding to dotted lines 1, 2, and 3, respectively, in Fig. 8(C). The outline of the cross-sectional images was similar to a well-known view of the guinea pig sensory epithelium displayed in Fig. 8(G) [31]. In particular, the sensory epithelium has multiple regions that lack cells (e.g., the tunnel of Corti and inner sulcus); they are hallmarks for identifying such components of the epithelium as the BM, reticular lamina (RL), and tectorial membrane (TM). Therefore, in the images in Fig. 8(D-F), we could roughly detect the structures of BM, RL, and TM.
We next intended to visualize and quantify the vibrations of the sensory epithelium in the wide-field mode. Axial OCT scanning was performed near the BM, and this position was chosen as a target. During the measurement with the improved WHIV technique, the animal was exposed to a pure tone sound of 21, 22, 23, 24, or 25 kHz through a Y-shaped waveguide that was connected to the exit of a speaker (EC1; Tucker-Davis Technologies, FL, USA). The intensities of the acoustic stimuli were monitored by an ultrasonic microphone inserted into one output port of the waveguide. The other output port was tightly inserted into the left external ear canal of the animal. All the parameters for operating the system and the algorithm for analyzing the data were the same as those used in the performance evaluation experiment with the mirror, except for the following characteristic. In the animal experiment, the data throughout the image (256 × 256 pixels) were analyzed by FFT, and the 2D distribution and intensity of the F 1,1 component were monitored by the computer. This preliminary analysis indicated that the target region in the sensory epithelium responded most markedly to 23 kHz among the five frequencies that we tested (see the results described later). epithe epithe the p parall indica epithe memb triang Furthermo extracted a re Figure 9 show vibrations in e a threshold fo were negligib was configure Fig. 8. Results of 3D elium located betw elium from differen ortion enclosed b elepiped is 50 μm ated in (C). (G) elium. BM, RL, a brane, respectively gle, respectively, w ore, to analyze gion of interes ws an example each pixel, we or the masking ble ( Fig. 5(C)). ed to assume th volumetric imagi ween neighboring b nt viewpoints (see by the dotted lin . (D-F) X-Z cross-Schematic illustr and TM denote th y. The inner sulcus which are also over the area of th st (ROI) from t e with a stimul averaged the i g procedure, be The mean + 2 he value of "1"  1,1 (|F 1,1 |) exceeded the threshold; otherwise, the configuration was set to "0" or "black." This process was applied to all the pixels at the same time, and the result was transformed into a black-and-white 2D map. This image is referred to as "mask 1." Third, on the basis of the interference phase α distribution, a pixel that manifested itself as an unmeasurable area was set to "0" or "black," whereas a measurable pixel was set to "1" or "white" (Fig. 6(D)). This digitization was carried out for all the pixels simultaneously, thus affording a 2D image called "mask 2." The final step was the "AND operation," in which a pixel with the value of "1" in both masks was redefined as "1" or "white," and data smoothing was performed by means of a median filter. These procedures produced the "conclusive mask" (Fig. 9). Fig. 9. The masking procedure. In mask 1, when the absolute value of the peak of F 1,1 in a pixel exceeded the threshold determined as described in the text, the pixel value was transformed to "1" or "white." Mask 2 served to filter out the unmeasurable area according to the value of interference phase α. In this configuration, the measurable pixel was referred to as "1" or "white." Masks 1 and 2 were merged by the "AND operation," in which the pixel with the value of "1" in both masks was set to "1" or "white." Finally, to the conclusive mask, the median filter was applied for data smoothing. For the detailed procedure, refer to the text. As indicated in the boxed panel, the distributions of (a) Z s and (b) Φ s were segmented through the conclusive mask. The ROI of the sensory epithelium in the wide-field mode was determined by this procedure.  rations in the the animal we and Φ s in eac Φ s were norm ude increased w osed to sounds ry epithelium of a n the ROI on the part of each panel Fig. 8 Fig. 1 fter the measu again determin recorded via plitude quantifi al (Fig. 11 Fig. 10(A-D)). hat can amplify spersion of the m (Fig. 11(E-H relatively hom s likely to beco ium of the guinea es, and displayed 0, the guinea ibed in Fig. 9 g. 11. For each m was likely t . This observa y the motion o e phase distribu H)), although th mogenous ome more a d pig was , and the h type of to be less ation may of the BM ution was he spatial deviation of this parameter was apparent under the control conditions ( Fig. 10(E-H)). This difference seems to stem from the lack of noise sources, such as breathing and blood flow, after death. Fig 12(A) presents a plot of the average values and standard deviations calculated from the vibration amplitude recorded for all the pixels included in the ROI (approximately 2000 pixels) versus different sound intensities. Note that the number of pixels differed among the trials. Overall, the values for the live guinea pig (control) exceeded those for the postmortem animal, as mentioned above. A comparison of these two series of data revealed that when the stimuli were relatively weak (70 and 75 dB), the response increased under the control conditions [2,31]. This nonlinear amplification supports the idea that the cochlea (including the sensory epithelium) was damaged only minimally during the measurement. Fig 12(B) illustrates the measurements at 21, 22, 23, 24, and 25 kHz (85 dB SPL). For each type of stimulus, the ROI was determined. The averaged values of the amplitude were obtained with the procedure utilized in Fig. 12(A). The other displayed parameters are the deviations of phase values (σ Φ ). At the stimuli of 23 kHz, the amplitude was maximal and the deviation was minimal. Because the lower σ Φ value means a higher SNR of the detected signal as mentioned in Subsection 3.2, we inferred that the characteristic frequency at the point we examined was 23 kHz. This observation is in agreement with the aforementioned preliminary finding that the characteristic frequency of the epithelium targeted by the laser was 23 kHz. The data were collected from the ROI (~2000 pixels) of the sensory epithelium in the control (red curve; live guinea pig) and postmortem (black curve). The animal was exposed to stimuli of 23 kHz at various SPLs. The average values and standard deviations (SD) are plotted. (B) The tuning curve of the vibration amplitude (blue curve) and phase (orange curve; σ Φ ). In this assay series, the live animal was acoustically stimulated with different frequencies (21−25 kHz; 85 dB SPL). The averages and SD of the data in the ROI individually determined for each stimulus are shown. The phase values were obtained with Eq. (6) (see text).
Theoretical models and experiments on the base of the cochlea corroborate that the BM shows wave propagation behavior with the phase gradient along the cochlea from the base to the apex in the region of the characteristic frequency [2]. In the region of the sensory epithelium approximately 2-3 mm in length observed through the window, the characteristic frequency is known to be in the range 28-32 kHz [33] along the cochlear spiral on the basal side. The other apical side can be predicted to be 19-22 kHz according to Ref [32]. Furthermore, the experimental results mentioned above suggest that the characteristic frequency varies in the range ~21-25 kHz in the region we examined. Especially significant signals were detected at 23 kHz (Fig. 12(B)). Therefore, it is expected that the phase difference caused by the traveling wave that is scale invariance can be detected in this frequency range.
To confirm distributions obtained at di the limitation dB SPL. Ove live one. The reduced in the In the ana epithelium wa direction of t calculated by axis. Then, th Thereby, the m this theoreti as in the abov ifferent freque n on sensitivity rall, more relia reason is that e postmortem a 3. Analysis of the e phase gradient ency. The gradient lated by a pure to gradients on the B phase gradients we yellow lines on e ent corresponding imental conditions alysis, first, the as determined the membrane linear approxi he phase values averaged value cal prediction, ve experiment encies and thei y in this system able results we noise was dec animal. e phase gradient ca plotted in terms t values were obta one sound at 85 d BM of a live guin ere obtained by ca each distribution) to frequencies of s, and data analysi e centroid line for quantitativ e (yellow line mation using t s aligned on a es were plotted , we analyzed t (Fig. 12 (B)) r averaged pha m, we could ex ere obtained fro ceased because aused by a travell of radian/mm as ained from the spa dB SPL. (B, C) S nea pig at a frequen alculating the aver . direction of the aseline represe Fig. 13). This es with respect direction were a g. 13(B)). The me spatial tributions ecause of ed at >80 an from a facts was n n n r . d e , e sensory enting the line was t to the xaveraged. averaged result was removed where the number of valid points was less than 2.5% of the pixel number in the vertical line (for example, most of data located out of the ROI). Finally, from the averaged phase plot, the tilt of an approximate straight line obtained via polynomial regression was estimated as the phase gradient. In this analysis, the standard error of the approximation for the line was ~0.05 rad on average.
As a result, the gradient of the phase varied between approximately 1 and 7 rad/mm. The phase changes were observed at 23, 24, and 25 kHz oscillations of the postmortem animal. In the case of the live animal, comparatively small phase gradients were observed. The result at 23 kHz, which was considered the characteristic frequency, was estimated to be 1.15 and 5.39 rad/mm respectively in the live and postmortem animal. The largest gradient was estimated to be 6.27 rad/mm at 25 kHz oscillation postmortem. Reasonable values of phase gradients were obtained in this analysis as compared with the result obtained in Ref [34]. We confirmed the phase changes predicted theoretically by analyzing selected distributions with relatively less noise. Further improvements are needed for more sensitive and accurate measurements in a live tissue.

Discussion
MS-OCMV has been developed for en face measurement of biological vibrations by the WHIV technique. In comparison with Doppler SD-OCT, however, disadvantages and problems are yet to be improved. Generally, standard Doppler SD-OCT is superior to our method in terms of real-time performance and sensitivity (SNR of 90-100 dB and picometer accuracy for vibrometry). Our system might be superior in terms of the speed of capturing 3D volumetric data with spatial simultaneous en face detection using a high-speed CMOS camera. SD-OCT and a conventional LDV requires a photodetector or a line sensor with high sampling frequency to avoid aliasing. In our technology, in principle, there is no limitation on the vibration frequency because the heterodyne signal sufficiently detectable by a CMOS camera produced by two sinusoidal phase modulations is utilized to analyze the vibration.
Nevertheless, for practical applications, high-speed cameras are preferred, to avoid the effects of low-frequency noise mentioned above. In addition, sensitivity of the high-speed CMOS cameras is generally lower than that of standard photodetectors owing to the lower full-well capacity (16,000 e − ). Besides, it is known that SD-OCT methods based on Fourier spectroscopy guarantee a higher SNR than qualitative time domain methods [35,36]. A representative Doppler SD-OCT system (e.g., Ganymede SD-OCT, Thorlabs, USA) provides sensitivity of ~100 dB for OCT imaging. On the other hand, OCT sensitivity of an MS-OCMV system were estimated to be approximately 40 dB, respectively. Further improvement of the system is needed for practical in vivo measurements. In this section, we discuss current issues from this perspective to clarify the limitations and prospects.

The noise level of OCT
The original MS-OCMV, which was equipped with an SLD light source (center wavelength: 820 nm), could not clearly detect either the 3D volumetric image or the sound-induced vibrations in the sensory epithelium. In this series of experiments, the light source typically administered an optical power of 0.06 μW to each pixel of the CMOS camera during a usual exposure time of 0.5 ms. Because of the low reflectance of the sensory epithelium, it is estimated that a pixel of the camera with a quantum efficiency of 25% received only 54 e − from the tissue. In such a condition, the noise of the signal can be determined as where N td and N s denote temporal dark noise (i.e., read noise) and shot noise, respectively. In accordance with the native profile of the image sensor [37], temporal dark noise (N td ) was 29 e − . Shot noise (N s ), calculated by the square root of the number of photon fluxes on a one-pixel surface, e − . These data unsuccessful present study, increased opt SNR was incr When a m around the m sensitivity and power spectru power spectru spectral envel light source.
The actua resolution, pr reference and CMOS camer shape of the and waveleng improvement low-coherenc  amplitude voltages applied to the PZTs for Z s , Z r , and Z 0 were set to 1.0, 1.0, and 5.0 V, respectively. The en face images consisted of 256 × 256 pixels were extracted to display the distribution set of F(0) and F(Δf) or of F 0,0 , F 0,1 , and F 1,1 in the original method and improved method, respectively, as illustrated in Fig. 14.

Compar
The intensity of the frequency component F(0) in the original method can be written as [16] ( ) ( )  [39] focusing on the spatial interference fringe pattern appearing in the obtained en face distribution (Fig. 14(A)). Nevertheless, this arrangement rather resulted in diminished accuracy for quantifying the vibration amplitude especially when the analyzed surfaces were complicated as in biological tissues. In Ref [17], a substitute method utilizing a second-order component instead was adopted. Nonetheless, the intensity of the substitute signal is inherently weaker than that of the zeroth-and first-order signals. This problem is prominent when we analyze biological samples such as the cochlear sensory epithelium that yield a weak interference signal.
To solve these problems, in the improved method, the zeroth-order component can be clearly separated from the DC component by means of a third signal (DC offset modulation). As shown in Fig. 14(B), obviously F(0) contained only DC term A xy . The zeroth-order component was obtained as F 1,0 independently. The principle that can be discerned in the frequency domain is that the zeroth-order component was shifted by offset modulation frequency f 0 from 0 Hz, and that first-order component F 1,1 was regarded as a sideband around F 1,0 . This feature enables accurate extraction of this essential signal for the vibration amplitude without interference from the DC component.
An advantage of this improvement is that the method can also estimate the interference phase α. In addition, the vibration amplitude can be estimated in the "unmeasurable area" using the pair of second-order signals, F 2,0 and F 2,1 , alternatively. A disadvantage of this improvement is that in the comparison of the peak intensities between |F(Δf)| and |F 1,1 |, the SNR in the improved method deteriorated by approximately 10 dB as shown in Fig. 14.
The effect of low-frequency noise was also confirmed in Fig. 14(A). The low-frequency noise was distributed up to about 50 Hz (Fig. 14 (A)), and when WHIV was used, it appeared as sidebands of each frequency component. This low-frequency noise was mainly due to mechanical disturbances of the interferometer. In the original method, modulation frequency Δf is set higher to prevent the influence of low-frequency noise. Even in the improved method, it was necessary to guarantee a sufficient frequency spacing between the longitudinal modes to separate them from the sidebands of noise components. Therefore, in our system, by introducing a high-speed CMOS camera with 2000 fps or more, accurate measurement could be achieved that was less susceptible to the low-frequency noise.

The performance limit of the improved WHIV technique
In general, verification of the limit of detection is crucial for characterization of any analytical instrument or method. As for the improved WHIV technique, here, we chose an in silico approach to evaluate the error in the measurement of vibration amplitude. According to Eq.
(1), in the simulation, we set several parameters to reconstitute the interferometric heterodyne signals in the time domain as follows: Z r = 1.8 rad, Z 0 = 1.8 rad, f s = 23 kHz, f r = 23080 Hz, and f 0 = 170 Hz. In addition, we reproduced the noise floor of −70 dB as shown in Fig. 7 and found it to be roughly 5 nm (Fig. 15(B)). Further studies are needed to elucidate the details of the actual sensitivity of in vivo measurement.

Motion artifacts
A motion artifact generally leads to a loss of vibration data and deterioration of the SNR ratio.
If the fluctuation of the OPD is limited to a phase change of approximately several hundred nanometers, it is possible to remove its influence because the fluctuation is detected as lowfrequency noise of the signal. Further changes lead to alteration of the measurement area. Therefore, it is desirable that the OPD fluctuate within a few micrometers, which is the coherence length during acquisition of the data. To prevent the motion artifact, we controlled the movement due to breathing of the animal by artificial ventilation via a respirator during the measurements as mentioned in Subsection 2.3. Another solution is to reduce measurement time by increasing the frame rate of the CMOS camera. The advantage of this improvement method is that it can reduce the accumulation effect of the detector and improve time resolution to capture the movement of the animal. Accumulation during exposure time in a pixel of the image sensor may cause blurring of the detected signal because it is very sensitive to the phase change due to the path length variation. Therefore, the accumulation effect of the detector worsens contrast in the interference signal.
We investigated the influence on the contrast reduction exerted by accumulation time and scan speed in a previous study [17]. For instance, according to the simulation, it is possible to improve the degradation of interference contrast from approximately 78% to 94% when the frame rate increases from 2000 to 4000 fps if the frequency of heterodyne signal F 1,1 is 250 Hz. Nevertheless, application of this improvement requires careful consideration of the overall-light-intensity reduction due to the exposure time shortening. Given that intensity of the accumulated interference signal is also proportional to exposure time (e.g., see Eq. (6) in Ref [17].), acquisition at 4000 fps, for instance, causes deterioration of the SNR by approximately 3 dB as compared to the results of this study. As demonstrated in Fig. 15, deterioration of the SNR increases the error of the measured amplitude value. Therefore, there is a trade-off between amelioration of a motion artifact due to acceleration of the frame rate and an increase in measurement error due to shorter exposure time.

Conclusion
In this study, we drastically modified our previously developed MS-OCMV system for the acquisition of 3D tomographic images and for analysis of a wide-field vibration distribution of objects. The present system was equipped with an SC light source to enhance irradiation Furthermore, in the WHIV technique, we added offset modulation to the reference signal to accomplish wide-field measurement of ultrafast vibrations in a biological sample. The performance of the improved MS-OCMV system for 3D volumetric imaging is characterized by a transverse resolution of 3.6 μm and an axial depth resolution of 2.7 μm. The vibration amplitude detectable with this system was estimated to be ~1.1 nm, and measurement accuracy is similar to that of conventional LDVs. These settings enabled us to detect the structure and motion in an acoustically stimulated cochlear sensory epithelium of a live guinea pig, even though this tissue has an extremely low reflectance rate. With sounds of different intensities or frequencies, the spatial distribution of the vibration amplitude and phase was quantified and mapped onto a 3D volumetric image. The profile changed when the animal was euthanized. The proposed technique can provide a platform for effective analysis of the cochlea as well as other organs, thereby contributing to advances in life sciences.