Acousto-optic Ptychography

Acousto-optic imaging (AOI) enables optical-contrast imaging deep inside scattering samples via localized ultrasound-modulation of scattered light. While AOI allows optical investigations at depths, its imaging resolution is inherently limited by the ultrasound wavelength, prohibiting microscopic investigations. Here, we propose a novel computational imaging approach that allows to achieve optical diffraction-limited imaging using a conventional AOI system. We achieve this by extracting diffraction-limited imaging information from 'memory-effect' speckle-correlations in the conventionally detected ultrasound-modulated scattered-light fields. Specifically, we identify that since speckle correlations allow to estimate the Fourier-magnitude of the field inside the ultrasound focus, scanning the ultrasound focus enables robust diffraction-limited reconstruction of extended objects using ptychography, i.e. we exploit the ultrasound focus as the scanned spatial-gate 'probe' required for ptychographic phase-retrieval. Moreover, we exploit the short speckle decorrelation-time in dynamic media, which is usually considered a hurdle for wavefront-shaping based approaches, for improved ptychographic reconstruction. We experimentally demonstrate non-invasive imaging of targets that extend well beyond the memory-effect range, with a 40-times resolution improvement over conventional AOI, surpassing the performance of state-of-the-art approaches.


INTRODUCTION
Optical microscopy through scattering media is a long-standing challenge with great implications for biomedicine. Since scattered light limits the penetration depth of diffraction-limited optical imaging techniques approximately to 1 millimeter, the goal of finding a better candidate for high-resolution imaging at depth is at the focus of many recent works [1]. Modern techniques that are based on using only unscattered, 'ballistic' light, such as optical coherence tomography and two-photon microscopy, have proven very useful, but are inherently limited to shallow depths where a measurable amount of unscattered photons is present [2][3][4][5][6][7].
The leading approaches for deep-tissue imaging, where no ballistic components are present, are based on the combination of light and ultrasound [1], such as acousto-optic tomography (AOT) [8][9][10] and photoacoustic tomography (PAT) [8,11]). PAT relies on the generation of ultrasonic waves by absorption of light in a target structure under pulsed optical illumination. In PAT, images of absorbing structures are reconstructed by recording the propagated ultrasonic waves with detectors placed outside the sample. In contrast to PAT, AOT does not require op-tical absorption but is based on the acousto-optic (AO) effect: in AOT a focused ultrasound spot is used to locally modulate light at chosen positions inside the sample. The ultrasound spot is generated and scanned inside the sample by an external ultrasound transducer. The modulated, frequency-shifted, light is detected outside the sample using an interferometry-based approach [8,10]. This enables the reconstruction of the light intensity traversing through the localized acoustic focus inside the sample. Light can also be focused back into the ultrasound focus via optical phase-conjugation of the tagged light in "timereversed ultrasonically encoded" (TRUE) [12] optical focusing, or via iterative optimization, which can be used for fluorescence imaging [13,14]. AOT and PAT combine the advantages of optical contrast with the near scatter-free propagation of ultrasound in soft tissues. However, they suffer from low spatial-resolution that is limited by the dimensions of the ultrasound focus, dictated by acoustic diffraction. This resolution is several orders of magnitude lower than the optical diffraction limit. For example, for ultrasound frequency of 50MHz the acoustic wavelength is 30µm, while the optical diffraction limit is λ N A where NA is the numerical aperture of the system and λ is the optical wavelength, i.e. a 100-fold difference in resolution. This results in arXiv:2101.10099v1 [physics.optics] 25 Jan 2021 a very significant gap and a great challenge for cellular and sub-cellular imaging at depths.
In recent years, several novel approaches for overcoming the acoustic resolution limit of AOT based on wavefront-shaping have been put forward. These include iterative TRUE (iTRUE) [15,16], time reversal of variance-encoded (TROVE) optical focusing [17] and the measurement of the acousto-optic transmission matrix (AOTM) [18]. Both iTRUE and TROVE rely on a digital optical phase-conjugation (DOPC) system [19], a complex apparatus, which conjugates a high-resolution SLM to a camera. In AOTM, an identical resolution increase as in TROVE is obtained without the use of a DOPC system, by measuring the transmission-matrix of the ultrasound modulated light and using its singular value decomposition (SVD) for sub-acoustic optical focusing. A major drawback of this state-of-the-art 'superresolution' AOT approaches is that they require performing a large number of measurements and wavefront-shaping operations in a time shorter than the sample speckle decorrelation time. In addition, in practice, these techniques do not allow a resolution increase of more than a factor of ×3 − ×6 improvement from the acoustical diffraction-limit, when sub-micron optical speckle grains are considered [18]. Recently, approaches that do not rely on wavefront shaping, and exploit the dynamic fluctuations to enable improved resolution [20] or fluorescent imaging [21] have been demonstrated, but these do not practically allow a resolution increase of more than a factor of 2-3. Closing the two orders of magnitude gap between the ultrasound resolution and the optical diffraction-limit is thus still an open challenge.
Diffraction-limited resolution imaging through highly scattering samples without relying on ballistic light is currently possible only by relying on the optical "memory effect" for speckle correlations [22][23][24]. These techniques retrieve the scene behind a scattering layer by analyzing the correlations within the speckle patterns. Unfortunately, the memory-effect has a very narrow angular range, which limits these techniques to isolated objects that are contained within the memory-effect field-of-view (FoV). For example, at a depth of 1mm the memory-effect range is of the order of tens of microns [25,26] making it inapplicable for imaging extended objects.
Here, we present acousto-optic ptychographic imaging (AOPI), an approach that allows optical diffraction-limited imaging over a wide FoV that is not limited by the memory-effect range, by combining acousto-optic imaging (AOI) with specklecorrelation imaging. Specifically, we utilize the ultrasound focus as a controlled probe that is scanned across the wide imaging FoV, and use speckle-correlations to retrieve optical diffractionlimited information from within the ultrasound focus. Importantly, we develop a reliable and robust computational reconstruction framework that is based on ptychography [27][28][29], which exploits the intentional partial overlap between the ultrasound foci. We demonstrate in a proof of principle experiments a > ×40 increase in resolution over the ultrasound diffractionlimit, providing a resolution of 3.65µm using a modulating ultrasound frequency of 25MHz.

A. Principle
The principle of our approach is presented in Figure 1 along with a numerical example. Our approach is based on a conventional pulsed AOI setup, employing a camera-based holographic detection of the ultrasound modulated light [8,18]. In this setup ( Fig. 1(a)) the sample is illuminated by a pulsed quasi-monochromatic light beam at a frequency f opt . The diffused light is ultrasonically tagged at a chosen position inside the sample by a focused ultrasound pulse at a central frequency f US . The acousto-optic modulated (ultrasound-tagged) light field at frequency f AO = f opt + f US is measured by a camera placed outside the sample using a pulsed reference beam that is synchronized with the ultrasound pulses, via off-axis phase-shifting interferometry [20,30] (Supplementary section 1).
In conventional AOI, the ultrasound focus is scanned along the target object ( Fig. 1(b,c)), and the AOI image, I AOI (r), is formed by summing the total power of the detected ultrasound-modulated light at each ultrasound focus position r US m : is the ultrasound-modulated speckle field that is measured by the camera ( Fig. 1(d)). Since the conventional AOI image ( Fig. 1(g)) is a convolution between the target object and the ultrasound-focus pressure distribution [20], its resolution is limited by the acoustic diffraction limit.
Our approach relies on the same data acquisition scheme as in conventional AOI ( Fig. 1(b-d)). However, instead of integrating the total power of the camera-detected ultrasoundmodulated field at each ultrasound position, we use the spatial information in the detected field, E m (r cam ), to reconstruct the diffraction-limited target features inside the ultrasound focus, via speckle-correlation computational imaging approach [23,24,31]. Specifically, we estimate the autocorrelations of the hidden target inside each ultrasound focus position ( Fig. 1(e)), and then use a ptychography-based algorithm [27,28] to jointly reconstruct the entire target from all estimated autocorrelations ( Fig.  1(f)). Thus, our approach exploits the richness in the information of the detected ultrasound-modulated speckle fields, which contain a number of speckle grains limited only by the camera pixel count.
Beyond improving the resolution of AOI by several orders of magnitude, from the ultrasound diffraction-limit to the optical diffraction limit, our approach allows to tackle a fundamental and generally very difficult to fulfill a requirement for speckle-correlation imaging: that the entire imaged object area must be contained within the memory-effect correlation-range [23,24,32], i.e. that all object points produce correlated speckle patterns [24]. This requirement usually limits speckle-correlation imaging to small and unnaturally-isolated objects. Recently, ptychography-based approaches were utilized to overcome the memory-effect FoV [33,34]. However, the implementations of all FoV-extending approaches to date required direct access to the target in order to limit the illuminated area, a requirement that is impossible to fulfill in noninvasive imaging applications. Our approach overcomes this critical obstacle by relying on noninvasive ultrasound tagging to limit the detected light to originate only from a small controlled volume that is determined by the ultrasound focus. The only requirement for speckle-correlation imaging is that the ultrasound focus ( Fig. 1(b), dashed yellow circle) would be smaller than the memory-effect range ( Fig. 1(b), dashed green circle). Thus allowing to image through scattering layers objects that extend well beyond the memory-effect FoV, without a limit on their total dimensions.
Mathematically our approach can be described as follows: Consider a target object located inside a scattering sample ( Fig.  1(a)). As a simple model, we model the object by a thin 2D amplitude and a phase mask, whose complex field transmission is given by O(r). The goal of our work is to reconstruct the object 2D transmission |O(r)| 2 by noninvasive measurements of the scattered light distributions outside the sample.  Fig. 1. Acousto-optic ptychographic imaging (AOPI) principle and numerical results. a. Schematic of the experimental setup: An AOI setup is equipped with a rotating diffuser for producing controlled speckle realizations. An object hidden inside a scattering sample is imaged by scanning a focused ultrasound beam (in yellow) over the object, and the acousto-optic modulated (frequency-shifted) light is holographically detected using a high-resolution camera. b. The ultrasound beam scan the target (dashed yellow circle). The US beam is smaller than the "memory effect" range (dashed green circle), allowing the use of speckle correlation imaging for each scan as part of our AOPI method. c. For m = 1...M scan positions, the ultrasound beam modulate the light at the target plane. d.The modulated light propagate from the target plane through a diffuser and reach the camera. For each scan position, N different speckle realizations fields are recorded at the camera, due to different speckle illuminations that are obtained using the rotating diffuser. e. For each scan position, the autocorrelation of the ultrasound modulated light is estimated via correlography, using the N recorded fields. f. Numerical result for AOPI reconstruction. The M autocorrelations for all scan positions are entered into ptychography-based phase retrieval algorithm and a full reconstruction of the target is obtained. g. Conventional AOI reconstruction, obtained by plotting the total modulated power at each ultrasound focus position. Scale bar 100µm.
A monochromatic spatially-coherent laser beam illuminates the object through the scattering sample ( Fig. 1(a)). The light propagates through the scattering sample, results in a speckle illumination pattern at the object plane. Considering a dynamic scattering sample, such as biological tissue, the illuminating speckle pattern on the object is time-varying. We denote the speckle pattern field illuminating the object at a time t n by S n (r). The field distribution of the light that traverses the object at time t n is thus given by: O n (r) = O(r)S n . This light pattern is ultrasound modulated by an ultrasound focus whose central position, r US m , is scanned over m = 1..M positions inside the sample. We denote the ultrasound focus pressure distribution at the m-th position by U(r − r US m ). The shift-invariance of the ultrasound focus is assumed here for simplicity of the derivation, and is not a necessary requirement [35]. The ultrasound modulated light field at the m-th ultrasound focus position is given by the product of O n (r) and U(r − r US m ): O m,n (r) = O n (r)U(r − r US m ) . The ultrasound modulated light field O m,n (r) propagates to the camera through the scattering sample, producing a random speckle field at the camera plane: E m,n (r cam ). When the ultrasound focus dimensions are smaller than the memory-effect range: D US < ∆r mem ≈ L∆θ mem , where L is the depth of the object inside the scattering sample from the camera side ( Fig. 1(a)) and ∆θ mem is the angular range of the memory effect, the scattering sample can be considered as a thin random phase-mask with a phase distribution φ sample (r cam ). The scattered light field measured by a camera that images the scattering sample facet is thus given by: where P L is the propagation operator for propagating the field from the object plane to the scattering sample facet. The complex field autocorrelation of the target, illuminated by the n-th speckle pattern and multiplied by the ultrasound field distribution O m,n (r) O m,n (r) can be calculated from a single camera frame [36]. However, multiple (n = 1..N) camera frames, captured under different speckle illuminations, can be used to calculate the target intensity autocorrelation: which is free from speckle artefacts via correlography [36], in the exact same manner as demonstrated without ultrasound modulation by Edrei et al [31]. In correlography [31,34,36] the estimate for the object un-speckled intensity autocorrelation at the m-th ultrasound focus position,ÂC m (r), is calculated by averaging the Fourier transforms of the captured speckle frames intensity distribution, after subtracting their mean value [34,36]: AC m (r) * |S n (r) S n (r)| 2 n speckle grain autocorrelation

(2)
The calculated autocorrelation,ÂC m (r), is the autocorrelation of the object convolved with the diffraction-limited pointspread-function of the imaging aperture on the facet plane. A required condition for estimating the object autocorrelation from the Fourier-transform of the captured speckle patterns (Eq.2) is that the object distance from the measurement plane, L, is larger than 2Dr c /λ, where D is the object dimensions, i.e. the ultrasound focus diameter, and r c is the illumination speckle grain size [31]. In deep tissue imaging, r c ≈ λ/2, and the condition becomes: L > D, i.e. the imaging depth should be larger than the dimensions of the ultrasound focus, a naturally fulfilled condition.
The use of a multiple illumination speckle realizations, N, is advantageous both for estimating the un-speckled intensity transmission of the object, |O m (r)| 2 , and for improved ensemble averaging of the estimation [24,31] (see Supplementary section 5). For a dynamic sample, the speckle illuminations naturally vary in time, and in the case of a static sample, the speckle realizations can be easily obtained by e.g. a rotating diffuser. According to the Wiener-Khinchin theorem, the Fourier transform of the object estimated autocorrelation,ÂC m (r), is the Fourier magnitude of the object intensity transmission: |F m (k)| = F {ÂC m (r)} The object itself can thus be reconstructed from |F m (k)| via phase retrieval [37].
If a partial overlap between the scanned ultrasound foci exists, the reconstruction problem can be reliably solved using ptychography, an advanced joint phase-retrieval technique [27][28][29]38], that was recently shown to be extremely successful in stable, high-fidelity, robust reconstruction of complex objects, which is not possible by separately solving the M phase-retrieval problem. A numerical example for using our approach to image an extended object beyond the memory-effect FoV, with diffractionlimited resolution is shown in Fig 1.f, side-by-side with the conventional AOI image of the same object using the same measurements ( Fig. 1(g)). A resolution increase of > ×25 over conventional AOI is apparent in the high-fidelity reconstruction ( Fig. 1(f)). A detailed explanation of the data processing and the implemented ptychographic reconstruction algorithm can be found in Supplementary section 4,7.

A. Experimental set up
To demonstrate our approach we built a proof-of-principle setup schematically shown in Fig. 1(a). It is a conventional AOI setup with camera-based holographic detection based on phaseshifting off-axis holography (Supplementary section 1), with the addition of a controlled rotating diffuser before the sample (two 1 o light shaping diffusers, Newport), used for generating dynamic random speckle illumination. The illumination is provided by a pulsed long-coherence laser at a wavelength of 532nm (Standa). An ultrasound transducer with a central frequency of f US = 25MHz, and ultrasound focus dimensions of D X = 149µm, D Y = 140µm full-width at half max (FWHM) in the horizontal (transverse) and vertical (axial ultrasound) directions, correspondingly, is used for acousto-optic modulation. The ultrasound focus position was scanned laterally by a motorized stage, and axially by electronically varying the time delay between the laser and ultrasound pulses. The full setup description is given in Supplementary Section 1. As controlled scattering samples and imaged targets for our proof-of-principle experiments we used a sample comprised of a target placed in water between two scattering layers composed of several 5 o scattering diffusers that have no ballistic component (see Supplementary Section 6). An sCMOS camera (Andor Zyla 4.2 plus) was used to holographically record the ultrasound-modulated scattered light fields, using a frequency-shifted beam. To minimize distortions in the recorded fields, no optical elements were present between the camera and the diffuser. The field at the diffuser plane, E m,n (r cam /λL), was calculated from the camera recorded field by digital propagation (Supplementary section 1).

B. Imaging an extended object beyond the memory effect
As a first demonstration, we imaged a transmittive target composed of nine digits ( Fig. 2(a)) that extends over 3.5 times beyond the memory-effect of the scattering sample, which is ∆r mem ∼ 280µm (Fig. 2(c(i)  For 2D imaging, the ultrasound focus ( Fig. 2(c(ii))) was scanned over the target with a step size of ∆X = 44.7µm ,∆Y = 37.3µm, in the horizontal and vertical directions, respectively. These steps (along with the ultrasound spot size) define a probe overlap of ∼ 88% between neighboring positions ( Fig.  2(b)). A study of the effect of the probe overlap on the reconstruction fidelity is presented in Supplementary section 8. For each ultrasound focus position, r US m (m = 1..224) we recorded N = 150 different ultrasound-modulated light fields, E m,n (r cam ), each with a different (unknown) speckle realization, S n (r). The target was reconstructed from the M = 224 autocorrelations using rPIE ptychographic algorithm [28] (Supplementary section 7). Fig.  2(c(iii)) presents an example for one of the autocorrelations used as input. The AOPI reconstructed image (Fig. 2(e)) provide an image with a resolution well beyond that of a conventional AOI reconstruction (Fig. 2(h)), and also well beyond the improved resolution of recent super-resolution AOI techniques such as AO-SOFI [20] (Fig. 2(i-j)). Importantly, since the target extends beyond the memory-effect range, as is the case in many practical imaging scenarios, conventional speckle-correlation imaging without AO modulation [24,31] fail to reconstruct the object (Fig. 2(d)), as expected. Interestingly, when taken only part of the measured positions from the center of the scanned area (only 6 × 8 positions from the center of the object area, instead of all 28 × 8 positions, Fig. 2(f)), and reconstruct the probed object -a good reconstruction is obtained using AOPI method ( Fig. 2(g)), proving that the acoustic probe well functions as an isolation aperture.

C. Imaging resolution verification
To demonstrate the resolution increase of AOPI we performed an additional experiment where the target of Fig. 2 was replaced by elements 3-4 of group 6 of a negative USAF-1951 resolution test chart (Fig. 3(a)). For 2D imaging, the ultrasound focus ( Fig. 3(d)) was scanned over the target with a step size of ∆X = ∆Y = 29µm. These steps (along with the ultrasound spot size) define a probe overlap of ∼ 93% between neighbouring positions ( Fig. 3(b)). For each ultrasound focus position, r US m , m = 1..72 we recorded N = 150 different ultrasound-modulated light fields, E m,n (r cam ) (Fig 3.c), each with a different (unknown) speckle realization, S n (r). The reconstruction of the target from the M = 72 autocorrelations using rPIE ptychography algorithm [28] is presented in Fig. 3(g-h). A study of the effect of probe overlap on the reconstruction is presented in Supplementary section 8. The AOPI reconstructed image resolves resolution target features of size (separation) of 5.52µm (Fig. 3(g-h)  Conventional AOI reconstruction. The acoustic probe limits the imaging resolution to the probe dimensions. f. Speckle correlation reconstruction using classic phase retrieval algorithm (same algorithm as in the ptychography engine). g. AOPI reconstruction using rPIE algorithm. h. Cross-section of the AOPI reconstruction from (g), resolve features ∼ ×30 smaller than the acoustic focus with ∼ ×40 increase in resolution compared to classical AOI methods. Horizontal cross-section lines width is 6.2µm (orange) and vertical cross-section lines width is 5.52µm (bright orange). Scale bar 40µm.
The cross-sections of the reconstructed image ( Fig. 3(h)) allow to estimate the imaging resolution by fitting the result to a convolution of the known sample structure with a Gaussian PSF. This results in a resolution of 3.65µm (FWHM), a 40-fold increase in resolution compared to the acoustic resolution of conventional AOI (Fig. 3(e)). Interestingly, although the target dimensions in this experiment are contained in the memory-effect range, conventional speckle-correlation imaging that is based on phaseretrieval without AO modulation ( Fig. 3(f)), results in a considerably lower reconstruction fidelity than the AOPI reconstruction ( Fig. 3(g)). This improvement is attributed to the larger input data set and improved algorithmic stability of ptychographic reconstruction compared to simple phase-retrieval [27,38].

DISCUSSION AND CONCLUSION
We proposed and demonstrated an approach for diffractionlimited wide FoV optical imaging in ultrasound with specklecorrelation computational imaging [31]. In contrast to previous approaches for super-resolved acousto-optic imaging [15][16][17][18], the resolution of our approach is optically diffraction-limited, independent of the ultrasound probe dimensions, the ratio between the speckle grain size and the ultrasound probe dimensions, or the number of realizations [18]. This allowed us to demonstrate an ×40 improvement in resolution over the acoustic diffraction-limit, an order of magnitude larger gain in resolution compared to the state of the art approaches, such as iTRUE [15,16], TROVE [17], and AOTM [18]. In addition, TROVE and AOTM allow the resolution to increase only when unrealistically large speckle grains are considered [18]. Another important advantage of our approach is that unlike transmission-matrix and wavefront-shaping based approaches [17,18] it does not require unrealistically long speckle decorrelation times. Similar to recent approaches that utilize random fluctuations [20,34], our approach benefits from the natural speckle decorrelation to generate independent realizations of coherent illumination, improving the estimation of the object autocorrelation [31]. While our approach relies on the memory-effect to retrieve the diffraction-limited image, its FoV is not limited by the memory-effect range, as is the case in all other noninvasive memory-effect based techniques [23,24]. The FoV is dictated by the scanning range of the ultrasound focus, which is practically limited only by the allowed acquisition time. Such an extension of the FoV in speckle-correlation based imaging has only been obtained before by invasive access to the target object [33,[39][40][41]. The adaptation of a ptychographic image reconstruction significantly improves the reconstruction fidelity and stability compared to phase-retrieval reconstruction (Fig. 3(f-g)) [23,24].
Our super-resolution AOPI approach does not rely on wavefront-shaping [15][16][17][18] or nonlinear effects [42], and it can be applied to any AOI system employing camera-based coherent detection.
The main limitation of our approach is the requirement for a memory-effect range which is of the order of the ultrasound probe dimensions, i.e. that the ultrasound-tagged fields have a non-negligible correlations. This condition can be satisfied by relying on a small ultrasound focus, achieved by the use of highfrequency ultrasound, and by the use of long laser wavelength, which increases the memory-effect range [32,43]. Importantly, while at very large imaging depths, deep within the diffusive light propagation regime, the memory effect angular range is [32,43], at millimeter-scale depths, which are of the order of the transport mean free path (TMFP), the memory-effect range has been shown to be orders of magnitude larger [44]. In addition, the requirement for a sufficiently large memoryeffect can be alleviated by relying on translation-correlations [25] or the generalized memory-effect [26]. Applying the above improvements (a higher ultrasound frequency, a longer optical wavelength, generalized speckle correlations) are the next steps for bringing our proof of principle demonstrations to practical biomedical imaging applications. Another necessary technical improvement for biomedical application of our approach is in the demonstrated acquisition speed. Similar to all super-resolution techniques that do not rely on object priors, our approach requires a large number of measurements for reconstructing a single image. The required number of frames is the product of: (the number of probe positions) × (the number of realization per probe position) × (number of phase-shifting frames). In our proof of principle demonstration system, which was not optimized for acquisition speed, we used 150 realizations per probe position and 16 phase-shifting frames. A diffuser mounted on a slowly rotating motor was used to change the realizations, and a conventional sCMOS camera was used to capture the images, which resulted in an acquisition time of ∼ 1.6 seconds per realization. This time can be reduced by orders of magnitude using single-shot, off-axis, fast camera-based detection [18,21,45,46], a fast MEMS-based dynamic wavefront randomizer [47], and 2D electronic scanning US array instead of the mechanical scan of the single-element ultrasound transducer [48]. Assuming the acquisition speed is limited by the camera frame-rate of ∼ 7, 000 frames-per-second [18], the acquisition of 150 realizations for 72 probe positions (as in Fig. 3), will be of the order of ∼ 1.5sec, excluding off-line data processing. This is expected to be adequate for imaging biological structures since the requirement on the speckle decorrelation time is that of a single frame, i.e. ∼ 0.1msec. Moreover, the number of required realizations is expected to be significantly reduced by using advanced correlation-based reconstruction schemes such as those provided by deep neural networks (DNN). These have been recently shown to be able to significantly improve the estimation of the intensity autocorrelation, from only a few coherent realizations [49].
The combination of the state-of-the-art optical, ultrasound, and computational imaging approaches, has the potential to significantly impact imaging deep inside complex samples.

ACKNOWLEDGMENTS
We thank Prof. Hagai Eisensberg for the q-switched laser and thank the Nanocenter at the Hebrew University, with special thanks to Dr. Itzik Shweky and Galina Chechelinsky, for fabrication the target samples.

DISCLOSURES
Disclosures. The authors declare no conflicts of interest.

EXPERIMENTAL SETUP
The full experimental setup is shown schematically in Supplementary Figure S1. It is a pulsed acousto-optic imaging (AOI) experiment, employing a camera-based digital phase-shifting holographic detection [1,2]. The light source is a single longitudinal-mode 532nm passive Q-switched laser (Standa STA-01-SH-4) providing < 1ns duration pulses at f rep = 25kHz repetition rate. The laser output is split to a reference and object arm by a polarization beam splitter (PBS), with a half-wave plate (HWP) controlling the splitting ratio.
At the object arm the beam is expanded to a width of ∼ 4mm, passed through an f = 200mm focal length lens and a diffuser (two stacked 1 o holographic diffusers Newport) mounted on a computer-controlled rotating stage, generating a controlled speckle decorrelation-time in the illumination beam. The speckled beam was introduced into the water tank containing the target, placed between two scattering layers, in a sandwich configuration. The first scattering layer was a 5 o diffuser placed ∼ 1cm before the target, and the second layer was composed of two 5 o diffusers stacked together. The distance between the target to the second layer was L = 5cm . The targets were fabricated on 1.5mm thick glass slides coated with Ti and Ag. The Ti layer was 20nm thick placed above an Ag coating of 100nm thickness, fabricated using E-Beam Lithography.
An sCMOS camera (Andor Zyla 4.2) was placed at a distance of 9cm from the second diffuser. For acousto-optic tagging, an ultrasound transducer with a central frequency of 25MHz (Olympus Fig. S1. The experimental setup. PBS -polarizing beam splitter. AOM -two acousto optic modulators. BD -beam dump. FG -function generator. AMP -amplifier. UST -ultrasonic transducer. f opt -laser optical frequency ( f opt =c/λ).f US -ultrasound driving frequency: f US = 25MHz + f rep /2 + 1/T exposure , where f rep is the laser pulse repetition-rate, and T exposure is the camera exposure time v324, focal length = 12.7mm, F# = 2) was used. The transducer was driven by 100ns-long sinusoidal pulses at a central frequency of f us = 25MHz + f rep /2 + 1/T exposure , where T exposure is the camera exposure time. The pulses were generated by a function generator (Keysight 33600A) with a peak-to-peak amplitude of 2.4V pp , amplified by 40 dB by an RF power amplifier (Amplifier Research 25A250A), resulting in 240V pp driving amplitude.
The ultrasound transducer was mounted on a computer-controlled motorized translation stage (Thorlabs), which allowed transverse horizontal scanning of the acoustic focus. The vertical position of the ultrasound focus was controlled by varying the relative delay between the acoustic pulses and the laser trigger pulses. To ensure accurate temporal synchronization of the < 1ns laser pulses and to compensate for slow temporal drifts of the passively q-switched laser pulses timing from the laser trigger, a photodiode was used to trigger the acoustic pulse using the previous laser pulse.
At the reference arm, two acousto-optic modulators (AA Opto-Electronic MT80-A1) connected to RF amplifiers (Mini-Circuits ZHL-3a) were used to frequency shift the reference beam by f AOM = f US + f phase shifting /4, where f phase shifting = f cam /4, and f cam is the sCMOS camera frame-rate.
The ultrasound frequency shift was chosen to be f us = 25MHz + f rep /2 + 1/T exposure , to both efficiently reject the coherent interference of the unmodulated speckle background, which originates from the use of the laser pulses shorter than the ultrasound period and pulse width [3,4], and to average over the sinusoidal spatial modulation of the ultrasound pulse. Efficient background rejection of the unmodulated light is obtained by adding a frequency-shift of f rep /2 = 12.5KHz [3], and the additional 1/T exposure exploits the travelling ultrasound pulse to average the sinusoidal spatial modulation over a single ultrasound wavelength [2], effectively providing smooth ultrasound probe in the pulse propagation (vertical) direction ( Fig. S2(a,c)). The additional frequency-shift of f phase shifting = 5Hz allows 4-frame phase-shifting holographic detection.

ULTRASOUND PROBE CHARACTERIZATION
A direct optical characterization of the ultrasound focus, as measured by removing the diffusers before the camera and placing an imaging lens, averaged over 200 different speckle illuminations is presented in Fig. S2. One-dimensional profiles of the ultrasound focus along the horizontal (x-axis) and vertical (y-axis) dimensions are presented in Fig. S2(b-c). Their measured full width at half max (FWHM) are D X = 149µm, and D Y = 140µm (FWHM), respectively.

EXPERIMENTAL CHARACTERIZATION OF THE SCATTERING LAYERS MEMORY-EFFECT RANGE
The memory effect range of the scattering layer through which imaging was performed in our experiments (two 5 o diffusers, Newport) was characterized by comparing the autocorrelation of the direct image of the extended object ( Supplementary Fig. S3(a)) used in Fig. 2 in the main text, obtained using an imaging-lens replacing the scattering layer (Supplementary Figure S3(b)) to the autocorrelation of the object obtained using correlography through the scattering layer but without acousto-optic tagging, following Edrei et al [5] ( Supplementary Fig. S3(c)). The memory-effect correlation at each shift at the object plane was estimated by calculating the ratio between the two autocorrelations, and taking the envelope of the resulting ratio. Supplementary Fig. S3(d) presents the horizontal cross-section of the memory effect range, along with the acoustic probe cross-section and auto-correlation. The resulting memory-effect FWHM is ∼ 280µm at the target distance of ∼ 5cm, i.e. an angular memory-effect range of ∼ 5.6mrad = 0.32 o . The requirement for acousto-optic ptychographic imaging (AOPI) is that the acoustic probe will be contained withing the memory effect range. In this way, each position can be retrieved using speckle correlation method, while the full object can extend beyond the memory effect range indefinitely. As can be seen from Fig. S3(d), in our experimental conditions while the extended object (Fig. 2) indeed extends beyond the memory effect, the ultrasound probe itself and its spatial autocorrelation are smaller than the memory-effect correlation range, as required. In cases where the probe dimensions are similar to the memory effect range, one can digitally compensate the memory effect correlation reduction by normalizing the calculated autocorrelation through the scattering medium by the memory effect correlation function at each angle. Using this method will enable to relax the requirement on the probe size concerning the memory effect range, which might be significant when imaging in biological samples. . The memory effect correlation FWHM is 280µm, , which is ∼ 3.5 times smaller than the object size, but ∼ 2 times larger than the ultrasound focus dimensions. Scale bars: 200µm.

DATA PROCESSING FOR CALCULATING THE ULTRASOUND-GATED AUTOCOR-RELATIONS
Here we provide the technical details of our experimental autocorrelation calculation. As discussed in the main text, the complex field autocorrelation of the target can be calculated from a single camera frame obtained for a single speckle illumination [6]. However, this autocorrelation will be the autocorrelation of a speckled target, and thus will contain a sharp diffraction-limited peak representing the diffraction-limited speckle grain size. To overcome this, multiple camera frames captured under different speckle illuminations can be used to estimate the target incoherent intensity autocorrelation which, in the case of sufficient speckle averaging, is free from speckle artifacts via correlography [5][6][7][8]. The use of multiple speckle realizations is advantageous also for improving the SNR of the autocorrelation estimation via ensemble averaging [6]. While for an infinite number of speckle patterns the diffraction-limited speckle autocorrelation peak will decrease to zero, when a finite amount of realizations are used, the autocorrelation will still contain speckle features (see next section) and contain a sharp diffraction limited peak, which can be significantly larger than the object autocorrelation features. In our experiments, 150 different speckle realizations where used at each scan position, and the speckle-grain autocorrelation peak was reduced by replacing the peak intensity with an intensity that is ∼ 3-times larger than its neighbouring pixels, following [5]. For background noise removal and filtering, the peak-reduced autocorrelation was additionally filtered by a Tukey window with a width adjusted to the ultrasound focus autocorrelation width. Finally, the windowed autocorrelation was filtered by thresholding its Fourier magnitude, resulting in noise removal and spatial smoothing.

EFFECT OF THE NUMBER OF SPECKLE REALIZATIONS ON AUTO-CORRELATION ESTIMATION
To determine the number of required speckle realizations (illuminations) we have studied the effect of the number of speckle realizations on the retrieved autocorrelation. The results of this study are given in Fig. S4. As expected, a lower number of speckle realizations (Fig. S4.(d-f)) results in speckle artefacts in the autocorrelation estimate, harming the reconstruction fidelity.

CHARACTERIZATION OF THE DIFFUSER USED IN THE EXPERIMENT
When imaging through a scattering medium, it can be suspected that some of the imaging information is carried by the ballistic components that pass through the scattering medium along with the diffused components. To verify that in our experiments there is no significant ballistic component that allows imaging, and to study the scattering of the used diffusers, we performed an experiment that directly measures the scattering function of the scattering layer used in our experiments. To this end, we focused a beam from a collimated laser diode @520nm (Thorlabs CPS520) on an sCMOS camera (Andor Zyla), using an f=100mm lens, and measured the focused intensity without and with the scattering layer placed in the path between the lens and the camera. The results of this experiments are shown in Fig. S5(a), and prove that the transmission through the scattering layer does not contain any significant ballistic component, as expected.

PTYCHOGRAPHIC IMAGE RECONSTRUCTION PIPELINE
An iterative minimization algorithm was used to solve the joint-phase retrieval problem. We compared several algorithms from the ptychographic iterative engine (PIE) family [9,10]. The best results on our experimental data were obtained using the regularized PIE (rPIE) algorithm [10], which was used for all reconstructions displayed in this work. Below, we provide the details of the algorithmic steps for reconstructing the full objects from the set of estimated autocorrelations, which are calculated from the measured scattered patterns, as explained in the previous sections.
In a nutshell, the joint-phase retrieval algorithm follows a similar scheme to Fienup's Hybrid input output (HIO) algorithm for phase retrieval [11] but using a set of power spectra, which in our case are the Fourier magnitude of the measured autocorrelations, where the scanned acoustic probe acts as the isolated object constraint (spatial gate). The ptychographic algorithm starts by guessing an initial object, and estimation of the acoustic probe. During every iteration the algorithm enforces the object Fourier amplitude measurements and the estimated probe as constraints following the PIE update rules, until convergence. Since in our method we aim at reconstructing the target intensity pattern, we also enforced non-negativity constraint on the reconstructed object. rPIE is a slightly modified advanced version of the original ePIE, differing in the update rules for the exit wave function, probe, and object [10]. For simplicity we present here the original ePIE algorithm. A full flowchart of our method, including both the autocorrelation estimation, and the ptychographic reconstruction, is given in Fig S6. At the m-th probe scan position and the j-th iteration of the algorithm we can write the exit wave function: Where O j,m (r) is the object and U j,m (r − r US m ) is the probe illumination shifted by r US m relative to r -a two dimensional coordinate along the ultrasound propagation and in perpendicular to it. The probe initial guess is estimated by a Gaussian profile in the horizontal axis and to a convolution between Gaussian and a rectangle profile in the vertical axis, from prior measurements of the acoustic probe (Supplementary section 2). In the next step, the algorithm constrains the Fourier amplitude of the exit wave function to the estimated Fourier amplitude from the object autocorrelation and keep the Fourier phase. Then, the algorithm use the new exit wave function to update first the object under the assumption that the probe is fixed and then the probe under the assumption that the object is fixed with the update terms: · (ψ j,m (r) − ψ j,m (r)) (S3) U j+1 = U j + β O * j (r + r US m ) |O j (r + r US m )| 2 max · (ψ j,m (r) − ψ j,m (r)) (S4) Where α, β are weight parameters for the updated feedback. α is usually between 0.7 − 0.9 and β ≈ 0.1. When all M scan positions are gone through in a random order, the j-th ptychography iteration is completed.

NUMERICAL STUDY OF THE RECONSTRUCTION FIDELITY AS A FUNCTION OF NUMBER OF PROBE POSITIONS
The significant advantage of ptychography over individual reconstruction of the object at each probe position, is in the joint solution of the phase-retrieval problem, exploiting information contained in the overlap between neighboring probe positions [12]. To determine the required number of probe positions, or equivalently the overlap between neighbouring probe positions, we performed a numerical investigation of the influence of the number of probe positions on the reconstruction performance. The results of this study are shown in Figure S7.
For the numerical experiment a USAF-1951 resolution test chart with a minimal line width of 6.24µm and an experimentally measured acoustic probe (Fig. S2) were used. The conventional AOI reconstruction from 868 probe positions is presented in Fig. S7(a) and shows a low-resolution reconstruction, without the ability to distinguish even the largest features of the target. An ideal AOPI reconstruction from 868 scans assuming infinite number of speckle averaging (i.e. effectively incoherent illumination) is presented in Fig. S7(b), where each autocorrelation was estimated from a single incoherent illumination, free from speckled pattern, and show a very-well resolved features with high-resolution reconstruction of the target. Fig. S7(c-f) show the AOPI reconstruction using 150 speckle realizations in each probe position, for a different number of probe positions: 868 scans (89% overlap between neighboring positions), 460 scans (80% overlap), 272 scans (67% overlap), and 72 scans (24% overlap), respectively.
To quantify the reconstruction fidelity, we calculate the structural similarity index (SSIM) [13] between each reconstruction and the target object. The graph in Fig. S7(g) plots the SSIM index as a function of number of probe positions. For the AOPI reconstruction using 150 speckle realizations, the reconstruction fidelity does not reduce until an overlap lower than 67% (i.e. 272 scans) is used. In our experiments an overlap of approximately 90% was used.