Rodent ultrasonic vocal interaction resolved with millimeter precision using hybrid beamforming

Ultrasonic vocalizations (USVs) fulfill an important role in communication and navigation in many species. Because of their social and affective significance, rodent USVs are increasingly used as a behavioral measure in neurodevelopmental and neurolinguistic research. Reliably attributing USVs to their emitter during close interactions has emerged as a difficult, key challenge. If addressed, all subsequent analyses gain substantial confidence. We present a hybrid ultrasonic tracking system, Hybrid Vocalization Localizer (HyVL), that synergistically integrates a high-resolution acoustic camera with high-quality ultrasonic microphones. HyVL is the first to achieve millimeter precision (~3.4–4.8 mm, 91% assigned) in localizing USVs, ~3× better than other systems, approaching the physical limits (mouse snout ~10 mm). We analyze mouse courtship interactions and demonstrate that males and females vocalize in starkly different relative spatial positions, and that the fraction of female vocalizations has likely been overestimated previously due to imprecise localization. Further, we find that when two male mice interact with one female, one of the males takes a dominant role in the interaction both in terms of the vocalization rate and the location relative to the female. HyVL substantially improves the precision with which social communication between rodents can be studied. It is also affordable, open-source, easy to set up, can be integrated with existing setups, and reduces the required number of experiments and animals.

We have developed an advanced localization system for USVs in which a high resolution 'acoustic camera' consisting of 64 ultrasound microphones with an array of 4 high-quality ultrasound microphones. Both systems can individually localize USVs but exhibit rather complementary patterns of localization errors. We fuse them into a hybrid system that exploits their respective advantages in sensitivity, detection, and localization accuracy. We achieve a median absolute localization error of 3.4-4.8 mm, translating to an assignment rate of ~91%.
Compared to the previous state of the art 73,75 , the accuracy represents a three-fold improvement that halves the proportion of previously unassigned USVs. Given the physical dimensions of the mouse snout (ø~10 mm), this likely approaches the physical limit of localizability for USVs. We successfully apply it to and analyze dyadic and triadic courtship interactions between male and female mice. The comparison of dyadic and triadic interactions is chosen here, as courtship interactions in nature are naturally competitive and this comparison is therefore both scientifically relevant and can benefit from high reliability assignment of USVs. We demonstrate that the fraction of female vocalizations has likely been overestimated in previous analyses, due to a lack of precision in sound localization. Further, in the triadic recordings we find that in competitive male-male-female courtship, one male takes a dominant role which shows in emitting most USVs and also positioning himself more closely to the female abdomen.

Results
We analyzed courtship interactions of mice in dyadic and triadic pairings. The mice interacted on an elevated platform inside an anechoic booth (see Fig. 1A, for details see Materials & Methods: Recording Setup). Each trial consisted of 8 minutes of free interaction while movements were tracked with a high-speed camera (see Fig. 1B), and ultrasonic vocalizations (USVs) were recorded with a hybrid acoustic system composed of 4 high-quality microphones (i.e., USM4) as well as a 64-channel microphone array (Cam64, often referred to as an acoustic camera; see Fig. 1C for raw data samples, green and red dots mark the start and stop times of USVs).
Most USVs were emitted in close proximity in dyadic and triadic pairings (see Fig. 1D).
Reliably assigning most USVs to their emitter therefore requires a highly precise acousto-optical localization system. The presently developed Hybrid Vocalization Localizer (HyVL) system is the first to achieve sub-centimeter precision, i.e. ~3.4-4.8 mm (see Fig. 2 for an overview). This accuracy on the acoustic side is achieved by combining the complementary strengths of the USM4 and Cam64 data. The Cam64 data is processed using acoustic beamforming 82 which delivers highly precise estimates (MAE = ~4-5 mm), but is not sensitive enough for very highfrequency USVs (see Fig. 1-Fig. Suppl. 1). The USM4 data is analyzed using the previously published SLIM algorithm 73 , which delivers accurate (MAE = ~11-14 mm) and less frequencylimited estimates. The accuracy of SLIM, the previously most accurate ultrasonic localization technique (see Discussion for a comparison), is generally lower than that of HyVL, but it makes essential contributions to the overall accuracy of HyVL through the integration of the complementary strength of the two methods/microphone arrays (see Fig. 3A, L-shape of errors).
The methods exhibit a complementary pattern of localization errors, which predestines them for high synergy when combined (see below).
For each USV, a choice is made between the USM4/SLIM and Cam64/Beamforming estimates based on a comparison of each method's USV-specific certainty and the relative position of the mice to the estimates, using an extended, hybrid Mouse Probability Index (MPI 76 ).
HyVL is the first system of its kind that exploits a hybrid microphone array to overcome the limitations of each subarray. The positions of the mice are obtained via manual and automatic video tracking using DeepLabCut 83 , each of which achieve millimeter precision for localizing the snout.
Overall, 228 recordings were collected from 14 male and 4 female mice (153 dyadic, 67 triadic, and 8 with a single mouse). In 90 recordings USVs were produced and recorded with Cam64 and USM4 simultaneously (55 dyadic, 28 triadic, and 7 single). The single mouse recordings were also used in a previous publication 73 where only the SLIM accuracy was evaluated. 112 recordings were recorded in a balanced design (4 dyadic and 4 triadic per male mouse paired with all females) and the remaining recordings conducted with good vocalizers to maximize the number of USVs for downstream analysis. In all trials combined, 13714 USVs were detected.

Precision of USV Localization
Assigning USVs to individual mice required combining high-speed video imaging with the HyVL location estimates at the times of vocalization. We manually tracked the animal snouts at the temporal midpoint of each USV to obtain near-optimal position estimates (see Fig. 2). We first assessed the relative structure of the localization errors between both methods, USM4/SLIM ( Fig. 3A, green) and Cam64/Beamforming (red, each dot is a USV). While most errors were small, and clustered close to the origin of the graph (evidenced by the small Median Absolute Errors (MAE), shown as horizontal and vertical lines, respectively), the less frequent, larger errors exhibited an L-shape. This error pattern is an optimal situation for combining estimates from the two methods, to compensate for each other's limitations. While the Cam64 data can compensate for single microphone noise through the large number of microphones, the nature of its microelectromechanical systems (MEMS) microphones deteriorates for very high frequencies (see Fig.   1- Fig. Suppl. 1B). Conversely, the USM4 microphones show an excellent noise level across frequencies (see Fig. 1-Fig. Suppl. 1A) but can produce erroneous estimates if there is noise in a single microphone and have an intrinsic limitation in spatial accuracy due to the physical size of their receptive membrane (ø~20 mm).
We therefore designed an analytical strategy to combine the estimates of both systems to optimize the number of reliably assignable USVs, while evaluating the resulting spatial accuracy alongside. Briefly, the location estimates of both methods each come with an estimate of localization uncertainty. First, we assess for each method's estimate how reliably it can be assigned to one of the mice, taking into account the positions of the other mice. This is quantified using the Mouse Probability Index (MPI), 76 which compares the probability of assignment to a particular mouse to the sum of probabilities for all mice, weighted by the estimate's uncertainty. If the largest MPI exceeds 0.95, it is considered a reliable assignment to the corresponding mouse.
If both methods allowed reliable assignments, the one with smaller residual distance was chosen.
If only one method was reliable for a particular USV, its estimate was used. If neither method allowed for reliable assignment, the USV was not used for further analysis. This typically happens if the snouts are extremely close or the USV is very quiet. This approach outperformed many other combination approaches in accuracy and assignment percentage, e.g. Maximum Likelihood (see Materials & Methods: Assigning USVs and Discussion for details).
Analyzing all courtship vocalizations, HyVL performed significantly better than either method alone (see Fig. 3), allowing a total of 91.1% of USVs to be assigned at a spatial accuracy of 4.8 mm (MAE). This constitutes a substantial, 2.9-fold improvement in accuracy over the previous state of the art, the SLIM algorithm. 73 On the full set of USVs where both microphone arrays were recording (N = 7982), HyVL outperformed both USM4/SLIM and Cam64/Beamforming significantly, both in residual error (SLIM: 14.8 mm; Cam64: 5.33 mm; HyVL: 5.08 mm; p<10 -10 for all comparisons, Wilcoxon rank sum test) and percentage of reliably assigned USVs (SLIM: 74.4%; Cam64: 79.8%; HyVL: 91.1%). Cam64/Beamforming performed even more precisely on its reliably assignable subset (4.55 mm), which was, however, smaller than the HyVL set. This difference emphasizes the complementarity of the two methods and thus the synergy through their combination. There was no significant difference between tracking on dyadic and triadic recordings (HyVL: 5.0 mm vs. 5.1 mm, p=0.71, Wilcoxon rank sum test) with correspondingly similar selection percentages (92 vs. 90%, respectively).
The accuracies above are an average over localization performance at any distance. In particular during close interaction, USVs will often be reflected or obstructed, complicating localization. While this constitutes the realistic challenge during mouse social interactions, we also investigated the 'ideal', unobstructed performance of HyVL by comparing the performance on USVs emitted when all animals were 'far' (>100 mm) apart, i.e. more than ~20 times the average accuracy of HyVL, as well as for a single male mouse on the platform. For the far USVs the reliably assignable fraction increased to 97.9%, and the accuracy significantly improved to 3.79 mm (Fig. 3C gray, p=8.6x10 -7 , Wilcoxon rank sum test). For the single animal USVs, the accuracy was even better at 3.45 mm with 98.4% reliably assigned (Fig. 3C, blue). In addition, we evaluated HyVL's performance on sounds emitted from a miniature speaker placed in a regular grid of locations (see Fig. 3-Fig. Suppl. 2). In this condition the accuracy was even higher (1.87 mm, or even ~0.5 mm when correcting for experimental factors, see figure caption), however, given the differences in the emitter characteristics, emitted sounds and lack of adsorption, this should be treated as a lower bound that will be hard to achieve with mice.
Next, we inspected separate localization along the X and Y axis to check for anisotropies of localization ( Fig. 3D/E, histograms normalized to maximum). The position of the closest animal aligned precisely with the estimated position in both dimensions, indicated by the high density along the diagonal (Pearson r > 0.99 for both dimensions) and the MAE's along the X and Y direction separately (X = 3.1 mm, Y = 2.8 mm). These one-dimensional accuracies might be of relevance for interactions where movement is restricted.
Lastly, we visualized the localization density relative to the mouse that the vocalization was assigned to (Fig. 3F). Combining both dimensions and appropriately rotating them, the estimated position of the USVs is shown relative to the mouth. The density is narrowly centered on the snout of the mouse (circle radius = MAE: green: SLIM method; orange: HyVL; light orange: HyVL assigned USVs, gray: Far assigned USVs).
In summary, the HyVL system provides a substantial improvement in the localization precision. In comparison to other methods, its precision also allows a larger fraction of vocalizations to be reliably assigned and retained for later analysis, which enables a near complete analysis of vocal communication between mice or other vocal animals (see Discussion for details).

Sex Distribution of Vocalizations During Social Interaction
Courtship interactions between mice lead to high rates of vocal production, but are challenging due to the relative proximity, including facial contact. Previous studies using a single microphone have often assumed that only the male mouse vocalized, [84][85][86][87] while more recent research has concluded that female mice vocalize as well. 76,88 Female vocalizations were typically less frequent, but constituted a substantial fraction of the vocalizations (11-18%) 73,74,76,89 . Below, we demonstrate that the accuracy of the localization system can be an important factor for conclusions about the contribution of different sexes to the vocal interaction.
Over all dyadic and triadic trials combined, females produced the minority of vocalizations.
Beyond the absolute distance between the mouths of the mice, high-accuracy localization of USVs allows one to position the bodies of the animals relative to one another at the times of vocalization by combining acoustic data with multiple concurrently tracked visual markers. This provides an occurrence density of other mice relative to the emitter (Fig. 4E).
Female mice appear to emit vocalizations in very close snout-snout contact, with a small fraction of vocalizations occurring when the male snout is around the hind-paws/ano-genital region (Fig. 4F). Male mice emit vocalizations both in snout-snout-contact, but also at greater distances, which dominantly correspond to a close approach of the male's snout to the female ano-genital region (Fig. 4G). This was verified separately with a corresponding analysis, where the recipient's tail-onset was used instead (not shown).
In summary, the combination of high-precision localization and selection using the MPI indicate that female vocalizations may be even less frequent than previously thought. When they vocalize, the mice appear to almost exclusively be in close snout-snout contact. As this is incidentally also the condition which has the highest chance of mis-assignments, even the remaining female vocalizations need to be treated with caution.

Vocalization Rate Analysis
In dyadic trials, one female and one male mouse interacted, whereas in triadic trials either two males and one female or two females and one male mouse interacted. We first address in dyadic trials, whether there were significant differences in individual vocalization rates between the mice.
For the balanced dataset of 14x4 dyadic interactions (pairing of all males with all females), we did not find a significant effect of individual on vocalization rates for either male and female mice (see In the balanced dyadic and triadic datasets, only 23/112 recordings contained vocalizations. We collected additional dyadic and triadic recordings for the purpose of maximizing the number of USVs, both for assessing HyVL performance and comparing dyadic and triadic interactions. In this enlarged dataset a total of 83 recordings (55 dyadic, 28 triadic) were available which contained USVs. This dataset was still balanced for female mice, however, unbalanced for male mice, i.e. although the same mice participated in both dyadic and triadic recordings, however, not with exactly the same number of recordings. While the analysis on the balanced dataset above did not suggest significant differences between individuals, we thus cannot fully exclude that the reported differences below are partially due to individual differences between some male mice.
In the analysis of triadic interactions, we separate competitive and alternative contexts depending on whether a mouse had to compete with another same sex mouse or could interact with two opposite sex mice, respectively. For triadic trials we further separate the same-sex mice into dominant and subordinate, based on who vocalized more.
However, in competitive interactions between males, one male mouse significantly and strongly dominated the 'conversation', with on average 9-fold more vocalizations than the other male mouse (TD vs Ts, Fig. 5A Alternatives). While the present division into dominant and subordinate mouse based on a higher vocalization rate within a recording will always lead to a significant difference, the quantitative difference between them is the striking aspect in this comparison. Overall male vocalization rates were similar in competitive and alternative triadic trials. Female vocalization rates were similar across all compared conditions. The mean vocalization energy of dominant males in triadic pairings tends to be higher than those of submissive males in triadic pairings, however, this result did not reach significance in the present dataset (see Fig. 5C). No effects of vocalization energy were found in females.
The distance to the closest animal of the opposite sex was found to be even closer during triadic trials (see Fig. 5D), driven purely by male vocalizers (p=0.00046, after Bonferroni correction as above, Wilcoxon Sum of Ranks test): the distance to the closest animal does not change between conditions for vocalizing females (p=0.975, Wilcoxon Sum of Ranks test).
Interestingly, the distance to the closest animal was larger for females at the time of vocalization when they had a same-sex competitor on the interaction platform with them than when they were the only female (Tc vs. Ta, p=0.0068, Wilcoxon Sum of Ranks test).
Lastly, we investigated whether the division into a dominant and subordinate male based on the vocalization rate was also reflected in the spatial behavior of the male mice relative to the female mouse. For this purpose we gain constructed relative spatial interactions histograms (see Fig. 6, analogous to Fig. 4), separately for USV-rate dominant and subordinate males. The results are displayed as the relative location between the male snout and the female abdomen.
Dominant males spent more time close to the female abdomen, thus engaging in ano-genital contact (Fig. 6A, center), in comparison with subordinate males (Fig. 6B). This is highlighted in The difference between the spatial interaction histograms (Fig. 6C), where the most salient Dominant peak occurs in the center, while the subordinate male spent more time in snout-snout contact, indicated by the blue arc at about one mouse body length from the center (shown in blue here). These differences were significant, in addition to a number of other locations in the spatial interaction histogram. Significance analysis was performed using 100x bootstrapping on the relative spatial positions to estimate p=0.99 confidence bounds around the histograms of the dominant and subordinate respectively. Significance at a level of p<0.01 highlights multiple relative spatial positions.
In summary, in competitive triadic interactions, one of the male mice took a strongly dominant role, evidenced both in the vocalization rate as well as the more abundant ano-genital interactions with the female throughout the recordings. In triadic interactions, the female mouse was generally approached more closely by a male mouse, in particular in the alternative condition.
The latter could, however, be a consequence of the larger number of male animals on the platform, compared to dyadic and triadic competitive (from the perspective of the female).

Discussion
We have developed and evaluated a novel, hybrid sound localization system (HyVL) for ultrasonic vocalizations (USVs) emitted by mice and other rodents. USVs are innately used by rodents to communicate social and affective information and are increasingly being used in neuroscience as a behavioral measure in neurodevelopmental and neurolinguistic research. In the context of dyadic and triadic social interactions between mice, we demonstrate that HyVL achieves a groundbreaking increase in localization accuracy down to ~3.4-4.8 mm, enabling the reliable assignment of >90% of all USVs to their emitter. Further, we demonstrate that this can be combined with automatic tracking, enabling a near-complete and automated analysis of vocal interaction between rodents. The showcased analyses demonstrate the advantages obtained through more precise localization, further discussed below. HyVL is based on an array of highquality microphones in combination with a commercially available, affordable acoustic camera.
With our freely available code, this system can be readily reproduced by other researchers and has the potential to revolutionize the study of natural interactions of mice.

Comparison with previous approaches for localizing vocalizations
Localization accuracy was first systematically reported by Neunuebel et al. 76 using a 4microphone setup and a maximum likelihood approach 90 , who attained an MAE of ~38 mm that conferred an assignment rate of 14.6-18.1% (their Table 1, assigned relative to detected or localized). Originating from the same research group, Warren et al. 75 employed both a 4 and 8 microphone setup in a follow-up study, achieving an MAE of ~30 mm for 4 microphones (~52% assignment rate) and of ~20 mm with 8 microphones (~62% assignment rate), both using a jackknife approach to increase robustness of localization. Stahl et al. 73 introduced the SLIM algorithm, reaching an MAE of ~11-14 mm (~80-85% assignment rate depending on the dataset) using 4 microphones. Presently, we advance the state-of-art in multiple ways: we use 68 microphones, combining a 64 channel 'acoustic camera' with 4 high quality ultrasonic microphones. While the acoustic camera has relatively basic micro-electromechanical systems (MEMS) microphones, it is inexpensive, features a high degree of integration and correspondingly easy operation. Combining the complementary strengths of the two arrays is the key advantage of the present approach over previous approaches, as it allows for a quantum leap in accuracy (3.4-4.8 mm, 91% assignment rate), while keeping the complexity of the system manageable. A comparable alternative might be a 16-channel array from high-quality microphones, which would, however, be substantially more expensive (~€40,000) as well as cumbersome to build and refine.
A future generation of MEMS microphones might make the use of the high-quality microphones unnecessary and thus further simplify the system setup, allowing for inexpensive, small-form factor deployment (see below).

Expected impact for future research
Mice and rats are social animals, 91,92 and isolated housing 93 or testing 94 can affect subsequent research outcomes. Social isolation also has direct effects on the number and characteristics of USVs, at least in males. 95,96 Sangiamo et al. 88 demonstrated that distinct USV patterns can be linked to specific social actions and the latter that locomotion and USVs influence each other in a context-dependent way. Using HyVL, such analyses could be extended to more close-range behaviors, when a substantial fraction of the vocalizations are emitted (see Fig. 1D). The development of more unrestricted behavioral paradigms, made viable by increased localization precision, will thus also likely prove valuable to the fields of human language impairment and animal behavior. As an added benefit, better USV localization will also likely increase lab animal well-being via (i) more social contact in specific cases where they spend much time with their conspecifics in the testing environment, or when the home environment is the testing environment (e.g., PhenoTyper; Noldus Information Technologies), and (ii) a reduced need for (non-)invasive markers.
Here, we conducted a limited set of showcase analyses on the spatial characteristics of vocalization behavior. As expected, the system was accurate enough to assign vocalizations during many snout-snout interactions as well as other, slightly more distant interactions, e.g. snout contact with the ano-genital region of the dyadic partner. We found the male mice to vocalize most while making snout contact with the abdomen and ano-genital region of the female wildtype. Females vocalized predominantly during snout-snout contact, with the male's snout in front of the female mouse's snout.
This highlights an example of how localization accuracy can shape our understanding of roles in social interaction between mice: A recent, pivotal study 76 demonstrated that female mice vocalize during courtship interactions. Research from our group 73 concluded further that mice primarily vocalize in snout-snout interactions, incidentally the condition that makes assignment the most difficult. While the present results maintain that female mice vocalize, the fraction appears to be lower than previously thought. We, however, emphasize that this conclusion still requires further study under different social contexts, e.g. interaction of more mice as in some of the previous studies. 34,88 The compact form factor of the HyVL microphone arrays, in particular the Cam64, enables studies of social interaction in home cages. There, rodents are less stressed and likely to exhibit more natural behavior, in particular if the home cage includes enrichments. The relatively low hardware costs for HyVL allows deployment of multiple systems to cover larger and more natural environments. Research in animal communication with other species could also benefit from use of HyVL, for example, with different insects or other vocal animals, as there is little reason to suspect that the performance of HyVL would not extend to lower frequencies. Flying animals, such as bats or birds, could also be studied, however, the subsequent data-analysis would have to be extended by one dimension.

Current limitations and future improvements of the presented system
The millimeter accuracy by HyVL enables the assignment of USVs even during close interaction, certainly including all snout-ano-genital interactions, and many snout-snout interactions.
However, certain snout-snout interactions are still too close to reliably assign co-occurring USVs.
While the MPI criterion maintains reliability even then, subsequent analysis will be partially biased due to the exclusion of these USVs during the closest interactions. While a further improvement of accuracy may be possible, close inspection of the sound density maps available via beamforming from the Cam64 recordings suggests that the mouse's snout acts as a distributed source: the sound density is rather evenly distributed on it, without a clear internal peak. During free interaction, we noticed that the sound density was co-elongated with the head-direction of the mouse and could thus be used as an additional feature to identify the vocalizer. However, this proved unreliable during close interaction, likely due to absorption and reflection of sounds based on the mice's bodies. More advanced modeling of the local acoustics or deep learning might be able to resolve these issues by analyzing interactions where one mouse is known to be silent, e.g. by cutting the laryngeal nerves.
The present strategy for combining the estimates from Cam64/Beamforming and USM4/SLIM was chosen as it optimized the reliably assigned percentage of USVs, while minimizing the residual distance. We also tested alternative approaches, e.g. using direct beamforming on the combined data from Cam64 and USM4 (unreliable estimates, due to mismatch of number of microphones, not further pursued), maximum likelihood combination of estimates (MAE=7.1 mm), 97 and making the selection solely depend on the MPI (MAE=5.2 mm).
While each of these approaches have certain, theoretically attractive features, the results were worse in each case, likely due to particular idiosyncrasies of the MPI computation, the different microphone characteristics, and the estimation of single-estimate uncertainty.
A small set of vocalizations was not assigned solely due to the overall proximity threshold of 50 mm (see Methods, 2.9%). We have previously shown that very quiet or very short USVs are, unsurprisingly, harder to detect and localize. 73 In addition, spectrally narrow and acoustically occluded USVs are likely hard to localize: USVs that are spectrally very narrow -i.e. close to a pure tone -will have phase ambiguity, which will make it hard to assign a single location. USVs that are acoustically occluded -e.g. an animal vocalizing away from a microphone, or a mouse body in the path of the sound -will have a reduced signal-to-noise ratio on one or more microphones. In our experience, the latter two affect the Mic4 data more than Cam64, due to their different placement relative to the platform.
A very small percentage of vocalizations (<0.1%) contained multiple, differently shaped vocalization traces that, when re-analyzed in shortened time-frequency bins with beamforming, could be assigned to two different males. Such overlapping vocalizations did not form a harmonic stack. Overall, overlaps were surprisingly rare and only occurred when our USV detection algorithm produced a longer interval, affecting the cumulative heatmap because beamforming is separately performed from the onset to the end of each vocalization. Although the identity of the assigned vocalizer could shift in these very rare cases depending on which time bin was reanalyzed, the system's localization performance remained in principle unaffected: as mentioned above, shorter time bins on non-overlapping parts correctly show the origin of the vocalizations in this case, and we think that improved USV detection/separation based on the harmonic structure will partially address this issue. During the beamforming each vocalization can then be separately localized, by restricting the beamforming to the corresponding time and frequency range. Further, the beamforming analysis could be refined so that multiple salient peaks can be detected in the soundfield estimate, e.g. a sequence of soundfield estimates would be computed on shorter segments of data and later fused again. As this uses less data per single estimate, it also increases the possibility of false positives, which in the current situation with very few overlaps in time, would likely reduce the overall accuracy of the system. Lastly, for the present data, if a time window was analyzed such that the intensity map of the sound field contains multiple hotspots of an approximately equal magnitude, the USV would likely remain unassigned, because the within soundfield uncertainty would be higher than for a single peak, and this would reduce the MPI.
However, given the rarity of these cases in our dataset, we do not think that their exclusion would change the results appreciably.
Lastly, for the purpose of online feedback during experiments and to reduce data warehousing, it would be advantageous to perform the localization of USVs in real-time. This would be enabled by streaming the data to a GPU, performing localization immediately and keeping only a single channel, beamformed estimate of each USV. Ideally, the same device could run visual tracking simultaneously, which would remove all temporal limitations on the recordings in terms of data size and enable continuous audiovisual tracking.

Conclusion and Outlook
HyVL delivers breakthrough accuracy and assignment rates, likely approaching the physical limits of assignment. The low system costs (<€10k) in relation to its performance make HyVL an excellent choice for labs studying rodent social interaction. Many recent questions regarding the sequencing of vocalizations during social interactions become addressable with HyVL without intrusive interventions. Its use can both refine the precision and reliability of the analysis, while reducing the number of animals required to complete the research due to a larger fraction of assigned USVs per animal. The current experiment was performed as an add-on to an existing set of experiments, whose focus included a region-specific knockout of Foxp2 in the cerebellar Purkinje cells of the male mice, denoted as Foxp2 flox/flox ;Pcp2 Cre . Neither previous work nor our own work has detected any differences in USV production between WT and KO animals 98 , so -given the mostly methodological focus of the present work -we considered it acceptable to pool them in the current analysis, reducing the number of animals needed, thus treating all males as WT C57Bl/6J, the genotype of the female mice.

Recording Setup
The behavioral setup consisted of an elevated interaction platform in the middle of an anechoic booth together with 4 circumjacent ultrasonic microphones as well as an overhanging 64-channel microphone array and high-speed video camera (see Fig. 1A). Considering the directional receptivity of the microphones (~25 dB attenuation at 45º), the microphones were placed a short distance away from the corners of the platform to maximize sound capture (5 cm in the long direction and 6 cm in the short direction of the platform). The rotation of each microphone was chosen to be such that it aimed at the platform center. The microphones produce a flat (±5 dB) frequency response within 7-150 kHz that was low-pass filtered at 120 kHz to prevent aliasing (using the analog, 16th order filter, which is part of the microphone amplifier). Recorded data was digitized using a data acquisition card (PCIe-6351,

Experimental Procedures
The experiment had 3 conditions: dyadic (with 2 mice), triadic (with 3 mice), as well as monadic (single male mouse, one type of ground truth data). For each of the male animals (n=14) we conducted one trial with each female (n=4) in dyadic and triadic conditions, i.e. 112 trials in total, in pseudo-random order. The third animal in triadic conditions was chosen pseudo-randomly.
Afterwards, to maximize the number of USVs for evaluation of the localization system, another 108 trials were run with the best male vocalizers in both dyadic and triadic conditions, leading to a total of 220 trials. In 85/220 trials USVs were emitted by the mice (57 dyadic, 28 triadic), prompting the experimenter to initiate a Cam64 recording (see below). Two dyadic trials were excluded from further analysis due to repeated but required experimenter interference during the recordings leaving 55 dyadic trials. The USVs from the remaining 83 trials formed the basis for the evaluation of the tracking accuracy of HyVL, while we used the 112 balanced-design dyadic and triadic recordings (with and without USVs) in the analysis of differences in dyadic/triadic interactions (Fig. 6). Lastly, 8 trials were recorded with just a single male mouse on the platform.
Each trial consisted of 8 minutes of free interaction between at least 1 female and at least 1 male mouse on the platform. Females were always placed on the platform first, and males were added shortly thereafter. In the monadic case, fresh female urine was placed on the platform instead of a female mouse to prompt the male mouse to vocalize. The high-speed camera and 4 high-quality microphones started recording after all mice had been placed on the platform and continued for 8 minutes. Data points where one mouse had left the platform or the hand of the experimenter was visible 10 seconds before or after (e.g., to pick up a mouse) were discarded (<5% of frames). Due to the rate of data generation of the Cam64 recordings (32 MB/s), their duration and timing was optimized manually. The experimenter had access to the live spectrogram from the USM4 microphones, and upon the start of USVs, triggered a new Cam64 recording (of fixed 2 min duration). If additional USVs occurred after that point, the experimenter could trigger additional recordings.

Data Analysis
The analysis of the raw data involved multiple stages (see Fig. 2): From the audio data, the presence and origin of USVs was estimated automatically. From the video data, mice were carefully tracked by hand at the temporal midpoint of each USV as near-optimal estimates for their acoustically localized origin. To estimate what proportion of our precision would be lost when using a faster and more scalable visual tracking method, we also tracked the mice automatically during dyadic trials. The estimated locations of the mice and USVs were then used to attribute the USVs to their emitter. All these steps are described in detail below.
Audio Preprocessing: Prior to further analysis, acoustic recordings were filtered at different frequencies. USM4 data was band-pass filtered between 30-110 kHz before further analysis using an inverse impulse response filter or order 20 in Matlab (function: designfilt, type: bandpassiir).
Cam64 data was band-pass filtered with a frequency range adapted to the frequency content of each USV. Specifically, first the frequency range of the USV was estimated as the 10th to 90th percentile of the set of most intense frequencies at each time point. Next, this range was broadened by 5 kHz at both ends, and then limited at the top end to 95 kHz. If this range exceeded 50 kHz, the lower end was set to 45 kHz. This ensured that beamforming was conducted over the relevant frequencies for each USV and avoided the high-frequency regions where the Cam64 microphones are dominated by noise (see Fig. 1C, Fig. 1-Fig. Suppl. 1).

Video Preprocessing:
The high-speed camera lens failed to produce perfect rectilinear mapping and was placed off-center with respect to the interaction platform, thereby producing a nonlinear radial-tangential visual distortion. We corrected for the radial distortion with: Detection of Ultrasonic Vocalizations: USVs were detected automatically using a set of custom algorithms described elsewhere. 72 Detection was only performed on the USM4 data, as their sensitivity and frequency range was generally better than for the Cam64 (see Fig. 1C, Fig. 1-Fig. Suppl. 1). A vocalization only had to be detected on 1 of the 4 high-quality microphones to be included into the set. In total, we collected 13406 USVs, out of which 8424 occurred when the Cam64 recordings were active.
Automatic Visual Animal Tracking: To assess whether we could reliably assign USVs to their emitter in a fast and scalable way, we automatically tracked multiple body parts of interacting mice in all framesmost importantly the snout and head centerfor all dyadic trials (using DeepLabCut 2 ; see Fig 2) and a subset of triadic trials (using SLEAP 99 ; see Fig. 6). With this approach, tracking is not temporally restricted to the midpoint of USV production, but can be performed for every frame of the entire recording. This data can be used to establish spatial densities of interaction against which e.g. the spatial density of vocalizations can be compared. 73 For the dyadic recordings, mice were tracked offline using a combination of DeepLabCut (DLC) 83 and extensive post-processing to maintain animal identity over the entire recording. While the tracking results from DLC were generally quite accurate, we refrained from using them directly because of inaccuracies and identity switches that occurred on many hundreds of occasions in every recording. Instead we adopted a strategy where DLC generated an overcomplete set of candidate locations followed by custom synthesis and tracing of these alternatives in space and time (see Fig. 3-Fig. Suppl. 1). In short, improved marker locations were generated from marker estimate clouds produced by DLC. Next, these marker positions were assembled into short spatiotemporal threads with the same, unknown identity based on a combination of spatial and temporal analysis. Finally, the thread ends were connected based on quadratic spatial trajectory estimates for each marker, yielding the complete track for both mice. This strategy resulted in reliable, high-quality tracking for all recordings, with a greatly reduced number of manual corrections needed overall (~10 per trial on average). All resulting tracks were visually verified (for a representative example, see Video 1).
For tracking the triadic interactions with two males, we used the SLEAP 99  For instance, we addressed cases where two instances were detected on a single mouse or when one instance appeared to cover two mice. To further refine the results, we interpolated outlying instances based on velocity jumps.
We compared the accuracy of localization on the basis of manual tracking with that of automatic tracking (N = 5046 USVs, see Fig. 3-Fig. Suppl. 3). Directly comparing the snout positions between the methods shows a median difference of 3.76 mm. The resulting error for localizing USVs was still superior to other systems, however, significantly increased by ~0.9 mm (MAE = 5.71 mm) relative to manual tracking. Both manual and automatic tracking appear to have particular patterns of residual errors, indicated by the fact that the error between the tracking methods is much larger than their difference in USV localization error. The percentage of reliably assignable USVs interestingly increased to 93.6% (HyVL), compared to 92% with manual tracking for the dyadic recordings only. We optimized the mouth location on the snout-to-head-center line, finding an optimal distance of 15% of the snout to head center distance to the front of the animal.
This indicated that the automatic tracking tended to place the snout tracking point a bit further into the snout than manual tracking, which might also explain the increase in assignment, due to a slight -but erroneous -increase in the separation between the snouts. While these results suggest that manual tracking is still advantageous, it highlights that completely automatic analysis of dyadic and possibly n-adic social interaction experiments is feasible at slightly reduced accuracy. The display included a zoom function for optimal accuracy, as tracking was click-based. Users could also freely scroll in time to ensure consistent animal identities. Only the snout and head center (i.e., midpoint between the ears) needed to be annotated because these points define a vector representing the head location and direction, which was all that was required in subsequent behavioral analyses.

Manual Visual Animal
Localization of Ultrasonic Vocalizations: USVs were spatially localized using a hybrid approach that integrates SLIM 73 (based on 4 high quality microphones) and beamforming (based on the 64channel microphone array), drawing on the complementary strengths of the 2 microphone arrays (see Fig. 1-Fig. Suppl. 1). For example, the Cam64 array provided excellent localization for USVs with energy below ~90 kHz, due to the increasing noise floor of the MEMS (microelectromechanical systems) microphones with sound frequency. Conversely, the 4 high-quality ultrasonic microphones (USM4) have a rather flat noise level as a function of frequency. On the other hand, USM4 will occasionally have glitches in one of the microphones, which can be compensated for in Cam64-based estimates through the number of microphones. As a consequence, the errors of the two methods show an L-shape (see Fig. 3A), which highlights the synergy of a hybrid approach.
Acoustic localization using the Cam64 recordings was performed on the basis of delayand-sum beamforming 82 . In beamforming, signals from all microphones are combined to estimate a spatial density that correlates with the probability of a given location being the origin of the sound. Specifically, we computed beamforming estimates for a surface situated 1 cm above and co-centered with the interaction platform, extending to 5 cm beyond all edges of the platform (i.e., 50 x 40 cm in total) at a final resolution of 1 mm in both dimensions. We refer to this density of sound origin as ( , ) where and denote spatial coordinates. To prevent noises unrelated to a specific USV from contaminating the location estimate, we limited beamforming to a particular frequency range estimated from the simultaneous data of the USM4 array that enveloped the USV. Spatial density was defined as The final beamforming estimate was calculated sequentially in 2 steps: first, a coarse estimate with 1 cm resolution was generated over the entire beamforming surface. Second, a fine-grained estimate with 1 mm resolution was generated over a 30 x 30 mm window centered on the peak location of the coarse estimate (see Fig. 2 for an example). This two-step approach was chosen to optimize performance, as an estimate with 1 mm resolution over the entire beamforming surface would be computationally expensive while failing to produce a better result.
For USVs of sufficient quality (i.e., containing frequency content below ~90 kHz while being sufficiently intense and long), both the coarse and fine estimates of ( , ) contained a peak whose height was typically very large compared to the surrounding values at distances greater than a few cm's. The peak location of the fine-grained estimate was used as the final estimate of the USV's origin. To assess the quality of this location estimate, we computed a signal-to-noise ratio (SNR) per USV as follows: where ( , ) is assumed to be calculated for the USV . The inverse, 1/ 64 was used as a proxy for the uncertainty of localization for a given USV.
Localization from the USM4 recordings was performed using the SLIM method 73 . Briefly, SLIM analytically estimates submanifolds (in 2D: surfaces) of a sound's spatial origin for each pair of microphones and combines these into a single estimate by intersecting the manifolds (in 2D: lines). The intersection has an associated uncertainty which scales with the uncertainty of the localization estimate for a given USV, specifically the uncertainty was defined as the standard deviation of all locations that were >90% times the maximum of the intersection density of all origin curves.
Lastly, for each USV where both Cam64 and SLIM location estimates Ẋ 64 and Ẋ were available, a single estimate Ẋ was computed based on the two estimates, spatial uncertainties and their spatial relation to the mice at the current time (see below).
USV Assignment: The final, hybrid location estimate and assignment to a mouse was performed while taking into account the probability of making a false assignment as proposed before 76 , through the calculation of the mouse probability index . While the was previously only used to exclude uncertain assignments (e.g. if two mice are nearly equidistant to the estimated sound location), we also adapted it here to select and combine the location estimates. The for each mouse k was computed as Here, is the probability that the USV in question originated from mouse computed as , where Ẋ ℎ is an estimate of the acoustic origin, ℎ, was assumed to lie on a line connecting the snout and head-center. For manually tracked recordings, the optimal location on this line was close to the snout (~2% towards the head, where % is relative to the snout-to-head-center tracked distance), while in the automatic tracking it was ahead of the snout tracking point (~15% away from the head). The position density of the recipient mouse was collected in cumulative fashion, with the polar coordinate system translated appropriately for each USV based on its temporal midpoint.
We assumed that the mice had no preference for relative vocalizations to either side of their snout, so all relative spatial positions were agglomerated in the right hemispace for further analysis. All data points were then binned using a polar, raw-count histogram with bins of 10° and 1 cm.

Statistical Analysis
To avoid distributional assumptions, all statistical tests were nonparametric, i.e., Wilcoxon rank sum test for two-group comparisons and Kruskal-Wallis for single factor analysis of variance.
Correlations were computed as Spearman's rank-based correlation coefficients. Error bars represent standard errors of the mean (SEM) unless stated otherwise. All statistical analyses were performed in MATLAB v.2018b (The Mathworks, Natick) using functions from the Statistics Toolbox.    Fig. 1C for the effect on the detectability of high-frequency USVs). Since no information was available on the input-referred self-noise level in the technical documentation, we shifted the curve to its minimum. In reality it should be shifted higher to be quantitatively compared with the Avisoft microphone, as the latter's large membrane is expected to outperform the AKU242 at all frequencies. For clarity, the above spectra are not equivalent to the sensitivity of the microphone at different frequencies, however, the baseline noise limits the sensitivity at these frequencies. While in principle a frequency dependent increase in sensitivity could overcome the baseline noise, this does not seem to be the case (see Fig. 1C, top). (i) Manual visual tracking: the observer was presented with a combined display of the vocalization spectrogram and the concurrent video image at the temporal midpoint of each USV and annotated the snout and head center (i.e., midpoint between the ears).
(ii) Automatic visual tracking: Started with finding the optimal locations of each marker based on marker estimate clouds produced by DeepLabCut 83 (DLC) for all frames. Next, these marker positions were assembled into spatiotemporal threads with the same, unknown identity based on a combination of spatial and temporal analysis. Finally, the thread ends still loose were connected based on quadratic spatial trajectory estimates for each marker, yielding the complete track for both mice (see Materials & Methods: Automatic Visual Animal Tracking and Fig. 3-Fig. Suppl. 1).   presented from a miniature high-quality in-ear driver (Sennheiser IE800, calibrated up to 80 kHz). Due to the physical size of the speaker, its membrane was located ~4 cm above the platform, which was taken into account in the source localization. The video shown was slightly corrected for lens distortions to exhibit orthogonal/parallel lines on the placement grid (see Methods).
B The accuracy of the HyVL estimates (orange) at each location (gray dots) was quite similar, after minor, linear rescaling (~2% in both directions) and residual shifting (4.5 mm in x, error bars show x and y [17,83] percentiles around the median). The remaining shifts in e.g. the lower left corner could partly be due to slight misplacements of the speaker.
C Density of estimates centered on known speaker locations. The errors group around the individual locations, while the variance inside the groups is below a single millimeter. This further suggests that the main source of shifts was imperfect placements/orientation of the speaker.   F Female mice appear to emit vocalizations in very close snout-snout contact, with a small fraction of vocalizations also occurring when the male mouse around the hind-paws/ano-genital region.
G Male mice emit vocalizations both in snout-snout-contact, but also at greater distances, which dominantly correspond to a close approach of the male's snout to the female ano-genital region. This was verified separately with a corresponding analysis, where the recipient's tail-onset was used instead (not shown).
H Radial distance density of receiver animals, marginalized over directions, shows a significant difference, with females vocalizing mostly when males (blue) are in close proximity of the snout, while males vocalize when the female mouse's snout is very close (corresponding to snout-snout contact), but also when the female's snout is about 1 body length away (snout-ano-genital interaction). Plots show means and SEM confidence bounds.
I Direction density of receiver animals, marginalized over distances, shows that female mice vocalize primarily when the male mouse's snout is very close and in front of them. Note that the overall angle of approach of the male mouse is not from directly ahead (see Fig. 4-Fig. Suppl. 1). B While the fraction of USVs emitted by males was overall comparable between D and T pairings, the dominant male (TD) emitted a substantially larger fraction than their submissive counterpart (TS), roughly a factor of 9. In competitive pairings, male mice tended to emit an overall larger fraction of all USVs than in alternative pairings (TC vs. TA), but this is unsurprising as both males vocalize. In female mice, the overall fraction of USVs in D and T pairings was also similar (see details in Results for potential caveats of the dominant/subordinate classification).
C In triadic pairings, dominant male mice tended to vocalize more intensely than in dyadic pairings, however, this difference was not significant at the current sample size. No significant differences were found for female mice.   B For the subordinate male, the histogram was less peaked around the proximal snout-abdomen interactions, but showed a more visible arc between 90-180º, pointing to snout-snout interactions.
C The difference between the two histograms (each density-normalized to a sum of one) shows the focused snout-abdominal interactions for the dominant male, and the arc pointing to snout-snout interactions for the subordinate male , in addition to smaller absolute differences in other relative locations.
D Spatial regions of significant difference between the dominant and subordinate male were found both in the regions highlighted in C, as well as more distant regions. Significance was assessed by bootstrapping confidence bounds on the histograms of the dominant and subordinate males (based on relative locations, rebuilding the histogram, 100x). The distance to the most extreme values were taken as the limits for significant deviation at p<0.01, and the difference in C was then compared in both the positive/negative direction against these bounds.