DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila

Studying how neural circuits orchestrate limbed behaviors requires the precise measurement of the positions of each appendage in three-dimensional (3D) space. Deep neural networks can estimate two-dimensional (2D) pose in freely behaving and tethered animals. However, the unique challenges associated with transforming these 2D measurements into reliable and precise 3D poses have not been addressed for small animals including the fly, Drosophila melanogaster. Here, we present DeepFly3D, a software that infers the 3D pose of tethered, adult Drosophila using multiple camera images. DeepFly3D does not require manual calibration, uses pictorial structures to automatically detect and correct pose estimation errors, and uses active learning to iteratively improve performance. We demonstrate more accurate unsupervised behavioral embedding using 3D joint angles rather than commonly used 2D pose data. Thus, DeepFly3D enables the automated acquisition of Drosophila behavioral measurements at an unprecedented level of detail for a variety of biological applications.

Introduction 33 The precise quantification of movements is critical for understanding how neurons, biomechanics, 34 and the environment influence and give rise to animal behaviors. For organisms with skeletons and DeepFly3D 2019 calibration protocols. Finally, we demonstrate that unsupervised behavioral embedding of 3D joint 91 angle data (Figure 4) is robust against problematic artifacts present in embeddings of 2D pose data 92 (Figure 3). In short, DeepFly3D delivers 3D pose estimates reliably, accurately, and with minimal 93 manual intervention while also providing a critical tool for automated behavioral data analysis.

95
DeepFly3D 96 The input to DeepFly3D is video data from seven cameras (Figure 11). These images are used 97 to identify the 3D positions of 38 landmarks per animal: (i) five on each limb -the thorax-coxa, 98 coxa-femur, femur-tibia, and tibia-tarsus joints as well as the pretarsus, (ii) six on the abdomen 99 -three on each side, and (iii) one on each antenna -useful for measuring head rotations. Our  (Simon et al., 2017). We also use pictorial structure corrections to 113 fine-tune the 2D pose deep network. Self-supervision constitutes 85% of our training data. 114 • Calibration without external targets: Estimating 3D pose from multiple images requires 115 calibrating the cameras to achieve a level of accuracy that is commensurate with the tar- 116 get size-a difficult challenge when measuring leg movements for an animal as small as 117 Drosophila. Therefore, instead of using a typical external calibration grid, DeepFly3D uses 118 the fly itself as a calibration target. It detects arbitrary points on the fly's body and relies on 119 bundle-adjustment (Chavdarova et al., 2018) to simultaneously assign 3D locations to these 120 points and to estimate the positions and orientations of each camera (Figure 7). To increase 121 robustness, it enforces geometric constraints that apply to tethered flies with respect to limb 122 segment lengths and ranges of motion. 123 Improving 2D pose using pictorial structures and active learning 124 We validated our approach using a challenging dataset of 2063 frames manually annotated using 125 the DeepFly3D annotation tool (Figure 6) epochs without user intervention and to automatically distinguish between otherwise similar   142 actions. However, with this sensitivity may come a susceptibility to features unrelated to behavior. 143 These may include changes in image size or perspective resulting from differences in camera angle 144 across experimental systems, variable mounting of tethered animals, and inter-animal morpho-145 logical variability. In theory, each of these issues can be overcome-providing scale and rotational 146 invariance-by using 3D joint angles rather than 2D pose for unsupervised embedding. 147 To test this possibility, we performed unsupervised behavioral classification on video data 148 taken during optogenetic stimulation experiments that repeatedly and reliably drove specific ac-  (Hampel et al., 2015). We also stimulated control 152 animals lacking the UAS-CsChrimson transgene (Figure 4-video 3)(MDN-GAL4/+ and aDN-GAL4/+). 153 First, we performed unsupervised behavioral classification using 2D pose data from three adja-154 cent cameras containing keypoints for three limbs on one side of the body. Using these data, we 155 generated a behavioral map ( Figure 3A). In this map each individual cluster would ideally repre-156 sent a single behavior (e.g., backward walking, or grooming) and be populated by nearly equal 157 amounts of data from each of the three cameras. This was not the case: data from each camera 158 covered non-overlapping regions and clusters ( Figure 3B-D). This effect was most pronounced 159 when comparing regions populated by cameras 1 and 2 versus camera 3. Therefore, because the 160 underlying behaviors were otherwise identical (data across cameras were from the same animals 161 and experimental time points), we concluded that unsupervised behavioral classification of 2D pose 162 data is highly sensitive to corruption by viewing angle differences. 163 Figure 3. Unsupervised behavioral classification of 2D pose data is sensitive to viewing angle. (A) Behavioral map derived using 2D pose data from three adjacent cameras (Cameras 1, 2, and 3) but the same animals and experimental time points. Shown are clusters (black outlines) with enriched (yellow), or sparsely (blue) populated data. Different clusters are enriched for data from either (B) camera 1, (C) camera 2, or (D) camera 3. Behavioral embeddings were derived using 1 million frames during 4 s of optogenetic stimulation of MDN>CsChrimson (n=6 flies, n=29 trials), aDN>CsChrimson (n=6 flies, n=30 trials), and wild-type control animals (MDN-GAL4/+: n=4 flies, n=20 trials. aDN-GAL4/+: n=4 flies, n=23 trials).
By contrast, performing unsupervised behavioral classification using DeepFly3D-derived 3D 164 joint angles resulted in a map (Figure 4) with a clear segregation and enrichment of clusters for 165 different GAL4 drivers lines and their associated behaviors (i.e., backward walking (Figure 4-video 4), 166 grooming (Figure 4-video 5), and forward walking (Figure 4-video 6)). Thus, 3D pose overcomes 167 serious issues arising from unsupervised embedding of 2D pose data, enabling more reliable and 168 robust behavioral data analysis.

193
With synchronized Drosophila video sequences from seven cameras in hand, the first task for 194 DeepFly3D is to detect the 2D location of 38 landmarks. These 2D locations of the same landmarks 195 seen across multiple views are then triangulated to produce 3D pose estimates. This pipeline is 196 depicted in Figure 5. First, we will describe our deep learning-based approach to detect landmarks 197 in images. Then, we will explain the triangulation process that yields full 3D trajectories. Finally, we 198 will describe how we identify and correct erroneous 2D detections automatically. annotation tool that we implemented in JavaScript using Google Firebase (Figure 6). The DeepFly3D 261 annotation tool operates on a simple web-server, easing the distribution of annotations across 262 users and making these annotations much easier to inspect and control. We provide a GUI to 263 inspect the raw annotated data and to visualize the network's 2D pose estimation (Figure 9). DeepFly3D 2019 expressed in a coordinate system whose origin is in the optical center of the camera and whose 292 z-axis is its optical axis, the 2D image projection , of a 3D homogeneous point , , , 1 can be 293 written as where the 3 × 4 matrix is known as the intrinsic parameters matrix-scaling in the and 295 direction and image coordinates of the principal point and -that characterizes the camera 296 settings. 297 In practice, the 3D points are not expressed in a coordinate system tied to the camera, especially 298 in our application where we use seven different cameras. Therefore, we use a world coordinate 299 system that is common to all cameras. For each camera, we must therefore convert 3D coordinates 300 expressed in this world coordinate system to camera coordinates. This requires rotating and trans-301 lating the coordinates to account for the position of the camera's optical center and its orientation. 302 When using homogeneous coordinates, this is accomplished by multiplying the coordinate vector 303 by a 4 × 4 extrinsic parameters matrix where is a 3 × 3 rotation matrix and a 3 × 1 translation vector.
Camera distortion. The pinhole camera model described above is an idealized one. The 307 projections of real cameras deviate from it and these deviations are referred to as distortions and 308 need to be accounted for. The most significant one is known as radial distortion because the error 309 grows with the distance to the image center. For the cameras we use, radial distortion can be 310 expressed as 311 pinhole = 1 + k 1 r 2 + k 2 r 4 , to also change. In other words we look for where and , are the 3D locations and 2D projections of the landmarks introduced above and 336 denotes the Huber loss. Equation 7 is known as bundle-adjustment (Hartley and Zisserman, 2000). 337 Huber loss is defined as Replacing the squared loss by the Huber loss makes our approach more robust to erroneous 339 detections , . We empirically set to 20 pixels. Note that we perform this minimization with respect 340 to ten degrees-of-freedom per camera: three translations, three rotations, and four distortions. 341 For this optimization to work properly, we need to initialize these ten parameters and we need to 342 reduce the number of outliers. To achieve this, the initial distortion parameters are set to zero. We   3. In the end, we aim to 378 find the solution̂ = argmax ( | , ). This is known as Maximum a Posteriori (MAP) estimation. 379 Using Bayes rule, we write where the two terms can be computed separately. We compute ( | , ) using the probability procedure can also be understood as an MLE of the calibration parameters (Triggs et al., 2000). 424 Therefore, the entire set of parameters for the formulation can be learned using MLE. 425 The pictorial structure formulation can be further expanded using temporal information, pe- dimensional local structure, while sacrificing larger scale accuracy. We used the Kullback-Leibler 486 (KL) divergence as the distance function in our t-SNE algorithm. KL assesses the difference between 487 the shapes of two distributions, justifying the normalization step in the preceding step. By analyzing 488 a multitude of plots generated with different perplexity values, we empirically found perplexity 35 489 to best suit the features of our posture-dynamics space. 490 From this generated discrete space, we created a continuous 2D distribution, that we could then 491 segment into behavioral clusters. We started by normalizing the 2D t-SNE projected space into 492 a 1000 × 1000 matrix. Then, we applied a 2D Gaussian convolution with a kernel of size = 10px.

493
Finally, we segmented this space by inverting it and applying a Watershed algorithm that separated 494 adjacent basins, yielding a behavioral map. 495