Attention LSTM U-Net model for Drosophila melanogaster heart tube segmentation in optical coherence microscopy images

Optical coherence microscopy (OCM) imaging of the Drosophila melanogaster (fruit fly) heart tube has enabled the non-invasive characterization of fly heart physiology in vivo. OCM generates large volumes of data, making it necessary to automate image analysis. Deep-learning-based neural network models have been developed to improve the efficiency of fly heart image segmentation. However, image artifacts caused by sample motion or reflections reduce the accuracy of the analysis. To improve the precision and efficiency of image data analysis, we developed an Attention LSTM U-Net model (FlyNet3.0), which incorporates an attention learning mechanism to track the beating fly heart in OCM images. The new model has improved the intersection over union (IOU) compared to FlyNet2.0 + with reflection artifacts from 86% to 89% and with movement from 81% to 89%. We also extended the capabilities of OCM analysis through the introduction of an automated, in vivo heart wall thickness measurement method, which has been validated on a Drosophila model of cardiac hypertrophy. This work will enable the comprehensive, non-invasive characterization of fly heart physiology in a high-throughput manner.


Introduction
Drosophila melanogaster, the fruit fly, is widely used to study developmental processes and uncover the mechanisms of human diseases.The Drosophila genome is fully sequenced, and methods for manipulating its genes are well established.Drosophila has gene homologs of approximately 75% of the disease-causing genes in humans [1,2].In cardiac research, the fruit fly has been used to study various genetic, diet-and age-induced models of cardiac diseases as well as for drug testing [3][4][5][6][7].
The first study of cardiology in Drosophila was reported over 30 years ago, showing that the gene tinman is required for its heart to develop [8].Since the first study of cardiology in Drosophila, many methods have been developed both to characterize and to automate the characterization of Drosophila's cardiac activity.Fink et al. introduced a platform for semiautomatic heartbeat analysis, which analyzed the repeated motion in brightfield videos of the beating heart, providing parameters such as heart rate, arrhythmicity index, fractional shortening, and systolic and diastolic indices [9].While this method is simple and provides extensive analysis, it is invasive and requires a high skill level to identify the heart tube in the adult fly body.
Drosophila's cardiac morphology has been analyzed using histology and micro computerized tomography (microCT) measurements.Histological methods produce 2D ex vivo tissue staining and can be used to characterize the heart wall thickness in disease models of cardiac hypertrophy [10][11][12].However, the histological fixation procedure is complicated and time-consuming, and tissue may also become distorted during the procedure [13].MicroCT has emerged as a high-resolution 3D imaging modality capable of heart wall measurements [14,15].In microCT imaging of the Drosophila heart, X-ray images undergo incremental rotation to reconstruct detailed 3D structures with micron-level resolution.Unlike histology, this technique does not require tissue sectioning.However, the flies must be sacrificed and fixed during the procedure, limiting it from capturing real-time changes.
To study Drosophila's cardiac structure and function in vivo, optical coherence microscopy (OCM) has been introduced as a label-free, high spatiotemporal resolution imaging platform.OCM is a high-resolution form of optical coherence tomography, which relies on light interference between a sample and reference arm to form images [16].It is non-invasive, using endogenous contrast provided by light backscattering off tissue in the sample, and it can image up to 1 mm in depth.In Drosophila, the heart tube is located ∼200 µm under the cuticle surface, and the tissue does not greatly attenuate the imaging light beam [17].OCM offers micron-scale resolution while providing frame rates of over 100 Hz, fast enough to capture heartbeat dynamics [2].
Previous OCM studies of cardiac function in Drosophila have examined both the morphological and functional effects of cardiac disease [18][19][20].In early studies, this examination was performed manually by human experts and by using various image segmentation techniques, such as pacing pulses, measuring the dimensions of the heart chamber in each frame using customwritten MATLAB codes, and applying the magic wand algorithm [2,[20][21][22].To improve the analysis speed, Lee et al. developed an automated algorithm for Drosophila heartbeat counting based on longitudinal OCT images, but this only provided heart rate assessment and lacked additional analysis capabilities [23].Overall, these methods don't provide rapid comprehensive characterization of heart structure and function, motivating the development of automated segmentation methods through deep learning.
Deep neural networks (DNNs) continue to demonstrate success in computer vision tasks, and researchers are now deploying DNN architectures for segmenting biomedical images.Among these, the fully convolutional network stands out for biomedical image segmentation [24,25].In 2018, Duan et al. introduced a fully convolutional U-Net model for identifying and marking the heart region of Drosophila in cross-sectional images [26].This model, FlyNet1.0,achieved an accuracy of 86%.In 2020, Dong et al. advanced the model by incorporating convolutional long short-term memory (LSTM) to enable 3D segmentation [27].FlyNet2.0 utilized both spatial and temporal information to improve segmentation performance, resulting in an accuracy of 92% [28].Recently, Fishman et al. optimized the FlyNet2.0 model structure with better GPU utilization and provided an open-source dataset [29].Although these models achieved an impressive performance, there are unsatisfactory segmented cases, especially when there exist imaging artifacts originating from normal movements in the larval stage and light reflection.
Here, we implemented an attention learning model, FlyNet3.0, which integrates attention gates in the skip connections between each level of the LSTM U-Net model to improve segmentation in the presence of common artifacts.The attention gate was proposed by Bahdanau et al. in 2015 as an addition to skip connections, allowing models to dynamically adapt and learn to focus on the region of interest by optimizing attention weight parameters [37].With the added attention gate in FlyNet3.0,model sensitivity and accuracy are improved, because the attention weight emphasizes the heart region while limiting the contribution of unrelated features.Additionally, we developed an automated in vivo heart wall thickness measurement algorithm based on the FlyNet3.0 segmentation to expand the current characterization of the heart using OCM.Altogether, we aimed to provide an image processing platform that reduces the need for manual human segmentation, allowing for accurate rapid segmentation regardless of artifacts to facilitate heart morphology and functional characteristic assessments.

Dataset acquisition
The dataset used in this study was obtained and developed from the publications by Fishman et al. [29].In brief, specimens were mounted on a glass slide, where larval and pupal specimens were mounted with double sided tape and the wings of adult flies were attached to the slide with rubber cement.The custom-built spectral domain OCM imaging system had a central wavelength of 850 nm and a bandwidth of ∼175 nm.The frame rate was 125 Hz to capture the heartbeat dynamics.The lateral resolution with a 10× lens was ∼2.3 microns and the axial resolution was ∼3.3 microns in air.The A7 segment of the heart in larval and early pupal stages was imaged, which is the most posterior segment of the heart chamber.The A1 segment of the heart in adult flies was imaged, which is the most anterior segment of the heart chamber [2].

Dataset preparation
To train a generalized Drosophila heart segmentation model on OCM images, we prepared a dataset of OCM videos built off the over 600,000 frames of the open-source dataset published by Fishman et al [29].Here, a variety of beating patterns were captured, including recordings of all developmental stages and of optogenetic experiments involving the excitatory opsins ReaChR and ChR2 and the inhibitory opsin NpHR2.0 [21,31,32].An additional set of images with different degrees of image artifacts (reflection, movement) was added to this dataset, and all image sets were manually assigned to three categories: reflection, movement, and normal.Images with reflection artifacts were categorized by the extent to which the artifact passes through or blocks part of the heart region.In the 'movement' category, the heart region moves occasionally during the imaging procedure, which is predominant in the larval stages by their nature.Normal images present a clear and well-defined heart region throughout the whole video.
Overall, we gathered 710,000 frames of OCM images, which included 198,000 reflection frames, 48,000 movement frames, and 464,000 normal frames, as shown in Table 1.To train the neural network model, we randomly split the dataset into three sets: 60% for training, 5% for validation, and 35% for testing.To predict a mask, the full-size 128 × 701 pixel original OCM images were resized to 128 × 128 pixels using a neural network that produced a preliminary segmentation box by identifying the general heart region of the video.The resized and rescaled video was then fed into FlyNet3.0for heart segmentation.
To demonstrate the heart wall thickness measurements, we utilized the OCM dataset of a Drosophila model of cardiac hypertrophy published by Migunova et al. in 2023 [10].As reported in that study, cardiomyocyte-specific loss of RNase Z leads to heart wall overgrowth and heart dysfunction.We acquired the standard OCM videos of Drosophila lines with CRISPR mediated heart-specific knockout of RNase Z, tinC-Cas9 > RNZ KO , and appropriate control tinC-Cas9 > gZ + .The tinC-Cas9 > RNZ KO larvae were reported to have a profound heart wall hypertrophy compared to the controls [10].We included n = 16 tinC-Cas9 > gZ + larvae and n = 34 tinC-Cas9 > RNZ KO larvae for in vivo heart wall thickness analysis.

General network structure
Figure 1 presents a complete overview of the proposed heart region segmentation network.Based on the FlyNet2.0 + model [29], which consists of four convolution layers and three skip connections that bridge the corresponding encoder and decoder blocks, we constructed the FlyNet3.0 model by adding an attention gate at each skip connection, as shown in red in Fig. 1(A).Spatial and temporal information is captured by the LSTM 2D convolution layers.Passing through the skip connection, this information remains and contributes to the final prediction. Figure 1(B) shows the newly added attention gates taking this skip connection signal as the horizontal dashed purple inputs and the gating signal from the previous layer as the vertical dashed pink inputs.The attention model adaptively adjusts and automatically learns to focus on the target structure, the heart region.To locate the heart area accurately, the model places more weight on salient features like the heart area or the boundary of the heart, and less weight on irrelevant background regions in the input images.The algorithm is implemented in Python and the code is publicly available on GitHub [33].

Model training and evaluation
To minimize the loss function, we trained FlyNet3.0 using the Adam optimizer with a learning rate of 0.0001.The loss function was chosen as the Log-Cosh Dice Loss function, which was developed from the Dice Coefficient.The Dice Coefficient follows: where predict area is the area of the predicted mask from the attention model and ground area is the area of the ground-truth mask.Using the Dice Coefficient optimizes the segmented area to be closer to the ground truth [34].However, because of the non-convex nature, the Dice Coefficient can potentially fall short of attaining the optimal outcomes.To solve this, researchers developed the Log-Cosh approach based on the Lovsz-Softmax loss (L lc−dce ), as follows [35]: To evaluate whether the attention model can precisely generate a 3D mask of the heart from the resized image, the generated mask was compared with the ground-truth mask using intersection over union (IOU), which is a value calculated as the intersection region of the predicted mask and the ground-truth mask over the union region of the two masks, as shown below: The model trained 80 epochs, with 193 steps per iteration and a batch size of 32.Through monitoring the learning performance throughout all the epochs, the best model was determined from the epoch with the lowest loss value and highest IOU over the full validation dataset, when the loss function reached stability.Model training and prediction were implemented using TensorFlow 2.10.1 and Python 3.7.4 on a workstation equipped with a NVIDIA GeForce RTX 3090 card.In the OCM cross-sectional view, we assumed that the masks that cover the Drosophila heart are contained and continuous in each frame.To improve the segmentation accuracy, we applied postprocessing steps on the predicted mask by keeping the largest connected component and smoothing the boundary.Results for both FlyNet2.0 + and FlyNet3.0 are shown with post-processed images, where from predicted masks, the largest connected component was kept.accuracy, we applied postprocessing steps on the predicted mask by keeping the largest connected component and smoothing the boundary.Results for both FlyNet2.0+ and FlyNet3.0 are shown with post-processed images, where from predicted masks, the largest connected component was kept.

Attention gate structure
Fig. 1B shows the structure of the attention gate, which computes the attention weight for an input feature, x, to improve the identification of the significant image region and to determine the focal area [30].The attention gate inputs two parameters: x from the skip connection, and the gating signal, g, from upsampling of the previous block.The attention weight is represented The last layer is a 3D convolutional layer with a kernel size of 1 × 1x1, followed by a sigmoid activation function.B) Detailed structure of the attention gate.The input g is from the gating signal, x is from the skip connection, and the output s is copied to the input of the corresponding decoder.

Attention gate structure
Figure 1(B) shows the structure of the attention gate, which computes the attention weight for an input feature, x, to improve the identification of the significant image region and to determine the focal area [30].The attention gate inputs two parameters: x from the skip connection, and the gating signal, g, from upsampling of the previous block.The attention weight is represented as a single scalar weight vector calculated for each pixel vector at each frame.The output of the attention gate is the element-wise multiplication between the elements of the input feature mapping and the calculated attention weight, ranging from 0 to 1.The first step in calculating the attention weight is to use convolutional layers and batch normalization layers to capture the characteristics of the gating signal and skip connection inputs.Both inputs' convolutional layers have a kernel size of 2 × 2, and the batch normalization layers' kernel sizes are 1 × 1. Next, the result from the gating signal is added to the input from the skip connection to help determine the focal region of the attention weight.Then the ReLU activation function is applied to the pixel-wise summation to introduce non-linearity to the model to add complexity [35].After performing linear transformation by passing the signals through another 1 × 1 convolutional layer and batch normalization layer, we can get the initial attention weight coefficient of the input as follows: where Φ x and Φ g are the convolutional and batch normalization (BN) layers for inputs x and g respectively, σ 1 is the ReLU activation function, and Φ q represents the convolutional and BN layers for the initial model after passing through the first activation function.
To apply the initial attention weight coefficient to the input x, the coefficient needs to be rescaled to between 0 and 1 by applying an activation function as follows: where σ 2 is the sigmoid activation function.The input features, x, will then be scaled according to the attention weight, α, as follows: The result, s, will be used as the input to feed to the next block.
To eliminate noisy and unrelated responses from the skip connections, gating is determined based on pertinent information extracted from the previous prediction layer.Furthermore, the attention gate performs merge operations only before filtering neuronal activations and element-wise multiplication, ensuring that all features extracted are retained.Moreover, attention gates are linearly transformed without introducing any spatial modifications, making them a suitable complement to the LSTM model we previously employed.

Statistical analysis
Statistical analysis was performed using a two-sided student's t-test.Results were deemed significant when p < 0.05.

Attention weight visualization
To better understand the training process in various datasets, we visually represented the attention weight vector obtained from different training epochs, as shown in Fig. 2. The input to the model is a grayscale resized image, and the output is a binary image of the heart region.Figure 2(A) demonstrates a ground truth mask, shown in red, combined with its corresponding grayscale resized image on the left.The rest of the images show the attention weight training process with respect to different epochs.In the beginning, the attention weight has a uniform distribution over the input image and passes features at all locations, as indicated by the broad distribution of the heatmap generated at the 3 rd epoch.The weights are then gradually learned and become more focused around the heart region or the heart boundary, depending on the area of the heart.If the heart's area is relatively large along all the frames, as shown in Fig. 2(A), the attention weight will focus on the heart region little by little.Here, the model reaches an optimal focus on the heart region around 47 epochs.On the other hand, if the heart region contains obvious variation between systolic and diastolic phases, as shown in Fig. 2(B) for a systolic phase and Fig. 2(C) for a diastolic phase, to locate the heart position, the attention weight will also focus on the surrounding boundary created by tissues around the heart region in the larval body.Further training of the model does not refine the model weights to make them more focused on the heart region.When comparing epochs later in training, such as epochs 47 and 63 found in Fig. 2, the attention weight has a higher intensity around the heart region at epoch 47.At epoch 63, the model becomes overtrained producing an attention weight with a larger spread throughout the frame that is less focused around the heart region.When the evaluation criteria reaches stability, we consider the training model as reaching the best case, and the attention weights provide a rough outline of the heart, which is gradually refined at finer resolution at coarser scales, as shown in the overlay of the resized image and optimized mask in Fig. 2(D).Recordings of the weights at different epochs during heart beating can be found in Visualization 1, Visualization 2.

Test results
We tested our model on a variety of datasets, including normal examples, images with reflection artifacts (reflection) through the heart region, and images in which the heart moves horizontally or vertically during the frame sequence (movement).Table 2 displays the average IOU results for each category and the overall performances for FlyNet2.0 + and FlyNet3.0, both after post-processing.FlyNet3.0 improves performance in all cases, particularly for the 'movement' group.Figure 3(A) shows an example mask of the heart ground truth from a dataset with no artifacts present with corresponding prediction results from FlyNet2.0 + and FlyNet3.0 models and an overlapping comparison image.All the ground truth masks are in red, model prediction masks are in green, and the overlapping area between the ground truth and prediction result are in yellow.Figure 3(B) shows the M-Mode ground truth mask overlayed on the M-Mode OCT image along with the overlayed comparison between ground truth, FlyNet2.0+, and FlyNet3.0 predicted masks.In both Fig. 3(A)-(B), overlaying the ground truth and prediction masks resulted in mostly yellow, indicating an accurate prediction for both the FlyNet3.0 and the FlyNet2.0 + models.The accurate result for both models is further confirmed by the IOU traces in Fig. 3(C) for FlyNet2.0 + and FlyNet3.0,where IOU remains close to 100%, with some periodic oscillation due to the smaller area in systole.Comparing IOU overall for samples in the 'normal' group, FlyNet3.0 improves segmentation accuracy by 2% over FlyNet2.0+, as shown in Fig. 3(D).3D cross-sectional videos and M-mode images displayed in Fig. 3 can be found in Visualization 3. Figure 4 shows an example from the 'reflection' group, where Fig. 4(A) shows predicted masks compared to the ground truth mask.Here, the blue arrow indicates an example of a reflection    Fig. 5 shows an example from the 'movement' group, where the cross-section in Fig. 5A comes from a frame soon after movement occurs.Here, due to movement, the heart region changes position within the frame.When resizing an image to 128x128 pixels for prediction, the resized image must contain the whole heart region for the duration of the recording.In datasets with movement, this results in a resized image that does not focus as tightly on the artifact, which appears as vertical and horizontal white lines in the cross-section.From the overlap image produced by FlyNet2.0+,there exists a red area next to the smaller yellow area, indicating an inaccurate prediction of the mask area due to the presence of an artifact.In the FlyNet3.0 prediction image, the predicted mask covers the entire heart region, and the yellow area from the overlapping image accurately represents the precise prediction result of the FlyNet3.0 model.Figure 4(B) shows the M-Mode image for the entire dataset, where the presence of the artifact reduces the size of the FlyNet2.0 + predicted mask, especially in the systolic phases.The FlyNet3.0 overlap M-mode image shows mainly a yellow area, indicating high fidelity between the ground truth and predicted masks.Figure 4(C) further confirms this result by showing the IOU trace for this example dataset, where the FlyNet2.0+IOU drops down to 0 in some systolic phases due to the reflection artifact causing the model to not recognize any of the heart area, and the IOU performance for FlyNet3.0 remains close to 100%.Figure 4(D) compares the performance of the two models for the entire 'reflection' group, where FlyNet3.0 improves IOU over FlyNet2.0 + by 3%.3D cross-sectional videos and M-mode images in Fig. 4 can be found in Visualization 4. Figure 5 shows an example from the 'movement' group, where the cross-section in Fig. 5(A) comes from a frame soon after movement occurs.Here, due to movement, the heart region changes position within the frame.When resizing an image to 128 × 128 pixels for prediction, the resized image must contain the whole heart region for the duration of the recording.In datasets with movement, this results in a resized image that does not focus as tightly on the heart region, as the region the heart occupies during the recording is larger.FlyNet2.0 + performs with lower accuracy on this smaller, less centered heart region.In the overlap image in Fig. 5(A), there is an additional region, indicated by the green area, predicted by the FlyNet2.0 + model that cannot be removed by post processing due to it being connected to the true heart region.This area is less likely to be recognized by the FlyNet3.0 model.In the M-Mode overlap images shown in Fig. 5(B), FlyNet2.0 + continues the pattern of recognizing additional areas that are Fig. 5 shows an example from the 'movement' group, where the cross-section in Fig. 5A comes from a frame soon after movement occurs.Here, due to movement, the heart region changes position within the frame.When resizing an image to 128x128 pixels for prediction, the resized image must contain the whole heart region for the duration of the recording.In datasets with movement, this results in a resized image that does not focus as tightly on the not part of the heart region, indicated by the green areas.The overlapping mask M-mode for FlyNet3.0 shows a mainly yellow area, with a few areas of green, indicating a more accurate prediction.This result is shown graphically in Fig. 5(C), where the IOU plot for FlyNet2.0 + has more discontinuities than the plot for FlyNet3.0.Overall results for the movement group, shown in Fig. 5(D), indicate that FlyNet3.0 significantly improves performance over FlyNet2.0 + by 4%.3D cross-sectional videos and M-mode images from Fig. 5 can be found in Visualization 5.

In vivo heart wall thickness measurements
In addition to constructing a robust model that accurately characterizes the heart while mitigating the influence of artifacts, we aim to create a more comprehensive package for analyzing both morphological and dynamic features.Our previous FlyNet models have enabled the characterization of heart rate, fractional shortening, and heart areas.Here in our updated model, we propose an additional tool for measuring the heart wall thickness in vivo.
To measure the thickness of the heart wall, we begin with an accurately predicted mask.Our model takes in a 128 × 128 pixel resized image of the heart region, so we first resize the predicted mask to match the size of the original OCM image.We then rescale the mask according to the pixel-to-micron ratio found using a USAF target with our scan settings such that each pixel represents one micron.Examples of the scaled combined images in the systolic and diastolic phases are displayed in Fig. 6(A) and Fig. 6(B), respectively.The greyscale images are from the rescaled OCM image, and the red region is the corresponding predicted mask.Subsequently, we determine the position of the heart wall based on the boundary of the predicted mask.The pink curves outlined in Fig. 6(C) and Fig. 6(D) represent the position of the heart boundary, which are then used to measure the pixel intensity along the heart wall.For each pixel along the smoothed heart boundary, a line is drawn perpendicular to the heart wall to obtain the OCM intensity across the heart wall.As shown in Fig. 6(C)-(D), we use the neighboring points along additional areas that are not part of the heart region, indicated by the green areas.The overlapping mask M-mode for FlyNet3.0 shows a mainly yellow area, with a few areas of green, indicating a more accurate prediction.This result is shown graphically in Fig. 5C, where the IOU plot for FlyNet2.0+ has more discontinuities than the plot for FlyNet3.0.Overall results for the movement group, shown in Fig. 5D, indicate that FlyNet3.0 significantly improves performance over FlyNet2.0+ by 4%.3D cross-sectional videos and M-mode images from Fig. 5 can be found in Visualization 5.

In vivo heart wall thickness measurements
In addition to constructing a robust model that accurately characterizes the heart while mitigating the influence of artifacts, we aim to create a more comprehensive package for analyzing both morphological and dynamic features.Our previous FlyNet models have enabled the characterization of heart rate, fractional shortening, and heart areas.Here in our updated model, we propose an additional tool for measuring the heart wall thickness in vivo.
To measure the thickness of the heart wall, we begin with an accurately predicted mask.Our model takes in a 128×128 pixel resized image of the heart region, so we first resize the predicted mask to match the size of the original OCM image.We then rescale the mask according to the pixel-to-micron ratio found using a USAF target with our scan settings such that each pixel represents one micron.Examples of the scaled combined images in the systolic and diastolic phases are displayed in Fig. 6A and Fig. 6B, respectively.The greyscale images are from the rescaled OCM image, and the red region is the corresponding predicted mask.Subsequently, we determine the position of the heart wall based on the boundary of the predicted mask.The the heart boundary, shown in red, to determine the slope of the green point of interest.We chose to exclude points on the heart wall if the borders are connected to other tissues in the body, making it hard to accurately differentiate the heart wall computationally.Points are also excluded if the images are affected by artifacts that may reduce the image quality.After obtaining the intensity measurement along the perpendicular direction for all the selected points along the entire heart wall of one frame, we align all the measurement profiles and then averaged them.The averaging of the distribution of all points in a single frame produces a trace that resembles a Gaussian distribution, where the full width at half-maximum (FWHM) is measured and used to determine the thickness.Therefore, we use the center position of the FWHM measurement as the reference to align all the intensity measurement profiles in a single frame.Figure 6(E)-(F) show examples of averaged intensity measurement at the systolic and diastolic stages shown in Fig. 6(C)-(D).These steps are repeated for all frames, and the averaged FWHM values are recorded separately.End systolic and end diastolic measurement points are identified using peaks and valleys in the heart area over time.In this example shown in Fig. 6(A)-(B), the resulting average heart wall thickness during systole is ∼13.0 µm, and ∼10.5 µm during diastole, which is ∼2.4 µm in difference on average.Thus, the heart wall is thicker during systole than diastole in general.
Using this method, we characterized the heart wall thickness of tinC-Cas9 > gZ + and tinC-Cas9 > RNZ KO larvae from our previous study, where the heart wall thickness measurements were performed by measuring the transverse histological sections [10].We averaged the measurements at the end diastolic phase and the end systolic phase separately for the two larvae genotypes, as Fig. 6(G) shows.For both end diastolic and end systolic phases, tinC-Cas9 > RNZ KO larvae exhibited a significant increase in cardiac wall thickness compared to those of tinC-Cas9 > gZ + (p < 0.05).When this dataset was analyzed histologically, the average heart wall thickness for tinC-Cas9 > gZ + larvae was ∼4.3 µm and for tinC-Cas9 > RNZ KO larvae was ∼7.2 µm.Using its high segmentation accuracy, FlyNet3.0 also adds the ability to measure the heart wall thickness dynamically in vivo.Although our model (Fig. 6) showed the same measurement trend as in the source publication, histological measurements for heart wall thickness between Though our measurements are larger, they are still consistent with the findings of the original publication, where histologic treatment of tissue may cause shrinkage, explaining the discrepancy in measurements [10].

Discussion
Our FlyNet3.0 segmentation method leverages the attention mechanism to capture and concentrate on characteristics of the Drosophila heart region in cross-sectional OCM videos, minimizing the need for additional supervision.In contrast to FlyNet2.0+,FlyNet3.0 seamlessly integrates attention gates into the skip connections at every level of the LSTM U-Net model.The attention mechanism enables the autonomous learning of attention gates to focus on target structures without the need for additional supervision.Attention gates effectively enhance the sensitivity and accuracy of dense label predictions by allowing the model to focus on the heart during training without imposing significant computational overhead.This improvement is achieved by suppressing feature activations in irrelevant regions, so only the important features of the heart region are highlighted.
FlyNet3.0 provides two key improvements.First, the incorporation of attention gates raises the model's prediction accuracy without complicated implementation or computation [36].Second, the attention gates penalize irrelevant features in the target region, thereby accentuating the important characteristics.This characteristic is particularly critical for images containing artifacts, where the heart region demands special attention for accurate localization.Models trained with attention blocks self-learn to progressively focus on the Drosophila heart region, resulting in enhanced segmentation accuracy for fly heart OCM videos.
Using its high segmentation accuracy, FlyNet3.0 also adds the ability to measure the heart wall thickness dynamically in vivo.Although our model (Fig. 6) showed the same measurement trend as in the source publication, histological measurements for heart wall thickness between tinC-Cas9 > RNZ KO and tinC-Cas9 > gZ + larvae differed by ∼3 µm, whereas for OCM measurements the difference was ∼1 µm.Additionally, the heart wall thickness was greater on average when compared to the histological measurement.This difference originates from the degree of sample or tissue processing required for each method.For a histological measurement, tissues must be fixed, paraffin embedded, thinly sliced, and stained, which may lead to tissue distortion.The OCM method of heart wall thickness measurement is dynamic and can show how the heart wall thickness changes at each point in the cardiac cycle, providing a comparison between systolic and diastolic heart wall thicknesses.In systolic measurements, the heart wall thickness is greater, as the tissue is contracted.In diastolic measurements, the heart wall thickness is smaller, as the tissue is expanded.
Although FlyNet3.0 has demonstrated improvements compared to FlyNet2.0+, limitations remain.First, for a dataset to be used, the heart region must be able to be identified by human experts to verify the quality of the segmentation.Datasets where the heart moves out of frame, most common in the larval stage, or datasets where reflections completely cover the heart region cannot be used.To decrease the occurrence of a dataset containing movement, repeated collections are recommended.Adding the attention gate module also slightly increases the computational cost as it introduces the attention weight parameters to the model, requiring more GPU resources.
The heart wall thickness measurement is limited to cases where the heart wall can be differentiated from surrounding tissues.Frames are excluded if the thickness is outside of a stable range, defined as within a standard deviation of the mean, which helps exclude measurements where the heart is in contact with other tissue and a valid measurement cannot be performed.Additionally, image rescaling from pixels to physical distance in microns introduces a pixelation effect, where all pixelwise measurements must be rounded to an integer pixel.
In future work, to further boost the segmentation accuracy, we are planning to add more training datasets of fly OCM videos with more accurate masks and different image qualities, including artifacts in different development stages.In addition, other model structures, such as the transformer U-Net, may provide even faster image processing and utilize long-range dependencies among pixels in the input image, potentially advancing the overall performance of the segmentation model [37][38][39].

Conclusion
We have developed an Attention LSTM model, FlyNet3.0,designed to segment the fruit fly heart cross-section region more accurately in OCM videos.Evolved from the earlier FlyNet versions, FlyNet3.0 incorporates an attention learning mechanism to focus on image features extracted from the heart region.The new model was tested and validated on a diverse Drosophila heart OCM video dataset, varying in image quality, developmental stage, and heart rhythmicity.FlyNet3.0 demonstrated significant improvements in segmentation as measured by IOU accuracy, particularly in scenarios involving artifacts.It achieved an average IOU accuracy of 89% for images with reflections and 89% for those depicting frequent heart movement.These refined segmentation results enable the accurate and efficient calculation of heart wall thickness.Overall, FlyNet3.0 reduces the need for manual correction for the heart region by producing more accurate masks.This, combined with the new in vivo heart wall thickness measurement, will make cardiac disease analysis in Drosophila more comprehensive, as demonstrated with the RNase Z model of cardiac hypertrophy.

Fig. 1 .
Fig. 1.The Flynet3.0 Model Architecture.A) There are four encoders and three decoder blocks, followed by a sigmoid layer.Each block contains two groups of 2D convolutional layers with a kernel size of 5x5, shown in blue, a batch normalization layer in yellow, and a ReLU activation function in yellow.The numbers shown below each block denote the number of filters for each group.The encoder blocks include a spatiotemporal encoder, in which the 2D convolutional layer is wrapped with an LSTM layer, followed by spatial encoding blocks, in which the 2D convolutional layer is encapsulated in a time-distributed wrapper.Similarly, all the decoders are spatial decoders, except for the last one, which is a spatiotemporal decoder.The gray solid arrow represents a 2D max-pooling layer with a pool size of 2x2.The green solid arrow represents a 2D transposed convolutional layer with a kernel size of 2x2 and a stride of 2. The purple dashed arrow indicates where a skip connection occurs.The pink dashed arrow stands for the gating signal.The gating signaland the skip connections are copied to the attention gate, shown in the red cylinder, and the output concatenates to the next layer.The last layer is a 3D convolutional layer with a kernel size of 1x1x1, followed by a sigmoid activation function.B) Detailed structure of the attention gate.The input g is from the gating signal, x is from the skip connection, and the output s is copied to the input of the corresponding decoder.

Fig. 1 .
Fig.1.The FlyNet3.0 Model Architecture.A) There are four encoders and three decoder blocks, followed by a sigmoid layer.Each block contains two groups of 2D convolutional layers with a kernel size of 5 × 5, shown in blue, a batch normalization layer in yellow, and a ReLU activation function in yellow.The numbers shown below each block denote the number of filters for each group.The encoder blocks include a spatiotemporal encoder, in which the 2D convolutional layer is wrapped with an LSTM layer, followed by spatial encoding blocks, in which the 2D convolutional layer is encapsulated in a time-distributed wrapper.Similarly, all the decoders are spatial decoders, except for the last one, which is a spatiotemporal decoder.The gray solid arrow represents a 2D max-pooling layer with a pool size of 2 × 2. The green solid arrow represents a 2D transposed convolutional layer with a kernel size of 2 × 2 and a stride of 2. The purple dashed arrow indicates where a skip connection occurs.The pink dashed arrow stands for the gating signal.The gating signal and the skip connections are copied to the attention gate, shown in the red cylinder, and the output concatenates to the next layer.The last layer is a 3D convolutional layer with a kernel size of 1 × 1x1, followed by a sigmoid activation function.B) Detailed structure of the attention gate.The input g is from the gating signal, x is from the skip connection, and the output s is copied to the input of the corresponding decoder.

Fig. 2 .
Fig. 2. Attention weight visualization.A) Example of the attention weight learning process on adult fly images over a range of epochs.B) and C) Attention weight learning process on the same larval images over the same range of epochs of systolic and diastolic phases, respectively.D) Resized input image and the corresponding optimized attention weight.After overlapping these images and filtering out the low-intensity region, the ideal attention weight is focused on the heart area.

FlyNet3. 0
model.Fig. 4B shows the M-Mode image for the entire dataset, where the presence of the artifact reduces the size of the FlyNet2.0+predicted mask, especially in the systolic phases.The FlyNet3.0 overlap M-mode image shows mainly a yellow area, indicating high fidelity between the ground truth and predicted masks.Fig. 4C further confirms this result by showing the IOU trace for this example dataset, where the FlyNet2.0+IOU drops down to 0 in some systolic phases due to the reflection artifact causing the model to not recognize any of the heart area, and the IOU performance for FlyNet3.0 remains close to 100%.Fig. 4D compares the performance of the two models for the entire 'reflection' group, where FlyNet3.0 improves IOU over FlyNet2.0+ by 3%.3D cross-sectional videos and M-mode images in Fig. 4 can be found in Visualization 4.

Fig. 2 .
Fig. 2. Attention weight visualization.A) Example of the attention weight learning process on adult fly images over a range of epochs.B) and C) Attention weight learning process on the same larval images over the same range of epochs of systolic and diastolic phases, respectively.D) Resized input image and the corresponding optimized attention weight.After overlapping these images and filtering out the low-intensity region, the ideal attention weight is focused on the heart area.

Fig. 3 .Fig. 4 .
Fig. 3. Flynet3.0 performance on a normal group example.A) Cross-sectional comparison of the ground truth mask (red) and prediction results (green) from FlyNet2.0+ and FlyNet3.0.The overlap between the two masks is shown in yellow.B) M-mode image comparison of ground truth masks (red) and prediction results (green) from FlyNet2.0+ and FlyNet3.0.The overlap between the two masks is shown in yellow.C) IOU comparison over time of FlyNet2.0+ and FlyNet3.0 for the example shown in A-B.D) Group IOU comparison between FlyNet2.0+ and FlyNet3.0,*** p<0.001.

FlyNetFig. 3 .
Fig. 3. FlyNet3.0 performance on a normal group example.A) Cross-sectional comparison of the ground truth mask (red) and prediction results (green) from FlyNet2.0 + and FlyNet3.0.The overlap between the two masks is shown in yellow.B) M-mode image comparison of ground truth masks (red) and prediction results (green) from FlyNet2.0 + and FlyNet3.0.The overlap between the two masks is shown in yellow.C) IOU comparison over time of FlyNet2.0 + and FlyNet3.0 for the example shown in A-B.D) Group IOU comparison between FlyNet2.0 + and FlyNet3.0,*** p < 0.001.

Fig. 3 .
Fig. 3. Flynet3.0 performance on a normal group example.A) Cross-sectional comparison of the ground truth mask (red) and prediction results (green) from FlyNet2.0+ and FlyNet3.0.The overlap between the two masks is shown in yellow.B) M-mode image comparison of ground truth masks (red) and prediction results (green) from FlyNet2.0+ and FlyNet3.0.The overlap between the two masks is shown in yellow.C) IOU comparison over time of FlyNet2.0+ and FlyNet3.0 for the example shown in A-B.D) Group IOU comparison between FlyNet2.0+ and FlyNet3.0,*** p<0.001.

Fig. 4 .
Fig. 4. FlyNet3.0 performance on reflection group example.A) Cross-sectional comparison of the ground truth mask (red) and prediction results (green) from FlyNet2.0 + and FlyNet3.0.The overlap between the two masks is shown in yellow.Reflection artifacts are indicated in the region by the blue arrow.B) M-mode image comparison of ground truth masks (red) and prediction results (green) from FlyNet2.0 + and FlyNet3.0.The overlap between the two masks is shown in yellow.An example reflection artifact is indicated by the blue arrow.C) IOU comparison over time of FlyNet2.0 + and FlyNet3.0 for the example shown in A-B.D) Group IOU comparison between FlyNet2.0 + and FlyNet3.0,* p < 0.05.

Fig. 5 .
Fig. 5. FlyNet3.0 performance on a reflection group example.A) Cross-sectional comparison of the ground truth mask (red) and prediction results (green) from FlyNet2.0+ and FlyNet3.0 during a period of movement.The overlap between the two masks is shown in yellow.B) M-mode image comparison of ground truth masks (red) and prediction results (green) from FlyNet2.0+ and FlyNet3.0.The overlap between the two masks is shown in yellow.Areas with movement are indicated by the white arrow.C) IOU comparison over time of FlyNet2.0+ and FlyNet3.0 for the example shown in A-B.D) Group IOU comparison between FlyNet2.0+ and FlyNet3.0,* p<0.05.

Fig. 5 .
Fig. 5. FlyNet3.0 performance on a reflection group example.A) Cross-sectional comparison of the ground truth mask (red) and prediction results (green) from FlyNet2.0 + and FlyNet3.0 during a period of movement.The overlap between the two masks is shown in yellow.B) M-mode image comparison of ground truth masks (red) and prediction results (green) from FlyNet2.0 + and FlyNet3.0.The overlap between the two masks is shown in yellow.Areas with movement are indicated by the white arrow.C) IOU comparison over time of FlyNet2.0 + and FlyNet3.0 for the example shown in A-B.D) Group IOU comparison between FlyNet2.0 + and FlyNet3.0,* p < 0.05.

Fig. 6 .
Fig. 6. Larval heart wall thickness calculation example and heart wall thickness quantification measurement.The left column is the systolic measurement data, and the right column is the diastolic.A) and B) Combined OCM and heart segmentation image; C) and D) Heart wall boundary and perpendicular trace example along the boundary to measure the heart wall thickness; E) and F) Aligned and averaged intensities for all the selected points along the boundary in C) and D) respectively, with the corresponding FWHM values.G) Quantification of heart wall thickness measurements between a cardiac hypertrophy model (tinC-Cas9>RNZ KO ) and its control (tinC-Cas9>gZ + ).Error bars indicate the mean ± s.e.m. * p<0.05, *** p<0.001, **** p<0.0001.

Fig. 6 .
Fig. 6. Larval heart wall thickness calculation example and heart wall thickness quantification measurement.The left column is the systolic measurement data, and the right column is the diastolic.A) and B) Combined OCM and heart segmentation image; C) and D) Heart wall boundary and perpendicular trace example along the boundary to measure the heart wall thickness; E) and F) Aligned and averaged intensities for all the selected points along the boundary in C) and D) respectively, with the corresponding FWHM values.G) Quantification of heart wall thickness measurements between a cardiac hypertrophy model (tinC-Cas9 > RNZ KO ) and its control (tinC-Cas9 > gZ + ).Error bars indicate the mean ±s.e.m. * p < 0.05, *** p < 0.001, **** p < 0.0001.