Generation of Musculoskeletal Ultrasound Images with Diffusion Models

: The recent advances in deep learning have revolutionised computer-aided diagnosis in medical imaging. However, deep learning approaches to unveil their full potential require signiﬁcant amounts of data, which can be a challenging task in some scientiﬁc ﬁelds, such as musculoskeletal ultrasound imaging, in which data privacy and security reasons can lead to important limitations in the acquisition and the distribution process of patients’ data. For this reason, different generative methods have been introduced to signiﬁcantly reduce the required amount of real data by generating synthetic images, almost indistinguishable from the real ones. In this study, the power of the diffusion models is incorporated for the generation of realistic data from a small set of musculoskeletal ultrasound images in four different muscles. Afterwards, the similarity of the generated and real images is assessed with different types of qualitative and quantitative metrics that correspond well with human judgement. In particular, the histograms of pixel intensities of the two sets of images have demonstrated that the two distributions are statistically similar. Additionally, the well-established LPIPS, SSIM, FID, and PSNR metrics have been used to quantify the similarity of these sets of images. The two sets of images have achieved extremely high similarity scores in all these metrics. Subsequently, high-level features are extracted from the two types of images and visualized in a two-dimensional space for inspection of their structure and to identify patterns. From this representation, the two sets of images are hard to distinguish. Finally, we perform a series of experiments to assess the impact of the generated data for training a highly efﬁcient Attention-UNet for the important clinical application of muscle thickness measurement. Our results depict that the synthetic data play a signiﬁcant role in the model’s ﬁnal performance and can lead to the improvement of the deep learning systems in musculoskeletal ultrasound.


Introduction
Generative models have attracted much attention in the recent literature for their ability to generate realistic data [1][2][3]; this is due to the nature of deep learning models in which the final performance is related to the amount of data you possess and the inner characteristics. The generative models provide the ability to enrich your dataset without the limitations that occur in real-world data campaigns. These limitations are particularly severe in the medical imaging domain. Data privacy and security can lead to significant patient data acquisition and distribution constraints. Therefore, obtaining realistic data can be crucial for easily improving the computer-aided diagnosis (CAD) systems of musculoskeletal ultrasound (MSK-US) that are heavily based on deep learning.
Generative adversarial networks (GANs) [4] have been proposed for the generation of realistic synthetic images. Briefly, GANs are a class of deep learning models that can generate new data similar to the data on which they were trained. GANs normally consist of two deep neural networks: a generator and a discriminator network. The generator network takes random noise as input and generates a new sample of data similar to the training data. The discriminator network takes the generated and real training data as input and attempts to classify which is which. The generator network is trained to improve its ability to fool the discriminator network, while the discriminator network is trained to distinguish between the generated and real data correctly.
Several applications of GANs have been studied in ultrasound imaging. In particular, in [5], different GAN architectures were investigated to generate realistic breast ultrasound (US) images. Afterwards, the generated images were used to train convolutional neural networks (CNNs) to classify breast ultrasound images into three categories. Their results indicated that the generated images helped to outperform the baseline model. Furthermore, at [6], they used a GAN architecture to produce synthetic B-mode US images of bone data and their corresponding segmented bone surface masks in real time. Ref. [7] presents a pipeline for generating medical thyroid ultrasound images with an auto-encoding generative adversarial network as a data augmentation method for performance improvement. Similarly, at [8], a novel GAN architecture named Pix2Pix [9] is employed for data augmentation in bone surface segmentation in ultrasound images. Finally, another similar study is [10] in which the authors presented SpeckleGAN, a generative adversarial network with a speckle layer that can be incorporated into a neural network to add realistic and domain-dependent speckle.
Subsequently, at [11], a pipeline for generating synthetic 2D echocardiography images is presented using the Cycle-GAN [12]. Furthermore, at [13], a pipeline can synthesise realistic B-mode US images with customised texture editing features. Secondly, they enhance the structural details of generated images by introducing auxiliary sketch guidance into a conditional GAN. Finally, a study that is similar to ours is [14]. This study used Cycle-GAN to generate realistic B-mode musculoskeletal ultrasound images of longitudinal images of the gastrocnemius medialis muscle. The Cycle-GAN was fed with 100 images and a set of 100 synthetic segmented masks that featured two aponeuroses and a random number of fascicles. Their model output was a set of synthetic ultrasound images and an automated segmentation of each real input image. As a second step, they used existing software to measure muscle thickness, fascicle length, and pennation angle from the real and the generated images. The downside of their study is that they did not train a deep learning model using synthetically generated images to detect muscle architecture, so they have not evaluated how the generated images will affect the performance of such a model.
A more contemporary deep learning method that has presented exceptional results in generating synthetic images in many different applications is the denoising diffusion probabilistic model (DDPM), or simpler diffusion models [15][16][17][18]. The basic idea of diffusion models is to start with a random noise vector and then gradually transform it to produce a sample of synthetic data. The above is conducted by applying a sequence of invertible transformations to the noise vector over a series of discrete time steps. The noise vector is updated in each time step by adding a random perturbation, which helps introduce stochasticity into the model. Once the diffusion process is complete, the resulting noise vector is transformed back into a sample of synthetic data using a decoder network. Finally, the decoder network is trained to map the noise vector back to the data space, utilising a loss function that encourages the generated data to be as similar as possible to the real data. Diffusion models have several advantages over other generative models, such as GANs. They are more stable during training and do not suffer from the mode collapse problem common with GANs. They can also generate high-quality images with fine details and realistic textures.
Diffusion models have been applied in various medical imaging applications [19][20][21]. In [22], the authors propose a transformer-based UNet architecture to model the interaction between noise and semantic features. Furthermore, in [23], a conditional latent DDPM for medical images is proposed in different medical imaging datasets. In addition, at [24], a model which combines a synthetic diffusion-based label generator with a semantic image generator is presented and evaluated at brain magnetic resonance images. Another study worth mentioning is [25], in which the authors achieved image quality superior to the current state-of-the-art generative models in their synthetic data. They performed conditional and unconditional image synthesis and evaluated the quality of their synthetic data on different quantitative metrics.
In this study, the DDPMs are incorporated for the first time in musculoskeletal ultrasound imaging to generate realistic muscle images. We evaluate the similarity of the real and the generated images in different scenarios. Initially, qualitative and quantitative metrics that correspond well with human judgement are used to assess the proximity of the two data types. Later, Attention-UNet [26] is incorporated for the important clinical application of the muscle thickness measurement [27]. In particular, similar to [28], deep learning models are trained in various configurations to delineate the superficial and deep aponeuroses of the examined muscle. Afterwards, the muscle thickness is calculated by taking the average distance between the two aponeuroses at different muscle points.
This study aims to introduce, for the first time, the diffusion models in MSK-US imaging to generate high-quality synthetic images. Afterwards, these synthetic images would be used for training deep learning architectures in extracting muscle thickness in a novel MSK-US database. Therefore, the main contribution of this study is to present a complete methodology for reducing the amount of real data needed to be collected for achieving superior performance in the automation of clinical measurements relevant to the musculoskeletal system.

Database
As mentioned earlier, the DDPM are trained with a small number of real images to model the data distribution's inner characteristics. Furthermore, the main aim of this study is to generate high-quality synthetic MSK-US images. For this reason, a relatively new MSK-US database was evaluated in this study. The database consists of ultrasound recordings of four superficial human muscles of 116 young and healthy volunteers (49 males and 67 females with a mean age of 25.33 ± 4.92 y). All the ultrasound recordings were acquired in the Rehabilitation Department of the University Hospital of Patras using a Logiq P9 system (GE Healthcare GmbH, Freiburg, Germany) and an ML6-15 linear array transducer operating at 10-MHz. The same examination protocol remained constant for all participants. In particular:

•
Ultrasound scans were conducted longitudinally on the tibialis anterior (T.A.) muscle, at one-quarter of the distance from the inferior pole of the patella to the malleolus lateralis.

•
Ultrasound scans were conducted longitudinally on the rectus femoris (R.F.) muscle, halfway along the line from the anterior-superior iliac spine to the superior pole of the patella.

•
Ultrasound scans were conducted longitudinally on the bulkiest part of the medial head of the gastrocnemius (GCM) muscle.

•
Ultrasound scans were conducted longitudinally on the anterior arm muscles (B.B.) at two-thirds of the distance from the acromion to the elbow crease. This section of the scan included the biceps brachii and brachialis anterior muscles.
To prevent any changes to the image properties caused by software processing, all image optimization modes except for harmonic tissue imaging were turned off. The dynamic range was set at 66 dB and the gain to 50 during the examination of all subjects. The imaging depth was set at 4 cm for most muscles, except for the rectus femoris, where it was set at 6 cm. For patients with large muscles, the depth was increased to include the entire muscle in the image. Up to six focal zones were evenly distributed along the depth of the image. To ensure optimal ultrasound beam penetration and prevent soft tissue deformation due to transducer pressure, a sufficient amount of CLEAR ECO Supergel ultrasound gel was used. The beam inclination of the transducer was adjusted to obtain the brightest echo from the muscle fascia, ensuring that the images were obtained uniformly and consistently.
A total of 1223 ultrasound images of 4 different muscles were analysed. In particular, the images of the tibialis anterior were 306, the images of the rectus femoris were 299, the images of the gastrocnemius medialis were 299, and the images of the biceps brachii were 308. It must be mentioned that parts of this dataset have been previously presented in [28][29][30][31], but these studies' objectives differed from the current one. It is the first time that this database has been used for the task of synthetic image generation. In Table 1, the demographics of the dataset are depicted. Diffusion models are generative models that have been inspired by non-equilibrium thermodynamics. As mentioned earlier, the basic idea of diffusion models is to start with a random noise vector and then gradually transform it to produce a sample of synthetic data. First, they define a latent variable model which maps to a latent space of high dimensionality (same as the original data) using a fixed Markov chain [32]. The process of diffusion, which involves introducing random noise to data, is carried out in a series of steps. Subsequently, the system learns to reverse this diffusion process, allowing it to generate desired data samples from the added noise. Hence, the mathematical definition of the forward diffusion process can be described below.
Given a data point sampled from a real data distribution x 0 ∼ q(x) a small addition of Gaussian noise is added in T steps producing a sequence of noisy samples x 1 , x 2 , . . . , x T . Notably, the variance of Gaussian noise added in each time step is controlled by the following variance schedule {β t ∈ (0, 1)} T t=1 where β t is a value between 0 and 1 at time step t. The addition of Gaussian noise creates a new latent variable x t that follows the distribution of Equation (1): Since the distribution (1) is normal, the input data x 0 can transform to x T in a tractable way which is defined by the posterior probability (2): As the time step t becomes larger, the data sample x 0 loses its distinguishable features with the final in t → ∞ result to be an isotropic Gaussian distribution. As suitable property of the above process and by using the reparameterisation trick [33], we can sample x T at any arbitrary time step t. In Figure 1 the forward process of the diffusion model is depicted.
As the time step becomes larger, the data sample 0 loses its distinguishable features with the final in → ∞ result to be an isotropic Gaussian distribution. As suitable property of the above process and by using the reparameterisation trick [33], we can sample at any arbitrary time step . In Figure 1 the forward process of the diffusion model is depicted. Since the final distribution is an isotropic Gaussian, the next step is to manage to learn the reverse distribution ( −1 | ); this is important because approximating this distribution will enable us to sample from (0, Ι) run the reverse process and acquire a sample from ( 0 ), generating a novel data point from the original data distribution. The way that we approximate the ( −1 | ) is by a parametrised model that parameterises the mean and variance. Since the only requirement for that model is that its input and output dimensionality are identical, diffusion models are commonly implemented with U-Net-like architectures [34]. Finally, the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous time step, described by Equation (3).
In Figure 2, the forward and the reverse process is depicted:

Training Strategy
We trained a separate diffusion model for each muscle to generate synthetic images. Regarding the training strategy of these models, a manually curated set of approximately 300 longitudinal images for each muscle was used. As we mentioned earlier, these images were acquired from healthy individuals on the basis that the architectural characteristics of their muscles are optimal, meaning the superficial and deep aponeuroses, as well as muscle fascicles, were present and visible. The input and output size of the diffusion models was chosen to be 256 × 256 pixels. This decision was made mainly for the three following reasons: 1. The qualitative results in this image size were better than the smaller sizes (e.g., 128 × 128 and 64 × 64). 2. The input of the Attention-UNet that will delineate the deep and superficial aponeuroses, as described in the following section, is 256 × 256. 3. Larger image sizes would have required excessive computational power and training time. Since the final distribution is an isotropic Gaussian, the next step is to manage to learn the reverse distribution q(x t−1 |x t ); this is important because approximating this distribution will enable us to sample x t from N(0, I) run the reverse process and acquire a sample from q(x 0 ), generating a novel data point from the original data distribution. The way that we approximate the q(x t−1 |x t ) is by a parametrised model p θ that parameterises the mean and variance. Since the only requirement for that model is that its input and output dimensionality are identical, diffusion models are commonly implemented with U-Net-like architectures [34]. Finally, the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous time step, described by Equation (3).
In Figure 2, the forward and the reverse process is depicted: As the time step becomes larger, the data sample 0 loses its distinguishable fea-tures with the final in → ∞ result to be an isotropic Gaussian distribution. As suitable property of the above process and by using the reparameterisation trick [33], we can sample at any arbitrary time step . In Figure 1 the forward process of the diffusion model is depicted. Since the final distribution is an isotropic Gaussian, the next step is to manage to learn the reverse distribution ( −1 | ); this is important because approximating this distribution will enable us to sample from (0, Ι) run the reverse process and acquire a sample from ( 0 ), generating a novel data point from the original data distribution. The way that we approximate the ( −1 | ) is by a parametrised model that parameterises the mean and variance. Since the only requirement for that model is that its input and output dimensionality are identical, diffusion models are commonly implemented with U-Net-like architectures [34]. Finally, the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous time step, described by Equation (3).
In Figure 2, the forward and the reverse process is depicted:

Training Strategy
We trained a separate diffusion model for each muscle to generate synthetic images. Regarding the training strategy of these models, a manually curated set of approximately 300 longitudinal images for each muscle was used. As we mentioned earlier, these images were acquired from healthy individuals on the basis that the architectural characteristics of their muscles are optimal, meaning the superficial and deep aponeuroses, as well as muscle fascicles, were present and visible. The input and output size of the diffusion models was chosen to be 256 × 256 pixels. This decision was made mainly for the three following reasons: 1. The qualitative results in this image size were better than the smaller sizes (e.g., 128 × 128 and 64 × 64). 2. The input of the Attention-UNet that will delineate the deep and superficial aponeuroses, as described in the following section, is 256 × 256. 3. Larger image sizes would have required excessive computational power and training time.

Training Strategy
We trained a separate diffusion model for each muscle to generate synthetic images. Regarding the training strategy of these models, a manually curated set of approximately 300 longitudinal images for each muscle was used. As we mentioned earlier, these images were acquired from healthy individuals on the basis that the architectural characteristics of their muscles are optimal, meaning the superficial and deep aponeuroses, as well as muscle fascicles, were present and visible. The input and output size of the diffusion models was chosen to be 256 × 256 pixels. This decision was made mainly for the three following reasons: 1.
The qualitative results in this image size were better than the smaller sizes (e.g., 128 × 128 and 64 × 64).

2.
The input of the Attention-UNet that will delineate the deep and superficial aponeuroses, as described in the following section, is 256 × 256.

3.
Larger image sizes would have required excessive computational power and training time.
The deep learning architecture utilized for the diffusion model was a modified version of the U-Net. It incorporated residual blocks (as opposed to traditional convolutional blocks) and utilized group normalization and Sigmoid Linear Unit activation functions. An extended version of the 2D Convolution layer was employed to standardize weights before the convolution step. Attention mechanisms were also employed to selectively weigh and combine different feature maps, enhancing the importance of relevant features while suppressing the influence of irrelevant or noisy features. During training, the L1 loss function was employed, and the number of timesteps that provided the best fit was determined to be 300. The batch size was set to 4, and the model was trained for 300 epochs.

Muscle Thickness Measurement
For the muscle thickness (MT) extraction, the pipeline described in [28] was followed. This measurement involves drawing a centre line that lies midway between the superficial and deep aponeuroses. Subsequently, in five evenly distributed points along the centreline, a perpendicular chord is plotted, and the length of this chord is calculated. The muscle thickness is then measured by averaging the distances for all perpendicular chords, as is depicted in Figure 3. Finally, the measurement obtained in pixels is converted into millimetres using a scale factor obtained from the DICOM metadata of the recordings. Although this is not the standard procedure for measuring the MT, it is more robust to user dependence and easier to standardise since it eliminates the variability along the longitudinal axis.
before the convolution step. Attention mechanisms were also employed to selectively weigh and combine different feature maps, enhancing the importance of relevant features while suppressing the influence of irrelevant or noisy features. During training, the L1 loss function was employed, and the number of timesteps that provided the best fit was determined to be 300. The batch size was set to 4, and the model was trained for 300 epochs.

Muscle Thickness Measurement
For the muscle thickness (MT) extraction, the pipeline described in [28] was followed. This measurement involves drawing a centre line that lies midway between the superficial and deep aponeuroses. Subsequently, in five evenly distributed points along the centreline, a perpendicular chord is plotted, and the length of this chord is calculated. The muscle thickness is then measured by averaging the distances for all perpendicular chords, as is depicted in Figure 3. Finally, the measurement obtained in pixels is converted into millimetres using a scale factor obtained from the DICOM metadata of the recordings. Although this is not the standard procedure for measuring the MT, it is more robust to user dependence and easier to standardise since it eliminates the variability along the longitudinal axis.

Aponeurosis Delineation
For the MT measurement, it is crucial to delineate the deep and superficial aponeurosis in each muscle correctly. For doing so, the state-of-the-art Attention-UNet [26] has been selected. The Attention-UNet is a modified version of the original UNet architecture,

Aponeurosis Delineation
For the MT measurement, it is crucial to delineate the deep and superficial aponeurosis in each muscle correctly. For doing so, the state-of-the-art Attention-UNet [26] has been selected. The Attention-UNet is a modified version of the original UNet architecture, which incorporates attention gates to enhance the importance of relevant features in the skip connections. The authors claim that these attention gates can filter out responses that are irrelevant in the forward and backward passes of the training process. Especially, during the backward pass, gradients arising from background regions are reduced in weight. This enables the update of the model parameters in shallower layers to be based on the spatial areas that are relevant to the task at hand.
For training a deep learning model, annotated data pairs are required. Figure 4 depicts ultrasound images of the examined muscles along with their annotation. It must be mentioned that the annotation of the aponeuroses was performed with the guidelines of a specialised doctor. For improving the final performance, image augmentation techniques were incorporated. Rotation, scaling, vertical flipping, and random erasing were used. Regarding the optimization process, the weighted dice loss and an ADAM optimizer [35] were utilized along with a stepwise decrease in the learning rate as the learning rate policy. The input dimension of the deep learning models was 256 × 256 and lastly, the batch size was equal to 8 with 300 as number of epochs.
This enables the update of the model parameters in shallower layers to be based on the spatial areas that are relevant to the task at hand.
For training a deep learning model, annotated data pairs are required. Figure 4 depicts ultrasound images of the examined muscles along with their annotation. It must be mentioned that the annotation of the aponeuroses was performed with the guidelines of a specialised doctor. For improving the final performance, image augmentation techniques were incorporated. Rotation, scaling, vertical flipping, and random erasing were used. Regarding the optimization process, the weighted dice loss and an ADAM optimizer [35] were utilized along with a stepwise decrease in the learning rate as the learning rate policy. The input dimension of the deep learning models was 256 × 256 and lastly, the batch size was equal to 8 with 300 as number of epochs.

Evaluation Protocol
The dataset was split into three folds (e.g., train, validation, and test set) for better evaluation. It is important to have an independent validation set because it can be used to monitor the model's performance during training and to detect overfitting. Furthermore, having a test set is another reason to avoid the final model being biased by the training or validation data. In our case, the train set consisted of 40% for each muscle examination, 20% for the validation set and the rest 40% for the test set. Furthermore, 200 synthetic images were generated and annotated for each of the four muscles. The annotation procedure is presented in Figure 4. Later, a series of experiments were performed to evaluate the impact of the generated data on the model performance. The basic intuition behind these experiments is to demonstrate that the synthetic data can be used autonomously or auxiliary for training high-performance deep learning models. In particular: 1. A model was trained with only real data in the protocol mentioned earlier (40% train set, 20% validation set, 40% test set) and recorded its performance in the test set (Real Model). 2. A model was trained with only generated data and recorded its performance in the test set of real images (Gen Model). 3. A model was trained with the real and all the generated data. Specifically, the generated data were added to the real training set, and the validation and test sets were kept the same. The final performance at the test set was recorded (Real + Gen Model). 4. Finally, a supplementary analysis was performed in which the number of the real training images was intentionally reduced in the dataset while keeping the size of the

Evaluation Protocol
The dataset was split into three folds (e.g., train, validation, and test set) for better evaluation. It is important to have an independent validation set because it can be used to monitor the model's performance during training and to detect overfitting. Furthermore, having a test set is another reason to avoid the final model being biased by the training or validation data. In our case, the train set consisted of 40% for each muscle examination, 20% for the validation set and the rest 40% for the test set. Furthermore, 200 synthetic images were generated and annotated for each of the four muscles. The annotation procedure is presented in Figure 4. Later, a series of experiments were performed to evaluate the impact of the generated data on the model performance. The basic intuition behind these experiments is to demonstrate that the synthetic data can be used autonomously or auxiliary for training high-performance deep learning models. In particular:

1.
A model was trained with only real data in the protocol mentioned earlier (40% train set, 20% validation set, 40% test set) and recorded its performance in the test set (Real Model). 2.
A model was trained with only generated data and recorded its performance in the test set of real images (Gen Model). 3.
A model was trained with the real and all the generated data. Specifically, the generated data were added to the real training set, and the validation and test sets were kept the same. The final performance at the test set was recorded (Real + Gen Model).

4.
Finally, a supplementary analysis was performed in which the number of the real training images was intentionally reduced in the dataset while keeping the size of the testing set constant. Specifically, the entire dataset was divided into training and testing sets and three separate experiments were conducted. At each experiment, the number of real training images was reduced and the best performance of an Attention-UNet with and without synthetic data was reported.
Regarding the evaluation metrics for the deep and superficial aponeuroses delineation tasks, five well-established indexes were incorporated [36]. Specifically, the precision and recall of the segmentation results were reported between the manual and automatic measurements. Furthermore, the Dice coefficient (DSC) and the intersection over union (IoU) were also employed. Both metrics measure the pixels overlapping between the prediction of the networks with the ground truth masks. For measuring the discrepancy in the muscle thickness measurement, the root mean square error (RMSE) between the manual and automated readings was calculated. Finally, for the assessment of possible bias and systematic error between the two readings, the Bland-Altman analysis was also used. Figure 5 demonstrates synthetic images that the diffusion models generated for each muscle in comparison with real images that have been trained. From a qualitative aspect, the results are exceptional since it is difficult for the human eye to differentiate them. Furthermore, it is observable that these images consist of the basic characteristics of a typical longitudinal ultrasound recording. In particular, the muscle aponeuroses formed due to the high reflectivity of the epimysium surrounding the muscle have the exact form and properties as the real recordings. Furthermore, the muscle fascicles in the synthetic images are organised in a linear, pinnate, or triangular fashion, similar to the real ultrasound images. Overall, the generated images seem consistent and realistic from a medical and visual standpoint.

Qualitative Analysis
Regarding the evaluation metrics for the deep and superficial aponeuroses delineation tasks, five well-established indexes were incorporated [36]. Specifically, the precision and recall of the segmentation results were reported between the manual and automatic measurements. Furthermore, the Dice coefficient (DSC) and the intersection over union (IoU) were also employed. Both metrics measure the pixels overlapping between the prediction of the networks with the ground truth masks. For measuring the discrepancy in the muscle thickness measurement, the root mean square error (RMSE) between the manual and automated readings was calculated. Finally, for the assessment of possible bias and systematic error between the two readings, the Bland-Altman analysis was also used. Figure 5 demonstrates synthetic images that the diffusion models generated for each muscle in comparison with real images that have been trained. From a qualitative aspect, the results are exceptional since it is difficult for the human eye to differentiate them. Furthermore, it is observable that these images consist of the basic characteristics of a typical longitudinal ultrasound recording. In particular, the muscle aponeuroses formed due to the high reflectivity of the epimysium surrounding the muscle have the exact form and properties as the real recordings. Furthermore, the muscle fascicles in the synthetic images are organised in a linear, pinnate, or triangular fashion, similar to the real ultrasound images. Overall, the generated images seem consistent and realistic from a medical and visual standpoint. The next analysis aims to compare the pixel-level differences between the generated and the real images. Specifically, the pixel intensity distributions of 100 real and synthetic images were extracted to quantify each case's information content. Later, these distributions' shape and entropy values were compared. Finally, in Figure 6, the histograms of the real and synthetic images (all resized in 256 × 256) for each muscle are presented. The next analysis aims to compare the pixel-level differences between the generated and the real images. Specifically, the pixel intensity distributions of 100 real and synthetic images were extracted to quantify each case's information content. Later, these distributions' shape and entropy values were compared. Finally, in Figure 6, the histograms of the real and synthetic images (all resized in 256 × 256) for each muscle are presented.

Qualitative Analysis
It is clear from the histograms that the distribution shapes between the two sets of images are similar (statistical similarity). More specifically, every muscle has a skewed distribution with close mean skewness and entropy between the synthetic and real image types. The biggest difference in the mean skewness is reported in the T.A. (real: 0.95, synthetic: 1.31) and similarly in the R.F. (real: 1.32, synthetic: 1.67), which can be explained by the fact that the real images are darker than the synthetic as we see in the corresponding histograms (more pixel's intensities near zero). Regarding the mean entropy values, the results are extremely close in all the muscles, which is depicted in their range of values since the synthetic (10.83-11.03) and the real images (10.64-10.92) are alike, only with a slight offset in the synthetic. Furthermore, the generated images displayed similar variation between samples, as reflected by the different y-axis values and a similar range of peaks on the x-axis. BioMedInformatics 2023, 3, FOR PEER REVIEW 9 It is clear from the histograms that the distribution shapes between the two sets of images are similar (statistical similarity). More specifically, every muscle has a skewed distribution with close mean skewness and entropy between the synthetic and real image types. The biggest difference in the mean skewness is reported in the T.A. (real: 0.95, synthetic: 1.31) and similarly in the R.F. (real: 1.32, synthetic: 1.67), which can be explained by the fact that the real images are darker than the synthetic as we see in the corresponding histograms (more pixel's intensities near zero). Regarding the mean entropy values, the results are extremely close in all the muscles, which is depicted in their range of values since the synthetic (10.83-11.03) and the real images (10.64-10.92) are alike, only with a slight offset in the synthetic. Furthermore, the generated images displayed similar variation between samples, as reflected by the different y-axis values and a similar range of peaks on the x-axis.
Another useful analysis is presented in Table 2. This table depicts four qualitative metrics between the two sets of images for each muscle. These metrics will help us quantify the similarity in the information and textural content. In particular, PSNR [37] (peak signal-to-noise ratio) has been used to measure the similarity between the two sets of images. However, since PSNR does not always correlate well with human perception of im- Another useful analysis is presented in Table 2. This table depicts four qualitative metrics between the two sets of images for each muscle. These metrics will help us quantify the similarity in the information and textural content. In particular, PSNR [37] (peak signalto-noise ratio) has been used to measure the similarity between the two sets of images. However, since PSNR does not always correlate well with human perception of image quality, the structural similarity (SSIM) [37] metric is also incorporated in this analysis. SSIM considers their luminance, contrast, and structure by comparing local windows of pixels. Furthermore, another metric used to assess image similarity is the learned perceptual image patch similarity (LPIPS) [38]. LPIPS has the benefit that it is based on a learned model, which means it has been trained on large datasets of human judgments of image similarity; this allows LPIPS to capture the nuances of human perception more easily than the other metrics. Finally, the last quality assessment metric is the freshet inception distance (FID) [39]. The FID score measures the distance between the distributions of real and generated images, with lower scores indicating higher similarity. First, FID is calculated using a pretrained convolutional neural network to extract feature representations from the generated and real images. These features are then used to compute the mean and covariance of the feature distributions for both sets of images. The FID score is then calculated as the squared Euclidean distance between these two feature distributions. From the results in Table 2, the PNSR between the two image types of all the muscles are clearly shown to be above 60 and the SSIM close to 1, indicating an increased similarity of the generated images to the real. Furthermore, the LPIPS scores are close to zero. Finally, the FID scores have small values, another indicator of the similarity in the textural content between the two types of images. Additionally, the small values of LPIPS demonstrate the high quality of the generated images with no severe artefacts. Regarding a per muscle analysis, T.A. and R.F. exhibit the best result regarding PSNR and SSIM and the R.F. and GCM regarding the LPIPS metric. However, the differences between the results are so small that they are not statistically important.
As a supplementary analysis, one more experiment has been conducted. In this experiment the average SSIM over real images from various patients (inter-patient SSIM) was calculated and compared with the average SSIM over the synthetic images (synthetic SSIM). This experiment was designed to provide insight into the level of similarity that should be expected between the same types of images. Specifically, in each image set the total images were divided in half and the SSIM index was calculated. The results of this analysis are presented in Table 3. Our results indicate that the inter-patient SSIM index is almost identical to the SSIM index of synthetic images. This finding provides further evidence that the distribution of synthetic data possesses similar textural and informational characteristics to the distribution of real images. We believe that this result supports the validity and utility of our proposed method for generating realistic musculoskeletal ultrasound images and reinforces the potential of synthetic data to supplement real data.
Continuing our analysis, visualisation of the generated and real data in a 2D space was performed. Specifically, features were extracted for each sample from the bottleneck of a pre-trained Attention-UNet. This Attention-UNet had been trained to segment all four muscles' deep and superficial aponeuroses. Afterwards, these features were normalised, and later dimensionality reduction was performed with the well-established principal component analysis (PCA). Finally, all the samples were visualised in a two-dimensional (2D) feature space for inspecting their structure. Figure 7 depicts the 2D feature space in each muscle. Figure 7 shows that the features of the real and synthetic images in this 2D space are not easily separable in any of the examined muscles, which means that the data points share similar characteristics and are likely to belong to the same class or category. Finally, it must be noted that this result is another indicator that the textural representation of the synthetic and realistic images is similar. was performed. Specifically, features were extracted for each sample from the bottleneck of a pre-trained Attention-UNet. This Attention-UNet had been trained to segment all four muscles' deep and superficial aponeuroses. Afterwards, these features were normalised, and later dimensionality reduction was performed with the well-established principal component analysis (PCA). Finally, all the samples were visualised in a two-dimensional (2D) feature space for inspecting their structure. Figure 7 depicts the 2D feature space in each muscle.  Figure 7 shows that the features of the real and synthetic images in this 2D space are not easily separable in any of the examined muscles, which means that the data points share similar characteristics and are likely to belong to the same class or category. Finally, it must be noted that this result is another indicator that the textural representation of the synthetic and realistic images is similar.

Muscle Thickness Analysis
The image segmentation results for the experiments described in Section 2.3.3 are presented in this section. As mentioned earlier, the average performance in the test set of the real images is reported for each experiment. The results of the deep and superficial aponeuroses delineation are presented in Table 4. The model trained with the real and generated data (Real + Gen Model) outperforms the other two models in most metrics; this is important because it shows that the generated data positively impacts the final performance. Furthermore, it must be mentioned that the performance of the model trained with only synthetic data (Gen Model) is far from disappointing. Specifically, the precision reported is 0.78, while the corresponding recall is 0.85, indicating the network's ability to accurately locate both the deep and superficial aponeuroses. Additionally, the reported

Muscle Thickness Analysis
The image segmentation results for the experiments described in Section 2.3.3 are presented in this section. As mentioned earlier, the average performance in the test set of the real images is reported for each experiment. The results of the deep and superficial aponeuroses delineation are presented in Table 4. The model trained with the real and generated data (Real + Gen Model) outperforms the other two models in most metrics; this is important because it shows that the generated data positively impacts the final performance. Furthermore, it must be mentioned that the performance of the model trained with only synthetic data (Gen Model) is far from disappointing. Specifically, the precision reported is 0.78, while the corresponding recall is 0.85, indicating the network's ability to accurately locate both the deep and superficial aponeuroses. Additionally, the reported results for DSC and IoU are 0.80 and 0.68, respectively, further demonstrating the strong performance of the segmented masks on the test set. These two combined results prove that the generated data have a positive role in training deep learning models. Continuing our analysis, Table 5 presents a comparison between the automated measurements obtained from the aforementioned models and the manual measurements in physical units. The table displays the mean ± standard deviation of the measurements, along with their RMSE discrepancy. All the models exhibit an extremely low RMSE in the evaluated dataset, one more indicator of the applicability of the synthetic data for this task. Additionally, it must be highlighted that even though the Gen Model has slightly underperformed in comparison with the other two, the average discrepancy between the two readings was equal to only 1.05 mm. This provides additional evidence that synthetic data can be independently employed for training deep learning models that exhibit high performance. Finally, the other two models have achieved similar results, making it challenging to determine definitively which model is superior.  Figure 8 presents another informative analysis, namely the Bland-Altman plot of the muscle thickness measurements. All the plots indicate minimal additive bias and no evident systematic error. Moreover, a majority of the differences fall within the 95% limits of agreement, and no distinguishable patterns are discernible in the plots. Finally, it is observable that the Bland-Altman plot of the Gen model ( Figure 8b) has a few points that are far from the mean values highlighting that the results in those cases are failing. Instead in the other two plots ( Figure 8a) and (Figure 8c) such behaviour is not present at that extent.
Real + Gen Model 0.84 ± 0.10 0.88 ± 0.09 0.86 ± 0.08 0.76 ± 0.10 Continuing our analysis, Table 5 presents a comparison between the automated measurements obtained from the aforementioned models and the manual measurements in physical units. The table displays the mean ± standard deviation of the measurements, along with their RMSE discrepancy. All the models exhibit an extremely low RMSE in the evaluated dataset, one more indicator of the applicability of the synthetic data for this task. Additionally, it must be highlighted that even though the Gen Model has slightly underperformed in comparison with the other two, the average discrepancy between the two readings was equal to only 1.05 mm. This provides additional evidence that synthetic data can be independently employed for training deep learning models that exhibit high performance. Finally, the other two models have achieved similar results, making it challenging to determine definitively which model is superior.  Figure 8 presents another informative analysis, namely the Bland-Altman plot of the muscle thickness measurements. All the plots indicate minimal additive bias and no evident systematic error. Moreover, a majority of the differences fall within the 95% limits of agreement, and no distinguishable patterns are discernible in the plots. Finally, it is observable that the Bland-Altman plot of the Gen model ( Figure 8b) has a few points that are far from the mean values highlighting that the results in those cases are failing. Instead in the other two plots ( Figure 8a) and (Figure 8c) such behaviour is not present at that extent. Additionally, in Figure 9, a sample of the predictions of the Gen Model is presented. From the segmented masks, it is clear that the synthetic images can be applied effectively for the delineation of the deep and superficial aponeuroses. Additionally, in Figure 9, a sample of the predictions of the Gen Model is presented. From the segmented masks, it is clear that the synthetic images can be applied effectively for the delineation of the deep and superficial aponeuroses. Finally, an additional analysis was conducted to better demonstrate the impact of the generated data in training high-performance deep learning models. Specifically, the entire dataset was divided into training and validation sets and three separate experiments were performed. In each experiment, the number of training images in the dataset was intentionally reduced, while keeping the size of the validation set constant. In the first experi- Finally, an additional analysis was conducted to better demonstrate the impact of the generated data in training high-performance deep learning models. Specifically, the entire dataset was divided into training and validation sets and three separate experiments were performed. In each experiment, the number of training images in the dataset was intentionally reduced, while keeping the size of the validation set constant. In the first experiment, 50% of the whole dataset was considered as a training set and the rest 50% as a validation set. In the second experiment, the validation set remained constant, but the training set was reduced to 30% of the whole dataset. Finally, in the third experiment, the training set was further reduced to 10% of the whole dataset. For each experiment, two Attention-UNet models were trained: one utilizing the original dataset, and the other using the synthetic data generated by our proposed method added to the training set. Lastly, the best performance achieved by both models in terms of Dice coefficient and IoU was reported. The results of this analysis are presented in Table 6. Based on the results, it is evident that artificial data can significantly enhance the performance of deep learning models and can be utilized to reduce the real data required to achieve exceptional performance. Notably, the results demonstrate that the model trained with a combination of 30% real images and generated data (30% Train + Gen) yields similar performance to the model trained with 50% real images (50% Train). This indicates that the generated data has significantly boosted the performance of the model in the specific problem. Finally, the experiments revealed that the impact of the generated data was more pronounced when the amount of real training data was limited. This is likely because, when there is less real data available, the generated data can provide a more valuable supplement to the training set. On the other hand, as the amount of real data increases, the deep learning architecture can generalize better leaving less space for improvement.

Discussion
This study employed state-of-the-art diffusion models to generate realistic MSK-US images of four very informative for investigating neuromuscular disorders [40] muscles. Afterwards, the synthetic image quality was assessed both qualitatively and quantitatively compared to the real data. Specifically, a histogram analysis that demonstrates that the pixel's intensity distribution is similar in both cases has been performed. Additionally, four qualitative metrics that correspond well with human perception were evaluated between the two types of images. In all these metrics, the results exhibited superior performance. In addition to that, features from a pretrained Attention-UNet were extracted and visualised in a two-dimensional space using PCA. Again, the results showed that a clear distinction does not exist between the two projections, another indicator of the similarity of the two sets of images. Finally, for evaluating the applicability of the synthetic data in a real-world scenario, an Attention-UNet was trained to automatically delineate the deep and superficial aponeuroses. Our results indicate that the synthetic data can be used autonomously or supplementarily for training high-performance deep-learning models for this task.
A significant advancement of this study compared to recent works presented in [8,14], is the use of diffusion models instead of GANs to address this problem. Generative adversarial networks have the downside that they are hard to train because they involve a complex optimisation process that requires careful tuning of hyperparameters. Additionally, GANs may suffer from mode collapse, meaning that the generator network produces a limited set of output samples, ignoring the rest of the distribution, resulting in generated images lacking diversity and quality. Apart from that, GANs are also sensitive to data quality and quantity, requiring a large and high-quality dataset to learn meaningful patterns, or the model may not generate accurate samples. Next, another limitation of the GANs is that the generation mechanism of new samples is difficult to understand and is considered a black-box model. Instead, diffusion models can be trained efficiently without excessive tuning to produce realistic results. Furthermore, unlike other deep generative models, diffusion models have an interpretable structure based on stochastic differential equations, allowing insights into the generative process and the underlying dynamics of the data. Finally, diffusion models can be used for transfer learning by fine-tuning the model on a new dataset which is useful in scenarios where labelled data are scarce or when the model needs to adapt to new domains.
Several analyses have been performed to evaluate the quality of synthetic data properly. Initially, the histograms of the pixel intensities in 100 randomly picked generated, and real images of each muscle were extracted and compared. This analysis showed that each muscle's distribution shape and entropy are statistically similar. In particular, a rightskewed distribution exists in every muscle with close mean skewness and entropy between the two readings. The most significant difference in the mean skewness was reported in T.A. (real: 0.95, synthetic: 1.31) and in the R.F. (real: 1.32, synthetic: 1.67, explained by the fact that the real images were darker than the synthetic in both muscles. Regarding the mean entropy values, the results were extremely close in all the examined muscles. Four metrics aligned with human judgment were used to quantify the similarity of 100 generated images for each muscle with the real dataset. In all the metrics, the results demonstrated that the quality of the synthetic data is superior. PSNR was above 60, and SSIM was close to 1 in all the examined muscles. In addition, the similarity level of the real images was analysed (inter-patient SSIM) and found almost identical to the similarity level of the synthetic images. This finding provides further evidence that the distribution of synthetic data possesses similar textural and informational characteristics to the distribution of real images. Furthermore, LPIPS and FID, which also consider textural information, were close to zero in all the muscles, another indicator of the similarity of the two sets of images. Finally, the two sets of images were visualised in a common two-dimensional space. In particular, high-level textural features were extracted from the bottleneck of a pretrained Attention-UNet for each image. Afterwards, the dimensionality of these features was reduced with PCA and visualised in a common space. The results showed that the data points of the generated and real images are not forming separate classes but are mixed between them, which is one more indicator that possesses similar textural characteristics. Furthermore, a system that automatically extracts the muscle thickness measurement was developed to evaluate the applicability of the generated data in a real-world clinical application. Specifically, the deep and superficial aponeuroses were segmented with the state-of-the-art Attention-UNet in a novel database of musculoskeletal ultrasound images. Afterwards, the MT is measured by computing the mean distance of the two aponeuroses at several points across the muscle. Since the main goal is to assess the generated data's impact on the model's final performance, different experiments were performed. From these, it is clear that the generated data are capable of producing high-performance models with (or without) the use of real images. This is a significant result that can lead to the acceleration of the integration and the improvement of the deep learning technology in MSK-US, where the acquisition process of real data is very difficult and time-consuming due to privacy restrictions. Notably, the Attention-UNet trained only with synthetic images (Gen Model) achieved over 80% with the Dice coefficient, a performance very close to the Real Model (85%) that has been trained with only real images or to the Real + Gen Model (86%) that has been trained with both types of data. In every case, these results prove that the generated data can be used autonomously or supplementarily to train high-performance models for the specific task. This also depicted the RMSE difference between the manual and automated measurements in all the different training configurations. Specifically, the average difference between the two readings for the Real Model was only 0.35 mm and for the Real + Gen model was similar at 0.38 mm. Similarly, for the model trained only with generated data, the difference is 1.05 mm, larger than before but still deviates only 4% of the manual MT measurements. Finally, an additional analysis was conducted to better demonstrate the impact of the generated data in training high-performance deep learning models. During this analysis, it was observed that all the models trained with a combination of real and generated data outperformed the baseline models. Additionally, the model that was trained with a combination of 30% of real images and generated data (30% Train + Gen) achieved almost identical performance to the model trained with more real images (50% Train). These findings provide further support for the notion that artificial data can substantially enhance the performance of deep learning models and can serve as a supplement in situations where real data are lacking.
This study has some general limitations. Firstly, the examined muscles were only four from over 200 that the human body possesses. Secondly, the number of MSK-US images was 1223 from 116 subjects, which can be considered a relatively small number. Consequently, conducting further research involving a larger sample size and more muscles would offer a clearer understanding of the diffusion models' capability to generate MSK-US images. Another constraint is that all the actual images were obtained from a single ultrasound machine, utilizing the same software and image settings. Hence, we did not investigate multiple configuration setups that can alternate the final image. Lastly, all the recordings used in this study were acquired from young and healthy subjects, which can bias our results since the young population usually has muscles with normal echogenicity and better architectural characteristics than the elderly. However, we are confident that these challenges can be overcome with a small number of real data since the diffusion model is scalable, as we mentioned before and can be trained without excessive hyperparameter tuning.
In future work, the plan is to investigate the generation of transverse MSK-US images in these four muscles. Furthermore, we will investigate the applicability of the generated data in other clinical applications, such as the automatic extraction of the cross-sectional area (CSA) or even the extraction of the fascicle's length and pennation angle. Finally, in the future, we will investigate the generation of data acquired from older adults with higher echogenicity since ageing leads to a reduction in muscle mass and an increase in muscle fat.