Dynamic $\beta$-VAEs for quantifying biodiversity by clustering optically recorded insect signals

While insects are the largest and most diverse group of terrestrial animals, constituting ca. 80% of all known species, they are difficult to study due to their small size and similarity between species. Conventional monitoring techniques depend on time consuming trapping methods and tedious microscope-based work by skilled experts in order to identify the caught insect specimen at species, or even family level. Researchers and policy makers are in urgent need of a scalable monitoring tool in order to conserve biodiversity and secure human food production due to the rapid decline in insect numbers. In order to improve upon existing insect clustering methods, we propose an adaptive variant of the variational autoencoder (VAE) which is capable of clustering data by phylogenetic groups. The proposed dynamic beta-VAE dynamically adapts the scaling of the reconstruction and regularization loss terms (beta value) yielding useful latent representations of the input data. We demonstrate the usefulness of the dynamic beta-VAE on optically recorded insect signals from regions of southern Scandinavia to cluster unlabelled targets into possible species. We also demonstrate improved clustering performance in a semi-supervised setting using a small subset of labelled data. These experimental results, in both unsupervised- and semi-supervised settings, with the dynamic beta-VAE are promising and, in the near future, can be deployed to monitor insects and conserve the rapidly declining insect biodiversity.


Introduction
Insects make up the majority of all known animal species with ca. 1 million described species and an estimated 3-4 million yet to be discovered [1,2]. While insects are numerous and found in almost all habitats, the total insect population is thought to be shrinking at an alarming rate. An influential report recently reported a 70% loss of flying insect biomass in 30 years [3]. These losses have mainly been driven by changes in the agricultural landscape, increased use of pesticides and the spread of disease, but the exact reasons and consequences are still unknown [4,5]. In order to accurately measure the biodiversity and health of the insect community across various biotopes (or habitats), researchers, agronomists, policy makers and institutions are in need of insect monitoring capabilities from multiple areas, over long periods of time.
Conventional insect biodiversity monitoring typically involves various trapping methods, each with their own bias towards different species, which makes it difficult to compare the results across studies [6]. The collected insect specimens are further identified under microscopes by highly trained experts. These methods provide data with very high specificity but are time consuming and expensive which severely limits the ability to collect data on a large scale, or over extended time periods, with high spatial and temporal resolution.
In recent years, new technologies have been developed for insect monitoring such as automated traps [7,8], acoustic methods [9,10] and optical instruments such as the entomological lidar [11,12,13,14]. In general, these methods provide large amounts of data with a high temporal resolution but with lower specificity compared to conventional methods [15,16,8,17]. The introduction of automated and continuous monitoring methods has the potential to greatly improve biodiversity monitoring and, consequently, conser-vation efforts. In order to utilize the full potential of these new methods, large number of unlabelled insect recordings need to be translated into a quantifiable biodiversity index, comparable to conventional estimates.
In this work, we combine the rich data from lidar entomology with the powerful capabilities of variational auto-encoders (VAEs) [18]. A common trade-off that is difficult to achieve in VAEs is between its two loss components: reconstruction loss and regularization loss. Balancing these two components can yield useful low-dimensional features (representations) of the input data which can be further analyzed to perform clustering of the input data [19,20]. We introduce a dynamically changing formulation of the scaling of the loss terms (β). The proposed β dynamics takes instantaneous changes to the loss terms and the historical performance during training into account to keep both the reconstruction and regularization performance near their optimum.
The proposed dynamic β-VAE is trained on unlabelled insect recordings collected using a novel optical insect sensor at several sites in southern Scandinavia. We demonstrate its ability to successfully extract features from input data into the regularized latent space, and to cluster the data into an appropriate number of clusters. We experimentally validate improvements compared to more conventional methods such as the hierarchical clustering algorithm (HCA) [21,22] and principal component analysis (PCA). Additionally, we show that the introduction of a semi-supervised data set further improves the clustering performance on the unlabelled data when evaluated on known phylogenetic groups.

Lidar-entomology & clustering
Lidar entomology is an insect monitoring method where insects are recorded as they enter an infrared laser beam which can sometimes extend for kilometers. It might be the fastest way to record large amounts of insect data, yielding up to several tens of thousands of optically recorded insect signals per day [21]. This data consists of time series, where the signal intensity varies with the insect cross section and wing beats. As the wing-beat frequency (WBF) varies between insects groups, it can to some degree be used to distinguish between species, alone or with other extracted features [15,16,8,17,23].
While the ability to identify a number of key species from automated sensors would be greatly beneficial to the entomological community, it is not sufficient to quantify the biodiversity. Instead, the total number of species (species richness) and their relative distributions (species evenness) are the commonly used measurements. Previous efforts to derive these numbers from a large number of optically recorded insect signals have been made using HCA on the WBF power spectra [21,22]. However, in order to cover a broad range of frequencies with sufficient resolution, a high dimensional feature space is required and the distance measures are non-trivial. For reduced model complexity and improved computational and clustering performance, a reduction of the parameter space is desired.
In order to reduce the parameter space while retaining the necessary information, various algorithms for extracting the WBF and other physical properties from insect recordings have been proposed and used [15,23,24,25,26]. Machine learning based methods for feature extraction, such as autoencoders (AE), have also been used to extract additional features [24], and very recently, to cluster acoustically recorded bird songs [27]. While an AE is able to generate high quality features for classification, a known behaviour of AE is the irregularity of the latent feature space where two similar data inputs might end up with very different latent representations. This makes the extracted features from an auto-encoder unsuitable for clustering recordings of similar insect species and quantifying the diversity of the recorded insects.

VAEs and β-Annealing
Variational autoencoders (VAEs) consist of a regularized probabilistic encoder-decoder pair and are some of the most powerful representation learning methods [19,18]. They have seen broad applications in generative modelling and unsupervised learning tasks.
Given unlabelled input data consisting of N samples with F features, x ∈ R N ×F , the probabilistic encoder of a VAE maps the input to the posterior density p(z|x) over the latent variable, z ∈ R N ×L . In practice, L << N and the encoder neural network approximates the true posterior density, p(z|x), with a multivariate Gaussian, q θ (z|x) ∼ N (µ θ , σ 2 θ ). The decoder of a VAE reconstructs the input data from the latent variable and is given by the density function p φ (x|z). The encoder and decoder neural networks are parameterised by θ and φ, respectively. The optimization objective of a VAE consists of two competing terms and it can be shown to be [18] The quality of the auto-encoded reconstructions is controlled by the reconstruction loss L rec , which is the first term in Eq. (1). The encoder density is regularized to match the prior over the latent variable, p(z) ∼ N (0, I), enforced by the regularization loss, L reg , which is the Kullback-Leibler divergence (KLD) term in Eq. (1). At a high level, the regularization term controls the smoothness or the regularity of the latent space. Well structured and smooth latent spaces can yield useful representations of the input data.
The trade-off between the two loss terms can have influence on the performance of any VAE. A VAE where the reconstruction term dominates might be able to reconstruct the input data well with a latent space that might not be interesting for the downstream tasks (such as clustering). To alleviate this, a simple trick of scaling the regularization term L rec was used in [20] resulting in a modified objective: Here the role of β ≥ 0 is to balance the reconstruction-and regularization losses. Typically, lower β values yield better reconstructions but a less regularized latent space and less disentangled features. On the other hand, higher β may lead to posterior collapse, where all reconstructions are reduced to the average input and the KLD approaches zero. Various methods have been proposed to overcome this instability in achieving a reasonable trade-off between the loss terms. A common implementation is β-annealing, where β is gradually increased from a very low value up to a fixed point. While this solves the initial stability problems, the task of finding the optimal value of β remains. Recently, it has been shown that repeating the process with a cyclic β can lead to better performance [28]. However, when unchecked both implementations face the risk of posterior collapse (vanishing KLD) once β enters a stationary phase. More recently, several approaches have attempted to adapt β instead of using a fixed or scheduled scaling [29,30,31]. In the controlVAE formulation in [29], rather than gradually increasing β to a maximum value (annealing), it is controlled with feedback from a non-linear proportional-integral (PI) controller to keep the KLD at a desired level. This addresses the vanishing KLD problem but the users still have to set the desired value of the KLD, which might not be straightforward for many applications.

Methods
The primary goal of this work is to obtain low-dimensional latent representations suitable for clustering of the high-dimensional input data. Specifically, the objective is to obtain latent space encodings of the insect spectral data such that similar species of insects are positioned close to each other. To this end, we propose to dynamically adapt, the otherwise constant scaling factor, β of a standard β-VAE.
In our proposed dynamic β-VAE formulation, the changes in reconstruction and regularisation losses are monitored throughout the training process; these changes are used to adapt the β value in each epoch using a simple control algorithm. If either the reconstruction-or regularization losses increase above a specific level of their historical minimum then β is adjusted (either by increasing or decreasing) until a new optimum is attained. This dynamic control of β maintains a steady trade-off between the two loss terms while reducing the global loss function by latching on to the historical minimum of the loss components.
In the remainder of this section, we detail the dynamic β-VAE and formulate a semi-supervised variant of the model using a new clustering loss component.

Dynamic β-VAE
The key contribution in this work is an online, adaptive formulation of the β-VAE using dynamic control of β. This is achieved by varying β at each epoch, based on the instantaneous changes in the reconstruction-and regularization terms in Eq. (3), with an objective of not letting either of the loss terms to dominate the overall model optimization. This results in a trade-off between sufficiently good reconstructions and adequately regularized latent space yielding representations of the input data that are useful for the downstream task.
At any given epoch t the objective for the dynamic β-VAE is given by, The dynamically controlled β (t) is formulated using the signum function 2 , ψ[·], given by The signum function, ψ[x] returns the sign of any real number s ∈ R with hyperparameters [a, b, w 1 , w 2 , w 3 , w 4 ] ∈ R + . The notation (: t − 1) is used to indicate all epochs up to (t-1) and (t ) is the epoch when β was last changed. The terms associated with (: t − 1) provide a form of long term memory of the previous local optima for each of the two loss terms.
The β dynamics in Eq. (5) can be divided into two regimes aimed at optimizing reconstruction-and regularization terms corresponding to increaseand decrease of β, respectively. Reconstruction regime (β ↓): The value of β is decreased due to the −b term in Eq. (5) when ∆ rec is positive; meaning the reconstruction loss is increasing compared to the historical minimum reconstruction loss, according to Eq. (6). The β decrease rule also checks if the regularization loss is decreasing compared to the historical minimum with the term 1 − ψ [∆ rec ] in Eq. (5). Regularization regime (β ↑): The increase in β happens due to the +a term in Equation (5) when ∆ reg is positive; meaning the regularization loss is increasing according to Eq. (7). The increase rule checks if the reconstruction loss has decreased compared to the historical minimum with the term 1 − ψ [∆ reg ] in Eq. (5).
Additionally, ∆L rec in Eq (8) nudges a change in β based on the last update to β. This allows β to get out of plateaus of either stable reconstruction or regularization regimes.
In Figure 1, one instance of optimizing the dynamic β-VAE with the objective in Eq. (4) is shown. The value of β increases until about epoch 700 at which it plateaus and decreases ( Figure 1, row 3). At epoch 2000, it has finally stabilized. These changes are correlated with changes to L reg and L rec captured in the second row of Figure 1, estimated according to Equations 5 to 8.

Semi-supervised clustering
Using a small subset of labelled data that optimizes a relevant loss could steer learning of representations that are more expressive for the downstream tasks under consideration. One approach to achieve this is to introduce auxiliary tasks based on the labelled data, resulting in a semi-supervised learning setup [32].
As we are interested in clustering of insect species based on their latent representation, we enforce clustering of a small subset of labelled examples to improve the overall clustering. An additional loss term, L cls , based on the auxiliary task is introduced to the model optimization: where γ (t) is the scaling of the clustering loss component. The clustering loss, L cls , has two components to encourage intra-class cohesion and inter-class repulsion. Intra-class cohesion is captured as the sum of distance between all data points belonging to a particular class and the corresponding cluster centroid location. Assuming K clusters with centroidŝ z k ∈ R L , k = 1 . . . K and N k cluster members with class label k denoted z k , the intra-class cohesion distance is given by: The inter-class repulsion distance is captured using the sum of all pairwise distances between the cluster centroids: In both cases, d(·) is the Euclidean distance. Finally, the clustering loss is computed as the ratio between the two distances, which when minimized encourages smaller intra-class and larger inter-class distances, given by where is a constant used for numerical stability.

Training of dynamic β-VAE
The final objective of the dynamic β-VAE with semi-supervision in Eq. (9) has three components which are introduced during the training in three successive stages:  1. Warm-up phase (β = 0, γ = 0): In this phase, the model primarily learns to reconstruct the input data similar to bottleneck autoencoders.
2. Regularized phase (β > 0, γ = 0): In this phase, the dynamic control of β sets in which smooths the learned latent space without deterioration in the quality of reconstructions.

Experiments & Results
The main objective of the proposed dynamic β-VAE is to cluster unlabelled insect spectra into plausible clusters that could correspond to unique species. To evaluate the performance of the model, we use real data collected from field instruments and compare the model's performance in different settings. Details of the data and experiments are presented in this section.

Data Collection & Pre-processing
The insect data were recorded with a novel instrument from FaunaPhotonics, which uses a similar principle at close-and long ranges as the methods described in [15,23,21,11]. In the current implementation, an air volume is illuminated with infrared light in two spectral bands at 808 nm and 975 nm. The back-scatter of any object passing through a 20 l volume within  Figure 3: Latent representation of optically recorded insect wingbeat frequency spectra. The proposed dynamic β-VAE is able to cluster unlabelled field recordings into compact clusters. Evaluated on labelled data, most species form compact clusters, as shown with different colours for each of the 12 named species groups. Closely related species groups, such as the Lucila spp. and Muscidae spp. are partially overlapping but well separated from more distant species groups, such as the Bombus spp..
1m of the sensor is recorded onto a photo diode quadrant detector. As insects fly past, the optical cross section varies with their WBF. This yields a modulated time series, sampled at 20 kHz with a bandwidth from 0 to 5 kHz. As signals from any non-insect object passing through the volume are also recorded, a CNN trained with manual labels was used to filter out insect recordings from rain and dust etc. For simplicity, the multi spectral time series were reduced to one dimension by calculating the average Welch power spectra [33] over both spectral bands in F = 193 bins between 0 and 2kHz. Finally, the data was log-transformed and individually normalized by the maximum of each spectrum. The unlabelled data were recorded from March to November 2020 in various biotopes in theÖresund region in southern Scandinavia and N = 40, 000 insect recordings were randomly selected for the experiments with F = 193 features after the WBF pre-processing.
Additionally, data encompassing 12 different species groups were labelled one species at a time in closed cages in Copenhagen, Denmark. From this data, 6000 insect recordings (15% of the unlabelled data) were randomly selected and added to the labelled training set. For each species group, this resulted in 500 labelled recordings to be used in the semi-supervised mode. The average WBF spectra for each labelled species group is shown in Figure  2.
In order to validate the clustering ability of the different models, 8 out of the 12 labelled species were included in computing the clustering loss, L cls in Eq. (11), in the semi-supervised setting. The remaining four labelled species were used as test set for validating the clustering accuracy.

Experimental set-up
The dynamic β-VAE was evaluated in unsupervised and semi-supervised modes to obtain latent representations, which were clustered using K-means [34]. Their clustering performance was compared with the baseline methods: PCA, Kernel-PCA, HCA using the standard implementations in sklearn [35] and a conventional VAE on the same data. The encoder neural network q θ (z|x) consists of 9 fully connected layers, with rectified linear unit (ReLU) activation (except for the last layer). The encoder predicts the mean and the variance of the approximate posterior distribution. The decoder neural network p φ (x|z) is implemented with 10 fully connected layers and ReLU activation (except the last layer, which has sigmoid activation). The VAE uses a bottleneck L = 2 to create the latent representation. The model layout was developed on a independent unlabelled dataset recorded at a different location and was gradually expanded until reconstructions were sufficiently good. In order to visualize the latent representation, the size of the bottleneck of the model (latent dimension) was limited to two. Details of the network architecture are reported in Table 1.
Both the unsupervised and semi-supervised models were run five times on random training and test splits. A random subset of 3000 recordings from the dataset were used as the test set.
After each training run the latent representation of the unlabelled test set was clustered using K-means method. The appropriate number of clusters K were automatically selected by the maximum average silhouette score [36] from a range of 5 − 50. As we expect the unlabelled data to consist of at least 5 distinct species, we incorporate this as prior information in choosing the range of clusters. For comparison, the data were also clustered into the same range of clusters using PCA, Kernel-PCA (implemented with sigmoid kernels) and HCA (implemented with complete linkage).
The final evaluation was done by comparing the automatically identified clusters with the four labelled test species. The automatically found clusters were compared with the labelled data using the adjusted rand index (ARI) [37] and adjusted mutual (AMI) information score [38]. These metrics are useful to compare clusters obtained in unsupervised settings, as they are agnostic to labels and only focus on the similarity between members within the clusters.

Model parameters & hyperparameters
The dynamic β-VAE has several tunable model parameters, as seen in Eq. (3) to Eq. (8). These model parameters were tuned on an independent dataset, collected with the identical instrumentation but at a different location. The data had a similar distribution as the data used in this work and While all methods map the high dimensional input data to a two dimensional feature space, the dynamic β-VAEs creates clusters with less overlap between the species groups. The inclusion of 10% labelled data for training further improves the results, yielding denser clusters with less overlap than the unsupervised β-VAE.
we obtained: a = 0.2 and b = 0.05, w 1 = w 2 = 1.2, w 3 = 0.9 and w 4 = 1.1. These parameters were found to be sufficiently robust on the dataset used in this work without any fine-tuning.
All models were implemented in PyTorch [39] and trained for 5000 epochs using the Adam optimizer [40] with a learning rate of 10 −3 . The models were trained on Nvidia GTX 1050 graphics processing unit with 4 GB memory with a batch size of 256. A decision to adapt β was taken every fifth epoch to avoid random fluctuations. The scaling of the clustering loss, γ, in the semi-supervised mode was cycled between 0.01 and 0.2 every 100 epochs.

Results
The clustering performance on the labelled test set for the unsupervised and semi-supervised instances of the dynamic β-VAE is presented in Table 2. Table 2: Aggregated results from 5 repetitions of each method. The unsupervised model performs better than the classical models and adding the labelled data further improves clustering of unlabelled data. The median number of automatically determined clusters (K) are also reported. (ARI and AMI scores: higher is better.

Models
K ARI-score AMI-score The dynamic β-VAE performs better than the baselines in the ARI-and AMI-scores which quantifies the intra-class cohesion and inter-class separability. While HCA have been successfully used to identify groups of similar insects by other groups previously [21], it has the lowest ARI score. With semi-supervision the dynamic β-VAE further improves upon its unsupervised clustering scores, and the improvements compared to the conventional models are more pronounced. The nonlinear kernel-PCA does not show any drastic improvements over conventional PCA. The low dimensional representations of the test species for each model are shown in Figure 4. While the different species form largely separable and homogeneous clusters in all methods, they are relatively more compact in the semi-supervised implementation.
In the results presented in Table 2, the appropriate number of clusters found in the unlabelled test set data is also reported. The number of clusters, K, was automatically chosen to maximize the average silhouette score [36].
An example of the latent representation from all 12 labelled species groups by the unsupervised instance is shown in Figure 3. All species generate dense, but partly overlapping, clusters except the weevils (Ceutorhynchus spp.), and to some degree the fruitflies (Drosophilidae spp.) which form sparser clusters.
The latent space obtained by the semi-supervised β-VAE on the unlabelled test set is shown in Figure 6a. Using K = 15 the data is color coded by cluster and the average spectra from each cluster is shown in Figure 6b.

Discussions
In this work, we introduced a dynamic β-VAE in order to achieve a good trade-off between the reconstruction-and regularization loss terms by per- forming online adjustment of the β term. The proposed β dynamics result in useful latent representations for the downstream clustering task. Our experiments demonstrate the ability of the model to map high-dimensional insect data into a well regularized latent representation where phylogentic groups are distinguishable.

Generalization of the β dynamics
The primary objective of using the β dynamics in Eq. (4) is to perform online adjustment of the scales of reconstruction and regularization terms based on their instantaneous values while taking the previous optima into account. The specific formulation of these control mechanisms in Eq. (6)-(8) force the model optimization to not deviate from the previous optimal solutions. The terms comprising min over (: t − 1) epochs in Eq. (6) and (7) provide a form of memory of the previous local optima. The trade off between long and short term memory of the losses and the corresponding optima help the model to steer towards more global optima. These equations provide a sufficiently general formulation for adjusting β as they are only dependent on the two loss components. Further, one can also envision a learnable neural network with long short-term memory (LSTM) that can perform this  dynamic control in a recurrent neural network type formulation of a closed loop control system [41].

Influence of the β and γ parameters
As seen in Figure 1, the initial effect of a dynamic β is similar to βannealing, where β gradually increases during training in order to prevent posterior collapse [42]. However, the key difference with β annealing is that the rate of annealing is not predetermined, as the β dynamics described in Sec.3.1 enables self-regulation of β. This is witnessed during the latter part of training, where β repeatedly adapted either by increasing or decreasing its value dependent on the changes in the reconstruction-and regularization terms. This behaviour is similar to what is reported with the cyclic β-VAE [28] where the shakeup often allows the model to obtain new global minimum loss. The similarity to the cyclic β-VAE is further enhanced by automatically increasing β when there has been no change for a large number of epochs (500 in our case). However, unlike a cyclic β-VAE, β is only cycled if it has reached a stationary condition and ∆ reg and ∆ rec are within limits. This helps both the unsupervised-and semi-supervised instances of dynamic β-VAE to latch on the historical minimum of both L rec and L reg .
The β dynamics introduced in Sec. 3.1 is also similar to adaptive strategies used in models such as the controlVAE [29]. A controlVAE stabilizes the model by adjusting β to keep the regularization loss (KLD) at a constant level. However, finding an appropriate KLD level can be difficult as it could vary across datasets and downstream tasks. In contrast, the dynamic β-VAE keeps the model stable by constantly comparing both L rec and L reg with their historical minima. A gain on either loss term, at the expense of the other, is counteracted by adjusting β. This self-regulating β dynamics that is not dependent on fixing KLD value is an advantage with our formulation.
Including the additional loss term scaling term γ (t) L cls in Eq. (9) further improved the clustering performance of the model. In this implementation γ was cycled between 0.01 and 2 in order to keep the contribution from L cls in a similar range as L rec and L reg . A logical next step could be to expand Eq. (4) to include a dynamically adjusted γ (t) L cls term. This would however make the model less generalized to other tasks.

Performance comparison
Comparing the performance between the models reported in Table 2, the dynamic β-VAE perform better than the baseline models. Adding 15% of labelled data from 8 different species to the training set further improves the clustering performance. While using HCA on high dimensional frequency spectra have been successfully used to identify mosquito clusters in field data [21] and biodiversity evaluation [22], our results show better performance for PCA + Kmeans. While the Kernel-PCA generally produced more heterogeneous latent distributions, it did not show any significant improvements over the standard PCA.
The non adaptive β-VAE showed large variation in its performance but was on average comparable with the conventional methods. The dynamic β-VAEs kept β < 1 during most of the training, as exemplified in Figure 1, and since a higher β term favours a well generalised latent space over good reconstructions, a reduction in clustering performance could be expected. Additionally, the non adaptive VAE was more cumbersome to train as the model collapsed frequently during training.

Selection of number of clusters
In this work, the appropriate number of clusters were automatically selected by maximizing the average silhouette score. However, when manually evaluating the average silhouette score and comparing it with commonly used empirical measurements, such as the elbow method [43] and intra-cluster sum-of-squares, a user might typically identify a higher number of clusters. Having more clusters yield more similar recordings within each cluster. An example is show in Figures 6a and 6b where the number of clusters were manually selected. As in previous work by lidar-entomologists [21], some species groups can be identified from these clusters at this level by comparing the average spectra of each cluster with known data. For example a cluster of possible mosquitoes can be identified by their high wing-beat frequency the lower right corner of Figure 6b.
With the 3000 randomly chosen insect recordings from multiple sites during summer, we expect the total number of species represented in the test set to be one or several orders of magnitude larger. We tested a range of 5 − 50 clusters as even reasonably coarse clustering will be useful for quantifying biodiversity. Once fully deployed on a network of insect sensors, a dataset recorded in an environment with high biodiversity could yield more clusters than a dataset captured in a biologically poor environment. This would allow the automated and optically recorded insect data to be correlated with conventional monitoring methods and greatly improve the ability to monitor insect biodiversity at scale.

Computation time and inference
Computation time for PCA + K-Means is significantly shorter compared to HCA on the full spectra [21,22]. While the initial training of the dynamic β-VAE takes a few hours depending on the number of epochs, once trained, the inference time for the model is comparable with that of using PCA + K-Means. Once deployed in the field, the dynamic β-VAE model is not expected to be retrained regularly but to be used as a dimensionality reduction method. Therefore, inference time is a more important metric than the initial computation time.

Exploring the latent space
Samples generated from the latent space of the semi-supervised model are shown as a latent space cart-wheel in Figure 5. Traversing different lines in the latent space results in samples that smoothly transition between different spectra types. As a side note, we point that the two latent features do not appear to be entirely disentangled; this is manifested as dense islands and sparse spaces of spectra in the latent space. For our downstream clustering task, fully disentangled features are not required. However, one could introduce an additional loss component that enforces orthogonality between the different latent dimensions to achieve improved disentanglement.
The latent representation of the unsupervised model can be further validated by comparing Figure 3 with the average spectra of each group, shown in Figure 2. Species groups with similar spectra, such as Aleyrodidae spp., Aphididae spp. Tortricidae spp. and Chrysopidae spp. are positioned in similar areas. Similarily, all dipterans (Lucilia spp., Muscidae spp. and Drosophila spp.) have overlapping clusters except the mosquitoes (Culicidae spp.) which have a much higher WBF and are more isolated. In Figure 3, the model performs less well on the weevils (Ceutorhynchus spp.) compared to the other species. It is likely due to a larger variation in their WBF than for the other groups.

Conclusions
In this work, we have presented, to our knowledge, the first VAE designed for clustering of optically recorded insect signals. The dynamic β-VAE was developed in order to achieve a stable model, optimizing both reconstructionand regularization terms. When trained on unlabelled data recorded during field conditions, the model is able to automatically create meaningful clusters. The unsupervised clustering performance was validated with labelled data collected in controlled conditions showing promising results. By using 15% of labelled data during the training process for semi-supervision, this clustering performance is further improved. Already at this stage, it is possible to extract easily identifiable insect groups such as mosquitoes from the automatically identified clusters and we expect this capability to grow as the collection of labelled reference data continues.
The future aim of our work is to further improve automatic identification of the number of clusters. In the near future, the model will be deployed on several sensors in the field, and estimated cluster sizes and distributions will be compared to conventional methods. This will greatly improve monitoring possibilities and decision support tools for entomologists and agronomists. In order to mitigate the trend of declining insect communities, the first step is to ensure adequate data collection. In the near future, we believe methods based on the proposed dynamic β-VAE will be useful to quantify and, thus, help conserve insect biodiversity.