Generative modeling of living cells with SO(3)-equivariant implicit neural representations

Data-driven cell tracking and segmentation methods in biomedical imaging require diverse and information-rich training data. In cases where the number of training samples is limited, synthetic computer-generated data sets can be used to improve these methods. This requires the synthesis of cell shapes as well as corresponding microscopy images using generative models. To synthesize realistic living cell shapes, the shape representation used by the generative model should be able to accurately represent fine details and changes in topology, which are common in cells. These requirements are not met by 3D voxel masks, which are restricted in resolution, and polygon meshes, which do not easily model processes like cell growth and mitosis. In this work, we propose to represent living cell shapes as level sets of signed distance functions (SDFs) which are estimated by neural networks. We optimize a fully-connected neural network to provide an implicit representation of the SDF value at any point in a 3D+time domain, conditioned on a learned latent code that is disentangled from the rotation of the cell shape. We demonstrate the effectiveness of this approach on cells that exhibit rapid deformations (Platynereis dumerilii), cells that grow and divide (C. elegans), and cells that have growing and branching filopodial protrusions (A549 human lung carcinoma cells). A quantitative evaluation using shape features and Dice similarity coefficients of real and synthetic cell shapes shows that our model can generate topologically plausible complex cell shapes in 3D+time with high similarity to real living cell shapes. Finally, we show how microscopy images of living cells that correspond to our generated cell shapes can be synthesized using an image-to-image model.


Introduction
Accurate and reliable segmentation of biomedical optical microscopy images is a challenging task (Meijering, 2012(Meijering, , 2020)), which is extremely time-consuming and tedious when performed manually, especially on 3D and 3D+time data (Coutu and Schroeder, 2013;Webb et al., 2003).There is a clear need for automatic segmentation methods, and deep learning methods have shown promising results (Stringer et al., 2021).However, these methods require large sets of information-rich and varied training data (Kozubek, 2016), consisting of pairs of microscopy images and their target segmentation masks.
There have been several efforts to collect large community data sets of images and masks, such as the Broad Bioimage Benchmark Collection1 or BioImage.IO2 .In case images are available but manual annotation is infeasible, masks generated by automatic algorithms are sometimes provided as a silver standard (Ulman et al., 2016;Burgos and Svoboda, 2022).For example, this approach was taken for a variety of optical microscopy scenarios in the Cell Tracking Challenge3 , where silver-standard corpora were used for training new segmentation methods (Arbelle and Raviv, 2019;Löffler and Mikut, 2022).
The synthesis of new data pairs typically has two stages.The first stage consists of synthesizing a completely new shape (in 2D or 3D) or shape sequence (when time is included).In the second stage, these shapes are used to synthesize textured cell images that are visually similar to real microscopy images.GANs excel at this second stage owed in part to image-to-image translation methods such as pix2pix (Isola et al., 2017), which have found widespread application in microscopy image synthesis.However, the first stage requires fundamentally different models and, most importantly, shape representations that can deal with the variation in cell shapes.As exemplified in Fig. 1, cells exhibit a range of visually observable phenomena including global cell body changes such as cell growth, cell division, deformation in cell tissue, or projections of leading edge during cell motility, as well as localized body changes, such as blebbing (i.e., randomly growing and retracting blobs on the cell surface) or growing filopodia (i.e., highly motile thin protrusions budding off the main cell body).
In this work, we synthesize living cell shapes in 3D+time, which we do from user-given examples such as segmentation masks of real cells.We propose to model the cell surface as the zero level-set of a continuous signed distance function (SDF) in 3D space and time.Following the DeepSDF model proposed by (Park et al., 2019), we represent this SDF function in an implicit neural representation (INR).Such an INR consists of a multilayer perceptron (MLP) that we jointly optimize with a latent space using a large set of 3D cell shape time-lapse sequences.Once trained, owing to the continuous implicit representation, the optimized model can be used to synthesize completely new cell shape sequences at any given spatial or temporal resolution.In contrast to voxel-based or mesh-based cell shape representations with discretized sampling rates, our approach alleviates the limitations on spatial and temporal resolution and allows us to model and represent cell shapes at an arbitrary level of detail.By parametrizing the cell surface implicitly through differentiable and, therefore, trainable neural networks, the complexity of the resulting model is independent of the spatial and temporal resolution and instead scales with the complexity of the cell surface.Hence, in contrast to voxelbased or mesh-based representations, the resulting network has fixed memory requirements.
We have previously proposed to represent living cell shapes using INRs (Wiesner et al., 2022).This work extends our previous work in several ways.First, we have adapted our model to be independent of spatial rotation of the cell.That is, we disentangle latent code and cell rotation so that two shapes that are rotated versions of each other will always be represented by the same latent code.This is a logical requirement in cell shape synthesis, where there is no canonical orientation, which allows us to learn more descriptive latent codes.Second, we have extended our data set with additional cell types so that we now optimize our model on three widely different cell shapes: Platynereis dumerilii embryo cells, C. elegans embryo cells, and A549 human lung carcinoma cells.Third, we have substantially extended our quantitative evaluation, and included an ablation study in which we evaluate the effect of rotation equivariance on the compactness of the learned latent space.We show how with these additions, we can generatively model different types of living cells with SO(3)-equivariant INRs and use the synthesized shapes as input to a GAN model to synthesize pairs of time-lapse images and segmentation masks.

Related work
We summarize existing explicit cell shape representation and synthesis methods and provide a brief introduction to related work in implicit neural representations.Moreover, we provide an introduction to equivariance and invariance in geometric deep learning models.

Explicit representations
Cell shape masks organized on pixel or voxel grids are the predominant standard for cell shape representation.The shape and resolution of the masks are chosen so that they match those of the original microscopy image, allowing easy correspondence matching and overlaying of the original image.While this has advantages, the memory requirements of such a representation grow quadratically (in 2D) or cubically (in 3D) with resolution.Hence, to represent fine details, such as a cell body with thin protrusions, memory requirements explode.This is also the case when the shape needs to be evolving (Svoboda and Ulman, 2017), in which case the denser grid makes it less likely for the shape boundary pixels to fall off the grid and to lose boundary localization precision.Voxel grids have been used in Cellular Potts Models (Merks and Glazier, 2005) (CPMs), which simulate time-resolved 3D cell shapes within a cell population (Merks and Glazier, 2005;Swat et al., 2012;Starruß et al., 2014;Svoboda and Nečasová, 2020).Deep learning models using GANs for cell shape synthesis also rely on grids (Wiesner et al., 2019a;Fu et al., 2018;Baniukiewicz et al., 2019).These models are able to synthesize static cell shapes in 2D, pseudo-3D, and 3D.In the pseudo-3D approach, individual 2D slices are synthesized and subsequently composed into a 3D volume.One significant drawback of grid representations is that there are no topological guarantees, which might result in disconnected components.
An alternative way to represent cell shapes is to represent them as a combination of (overlapping) spheres (Dufour et al., 2005).Several physics-oriented systems developed to simulate and study cell populations have opted to use this representation of the shape (Van Liedekerke et al., 2015;Ghaffarizadeh et al., 2018).Other lattice-free models use ellipsoids, in particular to represent cell nuclei (Böhland et al., 2019;Dunn et al., 2019;Han et al., 2019).A drawback of such representations is their limited ability to represent cells with protrusions, blebs, and other fine details.Polygonal meshes provide an alternative choice when detailed representation of 2D manifolds in 3D is desired.The approach is very well established for static shapes, but living cell shapes such as we consider in this paper are challenging to model with triangular meshes (Li and Kim, 2016;Sorokin et al., 2018), and often lead to intersecting faces and other mesh artifacts (Li and Kim, 2016).Alternatively, cells can be represented using spherical harmonics (Ducroz et al., 2012).

Implicit neural representations
Implicit neural representations (INRs), also called coordinate networks or neural fields, have recently become a popular choice to represent signals in space and -optionallytime (Xie et al., 2022).INRs are based on the idea that a multilayer perceptron with coordinates as input can universally represent functions on a domain.However, in practice, multilayer perceptrons suffer from spectral bias, which means that they have difficulties representing high-frequency signals.Efforts to overcome this bias have focused on positional encoding of input coordinates (Mildenhall et al., 2020;Tancik et al., 2020) or the use of alternative activation functions (Sitzmann et al., 2020).INRs can be used to represent any function in any space, which has led to a range of applications in medical imaging.For example, INRs can represent sinograms for CT reconstruction (Sun et al., 2021), MRI images obtained from k-space measurements (Shen et al., 2022), deformation fields in image registration (Wolterink et al., 2022), or outputs in image-to-image synthesis (Chen et al., 2023).Of particular interest to the current work are INR representations of manifolds in space, typically via the implicit representation of an SDF.(Gropp et al., 2020) showed how based on a limited number of training points, an INR can represent a continuous manifold in space.(Park et al., 2019) demonstrated how an INR can be coupled with a latent space by conditioning the multilayer perceptron on a latent code, a feature that is critical to the current work.Recently, (Erkoc ¸et al., 2023) proposed an approach for unconditional generative modeling using INRs.First, a set of MLPs is optimized to implicitly represent given data samples.Subsequently, a diffusion model is trained on the optimized weights of the MLPs to model the underlying distribution.The resulting diffusion model is then used to synthesize new MLP weights that represent an INR to generate new data samples in 3D and 3D+time.

Equivariant and invariant shape learning
Geometric deep learning is a learning paradigm in which neural networks are constructed under consideration of symmetry, i.e., by specifying groups of transformations to which the networks are supposed to be equivariant or invariant.A neural network is called equivariant if transforming its input results in the same transformation applied to its output and invariant if transforming the input has no effect on its output.For instance, when we use a neural network to classify images of cats and dogs, the neural network should be invariant to the rotations of the animals in the picture.Furthermore, when performing segmentation using a neural network, the neural network should output a rotated segmentation when the input image is rotated.In other words, the neural network should be equivariant to rotations.Finally, invariance and equivariance play a role in shape representation.For example, shapes do not intrinsically change when an object is rotated.Consequently, if we have two rotated instances of an otherwise identical shape, we want them to share the same representation that is independent of rotations.Such symmetry can be induced in neural networks in different ways, e.g., by imposing and solving linear constraints on the trainable parameters (Finzi et al., 2021).In the context of conditional shape encoding and decoding, (Atzmon et al., 2022) constructed equivariant and invariant layers via so-called frame averaging.(Deng et al., 2021) enforce rotational symmetry between latent code and shape by casting latent codes to Euclidean vectors for which rotation is well-defined.In this work, instead of coupling the latent code with shape orientation, we aim to explicitly de-couple it, achieving task-specific (approximate) independence of rotation.

Method
This work describes a method to model the surface of living cells in space and time.We represent the time-evolving cell sur-face implicitly as the zero level-set of a time-dependent continuous SDF that is parametrized by a neural network.We condition the neural network on learned latent code that describes the dynamics of the cell in space and time.By sampling new codes from the learned latent distribution and using these as input to the model, we can synthesize new and unseen time-evolving shapes.Furthermore, we disentangle the latent code and the rotation of a shape.This facilitates learning a more compact latent space as all rotations of a particular shape are represented by a single latent code, with rotation explicitly defined by a separate parameter.This results in a rotation equivariant implicit representation of time-evolving cell shapes.Figure 2 shows a diagram of the network.

Signed distance function
We propose to represent the evolution of a cell surface as the zero-level set of its time-evolving SDF.The SDF is a continuous function that, for any point in space, gives the signed Euclidean distance to the nearest point on the cell surface.By convention, its sign is positive for points outside the shape and negative for points inside the shape.Here we also consider that the SDF of a living cell at a particular point in space evolves over time.More precisely, let Ω = [−1, 1] 3 be a spatial domain, τ = [−1, 1] a temporal domain, and M t be a 2D manifold embedded in Ω at time t ∈ τ.For any point x = (x, y, z) ∈ Ω, the SDF M t : Ω → R is defined as x outside M t 0, x belonging to The zero-level set, and thus the surface of the cell at time t, is represented by all points where SDF M t (•) = 0.

Implicit neural representations
Recent works have shown that the function SDF M t (x) can be approximated using a multi-layer perceptron (MLP) f θ with trainable parameters θ (Sitzmann et al., 2020;Park et al., 2019).Such an MLP, called an INR, takes a coordinate vector x as input and provides an approximation of SDF M t (x) as output.We here propose to also condition the MLP on a time parameter t ∈ τ to provide an approximation of the time-evolving SDF of M t for arbitrary t ∈ τ.Hence, we regress the SDF values at a position at a certain time.In addition, the MLP can be conditioned on a latent space vector z drawn from a multivariate Gaussian distribution with a spherical covariance σ 2 I, where I is the identity matrix.This latent code can be thought of as a low-dimensional encoding of a shape.By conditioning the MLP on a latent code, we are able to optimize a single model for a distribution of time-evolving shapes.
Combining all these terms results in an MLP f θ (x, t, z) that approximates the time-evolving SDF of the manifold M t for arbitrary t ∈ τ, given latent space vector z.Now, we describe how we optimize such a model for cell shape sequences M t using an auto-decoder strategy (Park et al., 2019).3)-equivariant extension (in blue).The neural network f θ is given a latent code z sampled from a multivariate normal distribution, coordinates x = (x, y, z) from a spatial domain Ω, and a temporal coordinate t from temporal domain τ.Moreover, the latent code z is given to the network not only at the input but also at its fifth and eight layer.The network is optimized to output the SDF values at given points, whereas the latent codes are jointly optimized to match a given normal distribution.The trained network is able to output SDF values based on a given latent code at any coordinate in the space-time domain.When given new latent codes from the latent space, the trained network is able to infer new spatio-temporal SDFs and thus produce new time-evolving shapes.The equivariant extension modifies the sampling procedure of the spatial coordinates and introduces a rotation matrix R ∈ SO(3).Rotating the spatial coordinates x by a rotation matrix R results in a new rotated spatial coordinates x ′ .The rotation matrix R is optimized along with the network weights θ and the latent codes z.During inference, the network reconstructs an SDF of a time-evolving cell shape according to a randomly sampled latent code z and given spatial coordinates x, temporal coordinate t, and rotation matrix R.

Network optimization
We optimize the auto-decoder given a training set consisting of N cell shape sequences {M i } N i=1 .For each cell shape sequence, reference values of its time-evolving SDF are known at a finite and discrete set of points in Ω × τ.An important aspect of the auto-decoder model is that not only the parameters θ of the MLP are optimized during training, but also the latent code z i for each training sequence M i .The loss function therefore consists of two components.The first component is the reconstruction loss that computes the L 1 distance between reference SDF values and their approximation by the MLP, i.e., The second component is given by This term, with regularization constant 1 σ 2 , ensures that a compact latent space is learned and improves the speed of convergence (Park et al., 2019).The parameter σ 2 in the regularization term L code corresponds to the variance of the Gaussian distribution used for sampling the latent codes.During training of the auto-decoder, we have access to a training set of N cell shape sequences and thus the full loss function is where a latent code z i is optimized for each shape {M i } N i=1 .After optimization, we obtain an MLP f θ that is able to approximate the SDF of a shape M i t , given latent vector z i , spatial coordinates x, and time coordinate t, i.e., (5)

Network architecture
We represent the function f θ (x, t, z i ) by an MLP, which can have any finite depth and width.Here, in all experiments, the MLP has 9 hidden layers, each containing 128 units.In the hidden layers, we use a sine periodic activation function, which was shown to be able to better represent finely detailed surfaces of complex shapes compared to the commonly used rectified linear unit (ReLU) (Sitzmann et al., 2020;Mildenhall et al., 2020;Wiesner et al., 2022).The weights of layers using sine activations are initialized with a parameter ω 0 , which controls the angular frequency of the sine functions.This parameter directly affects the range of frequencies that the model is able to represent, where low values encourage low frequencies and smooth surfaces, and high values favor high frequencies and finely detailed surfaces.As proposed in (Sitzmann et al., 2020), we initialize the weight matrices of shape (c out × c in ) in hidden layers by drawing from a uniform distribution while we draw the weights in the input layer from We inject latent code vectors z i in the first, fifth, and eighth layer of the network to improve reconstruction accuracy (Park et al., 2019).Moreover, we concatenate the coordinates x and t and inject them into all hidden layers.Preliminary experiments found this to be a requirement for convergence on long spatiotemporal sequences.

Rotation equivariance
Thus far, we have described a model that can be optimized to jointly learn reconstructions and learn a latent space of cell shape sequences.However, during training, we assign each shape a latent code z and optimize the latent space only for compactness.Therefore, it might happen that a simple rotation of the same cell is described by f θ (x, t, z) where z ̸ = z or that there is no latent code representing the rotated shape time series at all.In other words, even though a rotated shape time series is, in essence, the same shape time series, the latent code z does not represent the identity of the shape time series, and the model has to learn the rotated shape time series as well.To make sure that the rotated shape time series are also included in the model and to let z represent the identity of a shape time series, we propose the following equivariant model.
Let R ∈ SO(3) be a rotation matrix, where SO(3) is the 3D rotation group representing all rotations about the origin in 3D Euclidean space.Every rotation matrix R ∈ R 3×3 is orthogonal, i.e., its transpose is equal to its inverse that corresponds to a rotation in the opposite direction.Now, assume we rotate a shape described by the zero level set of f θ (x, t, z) via the rotation matrix R. The resulting shape is described by where we let R act on the whole set by abuse of notation).By substituting x ′ = Rx, we then see that the rotated shape is described by Hence, to represent rotated shape time series, we only need to apply a (transposed) rotation matrix to x in f θ (x, t, z).
As seen in Fig. 2, we achieve (approximate) equivariance by adding a rotation matrix R to our model that transforms our SDF f θ (x, t, z) to the SDF corresponding to the rotated shape time series f θ (R T x, t, z).We optimize this model similarly to the non-equivariant model, with some important differences.First, we assign a rotation matrix R to each training time series and optimize this rotation matrix during training.We parametrize R using rotation angles α i , β i , γ i around axes x, y, and z, respectively.Using R φ i as the rotation matrix induced by the rotation angles φ i = (α i , β i , γ i ), the loss function we obtain is: We optimize angles α i , β i , γ i along with latent codes z i and network weights θ.The angles α i , β i , γ i are initialized from N(0, π 2 64 ) as suggested in (Bepler et al., 2019).Note that we do not assume a canonical orientation for cell shapes, and angles can be different for each cell sequence.

Proving equivariance
So far, we have not discussed the rationale behind calling this new model equivariant.To see this, we note that the latent code and rotation matrix R belonging to a new shape time series M t are found by minimizing the following loss function over φ = (α, β, γ) and z: or, more precisely, its estimate using M SDF tuples When minimizing the latter loss function over (R φ , z), we have the following equivariance property: Theorem 1.Let the reconstruction of a shape time series M t be obtained by minimizing Equation 10 over (R φ , z).Denote by M R t the shape time series M t rotated by applying the rotation matrix R. Assume (R, z) is a solution to the optimization problem for M t and let the data points for M R t be given by Then, ( RR, z) is a solution to the optimization problem for M R t .
A proof of the above theorem and a proof of an infinite sample version of the theorem can be found in the supplementary material.Theorem 1 tells us that when rotating the time series of a shape and the spatial part of the used data {(x i , t i , s i )} M i=1 by a rotation matrix R, a good reconstruction of the rotated time series can be found by changing the obtained rotation matrix in an equivariant way while leaving the latent code z invariant.In this way, in our equivariant model, the latent code z truly represents the identity of the shape time series while R captures the rotational part.

Data
To demonstrate the ability of our proposed method to model different phenomena occurring during the cell cycle, we selected three diverse 3D time-lapse biomedical data sets.First, Platynereis dumerilii cells exhibit rapid non-rigid cell shape deformations over time.Second, C. elegans cells were selected to demonstrate growth of the embryo and clear divisions of mother cells into their daughters.Third, A549 carcinoma cells feature growing and branching protrusions on a blebbing cell main body.Each cell type thus exhibits distinct shape features that we want to model using the proposed method.The 3D+time data sets were acquired in fluorescence microscopy, and the microscopy images were complemented with full segmentation masks.The segmentation masks for Platynereis dumerilii and C. elegans cells were produced using automatic image analysis algorithms.The A549 carcinoma cells are computer generated with the segmentation masks produced jointly with the synthetic microscopy images.Here, we describe each data set in more detail and discuss the data preparation procedure used to produce suitable SDF data for training the models.

Platynereis dumerilii embryo cells
Platynereis dumerilii is a sea worm that lives in tropical coastal areas and reaches a length from 2 to 4 cm when fully grown.It is commonly used in biological evolution studies as a model organism.The fluorescently-stained nuclei of a developing Platynereis dumerilii embryo were acquired in its early stage of development.The acquisition was done using a SiMView light sheet microscope with double illumination and double detection objectives (Tomer et al., 2012).Specifically, using the illumination objectives Olympus XLFLUOR 4×/340/0.28,and the detection objectives Nikon CFI75 LWD 16×/0.8W. The acquisition was done with a time step of 90 seconds for 300 time points, making the overall experiment time 7 hours and 30 minutes.The spatial resolution of the images is 700 × 660 × 113 voxels, with a voxel size of 0.406 × 0.406 × 2.031 µm in the x, y, and z axis, respectively.

C. elegans embryo cells
Caenorhabditis elegans is a transparent worm living in temperate soil environments and reaching approximately 1 mm in length.Its molecular and evolutionary biology was extensively studied (Brenner, 1974) and, to this day, it is a widely used model organism in biological studies.The nuclei of C. elegans embryo cell population were fluorescently stained in its early stage of development (Murray et al., 2008) and acquired using a Zeiss LSM 510 Meta confocal laser scanning microscope with Plan-Apochromat 63×/1.4(oil) objective lens.The images were acquired with a 60 second time step over 250 time points with the overall experiment time being 4 hours and 10 minutes.The spatial resolution of the acquisition is 708×512×35 voxels with voxel size of 0.09 × 0.09 × 1.0 µm.This data set is freely available from the website of the Cell Tracking Challenge (Ulman et al., 2017).

A549 human lung carcinoma cells
The A549 lung carcinoma cell line was cultivated from samples of cancerous human lung tissue.It is commonly used as a model in cancer studies and in the testing and development of drug therapies.This is a synthetic data set of simulated GFPactin-stained A549 lung cancer cells embedded in a Matrigel matrix (Sorokin et al., 2018).Both the membrane of a cell and its growing and branching filopodial protrusions are fluorescently stained.The acquisition process is simulated using a virtual Zeiss Axiovert 200M inverted fluorescence microscope with a Yokogawa CSU-10 confocal unit and Zeiss 40×/1.30(oil) objective.The time step is 20 seconds with 30 time points resulting in the overall experiment duration of 10 minutes.The spatial resolution is 300×300×300 voxels with a voxel size of 0.125 × 0.125 × 0.125 µm.This data set was generated using the CytoPacq web-interface (Wiesner et al., 2019b).We simulated 33 time-lapse sequences, where each sequence captures one growing A549 filopodial cell.

Data preprocessing
To prepare suitable training data sets, we processed the segmentation masks of the Platynereis dumerilii cells, the C. elegans cells, and the A549 filopodial cells.We partitioned these masks into voxel volumes, where each voxel volume contains a single cell shape.Moreover, for each cell, we prepared these voxel volumes for the first 30 time points from its inception.Cell shapes were centered at each time point according to their centroid and aligned according to their principal axes.For each cell type, we prepared 33 time-lapse sequences with 30 time points.Note that for the C. elegans cells, the prepared voxel volumes contain two daughter cells in the second half of the time-lapse sequence due to the mitosis of a selected mother cell.In this case, each daughter cell was aligned separately.
We subsequently precomputed SDFs for each time point in these time-lapse sequences, i.e., for each sequence we computed 30 three-dimensional SDFs.Each time point was then represented by 256×256×256 discrete SDF point samples.One time-evolving cell is thus represented by 30 × 256 3 samples, defining 30 time points of its 3D shape.These point samples constitute the training data sets, D S DF Plat of Platynereis dumerilii cells, D S DF Cele of C. elegans cells, and D S DF Filo of A549 filopodial cells.3D renders of time-evolving shapes from the training sets are shown in Fig. 1a.All data preparation and visualization algorithms were implemented4 in Matlab R2021a.

Experiments and Results
We present a series of experiments with the proposed autodecoder and evaluate the results both quantitatively and qualitatively.Specifically, we investigate the reconstruction of cell sequences, synthesis of new shapes with randomly sampled latent codes, temporal interpolation between consecutive time points, and compactness of the learned latent spaces.Finally, we demonstrate how synthetic shapes can be used as input to an image-to-image model that synthesizes corresponding microscopy images.
Training parameters We trained separate models on D S DF Plat , D S DF Cele , and D S DF Filo .All models were trained for 2000 epochs with a batch size of 5.More precisely, we select a time-evolving cell 5 times with repetition and subsequently select an individual time point from each selected time-evolving cell.The weights were initialized with ω 0 set to 30, following the initialization scheme proposed in (Sitzmann et al., 2020), and optimized using the Adam optimizer (Kingma and Ba, 2014) with a learning rate 10 −4 , which was reduced every 350 epochs by a factor of 0.5.The latent codes were initialized randomly from N(0, 0.01 2 ), as suggested in (Park et al., 2019), and their dimensionality was set to 64.The given hyperparameters were used in all experiments unless stated otherwise.The networks and the respective training and inference procedures were implemented in Python using PyTorch (Paszke et al., 2019) and PyTorch3D (Ravi et al., 2020).Filo .The values were obtained by counting the points x where SDF(x) <= 0 at each time point over all sequences in the given data set.As the cell shape occupies only a fraction of the considered 3D space, we sampled from the training data sets non-uniformly, as suggested in (Park et al., 2019).Specifically, 70% of a training batch is composed of points with distance to the cell surface less or equal to 0.6 [µm] (including the points with negative distance representing the cell interior) to ensure that the cell boundary and thus the zero level-set is well represented.Conversely, the remaining points with a distance greater than 0.6 [µm] form the remaining 30% of a batch, as they represent only the empty space around the cell.The non-uniform sampling allowed us to reduce the overall number of point samples and thus decrease the time and memory needed to train the models.Additionally, it also made the training optimization procedure focus more on the important points, as the L 1 distance in the reconstruction loss L rec is computed against these point samples.
Inference parameters Because the trained model is continuous, it can be used to generate point clouds, meshes, or voxel volumes (Park et al., 2019).In this work, we use binary volumes of size 256×256×256 voxels for quantitative evaluation and meshes obtained using marching cubes (Lorensen and Cline, 1987) for visualization.Each voxel volume and mesh represents a shape of a time-evolving cell at a given time point.With a trained model, generating an SDF of a time-evolving shape with 30 × 256 × 256 × 256 point samples took approximately 15 seconds on an NVIDIA A100 and required 3 GB of GPU memory.The stated memory consumption represents a forward pass through the MLP.It is composed of the network parameters and corresponding computational graph, along with the network inputs (i.e., a latent code and a grid of spatial and temporal coordinates) and the network outputs representing the inferred SDF values at given spatio-temporal coordinates.The proposed MLP has 168.5k parameters and is thus an order of magnitude smaller than convolutional generative models such as DCGAN (Radford et al., 2015) with 3.5M parameters.For inference, we divide the coordinate grid of 30 × 256 3 points into subsets of 1×32 3 points.These subsets are given to the MLP, and the network outputs are subsequently concatenated to form the resulting SDF of a time-evolving shape.The size of the grid subset can be set depending on the available GPU memory and has no impact on the quality of resulting SDF as points are processed individually.
Evaluation parameters For the reconstruction of cell shapes, we compute the Dice similarity coefficient (DSC).To compare distributions of cells, we compute descriptive statistics (i.e., minimum, maximum, mean, median value, and interquartile range) of shape descriptors, namely surface [µm 2 ], volume [µm 3 ], and sphericity (Costa and Cesar Jr, 2000), where sphericity ranges from 0 to 1, with 1 representing an ideal sphere.Please note that we accounted for the cell division present in D S DF Cele by splitting each respective voxel volume sequence into two sequences in order to evaluate the shape of each daughter cell separately.Each evaluated sequence of voxel volumes thus represents a single time-evolving cell shape.

Reconstruction and synthesis of time-evolving shapes
In this experiment, we evaluated the ability of the models to reconstruct a cell shape sequence given its learned latent code z and to synthesize a new cell shape sequence given randomly sampled latent code.A separate model was optimized on D S DF Plat , D S DF Cele , and D S DF Filo to learn an implicit representation of Platynereis dumerilii cells, C. elegans cells, and A549 filopodial cells, respectively.
Fig. 4 and Fig. 5 show descriptive statistics for cell surface, volume, and sphericity over all time-lapse sequences between real, reconstructed, and randomly generated synthetic shapes produced by the equivariant model and the non-equivariant model, respectively.The plots for reconstructed Platynereis dumerilii shapes show an excellent overlap of the descriptive statistics.The synthetic shapes fit within the interquartile range of the real data set while keeping similar mean values and generally exhibit lower variability compared to the real ones while retaining the mean volume and surface as the cells grow over time.
The graphs for C. elegans show that the descriptive statistics are similar, with a moderate shift toward lower values for the surface of reconstructed shapes.Conversely, we can observe that the sphericity is higher.Note the mitosis that the cells undergo at time point 15.We can observe from the plots that shapes get smaller and less round before growing again in the second part of the time-lapse sequence, and the reconstructed shapes reproduce this behavior very closely.The synthetic shapes similarly show a moderate shift toward the lower surface and volume while exhibiting higher sphericity.The models were able to accurately reproduce the cell division occurring in the middle of the time-lapse sequence.The visualization of the mitotic division can be seen in Fig. 1a.
The reconstructed filopodial cells also exhibit excellent similarity.The reconstruction accurately retains the cell volume.We can observe moderately higher sphericity and lower surface area.This is due to the growing and branching filopodial protrusions that are thin and have sharp edges.The reconstructed Each plot shows the mean and interquartile range (IQR) of the respective shape descriptor, with the values computed at each given time point over all time-lapse sequences (33 in total).The closer are the respective colored plotted values together, the more similar the shapes are, where yellow stands for shapes from the real set, and blue for the reconstructed shapes.
filopodial cells exhibit slightly rounder edges and thus lose a moderate amount of their surface area.The synthetic shapes exhibit excellent overlap with respect to the cell volume and, similarly to the reconstruction, a moderately lower surface and higher sphericity.
The results show that the models are able to accurately reproduce cell growth of Platynereis dumerilii, mitosis of C. elegans, and growing and branching protrusions of A549 lung cancer cells.Randomly generated synthetic shapes exhibit reasonably similar shape features and high visual similarity to the real ones for all three cell lines, as shown by the visual comparison in Fig. 1b.
Additionally, Table 1 lists the p-values of the two-sample Kolmogorov Smirnov (KS) test computed on the shape descriptors of the real and reconstructed shapes produced using the equivariant model.The KS test retained the null hypothesis (p > 0.05) that the shape descriptors are from the same distribution at 5% significance level for all tests except for the sphericity of the synthetic C. elegans and A549 human carcinoma cells, reflecting our observations.

Random sampling of latent codes
To generate new synthetic cell shapes, we randomly sampled new latent codes within the optimized latent space.For Platynereis dumerilii and C. elegans, we randomly sampled 33 new latent codes z from N(0, 0.001).For filopodial cells, noise vectors sampled from N(0, 0.0005) were added to the optimized latent codes obtained after training.The parameters of the normal distribution were obtained by analyzing the latent codes of the models trained on D S DF Plat D S DF Cele , and D S DF Filo data set.For the first two data sets, the models yielded latent codes with standard deviations of 0.0004 and 0.0008, respectively.We rounded these numbers up to 0.001 Fig. 5: Quantitative evaluation of reconstructed cell shapes produced using the non-equivariant model.The plots show descriptive statistics of shape descriptors, i.e., surface [µm 2 ], volume [µm 3 ], and sphericity, at each time point in the time-lapse sequence for a given cell type (30 timepoints in total).From the left, the columns represent Platynereis dumerilii cells, C. elegans cells, and A549 filopodial cells.Real shapes from the training sets are denoted as "Real", shapes reconstructed by the model from the optimized latent codes are denoted as "Reconstructed", and shapes generated by random sampling are denoted as "Synthetic".Each plot shows the mean and interquartile range (IQR) of the respective shape descriptor, with the values computed at each given time point over all time-lapse sequences (33 in total).The closer are the respective colored plotted values together, the more similar the shapes are, where yellow stands for shapes from the real set, and blue for the reconstructed shapes.
to ensure good coverage of the learned latent spaces and used this value to define the normal distribution for sampling new codes.The model trained on D S DF Filo yielded latent codes with a standard deviation of 0.0007.In this case, randomly sampled latent codes resulted in cells with incorrectly placed protrusions, regardless of the chosen normal distribution.Each cell in the filopodial data set has distinctly growing protrusions, and the shared similarity is only the main body of the cells.By random sampling in the latent space, the model correctly inferred the common features, in this case, the main cell body.However, the model was not able to learn a sufficiently generalized spatial and temporal representation of the protrusions, which vary in placement, branching, and growth direction between all given cells.We experimentally determined that sampling new latent codes within the proximity of already learned latent codes yields correct shapes that exhibit reasonable variability.

Latent space compactness
In this experiment, we evaluated the impact of the SO(3)equivariant extension on the compactness of the optimized latent space.In other words, we evaluated the ability of the equiv-ariant model to learn a latent space that is rotation independent.The equivariant extension is depicted in blue in Fig. 2. We trained the non-equivariant model (Wiesner et al., 2022) and the proposed equivariant model on a set of randomly rotated time-evolving cell shapes.For this experiment, we selected the D S DF Plat data set.To prepare the training data, we pre-computed 4 random rotations of 25 time-evolving cell shapes from D S DF Plat to obtain a training set with 100 time-evolving shapes.Specifically, for each rotation, we uniformly sampled angles α, β, and γ from (−π, π], and rotated the cell shape around the axis x, y, and z, respectively.Please note that the same angles were used at all time points for a given time-evolving shape.The dimensionality of latent codes was set to 64 for both models, as in the other experiments.Furthermore, the number of point samples, sampled randomly from the training data set at each epoch, was decreased to 250,000, and the batch size was increased to 20 to allow the model to observe more shapes at the same time and better optimize their rotations. After training both models, we evaluated the resulting latent spaces.Each model learned a mapping of 100 time-evolving shapes to 100 locations in the respective latent space.For vi-  sual comparison, we computed a low-dimensional representation of the latent spaces using t-SNE and plotted the resulting points in a scatter plot.Fig. 6a shows the latent space of the non-equivariant model, and Fig. 6b shows the latent space of the equivariant model.Rotated variants of each cell are represented by a unique marker.The plots show that the rotated variants of each cell are close together in the latent space of the equivariant model.Furthermore, we computed principal component analysis (PCA) of the latent codes learned by the models and then plotted the variability expressed by the individual principal components in Fig. 7a.The equivariant model exhibits 99.9% variability in the first 20 principal components compared to 61.2% of the non-equivariant model.In other words, PCA shows that it may be possible to further reduce the dimensionality of the latent codes for the equivariant model.To evaluate this hypothesis, we trained both models on the same data set of rotated cells with latent code dimensionalities ranging from 4 to 128.Fig. 7b and Table 2 show that the equivariant model retained the reconstruction similarity with latent dimensionality set to higher or equal to 32 (DSC of 0.954 ± 0.009 with dimensionality 32), whereas the non-equivariant model required latent dimensionality of 64 or higher to achieve comparable results (DSC of 0.959 ± 0.015 with dimensionality 64, and 0.934 ± 0.027 with dimensionality 32).This experiment provides empirical evidence that the equivariant model is able to reconstruct shapes with higher similarity when the latent code dimensionality is reduced and supports the outcome of the PCA analysis.

Latent space exploration
In this experiment, we took a closer look at the latent space of the proposed equivariant model in order to assess its properties and determine how different positions in the latent space, represented by the latent codes, affect the resulting cell shape.PLat .The standard deviation of the resulting latent codes was 0.0008.To visualize the latent space and the positions of the latent codes, we computed a low dimensional representation using t-SNE.Fig. 8a, Fig. 8b, and Fig. 8c show scatter plots containing markers representing the positions of individual cells in the latent space with points colored according to the shape descriptors that were computed on the cells, specifically, surface, volume, and sphericity, respectively.As each point in the latent space represents a time-evolving cell at multiple time points, we colored the points according to the mean descriptor values computed over all time points.The plots show that different parts of the learned latent space correspond to cells with different surface, volume, and sphericity characteristics.In Fig. 8d, we show how different positions in the latent space affect the mean volume of a cell.We sampled latent codes along a linear trajectory, forming a triangle.Here, (1), (3), and (5) correspond to shapes that the model has seen during optimization.(1) represents a cell with the lowest mean volume (586.5 [µm 3 ]), (3) a cell with the highest mean volume (5719.9[µm 3 ]), and (5) a cell with an average volume (3324.0[µm 3 ]).( 2), (4), and (6) are sampled along a linear trajectory between the other ones.The visual comparison shows that the position in the latent space indeed determines the cell volume and that sampling latent codes along a linear trajectory results in a proper change in the cell volume.

Temporal interpolation
As the proposed shape representation is continuous, the model can be used to produce time-evolving shapes at arbitrary spatial and temporal resolution without the need for additional training.In this experiment, we evaluated the ability of the model to interpolate in time.We trained the proposed equivariant model on every fourth time point present in the D S DF Filo data set (i.e., 1, 5, 9, ..., 29), each time-evolving cell was thus represented by 8 time points.In other words, the model has only "seen" 8 time points for each of the 33 cells present in the data set.We subsequently used the model to fill in missing time points and compared the result to the real shapes.As time coordinates used for training and inference are linearly spaced in the interval from -1.0 to 1.0, representing the first and last time point, respectively, we shortened all sequences from 30 to 29 frames by discarding the last frame to ensure that the time coordinates are aligned.Fig. 9a shows a visual comparison of real and interpolated cell shapes on a single sequence.As the visual differences are difficult to discern with the naked eye, we com-puted the similarity of the shapes using DSC, which is shown in Fig. 9b.The plot shows the mean and standard deviation of DSC computed over all sequences at each time point.The DSC shows that the interpolation similarity decreases toward the end of the sequence, where the cell protrusions are fully grown, and the shape is most complex.The interpolation yielded the lowest DSC of 0.858 ± 0.012 at time point 27.The visual comparison shows that the model is able to interpolate between time points, and DSC corroborates that the produced shapes retain reasonable similarity to the real cells even when interpolating multiple consecutive time points.

Spatial interpolation and sampling grid
In this experiment, we investigated how the size of the sampling grid affects the shape reconstruction.We trained the pro-posed equivariant model with shape SDFs represented by three different grids: 192×192×192, 256×256×256, and 384×384×384.As the original microscopy images of A549 filopodial cells are isotropic and exhibit high spatial resolution of 300×300×300 voxels, we chose shapes of this cell line for this experiment.We already had the D S DF Filo data set consisting of 256×256×256 point samples per time point.To prepare suitable training data sets for the other two models, we computed SDFs and created two additional variants of the data set, D S DF 192 After training the models, we reconstructed the cell shapes using different grids, ranging from 64 3 to 384 3 .As the shape representation with the proposed model is continuous, the model can be used to reconstruct shapes using an arbitrary grid, isotropic or anisotropic, in other words, to interpolate in space.Fig. 10a and Fig. 10b show a visual comparison of shapes reconstructed using model 384 and model 192, respectively.As the visual differences may be difficult to spot with the naked eye, we also computed DSC for reconstructed cells, including all three models and grid sizes, in Fig. 10c.DSC was computed over all sequences at time point 30, where the cell protrusions are fully grown.All Reconstructed cell shapes at the given time point were compared against real shapes with the grid of 384 3 .Voxel volumes of shapes reconstructed with smaller grids were upsampled using the nearest neighbor algorithm to match the 384 3 grid for DSC computation.
The reconstruction with grid of 384 3 yields the best DSC on all models.The results show that it is possible to further increase the reconstruction similarity by training the model on SDFs with denser grids, as demonstrated by model 384 trained on D S DF 384 Filo .Using denser training grids does not affect memory consumption, which is determined only by the batch size and the number of points sampled in each epoch.However, the grid density affects the number of epochs that the models need to converge.Specifically, model 192 required 1000 epochs, model 256 2000 epochs, and model 384 needed 4000 epochs.

Spectral bias
In this experiment, we investigate how the ω 0 parameter affects the spectral bias of the model.The weights of layers using periodic activations are initialized using a parameter ω 0 representing the angular frequency of the periodic functions.This parameter was shown to affect the spectral bias of the neural network in (Sitzmann et al., 2020), and its recommended value was 30, which is the value that we used in our other experiments.To investigate how the ω 0 parameter affects the shape reconstruction of the proposed equivariant model, we trained multiple models on the D S DF Filo data set with values of ω 0 ranging from 1 to 70.Fig. 11a shows a visual comparison of a real cell shape from the training data set and the shapes reconstructed with models trained with different ω 0 values.For visualization, we selected a single shape at time point 30 where the cell protrusions are fully grown, and the cell exhibits the lowest sphericity.The differences are most apparent on the tips of the protrusions, where the models trained with ω 0 lower than 30 produce shapes with a rounder and less defined structure.On the other hand, increasing ω 0 beyond 30 did not seem to yield any visually observable improvements over the default recommended value.Fig. 11b shows a plot of sphericity with respect to different ω 0 values computed over all reconstructed cells at time point 30.The results indicate that the ω 0 parameter indeed affects the spectral bias of the model and that increasing this parameter beyond the value recommended by (Sitzmann et al., 2020) does not yield any measurable improvements in sharpness of the reconstructed shapes.

Comparison with DeepSDF
In this experiment, we compared the proposed equivariant model with DeepSDF (Park et al., 2019).DeepSDF differs from the non-equivariant and the equivariant models presented in this study in its architecture.Specifically, DeepSDF uses ReLU activation functions, and the MLP is composed of 8 layers.In comparison, our models have MLP with 9 layers and use sine activation functions, where the equivariant model is further extended to be rotation equivariant.Periodic activation functions (sine) proposed by (Sitzmann et al., 2020) are expected to yield better results in comparison to ReLU activations.Specifically, they should allow the model to converge faster and to better fit high-frequency signals, such as shapes with sharp edges.In order to compare the methods, we used a DeepSDF extended to model shapes in 3D+time (Wiesner et al., 2022).
We experimented on the D S DF Filo data set, which consists of 33 complex cell shapes with sharp growing and branching protrusions.Furthermore, we extended the number of training epochs to 4000 to give both models enough time to converge and set the latent code dimensionality of DeepSDF to 256 (Park et al., 2019).We evaluated the results of both models at epoch 4000.A visual comparison of the real and resulting reconstructed cell shapes is shown in Fig. 12a.The differences are most apparent on the cell protrusions toward the end of the sequence when the cell is fully grown.A visual inspection of the shapes shows that the reconstruction using DeepSDF yields thicker protrusions with rounder tips compared to the proposed model with periodic activations that was able to match the real shapes more closely.We show quantitative evaluation using shape descriptors, specifically, surface in Fig. 12b, volume in Fig. 12c, and sphericity in Fig. 12d, and the training convergence in Fig. 12e.The quantitative results support the visual inspection, with DeepSDF producing rounder shapes that exhibit higher volume due to the increased thickness of the protrusions.At time point 30 where the cell is fully grown (and exhibits the lowest sphericity), DeepSDF and the proposed model yielded sphericity 0.042±0.008and 0.029±0.009,respectively, compared to real shapes with 0.021 ± 0.004.To show the training convergence, we computed DSC of real and reconstructed shapes every 100th epoch over all sequences at time point 30.proposed equivariant model are measurably improved in comparison to DeepSDF.

Cell classification based on a latent code
In this experiment, we investigated the application of the proposed equivariant model in a downstream classification task.
We optimized a single model on all three data sets, specifically, D S DF Plat , D S DF Cele , and D S DF Filo .Each data set consists of 33 timeevolving cell shapes of a given cell line, making a total of 99 cells.We visualized the latent space of the model using t-SNE in Fig. 13.The shape time series of each cell line form a distinct cluster in the latent space.The model distinguishes between different shape features of each cell line and puts similar shapes close together in the latent space.On data sets where the cell class labels are not known, the cells can be labeled according to their position in the latent space.

Conditional synthesis of textured cell images
To demonstrate the application of the method, we used synthetic time-evolving shapes (see Sec. 5.1) to generate synthetic data sets with pairs of a cell shape mask and a corresponding microscopy image.A comparison of real and synthetic microscopy images is shown in Fig. 14.Such data sets are used as training data for segmentation networks or for testing and evaluation of image analysis methods.In the latter case, we refer to them as benchmarking data sets.
The segmentation masks were obtained using maximum intensity projections of the voxel volumes of SDFs produced using the proposed method.The textured cell images were generated using a conditional GAN, more specifically pix2pixHD (Wang et al., 2018), which was trained on microscopy images and corresponding cell masks of Platynereis dumerilii cells, C. elegans cells, and A549 human carcinoma cells.The resulting data sets are produced in 2D+time and contain 33 time-lapse sequences with 30 time points for each cell type.

Discussion
In this work, we have proposed a generative model for living cell shapes in 3D+time.We represent evolving cell shapes using the zero-level set of their signed distance function, which is implicitly represented in a fully-connected neural network.This implicit neural representation is fully continuous and thus allows the synthesis of highly detailed shapes in virtually unlimited spatial and temporal resolution.By disentangling shape from rotation, we obtain a compact latent code that allows reconstruction and synthesis of cells with diverse shape and growth characteristics.In a series of experiments, we have shown that this model can be used to accurately reconstruct, synthesize, and interpolate complex and changing cell shapes.
In the proposed model, shape and rotation are disentangled, so that the latent space describes only shape.This has several advantages.First, we showed that this results in learning a more compact latent space, as fewer latent space dimensions might be necessary to describe a data set.Second, by explicitly specifying the rotation, the model is capable of generating cell shapes at any desired angle in 3D space.Third, by allowing the model to focus on the optimization of a latent description of shape, it might be able to learn a latent space in which similar cell shapes are properly clustered.This might allow unsupervised learning based on cell shapes, clustering, and identification of categories of cells.As the latent code describes shape changes over time, it is likely to also capture differences in morphology between cells, and provide insights into cell development which -in turn -can be used for deriving accurate quantitative models, e.g., for embryogenesis.Such flexibility and level of detail Where "real" represents real cells from the training data set, "proposed" represents the shapes reconstructed using the equivariant model, and "DeepSDF" represents the shapes reconstructed using DeepSDF.Additionally, we compare the convergence of the models using Dice similarity coefficient (DSC) (e).DSC of real and reconstructed shapes was computed every 100th epoch over all sequences at the last time point (30), where the cell protrusions are fully grown.
would not have been possible with existing voxel-based methods (Svoboda and Ulman, 2017;Fu et al., 2018;Baniukiewicz et al., 2019), nor in our previous work employing implicit neural representations (Wiesner et al., 2022) In our experiments, we have included three diverse data sets to demonstrate the versatility of our approach: Platynereis dumerilii embryo cells, C. elegans embryo cells, and A549 lung adenocarcinoma cancer cells.The Platynereis dumerilii and C. elegans embryo cells are commonly used in biological evolu-tion studies as model organisms.The A549 lung adenocarcinoma cancer cells are used in cancer studies and are subject to active research because filopodia and their relationship to cell migration are of great importance to the testing and development of drug therapies and understanding of the formation of cancer metastases.Our experiments showed that the model is able to accurately represent diverse time-evolving cell shapes and the phenomena occurring during the cell cycle, such as growth and mitosis.In general, descriptive statistics of synthe- sized cells matched those of real cells.In particular, the surface and volume of real and synthetic cells followed the same distribution.However, we found that the sphericity of synthetic cells was generally higher than that of real cells, indicating that it might still be challenging to properly synthesize details on the surface.A possible mitigation strategy for this might be to allow adaptive shape sampling.One can use sparse sampling where the shape is very smooth, and dense sampling where the shape contains fine details, or simply deserves more attention and thus accuracy, like protrusions on a cell surface.Similary, we can adaptively sample in time, where high temporal resolution has been shown to improve segmentation and tracking results on time-lapse data (Coca-Rodríguez and Lorenzo-Ginori, 2014).By random sampling in the optimized latent space, we were able to generate new time-evolving cell shapes with visually plausible features.This random sampling produced consistent results when synthesizing new Platynereis dumerilii and C. elegans cells.However, in the case of the A549 filopodial cells, we observed that the fully-grown protrusions of the new synthetic shapes had a tendency to disconnect from the main cell body in the second half of the time-lapse sequence.To mitigate this phenomenon, we randomly sampled new latent codes close to known latent codes of our training data.Additional regularization terms could help with optimizing a latent space that would be more suitable for sampling complex and heterogenous timeevolving shapes, such as the A549 filopodial cells.It's also possible that the complexity of this data set requires a more complex latent space distribution to sample new shapes.In future work, we will investigate how more structure and guarantees can be obtained for the latent space and explore Gaussian mixture models (Reynolds, 2009), which are able to generate more complex data distributions than a single isotropic Gaussian.
A limitation of the current study is that the A549 human lung carcinoma cells that we used to optimize our synthesis model were themselves synthesized (Sorokin et al., 2018) and thus might be slightly different from real living cells.We argue that this is only a minor limitation, as we use these data sets to demonstrate the efficacy of the model and show that our method can indeed synthesize cells that mimic the distribution of these reference cells.Moreover, we only model individual cells, while in reality, cells form populations.In future work, we wish to explore extensions to arrive at a 3D+time model capable of synthesizing living cell populations, describing not only the time-evolving shapes of cells but also cell trajectories and even cell interactions within the population.This would allow the synthesis of data sets that can be used for the development of instance segmentation or tracking models, which can distinguish between cells in a population.
Implicit neural representations are a versatile tool and, depending on the desired application, the inferred SDFs can be converted to mesh-based, voxel-based, or point cloud representations.To demonstrate one potential application of the proposed method, we used a GAN conditioned on the synthetic cell shapes to prepare data sets with pairs of textured cell images and reference annotation for all three cell types.This is a similar approach to those previously used for synthesis of new data (Osokin et al., 2017;Goldsborough et al., 2017;Böhland et al., 2019;Bailo et al., 2019;Baniukiewicz et al., 2019;Kozlovský et al., 2021), and the acquired data sets could be valuable for training, evaluation, and benchmarking of image analysis algorithms.In this study, we demonstrated the proposed method on evolving cell shapes from optical microscopy, but in principle, this approach can be used for learning shapes and spatio-temporal dynamics of diverse organisms at both micro and macro scales.For example, this method could be adapted to synthesize brain atrophy in patients with Alzheimer's disease or the progression of abdominal aortic aneurysms (Alblas et al., 2023).

Conclusion
We have presented a method that allows accurate spatiotemporal representation and synthesis of highly-detailed timeevolving shapes and structures in microscopy imaging.The method uses a neural network to implicitly represent timeevolving shapes and the occurring visual phenomena and deformations.This representation is fully continuous and equivariant with respect to shape rotations and allows for the synthesis of shapes in virtually unlimited spatial and temporal resolution at any given rotation.
In conclusion, conditional rotation equivariant implicit neural representations are a suitable representation for generative modeling of living cells.
Fig. 1: Visual comparison of real, reconstructed (a), and randomly generated synthetic cell shapes (b) in 3D+time.The proposed method is able to synthesize living cell shapes that accurately mimic processes such as cell growth in Platynereis dumerilii cell (top), cell division in C. elegans cells (middle), and growth and branching of filopodial protrusions in A549 lung cancer cells (bottom).

Fig. 2 :
Fig.2: Conceptual diagram of the proposed network and its SO(3)-equivariant extension (in blue).The neural network f θ is given a latent code z sampled from a multivariate normal distribution, coordinates x = (x, y, z) from a spatial domain Ω, and a temporal coordinate t from temporal domain τ.Moreover, the latent code z is given to the network not only at the input but also at its fifth and eight layer.The network is optimized to output the SDF values at given points, whereas the latent codes are jointly optimized to match a given normal distribution.The trained network is able to output SDF values based on a given latent code at any coordinate in the space-time domain.When given new latent codes from the latent space, the trained network is able to infer new spatio-temporal SDFs and thus produce new time-evolving shapes.The equivariant extension modifies the sampling procedure of the spatial coordinates and introduces a rotation matrix R ∈ SO(3).Rotating the spatial coordinates x by a rotation matrix R results in a new rotated spatial coordinates x ′ .The rotation matrix R is optimized along with the network weights θ and the latent codes z.During inference, the network reconstructs an SDF of a time-evolving cell shape according to a randomly sampled latent code z and given spatial coordinates x, temporal coordinate t, and rotation matrix R.
Fig. 3: Non-uniform SDF sampling.The figure illustrates the non-uniform SDF sampling process used during training.As the cell shape occupies only a fraction of the space, non-uniform sampling allows to reduce the number of point samples by focusing on the points that contain the most important information.

Fig. 4 :
Fig.4: Quantitative evaluation of reconstructed cell shapes produced using the proposed equivariant model.The plots show descriptive statistics of shape descriptors, i.e., surface [µm 2 ], volume [µm 3 ], and sphericity, at each time point in the time-lapse sequence for a given cell type (30 timepoints in total).From the left, the columns represent Platynereis dumerilii cells, C. elegans cells, and A549 filopodial cells.Real shapes from the training sets are denoted as "Real", shapes reconstructed by the model from the optimized latent codes are denoted as "Reconstructed", and shapes generated by random sampling are denoted as "Synthetic".Each plot shows the mean and interquartile range (IQR) of the respective shape descriptor, with the values computed at each given time point over all time-lapse sequences (33 in total).The closer are the respective colored plotted values together, the more similar the shapes are, where yellow stands for shapes from the real set, and blue for the reconstructed shapes.

Fig. 6 :Fig. 7 :
Fig. 6: Latent space of the non-equivariant model (a) and latent space of the equivariant model (b).The figure shows a comparison of the latent spaces on a low dimensional representation computed using t-SNE.There are 25 time-evolving cell shapes that have each been randomly rotated four times.
Fig. 8: Latent space exploration.The figure shows a low dimensional representation of the latent space learned by the equivariant model trained on D S DF Plat and a visualization of the reconstructed cell shapes sampled from selected locations in the latent space.The low dimensional representation was computed using t-SNE, and the points representing the latent codes were colored according to the mean cell surface (a), volume (b), and sphericity (c).(d) shows a visual comparison of shapes with mean volume determined by the latent code.The latent codes were sampled along linear trajectories, where position (1) represents the lowest mean volume (586.5 [µm 3 ]), position (3) the highest mean volume (5719.9[µm 3 ]), and position (5) the average volume (3324.0[µm 3 ]) Fig. 9: Temporal interpolation.We trained the proposed equivariant model on 33 time-evolving shapes in the D S DFFilo data set, with each sequence limited to every fourth time point.Specifically, we trained on time points 1, 5, 9, ..., 29, making a total of 8 time points per time series.(a) shows a visual comparison of real and reconstructed shapes on a single sequence, with the interpolated time points that the model has not seen explicitly marked as "interpolated".The differences are most apparent on the tips of the protrusions and are marked by a red circle.(b) shows DSC of real and reconstructed shapes computed over all sequences at each time point with the interpolated time points distinctly marked.

Filo
192 × 192 × 192, and  384 × 384 × 384 points, respectively.We will refer to resulting models as model 192, model 256, and model 384, with respect to the grid used for training.
The proposed model was able to converge towards a DSC of 0.950 ± 0.007 in 2000 epochs, whereas DeepSDF yielded DSC of 0.909±0.008at epoch 2000 and needed another 2000 epochs to converge toward DSC of 0.928±0.004.The results show that the reconstruction similarity and the convergence speed of the Fig. 10: Reconstructed shapes with respect to the training and reconstruction grids.The proposed equivariant model was trained on D S DF Filo with a grid of 192 3 , 256 3 , and 384 3 .Subsequently, the model was used to reconstruct cell shapes with different grids.Specifically, 384 3 , 320 3 , 256 3 , 192 3 , 128 3 , and 64 3 .Visualization of the reconstruction results showing selected time points of a single time evolving cell is shown for the model trained with a grid 384 3 (a) and grid 192 3 (b).The differences are most prominent on the tips of the protrusions, which are marked by a red circle.The Dice similarity coefficient (DSC) of real and reconstructed shapes with respect to the training and reconstruction grid is shown in (c).

Fig. 12 :
Fig. 12: Comparison of the proposed equivariant model with DeepSDF.Both models were trained on the D S DFFilo data set for 4000 epochs.(a) shows a visual comparison of real and reconstructed shapes on a single sequence.The differences between the shapes are most apparent on the protrusions (red circle), where the shapes reconstructed using the proposed model are sharper and more faithful to the real ones.The plots (b), (c), (d) show mean and standard deviation of surface, volume, and sphericity, respectively, at each time point over all sequences.Where "real" represents real cells from the training data set, "proposed" represents the shapes reconstructed using the equivariant model, and "DeepSDF" represents the shapes reconstructed using DeepSDF.Additionally, we compare the convergence of the models using Dice similarity coefficient (DSC) (e).DSC of real and reconstructed shapes was computed every 100th epoch over all sequences at the last time point (30), where the cell protrusions are fully grown.

Fig. 13 :
Fig. 13: Cell classification based on a latent code.The figure shows a low dimensional representation of the learned latent space computed using t-SNE.The model was trained on three cell lines, specifically, Platynereis dumerilii cells, C. elegans cells, and A549 filopodial cells.Each cell line is represented by 33 time-evolving cell shapes, making a total of 99 cells over all three cell lines.Time-evolving cell shapes of each line form separate, distinct clusters in the latent space.

Fig. 14 :
Fig. 14: Comparison of real (A, C, E) and synthetic (B, D, F) images of Platynereis dumerilii (A, B), C. elegans (C, D), and A549 filopodial cells (E, F).The images show one frame from the respective 2D time-lapse data sets.The segmentation masks are represented as white contours.The masks were obtained using the proposed method and the texture was produced using an image-to-image model.

Table 1 :
p-values of the two-sample Kolmogorov-Smirnov test computed on the shape descriptors of the real and reconstructed shapes produced using the proposed equivariant model.

Table 2 :
Dice similarity coefficient (DSC) (mean ± standard deviation) of shapes reconstructed using the equivariant and the non-equivariant model with respect to latent code dimensionality.Results were computed over reconstructed shapes at time points 30, where the cells are fully grown, over 33 time-lapse sequences.The values show the similarity of the reconstructed cell shapes to the shapes from the training data set.An ideal identical shape would have DSC equal to 1. Cases in which a specific model yielded better results are highlighted in bold.