Novel tracking approach based on fully-unsupervised disentanglement of the geometrical factors of variation

Efficient tracking algorithms are a crucial part of particle tracking detectors. While a lot of work has been done in designing a plethora of algorithms, these usually require tedious tuning for each use case. (Weakly) supervised Machine Learning-based approaches can leverage the actual raw data for maximal performance. Yet in realistic scenarios, sufficient high-quality labeled data is not available. While training might be performed on simulated data, the reproduction of realistic signal and noise in the detector requires substantial effort, compromising this approach. Here we propose a novel, fully unsupervised, approach to track reconstruction. The introduced model for learning to disentangle the factors of variation in a geometrically meaningful way employs geometrical space invariances. We train it through constraints on the equivariance between the image space and the latent representation in a Deep Convolutional Autoencoder. Using experimental results on synthetic data we show that a combination of different space transformations is required for meaningful disentanglement of factors of variation. We also demonstrate the performance of our model on real data from tracking detectors.


Introduction
Particle tracking detectors allow us to study elementary particle interactions by visualizing particle trajectories. Robust tracking algorithms are nowadays a fundamental component of all tracking detector techniques. Tracking techniques in particle physics have evolved along with technological developments, from implementations on hardware logical elements, computer data processing, GPU-accelerated algorithms, to modern Deep-Learning based approaches [1]- [3]. The advanced implementation of tracking algorithms can be seen for example in emulsion detector data reconstruction.
Nuclear photoemulsion (referred to as emulsion in further text) detectors are tracking detectors that allow the detection of charged particles with high spatial (50 nm) and angular (<1 mrad) resolution. They do not require a power supply during the experimental run. These properties enable fundamental physical experiments searching for short lived particles [4]- [7] and large scale experiments in remote regions [8]. The emulsion gel consists of small silver bromide crystals dispersed in a gelatin frame. When a charged particle passes through the emulsion gel, the crystals along its trajectory create latent image centers, which become visible under optical microscopes after chemical development ( Figure 1). Conventional track reconstruction is performed in three steps. First, 3D tomographic images of the emulsion detector are acquired using automated scanning microscopes. Next, the positions of silver grains ("hits") are located in the 3D image volumes, and finally tracks in the detector volume are reconstructed as a sequence of hits along straight lines [2], [9]. For this detector, the typical track curvature radius is significantly larger than the track length in a single emulsion film, and thus the local curvature is ignored.
Several tracking algorithms were developed during the evolution of the scanning systems, allowing for efficient track reconstruction in real-time during data acquisition [10]- [12], as well as for particle identification and energy measurements [13]- [15]. While satisfying the needs of many experiments they have several drawbacks. Their adaptation to different experimental condition, e.g. high track density or high background level requires tedious calibration ranging from extensive parameter tuning to performing dedicated test runs using e.g. accelerator beams. In addition, when the procedure of extracting the hits is separated from track reconstruction, the tracking algorithm cannot fully exploit the information available in the raw image data, compromising performance especially in the high background/track density cases.
Incorporating tracking based on classical Deep Learning, where the track parameters are predicted from the raw image data, would naturally address the latter issue. Yet, to train such a model in a supervised manner either one would need to provide massive amounts of labeled 3D raw image data for each experimental case, or training would need to be performed largely on simulated datasets. While suitable for some similar cases [3] this approach requires perfect knowledge of the optical microscope parameters, grain size distributions, detector noise, etc., which are not always available directly.
Similarly to recent works where, for example, the underlying factors of variation in the images of faces, such as eyes or hair color, glasses, or head tilt are disentangled [16], it is possible to identify geometrical factors of variation of tracks. Training such models in an unsupervised manner, i.e. where no track parameters (labels) are provided during the training can address the issues mentioned above simultaneously, by both leveraging raw image data for efficient track reconstruction and allowing simple adaptation to new configurations requiring the raw image dataset only.
In this work we aim at studying an unsupervised learning approach for extracting track parameters solely from the raw image data. Here we introduce a tracking approach based on the Deep Convolutional Autoencoder [17], [18] model that learns to disentangle the geometrical factors of variation (coordinates and angles of each track) in a fully unsupervised manner by imposing equivariance of the space transformation. While the reconstruction constraint alone fails to disentangle the factors of variation in a meaningful way, we show that adding a simple constraint on the translational invariance along the track line also does not lead to the desired disentanglement. We demonstrate that incorporating more sophisticated transformations in the latent representation is necessary to avoid the reference ambiguity.
The remaining of the paper is structured as follows. In Section 2. the details of the proposed equivariance constraints, latent representation interpretation and implementation details will be given. In Section 3. we will show how different constraints affect the performance and carry out an in-depth study of encoder and decoder performance separately to better understand the learned representation. Also, performance on real emulsion data will be shown. Finally, in Section 4 the applicability of the approach and future prospects will be discussed.

Equivariance constraint
In this work, we will simplify the problem to the 2D case. We use synthetic image data of the emulsion detector tracks to perform a study of the proposed approach. Also, we demonstrate the performance of the trained model on real emulsion detector data.
We use a Deep Convolutional Autoencoder consisting of an encoder E and a decoder D as illustrated in Fig. 2, that are trained in an end-to-end manner. The encoder E acts on 32×32 pixel images I c , which are obtained from the full images I using the cropping operation C: I c = C(I), producing the latent representation, z. In our setup, z is used to estimate the geometrical parameters, namely position and angle, of tracks present in the image crop.
We then define a set of geometrical transformations acting both in the image representation space and in the latent representation space parametrized by the same parameter set ξ. In the image space I = T im (I|ξ) and in the latent space of track parameters z t = T tr (z t |ξ).
Given the encoder and decoder functions z = E(I c ), we then demand equivariance of both encoder and decoder under these transformations, i.e. the commutation of the encoder and decoder functions E and D with the transformations T in corresponding domain: From which, assuming I = D(E(I)), it follows that where cropping operations are omitted for brevity. This allows us to formulate the optimization problem in an end-to-end manner, primarily through the minimization of the L 2 loss between the cropped transformed image I c = C(I ) and the decoder output Y ( Figure 2): L 2 (D(T tr (E(C(I))|ξ)), C(T im (I|ξ))).
We will show that with a sufficient set of transformations T , the model is able to learn a geometrically meaningful latent representation E(I).

Interpretable latent representation
We limit the number of tracks potentially detected on each cropped image to n = 8, slightly above the maximum possible number of tracks per crop (five) in our data set (see section 2.5 for details). We will parametrize a track with  Figure 3: Coordinate range on the image is chosen to be (0, 0) at bottom left to (1, 1) at right top. Track line is parametrized either by a point on the track (x, y) and sine and cosine of its slope angle φ as c = κ cos(φ), c = κ sin(φ), or by distance r to the track from the origin and angle θ as c = κ cos(θ), c = κ sin(θ).
n p parameters. Thus, the encoder is designed to output a vector z of length n · n p . This vector is then partitioned into n chunks of length n p . We further refer to these chunks as "track feature containers" z i , each corresponding to one of the n tracks.
We attribute an a priori meaning to each of the n p elements in track features z i . Here we have explored three parametrization options:  (Figure 3). Such a parametrization is chosen because it is continuous and confined, unlike e.g. φ itself or tan φ.
where the first 4 elements are the track's geometrical parameters as in 1), and the parameter a i ∈ [0, 1] shows the confidence of the encoder in the track presence. A value of a i ≤ 0 is defined as a disabled track, and a i > 0 as enabled. By disabling some tracks, degenerate outputs, in which multiple containers predict the same track, can be prevented.
3. n p = 3, z i = (r i , c i , s i ) -the track's geometrical parameters in the rho-theta parametrization [19]. c i = κ cos θ i , s i = κ sin θ i -are proportional to the cosine and sine of the angle θ i = φ i + π 2 , and r i is the distance from the origin (0, 0) to the track (Fig. 3).
The first two parametrizations are overparametrized yet more naturally occurring in the image representation. Importantly, these enable an explicit implementation of the translation invariance. The last one is the most common parametrization for 2D tracking (e.g. Hough transformation [19]).
Since by implementation the values are z i ∈ [−1, 1], the parameter r i is scaled linearly into the [0, √ 2] range.

Representation transformations
Image transformations are often used for image augmentation during model training [20] and in some cases, models are trained to recover original images from transformed ones [21]. Instead, here we explicitly apply transformations coherently in the image and latent spaces, as required for the equivariance constraint. As space transformations we employ affine transformations as a combination of rotation, scaling, skew, and translation. Under these affine transformations, straight lines are transformed to straight lines. We implemented these transformations coherently in the image and latent representation spaces. Details on the transformation's implementation are given in the Appendix A.
The main property of a line is the translational invariance along it. In the present work we tried to see if this translation transformation can be sufficient for disentangling the geometrical parameters of a line in the latent representation. Also, we have studied the effect of such a transformation when incorporated in addition to the affine transformations.
In this work we included the five model configurations of constraints based on the equivariance between the image space and the latent representation of the track line parameters (Table 1).
In the models, which output the track activation parameter a, when track container z i is marked by the encoder as not active (a i ≤ 0), we reset the geometrical parameters of this track to random values. This operation forces the decoder to learn to ignore the disabled track.

Loss function
The model training is performed by minimizing the loss function. In the models without the track activation parameter a, the loss function contains only the image term L im , which describes the dissimilarity between the decoder output Y and the transformed image I with the L 2 measure: Here E denotes averaging over all image pixels. The (λ α sig (c 0 + I c ) 2 + (1 − λ) α L2 ) term scales the loss in the image regions with high signal intensity I c in the beginning of training (λ = 1): The coefficient c 0 prevents the loss from dropping to zero in low intensity image regions. As the training progresses, λ exponentially decreases, and the loss is relaxed to pure In the models with the track activation parameter a, the loss function contains three terms: The first term describes the L 2 pixel value measure as described above. The next two terms address the information flow problem (also referred to as shortcut problem in ref. [22]), in which the geometrical parameters (x, y, c, s) i of multiple tracks describe the same track or some of them are ignored. This is achieved in two steps. First, we demand that each track t i is found (and marked as active by the parameter a i > 0) on average at the same rate as the others: where the mean activation of all tracksā = E mb,i a i and the mean activation of a particular trackā i = E mb a i are calculated over the minibatch. The final term L bin_act = α bin E (1 − a 2 ) forces the activation parameter a to cluster at values −1 or 1, enforcing the encoder decision on whether a track is enabled or disabled.

Training data
For model training and evaluation, we generate synthetic images resembling noisy emulsion data. They contain of two types of objects: "particle tracks" and noise, so called "fog". Tracks are chains of bright Gaussian spots with the spot density per unit length sampled from a Poisson distribution with mean µ located randomly along the straight lines with deviation d ∈ N (0, σ d ). Fog is represented by Gaussian spots uniformly distributed in the area with density ρ. The track density as well as the µ, σ d , and ρ parameters approximately correspond to usual experimental conditions [9] and remain fixed throughout this study. While the generated dataset resolution matches the usual imaging resolution, we downscale the images by a factor of four to facilitate this study ( Figure 4A,B).

Model implementation and training procedure
In both the encoder and decoder, we incorporated the CoordConv approach [23] in the first layer, which conceptually fits in our study. In this approach, two additional channels, containing x and y coordinates of pixels in the range of [0, 1] correspondingly, are concatenated with input data channels before the first convolutional layer. In practice, we observe that CoordConv improves the performance.
Since we deal with several objects of the same nature, we found it reasonable to apply the decoder D to the track feature containers z i of each i-th track separately and then merge the outputs. To this end we first process each of the containers z i with the same decoder network that outputs single channel map Y i = D(z i ) corresponding to the track t i . Then these images are merged into the final output Y such that for each pixel at coordinates Here the sigmoid activation function σ(x) = (1 + e −x ) −1 is ensuring that the output pixel values are in the range [0, 1]. This not only reduces the number of parameters in the decoder, but also simplifies the study of encoder performance, as each z i has the same structure. Alternatively, shuffling the containers z i within each sample could be employed. Details on the encoder and decoder architectures are given in Table 2. Table 2: Description of the encoder and decoder used in our models. conv(kernel size, dilation, # of channels)convolution; c_conv -CoordConv, concatenation with 2 channels of x and y coordinates and convolution; AP -average pooling; FC -fully connected layer; c_tconv -transposed CoordConv: tiling input up to the target size, concatenation with 2 coordinate channels, and convolution. All convolutions and FC layers are followed by batch normalization [24] and ReLU [25] activation, unless otherwise stated. The coefficients in the loss function were chosen empirically to balance the values of its terms: c 0 = 0.3, α sig = 10, 000, α L2 = 300, α unif = 2.5 × 10 5 , and α bin = 55. Training starts with λ = 1, and after l = 100k iterations λ is exponentially decreased every 5k iterations by a factor of 0.9, so that L im is relaxed to pure L 2 after about 200k iterations: We perform the training on 32x32 random crops from 40,000 images of 128x128 pixels until convergence for 500,000 iterations with minibatch of 128 images. All experiments were carried out using TensorFlow 1.12 [26]. Models were trained using the Adam optimizer [27] with initial learning rate of 6 × 10 −5 to allow for stabilization, rising to 1 × 10 −3 after 2k iteration was used. Afterwards the rate is decreased by a factor of 0.88 every 90,000 iterations. The loss function over the course of training for e.g. the AT+TI model is shown in Figure 5. The loss function evaluated on test dataset (green curve in Fig. 5; see section 3.3 for test dataset details) at training checkpoints confirms that model did not overfit to the training data. The training of each model took about 50 hours on a single GeForce GTX 1080 GPU.

Autoencoder performance
First, we evaluate whether our models have learned to properly capture the content of presented images in both latent and image spaces. In Figure 6, the comparison of the outputs of the five models and the lines drawn according to the latent representation z predicted for the image, interpreted as described above, are shown. It is clear that both AT+A and AT+TI+A models were able to build the geometrically meaningful latent representation z in most cases. For the AT+A model, which does not employ the translational invariance, the output contains more false detection both in the image output and in the drawn track lines. Also, the image output is significantly less sharp in the beginning of the training for this model. RT+TI+A on the other hand did not manage to separate the factors of variation in the desired way, and it took much longer to converge, even just to mimic the desired image output. One can see inconsistency between the image output of the autoencoder and the track lines obtained by the latent representation, meaning that overall it did not grasp the concept of the geometrical space in the desired manner. None of the models properly learned the ability to "switch off" the tracks using the confidence parameter a i . While this parameter is not completely ignored (blue lines in the tracks column in Fig. 6), in most cases the models have learned other ways to disable track parameter containers z i , which are not used to encode lines in the image. These containers simply have geometrical parameters corresponding to lines outside of the image crop range, or have x, y coordinates far away from the image crop center (e.g. AT+A and AT+TI+A in Fig. 6A). Performance of the AT+TI model was comparable or sometimes even better than that of the AT+TI+A model. On the downside, without the parameter a i acting as a regularizer, this model tends to attribute close parameters to several lines (e.g. AT+TI in Fig. 6E). The overparametrized models used the x, y position to encode confidence in the track presence by placing them closer or further from the image along the track line ( Figure  6E,F). Performance of the AT, rcs model is slightly worse than of the AT+TI model. The performance of the models clearly degrades when the number of tracks in the image crop is ≥4. We assume that the main reason for this is that these cases were rather underrepresented in the training set.

Disentanglement of the geometrical variational factors
To better understand the learned representation, we have performed a careful dissection into both decoder and encoder in this and the following sections, that was possible since the latent representation was designed to be fully interpretable.
We start with the visual analysis of the learned representation by verifying the output of the decoder for given values of z i . To this end, we have run the decoder on the entire range of meaningful values of z i . In addition, for this study we performed the prediction on the image coordinate area (−1, −1) − (2, 2), i.e. 9 times bigger than the range of the original cropped image (0, 0) − (1, 1). This way we can empirically see how well the decoder generalizes to a wider coordinate range. This is possible thanks to the CoordConv nature of the decoder: by changing the values in the coordinate channels we can perform the prediction at any position. In Figure 7 and Supplementary Figure 1  It is clear that all models, which employ the activation parameter, have learned to suppress the output when the values of the parameters a i are small. The output of the RT+TI+A model does not correlate with the expectation at all, and while the output resembles lines, the learned representation is clearly not the desired one. It would be interesting to investigate which representation was found but this is outside the scope of this paper. Models employing translational invariance produced more elongated lines that fade out slower compared to the AT+A model (compare, for example, AT+A vs AT+TI+A and AT+TI in Fig. 7, rows 2-5). AT+TI shows even more pronounced and fine lines. All overparametrized models (i.e. all except AT, rcs) also suppress tracks with an x, y position lying further from the image range (Fig. 7, top row).

Performance of the track parameters' measurement
To study the performance of the model for tracking, we evaluate the distribution and resolution of the encoder outputs z i . In Figure 8, the distributions of the predicted x i , y i positions and the angle φ i obtained from the c i , s i parameters for each of the eight track feature containers, z i , is shown for AT+A, AT+TI+A, AT+TI, and AT, rcs models. Since the latter has a different representation, for consistency we obtain the values of x i , y i , φ i using the two-argument arctangent function as follows: We skip further studies of the RT+TI+A model, since the latent variables do not have the desired meaning, driving the geometrical analysis meaningless. We show these distributions separately for "enabled" and "disabled" tracks, according to the latent activation parameter a i , where applicable. One can clearly see again that the models did not learn to use the parameter a i , and e.g. the AT+A assigned almost all tracks the "enabled" value of the activation parameter. Instead, the reconstructed parameters for containers, which do not correspond to any tracks in the image, have rather localized x, y positions and angle (see the peaks in the angular distribution, present for each of the eight track containers). The positions for existing tracks lie within or close to the coordinate region of the image (0, 0) − (1, 1). While they tend to cluster for each track container, they rather uniformly cover a wide band in the x, y space. Notably, combined with a wide angular distribution, this localization does not limit the sensitivity region of the model (see, for example, the parameter distributions of the 8-th container for the enabled tracks in the AT+A model in Fig. 8).
In the angular space all directions are covered, leaving no blind spots. Each of the eight parameter containers covers a subspace with some overlap for models AT+A and AT+TI+A. This means, that only a fraction of the track containers is sensitive to any chosen direction. E.g. the AT+TI+A model would fail to detect >3 parallel lines with 70 • inclination. Angular overlap in the AT+A model is very poor leading to poor detection of several parallel tracks in an image crop.
In fact, three out of eight output containers have geometrical parameters corresponding to lines outside of the image (Supplementary Figure 2). In AT+TI, on the other hand, each container covers almost π in angular space, and the overall distribution is rather uniform (See Supplementary Figure 2). This would allow to detect several parallel lines in a view (e.g. in Fig. 6F several tracks have similar angles).
The AT, rcs model did not learn to utilize most of the containers. Practically only the containers 2 and 6 learned to encode track lines, as seen in Figure 8. Nevertheless, even these two containers do not cover the whole angular range. For the remaining containers, the tracks lie outside of the image crop region, as seen on the x, y distribution. This distribution is easy to interpret for this model since the track angle can be observed directly from the coordinates. For a circle with center at the origin and passing through some point x, y, the track line would be the tangent to the circle at this point. Arguably, the lack of flexibility due to minimal parametrization did not allow this model to efficiently switch off the tracks, leaving 2 almost always enabled and 6 always disabled.
To quantitatively evaluate the performance of the models we have processed the test dataset. This dataset was generated similarly to the training dataset with additional information on ground truth (GT) track positions and angles. It consists of 30,000 images with 5,000 sample images for each of 0, 1, ..., 5 tracks/image conditions.
First, we assign the reconstructed tracks (and active, i.e. a > 0 for models with activation parameter a) to the GT ones or mark them as fake. We evaluate the distance ∆r from image center between a predicted track and a GT track, and the difference in angle ∆φ. Then we build the χ 2 as χ 2 = ( ∆r σr ) 2 + ( ∆φ σ φ ) 2 . Here we use the theoretical position resolution which is defined by pixel size σ r = 1px The assignment is then performed sequentially, by selecting available prediction-GT pairs according to the minimum value of χ 2 , if χ 2 ≤ 11.83 (3σ statistical significance, number of degrees of freedom ndf = 2). The remaining predicted tracks are split into two categories, fakes and duplicates. A track is considered as a duplicate if its χ 2 to any of the used GT tracks or assigned prediction tracks is χ 2 < 2.3 (i.e. within 1σ), and as a fake otherwise. For the assigned tracks, we then evaluate the offsets ∆r, ∆φ as a function of number of tracks on the original image, as well as the fraction of assigned tracks (efficiency), number of fake tracks, and number of duplicate tracks (Figure 9). The actual coordinate and angular resolutions σ r,mod , σ φ,mod of the models can be estimated from these data as mean values of ∆r and ∆φ. For example, for the AT+TI model σ r,mod ≈ 0.013 ≈ 0.42 px, σ φ,mod ≈ 15 mrad for l = 32 px).
We then use mean resolutions for each model to show the χ 2 = ( ∆r σ r,mod ) 2 + ( ∆φ σ φ,mod ) 2 distribution for these models for different numbers of tracks per image crop in Supplementary Figure 3. While for the AT+A, AT+TI, and AT, rcs models the distributions are consistent with a χ 2 distribution with ndf = 2, for the AT+TI+A model, the peak is smeared and shifted towards higher values, consistent with higher resolution variance especially in the high track density region.
The resolution of the models is stable as the number of tracks grows. The AT+TI model has consistently higher resolution, as well as higher efficiency. Efficiency significantly decreases in all models with increasing number of tracks per image. We argue that it is caused by the fact that images with high track number were underrepresented in the training set ( Figure 4C), and the efficiency would improve if a training set with high track multiplicity images were used. To support this claim, we have generated a training dataset of 60,000 images with a uniform distribution of track density, i.e. 10,000 images for 0, 1, ..., 5 tracks/image crop. We have then retrained the AT+TI model on this dataset. This has significantly (>20%) improved the efficiency for high track density (AT+TI/U model in Fig. 9). While the number of fake tracks is similar in the AT+TI model, since it lacks the regularization based on the latent parameter a, it tends to assign all of the available containers to tracks. This leads in turn to a larger number of duplicates. Nevertheless, this effect is suppressed for the AT+TI/U model, trained on the dataset with uniform track number representation. Another model without the activation parameter a -the AT, rcs model does not produce many duplicates, most likely since each track is sensitive only to a narrow angular range, as shown above.
Finally, we processed a real emulsion dataset to qualitatively observe the performance of the AT+TI/U model (trained on synthetic data) in processing real experimental data. We used a single image out of a 3D tomographic image stack of size 640 × 512 pixels corresponding to 190 × 150 µm of emulsion detector area, irradiated with 400 GeV protons at  From top to bottom: distance between GT track and the reconstructed one (less is better); angular difference (less is better); fraction of reconstructed tracks (more is better); number of reconstructed tracks that have no corresponding GT track, per image (less is better); number of reconstructed tracks that duplicate another reconstructed track, per image (less is better). Gray fill shows error range of 1 standard deviation of distribution for resolution parameters and of mean value for efficiency, fake tracks, and duplicates. different angles at the SPS accelerator beam at CERN [28]. We preprocessed the image ( Figure 10A) by downscaling it by a factor of four, inverting the image, and normalizing the color scale to match the training data properties ( Figure  10B). We then divided the image into 5×4 non-overlapping 32×32 pixel crops and processed them independently. The resulting tracks are then assembled into the full image size and shown as 32 pixels long segments with highlighted x, y position ( Figure 10B, overlay). Even though the models were trained on synthetic data with different signal and noise distributions from experimental data, one can appreciate the agreement between real tracks and predicted ones (e.g. tracks pointed to by arrowheads), confirming the effectiveness of our method. For real-life applications, the model must be trained on the raw experimental data, to learn the true signal and noise distributions from it.

Discussion and outlook
Disentangling factors of variation remains a hot topic for several years in representation learning research. Moreover, developing models that are capable of abstracting high-level concepts from raw data can lead to plenty of direct practical applications. Many works approached this problem by employing variational autoencoders with regularizations in the latent representation that enforce disentangling [16] or autoencoders combined with adversarial training [22]. In most cases, after disentangling the factors of variation, a few labeled samples can be used to associate the factors with interpretable measures in a quantitative way. Previously it has been shown [29] how a few labeled samples can improve disentanglement itself. In another work [30] it was shown that applying an equivariance constraint, i.e. changing one factor of variation, corresponding to a change of one dimension of the disentangled representation in a predictable manner, leads to disentangled variables. Nevertheless, to the best of our knowledge no previous works have tried to extract meaningful quantitative information in a fully unsupervised manner. In this work, we have demonstrated that imposing equivariance constraints on the autoencoder under geometrical transformations in the image and latent representation domains enables the model to "discover" the existence of multiple lines in the presented images in a fully unsupervised manner. Incorporating simple affine transformation such as translation, rotation, scaling and skew as equivariances between the image and latent spaces allows the models to successfully disentangle the factors of variation in the image data into geometrically meaningful parameters (coordinates and angles of lines). Adding the possibility to "switch off" a predicted track with an additional activation parameter a does not drastically change the results (models AT+A and AT+TI+A). While it can help to prevent the shortcut problem and reduces the number of track duplicates, these models did not learn to exploit it.
Incorporating the translation along the track line in addition to the whole set of affine transformations enforces the line detection. However, employing only the translational invariance, even together with rotational transformations (RT+TI+A model), leads to reference ambiguity so that latent parameters do not correspond to the desired geometrical variables. We believe that with a few calibration measurements it would be possible to find a mapping from these latent parameters to the desired geometrical variables, but this lies out of the scope of this work. As we have shown, a larger set of transformations allows the model to immediately learn the latent representation in an unambiguous, geometrically meaningful way. The minimal subset of affine transformations, sufficient for disentangling the factors of variation without reference ambiguity, will be explored in future works.
In addition to the coordinate-angle parametrization, which gives the models more freedom in sample exploration, we have studied the classical rho-theta parametrization. Under the set of affine transformations, this model is also able to learn a meaningful parametrization in a fully unsupervised manner. Yet the model performance is slightly worse than, for example, the AT+TI model, most likely because the incorporated CoordConv approach prefers to have a natural x − y coordinate representation.
The main weak point of our current implementation is that neither the background nor the grain distribution along the lines are in any way represented by the current models, which may impair line detection in the case of high background rate. In addition, the employed transformations in the image domain affect the image parameter distributions, such as brightness (corresponding to the dE/dx energy loss in emulsion detector) and sharpness. Models with additional global and per-track latent parameters without an a priori assigned meaning would naturally overcome these hurdles. Training them would require dealing with the shortcut problem in these parameters, and thus would benefit from employing the adversarial framework [22], [31]. While the aim of this work was to carefully study the proposed approach in general, we leave this aspect to further studies.
We expect this approach to have a large potential in the analysis and extraction of geometrical properties from image data. In further work we plan to adapt this technique to the location of tracks in full resolution 3D tomographic microscopy data, or data from Liquid Argon Time Projection Chamber detectors [32], [33], which would be a direct extension of this approach. Also adding more samples with a higher track number in the training dataset is expected to improve the efficiency and resolution in cases with high track density.
While designed to detect simple line structures, this technique has the potential to be used for locating and parametrizing other objects, such as splines. This would enable the tracking of particles in magnetic fields and pave the way to novel automated image vectorization techniques. Being fully unsupervised, this approach can leverage all available raw dataset with no extra work required.

Translation:
T trans (z t |ξ tx , ξ ty ) = (x + ξ tx , y + ξ ty , c, s), ξ tx ), ξ tx ∈ (−0.4, 0.4) Rotation is performed around the coordinate origin; scaling x and scaling y preserves the y and x coordinates intact; skew x and skew y preserves points on the x and y axes correspondingly. The employed range of transformations is a trade-off between urging the model to learn the desired representation and preserving most of the original tracks in the image crop after the transformation.
We apply these transformations only to the enabled tracks according to the value of a. For the disabled tracks, the parameters are set to random values in the (−1, 1) range, enforcing the decoder to learn to ignore disabled tracks: z = (z t , σ(γa)), a > 0 (r 1 , r 2 , r 3 , r 4 , σ(γa)); r i = rand(−1, 1), a ≤ 0 where γ = 20 and the sigmoid function σ(a) = 1 1+e −a is applied to the activation parameter a for implementation reasons.
For the model employing the (r, c, s) representation, we first transform the parameters to the (x, y, c, s) representation as: θ = arctan2(s, c), φ = θ − π/2, x = r cos θ; y = r sin θ; c = cos φ; s = sin φ, and then apply the shown above transformations. Afterwards, the inverse transformation to the (r, c, s) representation is applied.
The parameter set ξ is drawn from a uniform random distribution for each training sample on each iteration and is fed into the network along with the images. The input images I of size 96×96 pixels are cropped to 32×32 as shown in Figure 2 and fed as the network input I c . The same input images I are then elastically transformed with the transformation function I = T im (I|ξ) using the same parameter set ξ. For T im we have employed the tf.contrib.image.transform function from the TensorFlow library [26]. The origin of the transformations corresponds to pixel coordinates (48,48) in the input image, i.e. the lower bottom corner of the crop. After being cropped to 32×32 pixels, the images are used as network output target I c . We use larger 96×96 input images to ensure that the