Robust Image Stitching and Reconstruction of Rolling Stocks Using a Novel Kalman Filter With a Multiple-Hypothesis Measurement Model

This work introduces a novel algorithm for the reconstruction of rolling stocks from a sequence of images. The research aims at producing an accurate and wide image model that can be used as a Digital Twin (DT) for diagnosis, fault prediction, maintenance, and other monitoring operations. When observing large surfaces with nearly constant textures, metallic reflections, and repetitive patterns, motion estimation algorithms based on whole image error minimization and feature pairing with Random Sampling and Consensus (RANSAC) or Least Median of Squares (LMedS) fail to provide appropriate associations. To overcome such an issue, we propose a custom Kalman Filter (KF) modified by adding multiple input-noise sources represented as a Gaussian mixture distribution (GM), and specific algorithms to select appropriate data and variance to use for state prediction and correction. The proposed algorithm has been tested on images of train vessels, having a high number of windows, and large metallic paintings with constant or repetitive patterns. The approach here presented showed to be robust in the presence of high environmental disturbances and a reduced number of features. A large set of rolling stocks has been collected during a six months campaign. The set was employed to demonstrate the validity of the proposed algorithm by comparing the reconstructed twin versus known data. The system showed an overall accuracy in length estimation above 99%.


I. INTRODUCTION
Even if the national report on railway security [1], assesses Italy as one of the safest railways in Europe, in 2018 the average of significant railway accidents was one every 3.3Mln Tr-km (Millions of Train-kilometers), and the average number of deaths in train accidents was one over 5.133Mln Tr-km. When compared to the cumulative yearly train-mileage (384Mln Tr-km), this result shows about one-hundredth accidents per year. Major causes of train accidents are attributable to improper human behavior and (second) to maintenance issues.
Train maintenance is performed in two ways [2]: corrective maintenance and predictive maintenance. While corrective The associate editor coordinating the review of this manuscript and approving it for publication was Farid Boussaid. maintenance cannot be avoided, the allocation of economic resources in predictive maintenance can potentially reduce mechanical failure and then corrective maintenance.
In this work, we propose to reduce errors at the end of maintenance cycles and to detect when trains need to anticipate maintenance by setting up fault detection systems that exploit a reliable Digital Twin (DT). The goal is to move from the actual plan-based to condition-based maintenance that uses monitoring tools to assess the health status of the train.
DT concept was introduced in 2002 [3] as a method for Product Life-cycle Management [4]. However, the DT term was coined only in 2011 by NASA [5] as a conceptual basis in the astronautics and aerospace procedures. In modern maintenance paradigms, DT is considered as one of the most important tools in industry digitization [6]. Industry 4.0 approach, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. A concept of the wayside capture scenario using one camera and one laser.
in [7] imagines the predictive maintenance as the series of IoT networks that collect a large amount of data, followed by data fusion algorithms to produce a DT, and finally Artificial Intelligence techniques for taking decisions [8].
Boschert et al. [9], [10] introduced novel methods for the generation of railway-related twins both based on the identification of failures by comparing simulations on models and DT extracted from physical-data measurements. DT research commonly models train through laser/vision sensors: laser data are typically applied to wheels [11], carbon stripes [12], [13], pantographs and axles, while data from vision is being used in classical text detection as well as combined with deep learning [14]. Cha et al. [15] proposed a vision-based system to detect loosen bolts from images while Li et al. [16] provided a mechanism for serial number recognition from a fixed camera. Vision systems for rolling stocks analysis are applied to pantographs [17], [18] or to the entire train from different views [19], [20].
In this work, we aim to generate DTs for complete rolling-stock vessels for human or automated inspections [21]. This procedure will produce huge train bitmaps by composing multiple images through highly accurate mosaicing and stitching algorithms.

II. BACKGROUND
Extracting an accurate model of vehicle motion using vision information poses several problems such as data extraction, time-space alignment, sensor fusion, and uneven space-time measurements distributions [22]. Algorithms to analyze object motion in video or to align and combine images were explored in computer vision for almost three decades. An early example was the optical flow estimation, proposed by Lucas et al. [23]. Their approach detects object motion between frames by computing the relative spatial gradient. This method works on a local neighborhood of the moving object and successfully recovers the motion when an object moves slowly and shows a pattern that is not uniform and distinguishable from the background. The technique becomes unstable in the presence of constant or repetitive patterns, rapid motions and/or background with similar patterns.
When an object occupies a large part of the scene, the same result may be achieved through image stitching. Image stitching (or mosaicing) works through an appropriate alignment and reprojection algorithm that could be estimated using two different approaches: • Direct Methods: exploit a full hierarchical image analysis to evaluate pixel coherence and shift or warp the images consequently. These tools are used both for 2D and 3D reconstruction from camera images [24]; • Feature-Based Methods: use a subset of the image-pixels (features or keypoints) to evaluate the correspondence between frames. Considering Direct methods, given two image frames (I i , I j ), and a motion vector (s), two elements should be introduced. A transformation function that maps similar points between the two frames with coordinates (px i , py i ) and (px j , py j ) respectively: and an error-metric (E(T f , I i , I j , s)) that computes how well the obtained transformation represents a good fit for the projection between frames. Therefore, the optimal fit can be found though a minimum search procedure: For instance, in case of pure translation along the x-axis, the motion vector reduces to scalar and simple metric functions may be employed like the Sum of Squared Differences: or the Normalized Zero Mean Cross Correlation [25]: (2) where i and j are the frame indices, σ represents the image pixel standard deviation, andÎ i ,Î j represent the average pixel intensity computed in the respective frames. These metrics will be employed in section IV-A to evaluate the performance of different methods. Concerning the simple case of a single translation, direct methods are usually computationally heavy and generally inaccurate when reflections and illumination properties, typical of real environments, alter the objects' appearance.
An alternative approach requires the extraction of a set of features from the images. Feature-points identify image coordinates that have very strong peculiarities, hence these methods show higher robustness to illumination properties. Feature based Methods are faster than Direct methods since they focus on a reduced number of matches. To cope with feature analysis errors during matching, usually, these methods adopt outliers rejection tools such as RANSAC [26]. Sometime RANSAC is applied twice: first to identify keypoints belonging to the overlapped area, then to estimate the optimal homography matrix between frames. The idea of feature match was first exploited by Capel in [27] who proposed a way to achieve super-resolution images by mosaicing several images aligned and undistorted by a pre-computed homography map. The required homographies were estimated through a set of proposed point matches and a RANSAC procedure.

Algorithm 1 Frame Stitching and Retrieval of Inter-Frame Motion
procedure TrainStitch(frames[]) T f ← motion model (e.g. translation) , Img end procedure An example of the pipeline is shown in algorithm 1, first we run a feature detector algorithm on two consecutive frames, then we determine the optimal pixel shift (s[]) vector using a RANSAC estimator, finally we collate images together by translating the achieved super-resolution image by an offset of the estimated shift. The motion shift can also be converted to metric coordinate (s x ) if the camera matrix K [28] and the nominal train distance (d) are known: Brown & Lowe [29] refined such algorithm for mosaicing a panorama through a sparse set of images where the camera undergoes to pure rotations, and where all image points are far from the camera. Instead of using a full rigid transform, their approach simplified the homography (H i,j ) between frames as: where R j i represents the camera rotations between frames i, j. The homography was estimated matching sift-feature [30] with RANSAC. H i,j was furtherly refined using a tuning process known as bundle adjustment [31] that minimizes the points re-projection error in the target image coordinates. With this approach, Brown and Lowe simplified the Homography estimation problem (Eight-Degrees of Freedom) to only three parameters (relative rotations) and showed the efficacy of the matching search using only relevant features.
While Capel and Lowe approaches worked well on contrasted and opaque objects, constant patterns and metallic reflection diminish the number of valid features and create several (reflection-based) false matches that should be eliminated. Industry 4.0 scenarios have plenty of plant situations were the ideal conditions to make these approaches working could not be fulfilled.
Pure motion reconstruction from images, does not handle constraints such as those provided by the target dynamics. Two common estimation tools help with this type of analysis: the Kalman Filter (KF) or the Particle Filter (PF). These filters combine the information from noisy measurements with an internal dynamic model and the prediction estimate from the previous model state. At each step, they predict an estimation of the model-state distribution, using a model that includes also a noisy input (uncertainty). The output, whenever available, may be used to correct this estimation. In the classical KF the inner model is linear and the noise/state models are represented by a Normal distributions [32], [33]: In our case, we adopted a simple vehicle motion model, and we associated the filter state (x k ) with the estimated vehicle velocity (represented by s or s x ). Consequently, we have a unity process matrix A k = 1, and z k = x k is a direct observation of the state (H k = 1). Since we have no information on the user input (B k = 0), we assume that the input noise w k includes also this information, while the output noise (v k ) models velocity measurement (as they could be estimated from feature matching). In the KF approach, both errors are uncorrelated, zero mean, and with known covariance respectively equal to Q k and R k .
In the PF instead, the stochastic process is simulated [34]. At each step, a large set of input-state combinations (namely particles) gets propagated through the system dynamics to identify how the population will evolve. The evolution is then refined whenever a measurement is available. Indeed PFs offer the possibility to model a wider set of input, noise, and state information, but even in modern implementations [35], [36], it comes with a heavy computational cost that is proportional to the number of particles.
During the past decades, there have been different efforts to extend the KF in different directions and particularly to cope with multivariate I/O data or systems with variable dynamics [37]- [39]. Ensemble Kalman filters are tradeoffs that make use of a reduced number of particle set distributions represented by Gaussians and handled as in KF.
To properly work, KFs assume the input and measurement data being affected by regular disturbances whose models can be reduced to a form of Normal distribution. However, in our case, the output signal derived from feature analysis is blended with reflections and background noise whose size and statistical properties profoundly alter the possibility of filtering out disturbances.
To overcome these weaknesses we decided to classify the input in clusters that represents the type of detected signal. For this purpose, we adopted a robust fit through a Gaussian Mixture Model (GMM [40]). GMM is a common tool employed for data clusterization to find relevant components of multivariate distributions, i.e. a model where data are approximated by a weighted sum of Normal Distributions: where the probability density function (p(θ)) is represented as the sum of a finite number (N ) of Gaussians each with its own mean (µ i ), variance matrix ( i ) and cumulate probability (φ i ). As we will show, the GMMs may act like a natural companion to KFs since each tool can feed the other with appropriate statistical information.

III. PROBLEM DEFINITION AND MODEL DEFINITION
This section presents the mathematical approach alongside a typical scenario that is exemplary of the data reconstruction problem. A set of cameras is placed wayside the railway to capture images of a train while it is passing. Capture framerate, illumination, exposure time, focal distance, and camera parameters have been calibrated to prevent distortion and blurring effects, to maintain an appropriate color range, and to ensure that the same details appear in a sequence of two or more frames. For the sake of simplicity, the cameras x-axes are assumed to be parallel to the navigation direction, while the z-axes are considered orthogonal to the principal train surface (see figure 1). Hence, the following assumptions can be made: 1) the motion model is a pure translation; 2) train features move on lines at a constant distance from the camera; 3) the presence of glasses and metallic paintings will generate a large number of reflections interpreted as fixed features (outliers); 4) some objects, such as doors, seats, internal illuminations, can generate a set of false linear motions. Figure 2 shows a typical scenario: only the green squares can be associated with effective train translations; red areas represent reflection effects and yellow boxes include features taken on different distance planes. In such a case, the use of typical feature matching methods to compute a homography transformation fails due to missing data [41], or excess of false data that do not represent the appropriate train motion.

A. IMAGE ACQUISITION AND CAMERA MODEL
The goal of the motion estimation is to provide a shift vector (s(k)) that describes the translation in pixels between frames. Since in our case the motion is considered to fall along the x-axis of the camera, we may assume that the train speed can be computed from the shift vector, the camera sensitivity (S xy ), the calibrated train distance d, and the camera framerate A common approach is to find relevant points between frames (features) assuming a regular pre-calibrated perspective projection: where P a = K a R a ; 0 3x1 is the projection matrix, R a represents the rotation of the camera with respect to an absolute reference frame (in our case the identity matrix) and K a is the camera intrinsics matrix. Features can be detected using any of the common feature detection algorithms such as SIFT, BRISK [42], and ORB [43] to name a few. Once the features have been detected, a homography can be estimated by finding a correspondence between at least two matched-features (eight for a generic homography) as shown in figure 3 or using statistical tools that estimate the best homography-matching while rejecting outliers.

B. DATA AND DISTURBANCES
A density plot of motion histograms, useful for the assessment of the data distribution and for evaluating the importance of the reflections on the motion reconstruction, is reported in Figure 4. For each frame number (in the x-axis), the plot shows the density of all found matches. The Y-axis represents the detected motion shift in pixelsper-frame, which can be converted into train speed (m/s) through an appropriate conversion constant. The more the plot is intense, more features have been found indicating a similar train speed. Figure 4 shows three typical issues: 1 st the background noise generates several false matches, 2 nd the reflections generate a high number of zero-translation matches that for certain frames can be higher than the number of effective matches, 3 rd other illumination and motion artifacts can create regular distortions on the image patterns which maps into a considerable number of regular echoes on the histogram map. There are many situations in which the number of correct features is small and the most frequent value is not the target (expected) motion speed. This happens when the moving surface has large constant paintings and does not present enough recognizable features to match. Given these conditions, a motion detector that uses the only hypothesis of the prevalence of good matches cannot always estimate the proper motion. When the number of valid matches is small, a RANSAC procedure will fail to converge to the precise result with detrimental consequences on the overall stitching.
However, during the analysis, the null-motion hypothesis (zero translation) cannot be rejected, since, are not uncommon cases in which a train stops and restarts within the recording. Hence, the reconstruction algorithm should privilege the estimation of a relevant non-zero motion, without rejecting a priori the null-motion hypothesis as one of the possible solutions.
Looking more attentively to Figure 4 we may devise a strategy to select which would be the most appropriate estimation of the trajectory followed by the train by obeying a set of simple guidelines: 1. Random peak matches are provided by sensing disturbances; 2. the rolling stock is a physical object and, as such, its motion should be constrained to a set of differential equations; 3. motion peaks non-coherent with the maximum train acceleration should be rejected; 4. a frame range with a coherent non-zero motion histogram sequence, will be associated to the train motion even if it is less likely than the null-motion hypothesis.
In particular, once we run GMM to cluster the motion histogram of Figure 8, we got a set of mean values (µ i ), representative weights (φ i ) and noise covariances ( i ) that are well suited to be used as Kalman candidates for the measurement (µ i ) and the noise covariance (R i ∝ i ).
In this case a mixture of three different components is sufficient to represent: 1. the motion model; 2. the reflection model; and 3. all other noise measures. The GMM clusterization was performed through the Expectation-Maximization (EM) algorithm [40], which splits all motion pairs into J groups such that N k = i=J i=1 N i,k is the total number of features matched in consecutive frames. Each group has its own size (N i,k ), average (µ i,k ) and squared variance (σ 2 i,k ), the latest estimated from group-samples.
However, sometimes EM may fail due to different random initializations or scarcity of valid features. A decision algorithm should, therefore, determine if one or none of the current predictions can be used for a correction step. Several ''robustification'' approaches for KFs are available (e.g. [39], [44]- [46]). However, these approaches are highly focused on the tail-shapes rather than the presence of multiple/strong (false) peaks in the observational model, so a newly dedicated approach is considered more appropriate. Our proposed model is based on the predicted-likelihood, thus requiring only one hyper-parameter, and more appropriately identifies the effects of disappearing relevant matches and rejects inappropriate measurements (false matches).

C. FILTER MODEL
The adopted filter model can be represented by a Kalman Filter employing the dynamical system represented in eq. 5 where the state x k = s x (kT ) embeds the velocity between frames as described in eq. 7, and it is characterized by a process covariance P k = σ 2 x,k . The input-noise w k is zero-mean and its variance encodes the maximum acceleration between frames provided by the driver on the train.
We also assumed the output of the model to be estimated as the most likely mixture average (µ i ) described in eq. 6, while the associated measurement variance ( i ) has been used to estimate the variance of the output related noise(v k ). Both µ i and i were selected from the Gaussian Mixture using a specific algorithm (see Algorithm 2) that takes in input a shift population (δ k ) produced with feature pair-match among two consecutive frames, and eventually converted to metric distances through eq. 3.
The selection algorithm works as follows, first it determines which group is more coherent with the data predicted by the KF. This operation is performed by estimating the combined likelihood (L i,k ) of each group mean (µ i,k ), with

Algorithm 2 Selection of Kalman Best (KB) From a Motion-Shift Population
CandidateCorrection end if end procedure respect to the current Kalman prediction (N (x k , P k )). Only the most relevant group will be taken into consideration, and new data estimation will be accepted only if the Z score (computed using the Gaussian complementary error function, erfc) of the data candidate is significative (e.g. z max = 3 → P z maz > 99.85%). Hence we use the prediction variance (P k ) as a means to estimate the overall likelihood of a candidate observation. If new data are rejected (return value is none) the algorithm proceeds with the Kalman prediction, otherwise, it provides a candidate observation (z k ) and a candidate noise (R k ) for a Kalman correction step.
In the covariance estimation, we corrected the data covariance into a mean covariance using the number of matched features belonging to the group and keeping into account the effective degrees of freedom. The relevance threshold (z max ) is a parameter that should be selected experimentally.
Choosing larger values will allow diverging shift-estimations to be accepted as correct. Setting it too low could make the filter discard proper measurements.

D. ROBUSTNESS ANALYSIS
To assess the capability of the system to adhere and maintain the proper tracking, two cases are considered: (1) complete lack of correct matches, and (2) a reduced number of good matches compared to the wrong ones. In what follows we will assume µ G , σ G , N G as respectively being the mean, the variance, and the population of the mixture component associated to the good matches; µ W , σ W , N W , the same data collected for the most likely wrong component in the GMM decomposition. Without loss of generality, we may consider only mismatches against the dominant component of GMM decomposition.
When no matches are available the algorithm will properly work with the KF prediction until the RejectCondition remains true. The number of correct prediction steps is therefore limited to: where we assumed that the input-noise process remains constant (Q k = Q 0 ).

Application example
Considering a train capture with an average train arrival speed equal to 2m/s, a capture framerate set to 20Hz and with smooth accelerations(≈ 0.4m/s 2 ) it is possible to compute Q 0 ≈ (.4/20) 2 = 4 * 10 −4 . The typical mismatch errors are in the order of the average train pixel-shift (for reflections) or 10% of it when some features belonging to different planes are matched (usually a quite small number). Computing k starting from this data, for instance using Z max = 2, it is possible to obtain: k R = 2500 frames (achieved in 125s) for robustness to reflections, and k FP = 25 frames (1.25s) for features on false planes.
When only a small number of good features is available the KF-GMM algorithm will perform a good correction until the KFLikelihood condition returns good matches. A conservative limit threshold can be derived under the following theorem: Theorem 1 (Distribution Switch): In steady state condition, with constant input, given a KF-GMM tracking as defined by eq. 5, with D G = {N G , σ G , µ G } describing the distribution associated to the current correction-tracking, and D W = {N W , σ W , µ W } an alternate tracking distribution isolated by GMM analysis, the filter will proceed correcting using D G if the following switch limit condition is respected: Proof: First we consider the likelihood ratio logarithm to check when the D G is the most likely condition, i.e.: ln(L G /L W ) > 1, where each likelihood term may be expressed as: That with few algebraic operation, lead us to: If we introduce the hypothesis of steady state condition with constant input, we have: That reintroduced in eq. 12 lead us to the result.

Distribution switch example
In the case a wrong value is selected from the GMM correction, we may over-estimate the typical capture noise with 5 · 10 −1 m, even in such case we got the following threshold percentages: The presented KF-GMM algorithm allows to greatly improve tracking in the presence of great amounts of disturbances through rejecting false inputs that are typical of different robust fitting methods (e.g. RanSAC and LMedS). A practical example shows how even a small number of correct matches can dominate the tracking behavior and ensure that the filter will proceed with the correct estimation. Compared with particle filters, the data structure is much simpler to manage, and the computational load remains almost identical to that of the traditional KF.

IV. RESULTS COMPARISON
This section will present the results obtained from an experimental acquisition campaign. A comparison with existing methods is performed on data captured on a maintenance facility in Osmannoro (Firenze, IT), during a period of 2 months, which presents almost two trains capture per day. Comparing stitching results obtained from different approaches requires particular care. When the largest percentage of matches corresponds to good features, whichever algorithm is used, it leads to the same good result. However, when rebuilding a complete DT, even a few misaligned frames lead to an inappropriate reconstruction of the whole train. Hence, to show the limitations of alternative algorithms, an analysis has been applied on long real frame-sequences (>1500 frames) like those shown in figure 4.

A. DIRECT METHODS
To evaluate direct methods the error metrics defined in equations 1 and 2 have been used. The motion shift-vector (u) was computed by minimizing the error metric while restricting the search in the interval u ∈ [0, 150] (pixel per frames). Figure 5 shows an example of using eq. 1 in real-case scenario. The figure shows the value of the metrics computed in four different frames distant less than 20 frames (1s) each other. While in the first frame (#167) could be argued that the train is moving at 78 pixels per frame, as soon as more reflections (caused by a window) appear in the scene, a new global minimum appears (#180, #197), and the previous minimum, now local, tends to disappear (#207).
Estimating the train moving speed only using the absolute, or relative minimum policy leads to reconstruct only a portion of the correct motion shift. These results are summarized in   It is possible to observe that whenever the image pattern is constant, under-exposed, over-exposed, or repetitive, the reflections dominate over the details alignment and many minimum motion-vectors are equally probable.
In this case, the reflections (see figure 2) generate several false null-motion estimates for about one half of the whole capture sequence. Moreover, while for some frames (#180, #197) this problem could be avoided by searching for a local minimum different from zero, this minimum may completely disappear in subsequent frames.
A similar result was determined when using the E ZNCC scoring function. The table 1 (discussed later) summarizes the success rate in the reconstruction achieved by both direct and feature-based methods.
As a result of one, or a few subsequent wrong motion shifts the whole carriage reconstruction becomes distorted.

B. FEATURE-BASED METHOD
One of the limitations of the Direct Methods lays in the fact that all image pixels are considered to be equals unregarding the fact they belong to a dark or saturated area or have a uniform texture. We know instead that when a constant image is moved in front of a camera, the information should be extracted only by a few relevant points that are clearly distinguishable from the background. Feature-based methods help in this regard as they greatly reduce the number of points to be used for estimating the motion, and introduce complex descriptors that facilitate the matches between frames.
The additional use of the RANSAC tool helps to eliminate incoherent motion shifts that are not confirmed by a large percentage of the matches. However, even introducing these two tools, the distinction between the artifacts of figure 2 cannot be performed, being both coherent and feasible solutions to the problem.
Accuracy outcomes achieved with different stitching algorithms while reconstructing some critical train parts such as windows, doors, and the whole train have been computed. Table 1 compares these results between direct and indirect methods. While the percentage of mismatched motion shift is greatly diminished, it still remains a consistent percentage of the overall number of frames.
None of the tested algorithms succeeded in reconstructing a complete train image or even a single carriage since the number of reflection outliers was so high that it frequently happened they overcome the number of correctly matched features. The introduction of the RANSAC procedure slightly improves the results but it is still limited to a small fraction (about 20%) of the elements on the carriage surface.
A simple sample and hold (SH) strategy can be introduced to improve the reconstruction quality: maintain the last known speed, when newer acquisitions are incoherent with the previous ones (abs(u k − u k−1 ) > threshold). This procedure greatly improves the quality of the reconstruction but still leaves open two issues: • maintaining the same speed, when the train is accelerating, introduces length reconstruction errors on some train details; • especially at low speeds, it is particularly critical to determine which thresholds to apply to discriminate incoherent motions.

C. PROPOSED SOLUTION
To explain why the proposed algorithm outperforms traditional ones, it could be useful to analyze the results of the GMM algorithm shown in Figure 9. The graph displays for each frame the three average values of the GMM decomposition (µ i ) in different colors (green, red, and blue) ordered   by relevance (φ i ). The alternation of red and green dots in the graph illustrates that the population size criterium is not adequate to discriminate between motion and reflections.
Here the two improvements from Algorithm 2 come to play. First, we re-score the relevance into a likelihood that also depends on previous Kalman statistics; secondly, we ignore any estimations that significantly deviate from the prediction. While the first rule properly reorders measurements coherently with the current Kalman estimation, the second rule intervenes to reject any measurements in case there were no relevant features to estimate the motion.
When a measurement is considered valid, its extracted information (mean, STD, and population size) is fed to the Kalman correction step. In such a way the process noise description is continuously adapted to the current capturing conditions. In figure 10, a contour plot of the motion-sample density is overlaid with a black reconstructed line which derives from the proposed multi hypothesis Kalman filter.
The reconstructed trajectory has been found stable in all the captured scenarios and the algorithm capable to reject the disturbances coming both from the reflection features and from the object/features moving on different planes. Due to the Kalman ''memory'', the algorithm follows properly the train acceleration-deceleration phases even when motion features were missing.
Surprisingly the algorithm detects properly even when the train motion restarts. This could be motivated by the fact that the train stopped in a particularly favorable situation, having a consistent number of proper features detected.
The accuracy of the proposed algorithm was validated on about 250 carriages, related to 40 different train captures, each composed by 1500/3000 frames, depending on the speed profile and length of the rolling stocks. Two types of numerical evaluations were performed: by focusing on door and windows details, or by estimating the whole carriage length.
The first analysis benefits from the fact that the elements may also be captured by single picture frames, and the result compared with the stitching sequence made of tenths of frame-stripes. The elements rebuilt with this type of analysis are shown in figure 11. We focused on four element types: two doors and two windows. In this analysis, we can use the single picture as ground truth to estimate the percentage error committed by the stitching algorithm. The knowledge of the exact dimension of the element is not required. Table 2 shows the results of the analysis. The table contains the name of the element taken from the Figure 11, the number of element analyzed, the width of the element measured on the single frames, and in the three last columns the mean, the standard deviation in pixels, and the relative error (in percentage) of the estimated width as detected after the stitching algorithm. The measurements were taken manually from the resulting pictures and might be affected by ±1px inaccuracy. In all cases, we got a correct reconstruction (100% success) and the error between the measured size in pixel and the average detected size was below 1% with an std lower than 8 pixels (1%) in the worst case (window type B).
When analyzing the carriages and the whole trains, since there was no effective ground truth to compare the reconstructed image with, the lengths of different carriages have been compared using the obtained standard deviation as an indicator of the algorithm robustness. The achieved results are shown in Table 3. For each carriage typology, the table presents the carriage type as an image, the number of carriages of the same type, the estimated length, and the standard deviation both in pixels and as a percentage in relation to the whole length. Even in this case, the algorithm showed excellent capabilities, even better than those related to individual elements. We explained this behavior due to averaging and quantization of errors between frames.

V. CONCLUSION
We presented a novel algorithm that combines two relevant tools for robust estimation: Gaussian Mixture Models and Kalman Filters. These tools were applied in a tracking application with the aim of stitching images of a train taken by a fixed camera. We highlighted how the two tools can be combined together offering at once a light and robust tracking system that is much lighter than a particle filter but still allows coping with data mixed with different noise types. In particular, GMMs provide time-variant statistical measurement models whose mean and covariance can be used in the correction phase, and KFs provide methods to discriminate or reject GMM's analysis in a highly disturbed environment.
Using a selection algorithm in the data filter model, allowed us to reject outliers when proposed sensor measures are incoherent or weakly related to the projected model. Additionally, we derived an estimation of robustness in two limit conditions and showed that it was solid enough for vision tracking tools.
The resulting filter was applied to a stitching case-study in a real field environment, under different lighting conditions and in presence of several measurement disturbances. The results showed the high reliability of the proposed approach with 100% of trains fully reconstructed without evident errors and with accuracy on estimated geometries greater than 99%.
The algorithm is currently employed by the national train company to rebuild train information using a single array of cameras that observe a train during its passage. Once rebuilt the whole image of the train, the facility also provides to detect particular elements on the train (serial numbers, windows, boxes, grids, etc.) and check their integrity by comparing them with original images taken just after a maintenance procedure.