Covariance tracking: architecture optimizations for embedded systems

Covariance matching techniques have recently grown in interest due to their good performances for object retrieval, detection, and tracking. By mixing color and texture information in a compact representation, it can be applied to various kinds of objects (textured or not, rigid or not). Unfortunately, the original version requires heavy computations and is difficult to execute in real time on embedded systems. This article presents a review on different versions of the algorithm and its various applications; our aim is to describe the most crucial challenges and particularities that appeared when implementing and optimizing the covariance matching algorithm on a variety of desktop processors and on low-power processors suitable for embedded systems. An application of texture classification is used to compare different versions of the region descriptor. Then a comprehensive study is made to reach a higher level of performance on multi-core CPU architectures by comparing different ways to structure the information, using single instruction, multiple data (SIMD) instructions and advanced loop transformations. The execution time is reduced significantly on two dual-core CPU architectures for embedded computing: ARM Cortex-A9 and Cortex-A15 and Intel Penryn-M U9300 and Haswell-M 4650U. According to our experiments on covariance tracking, it is possible to reach a speedup greater than ×2 on both ARM and Intel architectures, when compared to the original algorithm, leading to real-time execution.


Introduction
Tracking consists in estimating the evolution in state (e.g., location, size, orientation) of a moving target over time. This process is often subdivided into two other subproblems: detection and matching. Detection deals with the difficulties of generic object recognition, i.e., finding instances from a particular object class or semantic category (e.g., humans, faces, vehicles) registered in digital images and videos. On the other hand, matching methods provide the location which maximizes the similarity with the objects previously detected in the sequence. Generic object recognition requires models that cope with the diversity of instances' appearances and shapes. This is generally made by learning techniques and classification. Conversely, matching algorithms analyze particular information and construct discriminative models that allow to disambiguate different instances from the same category and avoid confusions.
The main difficulty of tracking is to trace target trajectories and adapt to changes of appearance, pose, orientation, scale, and shape. Since the beginnings of computer vision, a diversity of tracking methods have been proposed, some of them construct path and state evolution estimations using a Bayesian framework (e.g., particle filters, hidden Markov models), others measure the perceived optical flow in order to determine object displacements and scale changes (median flow) [1]. Exhaustive appearancebased methods compare a dense set of overlapping candidate locations to detect the one that fits best with some kind of template or model. When a priori information about the target location and its dynamics (e.g., speed and acceleration) is available, the number of comparisons can be reduced enormously by giving preference to the more likely target regions. Other accelerations can be achieved using local searches that are based on gradient-descent algorithms able to handle small target displacements and geometrical changes. Among these approaches, feature points tracking techniques are very popular [2] since points can be extracted in most scenes, contrary to lines or other geometric features. Because they represent very local patterns, their motion models can be assumed as rigid and be estimated in a very efficient way. This method, as well as block matching, are raw-pixel methods, since the target is directly represented by its pixels matrix.
In order to deal with non-rigid motion, kernel-based methods such as mean-shift (MS) [3] and [4] use a representation based on color or texture distribution.
Covariance tracking (CT) [5] is a very interesting and elegant alternative which offers a compact target representation based on the spatial correlation of different features computed at each pixel in the target bounding box. Very satisfying tracking performances have been observed for diverse kinds of objects (e.g., with rigid motion or not, with texture or not). CT has been studied extensively, and many feature configurations and arrays of covariance descriptors have been proposed to improve its discrimination power [6][7][8][9][10][11] and. Smoother trajectories can be obtained by considering target dynamics; therefore, they increase tracking accuracy and reduce the search space [12,13]. Genetic algorithms [14] can also be used to accelerate the convergence towards the optimal solution of the best candidate position, considering a search in a large image. But, to our knowledge, little work has been done to analyze the computational demands of CT and its portability to embedded systems [15]. The goal of this article is to fill this gap, analyze the algorithm's computational behavior for different implementations, and measure their load on embedded architectures. A study is also made to compare different sizes and configurations of the descriptors in terms of discrimination power through a texture classification application.
The article is structured as follows. The first section introduces some of the basic principles of the CT algorithm and provides a brief description of the different searching and matching methods that can be associated with C. Then various configurations of the covariance matrix are evaluated. Finally, we provide an in-depth description of implementation details and suitable acceleration techniques proposed to achieve a higher level of performance. Experiments and details about the algorithm implementation are presented in the final section that comes followed by our conclusions.

Covariance matrices as image region descriptors
Let I represent a luminance (grayscale) or a color image with three channels and consider a rectangular region of size n = W × H (it can be the bounding box of the target to be tracked for example). Let F be the W × H × n F dimensional feature image extracted from I where φ is any n F -dimensional mapping forming a feature vector for each pixel of the bounding box. The features can be spatial coordinates p uv , intensity, color (in any color space), gradients, filter responses, or any possible set of images obtained from I. Now, let {z k } k=1···n be a set of n F -dimensional feature vectors inside the rectangular region R ⊂ F of n pixels. Concerning notations, p uv stands for the pixel at uth row and vth column.
The region R is represented with the n F × n F covariance matrix where µ is the mean feature vector computed on the n points.
The covariance matrix is a n F × n F matrix which fuses multiple features naturally by measuring their correlations. The diagonal terms represent the variance of each feature, while elements outside this diagonal are the correlations. Thanks to the averaging in the covariance computation, noisy pixels are largely filtered out, which is an interesting advantage when compared to raw-pixel methods. Covariance matrices are more compact than most classical object descriptors. Indeed, due to symmetry, C R has only (n 2 F + n F )/2 different values whatever the size of the target. To some extent, it is robust against scale changes, because all values are normalized by the size of the object, and against rotation when the locations coordinates p uv are replaced by the distance to the center of the bounding box.
The covariance descriptor ceases to be rotationally invariant when orientation information is introduced in the feature vector such as the norm of gradients with respect to x and y directions. The information considered by the covariance descriptor should be adapted to the problem at hand, because they depend on the application, as described in the next paragraph.

Covariance descriptor feature spaces
Covariance descriptors have been used in computer vision for object detection [16], reidentification [10,11] and tracking [5]. The recommended set of features to use depends significantly on the application and the nature of the object: tracking faces is different than tracking pedestrians because faces are somehow more rigid than pedestrians which have more articulations. Color is an important hint for pedestrian or vehicle tracking/reidentification because of their clothes or bodywork color. But color is less significant for reidentification or tracking faces because the set of colors they exhibit is relatively limited. Table 1 displays a summary of the more common feature combinations used by covariance descriptors in computer vision. The most obvious ones are the components from different color spaces such as RGB and HSV. Pixel brightness in the grayscale image I and its local directional gradients as absolute values |I x | and |I y |, gradient magnitude I 2 x + I 2 y , and its angle calculated as arctan |Ix| |Iy| . Foreground images G resulting from background subtraction methods and its gradients G x and G y . Features g 00 (x, y) to g 74 (x, y) represent the 2D Gabor kernel as a product of an elliptical Gaussian and a complex plane wave [9]. Face tracking and recognition [9] x y |Ix| |Iy| |Ixx| |Iyy| x y I |Ix| |Iy| |Ixx| |Iyy| θ(x, y) x y I g00(x, y) g01(x, y) · · · g74(x, y) Pedestrian detection [16,17] x y |Ix| |Iy| I 2 x + I 2 y |Ixx| |Iyy| arctan |Ix| |Iy | x y |Ix| |Iy| Pedestrian tracking [5,10,11,16] and [18] x y R G B |Ix| |Iy| x y R G B |Ix| |Iy| |Ixx| |Iyy| x y H S V |Ix| |Iy| x y R G B VarLBP Some texture analysis and tracking methods use local binary patterns (LBP) in the place of Gabor filters and the reason is that LBP operators are much more simple and economical. Values Var LBP , LBP θ 0 , and LBP θ 1 in Table 1 represent local binary pattern variance (which is a classical property of the LBP operator [19]) and the angles defined by them, as detailed in [18]. This version of the feature vector has shown very good performances for tracking, both in terms of robustness and computation times, and requires a far shorter vector (n F = 7) when compared to Gabor filters (n F = 43). In the rest of the paper, for the algorithmic optimization, a vector of five to nine features is considered, but note that the proposed optimizations can be applied to any matrix size.
Now, let us detail the computation of the covariance descriptor.

Covariance descriptor computation
After some term expansions and rearrangements on Equation 2, the (i, j)-th element of the covariance matrix can be expressed as Therefore, the covariance in a given region depends on the sum of each feature dimension z(i) i=1···n , as well as the sum of the multiplications of any pair of features z(i)z(j) i,j=1···n , requiring in total n F + n 2 F /2 integral images, one for each feature dimension z(i) and one for the multiplication of any pair of feature dimensions z(i)z(j) (the covariance matrix is symmetric).
Let A be a W × H × n F tensor of the integral images of each feature dimension A uv (i) = p∈R (11,uv) where R (11, uv) is the region bounded by the top-left image corner p 11 = (1, 1) and any other point in the image p uv = (x u , y v ). In a general way, let R(uv, u ′ v ′ ) be the rectangular region defined by the topleft point p uv and the right-bottom point p u ′ v ′ . Similarly, the tensor containing the feature product-pair integral images is denoted as (11,uv) F uv (i)F uv (j)for i, j = i · · · n F .
Now, for any point p uv , let A uv be a n F -dimensional vector and B a n F × n F dimensional matrix such as The covariance of the region bounded by (1, 1) and p uv is where n is the number of pixels in the R under investigation. Similarly, and after some algebraic manipulations, the covariance of the region R(uv, u ′ v ′ ) as it was presented in [20] is Once the integral images have been calculated, the covariance of any rectangular region can be computed in O(n 2 F ) time regardless of the size of the region R(uv, u ′ v ′ ). The complete process is represented graphically in Figure 1, where different image-processing operators are applied to the initial image (top left) to calculate the set of feature images (top right). Each feature component i is used to generate the integral image A uv (i) (bottom left) and the crossed product between features i and j is used to calculate the second order integral images B uv (i, j).

Figure 1 Covariance descriptor computation.
The image is first decomposed into an array of feature images (feature image tensor) applying the feature map F uv = φ(I, p uv ). Then the crossed-products of these features are computed; using these arrays, the tensor integral images A u ′ v ′ (i) and the second order integral images tensor B u ′ v ′ (i, j) are computed.
Next section explains the covariance matching process.

Searching and matching
Covariance models and instances can be compared and matched using a simple nearest neighbor approach, i.e., by finding the covariance descriptors that best resemble a model. The problem is that covariance matrices and symmetric positive definite (SPD) matrices in general is that they do not lie on the Euclidean space and many common and widely known operations in Euclidean spaces are not applicable or require to be adapted (e.g., a SPD matrix multiplied by a negative scalar is no longer a valid SPD matrix). A n F × n F SPD matrix only has n F × (n F + 1)/2 different elements; while it is possible to vectorize them and perform element-by-element subtraction, this approach provides very poor results as it fails to analyze the correlations between variables and the patterns stored in them. A solution to this problem is proposed in [21] where a dissimilarity measure between two covariance matrices is given as where {λ i (C 1 , C 2 )} i=1,··· ,n F are the generalized eigenvalues of C 1 and C 2 computed from The tracking starts in the first frame of the sequence, by computing the covariance matrix C 1 in the bounding box of the target under consideration (i.e., the model). The initial detection is not detailed in this paper since it can be made in various ways, by object recognition or background subtraction for example. The tracking procedure consists in determining the new target positions for the successive frames by comparing the covariance matrix C 2 (i.e., the candidate position) and minimizing the Riemannian distance (9). . Steepest descent methods look for the position which maximizes the appearance similarity when compared to the target model. Gradient-based methods do not require a large number of matrix comparisons; however, they do require to run iteratively until convergence, causing their computation time to be very unpredictable. Another limitation (and probably the most important one) is that contrary to exhaustive search approaches, local search may fail for targets undergoing brutal motion or target occlusions. The reason behind this problem is that local search methods tend to fall into local minima. Due to these limitations, the exhaustive search method was preferred for the tracking application implemented for this research.

Feature vector evaluation
The objective of this section is to determine the most discriminative vector combination and the ideal number of features (n F ) to use. Multiple feature combinations were tested using a texture classification method. The KTH-Tips dataset [22] is composed of ten different texture classes each one represented by 81 different samples of size 200 × 200 taken at different scales, illuminations, and poses.
There are different approaches for texture classification with covariance matrices. Most methods subdivide the image in small overlapping subregions and compute a descriptor associated to each one. The drawback with this approach is that it increases the number of matrix comparisons and the storage required. To avoid this problem, the local log-Euclidean covariance matrix (L2ECM) [6] computes a single covariance matrix from the log-Euclidean transformations of other (simpler) covariance matrices calculated at every pixel neighborhood (typical sizes are 15 × 15 or 30 × 30). While L2ECM provides high texture reidentification scores, its main drawback is that it considerably increases the number of computations and the memory space that is required during the computation phase. Therefore, L2ECM is clearly not appropriate for embedded platforms; hence, for the sake of simplicity, we were much more inclined to use a very simple approach and compute a single covariance descriptor for every sample and feature combination.
Ten random images were selected for the training of each texture class (from the set of 81 samples that represents each one); the remaining samples were used during the classification evaluation. The descriptor obtained from each test image is compared against all the covariance matrices inside the different training sets (ten samples for each texture class) using the Riemannian metric proposed in [20]. A label is assigned to each class using the KNN a algorithm counting the number of votes of each texture class for the closest k = 5 samples. The same procedure was repeated ten times to summarize and avoid unstable or misguiding results.
To evaluate the quality of our classification results, we counted the number of true/false positives and negatives and calculated their associated F 1 score (this represents the weighted average of the precision and recall) defined as where precision = #True positives #True positives+#False positives recall = #True positives #True positives+#False negatives .
Multiple feature combinations were evaluated based on the spatial coordinates (x and y), the luminance (I) and color channels (R, G, and B), the first and second order gradient magnitudes (|I x |, |I y |, |I xx |, |I xy | and |I yy |), and the enhanced local binary covariance matrices (ELBCM) features proposed in [18].  Three observations can be made from Figure 3: (1) that ELBCM-based combinations tend to have slightly higher scores than gradient-based ones (i.e., their F 1 scores are higher and more concentrated), (2) that color plays a crucial role to improve the discriminative power (purely gradient and ELBCMbased configurations both improve their scores when color information is included), and (3) the relevance of the spatial coordinates (x and y) seems to be small for the texture recognition problem. According to Figure 3, the ideal number of features (among the set of feature combinations evaluated here) is between n F = 7 and n F = 9 given that most of the smaller configurations produce less accurate results. However, for such as size of covariance matrix, real-time execution (40 ms for 25 frames per second) as required for visual tracking is impossible without optimizations.
Three strategies are studied to optimize the CT on multi-core CPUs. The first one is based on the structure of arrays (SoA) towards array of structures (AoS) transformation: SoA→AoS. The second one consists in architectural optimizations: either multi-threading the SoA version with open multiprocessing (OpenMP) middleware or using single instruction, multiple data (SIMD) instructions (SSE and AVX for Intel, Neon on ARM) for the AoS version. The third and final strategy consists of using loop-fusion transformations. In-depth information about the transformations employed in this article can be found in [23].
Let us introduce a set of notations for describing the algorithms and theirs optimizations ( Table 2): Number SIMD registers to hold n P products: vn P = ⌈n P /card⌉ (see Table 3) In the baseline version of the algorithm, the complete set of feature images F is stored separately using a cube data structure (referred to as fmat[k][i][j]) which can be regarded as an instance of a SoA data structure. The index k is used here to select one of the n F feature images while the pair (i,j) is used to select the spatial coordinates. Image cubes are straightforward to implement, the required arithmetic to compute the memory of an address using a table of 3D pointers only demands three integer additions; still, the latency time of a memory access is extremely dependent on the data access pattern.

SoA to AoS transform
The goal of SoA→AoS transform consists of transforming a set of independent arrays into one array, where each cell is a structure combining the elements of each independent array. The contribution of such a transform is to leverage the cache performance by enforcing spatial and temporal cache locality. Table 2 introduces the notations we will use from now on.
The first aspect we want to optimize is the locality of the features for a given point of coordinates (i, j).
In the SoA version, we have two cubes: one that stores all the pixel features F SoA (fmat) which size is n F × h × w and a different cube P SoA (prmat) of size n P × h × w that stores the feature crossedproducts. In the AoS data layout, these cubes are transformed into two 2D arrays F AoS and P AoS of size h × (w · n F ) and h × (w · n P ).

The SoA→AoS transform swaps the loop nests and changes the addressing computations from a 3Dform cube[k][i][j] into a 2D-form like matrix[i][j ×n+k]
, where n is the structure cardinal (here n F or n P ). The lack of spatial locality within the features in the SoA representation is illustrated in Figure 4; here, the SoA layout (on the left) stores pixels features in discontiguous 2D arrays (distant in memory) while for the AoS representation (on the right) features belonging to the same pixel are gathered together in contiguous memory addresses.  Concerning the index k of Algorithms 1 and 2, the increment k ← k + 1 can be replaced by k = k 1 n F − k 1 (k 1 + 1)/n + k 2 for direct access to the product of feature k 1 by feature k 2 .

SIMD or OpenMP parallelization?
Once this transform is done, one can also apply SIMD to the different parts of the algorithm. For the product part, the two internal loops on k 1 and k 2 are fully unrolled in order to show the list of all multiplications and the list of vectors to construct through permutation instructions (e.g., _mm_shuffle_ps in streaming SIMD extensions (SSE)). For example, for a typical value of n F = 7, there are n P = 28 products. The associated vectors are (the numbers are the feature indexes) as follows: In that case, the 7th vector is 100% filled, but it will become suboptimal if n P is not divisible by the cardinal of the SIMD register (4 with SSE and Neon). In SSE, some permutations can be achieved using only one _mm_shuffle_ps instruction while others need a maximum of two. Because some permutations can be reused to perform other permutations, it is possible to achieve a factorization over all the required permutations. For example with n F = 7, 15 shuffles are required.
In advanced vector extensions (AVX)2, there is a new instruction (compared to AVX) that greatly simplifies permutations : _mm256_permutevar8x32_ps. This instruction implements a full cross-bar, so we need exactly one AVX2 permutation per register that is a total 8 (for n F = 7).
In Neon it is more complex. If some permutations can be done into 128-bits registers -that is with a parallelism of 4 -other permutations require instructions only available with 64-bit registers, like the lookup table instruction named vtbl. So in Neon, 128-bit float registers should be: 1) split into 64-bit registers with vget_low_f32 and vget_high_f32 instructions, 2) type-casted into 64-bit integer registers with vreinterpret_u8_f32 -no latency, just for the compiler -, 3) permuted with vtbl1_u8 and vtbl2_u8 instructions, 4) type-casted into 64-bit float registers with vreinterpret_f32_u8, and 5) combined into 128-bit float registers with vcombine_f32. Finally it requires 31 SIMD Neon instructions to create the seven pairs of products (and 17 extra instructions for the castings). Table 3 gives the values of vn F and vn P depending on n F ∈ {7, 8} and card, the number of block within an SIMD register. For the same values of vn F , Table 4 provides the number of permutations for SSE, AVX and Neon.  The first part of Table 5 provides the algorithmic complexity and the amount of memory accesses for scalar version. Just replace n F and n P with vn F and vn P from Table 3 to get the SIMD value. This table also provides the arithmetic intensity (AI) -popularized by Nvidia -that is the ratio between the number of operations and the number of memory accesses. Table 6 provides numerical results from Table 5 for scalar, SSE, AVX, and Neon version; for 3-loop version; and for the 1-loop version with loop-fusion transform.

AoS version + loop fusion with 1 loop
Integral of features 0 2n F 2n F n F -Integral product of features n P 2n P n P n P -Total n P 2(n P + n F ) n P + 2n F n P + n F -Total with n P = n F (n F + 1)/2 1.5n 2 F + 3.5n F n 2 F + 4n F ∼ 1.5 Concerning OpenMP, the point is to evaluate SOA + OpenMP versus AoS + SIMD. Indeed, for a common 4-core General Purpose Processor (GPP), the degree of parallelism with a multi-threaded version and with a SIMDized version is the same, i.e., four. Results are provided in cycles per point (cpp) versus the data amount (image size). The cpp is a normalized metric thats help to detect cache overflow (when data do not fit in the cache): the curve of cpp increases significantly.
The three versions (SoA + OpenMP, AoS, AoS + SIMD) have been benchmarked on three generations of Intel processors: Penryn, Nehalem, and SandyBridge for image sizes varying from 128 × 128 up to 1024 × 1024. It appears ( Figure 5) that a 4-threaded version is always slower than a 1-threaded SIMD version. Eight threads are required on the Nehalem to be faster. The reason is the low AI inducing a high stress on the architecture's buses and also because manipulating SoA requires n P = 28 active references in the cache; that is more than the usual L2 or L3 associativity (24 on the Intel processor). In the next steps of this article, SIMDization is the only architectural optimization being considered as realistic.

Loop fusion
We have tested three versions with loop-fusion in order to increase the AI ratio by reducing the amount of memory accesses. But for that, we first have to rewrite the integral image computation. As integral image computation is known as being memory bound, but also a very simple algorithm (3 LOADs, 1 STORE, and 3 ADDs), it is quite impossible to reduce its complexity. Nevertheless, one can remove 2 LOADs by using a register that holds the accumulation along a line. Algorithm 5 implements this optimization for basic integral image computation. The first one is a scalar parametric version (with n F ) that fuses the external i-loop and keeps the three j-loops unchanged. The second one is a specialized version with n F = 7 where the three internal loops are fused together. The third one is the SIMDized version of the second one. The internal loop fusion allows to save the LOAD/STORE instructions in order to write a product of features into memory and to read it afterwards to compute the integral image of products. The loop-fusion has been done by hand, but some tools like PIPS [24] can do such a kind of transformation automatically [25]. The complexity of scalar and SIMD versions are provided in Table 5. The numerical value of these expressions is given in Table 6.
To be efficient loop-fusion is combined to full loop-unwinding (on k 1 and k 2 ) and scalarisation (to store temporary results within a register instead of a memory cell of an array). The behavior of the code is the following, for a given pixel (i, j): • all the features associated to point (i, j) are loaded into n F registers: f 0 , f 1 , · · · f n F −1 , • the integral image computation of features is done on the fly and in place with Algorithm 5 with n F accumulators sf 0 , sf 1 · · · sf n F −1 . The point fmat(i,j,k) that previously holds n F features is replaced by the sums stored in the n F accumulator, • the n P products are then calculated in n P registers: p 00 , p 01 , ...p k 1 k2 , · · · p n F −1nF −1 • the integral image computation of the product of features is done in the same way, with n P accumulators. The point prmat(i,j,k) is filled with the n P accumulators of products.
The code is quite big (as internal loops are unwound) but very efficient (see next section), but it can be easily generated automatically by a C program, as it is very systematic: load features do accumulation of features, store accumulations, and compute products and do accumulation of products and store accumulations. It is relevant for a bigger value of n F to avoid bugs. The complexity of these new versions are given in the second part of Tables 5 and 6.
We can observe that without loop-fusion has the lowest AI of 0.5. We can notice that, for a given version, loop-fusion divides the complexity by a factor 1.2 (by rewriting image integral steps) and memory accesses by a factor 2.5 by avoiding LOADs and STOREs of temporary results.

Embedded systems
Let us now focus on more embedded processors like the Intel ULV (ultra low voltage) family and ARM processors. In order to observe the performance evolution for each family, two processors were bench-   Firstly, for all processors, the SoA version is very inefficient compared to the best one (AoS+T+SIMD). The SIMDization alone is also inefficient: around ×1.5 instead of ×4 the ideal speedup for 128-bit SIMD and ×2.5 instead of ×8 for 256-bit SIMD. The reason is that SIMD is really efficient only (a speedup close to the SIMD register cardinal) when data fit in the cache [23]. Secondly, there is one big difference between them: the cpp values. The Intel cpp's are up to ×4.5 smaller than ARM ones that comes from higher latency instructions.
Cortex-A15 is faster than A9 for two reasons: a faster cache and a faster external memory (same for Haswell-M versus Penryn-M) and because A15 can execute one Neon instruction every cycle instead of every two cycles: the SIMD throughput has been multiplied by two.
Regarding the impact of loop-fusion, Table 7 shows that the speedup with AoS version is from ×1.6 up to ×1.7 for ARM and Intel processor and for scalar and SIMD version, respectively. So the loop-fusion is as efficient as SIMDization. The total speedup is from ×3.8 up to ×4.9 for ARM and Penryn-M processor, respectively, but reaches ×7.9 for Haswell-M (with SSE instructions). Concerning execution time, the Penryn-M and the Haswell-M are, respectively, ×4.3 and ×2.2 faster than the A9 and the A15. If we compare the estimation energy consumption (based on approximative TDP as previously said), the A9 and the A15 are, respectively, ×3.8 and ×2.1 more efficient than the Penryn-M and the Haswell-M. ARM embedded processors are still more efficient than Intel ones.

Impact of other parameters: SIMD width and n F value
The impact of a twice wider SIMD -256 bits for AVX2 instead of 128 for SSE -has been evaluated on a Haswell-M processor. It appears that there is quite no difference in performance between SSE and AVX2. First, AVX (and AVX2) processors can pair two SSE instructions within one AVX instruction, thanks to the out-of-order capabilities of these processors. Once the SIMD are fetched and decoded into the pipeline, they are put in the 'instructions ready' window before being dispatched to an execute unit (named port in Intel vocabulary). If the processor can find two SSE data-independent instructions that are ready to be executed, it pairs them together and sends the new instruction to an execute unit.
The impact of n F has been also evaluated for the four processors. The two specialized scalar and SIMD versions AoS + T and AoS + T + SIMD have been instanced for n F = 8 SSE, AVX, and Neon. It makes sense for AVX architecture as eight features 100% fills one AVX register (see Table 3). The cpp ratio (cpp(n F = 8)/cpp(n F = 7)) varies from 1.11 up to 1.35 for ARM processors and 1.21 up to 1.27 for Intel processors. These values are very close to the theoretical ratios (1.27 and 1.25) of the complexity and memory access amounts of Table 5. It means that the execution time of that part of the global algorithm is predictable until we run out of register and generate spill code.
Two sequences have been used to evaluate the global performance on the four processors. Panda and Pedxing for which the robustness of the algorithm have been evaluated in [11] and [18]. For both of them, the execution times are given in cpp for each version of the algorithm: SoA is the basic version, and AoS++ stands for AoS transform + SIMDization + loop-fusion transform.
Two counter-intuitive results can be noticed. The first one is the features computation cpp: it is lower for SoA. The reason is obviously the memory layout of SoA (versus AoS) when computing the features and storing them into a cube or a matrix. The second counter-intuitive result is even more interesting: it concerns the tracking part of the algorithm which is based on the computation of a similarity criterion that requires the computation of the generalized eigenvalues, inversions, and matrix logarithms (9). In order to have the same behavior, we use GNU Scientific Library to perform these computations on both platforms, but we can also use Intel MKL or Eigen libraries. The future position is chosen by evaluating 40 (in our case, but it is parameterizable) random positions in a research window, so matrix operations represent a high percentage of the tracking part. It appears that the features used for tracking lead to a 'more' ill-conditioned matrix requiring more computations for Panda than for the Pedxing-3 sequence.
Concerning the acceleration, Tables 8 and 9 show that the optimization of the kernel provides a speedup of ×2.8 to ×2.9 for Intel processors and ×2.0 to ×2.6 for ARM ones that assets the need of all the optimizations.  As both processors have two cores, all the processing parts can be done either on one core (the execution time is the sum of all parts) or on two cores (the biggest part is on one core and the two other parts are on the second core). With such a coarse grain thread distribution, the Penryn-M and the Haswell-M can track targets in real time for 640 × 480 images. The Haswell-M is even real time with only one core. The Cortex-A9 can do it for image sizes up to 320 × 240 and the Cortex-A15 is close to real time for 640 × 480 images. Once the kernel computation has been optimized, the biggest processing part becomes the features computation. With the optimization of this part, the Cortex-A15 should be able to reach real-time execution.
The performance ratio of the whole algorithm is close to the performance ratio of the kernel: the Penryn-M and the Haswell-M are, respectively, ×4.0 and ×3.3 faster than the Cortex-A9 and the Cortex-A15. We can also observe that the image size has quite no impact on the performance ratio. From an energy point of view, the Cortex-A9 and the Cortex-A15 are, respectively, ×2.1 and ×2.7 more energy efficient than the Penryn-M and the Haswell-M.

Conclusions
We have presented the implementation of a robust covariance tracking algorithm, with a parameterizable complexity that can be adapted to trade-off between robustness and execution time. A study has been made to qualitatively compare different covariance matrices in terms of number and nature of visual features. Classical software and hardware optimizations have been applied: SIMDization and loopfusion transform combined with AoS-SoA transform to accelerate the kernel of the algorithm. These optimizations allow a real-time execution (25 fps or about 40 ms per image) for 320 × 240 image size on ARM Cortex-A9 and for 640 × 480 on Intel Penryn-M and Haswell. ARM Cortex-A15 should also reach real-time execution for such image size, once the other parts of the algorithm will be optimized.
Our future work will focus on (1) the optimization of the features computation and (2) the multithreading of the tracking in order to perform multi-target tracking with load balancing on the available core. A more thorough study should also be made concerning the impact of the ill-conditioning of the matrix on the execution time.
To the best of our knowledge, our implementation of the covariance tracking algorithm is the first realtime implementation for embedded systems, while perfectly maintaining the quality of the tracking.
Endnotes a K-Nearest Neighbours