Functional Additive Models on Manifolds of Planar Shapes and Forms

Abstract The “shape” of a planar curve and/or landmark configuration is considered its equivalence class under translation, rotation, and scaling, its “form” its equivalence class under translation and rotation while scale is preserved. We extend generalized additive regression to models for such shapes/forms as responses respecting the resulting quotient geometry by employing the squared geodesic distance as loss function and a geodesic response function to map the additive predictor to the shape/form space. For fitting the model, we propose a Riemannian L2-Boosting algorithm well suited for a potentially large number of possibly parameter-intensive model terms, which also yields automated model selection. We provide novel intuitively interpretable visualizations for (even nonlinear) covariate effects in the shape/form space via suitable tensor-product factorization. The usefulness of the proposed framework is illustrated in an analysis of (a) astragalus shapes of wild and domesticated sheep and (b) cell forms generated in a biophysical model, as well as (c) in a realistic simulation study with response shapes and forms motivated from a dataset on bottle outlines. Supplementary materials for this article are available online.


Introduction
In many imaging data problems, the coordinate system of recorded objects is arbitrary or explicitly not of interest. Statistical shape analysis (Dryden and Mardia, 2016) addresses this point by identifying the ultimate object of analysis as the shape of an observation, reflecting its geometric properties invariant under translation, rotation and re-scaling, or as its form (or size-and-shape) invariant under translation and rotation. This paper establishes a flexible additive regression framework for modeling the shape or form of planar (potentially irregularly sampled) curves and/or landmark configurations in dependence on scalar covariates. The proposed component-wise Riemannian L 2 -Boosting algorithm allows estimation of a large number of parameter-intense model terms with inherent model selection. We further introduce a novel visualization of functional additive effects using a tensor-product factorization, which we consider essential for practical interpretation.
A rich literature on statistical shape analysis has been developed for 2D or 3D landmark configurations -presenting for instance selected points of a bone or face -which are considered elements of Kendall's shape space (see, e.g. Dryden and Mardia, 2016). In many 2D scenarios, however, observed points describe a curve reflecting the outline of an object rather than dedicated landmarks (Adams et al., 2013). Considering outlines as images of (parameterized) curves shows a direct link to functional data analysis (FDA, Ramsay and Silverman, 2005) and, in this context, we speak of functional shape/form data analysis. As in FDA, functional shape/form data can be observed on a common and often dense grid (regular/dense design) or on curve-specific often sparse grids (irregular/sparse design). While in the regular case, analysis often simplifies by treating curve evaluations as multivariate data, more general irregular designs gave rise to further developments in sparse FDA (e.g. Yao et al., 2005;Greven and Scheipl, 2017), explicitly considering irregular measurements instead of pre-smoothing curves. To the best of our knowledge, we are the first to consider irregular/sparse designs in the context of functional shape/form analysis.
Shapes and forms are examples of manifold data. Starting from geodesic regression (Fletcher, 2013), which extends linear regression to curved spaces, several generalizations of standard regression models to manifold responses have been developed, such as MANOVA , polynomial regression (Hinkle et al., 2014), regression along geodesic paths with non-constant speed (Hong et al., 2014), or kernel regression (Davis et al., 2010). While the formulations are more general, landmark shape spaces often serve as an example. However, only one metric covariate or categorical covariates are considered. In FDA, there is a much wider range of developed regression methods (see overviews in Morris, 2015;Greven and Scheipl, 2017). Among the most flexible models for (univariate) functional responses are the functional additive (mixed) models (FA(M)Ms) of i.a. Morris and Carroll (2006); Meyer et al. (2015) for dense functional data and the flexible FAMM model class also covering sparse irregular functional data, summarized in Greven and Scheipl (2017). There is less work on regression for bivariate or multivariate functional responses, which are closest to functional shapes/forms but without invariances. Rosen and Thompson (2009);Zhu et al. (2012); Olsen et al. (2018) consider linear fixed effects of scalar covariates, the latter also allowing for warping. Zhu et al. (2017); Backenroth et al.
(2018) consider one or more random effects for one grouping variable, linear fixed effects and common dense grids for all functions. Volkmann et al. (2021) combine the FAMM model class of Greven and Scheipl (2017) with multivariate functional principal component analysis (Happ and Greven, 2018) to obain FAMMs for multivariate (sparse) functional responses.
For estimation of FAMMs, Brockhaus et al. (2015) employ a (component-wise) modelbased gradient boosting algorithm. Model-based boosting (Hothorn et al., 2010) is a stagewise fitting procedure for estimating additive models with inherent model selection and a slow over-fitting behavior, allowing to efficiently estimate models with a very large number of coefficients. Regularization is applied both in each step and via early stopping of the algorithm based, e.g., on cross-validation. The combination of high auto-correlation within functional responses and complex tensor-product covariate effects typical for FAMMs make the double regularization particularly beneficial in this context (Stöcker et al., 2021). Gradient boosting with respect to (w.r.t.) a least-squares loss is typically referred to as L 2 -Boosting (Bühlmann and Yu, 2003). We introduce a Riemannian L 2 -Boosting algorithm for estimating interpretable additive regression models for response shapes/forms of planar curves.
The three major contributions of our regression framework are: 1. We introduce additive regression with shapes/forms of planar curves and/or landmarks as response, extending FAMs to non-linear response spaces or, vice versa, extending GLM-type regression on manifolds for landmark shapes both to functional shape manifolds and to include (nonlinear) additive model effects. 2. We propose a novel Riemannian L 2 -Boosting algorithm for estimating regression models for this type of manifold response, and 3. a visualization technique based on tensor-product factorization yielding intuitive interpretations even of multi-dimensional smooth covariate effects for practitioners. Although related tensorproduct model transformations based on higher-order SVD have been used, i.a., in control engineering (Baranyi et al., 2013), we are not aware of any comparable application for visualization in FAMs or other statistical models for object data. Despite our focus on shapes and forms, transfer of the model, Riemannian L 2 -Boosting, and factorized visualization to other Riemannian manifold responses is intended in the generality of the formulation and the design of the provided R package manifoldboost (developer version on github.com/Almond-S/manifoldboost). Moreover, while considering potential invariance w.r.t. re-parameterization ("warping") of curves is beyond the scope of this paper, the presented framework is intended to be combined with an "elastic" approach in the square root velocity framework (Srivastava and Klassen, 2016) in the future. The versatile applicability of the approach is illustrated in three different scenarios: an analysis of the shape of sheep astragali (ankle bones) represented by both regularly sampled curves and landmarks in dependence on categorical "demographic" variables; an analysis of the effects of different metric biophysical model parameters (including smooth interactions) on the form of (irregularly sampled) cell outlines generated from a cellular Potts model; and a simulation study with irregularly sampled functional shape and form responses generated from a dataset of different bottle outlines and including metric and categorical covariates.
In Section 2, we introduce the manifold geometry of irregular curves modulo translation, rotation and potentially re-scaling, which underlies the intrinsic additive regression model formulated in Section 3. The Riemannian L 2 -Boosting algorithm is introduced in Section 4.
Section 5 analyzes different data problems, modeling sheep bone shape responses (Section 5.1) and cell outlines (Section 5.2). Section 5.3 summarizes the results of simulation studies with functional shape and form responses. We conclude with a discussion in Section 6.

Geometry of functional forms and shapes
Riemannian manifolds of planar shapes (and forms) are discussed in various textbooks at different levels of generality, in finite (Dryden and Mardia, 2016;Kendall et al., 1999) or potentially infinite dimensions (Srivastava and Klassen, 2016;Klingenberg, 1995). Starting from the Hilbert space Y of curve representatives y of a single shape or form observation, we successively characterize its quotient space geometry under translation, rotation and rescaling including the respective tangent spaces. Building on that, we introduce Riemannian exponential and logarithmic maps and parallel transports needed for model formulation and fitting, and the sample space of (irregularly observed) functional shapes/forms.
To make use of complex arithmetic, we identify the two-dimensional plane with the complex numbers, R 2 ∼ = C, and consider a planar curve to be a function y : R ⊃ T → C, element of a separable complex Hilbert space Y with a complex inner product ·, · and corresponding norm · . This allows simple scalar expressions for the group actions of translation Trl = {y Trlγ −→ y + γ1 : γ ∈ C} with 1 ∈ Y canonically given by 1 : t → 1 t →1 the real constant function of unit norm; re-scaling Scl = {y rotations by ω radian measure. Concatenation yields combined group actions G as direct products, such as the rigid motions G = Trl × Rot = {Trl γ • Rot u : γ ∈ C, u ∈ S 1 } ∼ = C × S 1 (see Supplement A.1.1 for more details). The two real-valued component functions of y are identified with the real part Re(y) : T → R and imaginary part Im(y) : T → R of y = Re(y) + Im(y) ı. While the complex setup is used for convenience, the real part of ·, · constitutes an inner product Re( y 1 , y 2 ) = Re(y 1 ) , Re(y 2 ) + Im(y 1 ) , Im(y 2 ) for y 1 , y 2 ∈ Y on the underlying real vector space of planar curves. Typically Re(y) , Im(y) are assumed square-intregrable with respect to a measure ν and we consider the canonical inner product y 1 , y 2 = y † 1 y 2 dν where y † denotes the conjugate transpose of y, i.e. y † (t) = Re(y) (t) − Im(y) (t)ı is simply the complex conjugate, but for vectors y ∈ C k , the vector y † is also transposed. For curves, we typically assume ν to be the Lebesgue measure on T = [0, 1]; for landmarks, a standard choice is the counting measure on T = {1, . . . , k}.
The ultimate response object is given by the orbit [y] G = {g(y) : g ∈ G} (or short [y]) of y ∈ Y, the equivalence class under the respective combined group actions G: [y] Trl × Rot × Scl = {λu y + γ1 : λ ∈ R + , u ∈ S 1 , γ ∈ C} is referred to as the shape of y and [y] Trl × Rot = {uy + γ1 : u ∈ S 1 , γ ∈ C} as its form or size-and-shape. Y /G = {[y] G : y ∈ Y} denotes the quotient space of Y with respect to G. The description of the Riemannian geometry of Y /G involves, in particular, a description of the tangent spaces T [y] Y /G at points [y] ∈ Y /G , which can be considered local vector space approximations to Y /G in a neighborhood of [y]. For a point q in a manifold M the tangent vectors β ∈ T q M can, i.a., be thought of as gradientsċ(0) of paths c : R ⊃ (−δ, δ) → M at 0 where they pass through c(0) = q. Besides their geometric meaning, they will also play an important role in the regression model, as additive model effects are formulated on tangent space level. Choosing suitable representatives y G ∈ [y] G ⊂ Y (or short y) of orbits [y] G , we use an identification of tangent spaces with suitable linear subspaces Form geometry: Starting with translation as the simplest invariance, an orbit [y] Trl can be one-to-one identified with its centered representative y Trl = y − y,1 1 yielding an identification Y / Trl ∼ = {y ∈ Y : y,1 = 0} with a linear subspace of Y. Hence, also  Fig. 1). While y Trl × Rot depends on p, we omit this in the notation for simplicity.
All y Trl rotation aligned to p Trl lie on the hyper-plane determined by Im y Trl , p Trl = 0 ( Figure 1), which yields T [p] Y * / Trl + Rot = {y ∈ Y : y,1 = 0, Im( y, p ) = 0}. Note that, despite the use of complex arithmetic, reflects both the length of the shortest path (i.e. the geodesic) between the forms and the minimum distance between the orbits as sets.
Shape geometry: To account for scale invariance in shapes [y] Trl × Rot × Scl , they are identified with normalized representatives y Trl × Rot × Scl = y Trl × Rot y Trl × Rot . Motivated by the normalization, we borrow the well-known geometry of the sphere S = {y ∈ Y : y = 1}, where T p S = {y ∈ Y : Re( y, p ) = 0} is the tangent space at a point p ∈ S and geodesics are great circles. Together with translation and rotation invariance, the shape tangent space is then given by corresponds to the arc-length between the representatives. This distance is often referred to as Procrustres distance in statistical shape analysis.
We may now define the maps needed for the regression model formulation. Let y, p and p be shape/form representatives of [y], [p] and [p ] rotation aligned to the shape/form pole representative p. Generalizing straight lines to a Riemannian manifold M, geodesics c : (−δ, δ) → M can be characterized by their "intercept" c(0) ∈ M and "slope"ċ(0) ∈ T c(0) M. The exponential map Exp q : T q M → M at a point q ∈ M is defined to map β → c(1) for c the geodesic with q = c(0) and β =ċ(0). It maps β ∈ T q M to a point Exp q (β) ∈ M located d(q, Exp q (β)) = β apart of the pole q in the direction of β. On the form space Y / Trl × Rot , the exponential map is simply given by On the shape space Y / Trl × Rot + Scl , identification with exponential maps on the sphere yields Exp [p] In an open neighborhood U, q ∈ U ⊂ M, Exp q is invertible yielding the Log q : U → T q M map from the manifold to the tangent space at q. For forms, it is given by Log [p] takes the form of the parallel transport on a sphere replacing the real inner product with its complex analogue. For forms, it changes only the Im( ε, p ) coordinate orthogonal to the real p-p -plane as in the shape case, while the remainder of ε is left unchanged as in a linear space. This yields Transp [p], [p ] (ε) = ε − Im( p / p , ε ) p/ p + p / p 1+ p/ p , p / p ı for form tangent vectors. While this or equivalent expressions for the parallel transport in the shape case can be found, e.g., in Dryden and Mardia (2016); , a corresponding derivation for the form case is given in the Supplement Section A.1.2. This also involves a more detailed discussion of the quotient space geometry in differential geometric terms.
Based on this understanding of the response space, we may now proceed to consider a sample of curves y 1 , . . . , y n ∈ Y representing orbits [y 1 ], . . . , [y n ] with respect to group actions G. In the functional case, with the domain T = [0, 1], these curves are usually observed as evaluations y i = (y i (t i1 ), . . . , y i (t ik i )) on a finite grid t i1 < · · · < t ik i ∈ T which may differ between observations. In contrast to the regular case with common grids, this more general data structure is referred to as irregular functional shape/form data. To handle this setting, we replace the original inner product ·, · on Y by individual y i , y i i = y † i W i y i providing inner products on the k i -dimensional space Y i = C k i of evaluations y i , y i on the same grid. The symmetric positive-definite weight matrix W i can be chosen to implement an approximation to integration w.r.t. the original measure ν with a numerical integration measure ν i such as given by the trapezoidal rule. Alternatively, W i = 1 k i I k i with k i × k i identity matrix I k i presents a canonical choice that is analog to the landmark case for k i ≡ k. With the inner products given for i = 1, . . . , n, the sample space naturally arises as the Riemannian product Y * 1/G × · · · × Y * n/G of the orbit spaces, with the individual geometries constructed as described above.

Additive Regression on Riemannian Manifolds
Consider a data scenario with n observations of a random response covariate tuple (Y, X), where the realizations of Y are planar curves y i : T → C, i = 1, . . . , n, belonging to a Hilbert space Y defined as above and potentially irregularly measured on individual grids t i1 < · · · < t ik i ∈ T . The response object [Y ] is the equivalence class of Y with respect to translation, rotation and possibly scale and the sample [y 1 ], . . . , [y n ] is equipped with the respective Riemannian manifold geometry introduced in the previous section. For i = 1, . . . , n, realizations x i ∈ X of a covariate vector X in a covariate space X are observed.
X can contain several categorical and/or metric covariates.
For regressing the mean of [Y ] on X = x, we model the shape/form [µ] of a curve µ ∈ Y

as
with an additive predictor h : X → T [p] Y * /G acting in the tangent space at an "intercept" [p] ∈ Y * /G . Generalizing an additive model "Y = µ + = p + h(x) + " in a linear space, we implicitly define [µ] as the conditional mean of [Y ] given X = x by assuming zero-mean "residuals" . In their definition, we follow Cornea et al. (2017) but extend to the functional shape/form and additive case. We assume local linearized residuals Here, we assume [Y ] is sufficiently close to [µ] with probability 1 such that Log [µ] is well-defined, which is the case whenever Y , µ = 0 for centered shape/form representatives Y and µ, an un-restrictive and common assumption (compare also Cornea et al., 2017 Figure 1: Left: Quotient space geometry: assuming p and y centered, translation invariance is not further considered in the plot; given pole representative p, we express y = Re( p,y ) p 2 p + Im( p,y ) p 2 ip + (y − p,y p 2 p) ∈ Y in its coordinates in p and ip direction, subsuming all orthogonal directions in the third dimension. In this coordinate system, the rotation orbit [y] Rot corresponds to the dotted horizontal circle, and is identified with the aligned y := y Rot in the half-plane of p; [y] Rot × Scl is identified with the unit vector y Rot × Scl = y y projecting y onto the hemisphere depicted by the vertical semicircle. Form and shape distances between [p] and [y] correspond to the length of the geodesics c(τ ) (thick lines) on the plane and sphere, respectively. Right: Geodesic line c(τ ) between p = c(0) and p = c(1), Log-map projecting y to ε ∈ T p M, parallel transport Transp pp forwarding ε to ε ∈ T p M, and Exp-map projecting ε onto M visualized for a sphere. Tangent spaces, identified with subspaces of the ambient space, are depicted as gray planes above the respective poles. The parallel transport preserves all angles between tangent vectors and identifiesċ(0) ∼ =ċ(1).
spaces. To obtain a formulation in a common linear space instead, local residuals are mapped to residuals = Transp [µ], [p] (ε [µ] ) by parallel transporting them from [µ] to the common covariate independent pole [p]. After this isometric mapping into T [p] Y * /G , we can equivalently define the conditional mean [µ] via E ( ) = 0 for the transported residuals .
It is analogous to a response function in GLMs but depends on [p]. While other response functions could be used, we restrict to the exponential map here, such that the model contains a geodesic model (Fletcher, 2013) -the direct generalization of simple linear regression -as a special case for h(x) = βx 1 with a single covariate x 1 and tangent vector β. Typically, it is assumed that h is centered such that E (h(X)) = 0, and the pole [p] corresponds to the overall mean of [Y ] defined, like the conditional mean, via residuals of mean zero.
3.1 Tensor-product effect functions h j Scheipl et al. (2015) and other authors employ tensor-product (TP) bases for functional additive model terms. This naturally extends to tangent space effects, which we model as with the TP basis given by the pair-wise products of m linearly independent tangent vectors ∂ r ∈ T [p] Y * /G , r = 1, . . . , m, and m j basis functions b (l) j : X → R, l = 1, . . . , m j , for the j-th covariate effect depending on one or more covariates. The real basis coefficients can be arranged as a matrix {θ While, in principle, the basis {∂ r } r could also vary across effects j = 1, . . . , J, we assume a common basis for notational simplicity, which presents the typical choice. Due to the identification of T [p] Y * /G with a subspace of the function space Y, the {∂ r } r may be specified using a function basis commonly used in additive models: Let b 0 , employing the same basis for the 1-and ı-dimension before transforming it with a suitable basis transformation matrix Z p = {z (l,r) p } l,r ∈ R 2m 0 ×m with m < 2m 0 implementing the linear tangent space constraints as given in Section 2. Practically, Z p can be obtained as null space basis matrix of the matrix (Re(C) , Im(C)) with C = { b (l) 0 , ζ (r) } r,l (or with the empirical inner product on the product space of irregular curves instead) constructed from the normal vectors ζ (r) ∈ Y, For closed curves, we additionally choose Z p to enforce periodicity, i.e. ∂ r (t) = ∂ r (t + t 0 ) for some t 0 ∈ R (compare Hofner et al., 2016).
Given the tangent space basis, we may now modularly specify the usual additive model basis functions b (l) j : X → R, l = 1, . . . , m j , for the j-th covariate effect to obtain the full functional additive model 'tool box' offered by, e.g., Brockhaus et al. (2015). A 'linear effect' -linear in the tangent space -of the form h j (x) = βz with a scalar (typically centered) covariate z in x and β ∈ T [p] Y * /G is simply implemented by a single function b (1) κ → e κ ∈ R K−1 maps category κ to a usual contrast vector e κ just as in standard linear models. Here, we typically use effect-encoding to obtain centered effects. Moreover, TP interactions of the model terms described above, as well as groupspecific effects and smooth effects with additional constraints (Hofner et al., 2016) can be specified in the model formula, relying on the mboost framework introduced by Hothorn et al. (2010), which also allows to define custom effect designs. For identification of an overall mean intercept [p], sum-to-zero constraints yielding n i=1 h j (x i ) = 0 for observed covariates x i , i = 1, . . . , n, can be specified, and similar constraints can be used to distinguish linear from non-linear effects and interactions from their marginal effects (Kneib et al., 2009). Different quadratic penalties can be specified for the coefficients Θ j , allowing to regularize high-dimensional effect bases and to balance effects of different complexity in the model fit (see also Section 4).

Tensor-product factorization
The multidimensional structure of the response objects makes it challenging to graphically illustrate and interpret additive model terms, in particular when it comes to non-linear (interaction) effects, or when effect sizes are visually small. To solve this problem, we suggest to re-write the TP effects h j for a given fixed coefficient matrix Θ j as . . , m j , for the underlying effect basis, the h (r) j are specified to achieve decreasing component vari- In practice, the expectation over the covariates X and the inner product ., . are replaced by empirical analogs (compare Supplement Corollary 3). Due to orthonormality of the ξ (r) j , the component variances add up to the total predictor variance Moreover, the TP factorization is optimally concentrated in the first components in the sense that for any l ≤ m j there is no sequence of ξ the series of the first l components yields the best rank l approximation of h j . The factorization relies on SVD of (a transformed version of) the coefficient matrix Θ j and the fact that it is welldefined is a variant of the Eckart-Young-Mirsky theorem (proof in Supplement A.2).
Particularly when large shares of the predictor variance are explained by the first component(s), the decomposition facilitates graphical illustration and interpretation: choosing a suitable constant τ = 0, an effect direction ξ j (x). When based on the same τ , different covariate effects can be compared across the plots sharing the same scale. We suggest τ = max j √ v j , the maximum total predictor standard deviation of an effect, as a good first choice.
Besides factorizing effects separately, it can also be helpful to apply TP factorization to the joint additive predictor, yielding , modeled by a scalar additive predictor h (1) (x) composed of covariate effects analogous to the original model predictor. In Section 5, we illustrate its potential in three different scenarios.
highly autocorrelated) functional responses with parameter-intense additive model baselearners and, thus, leads to good estimates even in challenging scenarios.
This analogy to L 2 -Boosting motivates the presented generalization where local residuals are further transported to residuals in a common linear space.
Consider the pole [p] known and fixed for now. Assuming its existence, we aim to minimize the population mean loss mizing the conditional expected squared distance. Fixing a covariate constellation x ∈ X , the prediction [µ] = Exp [p] (h (x)) corresponds to the Fréchet mean (Karcher, 1977)

of [Y ]
conditional on X = x. In a finite-dimensional context, Pennec (2006) show that E ε [µ] = 0 for a Fréchet mean [µ] if residuals ε [µ] are uniquely defined with probability one. This indicates the connection to our residual based model formulation in Section 3. We fit the model by reducing the empirical mean lossσ 2 (h) = 1 , where we replace the population mean by the sample mean and compute the geodesic distances d i with respect to the inner products ·, · i defined for the respective evaluations of y i .
A base-learner corresponds to a covariate effect h j (x) = r,l θ } r,l , which is repeatedly fit to the transported residuals 1 , . . . , n by penalized least- Via the penalty parameters λ j , λ ≥ 0 the effective degrees of freedom of the base-learners are controlled (Hofner et al., 2011) to achieve a balanced "fair" base-learner selection despite the typically large and varying number of coefficients involved in the TP effects. The symmetric penalty matrices P j ∈ R m j ×m j and P ∈ R m×m (imposing, e.g., a second-order difference penalty for B-splines in either direction) can equivalently be arranged as a m j m × m j m penalty matrix R j = λ j (P j ⊗ I m ) + λ(I m j ⊗ P) for the vectorized coefficients vec (Θ j ) = (θ (1,1) j , . . . , θ (m,1) j , . . . , θ (m,m j ) ) , where ⊗ denotes the Kro-necker product. The standard PLS estimator is then given by vec ( (r,l)=(1,1),...,(m,1),...,(m,m j ) (r ,l )=(1,1),...,(m,1),...,(m,m j ) ∈ R m m j ×m m j and ∈ R m m j . In a regular design, using the functional linear array model (Brockhaus et al., 2015) can save memory and computation time by avoiding construction of the complete matrices. The basis construction of {∂ r } r via a transformation matrix Z p (Section 3.1) is reflected in the penalty by setting P = Z p (I 2 ⊗P 0 )Z p with P 0 the penalty matrix for the un-transformed basis {b (r) 0 } r . In each iteration of the proposed Algorithm 1, the best-performing base-learner is added to the current ensemble additive predictor h(x) after multiplying it with a step-length parameter η ∈ (0, 1]. Due to the additive model structure this corresponds to a coefficient update of the selected covariate effect. Accordingly, after repeated selection, the effective degrees of freedom of a covariate effect, in general, exceed the degrees specified for the baselearner. They are successively adjusted to the data. To avoid over-fitting, the algorithm is typically stopped early before reaching a minimum of the empirical mean loss. The stopping iteration is determined, e.g., by re-sampling strategies such as bootstrapping or cross-validation on the level of shapes/forms.
The pole [p] is, in fact, usually not a priori available. Instead we typically assume ) is the overall Fréchet mean, also often referred to as Riemannian center of mass for Riemannian manifolds or as Procrustes mean in shape analysis (Dryden and Mardia, 2016). Here, we estimate it as /G in the intercept-only special case of our model is estimated with Algorithm 1 based on a preliminary pole [p 0 ] ∈ Y * /G . For shapes and forms, a good candidate for p 0 can be obtained as the standard functional mean of a reasonably well aligned sample y 1 , . . . , y n ∈ Y of representatives.
The proposed Riemannian L 2 -Boosting algorithm is available in the R (R Core Team, 2018) package manifoldboost (github.com/Almond-S/manifoldboost). The implementation is based on the package FDboost (Brockhaus et al., 2020), which is in turn based on the model-based boosting package mboost (Hothorn et al., 2010).

Geometry
: specify geometry (shape/form) and pole representative p Hyper-parameters: Step-length η ∈ (0, 1], number of boosting iterations end for j = 1, . . . , J do # PLS fit to residuals # m m j vector:  Schafberg and Wussow, 2010). Table S1 in Supplement A.3 shows the distribution of available covariates within the breeds. Each sheep astragalus shape, i = 1, . . . , n, is represented by a configuration composed of 11 selected landmarks in a vector y lm i ∈ C 11 and two vectors of semi-landmarks y c1 i ∈ C 14 and y c2 i ∈ C 18 evaluated along two outline curve segments, marked on a 2D image of the bone (dorsal perspective). Several example configurations are displayed in Supplement Figure S1. In general, we could separately specify smooth function bases for the outline segments y c1 i and y c2 i , respectively. Due to their systematic recording, we assume, however, that not only landmarks but also semi-landmarks are regularly observed on a fixed grid, and refrain from using smooth function bases for simplicity. Accordingly, shape configurations can directly be identified with their evalua- and the geometry of the response space Y * / Trl × Rot × Scl widely corresponds to the classic Kendall's shape space geometry, with the difference that, considering landmarks more descriptive than single semi-landmarks, we choose a weighted inner product y i , y i = y † i Wy i with diagonal weight matrix W with diagonal 1 11 , 3 14 1 14 , 3 18 1 18 assigning the weight of three landmarks to each outline segment. We model the astragalus shapes [ For identifiability, the breed and mobility effects are centered around the status effect, as we only have data on different breeds/mobility levels for domesticated sheep. All base-learners are regularized to one degree of freedom by employing ridge penalties for the coefficients of the covariate bases {b (l) j } l while the coefficients of the response basis (the standard basis for C 43 ) are left un-penalized. With a step-length of η = 0.1, 10-fold shape-wise crossvalidation suggests early stopping after 89 boosting iterations. Due to the regular design, we can make use of the functional linear array model (Brockhaus et al., 2015) for saving computation time and memory, which lead to 8 seconds of initial model fit followed by 47 seconds of cross-validation. To interpret the categorical covariate effects, we rely on TP factorization ( Figure 2). The first component of the status effect explains about 2/3 of the variance of the status effect and over 50% of the cumulative effect variance in the model. In that main direction, the effect of feral is not located between wild and domestic, as might be naively expected. By contrast, the second component of the effect seems to reflect the expected order and still explains a considerable amount of variance. Similar to Pöllath et al. (2019), we find little influence of age, sex and mobility on the astragalus shape. Yet, all covariates were selected by the boosting algorithm.
Visually, differences in estimated mean shapes are rather small, which is, in our experience, quite usual for shape data. With differences in size, rotation and translation excluded by definition, only comparably small variance remains in the observed shapes.
Nonetheless, TP factorization provides accessible visualization of the effect directions and allows to partially order the effect levels in each direction.

Cellular Potts model parameter effects on cell form
where the conditional form mean [µ i ] is modeled in dependence on tangent-space linear effects with coefficients β j ∈ T [p] Y / Trl × Rot and non-linear smooth effects f j for covariate j = 1, . . . , 4, as well as smooth interaction effects f j for each pair of covariates j =.
All involved (effect) functions are modeled via a cyclic cubic P-spline basis {b (r) 0 } r with 7 (inner) knots and a ridge penalty, and quadratic P-splines with 4 knots for the covariates x ij equipped with a second order difference penalty for the f j and ridge penalties for interactions. Covariate effects are mean centered and interaction effects f j (x j , x) are centered around their marginal effects f j (x j ), f(x), which are in turn centered around the linear effects β j x j and βx, respectively. Resulting predictor terms involve 69 (linear effect) to 1173 (interaction) basis coefficients but are penalized to a common degree of freedom of 2 to ensure a fair base-learner selection. We fit the model with a step-size of η = 0.25 and stop after 2000 boosting iterations observing no further meaningful risk reduction, since no need for early-stopping is indicated by 10-fold form-wise cross-validation. Due to the increased number of data points and coefficients, the irregular design, and the increased number of iterations, the model fit takes considerably longer than in Section 5.1, with an initial 47 minutes followed by 8 hours 46 minutes of cross-validation. However, as usual in S 12 (x i1 , x i2 ) illustrating deviations from the marginal effects, which are of particular interest for CPM calibration.
boosting, model updates are large in the beginning and only marginal in later iterations, such that fits after 1000 or 500 iterations would already yield very similar results.
Observing that the most relevant components point into similar directions, we jointly factorize the additive model predictor as h(x i ) = r ξ (r) h (r) (x i ) with TP factorization. The first component explains about 93% of the total predictor variance (Supplement Fig.   S3), indicating that, post-hoc, a good share of the model can be reduced to the geodesic Figure 3. A positive effect in the direction ξ (1) makes cells larger and more keratocyte / croissant shaped, a negative effect -pointing into the opposite direction -makes them smaller and more mesenchymal shaped / elongated. The bulk stiffness x i1 turns out to present the most important driving factor behind the cell form, explaining over 75% of the cumulative variance of the effects (Supplement Fig.   S2). Around 80% of its effect are explained by the linear term reflecting gradual shrinkage at the side of the cells with increasing bulk stiffness.

Realistic shape and form simulation studies
To evaluate the proposed approach, we conduct a series of simulation studies for both form and shape regression for irregular curves. We compare sample sizes n ∈ {54, 162} and average grid sizes k = 1  [p]) in the form scenario and around 65% in the shape scenario. All simulation settings were repeated 100 times, fitting the generated data with models including the model components specified above and three additional nuisance effects: a linear effect βz 1 (orthogonal to f 1 (z 1 )), an effect f 2 of the same structure as f 1 but depending on an independently uniformly drawn variable z 2 , and a constant effect h 0 ∈ T [p] Y * /G to test centering around [p]. For the model fit, all base-learners are regularized to 4 degrees of freedom. We specify a step-length of η = 0.1, and early-stopping is based on 10-fold cross-validation.  (Fig. 4). For this effect, we observe a median relative mean squared error 7% of the total predictor variance for small data settings with n = 54 and k = 100 (5.9% with k = 40), which reduces to 1.5% for n = 162 (for both k = 40 and k = 100). It is typical for functional data that, from a certain point, adding more (highly correlated) evaluations per curve leads to distinctly less improvement in the model fit than adding further observations (compare, e.g., also Stöcker et al., 2021). In the extreme k i = 3 scenario, we obtain an rMSE of around 15%, which is not surprisingly considerably higher than for the moderate settings above. Even in this extreme setting (Fig. 4), the effect directions are captured well, while the size of the effect is underestimated. Rotation alignment based on only three points (which are randomly distributed along the curves) might considerably differ from the full curve alignment, and averaging over these sub-optimal alignments masks the full extend of the effect. Still, results are very good given the sparsity of information in this case. Having a simpler form, the binary effect β κ is also estimated more accurately with an rMSE of around 1.5% for n = 54, k = 100 (1.9% for k = 40) and less than 0.8% for n = 162 (for both k = 40 and k = 100). The pole estimation accuracy varies on a similar scale. Shape scenario: Qualitatively, the shape scenario shows a similar picture. For k = 40, we observe median rMSEs of 2.8% (n = 54) and 2.2% (n = 162) for f 1 (z 1 ), and 1.5% and 0.6% for the binary effect β κ . For k = 100, accuracy is again slightly higher.

Nuisance effects and integration weights:
Nuisance effects in the model where generally rarely selected and, if selected at all, only lead to a marginal loss in accuracy. The constant effect is only selected sometimes in the extreme triangle scenarios, when pole es-timation is difficult. We refer to Brockhaus et al. (2017), who perform gradient boosting with functional responses and a large number of covariate effects with stability selection, for simulations with larger numbers of nuisance effects and further discussion in a related context, as variable selection is not our main focus here. Finally, simulations indicate that inner product weights implementing a trapezoidal rule for numerical integration are slightly preferable for typical grid sizes (k = 40, 100), whereas weights of 1/k i equal over all grid points within a curve gave slightly better results in the extreme k i = 3 settings.
All in all, the simulations show that Riemannian L 2 -Boosting can adequately fit both shape and form models in a realistic scenario and captures effects reasonably well even for a comparably small number of sampled outlines or evaluations per outline.

Discussion and Outlook
Compared to existing (landmark) shape regression models, the presented approach extends linear predictors to more general additive predictors including also, e.g., smooth nonlinear model terms and interactions, and yields the first regression approach for functional shape as well as form responses. Moreover, we propose novel visualizations based on TP factorization that, similar to functional principal component analysis, enable a systematic decomposition of the variability explained by an additive effect on tangent space level.
Yielding meaningful coordinates for model effects, its potential for visualization will be useful also for functional additive models in linear spaces.
Instead of operating on the original evaluations y i ∈ C k i of response curves y i as in all applications above, another frequently used approach expands y i , i = 1, . . . , n, in a common basis first, before carrying out statistical analysis on coefficient vectors (compare e.g. Ramsay and Silverman (2005)  in particular for shapes/forms). However, keeping other possible combinations of invariances in mind, we explicitly refer to an individual reference point and suggest the centroid 0 y = 1 , y 1 or, more generally, 0 y = a(y)1 for some linear functional a : Y → R. Assuming Trl γ (0 y ) = 0 Trlγ (y) , as for the centroid, the definition of re-scaling and rotation around 0 y ensures that Trl γ , Scl λ and Rot u commute -and that Trl, Rot and Scl present normal subgroups of the combined group actions {y → λuy + γ : γ ∈ C, λ ∈ R + , u ∈ S 1 } of shape invariances. Thus, the combined group actions can be written as the direct product (or and invariances with respect to Trl, Scl, Rot can be modularly accounted for in arbitrary order. Trl × Rot, for instance, describe rigid motions. The ultimate response object is then given by the orbit [y] G = {g(y) : g ∈ G} (or short [y]), i.e. the equivalence class with respect to the direct sum G generated by the chosen combination of Trl, Scl and Rot.
[y] Trl × Scl × Rot is referred to as the shape of y and [y] Trl × Rot as its form or size-and-shape (compare Dryden and Mardia, 2016); studying [y] Scl is closely related to directional data analysis (Mardia and Jupp, 2009) where the direction of y is analyzed independent of its size y .

A.1.2 Parallel transport of form tangent vectors
To confirm that the parallel transport in the form space Y * / Trl × Rot can be carried out via representatives in Y as described in the main manuscript, we closely follow  in their derivation of shape parallel transport. Necessary differential geometric notions and statements are briefly introduced in the following before stating the main result in Lemma 1. For a more profound introduction, we recommend Lee (2018, in particular, p. 21, 43, 93, 124, 337, 402), as well as Tu (2011) for an illustrative introduction into some of the concepts, and Klingenberg (1995, in particular, p.103-107)  By construction, the quotient map Φ : Y * → Y * / Trl × Rot , y → [y] presents a Riemannian submersion: Since [p] = {up + γ1 : u ∈ S 1 , γ ∈ C} embeds S 1 × R 2 in Y, and, since the tangent spaces of S 1 and R 2 are well-known, the vertical space is given by T p [p] ∼ = {λpı + γ1 : λ ∈ R, γ ∈ C} ⊂ Y, with orthogonal complement H p Y * ∼ = {y ∈ Y : y,1 = 0, Im( y, p ) = 0} (see also Figure 1 in the main manuscript for an illustration). While Φ is obviously surjective, surjectivity and isometry of dΦ| HpY * can be seen by expressing Φ in terms of the charts for Y * / Trl × Rot : for a given p ∈ [p] ∈ Y * / Trl × Rot , the map (·) : [y] → y Trl × Rot provides a chart U [p] → V p , i.e. an isomorphism from U [p] = {[y] ∈ Y * / Trl × Rot : p Trl , y Trl = 0} to V p = {y ∈ Y : Im( p, y ) = 0, Re( p, y ) > 0, 1, y = 0} used to establish the differential structure on Y * / Trl × Rot . Expressed in this chart, Φ(y) = (·) • Φ(y) = y Trl × Rot is the identity for all y ∈ V p ⊂ Φ −1 U [p] . Thus, since T p V p = H p Y * , also d Φ HpY * is the identity, which is obviously an isometric isomorphism. The latter carries over to dΦ HpY * independent of the given chart.
∈ Y : y,1 = 0, Im( y, p ) = 0}, which we rely on in the main manuscript. Unlike there, we denote dΦ −1 HyY * : ξ → ξ also for tangent vectors in the following, to make the identification of ξ = dΦ( ξ) ∈ T [y] Y * / Trl × Rot with the corresponding ξ ∈ H y Y * , usually referred to as horizontal lift, explicit in the notation.
In analogy to straight lines, geodesic curves c(τ ) are characterized by i.e. curves with zero 'second derivative'. More generally, a vector-field W is called parallel According to that the parallel transport Transp c q,q : T q M → T q M along a curve c : [τ 0 , τ 1 ] → M between c(τ 0 ) = q, c(τ 1 ) = q ∈ M is defined to map tangent vectors ε = W (τ 0 ) → ε = W (τ 1 ) for some vector field W parallel along c (fulfilling Equation 3).
If the curve c is clear from context, we omit it in the notation. This is especially the case in the following, where c can be chosen as the unique geodesic between two forms [p] and [p ] with p, p = 0, yielding a canonical connection (in this case, c corresponds to the line between p and the alignedp ; for p, p = 0, by contrast, it is easy to see that for each u ∈ S 1 the line between p and up corresponds to a different geodesic; the second case can, however, be neglected).
The possibility to effectively carry out the parallel transport between forms [p], [p ] on suitable representatives p, p ∈ Y * stems from the following theorem and subsequent Corollary (compare, e.g, Klingenberg, 1995, p. 103-105).
for the horizontal vector-field W ∈ T M along c.
ii) c is a geodesic if and only if c is a geodesic.
are horizontal along c(τ ), i.e. W (τ ) ∈ H c(τ ) Y * for each τ , ifε is horizontal, i.e. if Im( p,ε ) = 1,ε = 0. More concretely, this holds as and, obviously, also 1, W (τ ) = 0 as this is the case for all involved vectors. Moreover, W is smooth andε → W arccos p p , p where the left-hand side may directly be computed as On the right-hand side, the orthogonal projection of a vector-field V (τ ) := V c(τ ) ∈ T c(τ ) Y * along c(τ ) into the vertical spaces (of which {γ(τ ),1, ı1} constitute an orthonormal basis) is given by with the 1-forms ω Rot and ω Trl defined as and ω Trl (V p ) = 1, V p for p ∈ Y * .
Proof of Lemma 2. i) See, e.g., Lee (2018), Proposition B.12 on page 402. This is a standard result. Note that based on an alternative (yet also common) definition of the wedge product and, hence, the exterior derivative,  and In this case, we also have dω Rot (V, W ) = Im( V, W ) in ii) compensating for the different factor in the proof of Lemma 1.
ii) Let {e r } r be an orthonormal C-basis of Y (a complete orthonormal system existing since Y is separable) and {ϑ (r) (y)} r = e r , y the corresponding dual basis. The tangent vectors ∂ Re,r p ∼ = e r and ∂ Im,r p ∼ = ı e r , p ∈ Y together form an R-basis of Re( e r , p ) Im( e r , V p ) − Im( e r , p ) Re( e r , V p ) and thus, expressing the exterior derivative in terms of wedge products where the last equation follows from a computation analogous to (7).

A.2 Tensor-product factorization
The optimality of the proposed tensor-product factorization follows from the Eckart-Young-Mirsky theorem (EYM) which can be found, e.g., in (Gentle, 2007, page 139) for matrices and, in more general terms, in (Hsing and Eubank, 2015, page 111) for Hilbert-Schmidt operators. In the following, we present a tensor-product version of EYM designed for our needs. The optimality of the tensor-product factorization is then illustrated in two corollaries -first in a theoretical model setting and second for the empirical decomposition on evaluations which can be practically conducted on given data. Consider two real vector spaces B j , j ∈ {0, 1}, with positive semi-definite bilinear forms ·, · j : B j × B j → R inducing semi-norms · j . Assuming B 1 to be, in fact, a function space of functions f : X → R on some set X , the (vector space) tensor product B 1 ⊗ B 0 of B 0 and B 1 is the vector space spanned by all f ⊗ y : X → B 0 , x → f (x) y with f ∈ B 1 and y ∈ B 0 . By linear extension, a symmetric positive semi-definite bilinear form on B 1 ⊗ B 0 is defined by f ⊗ y , f ⊗ y B 1 ⊗B 0 = f, f 1 y, y 0 for all f, f ∈ B 1 , y, y ∈ B 0 . It induces a semi-norm · B 1 ⊗B 0 on the tensor product space.
Theorem 2 (Eckart-Young-Mirsky for finite-dimensional tensor-products). Let B 0 , B 1 be semi-normed vector spaces as defined above and h = m 0 r=1 for all d (r) ∈ R and ξ (r) j ∈ B j , j ∈ {0, 1}, r = 1, . . . , L. Arranging D = diag(d (1) , . . . , d (m) ) and expressing ξ (r) j = l u (l,r) j b (l) j , r = 1, . . . , m, with the coefficient matrices {u (l,r) j } l,r = U j ∈ R m j ×m , j ∈ {0, 1}, an optimal decomposition is obtained as follows: j j } r,l are the identity G j = I m j , the matrices D and U j , j ∈ {0, 1}, are directly determined via SVD of the coefficient (a) In general, a suitable matrix is given by j , with r, l = 1, . . . , m j , for a symmetric positive definite weight matrix W j , we may equivalently set M j = R j based on the QR-decomposition W j B j = Q j R j . In this case, design matrices E j of vector representatives ξ (r) j for the ξ (r) j can, alternatively, be obtained as E j = W j − Q j V j (where W j is typically diagonal and, hence, W j − fast to compute).
The EYM for matrices (e.g. Gentle, 2007, page 139) To apply the theorem, we point out that, provided the Gram matrices G j = I m j , the {b j } r presents an isometric isomorphism from A j to R m j . Accordingly, the basis representation A 1 ⊗ A 0 → R m 0 ×m 1 , h → Θ presents an iso- = tr Θ 1 Θ 2 for basis representations h 1 → Θ 1 and h 2 → Θ 2 . This lets us carry over the EYM for matrices to A 1 ⊗ A 0 yielding the desired inequality (8) restricted to ξ (r) The property d 1 ≥ · · · ≥ d m ≥ 0 and orthonormality of the ξ (r) j are also inherited from the SVD.
Proof. After applying Theorem 2, it remains to check that h 2 L 2 (X )⊗Y = E ( h(X) 2 ). Indeed, this holds for all simple h = f ⊗ y , since for any y, y ∈ Y and f, f ∈ L 2 (X ), and, therefore, carries over to all h ∈ L 2 (X ) ⊗ Y in the vector space.

.1 Sampling of response observations
Response curves are generated separately for the shape and form scenario as follows: we obtain the true underlying models by fitting original beer and whisky bottles and 3D rotated versions of them, four successively rotated towards the viewer and four away from the viewer, and compute transported residuals i of a total of N = 360 bottle outlines y 1 , . . . , y N (20 whisky and 20 beer brands, each from 9 different angles z 1 ). For each simulated dataset, a sample of the desired size n is randomly drawn (with replacement) from the model residuals 1 , . . . , N . To obtain irregular data with an average grid length k = 1 n n i=1 k i , we subsample the original evaluations i (t i1 ), . . . , i (t iK i ), with original grid sizes K i ≥ 123, in two steps: first we randomly pick three evaluations as minimal sample size; then we draw evaluations independently with k−3 K i −3 probability to enter the dataset. To preserve the original covariate distribution of the data, covariates are not randomly picked but we select batches of 9 beer and 9 whisky bottles with z 1 ∈ [−60, 60] as in the original dataset. Sample sizes n are, therefore, multiples of 18. With the conditional means [µ i ] determined by the covariates, the evaluated residuals i (on k i points) are parallel transported to ε [µ i ],i ∈ T [µ i ] Y * i/G , into the tangent space of the true conditional mean, to generate the simulated shape/form dataset [y i ] = Exp [µ i ] (ε [µ i ],i ), i = 1, . . . , n.

A.5.2 Simulation results
In order to systematically and efficiently assess model behavior, we vary key aspects of the model setup and compare fitting performance in selected settings. Here, we list the different aspects and how they are referred to in subsequent graphical visualizations: • Scenario: Shape or form responses.
• Sample size n of curves and mean grid size k that curves are evaluated on.
• Setting: simulations adjusted in an additional aspect compared to a default setup equal weight: Constant inner product weights w iι = 1 k i , ι = 1, . . . , k i , are utilized for curve evaluations y i (t i1 ), . . . , y i (t ik i ) instead of trapezoidal rule weights (default).
no nuisance: No constant and smooth nuisance effects h 0 and f 2 (z 2 ) are included into the model, which are included by default.
pre-aligned: This setting concerns the pre-alignment of the curves y 1 , . . . , y n representing the forms/shapes in the simulated data. Note, however, that due to alignment to the pole p in the very beginning of the Riemannian L 2 -Boosting algorithm, all of this only effects the preliminary pole p 0 used for estimation of p. In the models fit in the paper, we estimated p 0 by using a functional L 2 -Boosting algorithm (without any alignment), which makes sense for typical data where the curves occur roughly aligned.
Consequently, this aspect translates to a "good or worse starting point p 0 ", which is then replaced by p in the actual model fit. In pre-aligned settings, simulated response curvesỹ i = Exp µ i (ε µ i ,i ) are directly used for fitting. In the default, by contrast, the model is fit on random representatives of [y i ] to mimic realistic scenarios, where y i = λuỹ i + γ ∈ [ỹ i ] with u = exp(ıω), ω ∼ N (0, π 20 ), with γ = σ 1 γ 1 + σ 2 2 γ 2 ı, γ 1 , γ 2 ∼ N (0, 1), (σ 2 1 , σ 2 2 ) = 1 nk n i=1 k i ι=1 (Re(ỹ i (t ι )) , Im(ỹ i (t ι ))), and with λ = 1 for forms and λ ∼ Gamma(10 2 , 10 −2 ) for shapes.   0 } r , presents an alternative empirical substitute for the inner product y i , y i of curves y i , y i ∈ Y, which is computed on the coefficients instead of y i , y i i = y † i W i y i computed on evaluation vectors y i = (y i (t i1 ), . . . , y i (t ik i )) , y i = (y i (t i1 ), . . . , y i (t ik i )) as suggested in Section 2. When, for dense grids, it can be assumed that both y i , y i 0 i ≈ y i , y i i ≈ y i , y i approximate the inner product on the level of curves well, the approach based on the coefficientsy i may be computationally preferable, guaranteeing regular and typically more sparse representations that necessitate operations on smaller design matrices (in particular when utilizing the linear array framework (Brockhaus et al., 2015)). By contrast, in comparably sparse irregular scenarios, expanding single observed y i in a basis in a first step might involve unwanted pre-smoothing. To give a consistent presentation on the original level of curves, we rely on an evaluation based approach in all applications presented in the main manuscript.