Generalized Functional Additive Mixed Models

We propose a comprehensive framework for additive regression models for non-Gaussian functional responses, allowing for multiple (partially) nested or crossed functional random effects with flexible correlation structures for, e.g., spatial, temporal, or longitudinal functional data as well as linear and nonlinear effects of functional and scalar covariates that may vary smoothly over the index of the functional response. Our implementation handles functional responses from any exponential family distribution as well as many others like Beta- or scaled non-central $t$-distributions. Development is motivated by and evaluated on an application to large-scale longitudinal feeding records of pigs. Results in extensive simulation studies as well as replications of two previously published simulation studies for generalized functional mixed models demonstrate the good performance of our proposal. The approach is implemented in well-documented open source software in the"pffr()"function in R-package"refund".


Introduction
Data sets in which measurements consist of curves or images instead of scalars -i.e., functional data -are becoming ever more common in many areas of application. This is due to the increasing affordability and deployment of sensors like accelerometers or spectroscopes, high-throughput imaging technologies and automated logging equipment that continuously records conditions over time. Recent methodological development in this area has been rapid and intense, see Morris [13] for a review of the state of the art for regression models for functional data.
In this work, we extend the general framework for functional additive mixed models for potentially correlated functional Gaussian responses described in Scheipl et al. [19] to non-Gaussian functional responses. The development is motivated by and evaluated on an animal husbandry dataset in which the feeding behavior of growing-finishing pigs was monitored continuously over 3 months Scheipl, Gertheiss, Greven/GFAMM 1 [11,3]. These are non-Gaussian functional data in the sense that the underlying probability of feeding is assumed to be a continuous function over time, while the available data are sequences of binary indicators ("feeding: yes/no") evaluated at the temporal resolution of the sensors. Aggregating these binary indicators over given time intervals, we get time series of counts or proportions for which (truncated) Poisson, Negative Binomial, Binomial or Beta distributional assumptions could be appropriate. Another example of non-Gaussian functional data with large practical relevance would be continuously valued functional responses with heavy-tailed measurement errors for which a scaled t-distribution could be appropriate.
We briefly summarize the most relevant prior work on regression models for non-Gaussian functional responses. The fundamental work of Hall et al. [6] relates observed functional binary or count data to a latent Gaussian process (GP) through a link function. This is the underlying idea of almost all the works that follow. They differ primarily in 1) how the latent functional processes are represented (i.e., either spline, wavelet or functional principal component (FPC) representations or full GP models), 2) to what extent additional covariate information can be included, 3) whether they allow the modelling of dependencies between and along the functional responses, 4) which distributions are available for the responses, 5) whether the functional data has to be available on a joint regular grid, and 6) in the availability of documented and performant software implementations. In Hall et al. [6], the latent GP is represented in terms of its FPCs and neither correlated responses nor covariates are accommodated. No implementation is publicly available. A Bayesian variant of Hall et al. [6] is provided by van der Linde [25]. Zhu et al. [32] describe a robustified wavelet-based functional mixed model with scalar covariates for error-contaminated continuous functional responses on regular grids. No implementation was publicly available at the time of writing. Serban et al. [20] extend the approach of Hall et al. [6] to multilevel binary data without covariates and provide a rudimentary problemspecific implementation. Wang and Shi [26] describe an empirical Bayesian approach for latent Gaussian process regression models for data from exponential families with scalar and concurrent functional covariates. The available implementation does not accommodate covariate effects and is limited to binary data with curve-specific i. i. d. functional random effects, see Section 4.3 for a systematic comparison with our proposal based on a replication of their simulation study. Li et al. [9] present a model for concurrent binary and continuous functional observations. Association between the two is modeled via cross-correlated (latent) FPC scores and cannot take into account any other covariate effects or dependency structures. A similar approach is described in Tidemann-Miller [24, ch. 3]. Goldsmith et al. [4] develop a fully Bayesian approach which uses latent FPCs in a spline basis representation to represent multilevel functional random effects and linear functional effects of scalar covariates. They provide an implementation for binary outcomes with a logit link function, see Section 4.4 for a systematic comparison with our proposal based on a replication of their simulation study. Gertheiss et al. [3] describe a marginal GEE-type approach for correlated binary functional responses with concurrent functional covariates.
Brockhaus et al. [2] develop a framework for estimating flexible functional regression models via boosting with similar flexibility as ours, implemented in R [15] package FDboost [1]. The package additionally implements functional quantile regression models, which are not included in the framework we present here. Since their implementation is based on a component-wise gradient boosting algorithm, they cannot provide hypothesis tests or resampling-free construction of confidence intervals, however, and also have to rely on computationally intensive resampling methods for hyperparameter tuning.
Compared to previous work, the novel contribution of this work is the development, implementation and evaluation of a comprehensive maximum likelihoodbased inferential framework for generalized functional additive mixed models (GFAMMs) for potentially correlated functional responses. Our proposal accommodates diverse latent-scale correlation structures as well as flexible modeling of the conditional mean structure with multiple linear and non-linear effects of both scalar and functional covariates and their interactions. Our proposal is implemented in the full generality described here for both regular grid data and sparse or irregularly observed functional responses in the pffr function in R package refund [7]. Available response distributions include all exponential family distributions as well as Beta, scaled and shifted t-, Negative Binomial, Tweedie and zero-inflated Poisson distributions, each with various link functions, as well as cumulative threshold models for ordered categorical responses. With the exception of Brockhaus et al. [2], none of the previous proposals in this area achieve anything close to this level of generality, not to mention offer publicly available, widely applicable open-source implementations. Since our framework is a natural extension of previous work done on generalized additive mixed models for scalar data and builds on the high performing, flexible implementations available for them, we can directly make use of many results from this literature, such as improved confidence intervals and tests for smooth effects [10,29].
The remainder of this work is structured as follows: Section 2 introduces notation and the theoretical and inferential framework for our model class. Section 3 presents results for our application. Section 4 summarizes results of our extensive validation on synthetic data and of our partial replication of the simulation studies of Wang and Shi [26] and Goldsmith et al. [4]. Section 5 concludes.

Model Structure
In what follows, we discuss structured additive regression models of the general form where, for each t, y i (t), i = 1, . . . , n, is a random variable from some distribution F with conditional expectation E(y i (t)|X i , t, ν) = µ i (t) observed over a domain T and an optional vector of nuisance parameters ν. We use g(·) to denote the known link function. Our implementation allows analysts to choose F from the exponential family distributions as well as Tweedie, Negative Binomial, Beta, ordered categorical, zero-inflated Poisson and scaled and shifted t-distributions. Note that, in the special case of ordered categorical responses, µ i (t) is not the conditional mean of the response itself, but that of a latent variable whose value determines the response category. Each term in the additive predictor is a function of a) the index t of the response and b) a subset X r of the complete covariate set X potentially including scalar and functional covariates and (partially) nested or crossed grouping factors. Note that the definition also includes functional random effects b g (t) for a grouping variable g with M levels. These are modeled as realizations of a mean-zero Gaussian random process on {1, . . . , M } × T with a general co- where m, m denote different levels of g. Table 1 shows how a selection of the most frequently used effect types fit into this framework.
We approximate each term f r (X r , t) by a linear combination of basis functions given by the tensor product of marginal bases for X r and t. Since the basis has to be rich enough to ensure sufficient flexibility, Section 2.4 describes a penalized likelihood approach that stabilizes estimates by suppressing variability of the effects that is not strongly supported by the data and finds a data-driven compromise between goodness of fit and simplicity of the fitted effects. Table 1 A selection of possible model terms fr(Xr, t) for model (1). All effects can be constant in t as well.

Data and Notation
In practice, functional responses y i (t) are observed on a grid of T i points t i = (t i1 , . . . , t iTi ) which can be irregular and/or sparse. Let y il = y i (t l ) and y i = (y i1 , . . . , y iTi ). To fit the model, we form y = (y 1 , . . . , y n ) and t = (t 1 , . . . , t n ) , two N = n i=1 T i -vectors that contain the concatenated observed responses and their argument values, respectively. Let X ri contain the observed values of X r associated with a given y i . Model (1) can then be expressed as for i = 1, . . . , n and l = 1, . . . , T i . Let f (t) denote the vector of function evaluations of f for each entry in the vector t and let f (x, t) denote the vector of evaluations of f for each combination of rows in the vectors or matrices x, t. In the following, we let T i ≡ T to simplify notation, but our approach is equally suited to data on irregular grids. For regular grids, each observed value in any X r that is constant over t is simply repeated T times to match up with the corresponding entry in the N = nT -vector y.

Tensor product representation of effects
We approximate each term f r (X r , t) by a linear combination of basis functions defined on the product space of the two spaces: one for the covariates in X r and one over T , where each marginal basis is associated with a corresponding marginal penalty. A very versatile method to construct basis function evaluations on such a joint space is given by the row tensor product of marginal bases evaluated on X r and t [e.g. 27, ch. 4.1.8]. Let 1 d = (1, . . . , 1) denote a d-vector of ones. The row tensor product of an m × a matrix A and an m × b matrix B is defined as the m × ab matrix A B = (A ⊗ 1 b ) · (1 a ⊗ B), where ⊗ denotes the Kronecker product and · denotes element-wise multiplication. Specifically, for each of the terms, where Φ xr contains the evaluations of a suitable marginal basis for the covariate(s) in X r and Φ tr contains the evaluations of a marginal basis in t with K xr and K tr basis functions, respectively. The shape of the function is determined by the vector of coefficients θ r . A corresponding penalty term can be defined using the Kronecker sum of the marginal penalty matrices P xr and P tr associated with each basis [27, ch. 4.1], i.e.
(4) P xr and P tr are known and fixed positive semi-definite penalty matrices and λ tr and λ xr are positive smoothing parameters controlling the trade-off between goodness of fit and the smoothness of f r (X r , t) in X r and t, respectively. This approach is extremely versatile and powerful as it allows analysts to pick and choose bases and penalties best suited to the problem at hand. Any basis and penalty over T (for example, incorporating monotonicity or periodicity constraints) can be combined with any basis and penalty over X r . We provide some concrete examples for frequently encountered effects: For a functional intercept β 0 (t), Φ xr = 1 N and P xr = 0. For a linear functional effect of a scalar covariate zβ(t), Φ xr = z ⊗ 1 T and P xr = 0, where z = (z 1 , . . . , z n ) T contains the observed values of a scalar covariate z. For a smooth functional effect of a scalar covariate f (z, t), 1≤h≤Kxr ⊗ 1 T and P xr is the penalty associated with the basis functions φ h (z).
For a linear effect of a functional covariate where W s contains suitable quadrature weights for the numerical integration and zeroes for combinations of s and t not inside the integration range [l(t), 1≤h≤Kxr the evaluations of basis functions φ h (s) for the basis expansion of β(s, t) over s. P xr is again simply the penalty matrix associated with the basis functions φ h (s). For a functional random slope zb g (t) for a grouping factor g with K xr = M levels, 1≤m≤M is an incidence matrix mapping each observation to its group level. P xr is typically I M for i. i. d. random effects, but can also be any inverse correlation or covariance matrix defining the expected similarity of levels of g with each other, defined for example via spatial or temporal correlation functions or other similarity measures defined between the levels of g. In general, Φ tr and P tr can be any matrix of basis function evaluations Φ tr = 1≤k≤K tr over t with a corresponding penalty matrix. For effects that are constant over t, we simply use Φ tr = 1 N and P tr = 0. Note that for data on irregular t-grids, the stacking via "⊗1 T " in the expressions above simply has to be replaced by using a different suitable repetition pattern. Scheipl et al. [19] discuss tensor product representations of many additional effects available in our implementation.
As for other penalized spline models, fits are usually robust against increases in K tr and K xr once a sufficiently large number of basis functions has been chosen since it is the penalty parameters λ tr , λ xr that control the effective degrees of freedom of the fit, c.f. Ruppert [18], Wood [27,Ch. 4.1.7]. Pya and Wood [14] describe and implement a test procedure to determine sufficiently large basis dimensions that is also applicable to our model class.
Recent results indicate that REML-based inference for this class of penalized regression models can be preferable to GCV-based optimization [16,28,Section 4]. Wood [30] describes a numerically stable implementation for directly optimizing a Laplace-approximate marginal likelihood of the smoothing parameters λ which we summarize below. Note that this is equivalent to REML optimization in the Gaussian response case. The idea is to iteratively perform optimization over λ and then estimate basis coefficients θ given λ via standard penalized likelihood methods.
The R package mgcv [27] provides an efficient and numerically stable implementation for the iterative optimization of (6) that 1) findsθ for each candidate λ via penalized iteratively re-weighted least squares, 2) computes the gradient and Hessian of LA (λ, ν|y) w.r.t. log(λ) via implicit differentiation, 3) updates log(λ) via Newton's method and 4) estimates ν either as part of the maximization for λ or based on the Pearson statistic of the fitted model. See Wood [28,30], Wood et al. [31] for technical details and numerical considerations.
Confidence intervals [10] and tests [29] for the estimated effects are based on the asymptotic normality of ML estimates. Confidence intervals use the variance of the distance between the estimators and true effects. Similar to the idea behind the MSE, this accounts for both the variance and the penalization induced bias (shrinkage effect) of the spline estimators.
We provide an implementation of the full generality of the proposed models in the pffr() function in the R package refund [7] and use the mgcv [28] implementation of (scalar) generalized additive mixed models as computational backend. The pffr() function provides a formula-based wrapper for the definition of a broad variety of functional effects as in Section 2.3 as well as convenience functions for summarizing and visualizing model fits and generating predictions.

Data
We use data on pigs' feeding behavior collected in the ICT-AGRI era-net project "PIGWISE", funded by the European Union. In this project, high frequency radio frequency identification (HF RFID) antennas were installed above the troughs of pig barns to register the feeding times of pigs equipped with passive RFID tags [11]. We are interested in modelling the binary functional data that arises from measuring the proximity of the pig to the trough (yes-no) every 10 secs over 102 days available for each pig. The raw data for one such pig, called 'pig 57' whose behavior we analyse in depth is shown in Figure B.1 in Appendix B. Such models of (a proxy of) individual feeding behavior can be useful for ethology research as well as monitoring individual pigs' health status and/or quality of the available feed stock, c.f. Gertheiss et al. [3]. Available covariates include the time of day, the age of the pig (i.e, the day of observation), as well as the barn's temperature and humidity.

Model
Due to pronounced differences in feeding behavior between individual pigs [c.f. 3, Figure 4], we focus on pig-wise models and model the observed feeding rate for a single pig for each day i = 1, . . . , 102 as a smooth function over the daytime t. For our model, we aggregate the originally observed binary feeding indicators y i (t), which are measured every 10 seconds, over 10 min intervals into y i (t). This temporal resolution is sufficient for the purposes of our modeling effort. We then model As is typical for applied problems such as this, many different distributional assumptions for y i (t)|X i and specifications of R r=1 f r (X ri , t) are conceivable.
In this case, both Beta and Negative Binomial distributions yielded less plausible and more unstable fits than the Binomial distribution. With respect to the additive predictor, the effect of temperature or humidity (i.e., X ri = hum i (t)), for example, could be modeled as a linear functional effect u(t) l(t) hum(s)β(s, t)ds, which represents the cumulative effect of humidity exposure over the time window l(t) ≤ t ≤ u(t), or the concurrent effect of humidity, either linearly via hum(t)β h (t) or non-linearly via f (hum(t), t). Auto-regressive and lagged effects of previous feeding behavior (i.e., X ri = y i (t)) could be modeled similarly either as the cumulative effect of previous feeding over a certain time window ( effects of the lagged responses with a pre-specified time lag δ. Finally, our implementation also offers diverse possibilities for accounting for aging effects and day-to-day variations in behavior (i.e., X ri = i). Aging effects that result in a gradual change of daily feeding rates over time could be represented as iβ(t) for a gradual linear change or as f (i, t) for a nonlinear smooth change over days i. Less systematic day-to-day variations can be modeled as daily functional random intercepts b i (t), potentially auto-correlated over i to encourage similar shapes for neighboring days. In this application, the performances of most of these models on the validation set were fairly similar and the subsequent section presents results for a methodologically interesting model that was among the most accurate models on the validation set.

Results
In what follows, we present detailed results for an auto-regressive binomial logit model with smoothly varying day effects for pig 57 (see Figure B.1, Appendix B): This model assumes that feeding behavior during the previous 3 hours affects the current feeding rate. We use periodic P-spline bases over t to enforce similar f r (X ri , t) for t =00:00h and t =23:59h for all r. The term f (i, t) can be interpreted as a non-linear aging effect on the feeding rates or as a functional random day effect whose shapes vary smoothly across days i. This model contains dim(λ) = 5 smoothing parameters (1 for the functional intercept, 2 each for the daily and auto-regressive effects), dim(θ) = 106 spline coefficients (24 for the intercept, 25 for the autoregressive effect, 56 for the day effect) and N = nT = 9648 observations in the training data. We leave out every third day to serve as external validation data for evaluating the generalizability of the model. The fit takes about 1 minute on a modern desktop PC. Appendix B contains detailed results for an alternative model specification, Section 4.1 describes results for synthetic data with similar structure and size.
Beside giving insight into pigs' feeding patterns and how these patterns change as the pigs age, the model reported here enables short-term predictions of feeding probabilities for a given pig based on its previous feeding behavior. This could be very helpful when using the RFID system for surveillance and early identification of pigs showing unusual feeding behavior. Unusual feeding behavior is then indicated by model predictions that are consistently wrong; i.e, the pig is not behaving as expected. Such discrepancies can then indicate problems such as disease or low-quality feed stock. For the auto-regressive model discussed here, only very short-term predictions 10 minutes into the future can be generated easily as y i (t − 10min) is required as a covariate value. Other model formulations without such auto-regressive terms or larger lead times of auto-regressive terms will allow more long-term forecasting. More long-term forecasts for auto-regressive models could also be achieved by using predictionŝ y i (t) instead of observed values of y i (t) as inputs for the forecast. These can be generated sequentially for timepoints farther and farther in the future, but errors and uncertainties will accumulate correspondingly. A detailed analysis of this issue is beyond the scope of this paper, Tashman [22] gives an overview. In addition to the lagged response value, also the aging effect f (i, t) is needed.
Here, we could either use the value from the preceding day i − 1, if it can be assumed that the pig's (expected) behavior does not change substantially from one day to the next; or do some extrapolation. Figure 1 shows fitted and observed values for 12 selected days from the training data (top) as well as estimated and observed values for 12 selected days from the validation data (bottom). The model is able to reproduce many of the feeding episodes, in the sense that peaks in the estimated probability curves mostly line up well with observed spikes of y i (t). This model explains about 24% of the deviance and, taking the mean over all days and time-points, achieves a Brier score of about 0.024 on both the training data and the validation data, i.e., we see no evidence for overfitting with this model specification. This model performs somewhat better on the validation data than an alternative model specification with i. i. d. random functional day effects b i (t), c.f. Appendix B, Figure B.2. Figure 2 shows the estimated components of the additive predictor for the model with smoothly varying day effects. The functional intercept in the left top panel of Figure 2 shows two clear peaks of the feeding rate around 10h and 18h, and very low feeding activity from 19h to 23h and from 2h to 6h. A two-peak profile has been identified previously as the typical feeding behavior of pigs [12,5]. Our analysis exploiting the rich information provided by the new RFID antennas used [11], however, gives pig-specific results, whereas previous studies typically reported on group characteristics; see, e.g., Hyun et al. [8] and Guillemet et al. [5]. Furthermore, existing results usually aggregated over time periods.
The estimated smoothly varying day effect (top right panel) shows increased feeding rates in the early morning in the first few days and a corresponding strong reduction in feeding rates in the early morning towards the end of the fattening period, as well as a tendency for increased feeding activity to concentrate more strongly in two periods around 9h and 21h towards the end of the observation period, a pattern also visible in the raw data shown in Figure B (Appendix B). The cumulative auto-regressive effect of feeding during the previous 3 hours is displayed in the bottom row of Figure 2. Unsurprisingly, feeding behavior in the immediate past is associated positively with current feeding (c.f. the blue region for s − t ≈ 0), especially during off-peak times in the early morning where a higher propensity for feeding is not already modeled by the global functional intercept. The model also finds a negative association between prior feeding during the night and feeding in the very early hours of the morning that takes place 1 to 3 hours later (c.f. the red region around t = 0). Confidence intervals for all three effects show that they can be estimated fairly precisely.

Simulation Study
This section describes results on binomial (Section 4.1) and Beta-, t-and negative binomial-distributed data (4.2). Sections 4.3 and 4.4 present replications of (parts of) the simulation studies on functional random intercept models in Wang and Shi [26] and Goldsmith et al. [4], respectively. Note that neither of these competing approaches are able to fit nonlinear effects of scalar covariates, effects of functional covariates, functional random slopes or correlated functional random effects in their current implementation, while pffr does. Both approaches are implemented only for binary data, not the broad class of distributions available for pffr-models. All results presented in this section can be re-produced with the code supplement available for this work at http://stat.unimuenchen.de/~scheipl/downloads/gfamm-code.zip.
To evaluate results on artificial data, we use the relative root integrated mean square error rRIMSE of the estimates evaluated on equidistant grid points over T : where sd(f (X ri , t)) is the empirical standard deviation of [f (X ri , t l )] l=1,...,T , i.e., the variability of the estimand values over t. We use the relative RIMSE ) 2 in order to make results more comparable between estimands with different numerical ranges and to give a more intuitively accessible measure of the estimation error, i.e., the relative size of the estimation error compared to the variability of the estimand across T .

Synthetic Data: Binomial Additive Mixed Model
We describe an extensive simulation study to evaluate the quality of estimation for various potential model specifications for the PIGWISE data presented in Section 3.1. We simulate Binomial data y i (t) ∼ B trials, π = logit −1 (η i (t)) with trials = 10, 60, 200 for n = 100 observations on T = 150 equidistant gridpoints over [0, 1] for a variety of true additive predictors η i (t) = R r=1 f r (X ri , t). The subsequent paragraph describes the various f r (X ri , t) we investigate.
The additive predictor always includes a functional intercept β 0 (t). For modelling day-to-day differences in feeding rates, the additive predictor includes an i. i. d. functional random intercept b i (t) (setting: ri) or a smoothly varying day effect f (i, t) (setting: day). Note that the "true" functional random intercepts were generated with Laplace-distributed spline coefficients in order to mimic the spiky nature of the application data. For modeling the auto-regressive effect of past feeding behavior, the additive predictor can include one of • a cumulative linear term over the previous 0.3 time units Wỹ i (s)β(t, s)ds; ) denotes the centered responses andȳ(s) the mean response function. The simulated data also contains humidity and temperature profiles generated by drawing random FPC score vectors from the respective empirical FPC score distributions estimated from the (centered) humidity and temperature data provided in the PIGWISE data. For modeling the effect of these functional covariates, the additive predictor includes either a nonlinear concurrent interaction effect f (hum i (t), temp i (t), t) (ht) or a nonlinear timeconstant concurrent interaction effect f (hum i (t), temp i (t)) (ht.c).  We simulate data based on additive predictors containing each of the terms described above on its own, as well as additive predictors containing either functional random intercepts (ri) or a smoothly varying day effect (day) and one of the remaining terms, for a total of 21 different combinations. For all of these settings, we vary the amplitude of the global functional intercept (small: logit −1 (β 0 (t)) ∈ For the settings with just a single additional term we also varied the absolute effect size of these terms. For small effect sizes the respective exp(f r (X ri , t)) ∈ [0.5, 2] ∀ t, is ∈ [0.2, 5] for intermediate effect sizes and ∈ [0.1, 10] for large effect sizes. We ran 30 replicates for each of the resulting 333 combinations of simulation parameters, for a total of 9990 fitted models. Since the settings' effects are large and systematic and we use a fully crossed experimental design, 30 replicates are more than enough to draw reliable qualitative conclusions from the results here.
We focus on the results for trials = 60 with intermediate amplitude of the global intercept as these settings are most similar to the application. Results for the other settings are qualitatively similar, with the expected trends: performance improves for more informative data (i.e, more "trials") and larger effect sizes and worsens for less informative data with fewer "trials" and smaller effect sizes. Figure 3 shows rRIMSE(η(t)) (left panel) and average point-wise coverages (right panel) for the various settings -it is easy to see that the concurrent functional effects (humtemp, humtemp.c) and especially the daily random intercepts (ri) are the hardest to estimate. Further study shows that concurrent interaction effects are very sensitive to the rank of the functional covariates involved: both humidity and temperature are practically constant in this data and show little variability, so effects are hard to estimate reliably in this special case due to a lack of variability in the covariate. The daily random intercepts are generated from a Laplace distribution of the coefficients to yield "peaky" functions as seen in the application, not the Gaussian distribution our model assumes. Figure A.3 in Appendix A.1 displays a typical fit for the setting with functional random intercepts and a cumulative autoregressive effect (ri+ff.6) and shows that rRIMSEs around .2 correspond to useful and sensible fits in this setting. Also note that performance improves dramatically if the Gaussianity assumption for the random intercept functions is met, c.f. the results in Sections 4.3 and 4.4 for Gaussian random effects in some much simpler settings.
Note that we are trying to estimate a latent smooth function for each day based only on the feeding episodes for that day for ri and, through smoothing, based on each day and its neighboring days in the case of day. The results show that this strategy can work rather well, albeit with a high computational cost in the case of ri: On our hardware, models with such functional random intercepts usually take between 5 minutes and 2 hours to fit, while models with a smooth day effect usually take between 15 seconds and 15 minutes, depending on which covariate effects are present in the model. See Figure A.1 in Appendix A.1 for details on computation times.
As seen in the replication studies in Sections 4.3 and 4.4, estimating random intercepts with our approach also works as well or better than competing approaches in more conventional settings with multiple replicates per group level rather than the one considered here with a single curve per group level. The results here still represent quite successful fits - Figure 4 shows results for a typical day+ff.6-model that yields about the median RIMSE for that setting. Most estimates are very close to the "truth".
In terms of CI coverage, we find that CIs forη i (t) (Figure 3, right panel) are usually close to their nominal 95% level.
Appendix A.1 shows typical fits for some of the remaining settings. In summary, the proposed approach yields useful point estimates for most terms under consideration and confidence intervals with coverages consistently close to the nominal level on data with a similar structure as the application.

Data Generating Process
We generate data with responses with scaled t(3)-distributed errors, Beta(α, β)distributed responses and Negative Binomial NB(µ, θ)-distributed responses. We generate data with n = 100, 300 observations, each with signal-to-noise ratios SNR = 1, 5. SNR is not easily adjustable for Negative Binomial data since variance and mean cannot be specified separately. In our simulation, we use g(·) = log(·), with θ = 0.5, so that Var(y(t)) = E(y(t)) + 2 E(y(t)) 2 . See Appendix A.2 for details. Functional responses are evaluated on T = 60 equidistant grid-points over [0, 1]. We investigate performance for 4 different additive predictors: • int: E(y(t)) = g −1 (β 0 (t)) • smoo: index-varying non-linear effect: E(y(t)) = g −1 (β 0 (t) + f (x, t)) • te: non-linear interaction: E(y(t)) = g −1 (β 0 (t)+f (x 1 , x 2 )) via tensor product spline • ff: functional covariate: E(y(t)) = g −1 (β 0 (t) + x(s)β(t, s)ds) 50 replicates were run for each of the 32 settings for Beta and t(3) and each of the 16 settings for NB. in this setting, the point estimates are so close to the true effects and estimation uncertainty becomes so small that the CIs almost vanish, and in any case observed coverages mostly remain above 90% for nominal 95%. Figure 6 shows relative RIMSEs (left) and coverages (right) for the estimated effects in the three settings that include more than an intercept. Again, we observe the expected patterns of increasing precision with bigger data sets and decreasing noise levels. Estimates are generally fairly precise and most coverages are close to or greater than the nominal 95%-level. Systematic under-coverage occurs for the scalar interaction effect (middle column) for the Beta and t(3)distributions, but note again that given the high precision of the point estimates the fact that CIs tend to be too narrow is of little practical relevance. This could also indicate a basis specification that is too small to fit the true effect without approximation bias. We also observe a single replicate of setting smoo with t(3)distributed errors with much larger rRIMSE and coverage around 20% (bottom row, leftmost panels, rightmost boxplots), indicating that the robustification by specifying a t-distribution may not always be successful.

Results
Median computation times for these fairly easy settings were between 3 and 49 seconds. Estimating the bivariate non-linear effect for setting smoo took the longest, with some fits taking up to 8 minutes. We observed no relevant or systematic differences in computation times between the three response distributions we used.
Appendix A.2 shows some typical results for these settings and gives additional details on the data generating process as well as tabular representations of the performance measures in Figures 5 and 6.

Replication of Simulation Study in Wang and Shi [26]
This section describes our results of a replication of the simulation study described in Section 4.1 of Wang and Shi [26]. We ran 20 replicates per setting. The data comes from a logit model for binary data with just a functional intercept and observation-specific smooth functional random effects. The observationspecific smooth functional random effects b i (t) are realizations of a Gaussian process with "squared exponential" covariance structure, the functional intercept β 0 (t) is a cubed sine-function, see Wang and Shi [26, their Section 4.1.] for details. The model and data generating process are thus P (y i (t l ) = 1) = logit −1 (η i (t)) with η i (t) = β 0 (t l ) + b i (t l ); i = 1, . . . , n; l = 1, . . . , T.
Note that Wang & Shi's MATLAB [23] implementation of their proposal (denoted by ggpfr) uses the same "squared exponential" covariance structure for fitting as that used for generating the data, while our pffr-implementation uses a B-spline basis (8 basis functions per subject) to represent the b i (t). We used the same basis dimension for β 0 (t) for ggpfr as Wang & Shi.
Boxplots in Figure 7 show RIMSEs for the estimated effects and the additive predictor for each of the 20 replicates per combination of settings. While ggpfr tends to yield more precise estimates for b i (t), it often does not do as well as pffr in estimating the functional intercept. As a consequence, the two approaches yield very similar errors for η i (t), even though the data generating process is tailored towards ggpfr and more adversarial for pffr. Note that we were not able to reproduce the mean RIMSE values for logit(η(t)) reported by Wang and Shi [26] despite the authors' generous support. Our mean RIMSE results for ggpfr were 0.37, 0.33, 0.30 while Wang and Shi [26] reported 0.32, 0.26, 0.24 for n = 60; T = 21, 41, 61, respectively. Figure 8 shows observed CI coverage for nominal 90% CIs for the two approaches evaluated overη i (t). Both approaches mostly do not achieve nominal coverage for these very small data sets, but coverages for pffr converge towards the nominal level much faster as n and T increase than those for ggpfr and show fewer and smaller incidences of severe under-coverage. Figure 9 shows computation times for the two approaches. The median computation time for ggpfr is about three times that of pffr for T = 21, which increases to a factor of 13-19 for n = 60; T = 41, 61 and a factor of 40-48 for n = 30; T = 41, 61 1 . Note that computation times for ggpfr are given for a single run using a pre-specified number of basis functions for estimating β 0 (t) here. But since the BIC-based se- lection of the basis dimension proposed by Wang and Shi [26] requires k runs for selecting between k candidate numbers of basis functions, actual computation times for ggpfr in practical applications will increase roughly k-fold.
Our replication shows that pffr achieves mostly similar estimation accuracies with better CI coverage for a data generating process tailored specifically to ggpfr's strengths. Depending on the MATLAB version that is used, pffr does so in similar or much, much shorter time.

Replication of Simulation Study in Goldsmith et al. [4]
This section describes our results of a replication of the simulation study described in Web Appendix 2 of Goldsmith et al. [4]. We ran 20 replicates per setting. The data comes from a logit model for binary data with a functional intercept, a functional linear effect of a scalar covariate and observation-specific smooth functional random effects. The observation-specific smooth functional random effects b i (t) are drawn using 2 trigonometric functional principal component functions, the functional intercept β 0 (t) is a trigonometric function as well and the functional coefficient function β 1 (t) associated with a N (0, 25)distributed scalar covariate x is a scaled normal density function, see Goldsmith et al. [4, Web Appendix, Section 2] for details. Note that genfpca, the rstan [21] implementation provided by Goldsmith et al., uses the same number of FPCs (i.e, two) for the fit as used for generating the data, while our pffrimplementation uses 10 cubic B-spline basis functions to represent each b i (t). The model and data generating process are thus P (y i (t l ) = 1) = logit −1 (η i (t)) with η i (t) = β 0 (t l ) + β 1 (t l )x i + b i (t l ); i = 1, . . . , n; l = 1, . . . , T. Figure 10 shows RIMSE for (top to bottom) the estimated functional intercept, the estimated functional coefficient, the functional random effect and the combined additive predictor. It seems that pffr yields somewhat more precise estimates for β 0 (t) in this setting, while the estimation performance of the two approaches for the functional coefficient and the random effects is broadly similar although the results for pffr for β 1 (t) show more variability. pffr yields slightly better estimates of the total additive predictor η i (t) in this setting. Note that Goldsmith et al. [4] report mean integrated square error (MISE, in their notation) while we report the RIMSE. Taking this into account, the genfpca results reported here correspond very closely to theirs. Figure 11 shows observed confidence [credibility] interval (CI) coverage for nominal 95% CIs for the two approaches evaluated overη i (t). Both approaches mostly achieve close to nominal coverage in this setting, but while coverages for pffr are mostly greater than the nominal 95% for T = 50 with some rare problematic fits with coverages below 75%, genfpca shows small but systematic under-coverage for T = 100. Figure 12 shows computation times for the two approaches. The median computation time for genfpca is about 4 times that of pffr for n = 50, and 0.80 to 0.94 that of pffr for n = 100.
In summary, our replication shows that pffr achieves mostly similar estimation accuracies and coverages in much shorter time or roughly the same time for a data generating process tailored specifically to genfpca's strengths.

Summary of Simulation results
We achieve useful and mostly reliable estimates for many complex settings in data structured like the PIGWISE data in acceptable time. Simulation results for less challenging settings and other response distributions than Binomial are good to excellent as well. Our partial replications of the simulation studies in Wang and Shi [26] and Goldsmith et al. [4] in Sections 4.3 and 4.4, respectively, show that our implementation achieves highly competitive results in similar or much shorter computation times even though the settings are tailored towards the competing earlier approaches which are also less flexible than ours by quite a large margin.

Discussion
This work introduces a comprehensive framework for generalized additive mixed models (GAMM) for non-Gaussian functional responses. Our implementation extends all the flexibility of GAMMs for dependent scalar responses to dependent functional responses and functional covariates, even for response distributions outside the exponential family such as robust models based on the tdistribution. Dependency structures can be spatial, temporal or hierarchical. Simulation and application results show that our approach provides reliable and precise inference about the underlying latent processes. The work presented here opens up promising new avenues of inquiry -one challenge that we have already begun to work on is to improve the speed and memory efficiency of the underlying computational engine to be able to fit the huge data sets increasingly common in functional data analysis. Along these lines, more efficient computation of simultaneous instead of pointwise confidence intervals for functional effects as well as a more detailed investigation of the performance of the available approximate pointwise confidence intervals for noisy and small non-Gaussian data is another important field of inquiry. An extension of the approach presented here to models with multiple additive predictors controlling different aspects of the responses' distribution -like the generalized additive models for location, scale and shape introduced for scalar response by Rigby and Stasinopoulos [17] or zero-inflation and hurdle modelsis yet another promising generalization of the ideas we have presented here. Scheipl

Appendix A: Additional Simulation Details
Section A.2 provides some more details on the data generating process used in Section 4.2 and shows some typical effect estimates in these settings.

A.1. Simulation 1: Computation times, Details and Examples
A.1.1. Computation times Figure A.1 shows computation times for the various settings. Note that the times are plotted on binary log-scale. One fit for a random intercept model (ri) did not converge within 200 iterations and ran for more than 16 hours.

A.28
The coefficient surfaces for ff.3, ff.6 are estimated with 30 bivariate cubic thin plate splines over T × T . The functional intercept (int) uses 40 cubic cyclic P-splines with first order difference penalty over T , while the functional random intercepts (ri) use 9 of those per curve. For all other terms, we use 8 cubic P-splines with first order difference penalty for Φ t . Figures A.2 to A.4 show graphical summaries of typical model fits for three difficult settings of the simulated binomial data discussed in Section 4.1. It is clear to see that estimating unstructured daily functional random intercepts is a harder task than estimating a smoothly varying aging effect (compare the accuracy off (t, i) in Figures A.2 and A.4 to that ofb i (t) in Figure A.3), and that concurrent functional nonlinear interaction effects can not be recovered very reliably in this setting. Also note that we can model very rough latent response processes µ i (t) for these data by including, e.g., a (nonlinear) auto-regressive term as in the top left panel of Figure A.

A.2.1. Data Generating Process for Simulation Study 2
We use different signal-to-noise ratios (SNR) to control the difficulty of the simulation settings. For Beta-distributed responses with g(·) = logit −1 (·), distribution parameters α, β are determined by approximate SNR via φ = SNR where v µ is the sample variance of the simulated µ it and µ = logit −1 (η). Beta parameters for generating y are then given by α = φµ and β = φ − α. η is first scaled linearly to the interval [−1.5, 1.5].

A.2.2. Computational details
We use the following spline bases to construct Φ x for the different effects: • smoo: 8 cubic thin plate splines over the range of x • te: 45 bivariate cubic thin plate splines over the joint space of x 1 and x 2 • ff: 5 cubic P-splines with first order difference penalty over S The functional intercept (int) uses 40 cubic cyclic P-splines with first order difference penalty over T . For all other terms, we use 5 cubic P-splines with first order difference penalty for Φ t .

A.2.3. Typical Fits for Simulation Study 2
Figures A.5 to A.7 show some typical fits for these data generating processes.  Tables 2 and 3 show median rRIMSES and coverages for simulation study 4.2 for the entire additive predictor and the estimated covariate effects, respectively.

A.3. Computational Details
Code to reproduce results for Section 4 is included in a code supplement.   Beyond the smoothly varying functional day effects used in the analysis in Section 3.2, we considered two alternative random effect specifications: autocorrelated functional random day effects with a marginal AR(1)-structure with auto-correlation 0.8 over the days as well as i. i. d. random day effects b i (t). We also considered models with no day effects at all.
Beyond the 3h-cumulative auto-regressive effect of previous feeding used in the analysis in Section 3.2, we also fit models with a 6h-cumulative autoregressive effect ( t−10min t−6h y i (s)β(t, s)ds) as well as non-linear time-constant auto-regressive effects (f (y i (t − 10min))) and time-varying (f (y i (t − 10min), t)). We also considered models with no auto-regressive effects at all.
To model possible effects of humidity and temperature, we considered additive non-linear time-varying (f (hum(t), t) + f (temp(t), t)) and time-constant (f (hum(t))+f (temp(t))) concurrent effects, as well as corresponding concurrent interaction effects f (hum(t), temp(t)) and f (hum(t), temp(t), t), respectively. We also considered models with no effects of humidity and/or temperature at all.
We estimated models for all 100 combinations (4 day effects, 5 auto-regressive effects, 5 humidity/temperature effects) of the different effect specifications given above. As in the main analysis, we excluded every third day from the training sample to serve as a validation set. Analysis of mean Brier scores achieved on the training data showed that all 100 models fit the training data similarly well, with slightly better fits for models with i. i. d. or AR(1) functional random effects compared to models with no day effects or a smooth day effect. Analysis of mean Brier scores on the validation data, however, showed that only very small differences in predictive accuracy between models with a smooth day effect and no day effects at all exist and also revealed a strong decrease in predictive performance for models with humidity-temperature effects, especially in combination with cumulative auto-regressive effects. All in all, the majority of models performed worse in terms of predictive Brier score than a simple functional intercept model y i (t) ∼ B(60, logit −1 (β 0 (t))), which had a mean predictive Brier score of 0.026, compared to a mean predictive Brier score of 0.029 for the i. i. d. functional random intercept model discussed below and a mean predictive Brier score of 0.024 of a model like the one discussed in Section 3.3 with a smoothly varying day effect better suited for generating interpolating predictions for missing days instead of functional random day effects, which are constant 0 for days in the test set. We conclude that overfitting of complex effects could be a serious concern for the model class we propose, as shown by the large differences in Brier scores between training and validation data for many of the models with complicated effect structures. Note, however, that even though the i. i. d. random effect model yields inferior predictions, it may be better suited for estimating explanatory and interpretable models than the one discussed in Section 3.2, as the larger flexibility of the i. i. d. functional random day effects is presumably more effective at modeling the peaky and irregular temporal dependencies in the observations than a smoothly varying aging effect. We present detailed results for an alternative model specification with i. i. d. functional random day effects b i (t) instead of a smooth aging effect f (i, t) below. Figure B.2 shows fitted and observed values for 12 selected days from the training data (top) as well as predicted and observed values for 12 selected days from the validation data (bottom). It is easy to see that the model is able to reproduce many of the feeding episodes in the training data, in the sense that peaks in the estimated probability curves mostly line up well with observed spikes of y i (t). This model explains about 41% of the deviance and achieves a Brier score of about 0.019 taking the mean over all days and time-points. Prediction (lower panel) with this model is more challenging (Brier score: 0.029), and succeeds only partially. For example, while the predictions for day 73 are mostly very good, predictions for the evening hours of day 11 or around noon on day 82 are far off the mark. Note, however, thatb i (t) ≡ 0 for the validation data, as E(b i (t)) ≡ 0 ∀ t, i. Models that include a smooth day effect f (i, t) instead, which can be interpolated for the days in the calibration data, are more successful at prediction, see Section 3.3. Figure B.3 shows the estimated components of the additive predictor for this alternative model specification. The functional intercept in the top left panel of Figure B.3 is fairly similar in shape to Figure 2, but overall the base rate is estimated to be much higher and the peak before 6h is much less pronounced. Estimated random day effects are shown in the top right panel of Figure B.3. In terms of absolute size, these are much larger than the values of the smooth aging effect depicted in Figure 2. Effect sizes for b i (t) are often unrealistically large, with some estimates going as low as -20 (-20 on the logit scale corresponds to a reduction of the probability by a factor of about 2 · 10 −9 ), presumably an artifact of the very low feeding activity between midnight and early morning leading to (quasi-)complete separability of the data. The vertical axis in Figure  B.3 is cut off at −10 in order to showcase the typical "peaky" structure of thê b i (t) which would otherwise be hidden by the larger vertical axis required to show the 12 random intercept functions reaching minimal values below −10 in their entirety. The cumulative auto-regressive effect of feeding during the previous 3 hours displayed in the bottom row of Figure B.3 is quite similar to the estimate for the original model shown in Figure 2 but with a weaker positive association between feeding in the immediate past and current feeding and a stronger negative association for feeding episodes in the early morning hours. Scheipl Table 2 Tabular display of results in Figure 5. Median and 25% and 75% quantiles for relative RIMSE and pointwise coverages for the additive predictorη(t).  Table 3 Tabular display of results in Figure 6. Median and 25% and 75% quantiles for relative RIMSE and pointwise coverages for the estimated effectsf (Xr, t).