VPNet: Variable Projection Networks

We introduce VPNet, a novel model-driven neural network architecture based on variable projection (VP). Applying VP operators to neural networks results in learnable features, interpretable parameters, and compact network structures. This paper discusses the motivation and mathematical background of VPNet and presents experiments. The VPNet approach was evaluated in the context of signal processing, where we classified a synthetic dataset and real electrocardiogram (ECG) signals. Compared to fully connected and one-dimensional convolutional networks, VPNet offers fast learning ability and good accuracy at a low computational cost of both training and inference. Based on these advantages and the promising results obtained, we anticipate a profound impact on the broader field of signal processing, in particular on classification, regression and clustering problems.


Introduction
Until recently, signal processing was dominated by conventional model-based algorithms, which rely on mathematical and physical models of the real world. They are inherently interpretable and often incor-porate domain knowledge, such as statistical assumptions, smoothness, structure of the model space, and origin of the noise. However, this approach can become mathematically intractable if problems are complex. Machine learning (ML) provides an alternative approach to this challenge by building data-driven mathematical models. Neural networks (NNs) and supervised learning in particular offer a proper framework for various signal-processing problems. 1 Below, we briefly review a few recent trends that served as motivation for developing the proposed variable projection network (VPNet).
A traditional ML approach is to decompose the problem into separate feature extraction and learning steps. 2 In this case, the data is preprocessed in order to extract static features based on the given domain knowledge. These features are inputs to conventional ML algorithms. Although the dimension of the original data is significantly reduced in the first step, these handcrafted features are usually suboptimal with respect to the whole learning process. 3 Deep learning provides alternatives to the traditional approach, overcoming some of its drawbacks. 1 Learned features of NNs can be used as input for non NN methods, like discriminant correlation filters, as well. 4 Ref. 5 combined traditional kernel-based Support Vector Machines (SVMs) with deep learning approaches. Another common method would would be to use the features as input for one or multiple other NN for multi-target prediction . [6][7][8][9] .
Using more hidden layers in deep neural networks (DNNs) has increased the learning abilities of NNs. 10 This enables DNNs to use the first layers for feature extraction and further layers for performing operations on the features learned. Convolutional neural networks (CNNs) are special, optionally deep architectures and are the leading ML approaches in 2D and 3D image processing and computer vision. [11][12][13][14][15][16] Here, the built-in feature extraction layers perform multiple convolutional filtering and dimension-reduction (pooling) steps. Despite their advantages, DNNs and CNNs continue to raise several concerns. Their improved efficiency comes at the cost of higher computational complexity and numerical difficulties in the training process (see, e.g., overfitting and divergence). Due to the large number of nonlinear connections between the model parameters, DNN and CNN approaches can be considered as black-box methods, where the parameters have no or little physical meaning and are difficult or impossible to interpret. Additionally, training these networks requires vast amounts of labeled data, which is problematic to collect in many applications, such as telecommunications, 17 and biomedical engineering. 18,19 Although data augmentation, transfer learning, outlier removal, and ensemble methods can mitigate this problem, reducing the data hunger of deep learning approaches is still a major challenge in this field.
Despite the popularity of deep learning, traditional ML algorithms continue to dominate in many 1D signal-processing tasks, 20 especially in biomedical signal classification, for example, of electroencephalograms (EEGs), electromyograms (EMGs), and ECGs. The main reason for this lies in the nature of clinical applications, where both accuracy and explainability are important. These cannot be guaranteed by the previously mentioned NN approaches, since they do not extract medically interpretable features. VPNet, however, breaks this impasse by harnessing the theory of variable projection (VP) to provide a framework for solving nonlinear least-squares problems, whose parameters can be separated into linear and nonlinear ones. In many fields of signal processing, there is a large number of linear parameters, which are driven by a smaller number of nonlinear variables (see Eq. (3)). For example, signal compression, representation, and feature-extraction algorithms are often based on linear coefficients of some transformation, such as Fourier and wavelet transforms, which can be parameterized via properties of the window function, mother wavelet, etc.
The VPNet was designed to merge the expert knowledge used by traditional model-based approaches with the learning abilities of NNs. The proposed architecture is inspired by the so-called modeldriven NN concept, which is a emerging trend in signal processing. In Section 2, we review the existing literature on incorporating model-based information into machine learning. The theoretical background, the general formulation of VPNet, and the corresponding forward and backpropagation algorithm are discussed in Section 3. Section 4 describes multiple experiments we performed to evaluate and compare the performance of VPNet to that of other NNs. Finally, Section 5 presents conclusions and the expected broader impact of our research.

Related works
Approximation theory gives a general framework to approach the fundamental task in machine learning that is to learn a good representation of the data. 3 Classical methods in approximation theory build up complicated functions by using linear combinations of elementary functions, whereas neural networks use compositions of simple functions. The structure of these compositions constrains the feasible region where we search for the solution of the corresponding ML task. The model-driven NN concept implements these constraints such that the design of the NN architecture resembles the solution to well understood mathematical problems, such as ordinary or partial differential equations 21,22 (ODE, PDE), signal [23][24][25] and image [26][27][28] processing, optimization, 29 and control. [30][31][32] ODE-and PDE-constrained learning strategies belong to a family of model-driven ML techniques that relates the rigorous mathematical background of differential equations to deep learning problems. On one hand, numerical solvers provide various ways to derive and to interpret the output of NN architectures, such as residual neural networks, 21 Hamiltonian networks, 22 based on the discretization scheme of the corresponding ODE and PDE. On the other hand, deep learning can incorporate domain knowledge automatically which would otherwise require a significant human effort, 4, 26, 28 e.g. good insights into the problem, and mathematical formulation of a priori information. Although this approach does not necessarily reduce the number of trainable weights, it helps to design reversible architectures that allow for memory-efficient implementations. 33 Another branch of model-driven NNs, such as deep unfolding 23 or Wiener-, 32 and Hammersteintype 25 NNs, originates from signal processing problems. The former approach unfolds the iterations of classical model-based algorithms into layer-wise NN structures whose parameters are optimized based on the training data. This way the resulting NN retains the powerful learning ability of DNNs, inherits expert knowledge, and reduces the size of the training data. 17 Wiener-32 and Hammerstein-type 25 NNs are alternatives that combine the advantages of modelbased methods and deep learning techniques. These networks comprise cascades of static nonlinear elements and dynamic linear blocks that represent NNs and linear time-invariant (LTI) systems, respectively. Recently, these methods have shown great potential in many fields, for instance, in system identification, 25 control engineering, 32 sparse approximation theory, 34,35 and telecommunication. 36,37 The motivation behind integrating optimization problems into DNN architectures is similar to the ODE/PDE-driven networks, namely, designing optimization problems to real-world processes is a laborintensive work which also needs expert knowledge. To date, several new NN architectures have been proposed in order to learn these optimization problems automatically from data. Solving ill-posed inverse problems is a typical example for such neural networks. In this case, each layer is constrained by a penalized linear least-squares problem where the parameters of the regularization term, such as threshold values, linear kernels, weights of the shrinkage functions, constitute the trainable weights . 27,38,39 OptNet 29 gives the most general framework in this family, where the layers encode convex quadratic programming (QP) problems. The Hessian matrix of the QP's objective function along with its equality and inequality constraints are learnable parameters. The representation power of an OptNet layer is higher than that of the two-layer ReLU networks, which can reduce the overall depth of DNN architectures (see Theorems 2 and 3 in Ref. 29). Besides its advantages, the forward/backward passes of an OptNet layer are much more computationally expensive than a linear or convolutional layer. This is due to the fact that constrained QP problems have no closed form solution in general, thus the forward pass requires the use of iterative numerical solvers in each layer for each update. We acknowledge that there are many other model-based 40 and model-free approaches. [41][42][43][44][45] Especially, for time series data there are methods based on spiking neural networks 46, 47 including their variations 48,49 which are beyond the scope of this paper.
To the best of our knowledge, this is the first time that the VP operators have been exploited in the context of learning end-to-end systems. However, we note that the proposed VPNet can be considered a special case of OptNet. Indeed, a VP layer forwards the solution of an unconstrained separable nonlinear least-squares (SNLLS) problem to the next layer (cf. Eq. (1) in Ref. 29). The corresponding nonlinear parameters are the trainable weights of the VP layer, and the linear ones are the extracted features, which are forwarded to the next layer. In contrast to a general OptNet layer, both the solution and the gradients of a VP layer can be calculated analytically that is provided by the theoretical framework of variable projection. 50 This speeds up the training and the inference, which can be further improved by the use of orthogonal and discrete orthogonal function systems (see e.g. Section 3.3).

Variable projections
Variable Projection (VP) 50 provides a framework for addressing nonlinear modeling problems of the form where x ∈ R m and Φ k ∈ R m denote the input data to be approximated and a parametric function system, respectively. The symbol Φ(θ) refers to both the function system itself and a matrix of size R m×n . The linear parameters c ∈ R n and the nonlinear parameters θ ∈ R p of the function system Φ are separated. The least-squares fit of this problem means minimization of the nonlinear functional Without nonlinear parameters (i.e., if θ is fixed), the model is linear in the coefficients c. The minimization of r with respect to c leads to the well-known linear least-squares approximation. Note that it is in fact the best approximation problem in Hilbert spaces. The optimal solution can be expressed by means of Fourier coefficients and orthogonal projection operators P Φ(θ) : (2) where Φ + (θ) denotes the Moore-Penrose pseudoinverse of matrix Φ(θ). The concept is closely related to mathematical transformation methods, such as Fourier and wavelet transforms, that can be interpreted as orthogonal projections by a given function system with a predefined θ. From a practical point of view, the coefficients c can be interpreted as features extracted by VP, andx is a result of low-pass filtering and dimension reduction. The minimization of r in the general case can be decomposed into the minimization by the nonlinear parameters θ, while the linear parameters c are computed by the orthogonal projection. Thus, according to the work of Golub and Pereyra, 50 minimizing r is equivalent to minimizing the following VP functional: In Ref. 51, a robust gradient-based Matlab implementation were provided for the numerical optimization of r 2 . Mathematically, VP is a formalization for adaptive orthogonal transformations that allows filtering and feature extraction by means of parametric function systems. If a nonlinear optimization problem can be separated into linear and nonlinear parameters, VP may also act as a solver, which opens up other possible applications. 52,53 In the ML context, VP can be used as a feature extraction method and as a modeling technique for the training procedure. 54 Pereyra et al. proposed VP as an optimization method for a given class of feedforward NNs. They modelled the whole network with VP and used the VP optimization method from Ref. 50 as an alternative to stochastic gradient methods. This methodology is, however, limited to NNs with only one hidden layer. Approaching VP from a different and novel direction, based on its feature extraction ability, we introduce VPNet.
Previous results have shown that several biomedical signal-processing problems can be addressed efficiently with variable projection by means of adaptive rational and Hermite functions as well as B-splines. 55, 56 VP features have been used in particular for ECG and EEG representation, compression, classification, and segmentation. [57][58][59][60][61][62][63][64][65] The results show that VP provides a very compact, yet morphologically accurate, representation of signals with respect to the target problem. Additionally, the nonlinear parameters themselves carry direct morphological information about the signals, and they are usually human-interpretable.

VPNet architecture
The key idea of this architecture is to create a network that combines the representation abilities of VP and the prediction abilities of NNs in the form of a composite model. The basic VPNet architecture is a feedforward NN, where the first layer(s) applies a VP operator that is forwarded to a fully connected, potentially deep NN (see Fig. 1). The construction is similar to that of CNNs in the sense that the first layer(s) of the network can be interpreted as a built-in feature extraction method. Note that more complex VPNet architectures are also possible, for instance, based on the models of U-Net 14 and Au-toEncoder, 66 which will be investigated as part of our future work. Depending on its target application, the VP layer we propose has two possible behaviours. It either performs a filtering of the form or a feature extraction of the form where θ ∈ R p denotes the nonlinear system parameters of the given function system Φ, as defined above. These VP operators refer to the orthogonal projection and the general Fourier coefficients of the input x by means of the parametric system Φ(θ), as in Eq. (2). The filter method may be better suited to regression problems, while the feature extraction is suitable for classification problems.
The nonlinear system parameter vector θ comprises the learnable parameters of the VP layer. Note that many inverse problems 52 can be viewed as SNLLS data fitting problems including a small set of adjustable nonlinear parameters θ with direct physical interpretations. For instance, the function system Φ k (t; τ k , λ k ) = cos(λ k t + τ k ) can be used in frequency estimation and in EEG, where the network would learn dominant frequencies λ k and phases τ k that characterize a certain class of signals, such as seizures in EEG recordings. 67-69 MRI imaging is another setting, 70 where Φ k (t; λ k ) = exp(−λ k t) with λ k ∈ R + yields information about the tissue type. The previously mentioned properties and advantages of the VP operator are implicitly built into VPNet: • Role: A novel model-driven network architecture for 1D signal-processing problems.
• Generality: VPNet can be built from arbitrary parameterized function systems, which allows the direct incorporation of domain knowledge into the network. • Interpretability: The VP layer can be explained as a built-in feature-extraction method. Further, the layer parameters are the nonlinear VP system parameters, which have an interpretable meaning. They are usually directly connected to morphological properties of the input data (see, e.g., Section 4.2). • Simplicity: Since the VP layer is usually driven by only a few system parameters, VPNet may provide a compact alternative to CNNs and DNNs. In fact, the VP layer can significantly decrease the number of parameters in a DNN.

VP forward propagation
In order to calculate the forward pass of the VP layer, a linear least-squares (LLS) problem has to be solved for a certain value of θ in each training iteration (see Eqs. (4)- (5)). Several numerical methods exist to solve such problems, among which QR factorizaton and singular value decomposition (SVD) are the most common techniques. The QR method (requires ∼ 2mn 2 − 2n 3 /3 flops) is fast and reliable for well-conditioned problems, but may fail when Φ(θ) ∈ R m×n is nearly rank-deficient. Therefore, in our implementation, we utilize the SVD (requires ∼ 2mn 2 + 11n 3 flops) that is the most stable way to solve unconstrained LLS problems. 71 Although, it is computationally more demanding than the QR factorization in cases when m ∼ n, their complexity is approximately the same if m n. Note that the latter inequality usually holds in practice, since in VPNet m stands for the length of the input signal, which is much greater than the number of extracted features n.
The low computational complexity is based on the fact that the non linearity is precomputed and stored in the Φ-matrix. As a consequence, during evaluation, the VP layer just performs a matrix multiplication. Further, since the number of features computed by the VP layer is typically very low, the following layers can have lower complexity as well. The weight matrix of a fully connected layer, following the VP layer, is element of R n×l instead of R m×l without the VP layer, with n is the number of coefficients, m is the length of the input signal and l is the number of neuron in the fully connected layer. Since, n is usually by far smaller than m, the weight matrix is significantly smaller for a fixed number or neuron l.
For shallow neural networks, when only a few hidden layers are connected to the VP layer, solving the corresponding LLS problem in each training iteration is obviously the bottleneck of VPNet that influences both the computational complexity and the numerical accuracy. In the following, we provide a realization of the VP layer with Hermite functions, and we demonstrate how the choice of the function system and its parametrization influence the conditionality of Φ(θ).

Adaptive Hermite system
In order to alleviate the computational burden of the VP layer, a straightforward option is to parametrize orthogonal function systems. As a case study, let us consider Hermite polynomials, 72 which are defined by the three-term recurrence relation: where H 0 (t) = 1 and H 1 (t) = 2t. These classical orthogonal polynomials can be parametrized via dilation and translation: where The functions Φ k (t; τ, λ) are the translated and dilated variations of the well-known Hermite functions, thus we refer to them as "adaptive Hermite functions". The forward propagation of the corresponding Hermite-VP layer can be defined by the matrix Φ(θ) in Eq. (3). For a given parameter value θ = (τ, λ), the k-th column of Φ(θ) is equal to the values of the k-th adaptive Hermite function evaluated at some stands for the sampling interval. In the case of proper discretization, 73 the columns of Φ(θ) are pairwise orthogonal and unit vectors for all θ; therefore, Φ + (θ) = Φ T (θ), which speeds up the computation of both the forward and the backward passes.
There are two strategies for choosing the discretization points: nonuniform and uniform sampling. The former relies on the Gauss-Hermite quadrature rules, which associates the points t 0 , t 1 , . . . , t m−1 ⊆ [a; b] with the roots of Hermite polynomials. 74 This approach is the most accurate way to define discrete orthogonal systems, but it requires both the precomputation of the roots and the resampling of the input signals at these nonequidistant points. Therefore, we consider the computationally simpler uniform discretization instead. This sampling scheme, although less accurate, satisfies discrete orthogonality, and thus the identity Φ + (θ) = Φ T (θ) holds, provided that the number of sampling points m is large enough, and θ ∈ Γ, where If θ / ∈ Γ, it can happen that the adaptive Hermite functions Φ k (t, τ, λ) are not discrete orthogonal anymore. In the worst-case scenario, they can be linearly dependent, which results in a rank deficient matrix Φ(θ). In Fig. 2, we demonstrate this phenomenon by evaluating the condition number of Φ(θ) ∈ R m×n for m = 1000, n = 3, and for a range of parameters θ = (τ, λ) ∈ [500, 1100] × [0.05, 0.012]. It can be seen that the condition number diverges from the ideal case (green dashed line) as we change τ and λ irrespective of Γ. This can be avoided if we choose the parameters from the feasible region Γ. The rationale behind this behaviour is given in Appendix A.

VP backpropagation
Let us discuss the training of a general feedforward NN in a supervised manner. Let be the annotated input-target pairs of the training data, where the input vector x i ∈ R m and the target vector y i ∈ R s (in the case of regression) or the target label y i ∈ N or probabilities y i ∈ [0, 1] c (in the case of classification). A general feedforward NN can be expressed as the composition of layer functions of the form where x ∈ R m stands for the input samples, f ( ) θ ( ) and θ ( ) denote the function and the parameters of layer , respectively. The symbol θ refers to the set of parameters θ ( ) . The layer functions f ( ) may refer to linear mappings, convolutional filters, nonlinear activations, pooling, VP operators, etc. Let denote the predicted values for each input. The training of the network can be addressed as a minimization problem, involving a proper loss (i.e., cost) function J that evaluates the error between predicted and target values. Common loss functions are the Mean Squared Error (MSE), that is, the least-squares cost function (regression problems, y i ∈ R s ), and the Binary Cross Entropy (BCE) loss (binary classification, y k ∈ {0, 1}): In our experiments, we used the Cross Entropy loss J CE , which is the multi-class extension of BCE (classification, y k ∈ N); see also Section 4. The state-of-the-art method for training feedforward networks is backpropagation, 75 where J is minimized by means of a stochastic gradient-descent optimization (see e.g. Adam, 76 Adagrad, 77 RMSprop 78 ). There are multiple implementation of the propagation algorithm for different programming languages, target hardware platforms and machine learning frameworks. [79][80][81][82] The gradient descent update formula for each layer parameter is where η > 0 is called the learning rate. Briefly, backpropagation provides a recursive way of computing the gradients above based on the chain rule: This way, only the partial derivatives of the layer function f ( ) with respect to its input (∂f ( ) /∂f ( −1) ) and to its parameters (∂f ( ) /∂θ ( ) ) must be calculated. These derivatives are usually well known for the common layer types and can also be directly calculated for the VP layers. Based on Ref. 50, the partial derivatives of the VP operators with respect to their input and nonlinear parameters can be expressed as follows. In the case of a filtering-type VP layer (see Eq. (4)): where ∂ Φ(θ)Φ + (θ) = (I−ΦΦ + )∂ΦΦ + + (I − ΦΦ + )∂ΦΦ + T .
In the case of a feature-extraction-type VP layer (see Eq. (5)): The naive implementation of the backpropagation, particularly in the case of DNNs, can lead to numerical issues, such as divergence and overfitting. In order to avoid this, a regularization term in the form of an 2 penalty on the weight parameters is added to the loss. 66 Here, we introduce a percent root-meansquare difference (PRD) regularization that can be applied to a single feature-extraction VP layer in the case of a classification problem. The modified loss function we propose is where α ≥ 0 controls the penalty effect. The motivation behind this regularization is twofold: First, it is based on the previous results that incorporate VP as feature extraction, which show that the precise VP approximation may lead to 'good' features and therefore to high classification accuracy. Second, we expect that the optimal VPNet classifier extracts the main characteristics of the input signals, which means that we presume 'good' approximation. This penalty term seemingly breaks the formulation of the backpropagation, but the original method can easily be extended by a bypass step that is applied to the VP layer only. The gradient with respect to the VP parameters is modified as follows: where ∂r 2 = −2x T i (I − ΦΦ + )∂ΦΦ + x i . We just developed the formulas for attaining the necessary gradient information for training VPNet via backpropagation. This allows us to train VPNets in the same way as convolutional and fully connected NNs.

Experiments
Using supervised classification problems inspired by particular biomedical signal-processing applications, we evaluated VPNet and compared it to fully connected and 1D convolutional networks. We present the details of the experiments, specifically about the network architectures, the VP system of choice, and the synthetic and real datasets.

Network architecture
Here we provide details about the networks we compared, the learning methods, and the network parameters. The networks were feedforward, consisting of the following layers: • VPNet: a VP layer, a fully connected (FC) layer with ReLU activation, an FC layer with Soft-Max activation; • Fully connected NN : one or two FC layers with ReLU, an FC layer with SoftMax; • CNN : a 1D convolutional and pooling layer, an FC layer with ReLU, an FC layer with SoftMax.
For signal-classification tasks, the inputs were R m samples and the outputs were interpreted as a probability distribution over predicted output classes. The FC layers performed linear mappings with nonlinear activation (ReLU or SoftMax). The VP layer was of the feature-extraction type (see Eq. (5)), and the CNN implemented 1D convolution and mean or maximum pooling as in Ref. 18. Based on cross entropy loss with VP regularization (see Section 3.4), offline backpropagation with Adam optimizer 76 was applied for learning. The hyperparameters and the parameter selection strategies were as follows: • Learning parameters: learning rate, VP penalty (VPNet only), batch size, and the number of epochs. The last two were fixed (512 and 10-100, respectively). The optimal learning rate and penalty can be found by a grid search. • Network parameters: number of layers, number of neurons, VP dimension n (VPNet only), convolutional and pooling kernel sizes (CNN only).
Here we either used fixed dimensions so that the three architectures are comparable or evaluated possible configurations by a grid search. • Layer parameters: linear weights and biases, nonlinear VP parameters (VPNet only), kernel weights and biases (CNN only). These parameters were optimized by backpropagation. Initialization was random for the linear and kernel parameters. However, the VP parameters have interpretable meaning, which may lead to special initialization. We investigated two options: a grid search on the intervals of possible values and initialization by means of pretraining the VP layer to reconstruct input data (i.e., minimizing r 2 in Eq. (3)). The latter approach is especially useful in the case of complex waveforms which possibly need more VP parameters to learn.

VP system of choice
Although Hermite functions have shown great potential in many fields, such as molecular biology, 83 computer tomography, 84 radar, 85 and physical optics, 86 their main application area is 1D biomedical signal processing. The shape features of Hermite functions are well suited to producing models of compactly supported waveforms such as spikes, 87-91 which is why we used them in ECG heartbeat classification. The nonlinear parameters τ and λ in Eq. (6) represent the time shift and the width of the modeled waveforms, respectively. Thus, the network learns the positions and the shapes of those waves/spikes which separate one class from another. For instance, in electrocardiography, a heartbeat signal comprises three individual waveforms (i.e., the QRS, T, and P waves), which represent different phases of the cardiac cycle, and their properties are directly used by medical experts for diagnosis. These features are learned by the VP layer: The amplitude and shape information is extracted by the linear coefficients c k , while position and width of the waves are represented by τ and λ (see Fig. 3). This approach is essentially different from CNN-based methods, where no direct connections exist between learned and medical descriptors.

Synthetic data
Our goal was, on the one hand, to synthesize a dataset where we know the actual structure of the data depending on the generator parameters. On the other hand, the dataset had to have practical relevance (i.e., be related to actual signal-processing problems). The generator system of choice was the adaptive Hermite system, which seemed to fulfill these expectations due to its applications in signal processing (see Section 4.2). The principles we followed to generate the dataset are discussed below.
Let us consider a general signal model by means of a linear combination of adaptive Hermite functions of the form where (τ i , λ i ) and c (i) (i = 1, 2, . . . , M ) refer to the sample-specific nonlinear parameters and coefficients, respectively. Based on the completeness of the Hermite system in L 2 (R), this formula provides a general approximation for arbitrary signals. However, the signal-processing applications of VP and the Hermite system show that proper selection of the nonlinear parameters may lead to accurate low-order approximations. Further investigation into this topic revealed that the nonlinear parameters correspond to coarse changes in the signal morphologies, while the coefficients reflect fine details. 92 For instance, we refer to Ref. 56, where the nonlinear parameters where utilized as global, patient-specific and the coefficients as heartbeat-specific descriptors. Motivated by these aspects, we sought to construct a dataset where the nonlinear parameters are close to each other and the coefficients form noticeably separable classes.
More precisely, we considered five coefficients (i.e., c (i) ∈ R 5 ) so that the points (c formed three separable spherical shells that correspond to the target class labels (see Fig. 4 (a)). The motivation behind spherical shells was twofold. They  are simple enough for human interpretation, but sufficiently complex to require complex networks. The last two coefficients, c (i) 4 and c (i) 5 , served as random factors and for amplitude normalization. Their effect is to mislead the classifier, but at the same time to decrease the chance of overfitting. The nonlinear parameters τ i and λ i are similar for each sample up to a random factor, and the sample-specific parameter values are generated randomly with given mean and variance (see Fig. 4 (b)). This random factor simulates the nonlinear noise in the measurement. Fig. 4 (c) presents the samples. We conclude that the simulation met our expectations: the resulting samples were hard to separate, but the underlying structure was easy to interpret. Note that this is a standard process to generate synthetic data which was utilized by other authors as well. 93 In the actual implementation, 5000 samples per class were generated for both the training and test sets. We evaluated a total of more than 8000 possible hyperparameter configurations of the three network architectures. A range of numbers of neurons in the hidden layer, various numbers of VP dimensions, and various CNN kernel and pooling sizes, learning rates and VP initializations were considered. The VP penalty was initially fixed to 0.1. The simulations showed that the VP regularization can not only increase the learning speed, but also ensure convergence of an otherwise divergent configuration. In this regard, 0.1 was found to be a good choice. The aggregated results are presented in Fig. 5 (a) and (b). There, the configurations are grouped into six categories: VPNets of dimension n = 7 and n = 9 in Eq. (1), fully connected NNs (FCNN), and CNNs with kernel sizes of 5, 15, and 25. Fig. 5 (a) shows the training accuracy curves corresponding to the best hyperparameter combination in each category. In Fig. 5 (b) and (c), the best test accuracies are plotted against the number of neurons in the hidden layer and the total number of learnable network parameters, respectively, for each category. We note that the y-axis of Fig. 5 (b) is restricted to the interval between 95% and 100% for better visual interpretability. In the following, we compare the performance of VPNet with respect to different network complexities.
The results demonstrate the efficiency and potential capabilities of VPNet. Fig. 5 (a) indicates its fast learning ability. In fact, VPNet may converge faster than the other network architectures. Fig. 5 (b) and (c) show that VPNet can potentially outperform FCNNs and CNNs in terms of the best accuracies on the test set. Although all architectures achieved accuracies close to 100%, VPNet achieved this with low structural complexity, which refers not only to the number of neurons, but also to the total number of network parameters (see Fig. 5 (c)). In this regard, VPNet is superior, because with FC-NNs and CNNs of the same effective receptive field, the number of parameters (i.e., the linear and kernel weights and biases) grows linearly with sample size and number of neurons. With VPNet, in contrast, the number of nonlinear parameters (p = 2) is independent of sample size and output dimension. For the sake of clarity, we remark that the kernel size or the number of convolutional layers in a CNN do not necessarily depend on the input size. Although, in order to detect global morphologic behavior of sig-  In addition to Fig. 5 (c), the best test accuracies depending on the number of learnable parameters are given in Table 1. Here, the number of parameters are grouped into bins for easier interpretation. The results show that the VPNet outperforms the CNNs and FCNNs for each bin, and reaches peak performance earlier than the other two. Besides the numerical comparison, statistical hypothesis testing were also performed for each bin, if applicable. The differences between the best performing VPNets and CNNs are statistically significant by both pairedsample t-tests and McNemar's tests with significance level 5%.

Real ECG data
We sought to prove the relevance of VPNet not only in simulation, but also using real signal-processing data. We chose a particular ECG signal-processing problem: classification of heartbeat arrhythmia (see Ref. 95). The state of the art is supervised ML by traditional approaches (see Refs. 19,96,97 and Section 2), including VP-based static feature extraction. 56,63,65 Here we focused on a related subproblem where we could compare the performance of the selected network configurations.
In more detail, we investigated the separation of the two largest arrhythmia classes: normal and ventricular ectopic beats (VEBs). The source of the data is the benchmark MIT-BIH Arrhythmia Database, 98 available from PhysioNet. 99 The database is split into sets DS1 and DS2 according to Ref. 100, for training and inference, respectively. The whole database contains more than 100 000 annotated heartbeats, but it is heavily biased towards the normal class, that usually distorts the performance evaluation. Here, we investigated two cases for data acquisition. First, a balanced subset was extracted: all VEBs and the same number of normal beats from each record. This yielded 4260 plus 4260 heartbeat signals for training (set DS1), and 3220 plus 3220 signals for testing (set DS2). This balanced subset is expected to provide undistorted evaluation and fair comparison of the NN architectures. The second, unbalanced subset consists of all normal beats and VEBs of the whole database, yield-ing around 50 000 heartbeats for both training and testing. This unbalanced subset represents a more realistic scenario, and supports partial comparability to the state-of-the-art. Note that the DS1 and DS2 heartbeats come from different patients, which means that there is no data leakage in either cases. We used the preprocessing and heartbeat extraction methods discussed in Ref. 63, but chose a window size of 100 samples (∼ 0.28 s) around the R peak annotations. This window was expected to cover the whole QRS complex and potentially the PR and ST segments of each heartbeat. Example heartbeats of the two classes are displayed in Fig. 6. To demonstrate the interpretability of the results, we depicted the response of a trained VP layer to three input QRS complexes in Fig. 7. It can be seen that the Hermite-VP layer learned in fact the position τ and the width λ of the QRS complexes such that it gives an approximation (red) to the meaningful part of the original (blue) curves. In addition to the QRS complex, the input data window may include irrelevant information, such as baseline wander, noise, part of the P and the T waves. However, these irrelevant information are discarded due to the optimization of τ and λ, and thus only the meaningful part of the input signal is approximated at the end of the training. Consequently, the VP layer is likely to be more tolerant to noise as well. In fact, the Hermite-VP representation of ECG recordings can simultaneously cope with various noise sources such as baseline wander, and power-line interference. 92 The layer can also retain diagnostically important morphological information via the extracted coefficients. In Fig. 7, the red curve is equal to the linear combination of the Hermite functions, whose coefficients are the output of the VP layer. The magnitude of these coefficients indicate the presence of each elementary components in the signal. For instance, Fig. 7 (b) shows an asymmetric QRS complex, which is reflected in a high coefficient c 2 that corresponds to an odd Hermite function. In contrast, Fig. 7 (c) plots a highly symmetric QRS complex, which resembles to a Gaussian function indicated by the high value of c 1 . Therefore, both the parameters τ, λ and the output c i 's of the VP layer are interpretable. Note that the level of interpretability tends to decrease as we connect more and more hidden layers to the network. The reason behind is that the whole network does not seek to reconstruct the parameters with which the data where constructed but it rather searches for the parameters that maximize the distinctness of the classes. Since the term presented in 3.4 penalizes the model for not reconstructing the original signal, a larger value for α mitigates the decreased interpretability. However, the VP layer provides a fully transparent feature extractor, which directly influences the output of the network due to the leastsquares penalty in the modified loss function J V P . Therefore, a trained VP layer can be used to improve the generalization properties of DNNs by synthesizing more realistic data samples in the learned feature space. 101,102 The performace of VPNet was measured in a similar way as in the synthetic case, with more than 3500 possible hyperparameter configurations examined. The aggregated results are presented in Fig. 8 (a) and (b), for the balanced and unbalanced case, respectively. Here, the FCNN and CNN cases were restricted so that the output dimensions of the first layer were similar to the VP dimensions, and only the number of neurons in the hidden layer were varied. Note that VPNet again required a remarkably low number of network parameters. We also evaluated another, larger FCNN configuration (FCNN++), where the number of neurons in the first layer was not restricted to that of the VP dimension n, but had the same number as in the second, hidden layer. The structure and distribution of training and test data were more complex than in the synthetic case, which clearly made the classification task more difficult for all network architectures. Again, we conclude that VPNet can outperform FCNNs and CNNs for low-complexity net- works. Note that VPNet reaches peak performance at low network complexity (at low number of hidden neurons, i.e. at low number of system parameters), and the performance starts to decrease early if we increase the complexity. This behaviour is slightly different for CNNs and FCNNs. A possible reason behind is that the first layer of the VPNet acts as a model-based feature extraction, i.e. provides a lowdimensional sparse representation of the input (4 or 8 features for 100 samples). Increasing the complexity of the fully-connected layers of VPNet without increasing the VP parameters or features will lead to over-parametrization and overfitting. In addition to the total accuracies, the usual performance metrics are also provided in Table 2. Namely, sensitivity/precision (Se) and positive predictivity/recall (+P ) was evaluated for each classes, as Se = T P T P + F N and + P = T P T P + F P , where T P , F P , and F N are the true positive, false positive, and false negative matches, respectively. Reference intervals of the state-of-the-art are also given according to the survey Ref. 19 We note that the direct comparison is not always possible, since most of these results refer to a 3 or 5 class classification of the database.

Conclusion
We developed a novel model-driven NN which incorporates expert knowledge via variable projections. The VP layer is a generic, learnable feature extractor or filtering method that can be adjusted to several 1D signal-processing problems by choosing an application-specific function system. The proposed architecture is simple, which means it has only a few, interpretable parameters. Our case studies showed that VPNet can achieve similar or slightly better classification accuracy than its fully connected and CNN counterparts while using a smaller number of parameters. In our tests, the convergence of the VP-Net was slightly better than that of the CNN and the FCNN counterparts. However, the VP layer required only two parameters for learning in all cases, whereas the number of weights and biases for the FCNN and CNN grew linearly with the length of the input signals. These results show that VPNet can be applied effectively to various problems in 1D (b) Unbalanced subset Figure 8. Evaluation on real data, best test accuracies signal processing including classification, regression, and clustering, which we will investigate as part of future work.

Broader Impact
We have proposed a new compact and interpretable neural network architecture that can have a broader impact in mainly two fields: machine learning and signal processing. The key idea is to create a network that combines the representation abilities of variable projections and the prediction abilities of NNs in the form of a composite model. This concept can be generalized to other machine learning algorithms. For instance, VP-SVM, and other combined VP methods, such as VP-K-means and VP-C-means, can extend the potential areas of application, including classification, regression, and clustering problems. Since the nonlinear parameters of the VP layer are interpretable, they can also be used in feature-space augmentation, where new data is generated from existing one in order to improve the generalization properties of DNNs. 101,102 Signal-processing aspects of VPNet were discussed in the ECG heartbeat classification case study. Additionally, VPNet may have great potential in a wide range of applications especially where VP has proven to be an efficient estimation method (cf. Section 4.2). Note that many already existing adaptive signal models have been reformulated as VP problems; 52, 53 however, parameterized wavelets 103 have not yet been studied in this context. Therefore, we encourage researchers to study this class of wavelets in the framework of VPNet.
Model-driven neural network solutions can have a great impact in biomedical engineering and healthcare informatics, where medical data classification alone is usually not enough, as physiological interpretation and explainability of the results are also important. However, special care should be taken to avoid automation bias when these approaches are applied to real-world problems. 104 These clinical decision-support systems are difficult to validate, since this requires medical expertise and vast amounts of data. The latter is naturally unbalanced in the sense that one class of signals (e.g. from healthy patients), is overrepresented compared to the others. In order to address these potential biases, VPNet should be tested in various scenarios that include, for instance, noisy and incomplete measurements, or unbalanced data.

Acknowledgments
This work has been supported by the "University SAL Labs" initiative of Silicon Austria Labs (SAL) and its Austrian partner universities for applied fundamental research for electronic based systems; and the COMET-K2 "Center for Symbiotic Mechatronics" of the Linz Center of Mechatronics (LCM), funded by the Austrian federal government and the federal state of Upper Austria; and EFOP-3.6.3-VEKOP-16-2017-00001: Talent Management in Autonomous Vehicle Control Technologies -The Project is supported by the Hungarian Government and co-financed by the European Social Fund. This paper was supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences.

Data Availability
The data and code that support the findings of this study are available at Ref. 105.