1 Introduction

In image data, similar objects may occur at highly varying scales. Examples are cars or pedestrians at different distances from the camera, cracks in concrete of varying thickness or imaged at different resolutions, or blood vessels in biomedical applications (see Fig. 1). It is natural to assume that the same object or structure at different scales should be treated equally, i.e., should have equal or at least similar features. This property is called scale or dilation invariance and has been investigated in detail in classical image processing [1,2,3].

Neural networks have proven to segment and classify robustly and well in many computer vision tasks. Nowadays, the most popular and successful neural networks are convolutional neural networks (CNNs). It would be desirable that neural networks share typical properties of human vision such as translation, rotation, or scale invariance. While this is true for translation invariance, CNNs are not scale or rotation invariant by default. This is due to the excessive use of convolutions which are local operators. Moreover, training sets often contain a very limited number of scales. To overcome this problem, CNNs are often trained with rescaled images through data augmentation. However, when a CNN is given input whose scale is outside the range covered by the training set, it will not be able to generalize [4, 5]. To overcome this problem, a CNN trained at a fixed scale can be applied to several rescaled versions of the input image and the results can be combined. This, however, requires multiple runs of the network.

One application example, where the just described challenges naturally occur, is the task of segmenting cracks in 2d or 3d gray scale images of concrete. Crack segmentation in 2d has been a vividly researched topic in civil engineering, see [6] for an overview. Cracks are naturally multiscale structures (Fig. 1, left) and hence require multiscale treatment. Nevertheless, adaption to scale (crack thicknessFootnote 1) has not been treated explicitly so far.

Recently, crack segmentation in 3d images obtained by computed tomography (CT) has become a subject of interest [6, 7]. Here, the effect of varying scales is even more pronounced [8]: crack thicknesses can vary from a single pixel to more than 100 pixels. Hence, the aim is to design and evaluate crack segmentation methods that work equally well on all possible crack widths without complicated adjustment by the user.

Fig. 1
figure 1

Examples of similar objects appearing on different scales: section of a CT image of concrete showing a crack of locally varying thickness (left) and pedestrians at difference distances from the camera (right, taken from [9])

In this work, we focus on 2d multiscale crack segmentation in images of concrete samples. We design the Riesz network which replaces the popular 2d convolutions by first- and second-order Riesz transforms to allow for a scale-invariant spatial operation. The resulting neural network is provably scale-invariant in only one forward pass. It is sufficient to train the Riesz network on one scale or crack thickness, only. The network then generalizes automatically without any adjustments or rescaling to completely unseen scales. We validate the network performance using images with simulated cracks of constant and varying widths generated as described in [6, 10]. Our network is compared with competing methods for multiscale segmentation and finally applied to real multiscale cracks observed in 2d slices of tomographic images.

There is just one publicly available dataset which allows for testing scale equivariance—MNIST Large Scale [5]. Additional experiments with the Riesz network on this dataset are reported in “Appendix A”.

1.1 Related Work

1.1.1 The Riesz Transform

The Riesz transform is a generalization of the Hilbert transform to higher dimensional spaces, see, e.g., [11]. First practical applications of the Riesz transform arise in signal processing through the definition of the monogenic signal [12] which enables a decomposition of higher dimensional signals into local phase and local amplitude. First, a bandpass filter is applied to the signal to separate the band of frequencies. Using the Riesz transform, the local phase and amplitude can be calculated for a selected range of frequencies. For more details, we refer to [12, 13].

As images are 2d or 3d signals, applications of the Riesz transform naturally extend to the fields of image processing and computer vision through the Poisson scale space [14, 15] which is an alternative to the well-known Gaussian scale space. Köthe [16] compared the Riesz transform with the structure tensor from a signal processing perspective. Unser and van de Ville [17] related higher-order Riesz transforms and derivatives. Furthermore, they give a reason for preferring the Riesz transform over the standard derivative operator: The Riesz transform does not amplify high frequencies.

Higher-order Riesz transforms were also used for analysis of local image structures using ideas from differential geometry [18, 19]. Benefits of using the first- and second-order Riesz transforms as low-level features have also been shown in measuring similarity [20], analyzing and classification of textures [11, 21], and orientation estimation [22, 23]. The Riesz transform can be used to create steerable wavelet frames, so-called Riesz-Laplace wavelets [17, 24], which are the first ones utilizing the scale equivariance property of the Riesz transform and have inspired the design of quasi monogenic shearlets [25].

Interestingly, in early works on the Riesz transform in signal processing or image processing [12, 13, 15], scale equivariance has not been noticed as a feature of the Riesz transform and hence remained sidelined. Benefits of the scale equivariance have been shown later in [17, 19].

Recently, the Riesz transform found its way into the field of deep learning: Riesz transform features are used as supplementary features in classical CNNs to improve robustness [26]. In our work, we will use the Riesz transforms for extracting low-level features from images and use them as basis functions which replace trainable convolutional filters in CNNs or Gaussian derivatives in [27].

1.1.2 Scale-Invariant Deep Learning Methods

Deep learning methods which have mechanisms to handle variations in scale effectively can be split in two groups based on their scale generalization ability.

Scale-invariant deep learning methods for a limited range of scales The first group can handle multiscale data but is limited to the scales represented either by the training set or by the neural network architecture. The simplest approach to learn multiscale features is to apply the convolutions to several rescaled versions of the images or feature maps in every layer and to combine the results by maximum pooling [4] or by keeping the scale with maximal activation [28] before passing it to the next layer. In [29, 30], several approaches based on downscaling images or feature maps with the goal of designing robust multiscale object detectors are summarized. However, scaling is included in the network architecture such that scales have to be selected a priori. Therefore, this approach only yields local scale invariance, i.e., an adaption to the scale observed in a given input image is not possible after training.

Another intuitive approach is to rescale trainable filters, i.e., convolutions, by interpolation [31]. In [29], a new multiscale strategy was derived which uses convolution blocks of varying sizes sequenced in several downscaling layers creating a pyramidal shape. The pyramidal structure is utilized for learning scale dependent features and making predictions in every downsampling layer. Layers can be trained according to object size. That is, only the part of the network relevant for the object size is optimized. This guarantees robustness to a large range of object scales. Similarly, in [30], a network consisting of a downsampling pyramid followed by an upsampling pyramid is proposed. Here, connections between pyramid levels are devised for combining low- and high-resolution features and predictions are also made independently on every pyramid level. However, in both cases, scale generalization properties of the networks are restricted by their architecture, i.e., by the depth of the network (number of levels in the image pyramid), the size of convolutions as spatial operators as well as the size of the input image.

Spatial transformer networks [32] focus on invariance to affine transformations including scale. This is achieved by using a so-called localization network which learns transformation parameters. Finally, using these transformation parameters, a new sampling grid can be created and feature maps are resampled to it. These parts form a trainable module which is able to handle and correct the effect of the affine transformations. However, spatial transformer networks do not necessarily achieve invariant recognition [33]. Also, it is not clear how this type of network would generalize to scales not represented in the training set.

In [34], so-called structured receptive fields are introduced. Linear combinations (\(1\times 1\) convolutions) of basis functions (in this case Gaussian derivatives up to fourth order) are used to learn complex features and to replace convolutions (e.g., of size \(3\times 3\) or \(5\times 5\)). As a consequence, the number of parameters is reduced, while the expressiveness of the neural network is preserved. This type of network works better than classical CNNs in the case where little training data is available. However, the standard deviation parameters of the Gaussian kernels are manually selected and kept fixed. Hence, the scale generalization ability remains limited.

Making use of the semi-group property of scale spaces, scale- equivariant neural networks motivate the use of dilated convolutions [35] to define scale-equivariant convolutions on the Gaussian scale space [36] or morphological scale spaces [37]. Unfortunately, these neural networks are unable to generalize to scales outside those determined by their architecture and are only suitable for downscale factors which are powers of 2, i.e., \(\{2,4,8,16,\ldots \}\). Furthermore, scale-equivariant steerable networks [38] show how to design scale-invariant networks on the scale-translation group without using standard or dilated convolutions. Following an idea from [34], convolutions are replaced by linear combinations of basis filters (Hermite polynomials with Gaussian envelope). While this allows for non-integer scales, scales are still limited to powers of a positive scaling factor a. Scale space is again discretized and sampled. Hence, a generalization to arbitrary scales is not guaranteed.

Scale-invariant deep learning methods for arbitrary scales

The second group of methods can generalize to arbitrary scales, i.e., any scales that are in range bounded from below by image resolution and from above by image size, but not necessarily contained in the training set. Our Riesz network also belongs to this second group of methods.

An intuitive approach is to train standard CNNs on a fixed range of scales and enhance their scale generalization ability by the following three-step procedure based on image pyramids: downsample by a factor \(a>1\), forward pass of the CNN, upsample the result by \(\frac{1}{a}\) to the original image size [5, 8]. Finally, forward passes of the CNN from several downsampling factors \(\{a_1,~\cdots ~,a_n~>~0\quad | \quad n\in {\mathbb {N}}\}\) are aggregated by using the maximum or average operator across the scale dimension. This approach indeed guarantees generalization to unseen scales as scales can be adapted to the input image and share the weights of the network [5]. However, it requires multiple forward passes of the CNN and the downsampling factors have to be selected by the user.

Inspired by scattering networks [39, 40], normalized differential operators based on first- and second-order Gaussian derivatives stacked in layers or a cascade of a network can be used to extract more complex features [41]. Subsequently, these features serve as an input for a classifier such as a support vector machine. Varying the standard deviation parameter \(\sigma \) of the Gaussian kernel, generalization to new scales can be achieved. However, this type of network is useful for creating handcrafted complex scale-invariant features, only, and hence is not trainable.

Its expansion to trainable networks by creating so-called Gaussian derivative networks [27] is one of the main inspirations for our work. For combining spatial information, \(\gamma \)-normalized Gaussian derivatives are used as scale-equivariant operators (\(\gamma = 1\)). Similarly as in [34], linear combinations of normalized derivatives are used to learn more complex features in the spirit of deep learning. However, the Gaussian derivative network is based on ideas from scale space theory and requires the specification of a wide enough range of scales that cover all the scales present in the training and testing set. Hence, the scale dimension needs to be discretized and sampled densely. These networks have a separate channel for every scale and the network weights are shared between channels. Scale invariance is achieved by maximum pooling over the multiple scale channels.

2 The Riesz Transform

Let \(L_2({\mathbb {R}}^d) = \{ f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\text { } | \text { } \int _{{\mathbb {R}}^d}{|f(x)|^2dx < \infty }\}\) be the set of square integrable functions.

Formally, for a d-dimensional signal \(f\in L_2({\mathbb {R}}^d)\) (i.e., an image or a feature map), the Riesz transform of first-order \({\mathcal {R}}=({\mathcal {R}}_1,\ldots ,{\mathcal {R}}_d)\) is defined in the spatial domain as \({\mathcal {R}}_{j}: L_2({\mathbb {R}}^d) \rightarrow L_2({\mathbb {R}}^d)\)

$$\begin{aligned} {\mathcal {R}}_{j}(f)(x) = C_d \lim _{\epsilon \rightarrow 0}{\int _{{\mathbb {R}}^d \setminus B_{\varepsilon }}{\frac{y_jf(x-y)}{|y|^{d+1}}{\textrm{d}}y}}, \end{aligned}$$
(1)

where \(C_d = \Gamma ((d+1)/2)/\pi ^{(d+1)/2}\) is a normalizing constant and \(B_{\varepsilon }\) is ball of radius \(\varepsilon \) centered at the origin. Alternatively, the Riesz transform can be defined in the frequency domain via the Fourier transform \({\mathcal {F}}\)

$$\begin{aligned} {\mathcal {F}}({\mathcal {R}}_j(f))(u) = -i\frac{u_j}{|u|}{\mathcal {F}}(f)(u) = \frac{1}{|u|}{\mathcal {F}}(\partial _j f)(u), \end{aligned}$$
(2)

for \(j \in \{1,\ldots ,d\}\). Higher-order Riesz transforms are defined by applying a sequence of first-order Riesz transforms. That is, for \(k_1,k_2,...,k_d\in {\mathbb {N}}\cup \{0\}\) we set

$$\begin{aligned} {\mathcal {R}}^{(k_1,k_2,...,k_d)} (f)(x) := {\mathcal {R}}^{k_1}_1( {\mathcal {R}}^{k_2}_2( \cdots ({\mathcal {R}}^{k_d}_d(f(x)))), \end{aligned}$$
(3)

where \({\mathcal {R}}_j^{k_j}\) refers to applying the Riesz transform \({\mathcal {R}}_j\) \(k_j\) times in a sequence.

Fig. 2
figure 2

Visualizations of Riesz transform kernels of first and second orders. First row (from left to right): \({\mathcal {R}}_1\) and \({\mathcal {R}}_2\). Second row (from left to right): \({\mathcal {R}}^{(2,0)}\), \({\mathcal {R}}^{(1,1)}\), and \({\mathcal {R}}^{(0,2)}\)

The Riesz transform kernels of first- and second-order resemble those of the corresponding derivatives of smoothing filters such as Gaussian or Poisson filters (Fig. 2). This can be explained by the following relations

$$\begin{aligned}&{\mathcal {R}}(f) =(-1)(-\triangle )^{-1/2} \nabla f \end{aligned}$$
(4)
$$\begin{aligned}&{\mathcal {R}}^{(k_1,k_2,...,k_d)} (f)(x)= (-1)^N(- \triangle )^{-N/2} \frac{\partial ^N f(x)}{\partial ^{k_1}x_1 \cdots \partial ^{k_d}x_d}, \end{aligned}$$
(5)

for \(k_1 +... + k_d=N\) and \(N\in {\mathbb {N}}\). The fractional Laplace operator \(\triangle ^{N/2}\) acts as an isotropic low-pass filter. The main properties of the Riesz transform can be summarized in the following way [17]:

  • translation equivariance: For \(x_0 \in {\mathbb {R}}^d\) define a translation operator \({\mathcal {T}}_{x_0}(f)(x):L_2({\mathbb {R}}^d) \rightarrow L_2({\mathbb {R}}^d)\) as \({\mathcal {T}}_{x_0}(f)(x) = f(x-x_0)\). It holds that

    $$\begin{aligned} {\mathcal {R}}_j({\mathcal {T}}_{x_0}(f))(x) = {\mathcal {T}}_{x_0}({\mathcal {R}}_j(f))(x), \end{aligned}$$
    (6)

    where \(j\in \{1,\ldots ,d\}\). This property reflects the fact that the Riesz transform commutes with the translation operator.

  • steerability: The directional Hilbert transform \({\mathcal {H}}_v:L_2({\mathbb {R}}^d) \rightarrow L_2({\mathbb {R}}^d)\) in direction \(v\in {\mathbb {R}}^d\), \(||v||=1\), is defined as \({\mathcal {F}}({\mathcal {H}}_v(f))(u) = i \text { } {{\,\textrm{sign}\,}}(\langle u,v\rangle )\). \({\mathcal {H}}_v\) is steerable in terms of the Riesz transform, that is, it can be written as a linear combination of the Riesz transforms

    $$\begin{aligned} {\mathcal {H}}_v(f)(x) = \sum _{j=1}^d v_j {\mathcal {R}}_j(f)(x) = \langle {\mathcal {R}} (f)(x), v\rangle . \end{aligned}$$
    (7)

    Note that in 2d, for a unit vector \(v=(\cos \theta , \sin \theta ),\ \theta \in \left[ 0,2\pi \right] \), the directional Hilbert transform becomes \({\mathcal {H}}_v(f)(x) = \cos (\theta ){\mathcal {R}}_1(f)(x) + \sin (\theta ){\mathcal {R}}_2(f)(x)\). This is equivalent to the link between gradient and directional derivatives [17] and a very useful property for learning oriented features.

  • all-pass filter [12]: Let \(H = (H_1, \cdots , H_d)\) be the Fourier transform of the Riesz kernel, i.e., \({\mathcal {F}}({\mathcal {R}}_j(f))(u) = i\frac{u_j}{|u|}{\mathcal {F}}(f)(u) = H_j(u){\mathcal {F}}(f)(u)\). The energy of the Riesz transform for frequency \(u \in {\mathbb {R}}^d\) is defined as the norm of the d-dimensional vector H(u) and has value 1 for all non-zero frequencies \(u\ne 0\), i.e.,

    $$\begin{aligned} ||H(u)|| = 1, \quad u \ne 0. \end{aligned}$$
    (8)

    The all-pass filter property reflects the fact that the Riesz transform is a non-local operator and that every frequency is treated fairly and equally. Combined with scale equivariance, this eliminates the need for multiscale analysis or multiscale feature extraction.

  • scale (dilation) equivariance: For \(a>0\) define a dilation or rescaling operator \(L_{a}:L_2({\mathbb {R}}^d) \rightarrow L_2({\mathbb {R}}^d)\) as \(L_{a}(f)(x) = f(\frac{x}{a})\). Then

    $$\begin{aligned} {\mathcal {R}}_j(L_{a}(f))(x) = L_{a}({\mathcal {R}}_j(f))(x), \end{aligned}$$
    (9)

    for \(j\in \{1,\ldots ,d\}\). That is, the Riesz transform does not only commute with translations but also with scaling.

Scale equivariance enables an equal treatment of the same objects at different scales. As this is the key property of the Riesz transform for our application, we will briefly present a proof. We restrict to the first order in the Fourier domain. The proof for higher orders follows directly from the one for the first order.

Lemma 1

The Riesz transform is scale-equivariant, i.e.,

$$\begin{aligned} {\mathcal {R}}_{i}f\Big (\frac{\cdot }{a}\Big ) = [{\mathcal {R}}_{i}f]\Big (\frac{\cdot }{a}\Big ). \end{aligned}$$
(10)

for \(f \in L_2({\mathbb {R}}^d)\).

Proof

Remember that the Fourier transform of a dilated function is given by \({\mathcal {F}}(f(\alpha \cdot ))(u) = \frac{1}{\alpha ^d} {\mathcal {F}}(f)(\frac{u}{\alpha })\). Setting \(g(x) = f(\frac{x}{a})\), we have \({\mathcal {F}}(g)(u) = a^d {\mathcal {F}}(f)(au)\). This yields

$$\begin{aligned}&{\mathcal {F}}\Bigg ({\mathcal {R}}_j\Big (f\big (\frac{\cdot }{a}\big )\Big )\Bigg )(u) = {\mathcal {F}}\Big ({\mathcal {R}}_j(g)\Big )(u) = \\&= i \frac{u_j}{|u|} {\mathcal {F}}(g)(u) = i \frac{u_j}{|u|} a^d {\mathcal {F}}(f)(au) = \\&= a^d \Big (i\frac{au_j}{a|u|}\Big ){\mathcal {F}}(f)(au) = a^d {\mathcal {F}}\Big ({\mathcal {R}}_j(f)\Big )(au) = \\&= {\mathcal {F}}\Bigg ({\mathcal {R}}_j(f)\Big (\frac{\cdot }{a}\Big )\Bigg )(u). \end{aligned}$$

\(\square \)

Fig. 3
figure 3

Illustration of the Riesz transform on a mock example of \(550\times 550\) pixels: aligned rectangles with equal aspect ratio and constant gray value 255 (left) and response of the second-order Riesz transform \({\mathcal {R}}^{(2,0)}\) of the left image sampled horizontally through the centers of the rectangles (right)

Figure 3 provides an illustration of the scale equivariance. It shows four rectangles with length-to-width ratio 20 and varying width (3, 5, 7, and 11 pixels) together with the gray value profile of the second-order Riesz transform \(R^{(2,0)}\) along a linear section through the centers of the rectangles. In spite of the different widths, the Riesz transform yields equal filter responses for each rectangle (up to rescaling). In contrast, to achieve the same behavior in Gaussian scale space, the scale space has to be sampled (i.e., a subset of scales has to be selected), the \(\gamma \)-normalized derivative [1] has to be calculated for every scale, and finally, the scale yielding the maximal absolute value has to be selected. In comparison, the simplicity of the Riesz transform achieving the same in just one transform without sampling scale space and without the need for a scale parameter is striking.

3 Riesz Transform Neural Networks

In the spirit of structured receptive fields [34] and Gaussian derivative networks [27], we use the Riesz transforms of first and second orders instead of standard convolutions to define Riesz layers. As a result, Riesz layers are scale-equivariant in a single forward pass. Replacing standard derivatives with the Riesz transform has been motivated by [16], while using a linear combination of Riesz transforms of several orders follows [21].

3.1 Riesz Layers

The base layer of the Riesz networks is defined as a linear combination of Riesz transforms of several orders implemented as 1d convolution across feature channels (Fig. 4). Here, we limit ourselves to first- and second-order Riesz transforms. Thus, the linear combination reads as

$$\begin{aligned} J_{{\mathcal {R}}}(f)&= C_0 + \sum _{k=1}^{d}{C_k\cdot {\mathcal {R}}_k(f)} + \nonumber \\&\quad +\ \mathop {\sum }_{k,l\in {\mathbb {N}}_0, k+l=2}{C_{k,l}\cdot {\mathcal {R}}^{(k,l)}(f)}, \end{aligned}$$
(11)

where \(\{C_0, C_k \vert k\in \{1,\ldots ,d\} \}\cup \{C_{k,l} \vert l,k \in {\mathbb {N}}_0, l+k=2\}\) are parameters that are learned during training.

Now we can define the general layer of the network (Fig. 4). Let us assume that the Kth network layer takes input \(F^{(K)} = ( F^{(K)}_1,\ldots , F^{(K)}_{c^{(K)}}) \in \big (L_2({\mathbb {R}}^d)\big )^{c^{(K)}}\) with \(c^{(K)}\) feature channels and has output \(F^{(K+1)}= (F^{(K+1)}_1,\ldots ,F^{(K+1)}_{c^{(K+1)}})\in \big (L_2({\mathbb {R}}^d)\big )^{c^{(K+1)}}\) with \(c^{(K+1)}\) channels. Then the output in channel \(j \in \{1,\ldots ,c^{(K+1)} \}\) is given by

$$\begin{aligned} F^{(K+1)}_j = \sum _{i=1}^{c^{(K)}} J_{K}^{(j,i)}(F^{(K)}_i). \end{aligned}$$
(12)

Here, \(J_{K}^{(j,i)}\) is defined in the same way as \(J_{{\mathcal {R}}}(f)\) from Eq. (11), but trainable parameters may vary for different input channels i and output channels j, i.e.,

$$\begin{aligned} J^{(j,i)}_K(f)&= C_0^{(j,i,K)} + \sum _{k=1}^{d}{C_k^{(j,i,K)}\cdot {\mathcal {R}}_k(f)} + \nonumber \\&\quad + \mathop {\sum }_{k,l\in {\mathbb {N}}_0, k+l=2}{C_{k,l}^{(j,i,K)}\cdot {\mathcal {R}}^{(k,l)}(f)}. \end{aligned}$$
(13)

In practice, the offset parameters \(C_0^{(i,j,K)}\), \(i=1,\ldots ,c^{(K)}\) are replaced with a single parameter defined as \(C_0^{(j,K)}:= \sum _{i=1}^{c^{(K)}}C_0^{(j,i,K)}\).

Fig. 4
figure 4

Building blocks of Riesz networks: the base Riesz layer from Eq. (11) (left) and the full Riesz layer from Eq. (12) (right)

3.2 Proof of Scale Equivariance

We prove the scale equivariance for \(J_{{\mathcal {R}}}(f)\). That implies scale equivariance for \(J^{(j,i)}_K(f)\) and consequently for \(F^{(K+1)}_j\) for arbitrary layers of the network. By construction (see Sect. 3.3), this will result in provable scale equivariance for the whole network. Formally, we show that \(J_{{\mathcal {R}}}(f)\) from Eq. (11) is scale-equivariant, i.e.,

$$\begin{aligned} J_{{\mathcal {R}}}\left( f(\frac{\cdot }{a})\right) = J_{{\mathcal {R}}}(f) (\frac{\cdot }{a}), \end{aligned}$$
(14)

for \(f \in L_2({\mathbb {R}}^d)\).

Proof

For any scaling parameter \(a > 0\) and \(x\in {\mathbb {R}}^d\), we have

$$\begin{aligned}{} & {} J_{{\mathcal {R}}}\left( f(\frac{\cdot }{a})\right) (x) = C_0 + \sum _{k=1}^{d}{C_k\cdot {\mathcal {R}}_k\left( f(\frac{\cdot }{a})\right) (x)} \\{} & {} \quad \qquad +\mathop {\sum }_{k,l\in {\mathbb {N}}_0, k+l=2}{C_{k,l}\cdot {\mathcal {R}}^{(k,l)}\left( f(\frac{\cdot }{a})\right) (x)} \\{} & {} \quad \overset{\mathrm {(10)}}{=}C_0 + \sum _{k=1}^{d}{C_k\cdot {\mathcal {R}}_k(f)(\frac{x}{a})} \\{} & {} \quad \qquad +\mathop {\sum }_{k,l\in {\mathbb {N}}_0, k+l=2}{C_{k,l}\cdot {\mathcal {R}}^{(k,l)}(f)(\frac{x}{a})} \\{} & {} \quad = J_{{\mathcal {R}}}(f) (\frac{x}{a}). \end{aligned}$$

\(\square \)

3.3 Network Design

block of modern CNNs is a sequence of the following operations: batch normalization, spatial convolution (e.g., \(3\times 3\) or \(5\times 5\)), the rectified linear unit (ReLU) activation function, and max pooling. Spatial convolutions have by default a limited size of the receptive field and max pooling is a downsampling operation performed on a window of fixed size. For this reason, these two operations are not scale-equivariant and consequently CNNs are sensitive to variations in the scale. Hence, among the classical operations, only batch normalization [42] and ReLU activation preserve scale equivariance. To build our neural network from scale-equivariant transformations, only, we restrict to using batch normalization, ReLUs, and Riesz layers, which serve as a replacement for spatial convolutions. In our setting, max pooling can completely be avoided since its main purpose is to combine it with spatial convolutions in cascades to increase the size of the receptive field while reducing the number of parameters.

Generally, a layer consists of the following sequence of transformations: batch normalization, Riesz layer, and ReLU. Batch normalization improves the training capabilities and avoids overfitting, ReLUs introduce non-linearity, and the Riesz layers extract scale-equivariant spatial features. For every layer, the number of feature channels has to be selected. Hence, our network with \(K \in {\mathbb {N}}\) layers can be simply defined by a \((K+2)\)-tuple specifying the channel sizesFootnote 2, e.g., \((c^{(0)}, c^{(1)}, \cdots c^{(K)},c^{(K+1)})\). The final layer is defined as a linear combination of the features from the previous layer followed by a sigmoid function yielding the desired probability map as output.

The four-layer Riesz network we apply here can be schematically written as \(1 \rightarrow 16 \rightarrow 32 \rightarrow 40 \rightarrow 48 \rightarrow 1\) and has \((1\cdot 5\cdot 16+16)+(16\cdot 5\cdot 32+32)+(32\cdot 5\cdot 40+40)+(40\cdot 5\cdot 48+48)+(48\cdot 1+1)=18\,825\) trainable parameters.

Fig. 5
figure 5

Cracks of width 3 used for training: before (first row) and after cropping (second row). Image sizes are \(256\times 256\) (first row) and \(64 \times 64\) (second row)

4 Experiments and Applications

In this section we evaluate the four-layer Riesz network defined above on the task of segmenting cracks in 2d slices from CT images of concrete. Particular emphasis is put on the network’s ability to segment multiscale cracks and to generalize to crack scales unseen during training. To quantify these properties, we use images with simulated cracks. Being accompanied by an unambiguous ground truth, they allow for an objective evaluation of the segmentation results.

Additionally, in “Appendix A” scale equivariance of the Riesz network is experimentally validated on the MNIST Large Scale data set [5].

Data generation: Cracks are generated by the fractional Brownian motion (Experiment 1) or minimal surfaces induced by the facet system of a Voronoi tessellation (Experiment 2). Dilated cracks are then integrated into CT images of concrete without cracks. As pores and cracks are both air-filled, their gray value distribution should be similar. Hence, the gray value distribution of crack pixels is estimated from the gray value distribution observed in air pores. The crack thickness is kept fixed (Experiment 1) or varies (Experiment 2) depending on the objective of the experiment. As a result, realistic semi-synthetic images can be generated (see Fig. 5). For more details on the simulation procedure, we refer to [6, 10]. Details on number and size of the images can be found below. Finally, we show applicability of the Riesz network for real data containing cracks generated by tensile and pull-out tests.

Quality metrics: As metrics for evaluation of the segmentation results we use precision (P), recall (R), F1-score (or dice coefficient, Dice), and intersection over union (IoU). The first three quality metrics are based on true positives tp – the number of pixels correctly predicted as crack, true negatives tn—the number of pixels correctly predicted as background, false positives fp—the number of pixels wrongly predicted as crack, and false negatives fn—the number of pixels falsely predicted as background. Precision, recall, and dice coefficient are then defined via

$$\begin{aligned}{} & {} P = tp/(tp+fp),\quad R = tp/(tp+fn),\\{} & {} \quad \text {Dice} = 2PR/(P + R). \end{aligned}$$

IoU compares union and intersection of the foregrounds X and Y in the segmented image and the corresponding ground truth, respectively. That is

$$\begin{aligned} IoU(X,Y) = \frac{|X\cap Y|}{|X \cup Y|}. \end{aligned}$$

All these metrics have values in the range [0, 1] with values closer to 1 indicating a better performance.

Training parameters: If not specified otherwise, all models are trained on cracks of fixed width of 3 pixels. Cracks for the training are generated in the same way as for Experiment 1 on \(256\times 256\) sections of CT images of concrete. Then, 16 images of size \(64\times 64\) are cropped without overlap from each of the generated images. In this way, images without cracks are present in the training set. After data augmentation by flipping and rotation, the training set consists of 1 947 images of cracks. Some examples are shown in Fig. 5. For validation, another set of images with cracks of width 3 is used. The validation data set’s size is a third of the size of the training set.

All models are trained for 50 epochs with initial learning rate 0.001 which is halved every 20 epochs. ADAM optimization [43] is used, while the cost function is set to binary cross entropy loss.

Crack pixels are labeled with 1, while background is labeled with 0. As there are far more background than crack pixels, we deal with a highly imbalanced data set. Therefore, crack and pore pixels are given a weight of 40 to compensate for class imbalance and to help distinguishing between these two types of structures which hardly differ in their gray values.

Fig. 6
figure 6

Measure of scale equivariance \(\Delta _a\) for the four-layer Riesz network with randomly initialized parameters w.r.t. the downscaling factor a. Mean (black), minimum, and maximum (gray) of 20 repetitions. Points on the line correspond to \(a~\in ~\{1,2,4,8,16,32,64\}\)

Table 1 Experiment 1. Ablation study: scale generalization ability of Riesz networks

4.1 Measuring Scale Equivariance

Measures for assessing scale equivariance have been introduced in [36, 38]. For an image or feature map f, a mapping function \(\Phi \) (e.g., a neural network or a subpart of a network), and a scaling function \(L_a\) we define

$$\begin{aligned} \Delta _a(\Phi ) := \frac{|| L_a(\Phi (f)) - \Phi (L_a(f))||_2}{||L_a(\Phi (f)) ||_2}. \end{aligned}$$
(15)

Ideally, this measure should be 0 for perfect scale equivariance. In practice, due to scaling and discretization errors we expect it to be positive yet very small.

To measure scale equivariance of the full Riesz network with randomly initialized weights, we use a data set consisting of 85 images of size \(512\times 512\) pixels with crack width 11 and use downscaling factors \(a\in \{2,4,8,16,32,64\}\). The evaluation was repeated for 20 randomly initialized Riesz networks. The resulting values of \(\Delta _a\) are given in Fig. 6.

The measure \(\Delta _a\) was used to validate the scale equivariance of deep scale spaces (DSS) in [36] and scale-equivariant steerable networks in [38]. In both works, a steep increase in \(\Delta _a\) is observed for downscaling factors larger than 16, while for very small downscaling factors, \(\Delta _a\) is reported to be below 0.01. In [38], \(\Delta _a\) reaches 1 for downscaling factor 45. The application scenario studied here differs from those of [36, 38]. Results are thus not directly comparable but can be considered only as an approximate baseline. For small downscaling factors, we find \(\Delta _a\) to be higher than in [38] (up to 0.075). However, for larger downscaling factors \((a>32)\), \(\Delta _a\) increases more slowly, e.g., \(\Delta _{64} = 0.169\). This proves the resilience of Riesz networks to very high downscaling factors, i.e., large changes in scale.

4.2 Experiment 1: Generalization to Unseen Scales

Our models are trained on images of fixed crack width 3. To investigate their behavior on crack widths outside of the training set, we generate cracks of widths \(\{1, 3, 5, 7, 9, 11\}\) pixels in images of size \(512\times 512\), see Fig. 9. Each class contains 85 images. Besides scale generalization properties of the Riesz network, we check how well it generalizes to random variations in crack topology or shapes, too. For this experiment, we will assume that the fixed width is known. This means that the competing methods will use a single scale which is adjusted to this width.

4.2.1 Ablation Study on the Riesz Network

We investigate how the network parameters and the composition of the training set affect the quality of the results, in order to learn how to design this type of neural networks efficiently.

Fig. 7
figure 7

Experiment 1. Effect of the training set size (left), the crack width in the training set (center), and the network depth (right) on generalization to unseen scales. The baseline Riesz network is marked with 1947 (left), w3 (center), and layer 4 (right) and with square symbol \(\square \). Quality metric: IoU

Table 2 Experiment 1. Comparison with competing methods: Dice coefficients for segmentation of cracks of differing width
Fig. 8
figure 8

Experiment 1. Comparison of the competing methods. Results of the simulation study with respect to crack width. Training on crack width 3. Quality metrics (from left to right): precision, recall, and IoU

Fig. 9
figure 9

Experiment 1. Columns (from left to right): crack of widths 3, 5, 7, and 11. Rows (from top to bottom): input image, ground truth, Riesz network, plain U-net, U-net with scale adjustment, and Gaussian derivative network. All images have size \(512\times 512\) pixels

Size of training set: First, we investigate the robustness of the Riesz network to the size of the training set. The literature [34] suggests that neural networks based on structure receptive fields are less data hungry, i.e., their performance with respect to the size of the training set is more stable than that of conventional CNNs. Since the Riesz network uses the Riesz transform instead of a Gaussian derivative as in [34], it is expected that the same would hold here, too.

The use of smaller training sets has two main benefits. First, obviously, smaller data sets reduce the effort for data collection, i.e., annotation or simulation. Second, smaller data sets reduce the training time for the network if we do not increase the number of epochs during training.

We constrain ourselves to three sizes of training sets: \(1\,947\), 975, and 489. These numbers refer to the sets after data augmentation by flipping and rotation. Hence, the number of original images is three times smaller. In all three cases we train the Riesz network for 50 epochs and with similar batch sizes (11, 13,  and 11, respectively). Results on unseen scales with respect to data set size are shown in Table 1 and Fig. 7 (left). We observe that the Riesz network trained on the smallest data set is competitive with counterparts trained on larger data sets albeit featuring generally 1–2% lower Dice and IoU.

Choice of crack width for training: There are two interesting questions with respect to crack width. Which crack width is suitable for training of the Riesz network? Do varying crack thicknesses in the training set improve performance significantly?

To investigate these questions, we choose three training data sets with cracks of fixed widths 1, 3, or 5. A fourth data set combines crack widths 1, 3, and 5. We train the Riesz network with these sets and evaluate its generalization performance across scales. Results are summarized in Fig. 7 (center) and Table 1. Crack widths 3 and 5 yield similar results, while crack width 1 seems not to be suitable, except when trying to segment cracks of width 1. Cracks of width 1 are very thin, subtle, and in some cases even disconnected. Hence, they differ significantly from thicker cracks which are 8-connected and have a better contrast to the concrete background. This indicates that very thin cracks should be considered a special case which requires somewhat different treatment. Rather surprisingly, using the mixed training data set does not improve the metrics. Diversity with respect to scale in the training set seems not to be a decisive factor when designing Riesz networks.

Number of layers: Finally, we investigate the explanatory power of the Riesz network depending on network depth and the number of parameters. We train four networks with \(2-5\) layers and \(2\,721\), \(9\,169\), \(18\,825\), and \(34\,265\) parameters, respectively, on the same data set for 50 epochs. The network with 5 layers has structure \(16\rightarrow 32\rightarrow 40\rightarrow 48\rightarrow 64\), and every other network is constructed from this one by removing the required number of layers at the end. Results are shown in Table 1 and in Fig. 7 (right). The differences between the networks with 3, 4, and 5 layers are rather subtle. For the Riesz network with only 2 layers, performance deteriorates considerably (3–5% in Dice and IoU).

In general, Riesz networks appear to be robust with respect to training set size, depth of network, and number of parameters. Hence, it is not necessary to tune many parameters or to collect thousands of images to achieve good performance, in particular for generalization to unseen scales. For the choice of crack width, 3 and 5 seem appropriate while crack width 1 should be avoided.

4.2.2 Comparison with Competing Methods

Competing methods: The four-layer Riesz network is compared to two other methods—Gaussian derivative networks [27] and U-net [44] on either rescaled images [5] or an image pyramid [8]. The Gaussian derivative network uses scale space theory based on the Gaussian kernel and the diffusion equation. Using the \(\gamma \)-normalized Gaussian derivatives from [1], layers of first- and second-order Gaussian derivatives are constructed [27]. We shortly state differences between our reimplementation and the original work [27]. In order to reduce the computation time, we use a version of the Gaussian network that has a single scale channel corresponding to the training thickness during the training, while additional channels with shared weights are added for the inference, i.e., testing the scale generalization. This version of the Gaussian network is different from the original one [27], which has more scale channels during training. Using multiple scale channels in the training step has been found to result in better scale generalization properties if the different scale channels were allowed to compete against each other during the training stage. However, this increases the computational burden compared to a single channel network. In this section, we use a single channel version of the network but with \(\sigma \) adjusted to the crack width. The sparser scale sampling ratio of 2 used in this reimplementation from Sect. 4.3 is, however, expected to lead to lower performance compared to using a scale sampling ratio of \(\sqrt{2}\), as used in the original work.

U-net has around 2.7 million parameters, while the Gaussian derivative network has the same architecture as the Riesz network and hence the same number of parameters (18k).

We design an experiment for detailed analysis and comparison of the ability of the methods to generalize to scales unseen during training. In typical applications, the thickness of the cracks would not be known. Here, crack thickness is kept fixed such that the correct scale of cracks is a priori known. This allows for a selection of an optimal scale (or range of scales) such that we have a best case comparison. For the Gaussian derivative network, scale is controlled by the standard deviation parameter \(\sigma \) which is set to the half width of the crack. Here, we have avoided the inference of the Gaussian network with multiple scale channels and have used the assumption that the scale is a priori known. For the U-net, scale is adjusted by downscaling the image to match the crack width used in the training data. Here, we restrict the downscaling to discrete factors in the set \(\{2,4,8,... \}\) that were determined during validation. For widths 1 and 3, no downscaling is needed. For width 5, the images are downscaled by 2, for width 7 by 4, and for widths 9 and 11 by 8. For completeness, we include results for the U-net without downscaling denoted by "U-net plain". Table 2 yields the prediction quality measured by the Dice coefficient, while the other quality measures are shown in Fig. 8. Exemplary segmentation results are shown in Fig. 9.

As expected, the performance of the plain U-net decreases with increasing scale. Scale adjustment stabilizes U-net’s performance but requires manual selection of scales. Moreover, the interpolation in upsampling and downsampling might induce additional errors. The decrease in performance with growing scale is still apparent (10–15%) but significantly reduced compared to the plain U-net \((55\%)\). To get more insight into performance and characteristics of the U-net, we add an experiment similar to the one from [5]: We train the U-net on crack widths 1, 3, and 5 on the same number of images as for one single crack width. This case is referred to "U-net-mix scale adj." in Table 2. Scales are adjusted similarly: w5 and w7 are downscaled by factor 2, w9 and w11 are downscaled by factor 4. The results are significantly better than those obtained by the U-net trained on the single width (10–15% in Dice and IoU on unseen scales), but still remain worse than the Riesz network trained on a single scale (around \(7\%\) in Dice and IoU on unseen scales).

Table 3 Experiment 2. Performance on simulated multiscale cracks

The Gaussian derivatives network is able to generalize steadily across the scales (Dice and IoU \(74\%\)) but nevertheless performs worse than the scale adjusted U-net (around \(10\%\) in IoU). Moreover, it is very sensitive to noise and typical CT imaging artifacts (Fig. 9).

On the other hand, the Riesz network’s performance is very steady with growing scale. We even observe improving performance in IoU and Dice with increase in crack thickness. This is due to pixels at the edge of the crack influencing the performance metrics less and less the thicker the crack gets. The Riesz network is unable to precisely localize cracks of width 1 as, due to the partial volume effect, such thin cracks appear to be discontinuous. With the exception of the thinnest crack, the Riesz network has Dice coefficients above \(94\%\) and IoU over \(88\%\) for completely unseen scales. This even holds for the cases when the crack is more than 3 times thicker than the one used for training.

4.3 Experiment 2: Performance on Multiscale Data

Since cracks are naturally multiscale structures, i.e., crack thickness varies as the crack propagates, the performance of the considered methods on multiscale data is analyzed as well. On the one hand, we want to test on data with an underlying ground truth without relying on manual annotation prone to errors and subjectivity. On the other hand, the experiment should be conducted in a more realistic and less controlled setting than the previous one, with cracks as similar as possible to real ones.

We therefore use again simulated cracks, this time however with varying width. The thickness is modeled by an adaptive dilation. See Fig. 10 for realization examples. The change rendering our experiment more realistic than the first one is to exploit no prior information about the scale. The Riesz network does not have to be adjusted for this experiment, while the competing methods require scale selection as described in Sect. 4.2.2. Without knowing the scale, testing several configurations is the only option. See “Appendix B” for examples. Note that in this experiment we used a different crack simulation technique [10] than in Experiment 1. In principle, we cannot claim that either of the two techniques generates more realistic cracks. However, this change serves as an additional goodness check for the methods since these simulation techniques can be seen as independent.

Fig. 10
figure 10

Experiment 2. Cracks with varying width. From left to right: input image, results of the Riesz network and the U-net with four pyramid levels. Image size \(400 \times 400\) pixels

Fig. 11
figure 11

Experiment 3. Real cracks in concrete: slice from input CT image, results of the Riesz network and of U-net with two pyramid levels. Image sizes are \(832 \times 1088\) (1st row) and \(544 \times 992\) (2nd row)

We adjust the U-net as follows: We downscale the image by several factors from \(\{2,4,8,16... \}\). The forward pass of the U-net is applied to the original and every downscaled image. Subsequently, the downscaled images are upscaled back to the original size. All predictions are joined by the maximum operator. We report results for several downscaling factor combinations specified by a number N, which is the number of consecutive downscaling factors used, starting at the smallest factor 2. Similarly as in Experiment 1, we report results of two U-net models: the first model is trained on cracks of width 3 as the other models in the comparison. The second model is trained on cracks with mixed widths. Including more crack widths in the training set has proven to improve the scale generalization ability in Experiment 1. Hence, the second model represents a more realistic setting that would be used in practice where the crack width is typically unknown. We denote the respective networks as "U-net pyramid" N and "U-net-mix pyramid" N.

For the Gaussian network, we vary the standard deviation parameter \(\sigma \) in the set \(\{1.5,3,6,12\}\). This selection of scales is motivated by the network having been trained on crack width 3 with \(\sigma = 1.5\). We start with the original \(\sigma \) and double it in each step. Note that in the related study on the scale sampling for scale-channel deep networks [5], using a scale sampling factor of \(\sqrt{2}\) was found to lead to better performance than using a scale sampling factor of 2. Hence, additional experimentation with hyperparameters of the method might improve the results. We decided to keep the sampling scheme as similar as possible to the one from U-net. The reason is that U-net with downscaling was more extensively tested on the crack segmentation task [8, 45]. As for the U-net, we test several configurations, now specified by the number N of consecutive \(\sigma \) values used, starting at the smallest (1.5). We denote the respective network "Gaussian network" N.

Results are reported in Table 3 and Fig. 10. We observe a clear weakness of the Riesz network in segmenting thin cracks (Fig. 10, first and last row). Despite of this, the recall is still quite high (\(90\%\)). However, this could be due to thicker cracks—which are handled very well—contributing stronger to these statistics as they occupy more pixels. Nevertheless, the Riesz network deals with the problem of the wide range scales without sampling the scale dimension, just with a single forward pass of the network.

The performance of the U-net improves with including more levels in the pyramid, too. However, this applies only up to a certain number of levels after which the additional gain becomes minimal. Moreover, applying the U-net on downscaled images seems to induce oversegmentation of the cracks (Fig. 10, second and third row). Including a variety of crack widths in the training set improves the overall performance of U-net in all metrics. This confirms the hypothesis that U-net significantly benefits from variations in the training set. However, this model of U-net is still outperformed by the Riesz network trained on a single crack width. The Gaussian network behaves similarly as the U-net, with slightly worse performance (according to Dice or IoU) but better crack coverage (Recall). As the number of \(\sigma \) values grows, the recall increases but at the same time artifacts accumulate across scales reducing precision. The best balance on this data set is found to be three scales.

4.4 Experiment 3: Application to Cracks in CT Images of Concrete

Finally, we check the methods’ performance on real data: cracks in concrete samples generated by tensile and pull-out tests. In these examples, the crack thickness varies from 1 or 2 pixels to more than 20 pixels (Fig. 11). This motivates the need for methods that automatically generalize to completely unseen scales. Here, we can assess the segmentation results qualitatively, only, as no ground truth is available. Manual segmentation of cracks in high-resolution images is time-consuming and prone to individual biases. Additional experiments on real cracks in the different types of concrete are shown in “Appendix C”.

The first sample (Fig. 11, first row) is a concrete cylinder with a glass fiber-reinforced composite bar embedded along the center line. A force is applied to this bar to pull it out of the sample and thus initiate cracking. Cracking starts around the bar and branches in three directions: left, right diagonal, and down (very subtle, thin crack). Crack thicknesses and thus scales vary depending on the crack branch. As before, our Riesz network is able to handle all but the finest crack thicknesses efficiently in a single forward pass without specifying the scale range. The U-net on the image pyramid requires a selection of downsampling steps (Appendix B), accumulates artifacts from all levels of the pyramid, and slightly oversegments thin cracks (left branch).

The second sample (Fig. 11, second row) features a horizontal crack induced by a tensile test. Here we observe permanently changing scales, similar to our simulated multiscale data. The crack thickness varies from a few to more than 20 pixels. Once more, the Riesz network handles the scale variation well and segments almost all cracks with minimal artifacts. In this example, U-net covers the cracks well, too, even the very subtle ones. However, it accumulates more false positives in the areas of concrete without any cracks than the Riesz network.

5 Conclusion

In this paper, we introduced a new type of scale-invariant neural network based on the Riesz transform as filter basis instead of standard convolutions. Our Riesz neural network is scale-invariant in one forward pass without specifying scales or discretizing and sampling the scale dimension. Its ability to generalize to scales differing from those trained on is tested and validated in segmenting cracks in 2d slices from CT images of concrete. Usefulness of the method become manifest in the fact that only one fixed scale is needed for training, while preserving generalization to completely unseen scales. This reduces the effort for data collection, generation or simulation. Furthermore, our network has relatively few parameters (around 18k) which reduces the danger of overfitting.

Experiments on simulated yet realistic multiscale cracks as well as on real cracks corroborate the Riesz network’s potential. Compared to other deep learning methods that can generalize to unseen scales, the Riesz network yields improved, more robust, and more stable results.

A detailed ablation study on the network parameters reveals several interesting features: This type of networks requires relatively few data to generalize well. The Riesz network proves to perform well on a data set of approximately 200 images before augmentation. This is particularly useful for deep learning tasks where data acquisition is exceptionally complex or expensive. The performance based on the depth of the network and the number of parameters has been analyzed. Only three layers of the network suffice to achieve good performance on cracks in 2d slices of CT images. Furthermore, the choice of crack thickness in the training set is found to be not decisive for the performance. Training sets with crack widths 3 and 5 yield very similar results.

The two main weaknesses of our approach in the crack segmentation task are undersegmentation of thin cracks and edge effects around pores. In CT images, thin cracks appear brighter than thicker cracks due to the partial volume effect reducing the contrast between the crack and concrete. For the same reason thin cracks look discontinued. Thin cracks might therefore require special treatment. In some situations, pore edge regions get erroneously segmented as crack. These can, however, be removed by a post-processing step and are no serious problem.

To unlock the full potential of the Riesz transform, validation on other types of problems is needed. Furthermore, scaling the Riesz network to larger, wider, and deeper models remains an open topic. Our study as well as previous ones [34, 46, 47] imply that small models based on linear combination of the convolutions with fixed filters could yields results comparable to those of large CNN models. However, in order to state this reliably, training convergence, expressiveness, and run-times of these types of networks have to be compared systematically to those of CNNs.

In future, the method should be applied in 3d since CT data is originally 3d. In this case, memory issues might occur during discretization of the Riesz kernel in frequency space.

An interesting topic for further research is to join translation and scale invariance with rotation invariance to design a new generation of neural networks with encoded basic computer vision properties [40]. This type of neural network could be very efficient because it would have even less parameters and hence would require less training data, too.