Neural Knitworks: Patched Neural Implicit Representation Networks

Coordinate-based Multilayer Perceptron (MLP) networks, despite being capable of learning neural implicit representations, are not performant for internal image synthesis applications. Convolutional Neural Networks (CNNs) are typically used instead for a variety of internal generative tasks, at the cost of a larger model. We propose Neural Knitwork, an architecture for neural implicit representation learning of natural images that achieves image synthesis by optimizing the distribution of image patches in an adversarial manner and by enforcing consistency between the patch predictions. To the best of our knowledge, this is the first implementation of a coordinate-based MLP tailored for synthesis tasks such as image inpainting, super-resolution, and denoising. We demonstrate the utility of the proposed technique by training on these three tasks. The results show that modeling natural images using patches, rather than pixels, produces results of higher fidelity. The resulting model requires 80% fewer parameters than alternative CNN-based solutions while achieving comparable performance and training time.


Introduction
The research on utilizing coordinate-based Multilayer Perceptron (MLP) networks for image synthesis has developed significantly, yielding a range of impressive results [1,2,3,4,5,6].However, most of the published works propose architectures that have no capability to model directly the spatial relationships within the represented signal, they rather attempt to independently fit network output for a set of input coordinates.In this work, we propose a coordinate-based model that has spatial awareness by fitting patches to input coordinates rather than isolated values.
The idea is inspired by the advancements made using models that focus on patch distributions like InGAN [7], SinGAN [8], and the Swapping Autoencoder [9].The proposed framework is an improvement to the conventional coordinate MLP architectures, where the network predicts a color patch (or a multi-scale stack thereof) with additional constraints imposed.The purpose of these constraints is to match the distributions of predicted and reference patches and encourage spatial consistency between the predictions.The resulting method constitutes a framework that can be applied to several image synthesis tasks, such as image inpainting, super-resolution and denoising, as shown in Figure 1.
The proposed approach combines the advantages of MLPs as neural implicit representation networks while providing versatility and robustness in levels similar to a Convolutional Neural Network (CNN).The network can be significantly smaller than equivalent CNN architectures and faithfully encode a target signal, exhibiting a compressive capability [10].Furthermore, fitting a single image by the Preprint.Under review.network has two significant advantages: i) there is no requirement for a dataset or pretraining ii) it requires fewer iterations to converge compared to solutions trained on a dataset of images.Effectively, a flexible internal learning framework is introduced that performs well on a diverse range of computer vision tasks with low memory and computational requirements.

Related Work
The potential of applying a MLP network as an encoding of a signal has recently been explored in a number of works [1,2,3,4,6,10,11,12,13,14,15].The learned signals can be of any dimensionality, however, MLP encoding of spatial coordinates is a particularly popular theme, involving a network that learns to produce given scalar values based on the input coordinates.This allows for considerable flexibility and leads to applications such as self-supervised learning of natural images or videos.
Coordinate-Based MLP Networks.The interest in using fully connected networks to represent signals in an implicit manner has grown over the last few years, which can be attributed to the potential of such methods to be used for 3D shape representations [3,4,6,13,14,15].An important issue for learning coordinate-based representations is the tendency of neural networks to interpolate and attenuate high-frequency changes in the output [1,2,16].Two effective solutions to this problem are to either map the input coordinates (known as positional encoding) [1] or use sinusoidal activation functions [2].However, neither of the two approaches does address the challenge of synthesizing new regions.As we demonstrate in a subsequent section (Figure 3), a standard MLP encoding input with random Fourier features does not synthesize new outputs in a convincing manner.The novel techniques of random Fourier feature encoding of spatial coordinates gave rise to Neural Radiance Field (NeRF) networks, which can synthesize high fidelity novel views of 3D scenes in an efficient manner [3].This contribution was soon followed by further developing works, focusing on aspects such as unbounded 3D scenes [4], synthesizing based on few (or only one) images [5], or taking advantage of compositionality of 3D scenes [6].There have been some works where coordinate-based MLP networks are used as a core for a generative model using techniques such as a hypernetwork predicting the weights of a sample coordinate MLP [11], or by modulating the weights of a base coordinate MLP [12].These approaches are fundamentally different as they attempt to create a wide generative model based on a large-scale dataset, while our approach focuses on data-agnostic internal learning tasks and uses a disparate architecture.Finally, Local Implicit Image Functions introduced in [17] are trained in a self-supervised manner and are based on latent feature maps used to synthesize an image at different resolutions.However, the architecture relies on a convolutional feature encoder, applies a fixed downsampling operation, and is trained to generate images based on a selected dataset.Our architecture is purely based on MLP networks, requires no pretraining, and directly maximizes self-similarity between the synthesized and known patches.Internal Generative Frameworks.Patches have been identified as crucial representation features of image in various works [8,18,19,20,21,22,23,24,25,26,27,28,29].The introduction of Generative Adversarial Network (GAN)s [30] made it possible to learn patch distributions of images in an adversarial manner [8,9].Additionally, internal learning approaches relying on the priors contained in convolutional architectures have been proposed [27,31].To the best of our knowledge, no attempt of introducing these techniques to coordinate-based MLP networks has been made until now.

Method
The core structure of the proposed network is presented in Figure 2. It consists of three small networks: (i) Patch MLP for translating from the original coordinate domain to the patch domain (ii) the discriminator responsible for assessing patch likelihoods, and (iii) MLP Reconstructor for mapping the patch domain to individual pixel color.
The resulting architecture performs the equivalent operation to a conventional coordinate-based MLP since the network ultimately predicts a single pixel value.However, the intermediate patch-based representation of the proposed architecture forces the model to establish the natural relationship between the encoded coordinates.This property can also be used as a useful prior for internal learning scenarios, similar to using convolutional kernels in CNN architectures.Further, the patch representation allows our model to be trained as a GAN and match the internal patch distribution with that of the reference image.

Patch Synthesis
The Patch MLP is a network of 4 ReLU layers with 256 units, identical to the one used in [1].The role of this component is to map each coordinate vector to an appropriate pixel patch.The coordinate input is mapped using random Fourier features before passing to the network.This processing step is known as positional encoding and has been described in detail in [1].
The output of this network approximates the implicit representation function φ (x) for a query coordinate vector x along with values of neighbouring coordinates.The required receptive field depends on the spectral content of the image and can be adjusted by either increasing the patch size to provide more spatial bandwidth or using multi-scale patches.We apply the latter approach as it is more efficient for large spatial spans, allowing for easily configurable scope covered by the output patches at low computational cost.We use patches of fixed size 3 by 3 for all experiments.For extraction of the patches with scales larger than one, a Gaussian filter is applied to the image to reduce aliasing.
Patch Reconstruction Loss Since our core module is an MLP with multi-scale patch output, a direct way of computing the error is taking the difference of the predicted patches φ(x) and ground truth reference φ (x) .For inpainting tasks, not all pixel values for the patch stack are known and, hence, we apply an appropriate mask m (x) to this loss.For other tasks, the mask will be a unit tensor.We refer to this loss as patch reconstruction loss L Recon , which is effectively a masked Mean Squared Error (MSE) computed for patches at N sampling coordinates.
The effect of learning patch-based representation rather than direct pixel values has been illustrated in Figure 3 as part of the ablation study included in the experiments.It becomes quite clear that patch-based representation alone (third column), while helpful, may not yield satisfactory results for challenging synthesis tasks.Instead, we must apply additional constraints to control the relationships between the synthesized values.
Cross-Patch Consistency Loss The ability to produce likely pixels or patches does not necessarily lead to consistent network output when the entire learned image is considered.By default, all patches for which ground truth is available, are optimized to be close to that reference, but this does not guarantee that all patches contribute to a single coherent image for coordinates with no ground truth.For new synthesized regions, the output patches may be convincing on their own (due to the bias component learned by the network from the known region) but display limited coherence between each other.
To encourage consistency, we design a cross-patch consistency loss that computes the difference between predictions for each pixel from all patches and for the entire image scope.In practice, a way to enforce this, is to use the predictions from the central element of the lowest-scale patch as a reference.The following notation is defined: φ(x) [i] represents the value of a patch element i predicted for coordinate x where i belongs to the the set of I elements across all scales.In a similar fashion, the φ( Reconstructed Pixel Loss The transition from predicting isolated pixel colors to patches introduces a new trade-off between imposing spatial relationships of the pixel colors and obtaining a high fidelity image with accurate detail.In practice, there will be some disagreement between the predictions for the same pixel from different patches and scales.The naive approach of averaging all predictions for a given coordinate value leads to blurring.To avoid this, a separate MLP Reconstructor network is used to translate from a multi-scale patch representation to a single color value, by approximating the color extraction function ρ(( φ(x) ), as shown in Figure 2. The error made by this final output network constitutes the reconstructed pixel loss, encouraging the entire model to produce accurate pixel colors based on a stack of patches.
The pixel reconstruction loss is computed as a 1 loss between the network pixel color output ρ( φ(x) ) and the color ground truth c(x)

Patch Discriminator
Another important property to enforce, especially when some parts of the signal need to be synthesized, is for all predicted patches to come from a distribution of likely patches, derived from the available information in the source image.This is achieved with the aid of a discriminator tasked to predict which patches come from the original distribution and which do not.The approach is partly inspired by a number of existing works that take advantage of self-similarity between patches in natural images [8,9,21,26,27].In our case, the discriminator is another MLP consisting of 3 Leaky ReLU layers and taking a flattened patch representation as input.
Discriminator Loss The discriminator network takes a single multi-scale patch and outputs a confidence score.At each training step of the discriminator, we feed it all real and all synthesized patches and compute the output confidence for them.Furthermore, we apply one-sided label smoothing [32] of the real labels with a factor of 0.1 when computing the discriminator loss in order to penalize over-confidence of this network module.We use a standard binary cross-entropy loss on the discrimination scores.

Complete Objective Function
The objective function is a minimax loss where the generator loss term is composed of the four losses contained parameterized by weights α, β, and γ.
The discriminator term only includes a single binary-cross entropy loss.Further details about the implementation and the hyperparameters can be found in the supplementary material.

Experiments
We demonstrate the capabilities of Neural Knitworks by utilizing a similar model with only minor adjustments for several tasks commonly investigated in the field of computer vision: 1) image inpainting 2) super-resolution and 3) denoising.The following section describes the key implementation details for each task and presents corresponding qualitative results.Furthermore, quantitative measures are provided by applying each method to Set5 [33] and Set14 [34].

Ablation Study
We begin our analysis with an ablation study of the proposed architecture to demonstrate the utility of each introduced loss component.Figure 3 illustrates the effect of the following adjustments to the conventional coordinate MLP network (second column): i) patch output (third column), ii) cross-patch consistency loss (fourth column), iii) patch discrimination (fifth column).We observe that the introduction of patch output alone can lead to a more convincing synthesis.However, some distortion can be observed in the synthesized region, which is reduced when cross-patch consistency loss is used.Finally, the addition of a GAN loss leads to improved region consistency.

Image Inpainting
For the image inpainting task, we cut out a rectangular section from the source image to be used as the inpainted region.The coordinates of the cutout are used for producing a mask indicating whether the source signal exists for a given pixel.The mask is used to backpropagate the reconstruction losses only from the pixels outside the inpainted region.
We compare the results of the inpainting for the Neural Knitwork to a conventional coordinate MLP model and to DIP [31], a CNN-based internal learning approach.Figure 4 contains the resulting output for the three tested models.The reconstruction quality of the whole image is comparable for the three tested methods.However, when inpainted region is concerned, we observe a significant improvement of over 4 dB for the Neural Knitwork compared to the conventional coordinate MLP and 2 dB less than the CNN-based technique.For some of the results, the Neural Knitwork was, in fact, able to outperform DIP.Table 1 contains the evaluation across the entire datasets for different fill ratios, which supports that the Neural Knitwork outperforms the conventional approach and achieves comparable performance to the DIP with approximately 80% less parameters.More examples can be found in the supplementary material.

Super-Resolution
To perform super-resolution, a Neural Knitwork has to translate the information contained in the patches of the original scale to a domain of patches of finer scale.This can be done by matching the patch distribution across scales [8,25,26,29].For blind super-resolution, Neural Knitwork core module is utilized with adjusted losses as illustrated in Figure 5.The queried coordinates for a patch MLP network include all super-resolved coordinates, which means that it is not possible to compute the patch reconstruction loss in this mode.However, it is possible to compute the cross-patch consistency loss as well as discriminate the patches to match the source image distribution.This alone could yield an output image resembling the low-resolution source without guaranteed structural Figure 5: The blind super-resolution framework utilizes the core module with the addition of a linear network to blindly infer the downsampling kernel.In this case the patch reconstruction loss can not be computed.coherence.To enforce coherence, we apply spatially-aware supervision by downsampling the superresolved image and computing the downsampling loss with the reference to the low-resolution source image.
The downsampling operation can be implemented in several ways.If the downsampling kernel is known, then the best approach is to simply backpropagate through that kernel (assuming it is differentiable).Otherwise, we can create a trainable downsampling module representing the kernel and optimize its weights in an end-to-end manner.We revisit the technique introduced in [29] by using an identical deep linear network to approximate the kernel.Their method relies on the assumption that a satisfactory kernel should preserve the distribution of patches in the image.For Neural Knitworks, there is no need to introduce a new loss term accommodating this since the core module objective imposes matching patch distribution by default.
In Figure 6, we demonstrate the downsampling effect of two non-standard kernels: i) delta function (leading to aliasing) and ii) diagonal Gaussian kernel.Different types of artifacts can be observed depending on the kernel.During training, Neural Knitwork blindly approximates the downsampling kernel based on the image content.The true and learned kernels are illustrated in the figure.
Figure 7 contains results for a diagonal kernel and upscaling factor of 4, for the proposed Neural Knitwork, the conventional MLP and SinGAN, another image super-resolution method based on internal learning.The results show that SinGAN has the lowest performance in terms of PSNR but it also creates distinguishable artifacts.Table 2 shows how Neural Knitwork compares to counterparts along with the model sizes.Interpolation with conventional MLP directly implies delta kernel and hence, they perform best in this instance.For other kernels, a Neural Knitwork can boost the performance in some instances by adjusting to the kernel.

Denoising
As we demonstrate in Figure 8, a standard MLP network has limited denoising capability because it attempts to fit all pixel colors with no additional constraints.In contrast, a Neural Knitwork ensures that both patches and pixel colors are reliably reconstructed while imposing additional consistency constraint on the derived solution.In the illustrated result with severe noise levels of σ = 40, we achieve PSNR approximately 4 dB higher than in the case of a conventional coordinate MLP.Further, Table 3 confirms that the Neural Knitwork model outperforms both other methods for high noise levels.

Conclusion
Neural Knitworks constitute a hybrid architectural approach for internal learning applications, based on three shallow MLP networks.It enhances conventional coordinate-based MLP networks by adding synthetic capabilities for tasks such as inpainting, super-resolution, and denoising, at levels comparable or better than the considered alternatives.Furthermore, the Neural Knitwork used in our experiments is 5x smaller than CNN internal learning counterparts with an additional benefit of being fully parallelizable; that is, all coordinate outputs could be computed independently.Apart from the significant potential for speed up, Neural Knitworks have the advantage of precise control over the output image size by adjusting the set of input coordinates.Our experimentation shows that Neural Knitworks can be sensitive to hyperparameters such as individual loss weights, patch sizes, and learning rates, however, the configuration used in our experiments has shown to offer stable performance.
Figure 8: Neural Knitwork demonstrates superior performance for severe levels of noise, in this case σ = 40.

Figure 1 :
Figure 1: The introduced model trained on a single sample can perform a number of different image synthesis tasks with very low memory requirements.

Figure 2 :
Figure 2: Neural Knitwork architecture consists of 3 shallow MLPs.The network knits patches for related coordinates by enforcing consistency of predictions and optimizing likelihoods of individual patches.Each patch stack is translated back to a single color by the MLP Reconstructor.
) [o], represents the value of the central element (constant index of o) of the lowest scale patch predicted from coordinate x.The central reference φ(x) [o] is compared with element φ(x+s) [i] that corresponds to the same pixel of the output image evaluated at coordinates x+s, where s indicates the appropriate shift, dependent on i.The terms with values x+s outside of the image bounds are naturally excluded from the summation.

Figure 3 :
Figure 3: Ablation study of Neural Knitwork components.Conventional MLP does not produce coherent inpainted region and this is improved with the introduction of patches.Further, imposing cross-patch consistency constraint increases the quality of the synthesized region while employing a GAN approach ensures patches of high likelihood.

Figure 4 :
Figure 4: Image inpainting results for a fill ratio of 2%.For the inpainted region Neural Knitworks and DIP perform comparably, and both outperform conventional MLP.

Figure 6 :
Figure 6: Our method approximates the downsampling kernel depending on the source image.

Figure 7 :
Figure 7: Comparison of blind image super-resolution for a diagonal Gaussian kernel and upscaling factor of 4x.Neural Knitwork can outperform conventional coordinate MLP network and achieve higher PSNR.SinGAN, while generating a considerable amount of high frequency details, results in significant artifacts.

Table 1 :
Comparison of inpainting performance for different fill ratios.The three approaches appear comparable PSNR (↑) and SSIM (↑) for whole images.For the inpainted region, the Neural Knitwork comes close to the level of performance of DIP, while conventional MLP is inferior.

Table 2 :
We compare the blind super-resolution performance achieved by a conventional coordinate MLP, a CNN-based internal learning framework of SinGAN and our method.We compute PSNR (↑) and SSIM (↑) for a number of upscaling factors and downsampling kernels.

Table 3 :
Comparison of achieved denoising performance.For higher power levels, the Neural Knitwork achieves higher PSNR (↑) and SSIM (↑) than a conventional MLP and DIP.