FEW SHOT PHOTOGRAMETRY: A COMPARISON BETWEEN NERF AND MVS-SFM FOR THE DOCUMENTATION OF CULTURAL HERITAGE

: 3D documentation methods for Digital Cultural Heritage (DCH) domain is a field that becomes increasingly interdisciplinary, breaking down boundaries that have long separated experts from different domains. In the past, there has been an ambiguous claim for ownership of skills, methodologies, and expertise in the heritage sciences. This study aims to contribute to the dialogue between these different disciplines by presenting a novel approach for 3D documentation of an ancient statue. The method combines TLS acquisition and MVS pipeline using images from a DJI Mavic 2 drone. Additionally, the study compares the accuracy and final product of the Deep Points (DP) and Neural Radiance Fields (NeRF) methods, using the TLS acquisition as validation ground truth. Firstly, a TLS acquisition was performed on an ancient statue using a Faro Focus 2 scanner. Next, a multi-view stereo (MVS) pipeline was adopted using 2D images captured by a Mini-2 DJI Mavic 2 drone from a distance of approximately 1 meter around the statue. Finally, the same images were used to train and run the NeRF network after being reduced by 90%. The main contribution of this paper is to improve our understanding of this method and compare the accuracy and final product of two different approaches - direct projection (DP) and NeRF - by exploiting a TLS acquisition as the validation ground truth. Results show that the NeRF approach outperforms DP in terms of accuracy and produces a more realistic final product. This paper has important implications for the field of CH preservation, as it offers a new and effective method for generating 3D models of ancient statues. This technology can help to document and preserve important cultural artifacts for future generations, while also providing new insights into the history and culture of different civilizations. Overall, the results of this study demonstrate the potential of combining TLS and NeRF for generating accurate and realistic 3D models of ancient statues.


INTRODUCTION
3D documentation methods for Digital Cultural Heritage (DCH) domain are overrunning across different disciplines. The wellknown boundaries that for years prevented a righteous dialogue between experts from different domains are going to be culled off. In particular, the heritage sciences witnessed, among the years, to an ambiguous claim for ownership of skills, methodologies, expertise (Remondino and Rizzi, 2010). Artificial Intelligence (AI) helped new generations of researchers to investigate, regardless the compartments, whether a technology can be applied and where it can be useful to solve, with impressive results, dated issues. Embracing the philosophy of the socalled Digital Humanities (DH), the main contribution that AIdriven approaches can give is that of preserving the high-quality standards of data collection, processing and validation, reducing human intervention and thus reducing timing and costly operations (Rowley, 2021). It happened for instance for the semantic segmentation of unstructured point clouds, where Deep Neural Networks (DNNs) are now able to discriminate between architectural elements. And more, DNNs are now helping archaeologists in automatising the vectorization of 2D orthophotos with segmentation tasks. Finally, generative approaches have been able to partially solve the occlusion problem, beside * r.pierdicca@staff.univpm.it providing unprecedented opportunities to create data-sets to be shared with the communities (Pierdicca and Paolanti, 2022). By the way, the path towards a full exploitation of AI in the field of DCH is still tortuous and uncertain, requiring huge efforts and, as said before, a sane interdisciplinary collaboration among heterogeneous backgrounds. This is the philosophy underling this research work, where the NeRF (Neural Radiance Field) Networks  are exploited in the CH domain, and compared with Geomatic approaches like MVS-SfM and Terrestrial Laser Scanner. An important aspect of NeRFs is its training and inference speed: the original NeRF took about 1-2 days to train a single scene. Many improvements have been made in this regard, mainly by improving the sampling strategy by reducing MLPs' parameters, resulting in smaller MLPs sizes and, thus, faster training, at the cost of a higher memory consumption. One of the first reimplementation of the original NeRF leaning toward speed was JaxNeRF (Deng et al., 2020), which used Google Jax to create a slightly faster and more suited for distributed computing NeRF implementation. Many works followed, such as Neural Sparse Voxel Fields (NSVF) , which developed a voxel-based NeRF that models the scene as a set of radiance fields bounded by voxels. This approach was faster than the original implementation but was very memory intensive. A speed up of about ten times from the original NeRF was made by (Lindell et al., 2021), which approximated the volume rendering step, allowing it to use much fewer samples, resulting in a much faster network, although with a slight decrease in quality. Another work, Deterministic Integration for Volume Rendering (DIVeR) (Wu et al., 2022), approached the task from a different angle, by reversing the order of volume sampling and MLP evaluation to obtain results that, in terms of quality, outperform methods such as PlenOctrees (Yu et al., 2021), Kilo-NeRF (Reiser et al., 2021) and FastNeRF (Garbin et al., 2021), while maintaining a comparable speed. Finally, the most recent improvement concerning speed was made in Instant-Neural Graphics Primitives (Instant-NGP) (Müller et al., 2022), in which authors use a multiresolution hash encoding trained to reduce the MLP size; this results in a much faster training and rendering, achieving, in a matter of seconds, the same results of the previous NeRF models. The flexibility and speed of Instant-NGP is the fundamental reason as why it was chosen to perform the experiments, as such a speed improvement can be groundbreaking in the DCH field.
The motivation behind our experiment propitiates from the following research question: can AI replace standards protocols of data acquisition and processing, without reducing the quality? Or better, to what extent? It is well-known in fact that the Digitization of cultural goods (let's for a moment forget the scale of representation, despite fundamental) is nowadays entrusted on TLS -accurate, but time consuming and extremely expensive-and on Multi View Stereo Matching (namely Digital Photogrammetry or DP) -less accurate, but extremely productive and foremost low-cost. The main reason that endeavours archaeologists, historians and experts in general in using such technologies is ancestral: never forget the past. The 3D point cloud is the more accurate replica of a cultural good, the sole method discovered to recreate, virtually, an artefact, making it, de-facto, immortal. Its consequent 3D model is the instrument to analyse it, to mould it, and to instantiate further processing like sectioning, representing and understanding. Finally, cutting edge visualization tool like Virtual Reality make it sharable for the whole mankind. All in all, we are at the stage of striving the discovery of new methods to, as said before, reduce the human intervention. A NeRF is a fully-connected neural network that can generate novel views of complex 3D scenes, based on a partial set of 2D images. It is trained to use a rendering loss to reproduce input views of a scene. It works by taking input images representing a scene and interpolating between them to render one complete scene. NeRF is a highly effective way to generate images for synthetic data. In the literature, the benefit of this new technique in the CH panorama is still partially unexplored, apart from some recent works (Condorelli et al., 2021). First, a TLS acquisition of an ancient statue -with a Faro Focus 2-was performed. Then, the MVS pipeline was adopted, using 2D images from a Mini-2 DJI Mavic 2 shooting pictures all around the statue from about 1 m distance. Finally, the same pictures were used, reduced of 90% to train and run the NeRF network.
Considering the existing literature, the main contributions of this paper is to improve the knowledge, arguing over such method, and by comparing DP and NeRF in terms of accuracy and final product as a whole, exploiting a TLS acquisition as the validation ground truth.
The paper is organized as follows: Section 2 gives details on the state of art on AI techniques applied to DCH domain; Section 3 presents the methods chosen and also explains the rules used for the decision making of this approach. Section 4 presents the results, the performance comparison of the algorithms used and some discussions. Section 5 is devoted to the conclusions and our future works in this direction.

STATE-OF-THE-ART
This section briefly reviews some relevant background works concerning AI techniques for the analysis and processing of digital representation of CH. As stated in the introduction, AI algorithms can support and speed up a variety of activities linked to architecture, building or civil engineering. Currently, AI and more in detail its subsets ML and DL are used and applied on areas close to the architectural scale and to CH (Wysocki et al., 2023) even if their use is still limited, since most of CH literature shows a tendency to rely on statistical toolboxes, which are commonly applied as a "black-box" on small CH datasets that are not generally publicly available (Fiorucci et al., 2020). It is also possible to notice that in the last years several initiatives have been made for promoting CH and AI techniques are applied to improve the visiting experience.
Examples of this promotion are the application of AI methods and NLP to support CH institutions and organizations. In (Machidon et al., 2020), an approach based on natural processing language (NLP) has been proposed to retrieve CH resources from Europeana 1 . In particular, the authors designed a solution to improve the accessibility and search accuracy of DCH resources from Europeana through a system that integrates AI, NLP, web services, and APIs. Dou et al. have proposed a knowledge graph for Intangible Cultural Heritage (ICH) which aimed to extract information from ICH text data using NLP techniques to support their organization, management and protection (Dou et al., 2018).
AI have also used to analyze large amounts of historical data and identify patterns that can help preserve and better understand CH. Image processing techniques have been used in the DCH domain for several purposes. For example, (Felicetti et al., 2021), developed Mo.Se. (Mosaic Segmentation), an algorithm that exploits DL and image segmentation techniques to overcome the labour and intensive procedure of extracting information and labelled tesserae from ancient mosaics. The proposed methodology combined U-Net 3 Network with the Watershed algorithm. The approach was tested in the pavement of St. Stephen's Church in Umm ar-Rasas, a Jordan archaeological site, located 30 km southeast of the city of Madaba (Jordan). Hurtut et al. proposed a method for the analysis of the pictorial content of line drawings by the use of the geometrical information of stroke contours. The authors showed that the developed method could be successfully applied for the indexing of line drawings in a retrieval framework (Hurtut et al., 2011).
Focusing more on architectural heritage, it is well established that the safeguarding and maintenance of historic architectural heritage has become crucial in preserving and protecting them from warfare, environmental impacts, calamities, and man-made debacles. The likelihood of these dangers is magnified by the perpetual progression of chemical alteration enacted upon all monuments. Surveying technologies like photogrammetry and laser scanning are currently employed for data collection, establishing them as conventional techniques for three-dimensional documentation of heritage properties . Shalunts et al. designed an approach based on clustering and learning of local features to classify the architectural style of facade windows (Bebis et al., 2011). In this context, the association of semantic information to the point clouds leads to a description of CH, expediting the phase of data interpretation and management. According to , DL algorithms had great potential to this regard. DL techniques are suitably adopted for directly handling the raw data of point clouds without an intermediate processing that allows a more regular representation. An example is the work of Malinverni et al. (Malinverni et al., 2019) that exploited PointNet++ (Qi et al., 2017) to semantically segment 3D point clouds of CH dataset. A newly dataset was specifically collected to deal with CH data and manually labelled by domain experts: ArCH dataset . The specific goal of this paper was to demonstrate the effectiveness of a DL framework specialized in point clouds semantic segmentation to tackle with CH-related point clouds. Inspired by the great results obtained in (Wang et al., 2019), which introduced a module called EdgeConv, that constructs a local neighborhood graph and applies convolutionlike operations and developed a new DL model named DGCNN (Dynamic Graph Convolutional Neural Network), dynamically updates the graph, changing the set of k-nearest neighbors of a point from layer to layer of the network, the same authors made an extension of the previous work and exploited the novelties offered by the DGCNN. Thus in  they proposed a modified version of DGCNN by adding relevant features such as normal and HSV encoded color. This improved version aimed at facilitating the management of DCH assets that have complex geometries, extremely variable and defined with a high level of detail.
More recently, an ambitious task of applying AI techniques in DCH domain has been tackled: 3D reconstruction starting from images. These aspect involves several challenges such as the identification of the starting images from the vast archive material and the ability of photogrammetric algorithms to work with numerically reduced and low-quality images. Some approaches, such as (Vicini et al., 2022), include the usage of Signed Distance Functions to represent 3D objects. More recently, the Neural Radiance Field approach emerged; Neural Radiance Fields have proven to be capable of achieving photorealistic results in complex scenes and, thus, gained a lot of research interest . In the span of 2 years, more than 200 preprints on arXiv have been registered (Gao et al., 2022), with the aim of improving the original architecture in different aspects. Some significant works include Mip-NeRF , which used cone tracing instead of the ray tracing of the standard NeRF; Ref-NeRF (Verbin et al., 2021), which was built starting from Mip-NeRF to model reflective surfaces more accurately.
This approach has already been adopted in (Condorelli et al., 2021). The authors assesed the experiments on two different dataset specifically collected by them: flower and tower dataset. In particular, the last one includes images which were acquired by the same authors during a survey on the place. They compared the results obtained by the application of NeRF  with the traditional photogrammetry pipelines based on Structure-from-Motion (SfM) and Multi-View-Stereo (MVS) open-source algorithms (Schonberger and Frahm, 2016a).
With respect to the above mentioned state-of-the-art works, our approach consists in creating a novel pipeline that leverages the NeRF architecture and, in particular, Instant-NGP for the cre-ation of 3D scene representations and subsequent mesh generation.

MATERIALS AND METHODS
The comparison between NeRF and MVS-SfM reconstruction methods was performed by using a specific case study to conduct our tests. A statue was chosen for the experiments, namely the Alberico Gentili statue, located in San Ginesio (Macerata, Italy). From this monument, 2 datasets were acquired, the first one composed of images taken from UAV, the second composed of spherical panoramas taken from TLS. As the dataset used for the NeRF pipeline (UAV dataset) was not intended to be used with NeRF initially, this work can be also useful to prove how a dataset not originally acquired to be used with a NeRF approach can be suitable anyway. A detailed description of the created datasets follows.

Dataset
In the presented case study, data acquisition was carried out using two different techniques: photogrammetry and laser scanning. For photogrammetry, a Mini-2 DJI Mavic 2 drone was used as the UAV system. The images were then processed using SfM-MVS techniques, which estimated the 3D structure from 2D image sequences. For laser scanning, a terrestrial laser scanning system (TLS), specifically the Faro Focus 2, was used to acquire four spherical panoramas. Each panorama had a high resolution of 10240x5120 pixels, enabling the generation of highly accurate and detailed point clouds. The use of both techniques allowed for a more comprehensive analysis of the site and an assessment of the accuracy and reliability of the data obtained using each method, resulting in a more complete and reliable dataset for research. Table 1 lists the main data collected and the methodology used, including data from the NeRf methodology, as the starting dataset was the same. Table 1 shows the different datasets and acquisition methods. In particular, the UAV dataset that was used in the experiments is comprised of 224 images of the Alberico Gentili Statue, taken with a Mini-2 DJI Mavic 2, with a size of 4000 × 3000 pixels in JPG format. The TLS dataset is comprised of 4 spherical panoramas, taken with a Faro Focus 2, with an image size of 10420x5120. In Figure 1 some examples extracted from the UAV dataset are shown (after the relative orientation, thus providing the sparse point cloud).
Concerning the NeRF experiments, the UAV dataset has been used, as images were found to be suitable for NeRF scene representation; to prove the effectiveness of the chosen network, the UAV dataset has been reduced to 38 images (about 15% of the total), with the aim of achieving comparable results.

Photogrammetric pipeline
The documentation and modeling of cultural objects and sites is often carried out using two different methods: photogrammetry and laser scanning. These methods differ significantly in the way data is captured. Photogrammetry uses Structure from Motion (SfM) technology, which involves acquiring images from different positions and angles to reconstruct a threedimensional model of the scene or object. Laser scanning, on the other hand, directly acquires data from three-dimensional point clouds using a laser, without using SfM technology. Both methods are suitable for different purposes and situations. Photogrammetry is particularly useful when covering a large area or when accessing difficult-to-reach o bjects s ince i mages can be taken from different angles and positions. Image-based 3D reconstruction techniques are considered cost-effective and efficient f or p roducing h igh-quality 3 D d igital m odels o f realworld objects in terms of hardware requirements, knowledge background, and man-hours. Typically, at least two images with common features are required, and 3D data accompanied by texture information can be derived through perspective or projective geometry formulations. These methods, mainly computer vision (Remondino and El-Hakim, 2006), are generally preferred for lost objects, monuments, or architectures with regular or complex geometric shapes, small objects with free-form shape, mapping applications, deformation analysis, and time or location constraints for data acquisition and processing. Laser scanning, on the other hand, is particularly useful when obtaining a large amount of three-dimensional data in a very precise and detailed way. In the context of cultural heritage, generating complete digital representations of the scene captured, often in the form of 3D surfaces, requires great attention to the quality of the surface models or meshes. These models must be highly detailed and sufficiently a ccurate, especially for metric applications. Surface generation can be seen as an integrated problem in the complete 3D reconstruction pipeline, and thus visibility information (pixel similarity and image orientation) is exploited in the meshing procedure, contributing to an optimal photo-consistent mesh.

NeRF workflow
A much faster version of the original NeRF network has been used, namely Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (Instant-NGP 2 ) (Müller et al., 2022); this network allows for accurate training within a short period of time, besides enriching the bulk of knowledge in the field . In particular, the choice of Instant-NGP was made due to its training and inference speed, which is 2 https://github.com/NVlabs/instant-ngp an key factor in the CH field, given that, in a classic MVS-SfM approach, this can be a really time-consuming task. Also, NeRF allows for the generation of novel views, so a dataset with a low example number can be suitable.
An Anaconda 3 environment running Python 3.9.15 has been created, installing all the necessary packages to run the network. CUDA Toolkit 4 version 11.8 has been used, as Instant-NGP leverages the tiny-cuda-nn framework 5 for the multiresolution hash input encoding.
In order to prepare the dataset for training, camera parameters must be extracted from the images, as both camera positions and images are needed for the network to create the scene representation. To achieve this goal, the COLMAP 6 software has been used, as it implements SfM techniques (Schönberger and Frahm, 2016b) for estimating three-dimensional structures from two-dimensional image sequences. For usage with COLMAP, dataset images have been resized to 2400 width and 1800 height, as it has proved to speed up the generation of the camera parameters while maintaining high image resolution. Both the camera positions and the image set are then fed to the network to start the training process. Then, the 5D output for each image (x, y, z spatial location coordinates and θ, ϕ viewing direction) is given as input to the network, which outputs view-dependent emitted radiance (RGB) and a volume density σ. Finally, through classic volume rendering techniques, the output is projected on an image, which enable the computation of a loss to train effectively. The result is the 3D scene representation. After the scene is trained, a mesh can be extracted by using the marching cubes algorithm (Lorensen and Cline, 1987). By increasing the number of steps, the network is capable of representing the monument with good quality and accuracy. These results show the suitability of the application of this method for a faster and accurate training or a longer and extremely accurate training. A demo video showcasing the results obtained  from the 35.000 steps training 7 has also been created to better exhibit the effectiveness of the network used.

RESULTS AND DISCUSSION
Starting from the 3D scene representation, a polygon mesh of the monument has been generated, by cropping the relevant portion. This can be useful as the mesh can be imported in 3D modeling tools. Figure 4 left shows some qualitative results. At the same time, the photogrammetric model was created, in order to analyse and understand in detail the pros and cons of each method (Figure 4 right).
Given the outcomes of our computation, there is the need to argue about the pros and cons of the proposed method. First of all, when dealing with cultural heritage documentation, the quality 7 https://www.youtube.com/watch?v=wOiX6uEur0Q  Table 3. Evaluation metrics for quantitative analyses. Values are in mm.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-M-2-2023 29th CIPA Symposium "Documenting, Understanding, Preserving Cultural Heritage: Humanities and Digital Technologies for Shaping the Future", 25-30 June 2023, Florence, Italy methods based on TLS and SfM are still outperforming the AI-Based and Generative ones. This is clearly demonstrated in Table 3, considering the roughness value achieved. Surface reconstruction methods in photogrammetric applications were evaluated using the work of (Nocerino et al., 2020) as a reference. The results of the approaches considered were evaluated using various metrics, including accuracy and roughness. In contrast, calculation time was not considered a key factor in this investigation (Table 3). Besides, performing C2M comparison, the accuracy values seems to be comparable between SfM and NeRF method. This is misleading, since the relative oreintation between the two method is performed in the same way, thus conducting to similar results; however, the main issue emerges with the creation of the dense cloud, since the NeRf method does not produce the dense cloud that is a by-product fo the algorithms. Further investigations are needed in order to extract a comparable point cloud from the NeRF method. However, the insiders knows that during the acquisition campaigns there are several problems (expecially in the architectural and archaeological fields) that hamper the surveyors to collect complete dataset. Missing images, occlusions, weather conditions and so on are often "enemies" of good quality models; in this case, NeRF method can be a winning solution to achieve a result that, despite not accurate and noisy, can be used for further processing. Finally, it is fair to say that the Photogrammetric pipeline is very time consuming for both acquisition and processing time, whilst the TLS is stil la very costy instrument; with the advent of NeRF method, in case the survey campaign has the above mentioned limitations, can be a valuable alternative.

CONCLUSION AND FUTURE WORKS
In conclusion, this study presents a novel approach for 3D documentation of an ancient statue, which combines TLS acquisition and MVS pipeline using images from a Mini-2 DJI Mavic 2 drone. The results of this study demonstrate the potential of using these methods for accurate and detailed 3D documentation of CH objects. Additionally, the comparison between DP and NeRF methods highlights the importance of carefully choosing the appropriate approach based on the specific object and research question. Finally, the interdisciplinary nature of this study underscores the need for continued collaboration and communication between experts from different domains to advance the field o f 3 D d ocumentation f or D CH. T his research contributes to ongoing efforts to improve the accuracy and efficiency o f 3 D d ocumentation m ethods, w hich h as important implications for CH preservation and research.
Some further works could involve the usage of different algorithms for mesh extraction in the NeRF pipeline, such as Deep Marching Tetrahedra (Shen et al., 2021), that could improve mesh generation. Other additions for NeRFs could see the usage of RGB-D images as input, as providing depth information in input can result in reduced artifacts and other minor issues currently present in the generated meshes.