Constraining the Geometry of NeRFs for Accurate DSM Generation from Multi-View Satellite Images

: Neural Radiance Fields (NeRFs) are an emerging approach to 3D reconstruction that use neural networks to reconstruct scenes. However, its applications for multi-view satellite photogram-metry, which aim to reconstruct the Earth’s surface, struggle to acquire accurate digital surface models (DSMs). To address this issue, a novel framework, Geometric Constrained Neural Radiance Field (GC-NeRF) tailored for multi-view satellite photogrammetry, is proposed. GC-NeRF achieves higher DSM accuracy from multi-view satellite images. The key point of this approach is a geometric loss term, which constrains the scene geometry by making the scene surface thinner. The geometric loss term alongside z -axis scene stretching and multi-view DSM fusion strategies greatly improve the accuracy of generated DSMs. During training, bundle-adjustment-refined satellite camera models are used to cast rays through the scene. To avoid the additional input of altitude bounds described in previous works, the sparse point cloud resulting from the bundle adjustment is converted to an occupancy grid to guide the ray sampling. Experiments on WorldView-3 images indicate GC-NeRF’s superiority in accurate DSM generation from multi-view satellite images.

Multi-view satellite photogrammetry pipelines predominantly focus on multi-view stereo (MVS) methods [14][15][16][17][18], while some recent works integrating Neural Radiance Fields (NeRFs) [19] have been proven to achieve better results [20,21].Using neural network and volume rendering [22] techniques, NeRFs synthesize new photorealistic images from 2D images and are an emerging approach to 3D reconstruction.Due to challenges presented by satellite camera characteristics, its applications for multi-view satellite photogrammetry are limited by its difficulty to acquiring accurate digital surface models (DSMs).For satellite scenes, the closely distributed viewpoints of satellite cameras pose challenges to accurately reconstruct fine geometry, which is the main cause of poor DSM accuracy.Moreover, during training and rendering, NeRFs use camera models to cast rays through the scene and project them onto known image pixels.However, satellites are far from Earth, so the large distance between the cameras and the scene poses challenges for ray sampling.Additionally, common inconsistencies in scene appearance, such as shadow movement and radiometric variation among satellite images, can easily lead to a lack of convergence during the network training process.Some works [20,21,23] have explored NeRFs for multi-view satellite images, but the accuracy of these methods' DSM outputs must be improved due to a lack of geometric constraint in NeRFs.Moreover, extra input of the altitude bounds of the scene is required in these methods for ray sampling guidance.This extra input reduces their feasibility with the goal of obtaining a DSM that represents the altitude of the scene.Additionally, the network architecture used in these methods for solving inconsistency issues is too complex.
To address these issues, a novel framework, Geometric Constrained Neural Radiance Fields (GC-NeRFs), specifically designed for multi-view satellite photogrammetry, is proposed.The key difference between GC-NeRF and traditional NeRF approaches is a geometric loss term, which constrains the scene geometry by making the scene surface thinner, greatly improving the accuracy of generated DSMs.Moreover, z-axis scene stretching is conducted for finer reconstruction granularity in the z-direction, which plays a constructive role in improving DSM accuracy.In addition, a strategy for fusing multi-view DSMs reduces errors in obstructed areas.During training, bundle-adjustment-refined satellite camera models [20] are used to cast rays through the scene.An occupancy grid converted from sparse point clouds generated by bundle adjustment is used to guide ray sampling, which reduces dependence on altitude bounds.Additionally, GC-NeRF integrates several advanced techniques in NeRF variants [24,25] for a concise network architecture.Experiments with WorldView-3 images indicate that GC-NeRF achieves higher DSM accuracy while minimizing additional input requirements.

Related Work
To achieve accurate 3D reconstructions, multi-view satellite photogrammetry faces several challenges, including limited and closely distributed viewpoints leading to ambiguous elevation estimates and inconsistent illumination conditions affecting scene appearance.In this section, relevant research on multi-view stereo and NeRF-based methods is comprehensively reviewed, focusing on approaches aimed at improving reconstruction accuracy and addressing challenges specific to multi-view satellite photogrammetry.

Multi-View Stereo for Satellite Images
Multi-View Stereo (MVS) was first approached as an extension of stereo pair algorithms by aggregating information from multiple stereo pairs.As a result, true MVS algorithms were mainly developed to reconstruct objects from images photographed at a close distance while considering all the images in the scene [26].
MVS approaches are widely employed for scene reconstruction from aerial and satellite images.However, satellite images characteristic of an extremely small ratio between the depth range and distance from the camera to the scene and inconsistent illumination discourage the use of true MVS methods for satellite images [14].In the case of satellite images, MVS has traditionally employed pairwise approaches, treating multiple views in pairs using traditional two-view stereo methods and subsequently combining pairwise reconstructions to obtain the final results [15,27].These methods typically involve pair selection, stereo rectifying, dense stereo matching, triangulation, and depth fusion [28].Semi-Global Matching (SGM) [29] is a popular choice for the stereo matching step, and its variants such as MGM [30], tSGM [31], and semi-global block matching [32] have been used to enhance efficiency and accuracy.However, these methods usually rely on manually selected stereo pairs and manually designed matching strategies, which may not ensure optimality [33].Recently, deep learning methods such as GA-Net [34], PSM [35], and HSM [36] have been progressively applied to satellite stereo pipelines [14,37] with some progress.Unlike classical algorithms, deep learning methods are prone to failure or reduced accuracy when encountering unseen scenarios [21].Moreover, learning-based methods often involve extensive training periods, lasting days or even weeks.Though popular, deep learning methods are still not the preferred option in satellite stereo pipelines, but rather a step in classic MVS methods [14].In general, MVS methods mainly focus on pairwise matching and do not fully exploit the benefits of multi-view data, leading to challenges in addressing seasonal and dynamic variations in satellite images [28].

Neural Radiance Field
Neural Radiance Fields (NeRFs) [19] represent static scenes through a continuous volumetric function F, learned as a fully-connected neural network.This function predicts the emitted RGB color c X = (r, g, b) and a non-negative scalar volume density σ X at a 3D point X = (x, y, z) from a given viewing direction d view = (θ, φ): Using a collection of input images and their camera poses, rays passing through the scene are projected onto the known pixels.Each ray r is defined by a point of origin o and the viewing direction; o is located at the camera's photography center, in general.To render the ray's color, r is discretized into N 3D points X i , i.e., ray samples with X i = o + d view t i , where t i is the distance between o and X i .The color c(r) of a ray r is computed using volume rendering [22] as: The rendered color c(r) results from integrating the colors c i predicted by F at different points of the ray r.The weight w i of each point X i in r to the rendered color depends on the opacity α i and the transmittance T i : where δ i is the distance between consecutive points along the ray.Additionally, the depth along a ray is rendered using a similar weighted integration approach: NeRFs are trained by minimizing the image loss Loss c , which is computed as the mean squared error (MSE) between the rendered color and the observed color of input images along the rays' projections: (7) where c gt (r) is the observed color, and R is the set of rays in each input batch.During training, F is gradually optimized and predicts accurate colors and densities at different 3D points in the scene.

NeRF Variants for Multi-View Satellite Photogrammetry
Recent efforts have explored NeRFs' potential for satellite imaging.S-NeRFs [23] pioneer NeRFs' application in multi-view satellite photogrammetry, leveraging sun direction d sun for precise geometry and building shadow rendering.A complex remote sensing irradiance model is adopted in S-NeRF: Physically, a X represents albedo color, and s X represents the ratio of incoming solar light with respect to the diffuse sky light sky X .The color of X is computed as: Subsequent works typically follow the irradiance model of S-NeRF [20,21], though it is very complex.Building on S-NeRFs, Sat-NeRFs [20] incorporates transient object modeling similar to NeRF-W [54] and satellite-adapted camera representations, enhancing accuracy by integrating RPC camera models and applying bundle adjustment.EO-NeRFs [21] focus on shadow modeling to align building shadows with scene geometry, resulting in highly accurate and detailed DSM reconstruction.In addition, SatelliteNeRFs [55] directly apply NeRFs to satellite images and can extract mesh, but does not modify special features of satellite images.
Despite progress in simplifying satellite sensors' imaging process, the irradiance model is not entirely accurate.Moreover, these methods require additional input from altitude bounds for sampling guidance, reducing their feasibility.Their training speed is also very low.Various strategies have been adopted to accelerate NeRFs since their proposal in 2020, including optimizing sampling [56,57], scene decomposition [58,59], and combining explicit models [60,61].Instant-NGP [24] adopts a different approach with a network architecture that uses a multi-resolution hash grid for position encoding and combines it with an occupancy grid to guide sampling, accelerated 3000 times while significantly reducing memory usage.Integrating these methods into multi-view satellite photogrammetry greatly improves efficiency.Table 1 shows the disadvantages of previous NeRF-based methods for multi-view satellite photogrammetry, which are solved with GC-NeRFs.Overall, NeRFs present a promising solution to address these challenges in multi-view satellite photogrammetry.However, NeRFs are still immature at processing satellite images.Adapting NeRFs to multi-view satellite photogrammetry poses unique challenges posed by satellite camera characteristics.This paper aims to bridge these issues by proposing a novel approach, GC-NeRFs, specifically designed for multi-view satellite photogrammetry to improve reconstruction accuracy.

Methods
GC-NeRFs aim to enhance geometric reconstruction accuracy for satellite scenes and generate accurate DSMs.The overview of GC-NeRFs is shown in Figure 1.Its key contributions include z-axis scene stretching, an occupancy grid converted from sparse point clouds, DSM fusion, and a geometric loss term for network training, which are shown in bold.After applying a bundle adjustment to the satellite images, refined camera models and a point cloud are obtained (Figure 1a,b).Then, the scene is stretched in the z-axis to enlarge the scale in the vertical direction (Figure 1c).Afterward, satellite camera models are used to cast rays through the scene, and an occupancy grid converted from the sparse point cloud generated by the bundle adjustment is used to guide sampling along the rays (Figure 1d).These sample points are input into the GC-NeRF network to query colors and densities.GC-NeRF is trained with the image loss and geometric loss terms proposed in Section 3.3 (Figure 1e).After optimization, GC-NeRF renders multi-view DSMs (Figure 1g) using volume rendering (Figure 1f).Additionally, the fusion of multi-view DSMs also contributes to accuracy improvement (Figure 1h).

Methods
GC-NeRFs aim to enhance geometric reconstruction accuracy for satellite scenes and generate accurate DSMs.The overview of GC-NeRFs is shown in Figure 1.Its key contributions include z-axis scene stretching, an occupancy grid converted from sparse point clouds, DSM fusion, and a geometric loss term for network training, which are shown in bold.After applying a bundle adjustment to the satellite images, refined camera models and a point cloud are obtained (Figure 1a,b).Then, the scene is stretched in the z-axis to enlarge the scale in the vertical direction (Figure 1c).Afterward, satellite camera models are used to cast rays through the scene, and an occupancy grid converted from the sparse point cloud generated by the bundle adjustment is used to guide sampling along the rays (Figure 1d).These sample points are input into the GC-NeRF network to query colors and densities.GC-NeRF is trained with the image loss and geometric loss terms proposed in Section 3.3 (Figure 1e).After optimization, GC-NeRF renders multi-view DSMs (Figure 1g) using volume rendering (Figure 1f).Additionally, the fusion of multi-view DSMs also contributes to accuracy improvement (Figure 1h).

Z-axis Stretched Radiance Model
GC-NeRF represents the scene as a static surface.However, minimizing  in Formula ( 7) can lead to a lack of convergence due to inconsistent appearances between satellite images.Therefore, the sun direction  was used for appearance encoding, turning Formula (1) into: For appearance encoding,  efficiently handles different weather conditions or seasonal variations in satellite images, whereas without  , the experiments would fail.Compared to previous works [20], which used an extra appearance module, GC-NeRF's network architecture is very concise.Furthermore, GC-NeRF avoids using the complex remote sensing optical model of Formulas ( 8) and ( 9) proposed by S-NeRF [23] to ensure network simplicity.
Fast convergence is achieved by integrating multi-resolution hash encoding (MH) proposed in Instant-NGP [24].MH divides the scene into multi-resolution voxel grids and

Z-axis Stretched Radiance Model
GC-NeRF represents the scene as a static surface.However, minimizing Loss c in Formula ( 7) can lead to a lack of convergence due to inconsistent appearances between satellite images.Therefore, the sun direction d sun was used for appearance encoding, turning Formula (1) into: For appearance encoding, d sun efficiently handles different weather conditions or seasonal variations in satellite images, whereas without d sun , the experiments would fail.Compared to previous works [20], which used an extra appearance module, GC-NeRF's network architecture is very concise.Furthermore, GC-NeRF avoids using the complex remote sensing optical model of Formulas (8) and (9) proposed by S-NeRF [23] to ensure network simplicity.
Fast convergence is achieved by integrating multi-resolution hash encoding (MH) proposed in Instant-NGP [24].MH divides the scene into multi-resolution voxel grids and uses hash tables to store optimizable space features at each grid cell vertex (Figure 2a).Each input 3D coordinate X is encoded into a 32-length vector.
Unlike typical scenes, satellite scenes are usually flat, with a significant difference in horizontal and vertical scales as their horizontal range is large and vertical range is small (Figure 2a).uses hash tables to store optimizable space features at each grid cell vertex (Figure 2a).Each input 3D coordinate  is encoded into a 32-length vector.
Unlike typical scenes, satellite scenes are usually flat, with a significant difference in horizontal and vertical scales as their horizontal range is large and vertical range is small (Figure 2a).This study aimed to improve the accuracy of output DSMs in the z-direction.To make the reconstruction granularity in the z-direction finer, z-axis scene stretching was conducted so that  , ,  : where  denotes a stretching scale factor between 0 and 1. Stretching only on the z-axis makes the scene scale more coordinated and increases the sampling density in the z-direction, thus improving the utilization rate of the hash grid and DSM accuracy.However, over-stretching (Figure 2c) leads to excessive hash conflicts in the multi-resolution hash encodings, reducing model performance.Therefore,  should not be too small.Setting  to 0.80 optimizes model performance, as discussed in Section 5. Stretching the z-axis of the scene will deform the 3D model, resulting in corresponding squeezing when outputting DSMs.
The architecture of the GC-NeRF network is shown in Figure 3.The network receives 3D spatial coordinate  , sun direction  , and viewing direction  as inputs to predict the volume density  and color  at .  and  are small fully-connected networks.The former has only one hidden layer, whereas the latter has two hidden layers, with each hidden layer containing 64 neurons and activated by the ReLU function.Furthermore, the output  is activated by an exponential function and the output  is activated by a sigmoid function.Spherical harmonic encodings [61] were used to encode the viewing direction and sun direction to a 16-length vector, respectively.This study aimed to improve the accuracy of output DSMs in the z-direction.To make the reconstruction granularity in the z-direction finer, z-axis scene stretching was conducted so that X = (x, y, z sretched ): where s z denotes a stretching scale factor between 0 and 1. Stretching only on the zaxis makes the scene scale more coordinated and increases the sampling density in the z-direction, thus improving the utilization rate of the hash grid and DSM accuracy.However, over-stretching (Figure 2c) leads to excessive hash conflicts in the multi-resolution hash encodings, reducing model performance.Therefore, s z should not be too small.Setting s z to 0.80 optimizes model performance, as discussed in Section 5. Stretching the z-axis of the scene will deform the 3D model, resulting in corresponding squeezing when outputting DSMs.
The architecture of the GC-NeRF network is shown in Figure 3.The network receives 3D spatial coordinate X, sun direction d sun , and viewing direction d view as inputs to predict the volume density σ X and color c X at X. F σ and F c are small fully-connected networks.The former has only one hidden layer, whereas the latter has two hidden layers, with each hidden layer containing 64 neurons and activated by the ReLU function.Furthermore, the output σ X is activated by an exponential function and the output c X is activated by a sigmoid function.Spherical harmonic encodings [61] were used to encode the viewing direction and sun direction to a 16-length vector, respectively.

Occupancy Grid Converted from Sparse Point Cloud
In this paper, satellite camera models are parameterized with rational polynomial coefficients.GC-NeRF is trained by casting rays from bundle-adjustment-refined satellite camera models [20,62] into known image pixels.To render a ray's color, NeRFs typically require multiple samples, which can be challenging for satellite images due to their vast

Occupancy Grid Converted from Sparse Point Cloud
In this paper, satellite camera models are parameterized with rational polynomial coefficients.GC-NeRF is trained by casting rays from bundle-adjustment-refined satellite camera models [20,62] into known image pixels.To render a ray's color, NeRFs typically require multiple samples, which can be challenging for satellite images due to their vast distance from Earth.Using the same 128-point sampling approach as NeRFs would yield poor results.Previous methods (such as Sat-NeRFs [20], EO-NeRFs [21], S-NeRFs [23], and RS-NeRFs [63]) addressed this problem by sampling only within the scene's altitude bounds, which are difficult to acquire.In fact, obtaining the altitude of the scene is the goal of multi-view satellite photogrammetry.
When applying bundle adjustment, an extra point cloud indicates scene geometry generation.Therefore, this paper proposes a method of converting the sparse point cloud to an occupancy grid (Figure 4a,b) for ray sampling guidance, eliminating the need to input altitude bounds.

Occupancy Grid Converted from Sparse Point Cloud
In this paper, satellite camera models are parameterized with rational polyno coefficients.GC-NeRF is trained by casting rays from bundle-adjustment-refined sat camera models [20,62] into known image pixels.To render a ray's color, NeRFs typi require multiple samples, which can be challenging for satellite images due to their distance from Earth.Using the same 128-point sampling approach as NeRFs would poor results.Previous methods (such as Sat-NeRFs [20], EO-NeRFs [21], S-NeRFs [23] RS-NeRFs [63]) addressed this problem by sampling only within the scene's alt bounds, which are difficult to acquire.In fact, obtaining the altitude of the scene i goal of multi-view satellite photogrammetry.
When applying bundle adjustment, an extra point cloud indicates scene geom generation.Therefore, this paper proposes a method of converting the sparse point c to an occupancy grid (Error!Reference source not found.a,b)for ray sampling guid eliminating the need to input altitude bounds.The occupancy grid divides the entire scene into 128 3 cells, with each cell stor bit to indicate whether an object occupies the area (Figure 4b).Empty areas are ski during ray sampling, reducing the number of samples to increase training and rend efficiency.When converting point clouds to occupancy grids, the occupancy tensor to positive if a point falls within its cell (Figure 4a,b).However, due to sparsity and e The occupancy grid divides the entire scene into 128 3 cells, with each cell storing a bit to indicate whether an object occupies the area (Figure 4b).Empty areas are skipped during ray sampling, reducing the number of samples to increase training and rendering efficiency.When converting point clouds to occupancy grids, the occupancy tensor is set to positive if a point falls within its cell (Figure 4a,b).However, due to sparsity and errors in the point cloud, training cannot solely depend on the initial occupancy grid.To mitigate this, volume density is re-evaluated every 16 training iterations to update the occupancy grid following previous works [24,64].Cells in the occupancy grid are classified based on their counterparts, float grid cells, which store cell density as float values (Figure 4b,c).During updates, float grids decay old density by 0.95 and sample an alternative density value from a random point in the cell.In this way, the maximum of old and alternative densities is retained (Figure 4c,d).A classification threshold, opacity α, derived from density and step size using Formula (3), determines non-empty cells (α > 0.01).

Geometric Loss Term
A geometric loss term is designed to constrain scene geometry.Multi-view satellite photogrammetry mainly focuses on geometric accuracy, while original NeRFs only constrain the appearance without explicitly constraining the geometry.Geometric consistency is implicitly constrained by the intersection of rays.However, camera parameters produce inevitable errors, creating inconsistent depths of different rays intersecting at the same point.Satellite cameras observe the scene from closely distributed viewpoints with the same camera parameter errors, i.e., at angle θ, the geometric error is greater in the satellite scene (Figure 5a).
The large geometric error leads to a wide range of depths and sample weights along rays dispersed around the true depth (left graph in Figure 5b).Obtaining true depth from a large depth range is difficult and decreases DSM accuracy.To reduce depth estimation error, a geometric loss term is introduced for GC-NeRFs to constrain scene geometry as follows: sity and step size using Formula (3), determines non-empty cells ( 0.01).

Geometric Loss Term
A geometric loss term is designed to constrain scene geometry.Multi-view satellite photogrammetry mainly focuses on geometric accuracy, while original NeRFs only constrain the appearance without explicitly constraining the geometry.Geometric consistency is implicitly constrained by the intersection of rays.However, camera parameters produce inevitable errors, creating inconsistent depths of different rays intersecting at the same point.Satellite cameras observe the scene from closely distributed viewpoints with the same camera parameter errors, i.e., at angle , the geometric error is greater in the satellite scene (Figure 5a).The large geometric error leads to a wide range of depths and sample weights along rays dispersed around the true depth (left graph in Figure 5b).Obtaining true depth from a large depth range is difficult and decreases DSM accuracy.To reduce depth estimation error, a geometric loss term is introduced for GC-NeRFs to constrain scene geometry as follows:

𝐿𝑜𝑠𝑠
ℎ   ∈ ∈ (12) Minimizing  impacts sample weights far away from depth on a ray  by making them smaller, creating compact weight distribution (right graph in Figure 5b) and a thinner surface.These changes result in increased geometric precision, surface position, and DSM output accuracy.Meanwhile,  is applied to a batch of rays, , reducing depth inconsistency between different rays.using  directly makes the density  of the entire scene tend toward zero as training progresses, indicating that the reconstructed scene is empty, and the reconstruction has failed.To avoid this,  is rewritten as: Minimizing Loss g impacts sample weights far away from depth on a ray r by making them smaller, creating compact weight distribution (right graph in Figure 5b) and a thinner surface.These changes result in increased geometric precision, surface position, and DSM output accuracy.Meanwhile, Loss g is applied to a batch of rays, R, reducing depth inconsistency between different rays.using Loss g directly makes the density σ of the entire scene tend toward zero as training progresses, indicating that the reconstructed scene is empty, and the reconstruction has failed.To avoid this, Loss g is rewritten as: The additional item for the geometric loss term in Formula ( 13) ensures that the scene does not become empty during training.The main term of the GC-NeRF loss function is the Loss c defined in Formula (7), which is complemented by Loss g .The complete loss function can be expressed as: where λ g is a weight given to Loss g and is empirically set to 0.02.The model is trained following the ray casting strategy for NeRFs, and Loss g significantly contributes to the improved accuracy of output DSMs.

Multi-View DSMs Fusion
To generate a DSM, a depth map was rendered according to Formula (6).Subsequently, the corresponding camera parameters were used to convert the depth map into a 3D point cloud, which was flattened into a DSM [20].Due to the obstruction caused by tall buildings and trees, the surface depth in some areas was inconsistent with the rendered depth.In other words, the generated point cloud did not cover this area, resulting in inaccurate DSM information at the edges of these tall objects.
Therefore, a multi-view DSM fusion strategy was proposed to improve accuracy.Multi-view DSMs were generated using all viewpoints in each dataset.DSMs from a single viewpoint may be hindered by tall objects, but DSMs from multiple viewpoints can complete each other.A simple approach is to merge point clouds from multiple viewpoints.However, flattening the merged point cloud does not improve DSM accuracy significantly.
ISPRS Int.J. Geo-Inf.2024, 13, 243 9 of 17 By analyzing DSMs from multiple viewpoints, we found large root mean square error (RMSE) values for the elevation between different viewpoints in occluded areas (Figure 6b).Therefore, merging these point clouds directly cannot distinguish noisy points.Furthermore, errors in these areas were mostly positive (Figure 6a), indicating that the predicted elevation was greater than the actual elevation in most areas with a large RMSE value.
in inaccurate DSM information at the edges of these tall objects.
Therefore, a multi-view DSM fusion strategy was proposed to improve accuracy.Multi-view DSMs were generated using all viewpoints in each dataset.DSMs from a single viewpoint may be hindered by tall objects, but DSMs from multiple viewpoints can complete each other.A simple approach is to merge point clouds from multiple viewpoints.However, flattening the merged point cloud does not improve DSM accuracy significantly.
By analyzing DSMs from multiple viewpoints, we found large root mean square error (RMSE) values for the elevation between different viewpoints in occluded areas (Figure 6b).Therefore, merging these point clouds directly cannot distinguish noisy points.Furthermore, errors in these areas were mostly positive (Figure 6a), indicating that the predicted elevation was greater than the actual elevation in most areas with a large RMSE value.Therefore, RMSE can be used to improve DSM accuracy: Therefore, RMSE can be used to improve DSM accuracy: In this formula, elevation denotes an elevation value in a DSM pixel and RMSE denotes the standard deviation of elevations at one certain pixel under different viewpoints.avg denotes areas where the RMSE is small and outliers are removed using the 3-sigma rule.The remaining average the elevation values are considered the final elevation at the corresponding pixel.min denotes areas with a large RMSE, where the minimum elevation is taken.Thres RMSE is set to 0.1 times the maximum RMSE in the entire scene as a threshold for determining whether the RMSE is large or small.After multi-view DSM fusion, the elevation accuracy of some obstructed areas improved.

Experiments and Results
GC-NeRFs were assessed in four areas of interest (AOI), each spanning 256 × 256 m, using approximately 10-20 crops from the WorldView-3 optical sensor with a pixel resolution of 0.3 m.The four AOI are the same as those of S-NeRFs and Sat-NeRFs for comparison purposes.The four AOI are from different locations, with one from rural areas and three from urban areas.The images were sourced from publicly available data from the 2019 IEEE GRSS Data Fusion Contest (DFC2019) [65,66].Detailed information for each AOI is shown in Table 2.

Implementation Details
GC-NeRFs were trained with an Adam optimizer starting with a learning rate of 0.01, which was decreased every 4k iterations by a factor 0.33 according to a step scheduler.A size of 2 18 samples in each batch of rays was used, and the convergence required about 15 k iterations and 6 min to converge on a NVIDIA GeForce RTX 3080 GPU with 12 GB RAM.DSMs were accessed using lidar DSMs (from DFC2019 datasets) with a resolution of 0.5 m per pixel.The peak signal-to-noise ratio (PSNR) [67] of the image renderings and elevation mean absolute error (MAE) [68], with respect to lidar data, were used as evaluation metrics.

Result Analysis
The experimental results indicated that GC-NeRF is superior to previous methods including S-NeRFs, Sat-NeRFs, and SatelliteRFs [69], both qualitatively and quantitatively.Table 3 displays detailed experimental data.The 3D models in Figure 7 show that GC-NeRFs are can acquire fine geometry of the scene.NeRFs were originally used for novel view synthesis, and reconstruction quality is mainly measured by the quality of the synthesized views.Therefore, PSNR was first used to evaluate GC-NeRF reconstruction quality.Figure 8 compares the quality of rendered images between S-NeRFs, Sat-NeRFs, and GC-NeRFs.Table 3 shows that GC-NeRFs produce statistically superior image quality scores in appearance, as measured by PSNR However, Sat-NeRFs remove transient objects such as cars (Figure 8a,b), which may decrease PSNR.In fact, quantitative appearance comparisons are not meaningful because satellite images have inconsistent lighting conditions.Qualitative comparison found that GC-NeRF rendered images were clearer, as shown in Figure 8c,d.NeRFs were originally used for novel view synthesis, and reconstruction quality is mainly measured by the quality of the synthesized views.Therefore, PSNR was first used to evaluate GC-NeRF reconstruction quality.Figure 8 compares the quality of rendered images between S-NeRFs, Sat-NeRFs, and GC-NeRFs.Table 3 shows that GC-NeRFs produce statistically superior image quality scores in appearance, as measured by PSNR.However, Sat-NeRFs remove transient objects such as cars (Figure 8a,b), which may decrease PSNR.In fact, quantitative appearance comparisons are not meaningful because satellite images have inconsistent lighting conditions.Qualitative comparison found that GC-NeRF rendered images were clearer, as shown in Figure 8c,d.Figure 9 shows the visualization of DSMs generated by lidar, S-NeRFs, Sat-NeRFs, and GC-NeRFs.Compared to lidar DSM, GC-NeRFs generate unexpectedly uneven surfaces, such as road areas (Figure 9a,b).Sharper building edges were obtained by the multiview DSM fusion strategy (Figure 9c,d).Thanks to the geometric loss term proposed in Section 3.3, the quality of DSMs generated by GC-NeRFs is superior to that of S-NeRFs (Figure 9e,f).Table 3 shows that GC-NeRFs produce statistically superior DSM quality scores, as measured by the MAE.
ISPRS Int.J. Geo-Inf.2024, 13, x FOR PEER REVIEW 13 Figure 9 shows the visualization of DSMs generated by lidar, S-NeRFs, Sat-N and GC-NeRFs.Compared to lidar DSM, GC-NeRFs generate unexpectedly uneven faces, such as road areas (Figure 9a,b).Sharper building edges were obtained by the m view DSM fusion strategy (Figure 9c,d).Thanks to the geometric loss term propos Section 3.3, the quality of DSMs generated by GC-NeRFs is superior to that of S-N (Figure 9e,f).Table 3 shows that GC-NeRFs produce statistically superior DSM qu scores, as measured by the MAE.

Ablation Analysis
To verify the effectiveness of the proposed method, ablation experiments were conducted.This paper focuses on geometric accuracy; therefore, the ablation experiment mainly analyzes the impact of the proposed method on DSM accuracy.
GC-NeRFs with all or without one of three measures, namely scene stretching, geometric constraint, and DSM fusion, were evaluated.The results are shown in Table 4. Without geometric constraint, the accuracy of DSMs generated by GC-NeRFs was almost the same as without adding any improvement strategies, emphasizing the importance of geometric constraint.Additionally, scene stretching and DSM fusion can also improve accuracy.Comparing different scenes, the accuracy improvement of scene stretching and DSM fusion for area 004 was minimal.In rural areas, the small altitude range and low terrain complexity may result in better reconstruction.Overall, the various methods proposed by the GC-NeRF framework contributed to accuracy improvement in the final DSM.The geometric constraint contributes most to improving DSM accuracy.

Discussion
Despite improvements in DSM accuracy, some limitations to the proposed method must be acknowledged.Firstly, the hyperparameter s z must be adjusted manually.Appropriate values of s z were obtained through extensive experiments so that GC-NeRF universality could be reduced.Figure 10

Conclusions
The main contribution of this paper is the proposal of a GC-NeRF framework for multi-view satellite photogrammetry, which generates highly accurate DSMs without extra input.Through a combination of geometric constraint, scene stretching, and multiview DSM fusion techniques, GC-NeRFs achieve notable improvements in output DSM accuracy.Occupancy grids converted from sparse point clouds avoid the extra input of altitude bounds for training.Additionally, the integration of advanced techniques in NeRF variants greatly enhances efficiency and conciseness in processing satellite images.Secondly, using the sun direction as appearance encoding, GC-NeRF implicitly addresses shadow issues in a concise manner.However, the limited number of input images may result in some areas always being shadowed.The reconstruction accuracy of these areas is low, so more effort is needed to eliminate the influence of shadows on geometry.
Thirdly, GC-NeRF is only applicable to high-resolution satellite images, and the results are poor when using low-resolution satellite images as input.Moreover, in areas with excessive terrain undulations, some hyperparameters must be adjusted.Overall, the robustness of GC NeRFs to different types of satellite images and terrain still needs improvement.
Additionally, a common phenomenon with GC-NeRFs and other NeRF-based multiview satellite photogrammetry approaches is that they produce uneven road and building surfaces, which is far from expected.Other geometric constraints could be designed to overcome this weakness.RegNeRFs [70] and CL-NeRFs [71] add geometric constraints of local plane or slanted plane regularization to make the surface more uniform.However, there are some areas that should not be regular planes in the real world, such as vegetation.Satellite images encompass multiple bands beyond RGB, which can be used to calculate vegetation indices; therefore, subsequent work involves applying plane regularization to areas with low vegetation index values, which may help generate a more uniform surface and improve DSM accuracy.

Conclusions
The main contribution of this paper is the proposal of a GC-NeRF framework for multi-view satellite photogrammetry, which generates highly accurate DSMs without extra input.Through a combination of geometric constraint, scene stretching, and multi-view DSM fusion techniques, GC-NeRFs achieve notable improvements in output DSM accuracy.Occupancy grids converted from sparse point clouds avoid the extra input of altitude bounds for training.Additionally, the integration of advanced techniques in NeRF variants greatly enhances efficiency and conciseness in processing satellite images.Experimental results demonstrated the effectiveness of GC-NeRFs.Overall, GC-NeRFs offer a promising solution for reconstructing accurate 3D scenes from multi-view satellite images.Scaling GC-NeRFs to larger datasets and addressing computational efficiency issues may greatly benefit from their practical deployment.Future research can focus on handling building changes.GC-NeRFs treat the scene as static, but the construction and demolition of buildings can lead to significant geometric changes within the scene, which is difficult to manage.Currently, there are some NeRF extensions for dynamic scenes, but these methods deal with continuously changing scenes, where the input data is continuous video.Satellite images are shot very far apart, and handling the modeling problems of such discrete dynamic scenes is challenging.However, it is still very beneficial for change detection.
Author Contributions: Conceptualization, Qifeng Wan; methodology, Qifeng Wan and Yuzheng Guan; software, Qifeng Wan and Qiang Zhao; validation, Qifeng Wan and Qiang Zhao; visualization, Qifeng Wan and Yuzheng Guan; writing-original draft, Qifeng Wan; writing-review and editing, Qifeng Wan, Xiang Wen and Jiangfeng She; project administration, Jiangfeng She.All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China, grant number 41871293.

Figure 1 .
Figure 1.Overview of GC-NeRFs.GC-NeRFs use satellite images and corresponding satellite camera models to reconstruct scenes and generate accurate DSMs.Their key contributions include zaxis scene stretching, an occupancy grid converted from sparse point clouds, DSM fusion, and a geometric loss term for network training, which are shown in bold.

Figure 1 .
Figure 1.Overview of GC-NeRFs.GC-NeRFs use satellite images and corresponding satellite camera models to reconstruct scenes and generate accurate DSMs.Their key contributions include z-axis scene stretching, an occupancy grid converted from sparse point clouds, DSM fusion, and a geometric loss term for network training, which are shown in bold.

Figure 2 .
Figure 2. Z-axis scene stretching.(a) The original satellite scene is flat.(b) The suitably stretched scene makes full use of multi-resolution hash encodings.(c) The over-stretched scene presents excessive hash conflicts.

Figure 2 .
Figure 2. Z-axis scene stretching.(a) The original satellite scene is flat.(b) The suitably stretched scene makes full use of multi-resolution hash encodings.(c) The over-stretched scene presents excessive hash conflicts.

19 Figure 3 .
Figure 3.The architecture of the GC-NeRF network.The model receives 3D spatial coordinate , sun direction  , and viewing direction  as inputs to predict the volume density  and color  at .

Figure 3 .
Figure 3.The architecture of the GC-NeRF network.The model receives 3D spatial coordinate X, sun direction d sun , and viewing direction d view as inputs to predict the volume density σ X and color c X at X.

Figure 3 .
Figure 3.The architecture of the GC-NeRF network.The model receives 3D spatial coordina sun direction  , and viewing direction  as inputs to predict the volume density  color  at .

Figure 4 .
Figure 4.The converting and updating of the occupancy grid.(a) The point cloud is freely obt from the bundle adjustment.(a,b) The point cloud is converted into a bit occupancy grid.(b,c bit grid cells are classified by the float grid cells.(c,d) The float grid cells are updated by the  predicted from the network.

Figure 4 .
Figure 4.The converting and updating of the occupancy grid.(a) The point cloud is freely obtained from the bundle adjustment.(a,b) The point cloud is converted into a bit occupancy grid.(b,c) The bit grid cells are classified by the float grid cells.(c,d) The float grid cells are updated by the α value predicted from the network.

Figure 5 .
Figure 5. (a) At the same camera parameter error angle , the geometric error is greater in the satellite scene.(b) Without  , the sample weight distribution is scattered around the true depth, resulting in significant errors in depth estimation.By contrast, the weight distribution is compactly around the true depth.

Figure 5 .
Figure 5. (a) At the same camera parameter error angle θ, the geometric error is greater in the satellite scene.(b) Without Loss g , the sample weight distribution is scattered around the true depth, resulting in significant errors in depth estimation.By contrast, the weight distribution is compactly around the true depth.

Figure 6 .
Figure 6.The relativity between positive DSM errors in merged point clouds and the root mean square error (RMSE) of multi-view DSMs.The predicted elevation is greater than the actual elevation in most areas with a large elevation standard deviation.

Figure 6 .
Figure 6.The relativity between positive DSM errors in merged point clouds and the root mean square error (RMSE) of multi-view DSMs.The predicted elevation is greater than the actual elevation in most areas with a large elevation standard deviation.

Figure 7 .
Figure 7. Visualization of 3D models derived by superimposing DSMs onto images.The DSMs and images are generated by GC-NeRFs.

Figure 7 .
Figure 7. Visualization of 3D models derived by superimposing DSMs onto images.The DSMs and images are generated by GC-NeRFs.

Figure 9 .Figure 9 .
Figure 9. Visualization of lidar, S-NeRFs, Sat-NeRFs, and GC-NeRF DSMs.Areas marked by and building changes are masked.(a,b) The DSM rendered by a GC-NeRF shows that the r uneven compared to lidar DSM.(c,d) The GC-NeRF DSM displays sharper building edges th Sat-NeRF DSM.(e,f) The DSM quality rendered by the GC-NeRF is superior to that of the S-N Figure 9. Visualization of lidar, S-NeRFs, Sat-NeRFs, and GC-NeRF DSMs.Areas marked by water and building changes are masked.(a,b) The DSM rendered by a GC-NeRF shows that the road is uneven compared to lidar DSM.(c,d) The GC-NeRF DSM displays sharper building edges than the Sat-NeRF DSM.(e,f) The DSM quality rendered by the GC-NeRF is superior to that of the S-NeRF.

19 Figure 10 .
Figure 10.The MAE of DSMs under different  .Low MAE indicates high DSM accuracy, and the best   value is around 0.8.

Figure 10 .
Figure 10.The MAE of DSMs under different s z .Low MAE indicates high DSM accuracy, and the best s z value is around 0.8.

Table 2 .
Detailed information for each AOI.

Table 3 .
Numerical results of reconstruction.The best PSNR and MAE values are shown in bold.Overall, GC-NeRFs provide the best image quality measured by PSNR and DSM accuracy measured by MAE.

Table 3 .
Numerical results of reconstruction.The best PSNR and MAE values are shown in bold.Overall, GC-NeRFs provide the best image quality measured by PSNR and DSM accuracy measured by MAE.

Table 4 .
The MAE of DSMs rendered by GC-NeRFs with all or without one of three measures, namely scene stretching, geometric constraint, and DSM fusion.