Enhancing endoscopic scene reconstruction with color-aware inverse rendering through neural SDF and radiance fields

Virtual surgical training is crucial for enhancing minimally invasive surgical skills. Traditional geometric reconstruction methods based on medical CT/MRI images often fall short in providing color information, which is typically generated through pseudo-coloring or artistic rendering. To simultaneously reconstruct both the geometric shape and appearance information of organs, we propose a novel organ model reconstruction network called Endoscope-NeSRF. This network jointly leverages neural radiance fields and Signed Distance Function (SDF) to reconstruct a textured geometric model of the organ of interest from multi-view photometric images acquired by an endoscope. The prior knowledge of the inverse correlation between the distance from the light source to the object and the radiance improves the real physical properties of the organ. The dilated mask further refines the appearance and geometry at the organ's edges. We also proposed a highlight adaptive optimization strategy to remove highlights caused by the light source during the acquisition process, thereby preventing the reconstruction results in areas previously affected by highlights from turning white. Finally, the real-time realistic rendering of the organ model is achieved by combining the inverse rendering and Bidirectional Reflectance Distribution Function (BRDF) rendering methods. Experimental results show that our method closely matches the Instant-NGP method in appearance reconstruction, outperforming other state-of-the-art methods, and stands as the superior method in terms of geometric reconstruction. Our method obtained a detailed geometric model and realistic appearance, providing a realistic visual sense for virtual surgical simulation, which is important for medical training.


Introduction
Virtual surgical simulators aim to enhance the surgical skills of resident physicians [1][2][3].The key to virtual surgery is precise modeling (geometry and appearance) and realistic simulation (such as deformation and cauterization of soft tissue) of human organs [4,5].Our research primarily focuses on reconstructing the geometry and appearance of the organ.The traditional methods involve reconstructing geometric models of various organs from grayscale CT/MR image sequences using the Marching Cube algorithm [6].CT images formed by the absorption coefficients of different tissues to X-rays and MRI images formed by the signal intensities of different tissues are greyscale images without real color information [7], so the appearance attributes of the models are typically obtained by pseudo-coloring or artistically drawn texture maps.This geometric accuracy is limited by the resolution of the original voxel data, which is unable to accurately capture small or complex structural details.
Reconstructing the geometry and appearance of a scene or object from 2D images is a significant research issue in computer graphics and machine vision.To address the issues of imprecise geometry and unrealistic appearance, when given a set of multi-view photometric images of a scene or organ collected by an endoscope during minimally invasive surgery, multi-view reconstruction methods are employed to recover the geometric model and appearance attributes.The reconstructed results are modeled by soft tissue deformation and cutting algorithms, providing a main theoretical foundation for virtual surgical simulation.Traditional multi-view reconstruction methods mainly include IBR-based methods [8], Structure from Motion (SfM) [9], and Signed Distance Functions (SDF) [10].However, these methods are challenged in achieving simultaneously fine geometry and realistic appearance.
Currently, neural implicit field reconstruction methods have become an alternative to traditional multi-view reconstruction methods.These methods optimize neural radiance fields through deep learning models to synthesize high-resolution, high-quality novel views, i.e., the appearance attributes of the object.Based on this, we introduce SDF to represent the surface of the object, using a new volume rendering method to combine the radiance and SDF fields for synthesizing 2D images under camera poses corresponding to the source view.The loss between the synthetic views and the source views serves as supervision to train our network, thereby obtaining precise geometric models and appearance attributes.However, the appearance attributes comprise only the images corresponding to the source camera poses, thus not capturing the complete appearance attributes of the scene or object in a single image.Furthermore, the geometric model and the appearance attributes are separated without a clear correspondence, and it is not possible to color the mesh model during rendering.We decompose the texture map and normal map corresponding to the geometric model through inverse rendering [11], and perform real-time rendering of the organ model under new endoscope poses.Our method provides a more realistic visual sense for resident physicians in the virtual surgical simulator.

SDF-based geometric reconstruction from multi-views
Erik Bylow et al. [12] proposed a novel method for real-time tracking and 3D reconstruction of static indoor environments using an RGB-D sensor.Their research highlights the efficiency of representing geometric shapes by SDF, which allows for rapid estimation of camera positions by directly minimizing depth image errors on the SDF.DeepSDF [10] introduces a new method for representing 3D shapes using deep neural networks to learn continuous SDF.This approach allows for encoding various complex shapes within a compact latent space.The learned SDFs can accurately represent intricate details and topologies, offering a more effective and flexible deep learning alternative to traditional voxel [13], point cloud [14], or mesh-based [15] methods.Differentiable volumetric rendering [16] presents a method for learning implicit 3D representations from 2D images without the need for direct 3D supervision.This is accomplished using differentiable volumetric rendering, a technique that enables the model to infer 3D shapes and structures from 2D data.MetaSDF [17] introduces an innovative application of gradient-based meta-learning algorithms in the field of neural implicit shape representations.This approach effectively tackles the challenge of generalizing across different shapes within neural implicit representations and facilitates geometric reconstruction from incomplete or noisy data.

Neural implicit field reconstruction and novel view synthesis
Neural implicit representation [18], a technique that utilizes neural networks to implicitly represent 3D geometry and appearance attributes, has significantly advanced the field of 3D modeling and rendering.Unlike traditional 3D models based on meshes or point clouds, this approach employs continuous mathematical functions, typically parameterized by neural networks, allowing for detailed representation at any chosen resolution.
As the pioneering work in neural implicit representations, Neural Radiance Fields (NeRF) [19] utilizes a simple MLP network to represent scenes as implicit radiance fields with color and volume density, enabling the synthesis of high-quality novel views through volume rendering [20].Instant-NGP [21] introduces an efficient method for training and rendering neural graphics primitives using a novel, small-scale neural network and a multiresolution hash table.This approach greatly accelerates the process while maintaining high quality.NeuS [22] presents a novel approach for multi-view neural surface reconstruction, using SDF for surface representation and a new volume rendering scheme for learning neural SDF representation, enabling accurate 3D reconstructions from 2D images, especially for objects and scenes with complex structures and self-occlusion.IRON [23] proposes a neural inverse rendering method.This method infers the geometric model, material textures, and lighting properties from a set of 2D multi-view photometric images by optimizing the neural SDF and appearance attributes.It utilizes existing graphics rendering pipelines to generate a high-fidelity 3D model with detailed textures.

Light transport
The distribution and rendering of light in the scene can be represented by the following rendering equation [24]: where, n • w i represents the dot product between the surface normal n and incident light direction w i at point p.L 0 (p, w o ) and L e (p, w o ) represent the emergent radiance and self-illumination along the viewing direction w o at point p, respectively.L i (p, w i ) represents the incident irradiance reaching point p from the direction w i .The Bidirectional Reflectance Distribution Function (BRDF) [25] f r (p, w i , w 0 ) is an inherent property of the object that describes how light from the incident direction w i scatters into the viewing direction w o at point p.BRDF is defined as: In common multi-view acquisition devices, the camera and light source are separate.As illustrated in the top left of Fig. 1, the light emitted from a point source is uniformly distributed on the surface of a sphere with center at the light source and radius r.The radiant flux on a unit area is defined as irradiance E = ϕ/πr 2 .The light intensity at any point on the same spherical surface is equal, and the total radiant flux ϕ on different spherical surfaces is equal (ϕ 1 = ϕ 2 ), i.e.E 1 πr 1 2 = E 2 πr 2 2 , it can be deduced: The equation indicates that the irradiance on an object's surface is inversely proportional to the square of the distance between the light source and the object [26,27].As shown in the lower left of Fig. 1, the radiance I ′ of the object is independent of the distance r from the camera to the object in a fixed light source [28], i.e.: Different from the acquisition method on the left side of Fig. 1, the camera is moving while the light source remains stationary.For the acquisition of endoscopic scenes, the light source and camera are fixed together at the tip of the endoscope.The light source follows the camera in motion and can be regarded as a point light source, assuming that w = w i = w 0 .Additionally, the self-illumination term L e (p, w o ) in the endoscopic scene is ignored, allowing Eq. ( 1) to be simplified as follows: Combining the characteristics of the light source and camera on the left side of Fig. 1, the radiance of the scene collected by the endoscope can be obtained by the BRDF value at point p is an inherent property: Combining Eq. ( 3) and (5), we obtain: This equation indicates that the radiance of a scene captured by an endoscope is inversely proportional to the square of the distance from the light source to the scene's surface.Incorporating this prior knowledge as part of our network can yield more realistic rendering results.

Methods
For a set of photometric images captured by an endoscopic device from multiple viewpoints, we propose an Endoscope-NeSRF network for optimizing the SDF and radiance fields of the organ of interest in endoscopic scenes, obtaining precise geometric shapes and realistic appearance information of the organ.The pipeline, as illustrated in Fig. 2, includes: (1) Normalization of scene point cloud and camera pose.(2) Reconstruction of the SDF and radiance fields of the organ of interest through the Endoscope-NeSRF network.(3) Synthesis of novel views through volume rendering techniques.Additionally, the Endoscope-NeSRF network integrates inverse rendering to obtain textured geometric models of organs, enabling real-time rendering of organ models in virtual surgery through the BRDF algorithm.

Normalization of point cloud and camera poses
The SfM algorithm [9] is utilized for sparse reconstruction of the endoscopic scene, enabling the estimation of the initial sparse point cloud and the intrinsic and extrinsic parameters of the camera.To reconstruct the geometry and appearance of the organ of interest in an endoscopic scene more precisely, the background point cloud is removed from the sparse point cloud.The remaining organ point cloud is enclosed within a spherical bounding box, after normalization and centralization, it is moved to be located within a unit sphere V P centered at the origin of the world coordinate system.The schematic diagrams of the organ point cloud, camera pose, and unit sphere are illustrated in the top left corner of Fig. 2. The normalized and centered organ point cloud is represented as where R ∈ [33], t ∈ [31] denote the rotation and translation matrices of the camera transformed from the camera coordinate system to the world coordinate system, respectively.In the world coordinate system, the camera origin is o = t, and the ray direction emitted by the camera is v = Rv c , with v c represents the ray direction in the camera coordinate system.

SDF and radiance fields representation
The point on the ray emitted from the camera can be represented as p(l) = o + lv, where l represents the distance from the camera origin to the point.After positional encoding and directional encoding, the point p(l) and direction v yield encoded vectors p(t) ′ andv ′ , respectively.

{p(t), p(t)
′ , v, v ′ , l} of all points on the ray are fed as input to the Endoscope-NeSRF network, and its outputs are the SDF and RGB values of the point.The Endoscope-NeSRF network comprises two networks constructed from Multi-Layer Perceptron (MLP), which are used to optimize the SDF and radiance field of the organ, respectively, for achieving more precise geometry shape and appearance.The details of these two networks are as follows:

SDF prediction network MLP s
The network MLP s is used to predict the SDF and feature vector f of point p in the reconstructed region V, where p = {x, y, z ∈ R|x 2 + y 2 + z 2 ≤ 1.2}.The distribution of the predicted SDF field is shown in the middle grey rectangular region of Fig. 2. SDF(p) > 0 indicates that point p is outside the organ model, SDF(p) < 0 indicates that point p is inside the organ model.The surface S of the reconstructed organ model is represented by the zero-level set of the SDF [10], that is, Almost all real objects have concave and convex surfaces, resulting in object occlusion during light propagation.As light travels through an object, the energy is constantly absorbed by the object.This is reflected in the volume rendering as a decrease in the contribution to the color or other attributes.Figure 3 shows the propagation of light within an object, including a schematic diagram of the ray with multiple intersection points with the object, the probability density function (PDF) [29], predicted SDF gradients, real SDF, and weights of points on the ray.The SDF of points on the ray near this point can be linearly represented as: where θ denotes the angle between the predicted gradient g and ray direction v of the point on the ray, and l 0 denotes the distance from the intersection point of the ray with the object to the camera.
,   A uniform coarse sampling is performed on the ray, and multiple fine samplings are performed near the sampling point based on its SDF value.The fine sampling follows the logistic distribution.
The logistic density distribution function f γ (x) = e −x/γ /(1 + e −x/γ ) 2 is chosen as the PDF f (s(x)) for SDF s(x), where the standard deviation γ, as a hyperparameter in the network training, approaches 0 as the network converges.This enables finer resampling near the intersection point, leading to a more precise reconstruction of the surface S. Since f γ (x) is the derivative of the sigmoid function F γ = 1/(1 + e −x/γ ) can be considered as the cumulative distribution function (CDF) [30] for SDF.

Color prediction network MLP c
The SDF is used to establish a connection between the two networks.The SDF gradient g and feature vector f of the sampling point p predicted by the network MLP S , together with the sampling point p with direction v at distance l from the camera, are fed into MLP c , which outputs the color c of the sampling point p.The loss between the 2D image synthesized under this camera pose using volume rendering and the corresponding original 2D image is used as the primary supervision.Additionally, geometric loss is incorporated to optimize the SDF fields, obtaining accurate geometry and appearance of the organ.

Color synthesis
To synthesize the rendered image, the color C(o, v) of any pixel is synthesized using volume rendering: where, where p(l) = o + lv denotes a point at a distance l from the origin on a ray with origin o and unit direction v emitted from the pixel, o(p(l)) and c(p(l), v) denote the opacity and color of point p(l) with direction v, respectively, T(l) denotes the cumulative transmittance along the ray from l n to l, ρ(l) denotes the opaque density of point p(l).
Under the premise of following the standard volumetric rendering formula from NeRF [19], where the weight is w(l) = T(l)σ(l).We introduce the method from NeuS [22] to ensure that the intersection points of the ray with the object (i.e., the zero-level set of the SDF) contribute maximally to the synthesized color, i.e., the normalization of the PDF f γ (x) of the SDF is used as the weight to make Eq.( 9) unbiased: For the case where the ray passes through multiple surfaces, we introduce the opaque density ρ(l) to make Eq.( 9) have occlusion-aware properties: Discretize Eq. ( 9), (10), and (11) to obtain: ) In Section 3.2, the SDF and radiance fields of the scene are obtained.The SDF of uniform sampling points on the ray is predicted using the trained Endoscopic-NeSRF network.Based on the density function of the SDF, more refined resampling is conducted near where SDF = 0.The Endoscopic-NeSRF network is then used again to predict the SDF and RGB of these fine sampling points.The final color and normal of the pixel are synthesized by accumulating the colors and gradients of all the sampling points along the ray using the volume rendering method.
Combining Eq. ( 15) and ( 16), the discrete opacity for the sampling point i can be derived: Combining Eq. ( 13) and ( 16), the discrete cumulative transmittance is obtained:

Normal synthesis
Similar to the process of color synthesis, the surface normal of the object can also be synthesized through volume rendering: where, {g i } M i=1 represents the SDF gradients of all sampling points on a ray predicted by the Endoscope-NeRSF network.

Training and loss
Given a set of multi-view photometric images of an endoscopic scene, we optimize the SDF and radiance fields of the organ of interest by the following total loss: where, the photometric loss (also known as color loss) L C represents the color difference between the predicted image and the corresponding real image.L G , L M represent geometric loss and mask loss, respectively.λ 1 , λ 2 andλ 3 are set to 1.0, 0.1, and 0.1, respectively.The color loss L C is used to optimize the color c and opacity density ρ of the sampling points on the ray, which accounts for the major part of the total loss and is defined as: The t-th sampling point on the ray with unit direction d emitted from the camera origin o can be represented as r = o + td.Differently from previous methods, we only consider the color loss of pixels in the non-specular region R 1 , ignoring the specular region R 2 .This approach avoids the issue of rendering pixels as white in the region of the source image that occasionally had specular highlights (originally other colors, like red), leading to errors in relighting the organ model in these areas.
The geometric loss L G is used to optimize the SDF fields of the organ, that is, the geometric information.This geometric loss L G = L E + L N , where L E represents the eikonal loss of the sampling points on corresponding rays of pixels in non-specular regions R 1 .It is used to enhance the effectiveness of the SDF by regularizing the predicted SDF gradients of the sampling points to a unit norm.This is defined as: L N represents the normal loss of pixels in specular regions R 2 .Due to the characteristic that the camera and light source in the endoscope are approximately in the same position, it can be known that the surface normal N h (r) in the specular region R 2 is nearly the negative ray direction o(r), which can be considered as N h (r) = −o(r).In the specular region R 2 , L1 loss between the synthetic normal N p (r) and priori normal N h (r) as the normal loss L N to optimize the geometry, which solves the failure of geometric reconstruction in the highlight region by the previous methods.the normal loss L N is defined as: The mask loss calculated by binary cross loss is used to solve the problem that the reconstructed geometry has blurred boundaries.
The geometric loss and color loss complement each other's shortcomings, enabling more refined geometry and realistic textures.

Inverse rendering and relighting
The selection of a geometric model with an appropriate number of meshes can reduce the computational load during the interaction process between instruments and organ models in the virtual simulator, solving the problem of lag.The organ model is re-illuminated in the new endoscopic pose by the BRDF algorithm, and texture and normal maps enhance the rendered image with more accurate texture details and bumpiness.The process of texture acquisition and relighting is shown in Fig. 4. The geometric model is unwrapped using the algorithm [31], resulting in a square UV map.The pixels in this UV map contain the indexes, coordinates and normals of the corresponding triangle vertices in the geometric model, as well as UV coordinates.
Figure 5 depicts the process of obtaining the ray r (which includes the origin O and direction v) corresponding to pixel p in the texture map.The normal N ⊥ at the intersection point P of the ray and the object and perpendicular to the triangular plane where the point is located can be expressed as −normal(N 1 + N 2 + N 3 ).Transforming normal N ⊥ to vector [0, 0, -1] (i.e., vertically screen outward), obtaining the rotation matrix R of point P transformed from the world coordinate system to tangent coordinate system.
The sampling points on the ray can be represented as r(t) = O + tLv, where L represents the length of the ray, which is the average length of all the triangular edges in the geometric mesh.The parameter t ∈ [0, 1] follows a normal distribution, allowing for finer sampling near the zero-level set of the SDF.The sampling point r(t) is fed into the trained Endoscope-NeSRF network, which outputs the SDF gradient and color of the sampling point.The normal map N w in the world coordinate system and texture map are synthesized along the ray using the volume rendering.The normal map N w , after being rotated by rotation matrix R, yields the normal map N t in the tangent coordinate system.computational load during the interaction process between instruments and organ models in the virtual simulator, solving the problem of lag.The organ model is re-illuminated in the new endoscopic pose by the BRDF algorithm, and texture and normal maps enhance the rendered image with more accurate texture details and bumpiness.The process of texture acquisition and relighting is shown in Fig. 4. The geometric model is unwrapped using the algorithm [31], resulting in a square UV map.The pixels in this UV map contain the indexes, coordinates and normals of the corresponding triangle vertices in the geometric model, as well as UV coordinates.The right-hand side of Fig. 1 displays our experimental equipment, which includes an endoscope (with a camera and light source), a laptop, and a rotation table for placing organs.This setup was used to capture multi-view images of organs.The camera and light source are at the same position, with the camera having a resolution of 1920 × 1080, and the light source can be considered as a point light source.The experiment was completed in a dark room, with the only light source coming from the endoscope.The rotation frequency of the rotation table is 1 FPS.Four organ scenes of the lung, liver, kidney and heart of the pig were captured.For each scene, the rotation table rotated 2-3 times, with the endoscope randomly capturing 60-80 images.Each scene exhibited highlight attributes.

Baselines
We compared our method with state-of-the-art methods in terms of geometric and appearance reconstruction.For the synthesized new views, we have compared our method with three methods (NeRF, Instant-NGP, NeuS and IRON) qualitatively and quantitatively.The image evaluation metrics for quantitative comparison are PSNR, SSIM, and LPIPS.Since the real geometric model of the organ is not obtained, we only qualitatively compared our method with four methods (Colmap, Instant-NGP, NeuS and IRON) in geometric reconstruction.

Implementation details
The Endoscope-NeSRF network optimizes a unique SDF field and radiance field representation for each organ scene.The network's input includes RGB images of an organ scene, corresponding masks, and camera internal and external parameters (Camera pose after organ centering and normalization).These camera parameters are estimated by the Colmap SFM package and MeshLab tool.In each optimization iteration, 512 rays are randomly selected from a rectangle with an organ mask dilated by n pixels, and then following the sampling scheme in Section 3.2.
For each ray, 64 coarse sampling points are taken, and 64 fine sampling points are added after 4 upsamplings, for a total of 128 sampling points.The geometric resolution is set to 64×64×64.The SDF and color c of sampling points are predicted using the Endoscope-NeSRF network.The network architecture is detailed in Fig. 6 and includes an SDF function represented by an MLP network with 8 hidden layers, as well as a color function represented by an MLP network with 4 hidden layers.We use Adam optimizer [32] with a learning rate that initially increases from 0 to 5 × 10 −4 over the first 5000 iterations, a strategy is known as 'warm-up'.Post-warmup, from the 5000th iteration onwards, the learning rate follows a cosine decay, gradually reducing from 5 × 10 −4 to 5 × 10 −5 .This approach helps in fine-tuning the model's training process for optimal performance.Optimizing a single scene on an NVIDIA GTX 2060Ti GPU typically around 80,000 iterations, completed in roughly 4 hours.

Results
We qualitatively and quantitatively assessed the geometry and appearance of organ scenes reconstructed using our method compared to baseline methods.Additionally, we evaluated the results of inverse rendering and relighting.

Evaluation of novel view synthesis
Table 1 presents a quantitative comparison between our method and state-of-the-art novel view synthesis methods.Our method, along with Instant-NGP, outperforms other methods in terms of image metrics and time consumed for 80,000 iterations, although our method is slightly inferior to Instant-NGP.Figure 7 shows the corresponding qualitative comparison results.In terms of texture details (area within the blue oval), our method is slightly worse than the Instant-NGP method, the NeuS method is the blurriest, and there are horizontal and vertical stripes in the NeRF method.A possible reason is that, unlike other methods which randomly select batch_size pixels from the entire image in each iteration, our method randomly chooses batch_size pixels within a dilated mask bounding box with 6-pixel dilatation, resulting in more pixels from the mask region being involved in the training.Additionally, pixels from the specular region only participate in the training for geometric loss and not for color loss, preventing the overexposure in the specular region from affecting the appearance reconstruction.These factors lead to faster convergence of color loss in the mask region.Our method and Instant-NGP outperform others in synthesizing more precise and clear object boundaries (areas within the red oval).NeuS and IRON result in blurry boundaries, and NeRF produces jagged boundaries.The likely reasons are our method and NeuS follow the logistic distribution up-sampling strategy in subsection 3.2, which makes the sampling points concentrated near the zero-level set of the SDF, with large contribution weights, and the weights of the sampling points far away from the zero-level set tend to be zero.Additionally, the dilated mask bounding box ensures more pixels near object boundaries are trained, leading to clearer boundaries in synthesized images.
Our method successfully removes specular highlights (area within the green oval) and reconstructs more accurate textures in areas previously affected by highlights.Instant-NGP and NeRF fail to remove highlights and inaccurately reconstruct these areas.IRON and NeuS can remove highlights, but their reconstruction in these areas is blurry.This is because our method adds a highlight adaptive optimization strategy (see subsection 3.4).
Figure 8 illustrates the effect of the number of pixels in the dilation organ mask on the appearance reconstruction.The results show that the highest quality of appearance is achieved when the dilation is set to 6 pixels, with the quality decreasing towards both ends.The possible reasons for this are that when n is smaller, more pixels within the mask region are involved in training, leading to less effective training at the boundaries.Conversely, when n is larger, fewer pixels in the mask region participate in training, resulting in a decreased quality of appearance reconstruction within the mask area.

Evaluation of geometric reconstruction
Since our dataset only contains multi-view images without real geometric models, two source images near the comparison view were selected as reference images to assess the quality of geometric reconstruction by observing the surface convexities and concavities of the object.The qualitative results in Fig. 9 show our method outperforms others in geometric reconstruction, achieving more accurate geometric models.The other four methods exhibited varying degrees of abnormal surface undulations.The surfaces reconstructed by the other four methods all have varying degrees of unevenness.Among them, the geometry reconstructed by the IRON method has concavities and convexities on a large contour scale, and the geometric reconstruction fails in areas seen by fewer viewpoints during the data collection (area within the blue oval).The Instant-NGP method fails to remove background voxels, resulting in the reconstructed geometry with cluttered triangular surfaces and incorrect synthesized colors.
Our method employs an adaptive highlight-guided optimization strategy.In highlight areas, normal loss for object surfaces (the normal loss of the object surface) and gradient regularization loss for sampling points are used to optimize the geometric model (refer to Section 4.3).This compensates for missing colors through global consistency across views, preventing reconstruction failures of original color and geometry in highlight areas due to color loss.For non-highlight  areas, color loss for object surfaces and gradient regularization loss for sampling points are used to optimize the reconstruction of color and geometry.

Inverse rendering and relighting
Figure 10 shows the results of inverse rendering and relighting of the organ scenes.The results indicate that our Endoscope-NeSRF network combined with the inverse rendering method successfully extracts the geometry, UV texture map, and normal map of the object from multi-view images.This enables real-time rendering of organ models in new endoscopic poses (with light and camera remaining in the same position), providing realistic visual perception for virtual surgery.On the premise of ensuring the basic shape of the organ, the geometry with a 64 × 64 × 64 resolution is selected to reduce the computational load during the interaction between instruments and organ models in virtual surgery, addressing lag issues.The texture mapping and normal mapping with 1024 × 1024 image resolution are chosen to provide the base color and increase the bumpiness of the surface of the geometric model, respectively, to make the rendering of the organ model more realistic.The rendering results of Relight 1 and 2 show that the model re-rendered by the BRDF algorithm accurately restores the original color and highlight of the organ, but there is a large error in the appearance rendering of the heart model.The possible reason for this is that the heart surface contains half of the white tissue, and some of them are incorrectly seen as highlights during the adaptive highlight-guided optimization process and fail to participate in the optimization of color loss.This resulted in an unrealistic appearance in the final reconstruction, where the brightness of the white tissue and the R-channel of the red color in the reconstructed organ model are both lower than in the source images.

Ablations and analysis
To validate the effectiveness of our method, we performed three ablation experiments on the lung and heart scenes of pigs in our dataset: (1) Removing the factor of the distance from the light source to the object.(2) Removing the highlight adaptive optimization strategy.(3) Removing the dilated mask.
The quantitative results in Table 2 show that removing any one of the three components results in a decrease in the quality of the synthesized view (excluding highlight areas), as measured by three image evaluation metrics (PSNR, SSIM, and LIPIS).Particularly, the removal of the "dilated mask" component results in a more significant decrease in the quality of the synthesized image.The corresponding qualitative results are illustrated in Fig. 11, which also includes the visualization of normal.By removing the "factor of the distance from the light source to the object", the quality of the synthesized image is reduced.The reason is that the square of the  distance from the light source to the object is inversely correlated with the radiance of the object.This prior knowledge can better represent the appearance attributes of the object.Removing the "highlight adaptive optimization strategy" results in failure to eliminate highlight areas in the synthesized image, this indicates that the highlight adaptive optimization strategy prevents the rays corresponding to the pixels in the highlight area from participating in the training and not learning the highlight attributes.However, this highlight region (appearing white) may be the original color in adjacent views, so the color attributes of the highlight region can be restored by global consistency across view textures.Removing the "dilated mask" degrades the overall reconstruction quality of both geometry and appearance.The reason is that the dilated mask effectively disregards unnecessary surrounding areas while also ensuring more masked areas and edge pixels of the object participate in training.

Conclusion
In this paper, we propose the Endoscope-NeSRF, a method for reconstructing textured geometry from multi-view photometric images captured by an endoscope.Endoscope-NeSRF represents the geometry and appearance of the organ as neural SDF and radiance fields, respectively.Our method successfully reconstructs the fine geometry and realistic representations of organs and renders organ models in real-time in a new endoscopic pose.Our future goal is to tetrahedralizable the reconstructed textured geometric models and add corresponding algorithms to simulate the deformation and cutting of soft tissue in virtual surgery.As soft tissues deform or change topology during interaction with surgical instruments, the texture attributes at the corresponding positions should also change.However, our method can only achieve the reconstruction of static scenes, not dynamic scenes, resulting in unrealistic textures of the deformed organ models.In our future work, we plan to incorporate a temporal component into our network to reconstruct textured geometric models of dynamic scenes, providing more realistic organ models for virtual surgical simulations.

Fig. 1 .
Fig. 1.Diagram of the derivation of endoscopic scene capture.The left side is the light source and camera properties when a camera captures conventional objects with a fixed light source.The right side is the light source and camera properties when an endoscope (where the light source and the camera are fixed in the same position) captures an endoscopic scene.

Fig. 2 .
Fig. 2. The pipeline workflow.The blue flowchart symbols describe the process of multi-view endoscopic image acquisition, and sparse reconstruction via SfM to obtain the scene point cloud and camera poses.Additionally, the positions p and directions d of the sampled points on the corresponding camera rays are fed as inputs to the Endoscope-NeSRF network, which outputs the color c and SDF for sampling points.The green flowchart symbols describe the geometric optimization process, including fine sampling along rays, geometric generation, geometric loss computation, and the generation of opacity o and accumulated transmittance T for sampling points.The purple flowchart symbols describe the utilization of volume rendering to accumulate the colors of all sampling points along each ray, synthesizing the color of the corresponding pixel, the total loss during network training comprises geometric loss and color loss.

Fig. 3 .
Fig. 3.The case of multiple intersections of a ray with the object and the values of relevant variables.(1) Schematic diagram of the intersection points, where V p , V r andC respectively represent the bounding sphere of the normalized point cloud for the organ of interest, the bounding sphere enlarged by 1.2 times, and the center of the bounding sphere.l m , l n , l f respectively represent the sets of mid-points, near-camera points, and far-camera points on all rays corresponding to each image.(2) Distribution and PDF of the sampling points.(3) Predicted SDF gradients of the sampling points.(4) True SDF of the sampling points.(5) Weights of the sampling points.

Fig. 4 .Fig. 5 .
Fig. 4. Acquisition of the texture map and normal map of the organ model, and the relighting of the textured geometric model.The reconstructed geometry is flattened intoUV space to obtain a UV map consisting of pixels with position and gradient.The position and direction of the sampling points on the corresponding ray of the pixel are fed into the Endoscope-NeSRF network to predict the color and SDF value of the sampling point.The predicted SDF gradient is used as the normal in tangent space and is transformed by rotation matrix R to obtain the normal map.The color of all sampling points on the corresponding ray of each pixel on the UV map is synthesized by volume rendering technology to the color value of the corresponding pixel, ultimately resulting in the apparent map.The organ model and its apparent and normal maps construct a textured geometric model of the organ, which is relighted using the BRDF algorithm.

1 .
Acquisition device and datasets

Fig. 6 .
Fig. 6.Endoscope-NeSRF network framework.Our proposed network architecture comprises two modules: a front-end module (in black) known as the SDF function and a back-end module (in purple) referred to as the color function.

Fig. 7 .
Fig. 7. Qualitative comparison of novel view synthesis.Large rows 1-5 represent source images and novel view synthesis results for the lung, heart, liver, kidney, and heart2 of the pig.Columns 1-2 represent the source images from the same viewpoint as the rendered images, and the zoomed-in images of the two rectangular box regions, respectively.Columns 3-7 show zoomed-in images of new viewpoints synthesized by ours, IRON, NeuS, Instant-NGP, and NeRF methods, respectively.Refer to Visualization 1 for synthesized video.

Fig. 8 .
Fig. 8.The effect of the number of pixels in the dilation organ mask on the appearance reconstruction.

Fig. 8 .
Fig. 8.The effect of the number of pixels in the dilation organ mask on the appearance reconstruction.

Fig. 9 .
Fig. 9. Qualitative comparison of geometric reconstruction.The five vertical blocks represent the pure geometric model (odd rows) and the colorized geometry (even rows) after the reconstruction of the lung, heart, liver, kidney and heart2 of the pig, respectively.Column 1 shows two reference source images near the compared viewpoints for each organ.Columns 2-6 respectively show the reconstruction results by our method, IRON, NeuS, Instant-NGP, and Colmap methods.

Fig. 10 .
Fig. 10.Results of inverse rendering and relighting.Rows 1-5 show five organs and their corresponding renderings.Column 1 represents one of the reference views of the organ.Columns 2-4 show the geometry, UV Map and Normal map generated by the inverse rendering method, respectively.Columns 5-7 show the geometric model with base colors, re-lighting in the same view as the reference view, re-lighting in the new endoscopic.

Fig. 11 .
Fig. 11.Qualitative results of three ablation experiments of our model on the lungs (left) and heart (right) of a pig.Rows 1 -4 represent the appearances and geometries reconstructed by our model, the models after removing the "factor of distance from the light source to object", "highlight adaptive optimization strategy" and "dilated mask" components, respectively.Columns 1 -4 on each side represent the truth images, synthetic images, error images and normal maps, respectively.