Application of Photogrammetric Computer Vision and Deep Learning in High-Resolution Underwater Mapping: A Case Study of Shallow-Water Coral Reefs

Underwater mapping is vital for engineering applications and scientific research in ocean environments, with coral reefs being a primary focus. Unlike more uniform and predictable terrestrial environments, coral reefs present a unique challenge for 3D reconstruction due to their intricate and irregular structures. Traditional 3D reconstruction methods struggle to accurately capture the nuances of coral reefs. This is primarily because coral reefs exhibit a high degree of spatial heterogeneity, featuring diverse shapes, sizes, and textures. Additionally, the dynamic nature of underwater conditions, such as varying light, water clarity, and movement, further complicates the accurate geometrical estimation of these ecosystems. With the rapid advancement of photogrammetric computer vision and deep learning technologies, there are emerging methods that have potential to enhance the quality of its 3D reconstruction. In this context, this study formulates a coral reef reconstruction workflow that incorporates these cutting-edge technologies. This workflow is divided into two core stages: sparse reconstruction and dense reconstruction. We conduct individual summaries of the relevant research efforts in these stages and outline the available methods. To assess the specific capabilities of these methods, we apply them to real-world coral reef images and conduct a comprehensive evaluation. Additionally, we analyze the strengths and weaknesses of different methods and identify areas for improvement. We believe this study offers valuable references for future research in underwater mapping.


Introduction
More than 70% of the Earth's surface is covered by water, predominantly oceans, presenting considerable scope for the advancing technologies dedicated to water observation.Affected by climate change and human activities, marine ecosystems, especially coral reef ecosystems, are facing significant challenges (Hughes et al., 2017).Coral reefs represent the most remarkable ecosystems in warm tropical and subtropical oceans.Although they cover less than 0.1% of the ocean floor, their fish communities encompass approximately one-third of the recognized marine species (Bowen et al., 2013).To enhance the understanding, monitoring, and protection of coral reefs, it is essential to use advanced technology to map, monitor and model coral reef habitats.
In coral reef observation, various approaches and platforms are employed, including satellite sensing, aerial remote sensing, vessel-based sonar and LiDAR, underwater vehicle-based imaging, and manual local in-situ underwater surveys (Collin et al., 2018, Price et al., 2019, Rossi et al., 2020, Character et al., 2021).Satellite and aerial remote sensing techniques offer a swift method for acquiring information in large-scale coralmonitoring applications (Casella et al., 2017).However, they fall short in capturing detailed and accurate observations of the intricate structures within coral reefs.In contrast, manual measurements demand substantial time investment, imposing constraints on the spatial and temporal scales.In terms of sensors, while sonar and LiDAR improve the acquisition of geometric information of benthic habitats, they face challenges in acquiring color information.The rise of vision-based underwater imaging enables the collection of higher-resolution data, unaffected by surface refraction of water, facilitating precise 3D reconstruction of real seabed coral reefs at a low cost (Rossi et al., 2020, Zhong et al., 2023).It provides a foundation of highprecision, high-resolution information crucial for subsequent research, and is becoming the centerpiece among various sensors (Zhong et al., 2023).
Over the past decade, thanks to the rapid advancements in computer vision technologies such as Structure-from-Motion (SfM) and Multi-View Stereo (MVS), underwater mapping based on photogrammetric computer vision has been extensively studied for coral reef observation.Utilizing high-resolution images captured by vehicles or divers, these techniques enable 3D observations with precision at the centimeter or even millimeter level (Guo et al., 2016).These approaches provide automated image processing tools, which facilitate the generation of fine 3D models that accurately represent the intricate spatial structural information of coral reefs (Zhong et al., 2023).However, due to the unique characteristics of coral reef environments and the limitations of current algorithms, there is still a need for improvements in the precision, robustness, and efficiency of underwater mapping.This is particularly true in light of the rapid advancements in learning-based image processing.
The data for this paper was collected from the shallow-water coral reefs in the vicinity of Moorea Island in French Polynesia.The island is surrounded by approximately 10 enclosed coral reefs, making it an ideal location for coral monitoring in the South Pacific.This research utilizes seabed images captured by an underwater camera system.To achieve high-resolution underwater mapping, we explore advanced deep learning and computer vision technologies.Specifically, we have established a workflow for coral reef 3D reconstruction based on underwater images, focusing on both sparse and dense reconstruction.We review and summarize the currently available methods, evaluating them qualitatively and quantitatively through comparative experiments.Additionally, we discuss their suitable application scenarios in light of their characteristics.We believe this study provides essential insights for future research in underwater mapping.

3D Reconstruction Workflow with Deep Learning
Our workflow for high-resolution 3D reconstruction of coral reefs is illustrated in Figure 1 and mainly comprises four main stages: data collection and preparation, sparse reconstruction, dense reconstruction, and further processing.The first stage mainly involves the acquisition of high-resolution underwater images and auxiliary data.The images should be clear and exhibit overlap between different perspectives, as this forms the basis for 3D reconstruction.The auxiliary data, while not obligatory, may encompass measurements such as Ground Control Points (GCPs) utilized for georeferencing or camera poses derived from an Inertial Measurement Unit (IMU).Sparse reconstruction applied photogrammetric computer vision techniques to extract a set of sparse 3D points from input images.These 3D points correspond to feature points or keypoints within the scene.Through sparse reconstruction, image poses and the accurate structural information of the scene can be estimated simultaneously, even when dealing with unordered and uninformative images.The 3D points obtained from sparse reconstruction are sparse and insufficient to reflect the detailed structure of the scene.Therefore, it is necessary to perform dense reconstruction to generate denser 3D points.This process ultimately results in a dense point cloud or mesh model, facilitating the dense representation of 3D information within the scene, such as at millimetre-level resolution.Finally, based on specific requirements, different 3D products can be generated.For instance, texture models can be created through texture mapping, or orthomosaics can be generated through orthorectification.In this workflow, sparse and dense reconstruction play a crucial role in the accuracy, robustness, reliability, and visual effectiveness of 3D reconstruction.They constitute the core steps of the entire workflow, and are also the focal areas of photogrammetric computer vision and deep learning.Therefore, the following sections will provide detailed descriptions of these two components separately.

Sparse Reconstruction
At present, one of the most widely used frameworks for sparse reconstruction is the Structure-from-Motion (SfM) technology, which offers fast, low-cost and easy 3D surveys, particularly applied successfully in high-resolution topography for geoscience applications.There exist various SfM strategies, among which incremental SfM stands out as one of the most popular approaches, demonstrating suitable robustness, accuracy, and efficiency.Incremental SfM initiates processing with two images and gradually incorporates new images while continually optimizing (Schonberger and Frahm, 2016).This paper adopts the incremental SfM framework, combined with advanced photogrammetric computer vision and deep learning technologies, to establish a sparse reconstruction method for coral reefs.As illustrated in Figure 2, the method primarily comprises two stages: correspondence search and incremental reconstruction.The first stage involves correspondence search, which identifies projections of the same points in overlapping images.For each coral reef image, the first step is to extract local features which are designed to be invariant under radiometric and geometric changes, ensuring their unique recognition across different images.Subsequently, feature matching is applied to discern images capturing the same area and establish feature correspondences across the images.The second stage is incremental reconstruction, utilizing feature correspondences to estimate the 3D relationships between 2D images.Building upon the outcomes of image matching, the inter-image overlap relationships are generated, and two adjacent images are selected for reconstruction initialization.Then, the remaining images are registered to the current model by using feature correspondences to trian-gulated points from previously registered images.The newly registered images not only encompass the points already observed but also contribute to the addition of new points through triangulation.However, due to the fact that image registration and triangulation are separate procedures, errors inevitably exist, leading to continuous error propagation and accumulation, which could potentially result in drift and failure of SfM reconstruction.Therefore, it is essential to optimize these observed values.Bundle adjustment (Triggs et al., 2000) is consequently used to minimize reprojection errors by jointly refining camera and point parameters through a non-linear optimization process.Through iterative computation based on the above process and outlier filtering, the scene structure and poses of registered images can be estimated accurately.
In the above sparse reconstruction process, the procedures of pose estimation, triangulation, and bundle adjustment have matured in research.The current major challenge lies in image matching, specifically in obtaining a sufficient quantity of accurate and reliable corresponding features.Due to the intricate and complex structure of coral reefs, the texture in captured images is more disorderly compared to typical images, which undoubtedly presents a considerable challenge for image matching.The process can be further divided into feature extraction and matching, as shown in Figure 2.For feature extraction, traditional hand-crafted local feature methods are based on a two-stage pipeline, first detecting keypoints and then generating local descriptors for each keypoint.Scale-Invariant Feature Transform (SIFT) (Lowe, 2004) is the most representative and widely applied method, capable of extracting keypoints with scale and rotation invariance from images.There are also methods such as SURF (Bay et al., 2006) and KAZE (Alcantarilla et al., 2012).These methods utilize predefined criteria to extract points with certain characteristics from images.However, as these criteria may not be applicable to different scenarios, adjustments to algorithm parameters are often required.With the advancement of deep learning technology, many studies have attempted to overcome these limitations using learning-based approaches.Early methods such as LIFT (Yi et al., 2016) utilize keypoint labels obtained by existing hand-crafted methods to enhance the repeatability of keypoints by optimizing the objective functions.Convolutional neural network-based approaches emerged next, such as SuperPoint (DeTone et al., 2018).It uses a fully convolutional model to extract pixel-level interest point locations and associated descriptors from the input image, and also applies self-supervised learning to improve the generalizability of the model.R2D2 (Revaud et al., 2019) uses a Siamese decoding structure to generate repeatable and reliable features.DISK (Tyszkiewicz et al., 2020) employs reinforcement learning to optimize the model for more correct matches.ALIKED (Zhao et al., 2023) adopts a deformable descriptor head that learns the deformable positions of supporting features for each keypoint, thereby outputting robust and accurate descriptors.There are also detector-free local feature matching methods like LoFTR (Sun et al., 2021), but they cannot be directly used in modern SfM systems because they do not explicitly extract keypoints and descriptors.
For feature matching, the difficulty lies in how to accurately match the features and minimize the number of mismatches.The classic ratio-test (Lowe, 2004) matches features based on the similarity between descriptors.While this method is simple and effective, it often results in a large number of outliers, leading to registration failure.To achieve robust feature matching, researchers have studied various strategies.On the one hand, robust outlier filters are designed to eliminate outliers while obtaining more correct matches.For example, AdaLAM (Cavalli et al., 2020) takes the keypoint positions and corresponding descriptors as input and achieves robust matching using an adaptive strategy.On the other hand, methods based on deep learning have been utilized, such as SuperGlue (Sarlin et al., 2020), which takes images and features as input, employing graph neural networks and attention mechanisms to obtain accurate matches.The commonly used traditional MVS adopts a depth-map-based strategy, obtaining depth maps corresponding to images through multi-view matching.Subsequently, it fuses all the depth maps to ultimately generate a dense 3D point cloud.This approach is flexible, concise, and suitable for reconstructing the 3D structure of large-scale scenes.It is a mature study and has been widely applied.Taking COLMAP (Schonberger and Frahm, 2016) as an example, its key steps involve the cost computation for multi-view matching.It uses bilateral weighted Normalized Cross Correlation (NCC) to calculate the similarity between image patches, thus obtaining matching costs.The optimization methods are then applied to minimize the matching cost for depth map generation.

Dense Reconstruction
Driven by deep learning technology, deep learning-based MVS have emerged.In this category of methods, MVSNet (Yao et al., 2018) is a pioneering approach that takes multi-view images and corresponding camera parameters as input to predict depth maps for the corresponding images.Specifically, it divides the images into one reference image and others as source images.A differentiable homography warping operation is employed to build 3D cost volumes from 2D feature maps, enabling the integration of camera parameters in network training.Ultimately, this results in the generation of a high-quality dense depth map for the reference image.Finally, a high-quality dense depth map of the reference image can be generated.There are several derivative algorithms based on MVSNet, among which Vis-MVSNet (Zhang et al., 2023) is one of the best in terms of overall performance, using a coarse-to-fine strategy to achieve multi view depth map estimation.An innovative aspect lies in its consideration of pixel visibility.To mitigate the impact of unmatched pixels, it generates an uncertainty map to estimate per-pixel visibility.The uncertainty is used as a weighting guidance, fusing the latent volume which is further regularized into a probability volume and regresses to the final depth estimation.
Another category of methods that has emerged in recent years is based on Neural Radiance Fields (NeRF) (Mildenhall et al., 2021).Unlike traditional 3D reconstruction methods, NeRF methods can represent real-world complex geometry and appearance using a neural network, storing 3D scene information in the parameters of the network.The typical input comprises images with known poses, and the output is the weights of the network.These methods are theoretically able to realize a finer representation of the continuous scene.To address the slow speed of NeRF, Instant-NGP (Müller et al., 2022) applied multiresolution hash encoding to reduce computational complexity while maintaining accuracy.This approach also facilitates parallel implementation on GPUs, thereby significantly improving efficiency.Nerfacto (Tancik et al., 2023) integrates improvements from multiple previous methods, allowing the model to balance accuracy and efficiency.With a modular design, it facilitates easy improvements in subsequent developments.Additionally, it introduces the Python framework Nerfstudio, supporting the output of results in the form of point clouds or mesh models.There are methods specifically designed for multiview 3D reconstruction, such as Neuralangelo (Li et al., 2023), which utilizes Instant-NGP as a neural Signed Distance Function (SDF) representation of the underlying 3D scene and is optimized from multi-view image observations via neural surface rendering.To enhance the effectiveness of multi-resolution hash encoding, it uses numerical gradients to compute higherorder derivatives, and a progressive optimization schedule is adopted to recover structures at different levels of detail, ultimately achieving high-quality surface reconstruction.

Research Data
The data used in this research is supported by the Moorea Island Digital Ecosystem Avatar (IDEA) project, consisting of highresolution underwater coral reef images captured in the same area in August 2018 and August 2019.Specifically, in 2018, 523 images were captured, and in 2019, 323 images were captured.The images were acquired along pre-planned routes, with overlap rates between adjacent images mostly ranging from 70% to 85%, enabling multi-view 3D reconstruction.The camera system includes a PANASONIC LUMIX GH5S camera body (resolution of 3680×2760 pixels) and a wide-angle lens Lumix G 14 mm f/2.5.
For image matching, each feature extraction method is applied to extract 8000 features from coral reef images.Subsequently, ratio-test and AdaLAM are employed for feature matching.Additionally, considering SuperGlue works particularly well with SuperPoint (Sarlin et al., 2020), SuperPoint features are also matched using SuperGlue.SuperGlue offers two pre-trained weight models, with one tailored for indoor environments (referred to as SG (in)) and the other designed for outdoor settings (referred to as SG (out)).The ratio-test is applied with a mutual nearest neighbor check, and the ratio is set to 0.9.After preliminary experimental analysis, we found that when there is substantial overlap and only minor translation or rotation is present between a pair of images, various methods can generally obtain a sufficient number of correct matches.However, significant differences arise among different methods when the overlap is low or there is a large rotation.Figures 4 and 5 illustrate the results of image matching under two challenging scenarios.
As shown in Figure 4, when there is a low overlap between two images, image matching becomes a challenging task.When using the ratio-test for feature matching, some correct matches can be obtained, but there are also many mismatches.While a limited number of mismatches can be filtered out during reconstruction, an excessive amount is problematic.Particularly, the matching results of SIFT and SuperPoint are not suitable for reconstruction, whereas the performance of the other methods is slightly better.When using AdaLAM, there is almost no existence of mismatches.SIFT features, due to a high error rate, struggle to obtain correct matches, while the other four deep learning-based local features, with the support of AdaLAM, successfully achieve correspondence search, indicating the effectiveness of the AdaLAM algorithm in outlier filtering.
The scenario depicted in Figure 5 is markedly different, involving approximately a 90-degree rotation between the two images.It is evident that the use of deep learning-based features fails to yield correct matches, primarily due to the regular CNNs lacking equivariance to rotation.Despite ALIKED's specific design to enhance rotational invariance, it proves insufficient for handling the current data.In contrast, SIFT demonstrates significantly better results, generating a majority of correct matches and a small number of mismatches when using the ratio-test.When employing AdaLAM, there are nearly no mismatches, and a lot of correct matches are obtained.Therefore, in such cases, using SIFT is a more suitable choice.features and represents a crucial area for future research.As for SuperGlue, in both sets of data, the indoor model is almost unable to generate any matches, and the outdoor model only produces mismatches.This is likely due to SuperGlue being a supervised learning method, and its training data lacks scenes similar to those in the given scenarios, and these scenarios are inherently challenging, so their performance is poor.
For SfM reconstruction, we utilized the open-source software COLMAP (Schonberger and Frahm, 2016) to implement incremental reconstruction.The configuration for feature extraction and matching methods remains the same, and the number of features of each image is still limited to 8000.We perform reconstruction using the images collected from Plot18 in 2018 and 2019, respectively.Figure 6 shows the SfM reconstruction results using SIFT features with AdaLAM, including the point cloud of the scene and camera poses.To compare and evaluate the effectiveness of SfM reconstruction, we calculated 5 metrics: nImage, F eat, nP oint, T rack, and Error.nImage represents the number of aligned images, F eat is the average number of features successfully used for triangulation per image, nP oint is the number of reconstructed 3D points (1k=1000), T rack is the mean repeat observation number of 3D point, and Error denotes the average reprojection error of keypoints.The quantitative results are shown in Table 1.Overall, the comprehensive performance of SIFT is excellent.Regardless of the matching method used, all images can be successfully aligned, and the reprojection error is only slightly larger than ALIKED.When using AdaLAM, both F eat and nP oint decrease, but T rack increases, becoming the highest among the methods.This indicates that AdaLAM can connect more features of different images, making the reconstruction more stable.On the contrary, R2D2 performs the worst.It has the fewest successfully aligned images in both datasets, with the lowest values for F eat and nP oint.Even when using AdaLAM, only about 2000 features are successfully matched, and the reprojection error is also high.Among SuperPoint, DISK, and ALIKED, ALIKED has a relatively good overall performance.It not only has the highest number of successfully aligned images but also the smallest reprojection error.This suggests that among deep learning-based feature methods, ALIKED is more suitable for high-precision SfM reconstruction.DISK and SuperPoint are close in terms of nImage, but DISK features have a higher F eat and T rack.However, the reprojection error of SuperPoint is slightly lower.As for the results of SuperGlue, there is a notable distinction between the indoor and outdoor models.Compared to non-learning methods, the indoor model performs poorly and struggles to align all images.The outdoor model shows improvement, but the reprojection error is relatively high, indicating less accurate feature matching.It is noteworthy that, compared to the ratio-test, AdaLAM is relatively less accurate, while generally improving matching robustness and increasing the number of repeated observations.It may match keypoints that are not the most accurate correspondences but rather nearby keypoints.This is related to the geometric assumptions within the algorithm, sacrificing a bit of accuracy for better stability.
In summary of the above experimental results, SIFT stands out as the most practical local feature method, showcasing excellent precision and reliability.On the other hand, deep learningbased methods currently face challenges in effectively handling coral reef image matching.Despite this, the progression from early methods like SuperPoint to the recent ALIKED indicates a continuous enhancement in reliability and accuracy.Therefore, it is reasonable to anticipate the development of even more outstanding methods in the future.As for feature matching, AdaLAM exhibits remarkable robustness, significantly addressing issues related to excessive outliers and enhancing the stability of SfM reconstruction.

Dense Reconstruction
Based on the accurately reconstructed camera poses derived from sparse reconstruction, the intricate fine structure of coral reefs can be estimated through dense reconstruction.This section conducts comparative experiments on the three categories of dense reconstruction methods mentioned in Section 2.2.The first is traditional MVS, and we use the dense reconstruction functionality in COLMAP (Schonberger and Frahm, 2016).
The second is deep learning-based MVS, and we employ Vis-MVSNet (Zhang et al., 2023).The third is the recently popular method based on NeRF, and we use Instant-NGP (Müller et al., 2022), Nerfacto (Tancik et al., 2023) and Neuralangelo (Li et al., 2023).It should be noted that the direct outputs of COLMAP and Vis-MVSNet are dense point clouds.As for Instant-NGP and Nerfacto, they are not designed for generating point clouds or mesh models, but it is still possible to get point clouds by some means (Tancik et al., 2023).Neuralangelo, on the other hand, can output mesh models directly.For the purpose of visualization, here we use Poisson surface reconstruction (Kazhdan et al., 2006) to transform point clouds into mesh models.Since the three NeRF-based methods applied here are not suitable for large-scale scenes, we select a subset of images (42 images) from a specific region for experiments.We adjust the parameters for each method to obtain the best possible results, and the visualization of partial dense reconstruction results is shown in Figure 7.
The results of COLMAP and Vis-MVSNet are similar, but Vis-MVSNet produces relatively denser results, albeit with slightly more noise, showcasing its commendable generalization.Meanwhile, COLMAP's result appears overly smoothed.The results of Instant-NGP and Neuralangelo are the least satisfactory in general.The point cloud obtained by Instant-NGP contains many outliers, making it difficult to correctly reconstruct a mesh model.On the other hand, the results of Neuralangelo are excessively smooth, failing to capture the fine structure of coral reefs, and the contrast of its mesh texture is abnormally high.In comparison, Nerfacto's result is significantly better, generally able to reconstruct the intricate details of coral reefs, especially the tentacles of the corals.However, there are still quite a few outliers in the point cloud, leading to surface irregularities in the mesh model.While NeRF-based methods often exhibit noise or over-smoothness in their results, indicating a current limitation in effectively handling data noise and suggesting a need for enhanced reliability in subsequent improvements, they have already achieved acceptable results.With ongoing enhancements, it is anticipated that satisfying outcomes will be achieved in the near future.Additionally, a major factor contributing to poor performance is that the viewpoints of the images are mainly downward-looking and lack side-view data.
NeRF captures 3D scene information by modeling the volumetric scene as a continuous function that predicts the color and opacity of any given 3D point.Lower image overlap implies fewer corresponding 2D projections of 3D points across different images.This results in fewer constraints in the NeRF optimization process, and inadequate information for NeRF to accurately model the geometry and appearance of the underlying scene.In practice, improving the quality of 3D reconstruction can be achieved by increasing image overlap and coverage.
In 3D reconstruction, dense reconstruction has always been the most time-consuming task.Therefore, we also test the execution time of different methods, where COLMAP is implemented in C++, and the other methods are implemented in Python.Instant-NGP and Nerfacto are implemented using Nerfstudio (Tancik et al., 2023).All experiments are conducted using an NVIDIA Geforce RTX 3090 GPU.We test the time taken by different dense reconstruction methods used in Figure 7, and the results are shown in  In addition, we also test the runtime of COLMAP and Vis-MVSNet in reconstructing large-scale scenes.We apply these two methods for the dense reconstruction of the entire area.As they are both depth map-based methods, the entire process can be divided into three steps: I. Data preprocessing, II.Depth map generation, and III.Fusion of depth maps to obtain point clouds.
The specific execution times are presented in Table 3.The total time taken by Vis-MVSNet is significantly less than that of COLMAP, mainly due to the time-consuming tasks involved in the MVS process within COLMAP, like iterative computations.The most time-consuming step in Vis-MVSNet is also depth map generation.Nevertheless, its preprocessing stage is also resource-intensive, mainly due to the conversion of sparse reconstruction results from COLMAP into a format suitable for the network input.Overall, Vis-MVSNet achieves appropriate results with excellent operational efficiency, while NeRF-based methods demonstrate outstanding potential in fineness but come with a longer processing time.

Data
Step In future practical applications, the reconstruction approach may not necessarily rely on a specific method alone.Instead, it may be beneficial to integrate the strengths and weaknesses of various methods in accordance with the specific requirements of the task, achieving a balance between effectiveness and efficiency.For example, a coarse-to-fine strategy can be adopted for large-scale underwater mapping.Specifically, after sparse reconstruction, initial dense reconstruction of the terrain could be rapidly achieved using deep learning-based MVS methods, resulting in a preliminary dense model.Subsequently, based on this model and task requirements, more densely sampled and higher-resolution data could be collected in areas of interest, such as coral reefs.Finally, fine-grained dense reconstruction could be carried out using NeRF-based methods.
In summary, deep learning-based dense reconstruction methods are not inferior to, and in some aspects even surpass, traditional MVS in fineness and efficiency.This indicates that emerging computer vision and deep learning technologies have achieved remarkable advancements, with substantial room for improvement.Looking ahead, these advancements hold the potential to significantly advance the field of high-resolution underwater mapping, leading to more in-depth and comprehensive outcomes.

Conclusions
In this paper, we take the coral reefs of Moorea Island as an example and elaborate in detail on how current emerging photogrammetric computer vision and deep learning technologies can be applied in high-resolution underwater mapping, in response to the limitations of traditional methods.Combining the current research, we establish an improved workflow for 3D reconstruction of coral reefs.Delving into both sparse and dense reconstruction, this paper conducts an analysis and summary of classical and state-of-the-art methods, elucidating how to apply them concretely.Through experiments on actual coral reef images, qualitative and quantitative evaluations of these methods are performed in terms of accuracy, reliability, efficiency, etc. Building upon this foundation, we analyze their strengths and limitations, confirm the promising prospects of cutting-edge methods, and propose feasible directions for improvement, providing outlooks for future research and applications.
Based on the camera poses estimated by sparse reconstruction, dense reconstruction techniques can be applied to generate a dense point cloud model or mesh model of the scene, as shown in Figure3.For coral reefs, the primary significance of dense reconstruction is to recover the fine structure of coral reefs.Due to the presence of structures like tentacles in coral reefs, images often encounter issues such as occlusion and texture repetition, imposing high demands on dense reconstruction.Over the past two decades, many excellent algorithms have emerged, ranging from traditional multi-view stereo (MVS) to deep learningbased MVS, and more recently, rapidly developing methods based on Neural Radiance Fields (NeRF).These methods vary in terms of accuracy, robustness and efficiency.The ongoing advancements in this field exemplify the continuous evolution and innovation in 3D reconstruction techniques, thereby opening up new possibilities for underwater mapping.

Figure 4 .
Figure 4. Qualitative visual inspection of underwater image matching with low overlap.

Figure 5 .
Figure 5. Qualitative visual inspection of underwater image matching with a large rotation.

Figure 6 .
Figure 6.Visualization of our SfM reconstruction results.

Figure 7 .
Figure 7.Comparison of dense reconstruction results.

Table 1 .
The SIFT descriptor achieves rotation invariance by assigning a consistent orientation to each keypoint based on local image properties.This is a capability lacking in current deep learning-based Various metrics of the reconstruction results of different methods."RT" represents ratio-test, and "Ada" represents AdaLAM.

Table 2 .
The runtime in seconds for small-scale scenes.

Table 3 .
The runtime in seconds for large-scale scenes.