SGformer: Boosting Transformers for Indoor Lighting Estimation from a Single Image

Predicting lighting from standard images can effectively circumvent the need for resource-intensive High Dynamic Range (HDR) lighting acquisition. However, this task is often ill-posed and challenging, particularly within indoor scenes, due to the intricacy and ambiguity inherent in various indoor illumination sources. We propose an innovative transformer-based method called SGformer for lighting estimation through the modeling of Spherical Gaussian (SG) distributions—a compact yet expressive lighting representation. Diverging from previous approaches, we explore underlying local and global dependencies in lighting features, which are crucial for reliable lighting estimation. Additionally, we investigate the structural relationships spanning various resolutions of SG distributions, ranging from sparse to dense, aiming to enhance structural consistency and curtail potential stochastic noise stemming from independent SG component regressions. By harnessing the synergy of local-global lighting representation learning and incorporating consistency constraints from various SG resolutions, the proposed method yields more accurate lighting prediction results, which allow for more realistic lighting effects in object relighting and composition. The code for the implementation of our work will be publicly available online.


Introduction
In today's world of gaming, digital effects, and the surging popularity of augmented and mixed reality (AR/MR) applications, there is a growing demand for more realistic lighting.Achieving this realism is essential to ensure consistent shading and shadow alignment between virtual and real objects.Traditionally, capturing a scene's lighting involves using light probes or omnidirectional 360 • capturing devices.However, the use of specialized devices is time-consuming and often cost-prohibitive, which barriers their widespread use.To overcome these limitations, recent advancements in deep learning techniques [15,17,7] and the availability of extensive lighting-related datasets have prompted the development of methods that predict global illuminations from standard partial field-of-view images, providing a more accessible and cost-effective approach to approximate lighting [34,21,28].
The challenge of estimating indoor scene lighting, where lighting source quantities, distribution, and intrinsic properties may vary significantly between scenes, is widely recognized.Various deep learning methods tackle this challenge by generating indoor lighting conditions using different lighting representations, including environment maps or regressing lighting parameters like Spherical Harmonics (SH) and Spherical Gaussian (SG).In contrast to environment maps, which provide dense pixel-level representations, parametric lighting models offer condensed lighting representation focusing on the distribution of key lighting sources, making them favored for real-time rendering and relighting applications [25,36,30].Among these parametric lighting models, the SG model is notable for its compactness and efficiency [14,8].It excels in capturing intricate high-frequency lighting details, enabling robust rendering of specular reflections and highlights in images, and has gained considerable attention recently [1,32,33].
Gardner et al. [8] introduced a set of SG parameters representing light sources, considering their direction, position, color, and size.However, directly regressing such light source-dependent SG parameters often leads to unstable model training and inference due to the unconstrained lighting source quantities and floating lighting positions, thereby limiting the accuracy of the predictions.Alternatively, Li et al. [14] and Zhan et al. [33] employed a different SG representation where multiple SG components are evenly distributed over a unit sphere [27].In this representation, each SG component encodes the local light direction and intensity, as well as the ambient lighting.This Gaussian map representation effectively enhances inference stability and enables more effective optimization.Subse-quent works [32,1,31] have built upon this SG map representation to predict lighting conditions for direct object rendering [14] or as a concise prior for improved environment map prediction [32,31].However, while Gaussian map predictions have shown promise, increasing the number of Gaussian components for better high-frequency information approximation often results in noisy predictions with more missing or superfluous Gaussian components.
In this paper, we introduce an innovative deep architecture aimed at enhancing Gaussian map predictions for improved indoor lighting estimation.Specifically, we leverage the Conformer architecture [18] to effectively extract lighting features from low dynamic range (LDR) input images by modeling both local and global lighting features, as well as their intricate relationships.Given that the input information is considerably limited compared to the target panorama, which is typically less than 10% of the full scene in a standard image [13], a comprehensive understanding of both local lighting cues (e.g., small specular highlights) and global lighting cues (e.g., ambient lighting, shadows) is essential for enhancing the network's ability to infer reliable lighting conditions.Furthermore, to improve the structural distribution of predicted Spherical Gaussian components, we propose a multi-head transformer decoder structure, accompanied by a distribution consistency loss across multi-resolution SG distributions for better lighting structure learning.These innovations effectively mitigate potential noise in all spectra of SG map predictions and enhance the overall spatial structure in predicted lighting.Our experimental results demonstrate that our boosted transformerbased framework effectively enhances Spherical Gaussian map predictions, leading to more realistic object rendering and more accurate guidance for environment map predictions.Our contributions are summarized as follows: • We propose SGformer, a novel transformer-based network that combines a Conformer encoder with a multihead transformer decoder to enhance SG predictions.
• We design a novel SG consistency loss to improve lighting structure predictions by exploring the spatial relationships across different SG resolution levels.To the best of our knowledge, this is a pioneering work to harness multiple SG resolutions for lighting prediction.
• Our SGformer model can effectively serve as a tool to enhance environment map generation and enable more realistic object rendering.

Related work
Lighting Representations: Extensive research efforts have been dedicated to devising methods for representing environmental lighting conditions.One widely adopted repre-sentation is the environment map [19,20], which characterizes lighting using dense 2D images.Typically, an environment map is derived from the projection of a high-dynamicrange (HDR) spherical image, employing techniques such as equirectangular projection or cube mapping.Environment maps are extensively employed in image-based rendering pipelines.However, their high dimensionality poses significant challenges when predicting individual pixels.Additionally, the non-uniform sampling on a spherical surface often introduces distortions or irregular shapes in the image, differentiating them from traditional images and presenting estimation challenges.
Alternatively, lighting parametric models provide compact representations commonly used as prior lighting information for real-time rendering.Various parametric models, such as SG [29,26] and SH [11,2] lighting models, have been introduced.The SG model characterizes environmental lighting using Gaussian lobes, each defined by several parameters like size, central direction, and fatness/sharpness.Utilizing more SG components/functions often leads to a more precise description of the lighting conditions.In comparison to SH models with a predefined set of orthogonal polynomial functions [16], SG provides greater flexibility in configuring the shape, number, and distribution of basis functions.It effectively mitigates potential ring artifacts introduced by high-order SH functions while approximating full-frequency lighting.With a similar number of parameters, SG excels in capturing specular reflections and highlights.In this paper, we primarily concentrate on estimating parametric lighting from a single standard image.We make the first attempt at investigating the spatial relationships among spherical Gaussian components across various SG resolutions.
Learning-based Lighting Parameter Regression: Several works focus on regressing lighting parameters from partial-view images using deep learning methods, primarily targeting real-time rendering and relighting applications.Garon et al. [10] and Cheng et al. [6] introduced a deep learning model for predicting scene illumination by regressing spherical harmonic (SH) coefficients.Gardner et al. [8] employ a parameterization scheme where each light source is represented by a single spherical Gaussian (SG) function.They developed a deep learning model to regress key lighting attributes, including light directions, light intensities, and light colors, for each individual light source.EMlight [33] introduced a method for predicting Spherical Gaussian (SG) maps that encompass a fixed number of lighting components, referred to as anchors, which are uniformly distributed on a unit sphere.A spherical mover's loss was introduced to precisely regularize the distribution of SG components within the SG maps.These predictions are utilized as initial lighting structure guidance for the synthesis of panoramic illumination maps.GMlight [32] ex-tended this work to incorporate depth information.It regularizes the Gaussian map learning in a geometric space using a geometric mover's loss guided by depth, enabling spatially varying lighting estimation.DSGlight introduced a graph-based framework to enhance SG map estimation.It employs a graph convolutional network (GCN) module to refine the color and depth of each SG component at a semantically structural level.Finally, Xu et al. [31] proposed a transformer-based model with self-attention mechanisms to improve contextual modeling of SG distributions, serving as a pre-processing step for further environment map estimation.
Prior methods have mainly concentrated on enhancing the stability of regression models during training [33], and they have attempted to introduce additional regularization techniques to improve the regression of lighting distributions for better preservation of full-frequency information [33,32].However, these efforts have primarily centered on improving the regression decoder [1,31], often neglecting the significance of the feature extraction module, which has a more critical impact on the final results.Additionally, regularization has typically been applied at the highest resolution, to maintain high-frequency details, which can be particularly challenging when working with very limited input information.
To enhance the accuracy of our predictions, we propose improving the extraction of lighting features by considering both local and global lighting characteristics and their interrelationships through the Conformer network [18].Furthermore, we introduce a multi-head transformer decoder accompanied by an SG consistency loss to regularize SG distributions by using multi-resolution information, spanning from sparse to dense, to promote a more profound understanding of their structural aspects.
Learning-based Environment Map Generation: Several deep learning method have been introduced to estimate environmental maps from standard images [34].The origins of these studies can be traced back to the pioneering work of Gardner et al. [9], which has since inspired a series of subsequent efforts [4,22,24].These efforts address various critical aspects in this field, including the ability to handle spatial variations in indoor scenes [23,24], lightweight solutions for mobile applications [13], and the capacity to generalize to a wide range of input variations [35].Notably, some recent work [33,32,31] has explored the use of sparse lighting representations as guidance for generating dense, pixel-wise environment maps.In such cases, SG maps and/or SH diffuse maps serve as instructive priors for directing the generation of lighting sources for the final environment map.

Method
We employ the following equation to represent illumination in the form of the SG map [33], denoted as D: The original HDR environment map of the scene is partitioned into two components: the light source component (L hdr>Is ) and the ambient component (L hdr<Is ), based on an intensity threshold (I s = I max * 0.05 in our case), which depends on the maximal pixel value I max of the environment map.The lighting source regions are approximated using multiple SG functions [8], evenly distributed across a sphere, where N represents the number of SG functions or anchor points.d i represents the direction of anchor point i, predefined using the method proposed by Vogel [27].The symbol u denotes an arbitrary direction vector on a unit sphere, and α stands for the inverse of angular size, which we set to constant 1.Each light source is associated with neighboring anchor points based on a minimum radial distance criterion.The RGB value of each anchor point A i is calculated as follows [33]: where p i represents the collection of pixels nearest to the anchor point with direction d i .
In the following sections, we will start by analyzing the capability of SG maps with different resolutions in representing lighting features.This analysis will demonstrate the importance of our SG learning framework's design.Following that, we will introduce our network structure and its associated learning scheme.

Multi-Resolution SG Analysis
The resolution of the SG map, crucial for accurately capturing lighting information from a scene, is determined by the number of SG functions, denoted as N .Figure 1 illustrates examples of scenes and their SG maps at different resolutions.Lower-resolution SG representations, characterized by fewer anchor points, provide a more abstract depiction of lighting conditions.They offer a rough but still informative overview of the primary light source's position and their central intensity.On the other hand, higher-resolution SG representations, with a greater number of anchor points, excel at conveying intricate lighting source details.It effectively captures nuances such as shape, intensity variations, and directional shifts.Across different resolutions, SG representations maintain consistent distribution patterns.As the resolution increases from sparse to dense, a core spatial distribution is preserved, while each single light source can  encompass multiple anchor points in light shaping.Thus it creates complex semantic relationships between different anchors.
To assess the accuracy of different SG map resolutions in representing the lighting sources, we conduct a comparison of angular and intensity deviations between the groundtruth lighting sources in the HDR environment map and those extracted from different SG maps.Following the approach of Gardner et al. [8], we detect the ground-truth lighting source regions using a region expansion method applied to the HDR environment map.This process begins with light peaks as initial seeds and incrementally expands until the intensity falls below one-third of the peak value.Subsequently, region merging is carried out based on overlapping regions.To extract lighting sources from SG maps, we aligned anchor points with their closest groundtruth lighting source regions.We ranked the top five lighting source regions based on peak intensity, arranging them from highest to lowest.For each of these regions, we conducted a statistical analysis of all anchor points within that specific region.The anchor point displaying the highest lighting intensity was designated as the extracted light source center.The outcomes, as presented in Table 1, demonstrate that as SG resolution increases, the deviations in intensity and angular accuracy stemming from the lighting representation diminish.This highlights the significance of high-resolution SG maps in achieving precise lighting representation.
The task becomes more challenging when aiming to predict high-resolution SG maps, primarily because it involves dealing with a substantially larger number of param-eters.Unlike low-resolution predictions, which focus on estimating average positions and intensities, high-resolution predictions must delve into more intricate details, including the shapes of light sources and subtle intensity variations.Therefore, a more comprehensive set of representative lighting features becomes essential.Additionally, as the number of SG components increases, preserving the structural semantics of numerous discrete points becomes crucial for creating a meaningful representation of lighting sources.Ensuring structural consistency across resolutions can serve as a valuable regularization technique in this context, effectively reducing noise in high-resolution predictions and addressing sparsity issues in low-resolution predictions.

Network Architecture
Given a partial view image captured in a scene, we introduced SGformer for predicting the lighting conditions in a scene, and Figure 2 presents an overview of the entire architectural structure.We utilize a Conformer encoder [18] to extract lighting features, which are then input into a multihead transformer-based decoder.Subsequently, a fully connected layer is employed to convert these lighting features into SG lighting parameters for each resolution.This includes estimating the anchor point distributions, along with associated global average lighting intensity and RGB ratios [33,31].The network model is trained holistically, and we augment it with a structural consistency loss for additional regularization.Conformer Encoder: Considering the limited scene information available from the input image and the wide varia- tions in lighting conditions in indoor illumination, learning discriminative lighting-specific features concealed within the input photos is crucial for a meaningful understanding and inference of lighting.Real-world lighting distributions exhibit intricate properties, with lighting features sometimes confined to small areas, like a specular highlight on an object's surface, and at other times extending widely, such as with large window light or shadows.Different local and global lighting features often interact with each other; for instance, a specular highlight can be generated by a light source located far away from the viewer's perspective.
Convolutional networks excel at capturing local features but tend to struggle when it comes to learning long-distance global relationships.On the other hand, the vision transformer is proficient in learning global representations but often overlooks finer local feature details.In our preliminary experiments, we observed that relying solely on a Transformer as the feature extractor is susceptible to inaccuracies in predicting lighting intensity.To tackle this challenge, we get inspiration from the recent trend of combining these two technologies in various visual and non-visual tasks [18,5,37,12].In the context of light estimation, Xu et al. [31] leveraged the DETR method [3,38] that blends convolutional layers and Transformers to extract lighting features.While this approach provided some relief to the issue, it still exhibited limitations in effectively modeling the interaction between local and global features within a cascade paradigm.
We propose a new lighting feature extraction module based on Conformer [18] [31] method.In contrast to previous lighting prediction methods [31], which primarily focused on a single SG resolution, our approach introduces a multi-head decoder that simultaneously estimates various SG parameters across different resolutions.This architectural approach enhances feature learning, enabling the encoder to generalize to a variety of tasks with differing levels of complexity.The DETR decoder consists of multiple transformer blocks that include self-attention and crossattention mechanisms, along with a learnable anchor query system for the simultaneous prediction of multiple targets.In our multi-head decoder structure, distinct anchor queries are used to handle different SG embeddings, in addition to a global query for obtaining global embeddings.Subsequent fully connected layers serve as the prediction head, transforming these feature embeddings into the respective SG parameters.

Loss Functions
SG Consistency Loss: Within the multi-head SG decoder, separate transformer decoders handle the decoding of lighting features for different SG resolutions.Consequently, there is no inherent guarantee of consistency among them.To enhance structural uniformity in predictions and simultaneously optimize both lighting encoding and decoding in a holistic manner, we introduce the SG consistency loss.This loss aligns SG maps across different resolutions via a spherical SG downsampling technique.It calculates the loss between the downsampled SG map and the real SG map of the current resolution, serving as a regularization term for model optimization.
The SG map downsampling is performed by searching for neighboring anchor points on the spherical surface between two adjacent SG map resolutions.We define the downsampling procedure as follows: A i,dn = max(A (j,rup) ) where j ∈ ||d j,rup − d i,rc || < R), (3) where the downsampled value for anchor point i at the current resolution, represented as A i,dn , is determined by selecting the maximum intensity value from all its neighboring anchor points j in the higher resolution that is located within a radian range R of it.In our configuration, we set the resolutions to N= [8,16,32,64,128], and R is adjusted to [0.65, 0.4, 0.3, 0.3] accordingly to identify these neighboring points.The downsampling process is applied dynamically to the predicted SG parameters during training to compute the consistency loss.Figure 3 provides an example of SG downsampling between different SG resolutions and their comparisons with actual SG maps.The SG consistency loss is calculated by applying both Earth Mover's distance and L 2 norm on the predicted SG A pd , the ground truth A gt , and the downsampled SG A dn , as outlined below: where Here, EM (•) represents the spherical Earth Mover's distance [31].It measures the minimum amount of probability required to move points from one distribution to another, taking into account a cost matrix T ( ij) determined by the predefined anchor positions and their radian distances along the sphere (Dist ( ij)) [27].Unlike L2 or cross-entropy metrics, the Earth Mover's distance can effectively leverage spatial information between distributed points when assessing the dissimilarity between two distributions.Multiple Resolution Loss: To individually supervise the generation of each SG resolution, we propose the following multiple-resolution loss function: (5) For each resolution level i, L em , L l2 , L log represent the loss terms that control the intensity distribution of SG resolution n.L log is a log-transformed version of the root mean square error of the intensity distribution, designed to mitigate extreme values [31].L em is the Earth Mover's loss.L IG and L rgb are losses related to global intensity and RGB ratios.The weights α from 1 to 5 are empirically set to [10 3 , 10 3 , 10 −6 , 10 2 , 10 −1 ].Additionally, for high-resolution SG map prediction with n = 128, we incorporate the render loss proposed by [31], which proves beneficial for modeling high-frequency lighting features.
During the training process, the multiple resolution loss function is initially employed to supervise each individual branch together.Once the model's training reaches a relatively stable state, the SG consistency loss across different resolutions is then applied to further refine the model holistically.

Experiments
We evaluate our proposed method on the Laval Indoor HDR Dataset [9].Each panoramic image in the dataset is cropped into eight images and tone-mapped into standard partial-view images as our inputs.To account for the spatially varying indoor lighting within a fixed position, we follow the approach outlined in [4,35,9,31].This involves applying spatial warping and re-centralization operations to the panoramic images, resulting in the final ground truth.By applying this transformation, the lighting representation is adjusted to better reflect the illumination conditions at the position where the object would be composited, rather than relying solely on the 360 • camera's position, which could be situated at varying distances from the composition place.Our training dataset consists of randomly selected 1200 scenes, providing a total of 1200 * 8 training pairs.The remaining 512 scenes are used for testing.The input crop size is set to (128, 128), and SG parameters in different resolution levels are extracted from the HDR environment map with dimensions (256, 128).
We conduct a comprehensive evaluation, encompassing both qualitative and quantitative assessments, to thoroughly assess the performance of SG predictions and their impact on environmental map generation.Our evaluation includes comparisons with state-of-the-art techniques and in-depth ablation studies.For a quantitative assessment of SG parameter predictions, we utilize the Root Mean Square Error (RMSE), ranked matching error (RME) [31], and L2 error for measuring global intensity and RGB ratio.And to measure the quality of environmental map generation, we employ well-established metrics such as RMSE, scaleinvariant RMSE (si-RMSE), and angular lighting error.

Comparisons
To illustrate the impact of our improved SG predictions on environment map generation and rendering, we utilize a neural projector model proposed by Xu et al.[Xu22] [31] for environment map generation.This model is distin-guished among GAN-based neural projectors for its capability to generate high-frequency environments, attributed to its integration of both high-frequency SG and lowfrequency SH as lighting priors.We conduct comparisons by generating environment maps based on our SG predictions and contrasting them with state-of-the-art approaches introduced by Gardner et al. Figure 4 displays the predicted SG maps (N=128) generated by SGformer alongside the synthesized environment maps.These environment maps are produced using the predicted SG maps as a prior input to the GAN-based environment map neural projector.Our results indicate that SGformer can generate authentic SG maps that are close to the ground truth.These predictions play a dominant role in shaping the final environment map synthesis, particularly in influencing the lighting structure under the SPADE paradigm within the neural projector module.Precise SG map priors yield the creation of high-fidelity environment maps featuring realistic lighting representations.
In comparative evaluation against various environment map estimation methods, our approach excels in both qualitative and quantitative evaluations.Figure 5 presents the Four examples are presented, with the upper part displaying the generated environment map and the lower part showing the input crop containing an inserted virtual ball (with roughness levels of 0.2 for rows 2 and 4, and 0.5 for rows 6 and 8) rendered using the predicted environment map in each case.Additional environment map synthesis results can be found in the supplementary material.
visual results, showcasing both the generated environment maps and their rendering effects on a virtual object (virtual ball).It is observed that our method outperforms competing approaches in several aspects, including lighting distribution, color tones, and intensity variations.While methods proposed by [Gardner17], [LeGendre19], [Chalmers20], and [Zhao21] have gradually improved the approximation of lighting distributions, their generated environment maps often lack detail and clarity when rendering objects with low-roughness materials.
The introduction of GAN loss, as seen in [Zhan21] and [Xu22], enhances the fidelity of generated environment maps.However, these methods still struggle with accurate lighting distribution due to limitations in feature encoding and lighting decoding capabilities.Artifacts is apparent in specular and texture reflections on composite object surfaces with roughness=0.2(Figure 5, rows 2 and 4, columns 5 and 6).Moreover, discrepancies are observed in color tones, intensity, and the highlight areas on object surfaces with roughness=0.5 (Figure 5, rows 6 and 8, columns 5 and 6).Our results demonstrate a distinct advantage in lighting predictions, compelling in the creation of the most authentic environment map and realistic object rendering effects.The overall structure of the environment map is improved.The composition of virtual objects within the scene is seamlessly integrated, with preferable detail and accurate highlight distribution.
These findings are consistent with the quantitative out- comes presented in Table 2, where we have compiled the average RMSE, si-RMSE, and angular error metrics [8] in comparison to prior works.The results indicate the advancements achieved by our approach in terms of both image quality and lighting direction within the generated environment maps.Our method achieved lower RMSE, si-RMSE, and angular error values associated with the main light source compared with others.It's worth noting that our method, as well as [Xu22], utilizes the same environment map generator, but our unique strength lies in our ability to produce more accurate environmental details and precise lighting directions.These improvements are primarily attributed to the advanced SG predictions generated by SGformer, which is essential in the paradigm of lighting predictions.

Ablation Study
To understand the individual contributions of each component within SGformer towards SG predictions, we conducted the ablations study that encompasses both encoder variants and decoder structures alongside the consistency loss design.Both quantitative and qualitative evaluations have been performed.The quantitative assessments involve evaluating RMSE, L1 error, and the Ranked Matching Error (referred to as RME) [31].Additionally, we analyze the L2 shift in intensity and RGB ratio values within the SG parameters.It's worth noting that, in contrast to the approach taken by Xu et al. [31], our RME calculations are specifically concentrated on the first half of the sorted anchor points.This adjustment allows us to place more emphasis on evaluating the primary lighting aspects.

Encoder Variants
To investigate the impact of different encoders on the extraction of lighting features and their subsequent influence on lighting predictions, we conducted an ablation study comparing the performance of DenseNet and DETR when used as the encoder alongside Conformer.For consistency, we employed the same DETR decoder with anchor points set at N = 128 for all encoder variants to generate the SG   parameters.DenseNet encoder is CNN-based and functioned as the primary feature encoder in previous works by Zhan et al. [33] and [32].DETR encoders is a mixer of CNN and Transformer, introduced by Xu et al. [31] in their deep learning architecture for lighting predictions.In this configuration, a particularly designed CNN module, DLA-SK, is used for local feature extraction, while a Transformer module is for global feature extraction.However, this concatenated structure overlooks the nuanced interplay between local and global features.In contrast, the Conformer we proposed to use adopts a concurrent structure that enables a more interactive combination of local lighting features from the CNN branch and long-context global lighting features from the Transformer branch.This approach better addresses the conflict and mutual enhancement of these features.Both the output of the CNN branch and the Transformer branch can be used as the lighting features.
Here we used CNN output and fed them into the following transformer-based decoder to estimate the SG parameter.
Figure 6 visually illustrates the significant advantages of the Conformer over the DERT encoder in inferring lighting cues and generating highly accurate lighting tones, as exemplified in rows 2 and 3, where bluish lighting predictions can be observed as expected in Conformer results.Furthermore, this distinction becomes evident in challenging scenarios, such as row 4, where only minimal lighting cues, the faint lighting reflections on the door and wall, are present in the input images.Quantitative results further validate the Conformer's effectiveness (Table .3), as it consistently improves across all metrics.This indicates the essential role of the encoder in deducing lighting cues and emphasizes the Conformer's capability in capturing both local and global lighting features while enhancing their synergy for superior lighting predictions.

Regression on Multi-Resolution SG
To progressively investigate the impact of each key component within SGformer, including encoder, decoder structure, and loss function design within SGformer, we conducted ablation studies using the models created under three distinct setups: 1) Train our backbone network with Conformer encoder and a single decoder separately for each SG resolution (referred to as "Separate" in Figure 7, the 4th row); 2) Training the network with Conformer encoder and multiple decoders (as illustrated in Figure 2, referred to as "Multihead" in Figure 7, the 3rd row ), and 3) Training with Conformer encoder and multiple decoders, augmented by our proposed consistency loss (referred to as "Ours" in Figure 7, the 2nd row).These setups were evaluated across a spectrum of SG resolutions, ranging from anchor points N set at 8 to 128.As illustrated in Figure 7, the results obtained from our model that encompasses all setup variants exhibit the closest alignment of anchor distributions with the ground truth across all SG resolutions (as seen in both "Ours" rows).In contrast, the separately trained model tends to exhibit random shifting or fading of anchors across different SG resolutions due to the isolation in their learning.Additionally, it struggles to precisely predict lighting shapes and locate lighting sources in high-resolution SG prediction (observed in the "Separate" row, "N=128" columns).The incorporation of a multi-head decoder, which utilizes a common encoder and facilitates joint learning for diverse SG regression tasks, enhances the consistency of anchor distributions across various SG resolutions.The introduction of a consistency loss further regulates the variations in anchor distributions spanning the spectrum of SG resolutions, resulting in improved lighting shape (as seen in the "Ours" row, "N=128" column) and spatial distributions.We assessed the mean similarity between the estimated SG maps and their corresponding ground truth by employing the Structural Similarity Index (SSIM) metric.We can see that the estimated SG maps exhibit the highest similarity to the ground truth compared to other methods.
The quantitative results are provided in Table 4, encompassing various metrics including RMSE, RME, and L2 shift of Intensity and RGB ratio values.The baseline network, introduced by Xu et al. [31] with a DETR encoder, is trained across different SG resolutions and serves as the reference point.The outcomes illustrate that the introduction of the Conformer encoder results in notable improvements over the baseline, particularly on RMSE, RME, and global intensity metrics.The integration of multi-head decoding and consistency loss further enhances the model's capabilities, resulting in consistent improvements across all metrics.This enhancement is particularly evident in the reduction of RME, indicating optimized main lighting source distributions.

Discussions
Our proposed method aims to estimate environmental lighting from a standard image, avoiding the need for direct panoramic image capture with expensive devices to obtain global illumination.While our supervised deep-learning method requires ground truth data during the training stage, once the model is trained, it can be applied to any arbitrary standard image captured by lightweight cameras, such as mobile phone cameras.The training ground truth in Laval Dataset has been publicly available, and we make full use of it in a one-off manner, consistent with the approach taken in most developments of machine/deep learning applications.

Conclusions and future work
In summary, this paper introduced an advanced deep transformer-based approach to enhance indoor lighting estimation performance from single standard images.Our novel network architecture combines a Conformer model for global and local lighting feature extraction with a multiresolution transformer-based decoder for simultaneous SG parameter predictions across various resolutions.We are the first to explore the interplay of spatial distributions across multiple SG resolutions and utilize this to enhance the spatial distribution of lighting sources.To improve lighting structure modeling, we introduced an SG consistency loss designed to ensure consistency in spatial distributions across different SG resolutions.Our comprehensive experiments have demonstrated significant improvements in lighting estimation, enhancing predictions of lighting source shapes, color tones, and lighting directions.As a powerful tool, SGformer effectively enhances the realism of environment map estimation, providing precise guidance for highly realistic environment map synthesis and realizing seamless object rendering.Looking ahead, our future research will focus on advancing methods to gain a deeper understanding of the visual context and semantics within scenes, with the goal of achieving even more accurate and context-aware illumination predictions.

Figure 1 .
Figure 1.Illustrations of scenes and their SG maps at various resolutions.

Figure 2 .
Figure 2. The overall architecture of SGformer takes a standard LDR image as input and produces SG parameters at various resolutions as output.It comprises a Conformer encoder and a multi-head transformer decoder.Additionally, we introduce a novel SG consistency loss to enhance the regularization of the lighting structure learning.
to enhance local and global feature extraction at various resolutions in an interactive man-ner.It comprises both a CNN branch and a transformer branch, with each layer featuring a dedicatedly designed feature coupling unit (FCU) to manage conflicts and complementarities among lighting features at different levels.When combined with our multi-head decoder design, the Conformer encoder's feature extraction capabilities are bolstered by considering data from different SG resolutions.Multi-head Transformer-Based Decoder: We propose a new multi-head decoder based on the aforementioned transformer-based DETR

Figure 3 .
Figure 3.Comparison between downsampled SG maps from higher resolutions (top row) and the corresponding ground truth SG maps (bottom row).The top-left corner shows the corresponding environment map, while the bottom-left corner displays the ground-truth SG map with the highest resolution of N=128.

Figure 4 .
Figure 4. Visualization of SG maps generated by SGformer and the corresponding synthesized environment maps guided by them through a GAN-based environment map neural projector.

Figure 5 .
Figure 5. Visual comparison of environment map generation and their rendering results alongside state-of-the-art methods.Four examples are presented, with the upper part displaying the generated environment map and the lower part showing the input crop containing an inserted virtual ball (with roughness levels of 0.2 for rows 2 and 4, and 0.5 for rows 6 and 8) rendered using the predicted environment map in each case.Additional environment map synthesis results can be found in the supplementary material.

Figure 6 .
Figure 6.Comparison of SG maps with anchor number N = 128 using DLA-SK and Conformer encoders.The same transformerbased decoder is employed for testing.

Figure 7 .
Figure 7.An ablation study on SG regression was conducted across various resolutions.Two exemplars are provided, where in each case, the top row exhibits the ground truth SGs, and the second row presents our results.The third row showcases the results of the 'Multihead' model.The fourth row displays the results of the 'Separate' model.For additional results, please refer to the supplementary material.

Table 1 .
The intensity and angular errors of various SG resolutions in representing the primary lighting sources.The table lists the top 5 lighting sources, ranked by their intensity.

Table 2 .
Quantitative comparisons of the quality of estimated environment maps, as evaluated through RMSE, si-RMSE, and Angular error metrics.

Table 3 .
Comparison of SG predictions using different encoders.

Table 4 .
Ablation study on SG regression across different SG resolutions.