HOLISMOKES -- IX. Neural network inference of strong-lens parameters and uncertainties from ground-based images

Modeling of strong gravitational lenses is a necessity for further applications in astrophysics and cosmology. Especially with the large number of detections in current and upcoming surveys such as the Rubin Legacy Survey of Space and Time (LSST), it is timely to investigate in automated and fast analysis techniques beyond the traditional and time consuming Markov chain Monte Carlo sampling methods. Building upon our convolutional neural network (CNN) presented in Schuldt et al. (2021b), we present here another CNN, specifically a residual neural network (ResNet), that predicts the five mass parameters of a Singular Isothermal Ellipsoid (SIE) profile (lens center $x$ and $y$, ellipticity $e_x$ and $e_y$, Einstein radius $\theta_E$) and the external shear ($\gamma_{ext,1}$, $\gamma_{ext,2}$) from ground-based imaging data. In contrast to our CNN, this ResNet further predicts a 1$\sigma$ uncertainty for each parameter. To train our network, we use our improved pipeline from Schuldt et al. (2021b) to simulate lens images using real images of galaxies from the Hyper Suprime-Cam Survey (HSC) and from the Hubble Ultra Deep Field as lens galaxies and background sources, respectively. We find overall very good recoveries for the SIE parameters, while differences remain in predicting the external shear. From our tests, most likely the low image resolution is the limiting factor for predicting the external shear. Given the run time of milli-seconds per system, our network is perfectly suited to predict the next appearing image and time delays of lensed transients in time. Therefore, we also present the performance of the network on these quantities in comparison to our simulations. Our ResNet is able to predict the SIE and shear parameter values in fractions of a second on a single CPU such that we are able to process efficiently the huge amount of expected galaxy-scale lenses in the near future.


Introduction
Gravitational lensing has become a very powerful tool in astrophysics, especially in combination with others, such as the lens velocity dispersion measurements (e.g., Barnabè et al. 2011Barnabè et al. , 2012Yıldırım et al. 2020) and the galaxy rotation curves (e.g., Strigari 2013;Hashim et al. 2014), which both help to probe the mass structure of galaxies. In particular, gravitational lensing allows us to measure the total mass of the lens (Dye & Warren 2005;Treu 2010) and thus to study the fraction and distribution of dark matter (DM; e.g., Schuldt et al. 2019;Baes & Camps 2021;Shajib et al. 2021;Wang et al. 2022), as well as its nature (e.g., Basak et al. 2022;Gilman et al. 2021). Moreover, one can use lensing to study high-redshift sources thanks to the lensing magnification (e.g., Lemon et al. 2018;McGreer et al. 2018;Rubin et al. 2018;Salmon et al. 2018;Shu et al. 2018) by reconstructing the surface brightness of the source, which can be used to reveal information about the evolution of The network code is available under https://github.com/ shsuyu/HOLISMOKES-public/tree/main/HOLISMOKES_IX galaxies at higher redshifts (e.g., Warren & Dye 2003;Suyu et al. 2006;Nightingale et al. 2018;Rizzo et al. 2018;Chirivì et al. 2020).
In special cases where a variable source, such as a quasar (e.g., Lemon et al. 2017Lemon et al. , 2018Lemon et al. , 2019Ducourant et al. 2019;Khramtsov et al. 2019;Chan et al. 2020;Chao et al. 2020Chao et al. , 2021 or supernova (SN), is present as a background object (Kelly et al. 2015;Goobar et al. 2017;Rodney et al. 2021), one can measure the time delays (e.g., Millon et al. 2020a,b;Huber et al. 2022) and then constrain the Hubble constant H 0 (e.g., Refsdal 1964;Chen et al. 2019;Rusu et al. 2020;Wong et al. 2020;Shajib et al. 2020Shajib et al. , 2022, helping to clarify the current discrepancy between the cosmic microwave background (CMB) measurements (Planck Collaboration et al. 2020) and measurements using the local distance ladder (e.g., the SH0ES project, Riess et al. 2019Riess et al. , 2021. Furthermore, if one detects a lensed SN on the first image, and predicts the location and time for the next appearing image(s), the SN can be observed at earlier phases than without lensing, which would help to shed light on open questions regarding the SN progenitor system(s).
For this reason, and also because such lensing systems with time-variable background sources are very rare, great effort has been made in recent years to run several large and dedicated surveys, such as the Sloan Lens ACS (SLACS) survey (Bolton et al. 2006;Shu et al. 2017), the CFHTLS Strong Lensing Legacy Survey (SL2S; Cabanac et al. 2007;Sonnenfeld et al. 2015), the Sloan WFC Edge-on Late-type Lens Survey (SWELLS; Treu et al. 2011), the BOSS Emission-Line Lens Survey (BELLS; Brownstein et al. 2012;Shu et al. 2016;Cornachione et al. 2018), and the Survey of Gravitationally-lensed Objects in HSC Imaging (SuGOHI; Sonnenfeld et al. 2018;Wong et al. 2018;Chan et al. 2020;Jaelani et al. 2020a). In addition to that, several teams used the imaging data from large surveys, such as the Dark Energy Survey (DES; e.g., Jacobs et al. 2019;Rojas et al. 2022), the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; e.g., Lemon et al. 2018;Cañameras et al. 2020), the Kilo Degree Survey (KiDS; e.g., Petrillo et al. 2017Petrillo et al. , 2019Li et al. 2020Li et al. , 2021, and the surveys with the Hyper Suprime-Cam (HSC; e.g., Cañameras et al. 2021;Shu et al. 2022;Jaelani et al. in prep.), to identify additional lens systems. In total, a few hundred spectroscopically confirmed lenses and many more promising lens candidates have been found so far, but the sample of lens candidates is expected to grow by a factor of around 20 (Collett 2015) with upcoming surveys, such as the Rubin Observatory Legacy Survey of Space and Time (LSST, Ivezic et al. 2008), which will observe around 20, 000 deg 2 of the southern hemisphere in six different filters (u, g, r, i, z, y), or the Euclid imaging survey operated by the European Space Agency (ESA; Laureijs et al. 2011).
After detecting the lens candidates, a model describing their total mass distribution is required for nearly all applications; this also helps to reject some false-positive candidates (Marshall et al. 2009;Sonnenfeld et al. 2013;Chan et al. 2015;Taubenberger et al. in prep.). As the sample of known lenses is increasing rapidly, current Monte-Carlo Markov-Chain (MCMC)-based techniques (e.g., Jullo et al. 2007;Suyu & Halkola 2010;Sciortino et al. 2020;Fowlie et al. 2020) are no longer sufficient to model all of them, because the MCMC sampling is very time consuming and resource dependent, and requires a lot of human input. Hezaveh et al. (2017) therefore proposed and demonstrated the feasibility of using convolutional neural networks (CNNs) to predict the mass model parameter values for high-resolution and pre-processed, lenslight-subtracted images. This was further improved and explored by the same team Morningstar et al. 2018Morningstar et al. , 2019, and Pearson et al. (2019Pearson et al. ( , 2021 also presented a CNN for modeling high-resolution lens images. For the LSST Dark Energy Science Collaboration, Wagner-Carena et al. (2021) developed so-called Bayesian neural networks to model HST-like lenses after lens light subtraction in analogy to Hezaveh et al. (2017), while in their follow-up work, they extended their approach to HST-like im-ages without prior lens light subtraction . To complement this, as part of our ongoing Highly Optimized Lensing Investigations of Supernovae, Microlensing Objects, and Kinematics of Ellipticals and Spirals (HOLISMOKES, Suyu et al. 2020) programme, in Schuldt et al. (2021b we presented a CNN designed to model strongly lensed galaxy images. This CNN predicts the five singular isothermal ellipsoid (SIE) mass model parameter values (lens center x and y, complex ellipticity e x and e y , and the Einstein radius θ E ) for the lens galaxy mass distribution using ground-based HSC images.
In the presented work, which builds upon S21b, we adopt an SIE profile plus external shear component and include an uncertainty prediction for each parameter, which has already been demonstrated on high-resolution HST-like images Park et al. 2021;Wagner-Carena et al. 2021). As this task is much more complex than that in S21b, it requires a deeper network to cope with the small distortions of the external shear, such that we rely now on residual blocks in the network architecture (ResNet, He et al. 2016). Moreover, we improved our simulation pipeline, which uses real observed galaxy images as well as corresponding redshift and velocity measurements (S21b), to obtain even more realistic training data. In detail, we add Poisson noise on the simulated arcs, add the external shear component, and improve the automated computation of the first and second brightness moments to obtain the lens light center and ellipticity. Our mock images of lenses also contain other galaxies in the image cutout, allowing our ResNet to cope with realistic lens systems that often have other nearby objects along the line of sight. While we use only simulated data to test the network performance in this work, we apply this network to a smaller sample of 31 real HSC lenses in a companion paper ) and compare the obtained models to the traditional ones obtained through MCMC sampling. The good agreement found there shows that our mock images are realistic and that our demonstrated network performance is trustworthy.
As one of the main advantages of our ResNet is the computational speed, it is perfectly suited to predict the lens mass model parameters of a lensing system with a transient host galaxy as the background source, which can then be used to predict the next appearing image(s) and time delay(s) of the transient. Therefore, in analogy to S21b, we compare the predicted image positions and time delays using (1) the ground-truth mass model from our simulation pipeline and (2) the predicted model from the network.
The outline of the paper is as follows. We summarize in Sect. 2 the simulation of our training data and changes compared to S21b. In Sect. 3, we describe our network architectures and in Sect. 4 we present our results. The dedicated tests for the network are summarized in Sect. 5. Our comparisons of the image positions and time delays are reported in Sect. 6, before we present our conclusions in Sect. 7.

Simulation of mock lenses
Training a supervised neural network requires a good-quality, representative, and sufficiently large sample of input data together with the corresponding output, which is the so-called ground truth. For our purpose, we request a sample containing around 100,000 lens systems, each in four filters, which requires that we rely on simulations given that the number of real known lenses is < 10 4 and the corresponding mass models need to be known. In order for our network to achieve optimum performance on real data, it is important that our mock images in our training sample are as realistic and as representative of real data as possible. We therefore use real observed images of galaxies as lenses and as background sources, instead of producing a completely mock sample as is often done (e.g., Hezaveh et al. 2017;Perreault Levasseur et al. 2017;Pearson et al. 2019Pearson et al. , 2021. We improve on the pipeline 1 described in S21b for simulating images of mock lenses and briefly summarize here the procedure, highlighting the new features we adopt in this work. A diagram of the pipeline is shown in Fig. 1. As lenses, we use again the HSC images (Aihara et al. 2019) of luminous red galaxies (LRGs) and available spectroscopic redshifts and velocity dispersion measurements from the Sloan Digital Sky Survey DR14 (SDSS, Abolfathi et al. 2018). In addition to our criteria in S21b, we now set a lower limit of 100 km s −1 for the velocity dispersion, as very low-mass galaxies do not yield a strong lensing configuration with spatially resolved multiple images (compare Fig. 1 in S21b). This speeds up the selection process of suitable lens-source pairs. As background sources, we again use images from the Hubble Ultradeep field (HUDF, Beckwith et al. 2006;Inami et al. 2017). Each background galaxy is then lensed with the software Glee (Suyu & Halkola 2010;Suyu et al. 2012); convolved with the (subsampled) HSC PSF; and then binned to the HSC pixel size of 0.168 . The obtained arcs are then added to the lens image. With this procedure, we include real line-of-sight objects and light distributions, as shown in Fig. 2.
To calculate the deflection angles, we assume a singular isothermal ellipsoid (SIE) profile and, in contrast to S21b, an external shear described in complex notation by γ ext,1 and γ ext,2 to account for additional mass outside of the cutout. The convergence (also called dimensionless surface mass density) of the adopted SIE profile (Barkana 1998) 2 can be expressed as with an elliptical radius and the axis ratio where and e y = 1 − q 2 1 + q 2 sin(2φ) (5) 1 The simulation pipeline will be made available upon request as well as the needed nonpublicly available software Glee (Suyu & Halkola 2010;Suyu et al. 2012). 2 The SIE mass profile introduced by Barkana (1998) allows for an additional core radius, which we set to 10 −4 , yielding effectively a singular mass distribution without numerical issues at the lens center.
Given our results in S21b, we again use a flat distribution of the Einstein radii up to 2 in order to allow the network to better learn the full parameter space. In analogy, we use a flat distribution for γ ext in the range between 0 and 0.1, but consider also a realistic distribution (Faure et al. 2011;Wong et al. 2011). Moreover, we improve our simulation code by including Poisson noise for the arcs approximated as where I arc is the lensed source image (arcs) and To compute the scaling factor α, we start from the HSC variance map with value v + i corresponding to the i th lens image pixel 3 and define We then compute the lens background noise σ bkgr defined as the minimal root-mean-square (rms) of the four corners of the lens image to exclude contributions from line-of-sight objects. With those two quantities, we then approximate the Poisson noise map of the lens as in order to obtain the Poisson scaling factor map, where I lens is the lens image from HSC. As this scaling factor α should be a constant, we approximate it as the median of the 10 × 10 pixel central region, because the lens intensity is highest in this region and therefore the mapping is more precise than on the outer parts. From this map, we draw Gaussian variations representing the additional Poisson noise, which are added on top of the simulated images. Additionally, we improve the lens centering and ellipticity estimation in the simulation code. For this, the code now accepts a mask of the lens, obtained for example with the Source Extractor (Bertin & Arnouts 1996), resulting in a more accurate masking of the lens. As that mask might exclude parts of the lens due to overlapping line-of-sight objects, we additionally apply a circular mask with a radius of 20 pixels centered at the  image center when determining the lens center. From the resulting masked image, we predict the lens center through the first moments and re-center the image on the pixel closest to the lens center. The remaining fractional pixel offset of the lens center is taken into account through the simulation as lens light center. The axis ratio q ll and position angle φ ll of the lens light is determined through the second moments using the provided mask.

Centered lens
As we now re-center the lens cutout, we shift the final mock image randomly by up to ±3 pixels in the x and y direction such that the lens light and mass center, which are different, are not coincident with the image cutout center. This allows the network to learn the lens center instead of the cutout center and ensures that the network can handle images that are not perfectly centered on the lens mass.
Similar to S21b, we set some criteria on the brightness of the arcs to ensure visibility. Specifically, we request that the brightest pixel of the arcs is above 5σ bkgr in either g or i band, chosing the band in which the source is brightest. In the same filter, we further set a threshold for that pixel to be a factor of 1.5 brighter compared to the lens image at the same pixel. This helps to avoid drastic blending with the lens. To overcome these criteria more easily, we include a slight boost of the source magnitude in up to six steps of −0.5 magnitudes. This does not affect the shape of the arcs, nor the lens, nor the line-of-sight objects, but helps to obtain detectable lenses.
With this method, we generate around 165,000 mocks for training, validating, and testing the network, where we limit the Einstein radius to the range between 0.5 and 5 , and additionally to 3,000 systems per 0.05 bin to obtain a flat distribution at least up to θ E ∼ 2 . As we use real measurements of the velocity dispersion and redshifts, lensing systems with an Einstein radius above ∼ 2.5 are very rare, even when increasing the number of iterations for testing different lens-source alignments or lens-source pairs. As a result, the number of systems drops by two orders of magnitude towards θ E ∼ 3 compared to the plateau. Example images generated with our upgraded simulation pipeline are displayed in Fig. 2; the left panels show the final mock images and the right panels show the lensed source alone, which were added to the HSC lens image. This figure demonstrates the variety of mocks and how realistic they are.

Neural networks and their architecture
Convolutional neural networks are very powerful tools in image recognition tasks, especially if an autonomous and fast method is required to cope with huge numbers of images. This property has already lead to many different applications in astrophysics (e.g. Paillassa et al. 2020;Tohill et al. 2021;Wu 2020;Schuldt et al. 2021a). As there are already thousands of known lens candidates in HSC (e.g., Wong et al. 2018;Sonnenfeld et al. 2019Sonnenfeld et al. , 2020Jaelani et al. 2020a,b;Jaelani et al. in prep.;Cañameras et al. 2021;Shu et al. 2022), and we expect hundreds of thousands more observed by LSST and Euclid (Collett 2015), CNNs would be perfectly suited for analyzing this amount of data in an acceptable amount of time.
While in S21b we used a CNN based on the LeNet (Lecun et al. 1998) architecture to predict the five SIE parameters, deeper networks help to capture small features of the more complex mocks used in this work. Therefore, we now make use of residual neural networks (He et al. 2016), which are a specific type of CNN. In residual neural networks, the network architecture includes so-called residual blocks of typically two convolutional (conv.) layers with a kernel size of 3x3, which are connected via a skip-connection (or short cut) for the backpropagation. Therefore, the back-propagating gradient does not vanish easily over the multiple convolutional layers and thus allows the neurons and weights of the first layers to be properly updated, even if the network architecture is very deep. A sketch of our network architecture is given in Fig. 3.

Error estimation and loss function
In analogy to S21b, we started with a network 4 predicting one point estimate per parameter, which amounts to seven values for the five SIE parameters, plus two for the external shear parameters. Here we again used a mean-square-error (MSE) loss function. We then modified the network such that it predicts two values per parameter, that is, 14 values in total. Providing not only a point estimate but rather a median-like value with a 1σ uncertainty makes the network much more powerful and helps to exclude insecure parameter estimations. From the 14 predicted values, we now interpret one value per parameter as the median (like previously the point estimate) and the other value as the standard deviation of a Gaussian function describing the error on that parameter value for that specific image. Here, we follow Perreault  and use a loss L given by η = (x, y, e x , e y , θ E , γ ext,1 , γ ext,2 ) output: with a regularization term to minimize the errors and a logprobability term that is defined in PyTorch for a Gaussian distribution, as The loss for a given system is the sum over the p different parameters η = (x, y, e x , e y , θ E , γ 1 , γ 2 ) and corresponding errors σ = (σ x , σ y , σ e x , σ e y , σ θ E , σ γ 1 , σ γ 2 ) with index l ∈ [0, p]. For the total loss L, we additionally sum over each image k in the batch or sample of size N. Similar to the weighting factors w l introduced already in S21b and used to improve on the Einstein radius, we have also introduced weighting factors in our new loss function as well as a regularization constant l that can be different for each parameter. With these factors, we can control the contribution of the different parameters to the loss, and can therefore choose which ones to better optimize. For our CNNs in S21b, we found it helpful to increase the contribution of the Einstein radius to the loss, as this is the key quantity in the SIE profile. Although it remains the key parameter in an SIE+γ ext setup, we also tested the possibility of up-weighting the external shear, as these two parameters are the most problematic ones. However, we discarded this option from all further tests, as we find only a minor improvement on the external shear but notably lose performance on the other parameters.
In addition to the weighting factors, we modified the loss function by introducing the uncertainty prediction such that also σ contributes to the optimization of the weights and neurons during the training process. Here, we tested different possible regularization terms, such as using the absolute value of σ instead of the squared term, or even leaving out the additional regularization term completely. Moreover, we varied the regularization constant . As changing the loss function will change the loss value for a given network, we cannot compare the obtained loss values directly. Based on a more quantitative comparison, we found no notable difference in the performance on the median predictions η. Given that we use a scaling to match the expected 68.3% confidence intervals (CIs) instead of tuning dropout for this, we finally adopted the regularization function proposed by Perreault Levasseur et al. (2017), which uses a squared term and a regularization constant of 0.5.
In order that the predicted σ values can indeed be interpreted as 1σ Gaussian widths, Perreault Levasseur et al. (2017) suggest tuning the network through the dropout rate (Hinton et al. 2012;Srivastava et al. 2014) such that the predicted errors match the expected CIs of 68.3% for 1σ, 95.4% for 2σ, and 99.7% for 3σ. Here, the bar indicates that the standard deviation is computed from the medians of a full sample, for example, the test set, and does not correspond to the uncertainty σ predicted by the network for an individual lens system and parameter. This idea was also adopted by Pearson et al. (2021) for their CNNs. As there is only one dropout rate, both teams average over their three or four parameters when trying to match the expected percentile such that the individual errors still differ from the expected CIs (compare Fig. 2 2021)). This would be even more difficult for us with seven parameters. Instead of using dropout with the same rate for both convolutional layers and fully connected (FC) layers, we test only the effect of dropout for the FC layers as the dropout rate typically differs significantly between convolutional layers and FC layers (Hinton et al. 2012;Srivastava et al. 2014). For these reasons, we finally do not use dropout for the error scaling and instead incorporate a direct scaling of each parameter uncertainty such that the uncertainties match the CIs on the test set as closely as possible.

Network architecture
In general, to find the best network architecture and the best set of hyper-parameters, which is defined as the network with minimal mean validation loss, we carried out extensive tests including the hyper-parameter search discussed in Sect. 3.3. For the architecture, we varied the number of residual blocks between 2 and 6, the number of FC layers between 1 and 3, and the number of feature maps and strides in the convolutional layers and the number of neurons within the different FC layers. We also tested different kernel sizes for the convolutional layers, but obtained the smallest average validation loss with the standard 3 × 3 kernel, which is understandable given the small size of our groundbased images.
As shown in Fig. 3, the final network architecture contains one convolutional layer with kernel size 3 followed by a batch normalization and three residual blocks of two convolutional layers each with a kernel size of 3. As shown in Fig. 3, we finally include three residual blocks with strides of two, one, and one, respectively, and 24, 32, and 64 feature maps, respectively. The first layer before the residual blocks has 16 feature maps, while the input has 4 feature maps corresponding to the four different filters. After flattening, the output of the convolutional sequence is passed through one FC layer connecting 1024 to 14 neurons.
During our tests on the network architecture, we also adjusted the pooling layer before flattening for the FC layers (compare Fig. 3). Here, we tried an average pooling layer, a maximal pooling layer, or no pooling at all. For the two different pooling layers, we tested kernel sizes of 8 × 8, 4 × 4, or 2 × 2. We found the best performance using an average pooling layer with a kernel size of 8 × 8, as indicated in Fig. 3.

Hyper-parameter search
In addition to testing different architecture layouts, the values for hyper-parameters also need to be optimized. Throughout these tests, we adopted a weight decay of 0.0005, a momentum of 0.9, and, after some short tests regarding the effect on the performance, a batch size of 32. Apart from those, the learning rate is one of the key quantities to test. Here, we typically tested the values for the learning rate r learn ∈ [0.01, 0.001, 0.0001, 0.00001, 0.000001] when changing any other hyper-parameter or layer. This covers a good range of plausible values. As the changes on the weights are expected to drop over the training, we also tested the option of a "decreasing learning rate", which means that we divide the learning rate by a factor of two every twentieth epoch. Because the best epoch was typically below 100, we had only a few relevant decreasing steps during training such that we finally adopted a constant learning rate for our further tests. In the presented network, we finally assumed a learning rate of r learn = 0.000001, a regularization constant of 0.5 for all the different parameters, and a weighting factor w of 5 for the Einstein radius in analogy to S21b, and otherwise 1.
Although the initialization is not a hyper-parameter that typically gets tuned, we tested the effect of using a different initialization of the network for a few given setups. This demonstrates how independent the final, trained network is from the original values. We find no preference of a specific seed and the overall performance is unaffected, but each network gives a slightly different loss. These slight changes are due to the stochastic learning process and are therefore commonly observed. For a few instances, the best hyper-parameter, for example the learning rate, changed by changing the seed, indicating the importance of optimizing the hyper-parameters for a given network.
To mitigate these changes, so-called ensemble learning methods can be used, where essentially the same network, that is, with fixed architecture and hyper-parameters, is trained with a different initialization and the predictions are combined afterwards. We performed such tests for a few given setups, predicted the 14 parameters with each network, and compared their average to the ground truth on the test set. In our case, this does not help to eliminate outliers and decrease the scatter, which proves the similarity of the networks regardless of the initialization.

Parameter normalization
Because different parameters cover different ranges, we include a scaling to map them consistently to the range [0, 1]. This ensures an equal contribution from the different parameters to the loss, resulting in a better optimization of all parameters. We assume the following input ranges [a l , b l ]: lens center x ∈ [−0.6 , 0.6 ] and y ∈ [−0.6 , 0.6 ], complex ellipticity e x ∈ [−1, 1] and e y ∈ [−1, 1], Einstein radius θ E ∈ [0.5 , 5 ] as already in the simulation procedure, and complex external shear γ ext,1 ∈ [−0.1, 0.1] and γ ext,2 ∈ [−0.1, 0.1]. This means the ground truth is scaled componentwise as and the output of the network is scaled back to the original ranges through and for the uncertainty, The uncertainties are not shifted by a, as those are considered with respect to the predicted median values η. To ensure that all predicted values are between 0 and 1, we include a sigmoid layer in the network architecture before splitting it up into seven values for the median η and seven values for the uncertainty σ.
The network parameter optimization is performed with a ReLU (Nair & Hinton 2010) activation function and a stochastic gradient descent algorithm, adjusting the weights to minimize the loss.

Cross-validation
To avoid unbalanced optimization of the network, we follow S21b and use a five-fold cross-validation by splitting our set of 165,374 mocks into roughly 56% training, 14% validation, and 30% testing, where rounding effects from the batch size occur, and train each run over 500 epochs. This also allows us to better determine the hyper-parameters (e.g., learning rate, number of neurons, or feature maps) and the final stopping epoch, which is defined through the minimal average validation loss. The lowest mean loss is −593.71, obtained in epoch 230, such that the final run is trained over 230 epochs. The loss curve reveals that there is a relatively large generalization gap, which means that after a certain number of epochs the network performs significantly better on the training data than on the validation set. Therefore, we tested the effect of dropout on the FC layers even with our relatively small number of FC layers. This helped to reduce the generalization gap, but also resulted in a higher average validation loss used to select the best network. Therefore, we consider no dropout for the final network, which was suggested by Perreault  and also used by Pearson et al. (2021) for the uncertainty scaling such that the distribution matches the expected percentile for a Gaussian distribution.

Network results and performance
In this section, we present the predicted SIE+γ ext parameters and corresponding uncertainties σ predicted with our residual neural network. This network was trained, validated, and tested on 165,374 realistic mock images created with our upgraded simulation procedure as described in Sect. 2. The optimized network architecture is shown in Fig. 3. In Fig. 4 we show a comparison between ground truth and predicted values on the test set, which are images that the network has not seen before during training or in the cross-validation procedure. Specifically, for every parameter, we show the distribution of values as a histogram as well as a direct comparison between predicted and true values. Figure 4 also shows the median predicted values per bin (red line) and the 1σ and 2σ (gray shaded). The bar indicates the computation from the whole sample to better distinguish from the specific uncertainties σ predicted by the network.
The network performs very well on the lens center, especially in contrast to the CNNs presented in S21b which had difficulty in recovering the mass center. Here, we clearly see the improvement on the lens centering resulting from the re-centering of the lens galaxy, taking the fractional pixel in the simulation into account, which was neglected in S21b, and the random shift of the final mock image by up to ±3 pixels. The median with 1σ range for x pred − x tr and y pred − y tr are, respectively, 0.04 +0.50 −0.40 and 0.04 +0.46 −0.41 pixels. The complex ellipticity components e x and e y are also well recovered with 0.02 +0.12 −0.09 and 0.01 +0.11 −0.10 , respectively, although the predictions still have a tendency to underestimate the absolute values. In other words, the network predicts the galaxies to be rounder than they are. One of the reasons for this is the much higher number of training systems with values around zero, which means that the network might be biased towards this value.
The Einstein radius θ E is very well recovered with a median and 1σ value of θ pred E − θ tr E = 0.003 +0.21 −0.24 . This can also be seen from Fig. 4, where the median line closely follows the 1:1 line between our lower limit of 0.5 and around 1.5 , dropping for systems with very large image separations. The 1σ and 2σ ranges indicate very precise predictions between ∼ 1 and ∼ 2 , while beyond this range the performance is not as good, which is partly due to the low number of systems in the training set. The lower precision on the low end is most likely due to blending issues with the lens. Because of the small image separations, the counter images are more strongly blended with the lenses such that the network cannot sufficiently deblend the arcs from the lens, which is important for the prediction of the Einstein radius. This is confirmed by a test using the arcs alone for the network training (see Sect. 5.4 for details). On the other side of the range (i.e., high θ tr E ), the performance drops significantly due to their underrepresentation in the data set. As shown with CNNs in S21b, the distribution of the Einstein radii is crucial for the performance, and a uniformly distributed sample yields the best performance over the full considered range. As mentioned already in Sect. 2, we therefore set a maximum of mocks per Einstein radius bin. As we use real measurements of the velocity dispersion and redshifts, lensing systems with an Einstein radius above ∼ 2 are very rare, such that the number of systems drops by more than one order of magnitude towards θ E ∼ 3 compared to the plateau at θ E ≤ 1.5 , which means a decreasing performance. Given the very low number of systems within each bin for θ E > 3 , we do not show them in Fig. 4 but keep our limit at 5 as the largest image separation accepted by our simulation pipeline. Even if the network shows a lower performance in this range, it has seen some of these systems, and because of our introduced scaling, it is in principle able to predict such large Einstein radii. As demonstrated in S21b, we can also train a dedicated network on a smaller sample, for example for systems with θ E ≥ 2 , meaning that a better performance is achieved for systems with such large image separations.
As we see from Fig. 4, the network is so far not able to accurately predict the external shear components γ ext,1 and γ ext,2 , although the mean with 1σ values for the whole test set are 0.002 +0.04 −0.04 and −0.001 +0.04 −0.04 , respectively. It tends to predict values closer to zero, resulting in a lower predicted shear strength than it should. We tested many different possibilities to improve on the external shear as described in Sect. 5, and found that blending with the lens is not the main reason for this issue. It seems that the current network apparently cannot sufficiently generalize to new systems on these very minor distortions on the arcs. This might be because of the variable PSF from system to system or the image resolution given that it works relatively well with more idealized, lens-light subtracted and high-resolution images (Morningstar et al. 2018). Further investigation beyond our tests summarized in Sect. 5 is therefore necessary for a better estimate of the external shear. Whether a precise estimate of the external shear is crucial depends on the science case behind the modeling. For statistical studies on the lenses, for instance regarding the lens mass, the external shear is expected to have only negligible influence. Figure 5 shows the difference between the predicted values and ground truths for the seven parameters, as well as correlations between them. We find no strong correlations, not even between typically degenerate quantities such as ellipticity and external shear. By comparing to Fig. 7 of S21b, where the plotting ranges of the SIE parameters are kept the same for ease of comparison, we find a generally better performance on the Einstein radius but with the same kind of diamond-shaped 2σ contour. On the other hand, the scatter on the lens mass center is slightly larger with the presented ResNet. This is most likely because of our newly introduced random ±3 pixel shift for the final mocks, which was implemented to ensure that the network predicts the lens mass center instead of the image center, but means that we cannot directly compare the performance on the lens center here. Instead, Fig. 4 reveals the much improved performance on the lens center of the ResNet compared to the CNNs in S21b ( Fig. 8 and Fig. 10). Given the numerous changes between these networks by introducing new parameters through the external shear, the uncertainty prediction, the change in the network architecture, and the addition of Poisson noise in the simulations, it is difficult to interpret the differences in performance. As the ResNet has the much more complex task of additionally predicting the external shear and uncertainties for each parameter, which it does well overall, this is definitely the more powerful network.
Besides the median value, the network also predicts an uncertainty σ for each parameter. To interpret this as the width of a Gaussian distribution, it has to match the statistical expectations of 1σ corresponding to a CI of 68.3%, 2σ to 95.4%, and 3σ to 99.7%. As explained in Sect. 3, we do not use dropout, as in Perreault Levasseur et al. (2017) and Pearson et al. (2021), and instead incorporate a direct scaling per parameter. As our predictions do not perfectly match a Gaussian distribution, we scale the predicted values for the different parameters σ = (σ x , σ y , σ e x , σ e y , σ θ E , σ γ ext,1 , σ γ ext,2 ) by s = (1.25, 1.32, 1.19, 1.20, 1.08, 1.21, 1.21). This minimizes the quadratic sum of the differences for the three σ intervals for each parameter η j , that is, it minimizes with where T denotes the size of the test set over which we sum, and This is a good compromise between matching the commonly used 68.3% CIs and also the 95.4% and 99.7% CIs. This is visualized in Fig. 6, where we show the coverage of the scaled uncertainty values for each parameter (gray bars) as absolute values (top) and the difference from the expectations (bottom). The top panel demonstrates the close match between the scaled uncertainties and the expected CI levels (blue dashed), especially for the mean over all seven parameters (red dotted), and can be directly compared to the achievements from Perreault Levasseur et al. (2017, Fig. 2) and Pearson et al. (2021, Fig . 4) and expected CIs; it shows the good match of the 1σ values for all individual parameters achieved through appropriate scaling, resulting in visible deviations for the 2σ and 3σ lines. In particular, the distribution for the Einstein radius is sharper than a Gaussian distribution.
A comparison of the performance to other modeling networks is difficult given the discrepancy in assumptions. Hezaveh et al. (2017), who originally proposed this novel idea, presented a network to predict e x , e y , and θ E of an SIE profile for HST-like images after lens-light subtraction to demonstrate the feasibility. Perreault  further included uncertainty predictions and an external shear component. Similarly, Pearson et al. (2021) presented a network to predict e x , e y , and θ E of an SIE profile for mock images with 0.1 resolution in preparation for the Euclid space mission. These authors also included error estimations inspired by Perreault  and explored the opportunity of a hybrid code by combining their network with PyAutoLens (Nightingale et al. 2018), a fully automated nonmachine-learning-based modeling software, for further refinement of the parameter predictions. The difference in image resolution, number of filters, and the quality of the training and test data makes a comparison difficult. Moreover, the different number of predicted parameters complicates the comparison given the degeneracies between the different parameters of a given lensing system. The closest work in terms of image quality and number of filters was presented by Pearson et al. (2019), who considered CNNs to predict e x , e y , and θ E of an SIE profile for Euclid, LSST r-band, and LSST gri-band data. The latter is the best match to our networks, and is comparable in performance, as mentioned in S21b. There is also currently a lot of work going into automated modeling without machine learning (e.g., Nightingale et al. 2018Nightingale et al. , 2021Rojas et al. 2022;Savary et al. 2022;Ertl et al. 2022;Etherington et al. 2022;Gu et al. 2022;Schmidt et al. 2023), which typically performs better than neural networks but requires significantly longer run times of hours to days. We refer to Schuldt et al. (2022) for a direct comparison between the network presented here and traditionally obtained models for real HSC lenses.  . The network-predicted uncertainties σ of the test set were scaled such that roughly 68.3%, 95.4%, and 99.7% of the true values η tr are contained in the 1σ (dark gray bar), 2σ (median gray bar), and 3σ width (light gray bar) ranges of a Gaussian distribution centered at the predicted median values η pred . With this scaling, we can indeed interpret the predicted uncertainties of each individual parameter as the width of a Gaussian distribution. We additionally show the mean of all seven σ values, which are, respectively, 68.05%, 95.72%, and 99.46%, in red for comparison.

Additional tests on the network
In this section, we summarize additional tests carried out mostly to address the remaining difficulty in the prediction of the external shear.

Tests on the network architecture
Beyond our general tests on the network architecture described in Sect. 3, we further tested the possibility to split the network into multiple branches after flattening and before the FC layers as sketched in Fig. 7. Each branch then consists of n FC layers, which can also mean just one, and predicts only specific parameters. This allows us to optimize the weights of the different branches for the specific parameters of that branch. The input of the first FC layer in each branch is the full flattened data cube obtained after the pooling, which is the same for each branch. Here we considered three versions visualized in Fig. 7, although others are possible. In version 1, we split the network into two branches, each predicting seven values: one branch for the median values η and one branch predicting the uncertainties σ. In version 2, we split the network into seven branches, each predicting the median and uncertainty for one parameter. The third considered option includes 14 branches, where each branch predicts just one value. As we did not find an improvement through these tests, we do not further explore the splitting into multiple branches. This might be because the parameters have shared information and are degenerate, such that a single branch is helpful especially for the uncertainty prediction. However, even though the multiple branches were not helpful in obtaining improved performance compared to an architecture with just a single branch of FC layers, they could be helpful when trying to tune the errors through the dropout rate as suggested by Perreault Levasseur et al. (2017), as one can adopt a specific dropout rate for each branch. This would allow an individual tuning for each parameter.

Over-fitting tests
To test whether a particular network architecture is promising, we performed so-called over-fitting tests, which means that we trained and evaluated the network on a very small sample with around 1,000 mock images. This shows whether the network is able to "memorize" the task perfectly, including predicting the external shear. As these were only short tests, we performed no cross-validation. Given our difficulties with the external shear, it is important to remark that our network is indeed able to learn the external shear for that very small sample. This shows that the network is in general able to extract features of the images and to connect them to all seven parameters, albeit imperfectly. On the shear in particular we see some scatter, which indicates that the network does not just remember the exact images and output the stored values. This demonstrates that the baseline network architecture is likely not the main reason for the failure in the external shear prediction. However, these tests have no implications as to whether the network performs well on new, unseen data from the test set. We further performed such over-fitting tests with networks that just predict the external shear. This helped us to significantly improve on the training data.

Test with fixed lens-source pairs
As our over-fitting tests presented in Sect. 5.2 only show that the network is able to predict the external shear well from images in a small training set, we further tested whether it can predict the shear on new images if we simplify the task. For this, we considered three different stages of simplification and created samples with 1,000 mocks each. First, we always used the same lens, but different background sources from HUDF and placed them randomly behind the lens. The second scenario had the same lens light and source light distribution, as well as the same redshifts, but various positions of the background source were tested with respect to the lens as well as different mass-to-light offsets of the lens, as in our general training data set. The third option was to keep everything fixed, including the source position, and only vary the external shear. This means that, in the third option, the η = (x, y, e x, y, e x , e y , θ E , γ ext,1 , γ ext,2 ) output: σ= (x, y, e σ x , σ y , σ ex , σ ey , σ θE , σ γext,1 , σ γext,2 )  Fig. 3).
arcs always appeared to be the same, without external shear, and only distortions arose from the external shear. As the lens did not vary in these tests, we excluded the SIE parameters from the prediction and trained the networks to only predict the external shear.
In the first and second scenarios, the network is able to predict the external shear better but not perfectly. In the third scenario, the network yields a nearly perfect prediction of the external shear on the test data (Fig. 8). This demonstrates the ability of the network to transfer the shear extraction to completely new images. As the lens and background source are always the same in the last scenario, we can exclude the possibility that the network connects other features of the image with the external shear parameters. On the other hand, we also exclude possible degeneracies between different parameters, such as the ellipticity and the external shear, that might explain the difficulties with the external shear. One reason for the good performance in these tests might be the matching PSF for all systems, given that we are using the same lens. Normally, different lenses have slightly different PSFs, meaning that the arcs appear different after the convolution. These variations can introduce difficulties for the network, as it does not know the PSF of a given lens system. Particularly for the external shear, detecting small distortions on the arcs is necessary, which can be highly influenced by different PSF shapes. However, passing the PSFs together with the images to the network did not help (see Sect. 5.4).

Variations of the input data
As we achieved a better performance in our lens search projects by applying a square-root stretching (Cañameras et al. 2021;Cañameras et al. in prep.;Shu et al. 2022), we also tested this for our modeling network. Hence, we no longer passed the images to the network, but rather the square-root of the images after setting all negative background pixels to zero. This helps to enhance the faint arcs compared to the brighter lens in the center, but also changes the flux ratios between pixels on the arcs in any given filter and also the ratio between the different filters where the color information is encoded. Finally, we find no improvement for the modeling network.
Another test of this kind was to subsample the images by a linear interpolation to increase the number of pixels, as our 64 × 64 pixels images are very small compared to images from for example ImageNet. Even though no information was added in this process, the hope was that a higher pixel count might help the network to predict small features and would allow for deeper networks with larger kernel sizes or strides in the convolutional layers. We tested subsampling factors of 2, 3, and 4, but the lowest mean validation loss was obtained without subsampling.
Given our difficulty in recovering the external shear, we tried to help the network with some additional information. First, we provided the full-width half-maximum (FWHM) values of the point spread function (PSF) frames in addition to the normal set of images. These values were added to the flattened output of the convolutional layers and processed through the FC layers. As this leads to no improvement, we considered networks accepting eight frames instead of four, and added the PSF images directly as input images. To this end, we subsampled the PSF with a linear interpolation to 64 × 64 pixels, matching the size of the images, and passed them independently of the images through a mirrored branch of convolutional layers, combining the intermediate outputs directly before the FC layers as suggested by Maresca et al. (2021) and Li et al. (2022). Again, no improvement is seen with this option, possibly because such networks, with their relatively small kernel sizes, perform well in pattern recognition but perhaps not as well in analyzing very similar and completely smooth images like a PSF. Another possible reason for gaining no improvement by adding the PSF images is that these PSFs are likely imperfect, because of the complex stacking procedure of ∼ 70 candidate stars per CCD (Bertin 2011;Aihara et al. 2018). These small errors are estimated to reach the 1% level at maximum (Aihara et al. 2018(Aihara et al. , 2019, but are possibly relevant for the relatively small effects of the external shear on the arcs. To further investigate the external shear, we trained networks on images containing only the arcs. This was proven to be helpful in other modeling networks (e.g., Hezaveh et al. 2017;Perreault Levasseur et al. 2017;Morningstar et al. 2018Morningstar et al. , 2019Pearson et al. 2021). By removing the lens light, the remaining information from the arcs becomes more easily accessible. We assumed perfect lens-light subtraction, which is not achievable in reality, but is the best-case scenario when assessing whether or not the network can then better predict the external shear, which is only encoded in the arcs. As expected, the new network predicts the Einstein radius with nearly no scatter over the full range, and also performs well on the lens center. The good performance on the systems with small image separations confirms that the lower performance of our normal network in that regime is due to blending issues with the lens. In this test, we notably last performance on the lens mass ellipticity, which is to some extent expected, as the network retrieves this information partially from the lens light which differ slightly from the lens mass distribution. However, also in this case this network cannot accurately predict the external shear, which suggests that the information is very hard to access from ground-based imaging and a generalization over the whole sample is currently impossible.
To answer the question of the minimum number of filters required, we trained networks on fewer filters. We tried with either only the g band; the g and r bands; or the g and i bands. Although we performed no real optimization of the network architecture and hyper-parameters, we found notably poorer performance with one or two filters than with four. However, even with just one filter, the network is able to predict the SIE parameters reasonably well, and has the greatest difficulty with the ellipticity. This confirms that it is possible to train CNNs or ResNets on single-band images, for example, from Euclid, where the much better resolution will compensate for the missing color information to some extent. As expected, the external shear is not predictable at all with just one filter of HSC image quality within our few tests in this direction.

Image positions and time-delay comparison
Given the extremely low computational time needed by the trained network to predict the SIE+γ ext parameters, the network would be ideal for predicting the mass model of a lensing system with a lensed transient. This mass model could then be used to predict, based on the first detection, the next appearing images with corresponding time delays and therefore help to plan follow-up observations. To test the precision and accuracy of the network, we performed a similar test as in S21b on our test set. To this end, we computed the image positions (θ tr x , θ tr y ) and time delays ∆t tr of the background source center given the true mass model from our simulation pipeline. Assuming that the first appearing true image position (θ tr x,A , θ tr y,A ) is the first observed image and is therefore coincident with (θ pred x,A , θ pred y,A ), we computed the source position using the mass model predicted by the network. From this predicted source position, we then obtained all other image positions (θ pred x , θ pred x ) and corresponding time delays ∆t pred . We finally compared the true and predicted values and directly computed the differences for lens systems with a matching number of images.
To visualize the results, we show the differences of the xand y-coordinates of the multiple image positions in Fig. 9 as a function of the Einstein radius. We find essentially no bias and a similar scatter as in S21b, which increases with the Einstein radius as expected from Fig. 4. The small bias and fluctuations in the last bins (θ E ≥ 2 ) are due to low-number statistics (compare also the histogram of the Einstein radius distribution in Fig. 4). Figure 10 shows the time-delay differences as a histogram (left panel) and as a function of the predicted Einstein radius (right panel). We find a small bias in ∆t, while the observed scatter increases as well, which is expected given the performance on the image positions. As we train our ResNet on a sample with equally distributed Einstein radii up to ∼ 1.5 , we can compare the bias and scatter to the similarly constructed CNN of S21b (see Sec. 4.3 in S21b). These networks show an overall comparable bias towards longer time delays, from which we can infer that the external shear has only a minor effect on the time-delay predictions and that the bias mainly originates from the Einstein radius offsets. While the ResNet equally under-and overpredicts many time delays with a typical scatter ofσ ∼ 20 days, the CNN has a slight tendency to make overpredictions. To correct the ResNet predictions for the bias with the Einstein radius, we fit a linear function of the form to the individual differences, restricting the fitting range to [0.5 , 2 ]. We obtain a value for a of 0.909 and a value for b of 1.928. The corrected plot is shown in the left panel of Fig. 11, along with the corrected time-delay difference ∆t pred − ∆t tr as a function of the difference in Einstein radius (right panel). For systems with θ pred E 2 , we can control the bias within ∼ 3 days, although the scatter remains large, limiting the usefulness of the network in predicting precise time delays. This performance is clearly not as good as that of traditional MCMC sampling on ground-based images and a SIE+γ ext mass distribution, but given the extreme difference in computational time, it may still serve as a reasonable first estimate for systems with θ E ≤ 2 .

Summary and conclusion
S21b demonstrated the possibility of modeling ground-based images of strongly lensed galaxy-scale systems with a relatively simple CNN inspired by the LeNet architecture. Building upon these results, we developed a way of modeling such systems when including the external shear component in addition to the SIE mass distribution of the lens. Moreover, we now predict a 1σ uncertainty for each parameter and lensing system.  end, we make use of a residual neural network, which is a specific type of CNN that includes so-called residual blocks with a skip connection. A diagram of our final network architecture is shown in Fig. 3. Because of the included error prediction, we changed the loss function from an MSE loss to a log-probability function with a regression term inspired by Perreault . To ensure that all parameters contribute equally to the loss, we introduced a scaling of each parameter to the range [0, 1] and therefore included the sigmoid function as the last layer. The network was trained on mock images created from real observed data by only simulating the lensing effect, resulting in realistic mocks, as shown in Fig. 2. We used images of LRGs observed with HSC as lenses, together with velocity dispersion and redshift measurements from SDSS. The background sources were images from HUDF with known redshifts.
The training data were created with a flat distribution in the Einstein radius between 0.5 and ∼ 2 and a flat distribution in the external shear strength γ ext between 0 and 0.1. Moreover, we included Poisson noise on the arcs, which we approximated from the variance map provided by HSC, and improved the lens center and ellipticity estimation using a dedicated mask for each individual lens. To make sure that the network predicts the lens mass center, we applied random shifts to the final mock images of up to three pixels in both x and y directions.
With this procedure, we created ∼ 165, 000 mock systems in four filters, which we then used to train our neural network. Through extensive tests, we found a network that is able to accurately predict the SIE parameters of the lens mass profile (Fig. 4 and Fig. 5). However, the external shear is very hard to predict accurately. We carried out many tests on the external shear, such as up-weighting its loss contribution; only predicting the external shear; adding further information, such as the FWHM values or PSF images; or applying subsampling. Only in the case of training on a very small sample with always the same lens and source pair, as shown in Fig. 8, is the network also able to predict the external shear well for new images in the test set. This demonstrates that the network indeed obtained its external shear prediction from the arcs, which are only different because of the variable external shear. In short, the network is able to predict the external shear from effects introduced by the external shear. Therefore, it seems that the network is in general able to extract the information from the external shear, but cannot generalize well to other systems, which is probably a result of the combination of the complexity of the lensing system, image resolution, inaccurate masking of the sources affecting the mocks, the unknown PSF for the network, correlations between the external shear and other parameters, and/or other reasons. This is supported by the fact that the external shear can be relatively well predicted by CNNs trained on more idealized, lens-light subtracted, and high-resolution images (Morningstar et al. 2018).
In analogy to S21b, we used the predicted SIE+γ ext mass parameters to predict the next appearing image(s) and corresponding time delay(s) given the first appearing image of the true, simulated mass model. Although we observed stronger discrepancies on the external shear, the observed scatter on the image positions and time delays is comparable to that obtained with our CNNs of S21b. We find a bias in the predicted time delays as a function of the predicted Einstein radius, which can be compensated up to θ pred E ∼ 2 by applying a linear correction function. Through a comparison of the performance with that of the CNN presented in S21b, we see that the external shear has only a minor effect on the time-delay prediction.
The greatest strength of our network is certainly its reduced requirement for computational time and its greater degree of automation. Once trained, it predicts these parameters in fractions of a second, while state-of-the-art methods like Glee & Glad require at least days and a lot of user input for the same task. With this network, we are able to predict the SIE+γ ext values with uncertainties for all known HSC lenses or lens candidateswhich already number a few thousand-within minutes. Given the good match of the HSC images to the expected quality of LSST, the performance of our network is expected to hold for LSST as well. Here, we propose to generate dedicated mocks and train a separate network as soon as data from the first LSST data release are available in order to avoid a possible degradation in performance because of slightly different image characteristics.  [days] Fig. 11. Difference between the predicted and true time delays after correction of the linear bias as function of, respectively, the predicted Einstein radius (left panel) and the difference between the predicted and true Einstein radius (right panel).
SDSS web site is www.sdss.org.