Improving the predictions of black carbon (BC) optical properties at various aging stages using a machine-learning-based approach

. It is necessary to accurately determine the optical properties of highly absorbing black carbon (BC) aerosols to estimate their climate impact. In the past, there has been hesitation about using realistic fractal morphologies when simulating BC optical properties due to the complexity involved in the simulations and the cost of the computations. In this work, we demonstrate that the predictions of optical properties like single scattering albedo ( ω ) and mass absorption cross-section ( MAC ) can be improved compared to the conventional Mie-based predictions using a highly accurate benchmark machine learning 5 algorithm. Unlike the computationally intensive simulations of complex scattering models, the ML-based approach accurately predicts optical properties in a fraction of a second. There has been an extensive evaluation procedure carried out in this study. Based on comparisons with laboratory measurements, it was demonstrated that incorporating realistic morphologies of BC significantly improved their optical properties. The results indicate that it is possible to generate optical properties in the visible spectrum using BC fractal aggregates with any desired physicochemical properties, such as size, morphology, or organic 10 coating. Based on these findings, climate models can improve their radiative forcing estimates using such comprehensive parameterizations for the optical properties of BC based on


Introduction
Black carbon (BC) aerosols are strong absorbers of solar radiation formed from incomplete combustion of fossil fuels, biofuels, and biomass (Ramanathan and Carmichael, 2008;Bond et al., 2013).In the atmosphere, BC is usually found together with other types of aerosols, which form a coating around it (Sun et al., 2022;Sedlacek et al., 2022;Romshoo et al., 2023a).To understand the impact of BC on the environment, global climate models require information about its light-scattering and absorption properties (Jacobson, 2001).The most common morphology assumed for such BC-containing aerosols in light-scattering codes is a spherical core-shell shape (Bond et al., 2013).The Lorenz-Mie theory (Mie, 1908) is often used to calculate the optical properties of such spherical BC particles (Bohren and Huffman, 2008).However, studies have shown significant discrepancies in the results of the Lorenz-Mie theory when compared with ambient measurements (Romshoo et al., 2024;Adachi et al., 2010;Wu et al., 2018).
High-resolution transmission electron microscopy (TEM) images showed that the BC particles have a fractal structure composed of numerous spherules known as primary particles (Chakrabarty et al., 2006).This led to an advanced mathematical description of BC as fractal aggregates, known as fractal law (Mishchenko et al., 2002): where a is the radius of the primary particle, N pp is the number of primary particles, k f is the fractal prefactor, and D f is the fractal dimension.R g is the radius of gyration, which characterizes the spatial size of the aggregate.The shortcomings of the simplified spherical assumption of BC have caused the scientific community to develop towards the use of such realistic fractal aggregate morphology for computing the optical properties of BC (e.g., Kahnert and Kanngießer, 2020;Romshoo et al., 2021;Kahnert, 2010a;Wu et al., 2018;Liu and Mishchenko, 2018).Romshoo et al. (2022) showed that the discrepancy between modeled and measured optical properties could be reduced to 10 % when an aggregate morphology is used.To simulate the optical properties of BC as fractal aggregates, the most commonly used methods are the Rayleigh-Debye-Gans (RDG) approximation (Sorensen, 2001), the discrete dipole approximation (DDA) (Purcell and Pennypacker, 1973), the generalized multi-particle Mie (GMM) method (Xu and Gustafson, 2001), and the T-matrix method (Mishchenko et al., 1996).The multi-sphere T-matrix (MSTM) method has found widespread applications in the research field because of its high computational speed and accuracy in comparison to other methods like the DDA (Kahnert and Kanngießer, 2020;Yurkin and Kahnert, 2013).Although the MSTM has lower computational costs when compared to other numerical methods, a single simulation can still take more than 24 h, depending on the properties of the aggregate.
Consequently, pre-calculated databases have been developed for aggregate properties to save time in constructing detailed aggregates and time-consuming optical simulations (Liu et al., 2019;Romshoo et al., 2021).Using these databases as look-up tables mitigates high computational overhead in large-scale applications.Still, this approach is limited by the range and step size of parameters chosen during the database creation.Previous work has trained machine learning (ML) models on such databases (Luo et al., 2018a;Lamb and Gentine, 2023) to overcome those limitations.Once trained, those ML models provide predictions for BC optical properties in a fraction of a second.Luo et al. (2018a) trained a support vector regressor on a database generated using MSTM simulations (N pp from 8 to 3000; D f from 1.8 to 2.2).However, they did not consider coating and used pure BC aggregates in their experiments.Their results also suggest that their model has considerable difficulties when attempting to predict optical properties for physicochemical properties not in the range of the training data.Lamb and Gentine (2023) predicted optical properties of uncoated BC fractal aggregate using a graph neural network (N pp from 8 to 960; D f from 1.8 to 2.3).The input graph contains one node for each primary particle and an edge between two nodes if the distance between the corresponding primary particles is less than some threshold.The authors generate their ground truth database using the MSTM algorithm, but, like Luo et al. (2018a), they do not consider any coating in their experiments.The machine learning methods, training parameters, performance metrics, and other details of Luo et al. (2018a) and Lamb and Gentine (2023) are compared to this study in Table B1.
This study demonstrates the use of a machine-learningbased approach to predict the optical properties of BC aggregates at various aging stages, including coating, which is highly relevant for atmospheric aerosols.Combining this ML-based approach with a laboratory dataset showed that optical properties like single-scattering albedo (ω) and mass absorption cross-section (MAC) can be predicted more accurately than with conventional Mie-based methods.A database of the optical and physicochemical properties of BC has been built for this study, which is an extension of the previous work by Romshoo et al. (2021).We trained two ML methods on this database: kernel ridge regression (KRR) and artificial neural networks (ANNs).Experiments show that these models predict the optical properties of BC aggregates regardless of their size, morphology, or composition at low computational costs and with high accuracy.The dataset used to train our ML models is freely available at Zenodo 1 .Furthermore, we published our ML models on GitHub2 together with an easy-to-use wrapper script to allow integration into higher-level applications.Our approach contributes to improving global climate model radiative forcing estimates by parameterizing BC optical properties using realistic fractal aggregate morphology.
The paper is structured as follows: Sect. 2 provides an overview of the physical, chemical, and optical properties of BC used in this study.Section 3 describes the machine learning techniques, including the data processing, machine learning algorithms, and evaluation procedures.In Sect.4, the results demonstrate that realistic morphologies of BC can be used to accurately predict optical properties at various stages of aging.Section 5 discusses how the results compare to laboratory measurements of BC, discussing the atmospheric processing in detail.Potential limitations and challenges of this work are presented in Sect.6, and we end with the main conclusions in Sect.7.

Database of physicochemical and optical properties of black carbon fractal aggregates
The database for the physicochemical and optical properties of BC fractal aggregates has been designed to consider all the possible aging stages of BC.The optical properties of BC fractal aggregates are most sensitive to the change in particle size as they age (Matsui et al., 2018).The particle size is reported as dependent parameters of the number of primary particles (N pp ), volume equivalent radii (r i and r o ), and mobility diameter (D m ).Furthermore, the chemical composition and morphology also influence their optical properties.There are constants related to the particle's chemical composition, such as density and refractive index.The optical properties have been reported as efficiencies and cross-sections.Further dependent optical properties have also been included.The mass and volume of the BC particles were used for conversion between various optical parameters.Furthermore, some parameters, such as the wavelength, were related to the optical model.The database was created using 6192 particles of varying sizes, morphologies, and coating fractions.There are 35 features in the database, which are categorized into 15 physicochemical features, 13 optical features, and 7 constants.Sect. 1 contains an overview of all the features of the database.In Table A1, the upper and lower bounds of the main features are provided.

Physicochemical features of the database
The BC fractal aggregate's physicochemical features include size, mass, volume, morphology, and composition.Figure 2 gives some examples of the various BC aggregate particles generated in this study.All the relevant properties provided in the study are discussed below, and their formulas are given in Sect.A1.

Size
Primary particle size (a).The primary particle size of a BC fractal aggregate is sensitive to the emission source or flame condition.Biomass burning produces black carbon aggregates with comparatively large primary particles, ranging from 15 to 25 nm in radius (Chakrabarty et al., 2006).Diesel engines produce aggregates whose primary particle radii range between 10 and 12 nm (Guarieiro et al., 2017).
On the other hand, emissions from aircraft engines consist of particles with radii as small as 5 nm (Liati et al., 2014).
There has also been research indicating that the size distribution of primary particles is largely polydisperse (Bescond et al., 2014).Liu et al. (2015) pointed out that, when considering a monodisperse and a polydisperse distribution of the radius of the primary particle, their resultant radiative properties differ.However, Kahnert (2010b) showed that particle light absorption is insensitive to the radii of primary particles when they are between 10 and 25 nm.The black carbon fractal aggregates in this study have a monodisperse distribution of the radius of the primary particle.BC aggregates were simulated with the inner diameter of the primary particle (a i ) fixed at 15 nm.In contrast, the outer radius of the primary particle (a o ), consisting of the organics, varied between 15.1 and 30 nm with the fraction of coating (f coating ) following Eq.(A3) in Appendix A. The a o was 15, 15.1, 15.3, 15.5, 15.8, 16.2, 16.5, 16.9, 17.8, 18.9, 20.4, 22.4, 25.6, and 29 according to the value of the f coating given in Table A1.Number of primary particles (N pp ).The number of primary particles determines the overall size of the particle.The BC fractal aggregates were simulated by varying N pp by 5 %, starting from 1 up to 1000.
Volume equivalent radius (r).The volume equivalent radius is defined as the radius of a sphere having the same volume as the BC fractal aggregate, described in Eq. (A1) in the Appendix.The outer volume equivalent radius (r o ) was calculated for the whole BC aggregate and for the coating using a o .The inner volume equivalent radius (r i ) was calculated using a i for the BC aggregate without the coating, i.e., pure BC.
Mobility diameter (D m ).The mobility diameter is the diameter of a sphere with the same migration velocity in a constant electric field as that of the BC fractal aggregate (Flagan, 2001).Mobility size spectrometers can measure D m , which is interesting for ambient and laboratory studies.We derived D m for the entire range of N pp using the conversion given by Sorensen (2011); see Eq. (A2) in Appendix A.
Geometric cross-section (C geo ).The geometric crosssection is the area of the cross-section of a volume equivalent sphere given as Eq.(A4) in Appendix A.

Mixing state
Along with BC, a complex mixture of gas-phase organic compounds is co-emitted during incomplete combustion, forming a coating around the BC aggregates (Gentner et al., 2017).As the BC aggregates stay in the atmosphere, they transform from being hydrophobic to hydrophilic due to water deposition attracting other foreign coatings (Bhandari et al., 2019).The result is that BC particles undergo complex changes in their morphology throughout atmospheric aging, transforming from bare to partially coated aggregates and finally forming compact spherical structures embedded within external coatings (Coz and Leck, 2011;Corbin et al., 2023).Therefore, regarding BC as fractal aggregates is necessary to represent all the different stages during their atmospheric aging process.The two parameters describing the mixing state are as follows.
Fractal dimension (D f ).The fractal dimension is a parameter for morphology that quantifies the folding of BC fractal aggregates into spherical structures with increasing residence time.The value of D f increases as an aggregate grows into a more spherical frame.A D f of 3 is the maximum value describing a complete sphere, whereas a D f of 1 represents an early-stage open-chain-like aggregate.In the early stages of the BC aging cycle, D f is usually between 1.5 and 1.9 (Wentzel et al., 2003).With increasing residence time in the atmosphere, aggregates become more compact with a fractal dimension of up to 2.2 (Wang et al., 2017).A humid environment or foreign coatings may further reshape the BC fractal aggregates into more compact structures with a fractal dimension of up to 2.6 (Bambha et al., 2013).In this study, the range of fractal dimensions was taken from 1.5 to 2.9 with a step size of 0.2.
Fraction of coating (f coating ).The fraction of coating is the percentage of coating volume compared to the total volume of the BC fractal aggregate.To cover all aging stages, the coating fraction was taken from 1 % to 90 % in increments of 5 %.Note that the coating composition was constrained to non-absorbing organics in this study.f coating is dependent on the a o and a i , described by Eq. (A3) in Appendix A.

Others
Volume.Three features in our database describe the volume of a BC aggregate: (1) the total volume of the particle (V total ), (2) the volume of the BC (V BC ), and (3) the volume of the organic coating (V coating ).
Mass.Similarly, we include five features related to the mass of the BC aggregate: (1) the total mass of the particle (m total ), (2) the mass of the BC (m BC ), (3) the mass of the coating (m coating ), (4) the mass ratio of total mass to BC mass m total m BC , and (5) the mass ratio of coating mass to BC mass m coating m BC .We computed those values fixing the density of BC as ρ BC = 1.8 g cm −3 (Park et al., 2004) and the density of the organic coating as ρ OC = 1.1 g cm −3 (Schkolnik et al., 2007).

Optical model and the optical features of the database
The tunable diffusion-limited aggregation (DLA) software (Wozniak et al., 2012) was used to simulate bare BC fractal aggregates of various physicochemical properties.BC can exhibit a range of coating thicknesses and fractal dimensions at any point in the atmosphere, as evidenced by images from transmission electron microscopy (TEM) analyzed from different locations (Fu et al., 2012).Detailed information and images from TEM analysis of BC particles have been provided in the Supplement.The coating model used in this study is called the "closed-cell model"; the results showed good comparability with the realistic coating model (Kahnert, 2017).The MSTM calculates the electromagnetic properties of a system that consists of a set of spheres (Mishchenko et al., 2004;Mackowski and Mishchenko, 2011).In this study, we use MSTM version 3.0 (Mackowski, 2013) written in Fortran to compute the electromagnetic properties for fixed and random orientations.For every BC fractal aggregate, the MSTM algorithm presents an orientational average of the combined spherical expansions of each primary particle.The MSTM code is best suited to calculating the optical properties of coated BC fractal aggregates, since it consists of nested spheres.However, a limiting condition in the MSTM is that primary particles cannot overlap.It was necessary to use this closed-cell coating model due to the non-overlapping sphere limitation of the MSTM code.A sophisticated coating model would be a good choice, but it requires more complex scattering models, such as discrete dipole approximation (DDA), which is computationally expensive.The optical features of the database are given below.
The real (n) and imaginary (k) parts of the refractive indices for BC and coating (non-absorbing organics) at different wavelengths (Kim et al., 2015) used in this study are summarized in Table A2.
Optical efficiencies (Q ext/abs/sca ).The MSTM directly calculates the extinction efficiency (Q ext ), absorption efficiency (Q abs ), and scattering efficiency (Q sca ) of the BC aggregate.
Optical cross-sections (C ext/abs/sca ).The optical crosssection is the product of efficiency and geometric crosssection; see Eq. (A5) in Appendix A.
Asymmetry parameter (g).The asymmetry parameter is directly obtained from the MSTM, defined as the intensityweighted average of the cosine of the scattering angle (Eq.A6 in Appendix A).
Single-scattering albedo (ω).The single-scattering albedo is the ratio of scattering efficiency (Q sca ) and extinction efficiency (Q ext ), given as Eq.(A7) in Appendix A. Mass absorption cross-section (MAC).The mass absorption cross-section is calculated from the ratio of absorption cross-section (C abs ) and mass (m) as detailed in Eq. (A8) in Appendix A. The three kinds of MAC calculated in this study are total mass absorption cross-section (MAC total ), BC mass absorption cross-section (MAC BC ), and coating mass absorption cross-section (MAC coating ).

Machine learning method for predicting optical properties of BC fractal aggregates
As mentioned in Sect. 1, several high-impact applications, such as climate modeling (Jacobson, 2001), depend on accurate optical properties for specific BC particles.Hence, we propose to train an ML model on a pre-computed database containing physicochemical and corresponding optical properties of BC fractal aggregates at several life cycle stages.This model will learn patterns and structures within the data and should generalize to unseen data values when used in applications, as evidenced by the success of ML in several domains (Radford et al., 2021;Ramesh et al., 2022).In this work, we train kernel ridge regression and a multi-layer perceptron on the database introduced in Sect. 2. The following sections detail our data processing routines, models, and evaluation procedures.

Data preprocessing
The subset of the database used as input was designed to include the critical parameters that influence the BC optical properties.As mentioned in Sect.2.1, not all physical properties in the database are independent, as some can be derived from others using simple formulae.Including all properties as inputs for the ML model will thus present it with redundant information, increasing its computational overhead and possibly even harming its performance.The first criterion to narrow down the input parameters was broadly choosing the independent physicochemical parameters representing particle size and mixing state.The fractal dimension (D f ) was used to represent the morphology of the BC fractal particles.
The chemical mixing state is represented by the fraction of coating (f coating ).The wavelength (λ) is also an input parameter.There was an exception in selecting the input parameters for particle size where we decided to keep four dependent parameters of outer primary particle size (a o ), number of primary particles (N pp ), outer volume equivalent radii (r o ), and mobility diameter (D m ).The reason for including all four size parameters is that, depending upon the focus of a study, the user may have more than one parameter representing the size.In this way, we could provide a more user-friendly prediction script in which the user has a choice to enter one or more of the four size parameters.Therefore, the subset of the database's properties as input for our ML models is λ, D f , f coating , a o , N pp , r o , and D m .The range of each input parameter used for designing the prediction algorithm is summa-rized in Table A1.The selection of input parameters needed while running the prediction script is λ; D f ; f coating ; and at least one among N pp , and r o , and D m .Similarly, a BC fractal aggregate's optical properties are also not independent.Thus, we make the ML model predict only the following three properties and compute the rest using the formulae in Sect.A1: absorption efficiency (Q abs ), scattering efficiency (Q sca ), and asymmetry parameter (g).
After feature selection, we transform input features using the Box-Cox transformation (Box and Cox, 1964), where we choose the transformation parameter by maximumlikelihood estimation.We also tried to apply the Box-Cox transformation to the target features, but, since this did not improve results, we decided not to use any transformation on the target features for the experiments that we report in Sect. 4. To find a suitable regression model, we conducted experiments with multiple ML-based models for regression, including support vector regression (SVR), ridge regression (RR), kernel ridge regression (KRR), and artificial neural networks (ANNs).Each model was evaluated using mean absolute error (MAE) on the sample dataset.The results showed that kernel ridge regression and neural networks demonstrated better performance, especially in capturing the non-linear relationships within the dataset.Hence, we used KRR and neural networks for further analysis.n) for all n ∈ {1, . .., N }.Kernel ridge regression (KRR) (Shawe-Taylor and Cristianini, 2004) learns a function of the form (Cortes and Vapnik, 1995) and α * ∈ R N ×D is a solution of the following convex optimization problem:

Kernel ridge regression
where K ∈ R N×N is the so-called kernel matrix defined by A popular choice for the kernel function is the Gaussian or radial basis function (RBF) kernel where γ ∈ R + is a parameter called bandwidth and x 2 := D d=1 x d 2 denotes the L 2 -norm.We use scikit-learn's KRR implementation3 with the RBF kernel for our experiments.This method has two hyperparameters that need tuning: the RBF kernel's γ ∈ R + and λ ∈ R + (see Eq. 2).We optimize hyperparameters using grid search; please see Table B2 for the grid and Sect.3.4 for more detailed information on our evaluation procedure.

Artificial neural networks
Artificial neural networks (ANNs) constitute one of the founding pillars of ML's success during the last 10 years.Originally, their design was inspired by the structure of neurons inside the nervous system of several organisms (Rosenblatt, 1958).Most designs used in practice nowadays abandoned that idea, but the name remains.
In our experiments, we use a feed-forward ANN, sometimes also called a multi-layer perceptron (MLP).It consists of an arbitrary number (L ≥ 2) of layers, of which the first is called the input layer, the last is called the output layer, and all layers in between are called hidden layers.Each layer consists of a certain number of neurons, which are connected to the neurons in the previous and following layers.
Formally, we can define an MLP as a function f : . ., where each f (l)  : R D (l) → R D (l+1) represents a connection between two layers.They are defined as f (l) (x) := σ (l) W (l) x + b (l) , where are learnable parameters and σ (l) is a so-called activation function that is applied separately to each element of its input vector.Common choices for σ (l)  include the rectified linear unit (ReLU) σ (l) (x) = max(x, 0) or the tanh function.We use the same activation function for each layer except the last, where we always use the identity function, i.e., σ (L−1) (x) := x.Finally, D (l)  ∈ N denotes the number of neurons in layer l, with D (1) = D and D (L) = D .The number of hidden layers, the number of neurons in those hidden layers, and the activation function are usually chosen by a human before training a neural network.Together, they define the architecture of the MLP.We can learn values for the parameters W := W (1) , . .., W (L−1) and b := b (1) , . .., b (L−1) by minimizing a so-called loss func- When solving a regression problem, the most common choice for L is the squared loss L( ŷ, y) := y− ŷ 2 2 , but prac-titioners sometimes use other loss functions as well, for example, the Huber loss (Huber, 1964): where δ ∈ R + determines the cut-off point between squared and absolute loss and is usually chosen as δ = 1.The entire procedure of adapting ANN's parameters using a given dataset is called training in the ANN literature.Note that, in general, Eq. ( 5) is not convex and does not have a closed-form solution.Hence, practitioners use gradient-based optimization methods, i.e., variants of minibatch stochastic gradient descent (SGD) (Bottou et al., 2018), to find a local minimum of Eq. ( 5).
For our experiments, we implemented an MLP using Keras4 .Section B3 contains the hyperparameter grid for the MLP's architecture and training procedure.

Evaluation procedure
In the case of kernel ridge regression, regularization is carried out by the regularization constant λ with a chosen optimal value of 0.0001.For neural networks, we tested the dropout technique to prevent overfitting.However, dropout regularization did not show notable improvements in the model's generalization.After preprocessing, we split the database into a training set and a test set.Models perform their training procedures and hyperparameter tuning on the training set only, and we then evaluate the model's performance exclusively on the test set.We consider three different methods of performing this split -each one intends to measure another aspect of the model's performance.We use the mean absolute error (MAE) as our primary performance metric: given a dataset D ⊂ R D ×R D and our prediction model f : R D → R D , we can compute the MAE as follows: where z 1 := D d=1 z d is the L 1 -norm.Regardless of the split strategy, we split the training set once more into a train and a validation set using the randomsplit method during the training phase.Here, we again use 30 % of the data for validation and the remaining 70 % for training.Our models then train on the train set for all possible hyperparameter configurations defined in the grid, and we record the MAE on the validation set for each combination.Finally, we choose the combination with the lowest MAE and evaluate the corresponding model's MAE on the test set.

Performance of the machine learning models
The error distributions for the ML methods are presented in Fig. 3 for different experimental scenarios of the datasplitting with respect to the parameter fractal dimension.The median error is close to zero for the random and interpolation splits, meaning our models do not generally over-or underestimate any optical value.The distribution of errors (excluding outliers) for the random and interpolation splits is relatively narrow, indicating that most test points have minor errors.In the extrapolation case, both ML models exhibit bias, such as overestimation of Q sca by ANN and overestimation of g by KRR.However, the mean absolute error, even for the extrapolation split, is 1.5 % to 8 %, which is still within reasonable limits.Luo et al. (2018a) showed that their model has considerable difficulties when attempting to predict optical properties for parameters not in the range of the training data.However, adding a few data points to extend any parameter range significantly improved the prediction ability of the ML algorithm.The interpolation and extrapolation results are similar if training data and test data are split according to the parameters of the coating f coating and particle size D m .The Appendix provides a more detailed discussion about the interpolation and extrapolation results for parameters of f coating and D m in Figs.C1 and C2, respectively.Overall, the narrow boxplots of the errors in the random split demonstrate the effectiveness of the ML algorithms in predicting the optical properties of coated BC fractal aggregates.
The MAEs for our experiments are reported in Table 1.In the case of the random split, both ML models are pretty accurate, with the percentage of MAEs ranging from 0.1 % to 0.4 % when compared to the average feature range.Lamb and Gentine (2023) reported mean absolute percentage errors (MAPEs) between 2 % and 9 % for their optical predictions, whereas Luo et al. (2018a) reported relative errors between 1 % and 5 %.The MAPEs are biased to the magnitude of the true value in the denominator.The same MAE can result in a significantly different MAPE depending on the magnitude of true value they are divided with.In our view, the prediction error should be weighted equally for both points; therefore, we chose the MAE as our error metric.Lamb and Gentine (2023) also discussed how the bias of MAPEs resulted in higher values of nearly 70 % for smaller particles.Error distributions for the ML methods shown in Fig. 3 are presented in terms of MAPE in the Supplement.The comparison of the two ML methods for random split in Table 1 showed that KRR generally results in a lower MAE for predictions of Q abs and Q sca .Contrary to this, ANN could predict g with a lower MAE.In line with expectations, the MAE for the splits based on interpolation and extrapolation is somewhat higher.The errors, however, are still regarded as relatively minor compared to the features' range.The extrapolation and interpolation experiments were used to test the performance of the ML algorithm under various scenarios of data available for training.The ML models we publish for use in applications were trained on the entire dataset using the best parameters from the random-split experiments.As a result, the errors should be similar to those we report for the random split here.
A one-to-one comparison was performed between the estimates and true values to understand better how the ML methods predict optical properties.Figure 4 compares the estimated and true values for the wavelength of 660 nm when the training and test data are randomly split.The values of Qabs , Qsca , and ĝ obtained from the KRR and ANN methods are compared to the true values derived from the MSTM method.The performance of both ML methods was studied for BC fractal aggregates with three representative morphologies and coating fractions (D f = 1.5 and f coating = 0 %; D f = 2.1 and f coating = 50 %; D f = 2.7 and f coating = 90 %).There was reasonable agreement between KRR and ANN for all sub-cases.Therefore, the machine learning models appear applicable in a broader context.The model does not overfit with different coating fractions and complex morphologies.The one-to-one comparison results agree with the results from Lamb and Gentine (2023), which also showed reasonable predictions of Qext , Qsca , Qabs , and ĝ across the entire range of size parameters.During their lifetime, BC fractal aggregates undergo complex changes in size, composition, and morphology due to atmospheric processing.Figure 5 shows a visualization of how the ML predictions compare to the MSTM reference for different aging scenarios for BC fractal aggregates.It compares the estimated and true values of the optical properties for the random split.The models trained using a random split of training data generally show a good agreement with the ground truth data over the entire range of D m .Overall, the KRR predictions are very close to the true values throughout the entire range of D m for all nine cases in Fig. 5.The ANN predictions slightly deviate from the true value for cases with larger f coating .For example, in the case of f coating = 90 % and D f = 1.5, ANN underestimates the Qabs .Lamb and Gentine (2023) showed comparatively more deviation in the predictions for larger pure BC fractal particles than smaller particles.In this study, KRR and ANN predictions were consistently good for pure BC fractal particles (first row in Fig. 5), although we could observe deviations from the true values for large and aged coated particle predictions (last row in Fig. 5).Appendix C3 contains plots similar to Fig. 5 for the interpolation and extrapolation split.In general, errors increase with increasing aggregate sizes for the interpolahttps://doi.org/10.5194/acp-24-8821-2024Atmos.Chem.Phys., 24, 8821-8846, 2024 tion and extrapolation splits.The ML models we publish are based upon random-split experiments, and Fig. 5 shows how well both the ML methods provide accurate estimates of the optical properties of BC fractal aggregates at each aging stage.
Apart from making accurate predictions, our ML models should also be fast to provide a benefit over time-consuming simulations.Hence, we recorded the time needed to train on the entire training dataset and the time to make a single prediction in Table 2.As a result, the prediction time of both algorithms is less than 1 ms, which is a drastic improvement compared to the MSTM method, which can take up to 24 h, depending on the particle.It should be noted that the prediction time for ANN does not depend on the input data.Training the models takes comparatively longer, but it is usually done offline.Therefore, it is irrelevant for users using the pretrained models we provide for their applications (see section "Code availability").

Comparison to black carbon laboratory measurements
Incorporating the fractal morphology of BC in global model calculations is essential, as the BC radiative forcing can increase up to 61 % compared to a more compact and aged particle (Romshoo et al., 2021).In the atmosphere, BC fractal aggregates are primarily found in conjunction with other aerosol types, such as organic carbon.It is therefore more relevant to predict the optical properties of BC fractal aggregates with organic coatings for atmospheric applications.To give an example of applying the ML algorithm to real-world atmospheric research, we predicted the optical properties of laboratory-generated soot for experiments described in Table 1 of our previous study (Romshoo et al., 2022).
The ML-based predictions were compared to the averages of each experimental case, represented by one data point in Fig. 6.The ML results correspond to KRR, the default algorithm used in the prediction script.The details of the laboratory experiments and instrumentations are given in Appendix D. Figure 6a compares the single-scattering albedo ( ωML ) predicted by the ML algorithm with the measured ω from the laboratory experiment.The ωML predictions are in good agreement with the measured results for a range of f organics going up to 55 %.The uncertainty of nearly 10 % in the measured SSA (Weber et al., 2022) is well represented within the 95 % confidence band of the ML-based predictions.On the contrary, Fig. 6b demonstrates that, if the conventionally used Mie core-shell theory is used, the predictions are overestimated by a large margin.The ML predictions of MAC are also compared to the measured MAC and the Mie-based predictions, whose results are given in The sensitivity in the predicted MAC and SSA as a function of change in input parameters, such as the D mob , D f , f coating , and a, have been extensively discussed by Romshoo et al. (2021Romshoo et al. ( , 2023b) ) and Smith and Grainger (2014).The recommendations given by the above studies have been adapted for obtaining the results in Figs. 6 and D1 and are discussed in detail in Appendix D. For future applications, it is recommended that ambient or laboratory datasets with a resolution of more than 30 min are used to minimize the interference of instrumental uncertainty due to noisy data.Similarly, for ambient or laboratory closure studies, it is recommended that the model output be compared with averaged optical observations.
https://doi.org/10.5194/acp-24-8821-2024Atmos.Chem.Phys., 24, 8821-8846, 2024 Based on the success of the ML-based approach in predicting the optical properties of coated BC particles, it has great potential for future development to predict the optical properties of mixtures of BC and other aerosols.Because such a study would be exhaustive, we initially tested this approach on BC fractal aggregates and organic coatings to determine its effectiveness.Further research is necessary to develop an ML algorithm with features representing different morphological shapes and other chemical compositions, such as inorganics.In the long run, the goal should be to develop an ML algorithm that can be used to integrate all atmospheric aerosols into global climate models.To develop such a universal algorithm for all atmospheric aerosols, we must incorporate the conventional spherically shaped particles into the current prediction algorithm to represent the fraction of aged aerosols.In this study, due to the experimental design of Romshoo et al. (2022), we could only test the ML-based prediction algorithm for particles with f organics of less than 65 %.The extension of the current algorithm to include more parameters also demands closure studies using more datasets of laboratory and ambient measurements.

Limitations and future challenges
The experiments conducted for this study show that our ML methods predict the optical properties of BC fractal aggregates with high accuracy as long as they are trained on sufficient data.However, the interpolation and extrapolation experiments show that the performance of both KRR and ANN significantly deteriorates when entirely removing certain ranges from the training data.This suggests that our models possess only limited generalization capabilities.Still, it should be noted that we train the models for practical use on the entire physically feasible range of D f and f coating .
Hence, those models will not have to extrapolate for any reasonable inputs.
Our models treat the wavelength λ as a continuous variable, meaning they should support computing optical properties at wavelengths that are not part of the training data.The prediction script can predict the optical properties well for the range between 467 and 660 and points close to the upper and lower limit.However, we did not test the models' generalization capabilities about the wavelength, since omitting just one wavelength from the training data would reduce the dataset size by one-third.Generating more ground truth data for other wavelengths requires refractive indices of BC and organics for that specific wavelength, which are unavailable in the literature.Even if they were available, it would be time-consuming, as MSTM simulations can take a long time to compute.Nevertheless, examining the models' generalization capabilities on other wavelengths in the future would be interesting.
In this study, the ML-based prediction algorithm is developed using training data of N pp up to 1000, which corresponded to particles with maximum D mob of 1561 nm depending on the f coating .This range of particle sizes was chosen while designing the database, considering the realistic size of BC-containing particles in the atmosphere.TEM analysis has shown a high probability that the BC-containing particles less than 1500 nm will be fractal (Adachi et al., 2016;Wang et al., 2017).The ML algorithm developed in this study, which is based on a close-shell coating model, is suitable for such particles smaller than 1500 nm.However, when aerosol particles grow larger, the mass of BC decreases significantly compared to the mass of coating (Adachi et al., 2016).For such cases of aged BC, using the conventional core-shell-based spherical morphology is appropriate.This is why we limited our training data range for particle size to 1561 nm.However, as demonstrated by Luo et al. (2018a), adding a few points in the training data significantly improves the extrapolation efficiency of machine learning models.Furthermore, some studies show that the optical properties are not sensitive to the change in the primary particle size a.Therefore, we fixed the a i to 15 nm and changed a o from 15.1 to 29 depending on the f coating .Similarly to the parameters related to a particle size such as N pp , r o , and D m , adding a few data points to the a i or a o can help optimize the extrapolation ability of the ML-based prediction algorithm.Although future studies can extend the model's extrapolation ability, the particle size range of the current prediction algorithm covers the physically feasible cases for BC fractal aggregates.
Both KRR and ANN provide only a single-point prediction for each input.In particular, their estimate does not quantify any uncertainty in the prediction.Bayesian ML methods such as Gaussian process regression (Rasmussen and Williams, 2005) can provide information about the uncertainty of a prediction via credible intervals as they return an entire probability distribution instead of a single-point estimate.Thus, it would be interesting to examine Bayesian ML for the prediction of BC fractal aggregates' optical properties.This method could be further developed for reporting the predictions for an ensemble of BC-containing aerosols with various physicochemical properties.However, applying them directly to our problem is not trivial, since the assumptions made by their statistical model (e.g., target variables follow a multivariate Gaussian distribution) are often violated in practice.Therefore, we leave the application of Bayesian ML to the BC aerosol problem to future work.
Atmospheric BC can exhibit a wide range of morphologies showing diversity at different locations (Sedlacek et al., 2022).It was observed that aged transported soot can retain its fractal morphology 500 to 1000 km downwind of emission sources (Sun et al., 2020).The current state of the art for representing atmospheric soot particles focuses on spherical morphology (Aquila et al., 2011;Stier et al., 2005;Bauer et al., 2008).The model provided in this study was designed to simulate the optical properties for the entire BC life cycle, capturing the transition between fresh fractal and aged spherical particles.Furthermore, the calibration of light-absorption measurement devices is mostly done with fresh soot.We can link to atmospheric-relevant absorption by simulating mass absorption cross-sections and lightabsorption enhancement factors.The coating model used in this study is called the "closed-cell model", and the results showed good comparability with the realistic coating model (Kahnert, 2017).A more sophisticated coating model would be a good choice, but it requires more complex scattering models such as discrete dipole approximation (DDA), which is computationally expensive.With the DDA method, generating elaborate datasets for training ML algorithms is not feasible.We provide a method that predicts the optical properties of a wide range of ambient soot particles with high accuracy.Therefore, the results of this study are valuable for the simulation of realistic scenarios, despite the model limitations.There is scope for future studies to extend such an MLbased approach using other morphological models of BC and coating positions.

Conclusions
The present study demonstrated that the predictions of BC optical properties can be improved by incorporating their realistic morphologies.Unlike the computationally intensive simulations of complex scattering models, the ML-based approach accurately predicts optical properties in fractions of a second.In conjunction with a laboratory dataset, it was shown that optical properties like single-scattering albedo ω and mass absorption cross-section (MAC) can be predicted with greater accuracy than with a Mie-based approach.Using an extensive database for the physicochemical and optical properties of BC fractal aggregates, we trained two ML models -KRR and ANN -that can be used to predict the optical properties of coated BC aggregates at all aging stages.In particular, we could accurately predict the optical properties in the visible spectrum for BC fractal aggregates of any desired size, shape, and fraction of organic coating.Thus, this work illustrates the use of this realistic approach in real-world atmospheric research applications.
We summarize the key conclusions of the study as follows.
-Active investigation area.BC is a highly relevant and active field of research, as it affects the climate system and human health.Global climate models require information about the optical properties of BC to simulate their radiative forcing.BC research will benefit from using this ML algorithm to generate the optical properties of BC based on more realistic fractal aggregates.
-Broader application.The ML algorithm can predict the optical properties absorption efficiency, scattering efficiency, and asymmetry parameter for a wide range of BC fractal aggregates with physiochemical properties specified by particle size, morphology, and coating fraction.Previous studies did not consider the critical parameter of coating fraction in their ML models.Therefore, even though we discuss the results in terms of the number of primary particles (N pp ), the user is additionally able to specify the particle size in terms of volume equivalent diameter (R v ) or mobility diameter (D m ) depending on the numerical or in situ-based nature of the study.We tested the use of the ML algorithm for predicting the scattering properties of laboratory-generated soot particles and found that it was well in agreement with the measured values.
-User-friendly.We published a simple Python script that allows users to predict optical properties for BC frac- tal aggregates using our pre-trained models at GitHub5 .The user must specify the physicochemical properties of a BC fractal aggregate as a .csvfile, from which the prediction script generates the corresponding optical properties using either KRR or ANN.
-Low computational and energy costs.Our ML models have a low computational cost, taking fractions of a second to provide the predictions on a run-of-the-mill desktop PC.The same optical properties could take more than 24 h to be generated when using a T-matrix optical model.Using such ML algorithms will thus reduce the energy expenditures associated with running optical models on supercomputers.
-Citability and reproducibility.The dataset used for developing the ML algorithm is available for download at Zenodo (Romshoo et al., 2023b).Furthermore, the baseline experiments can be reproduced with the code that is openly available on GitHub6 .
In summary, we demonstrated the feasibility of incorporating the realistic morphology of BC to improve the predictions of optical properties using a first-of-its-kind machine learning approach.This ML-based approach constitutes a significant step forward in BC aerosol research in two ways: firstly, it is the first attempt to provide optical properties of coated BC fractal aggregates at different stages of atmospheric aging using realistic representations.Secondly, this approach significantly reduces the heavy computational costs of using previous complex scattering models.Previous studies of BC avoid using complex scattering theories because of the high computational costs and prefer the more simplistic Mie theory.This research will be further developed in the future with the final goal of accurately predicting the optical properties of any mixture of atmospheric aerosols.We will investigate if the spherical core-shell model can be combined with the fractal aggregate-based ML model to distribute the weightage of light-absorption predictions for an ensemble of atmospheric BC aerosols with variable aging stages.

A1 Formulae
The volume equivalent radius (r) is defined as the radius of a sphere having the same volume as the BC fractal aggregate, given as where N pp is the number of primary particles and a is the radius of a single primary particle.The outer volume equivalent radius (r o ) was calculated for the whole BC aggregate and for the coating using a o .The inner volume equivalent radius (r i ) was calculated using a i for the BC aggregate without the coating, i.e., pure BC.
The mobility diameter of a sphere (D m ) was defined by Sorensen (2001) as where N pp is the number of primary particles; a o is the radius of a primary particle with coating; and x is the mobility mass scaling exponent given by x = 0.51Kn 0.043 , 0.46 < x < 0.56.Kn is the Knudsen number, which is the ratio of the molecular free path to the agglomerate mobility radius.The error estimated in the mobility mass scaling exponent (x) is ±0.02.
The relationship between the outer radius of the primary particle (a o ), the inner radius of the primary particle (a i ), and the fraction of organics (f organics ) is given as The geometric cross-section (C geo ) is the area of the crosssection of the volume equivalent sphere, given as The optical cross-sections (C ext/abs/sca ) are defined as the product of efficiency (Q ext/abs/sca ) and geometric crosssection (C geo ) as C ext/abs/sca = Q ext/abs/sca C geo . (A5) The asymmetry parameter (or asymmetry factor) g is defined as the average cosine of the scattering angle theta θ : The single-scattering albedo (ω) is derived from the ratio of the scattering efficiency (Q sca ) to the extinction efficiency (Q ext ) as The total mass absorption cross-section (MAC Total ), BC mass absorption cross-section (MAC BC ), and coating mass absorption cross-section (MAC Coating ) were calculated from the ratio of (C abs ) with total mass (m Total ), BC mass (m BC ), and coating mass (m Coating ), respectively, as MAC total/BC/coating = C abs m total/BC/coating .(A8) Figure C1 shows the residuals for the machine learning methods for the three splits related to the feature f coating : random, extrapolation (training data f coating = [0, 75)), and interpolation (training data f coating = [0, 35)∪ (50,90].When the training and testing data are randomly split, we see that residual errors are concentrated near zero for all intervals of f coating similar to Fig. 3.The errors from KRR and ANN are comparable in the random split.For the case of interpolation split, the errors from both the ANN and KRR models are comparatively higher for all the three optical properties, i.e., Q abs , Q abs , and g.It was noted in the errors from the interpolation split that KRR performs better in predicting the Q abs , whereas ANN performs better in g predictions.The errors in the Q abs , Q abs , and g from the extrapolation split were the highest.The error is largest for the predictions when f coating = 90, which is the case farthest away from the training data during an extrapolation split.The relative performance of ANN and KRR are comparable to those observed in the interpolation split.It was observed that the predictions Qabs fitted well with the true values, especially for the KRR method.However, the predictions Qsca fluctuate from the true value Q sca as they approach maximum values above 1.For the predictions ĝ, the ML methods ANN and KRR perform slightly differently.In the case of extrapolation split, as shown in Fig. C4, the predictions deviated from their true values for D f = 2.7, 2.9, since the ML models did not see the data.However, we can see that, for D f = 2.5 (first row), all the predictions are in better agreement with their true values, since it was present in the training data.The predictions Qabs and Qsca showed reasonable agreement in the case of D f = 2.7.The predictions Qsca for the unseen D f features were observed to be smaller than their true values.The predictions Qabs , Qsca , and ĝ are most inconsistent with their true values when D f = 2.9, which is the case farthest away from the training data.Therefore, it is demonstrated that there is comparatively higher uncertainty for predicting optical properties for features out of the range of the training data.Furthermore, the performance of KRR and ANN varied for different optical properties in such cases of interpolation and extrapolation split.The interpolation split performed better for predicting the optical properties out of the range of the training data.Therefore, adding more data in the training set for boundary values to let them interpolate would result in better predictions.

C3 Line plots showing performance as aggregate size changes
Figure C5 compares the machine learning predictions to their true values for interpolation split.The predictions for the case D f = 2.3 (middle row) showed the highest deviations from the true values, since it is the farthest point in the training data for the interpolation split.From the Qabs results, the KRR predictions were reasonable for the entire size range.The predictions for Qsca were also reasonable for KRR.However, after the particle size increased to larger than 500 nm, the prediction of Qsca using KRR was underpredicted.The prediction of Qsca using ANN showed a size-dependent behavior, under-predicting the results for certain particle sizes, after which there is an over-prediction.Similar size-dependent behavior was observed in the predictions ĝ from ANN and KRR.The ĝ predictions showed deviations from their true values as the particle size increased.In the case of interpolation split, the overfitting or underfitting is generally more pronounced in the larger particle size (> 500 nm).The explanation for this could be the lower resolution of the training data for particle size > 500 nm, which was a limitation of large computation time for larger particles and more coating fraction.
Similarly, Fig. C6 shows the machine learning predictions compared to the true values for the extrapolation split.To study the performance of KRR and ANN, the results for D f = 2.9 are interesting, since they are the farthest from the training data.The deviations of the Qabs are more from the true values in the case of KRR, which showed better performance in the interpolation split.However, the results for D f = 2.5 and D f = 2.7 show reasonable results, since they are closer to the training dataset.The predictions Qsca were lower than the true values for ANN, especially as the parti- cle size increased.The prediction ĝ was larger than its true value in the case of the extrapolation split.However, the performance of predicting ĝ from KRR showed an interesting size dependence over particle size unique to this split.When particle sizes were smaller, ĝ was higher than the true value, decreased, and returned to higher levels once a certain threshold was reached.In general, for the results when the f coating is 90, which is the upper limit of the feature, the results for Qabs , Qsca , and ĝ showed an expected higher deviation from their true values for both the interpolation split and the extrapolation split.

Appendix D: Laboratory measurements of black carbon
The data from the laboratory experiments by Romshoo et al. (2022) are compared to the ML-based prediction model in Figs. 6 and D1.A mobility particle size spectrometer (MPSS; designed by the Leibniz Institute for Tropospheric Research (TROPOS)) measured the particle number size distribution of the black carbon particles.A cavity-attenuated phase-shift extinction monitor (CAPS PMex 630, Aerodyne Res. Inc., USA) measured the light extinction coefficient, σ ext , at a λ of 630 nm.The particle light-scattering coefficient σ sca was measured using a nephelometer (Aurora 4000, Ecotech, Melbourne, Australia) at a λ of 635 nm.A multi-angle absorption photometer (MAAP; Model 5012, Thermo Scientific, Franklin, MA) measured the particle light-scattering coefficient, σ abs , at a λ of 637 nm.The aerosol mass concentration for selected experiments was determined using the tapered element oscillating microbalance (1405 TEOM, Thermo Scientific, Franklin, MA).Aerosols were collected on quartz fiber filters and were analyzed by an EC-OC analyzer (Sunset Laboratory Inc., Hillsborough, USA).The input parameters used while running the prediction script are λ, D f , f coating , and D m .The parameter of D m was chosen for particle size due to the MPSS measurements available in the experiment.A D f value of 1.7 was taken, as it represents laboratory-generated soot (Wentzel et al., 2003).The default a i value of 15 nm was used.Numerical studies have also investigated the sensitivity to input parameters like a, D f , and f coating to modeled optical properties (Romshoo et al., 2022;Luo et al., 2018b;Smith and Grainger, 2014).For example, Romshoo et al. (2022) recommended D f from 1.7 to 1.9 and a between 10 and 14 nm for laboratory-generated soot.The values of f coating for each experiment were derived from the EC-OC analysis results of the quartz fiber filters.The mean of the number size distribution measured by the MPSS was used as the input value for D m .There were 11 sub-cases of the laboratory experiment for which the means of D m and f coating were taken as input.The input parameters for the Mie core-shell theory were λ, f coating , and D m .The output parameters compared to the observations were SSA and MAC.The observational SSA was calculated from the ratio of σ sca and σ ext .The observational MAC was calculated from the σ abs and mass using Eq.(A8).The predicted SSA is compared to all 11 experimental cases for which the observational SSA was available (Table 1 in Romshoo et al., 2022).The uncertainty in the measured SSA is nearly 10 % (Weber et al., 2022).The uncertainties in the SSA are included in the 95 % confidence band of the ML-based predictions.The predicted MAC is compared to the 6 experimental cases of coated soot for which the observational MAC was available (last six rows in Table 1 in Romshoo et al., 2022).
Code availability.A Python script that predicts the optical properties of BC fractal aggregates using the trained ML-based models is available in a GitHub repository at https://doi.org/10.5281/zenodo.8060206(Romshoo et al., 2023d).To run the prediction script, the physiochemical properties need to be provided as a .csvfile that contains the fractal dimension D f , the fraction of coating f coating , and the wavelength (λ) at which the optical properties should be calculated.Depending on the relevance, users may specify the particle size by giving the values of one among the number of primary particles (N pp ), the mobility diameter (D m ), or the outer volume equivalent radii (r o ).If the input parameters are obtained from instrumental measurements, taking hourly or half-hourly averages is recommended to cancel the effect of noisy input parameters.The prediction script will generate a .csvfile with the corresponding optical properties for the provided physiochemical properties.Please check the README file inside the repository for more detailed information on using the script.

Figure 1 .
Figure 1.Overview of the various features of the database for physicochemical and optical properties of black carbon fractal aggregates.The features are arranged based on the three steps of constructing this database.As the legend at the bottom indicates, the features are further divided into physicochemical properties, optical properties, and others.

Figure 2 .
Figure 2. Visualization of the various BC aggregate particles generated in this study.Fresh BC aggregates with no external coating are shown in panels (a) to (c).Semi-aged BC aggregates with 50 % coating are shown in panels (d) to (f).Aged BC aggregates with 90 % coating are shown in panels (g) to (i).
parameter that controls the influence of the regularization term, Y = y (1) , . .., y (N) T ∈ R N×D ; and Z Fro := N n=1 D d=1 |z nd | 2 denotes the Frobenius norm.Note that Eq. (2) has a closed-form solution: α

Figure 3 .
Figure 3. Boxplots summarizing the error between the predicted value ( Qabs , Qsca , ĝ) and the true value for three optical properties.The training data for the interpolation split consist of fractal dimensions in D f = [1.5, 2.1) ∪ (2.5, 2.9], whereas the extrapolation split uses D f = [1.5, 2.5).The lower and upper hinges of the boxplot represent the 25 % and 75 % quantile of the observations, respectively.Note that the outliers significantly reduced the visualization of the boxplots and were therefore omitted from the figures.However, all the outliers are considered in the training data and error evaluation.

Table 1 .
Mean absolute errors of the predicted optical properties for different experiments.The training data for the interpolation split consist of fractal dimensions in D f = [1.5, 2.1) ∪ (2.5, 2.9], whereas the extrapolation split uses D f = [1.5, 2.5).

Figure 4 .
Figure 4. Comparison of the predicted optical properties with their true values when the ML models are trained on a random subset of data.The data points for predicted optical properties correspond to KRR and ANN, as shown by the legend on the top right.The blue line in each panel of the figure corresponds to the one-to-one line between the x axis and the y axis.

Figure 5 .
Figure 5. Absorption efficiency (Q abs ) at a wavelength of 660 nm predicted using KRR and ANN for nine representative BC aggregates with a variety of morphologies (represented by D f ) and coatings (represented by f coating ).Both models were trained on a random split of training data.

Fig
Fig. D1 of Appendix D. The predictions M AC ML were found to be less sensitive to the change in D mob .Due to a lack of monodisperse mass measurements, comparing the predictions and measured values is not so straightforward.However, one can see that the discrepancies in the ML-based predictions of MAC are comparatively lower than the Miederived MAC values.The sensitivity in the predicted MAC and SSA as a function of change in input parameters, such as the D mob , D f , f coating , and a, have been extensively discussed byRomshoo  et al. (2021Romshoo et al. ( , 2023b) ) andSmith and Grainger (2014).The recommendations given by the above studies have been adapted for obtaining the results in Figs. 6 and D1 and are discussed in detail in Appendix D. For future applications, it is recommended that ambient or laboratory datasets with a resolution of more than 30 min are used to minimize the interference of instrumental uncertainty due to noisy data.Similarly, for ambient or laboratory closure studies, it is recommended that the model output be compared with averaged optical observations.

Figure 6 .
Figure 6.Single-scattering albedo ω of coated BC particles at varying f organics , generated in a laboratory study using different miniCAST set points (Romshoo et al., 2022).Panel (a) compares the ωML with the measured ω from the laboratory experiment.The colored dots in the figure show the results from the MSTM-based database used for training the ML algorithm.Panel (b) compares the ωMie with the measured ω.The ML results correspond to KRR, the default algorithm used in the prediction script.Error bars along the x axis show the uncertainty in the measured ω.The colored dots are the ω from MSTM simulations.The black line represents a linear regression equation shown in the upper-left corner, with the coefficient of determination (R 2 ) in the upper-right corner of each panel.The gray area represents the 95 % confidence level interval for predictions.

Figure C2 .
Figure C2.Error between the predicted value ( Qabs , Qsca , ĝ) and the true value for three optical properties for various cases of mobility diameter (D mob ).The lower and upper hinges of the boxplot represent the 25 % and 75 % quantile of the observations, respectively.Note that the outliers significantly reduced the visualization of the boxplots and were therefore omitted from the figures.However, all the outliers are considered in the training data and error evaluation.
Figure C3.Comparison of the predicted optical properties with their true values for the interpolation split when the ML models are trained on data with boundary fractal dimensions (D f = 1.5, 1.7, 1.9, 2.7, 2.9) and when they are tested on data with inner fractal dimensions (D f = 2.1, 2.3, 2.5).

Figure C5 .
Figure C5.Optical properties of BC fractal aggregates predicted using machine learning methods KRR and ANN for the interpolation split when models are trained on data with boundary fractal dimensions (D f = 1.5, 1.7, 1.9, 2.7, 2.9) and when they are tested to see if it fits for the intermediate values of fractal dimensions (D f = 2.1, 2.3, 2.5).The three columns show the predicted values of absorption efficiency (Q abs ), scattering efficiency (Q sca ), and asymmetry parameter (g).Each row corresponds to the predictions for the intermediate values of fractal dimensions (D f = 2.1, 2.3, 2.5).

Figure D1 .
Figure D1.Mass absorption cross-section (MAC) for coated BC particles generated in a laboratory study at different D mob (Romshoo et al., 2022).Panel (a) compares the M AC ML with the measured MAC from the laboratory experiment.Panel (b) compares the M AC Mie with the measured MAC.The number of points used in this figure is less than in Fig.6, as some of the data were excluded due to the uncertainties associated with the tapered element oscillating microbalance (TEOM) instrument.
Random split.We randomly assign each point in the database to either the training set or the test set.Note that we use 30 % of the data for the test set and the rest for the training set.Using this split, the training test's and the test set's feature distribution should be similar.Romshoo et al.: Optical properties of black carbon aggregates whereas training data of [0, 35) ∪ (50, 90] were used for testing f coating .3.Extrapolation split.Similarly to the interpolation splits, we also consider choosing a test set at the boundaries of certain features.This measures the model's extrapolation capabilities.TableB5shows the features and ranges used for the four different extrapolation splits.
1. interpolating predictions for data points it has not seen during training.TableB4shows the features and ranges used for the two interpolation splits.The split was tested for D f using training data of D f = [1.5, 2.1) ∪ (2.5, 2.9],B.

Table 2 .
Training time for 18 526 samples in the dataset and prediction time per sample in seconds.Values were recorded on a machine with Intel(R) Core(TM) i7-9750H CPU, 8 GB RAM, and NVIDIA GeForce GTX 1650 GPU.

Table B6 .
Maximum errors of different splits for their test sets.