Proof of Concept for Fast Equation of State Development Using an Integrated Experimental–Computational Approach

A multitude of industries, including energy and process engineering, as well as academia are researching and utilizing new fluid substances to further the aim of sustainability. Knowledge of the thermodynamic properties of these substances is a prerequisite, if they are to be utilized to their fullest potential. To date, the way to acquire reliable knowledge of the thermodynamic behavior is through measurements. The ensuing experimental data are then used to develop equations of state, which efficiently embody the gained knowledge of the behavior of the fluid substance, allow for interpolation and, to some extent, extrapolation. However, the acquisition of low-uncertainty experimental data, and thus the development of accurate equations of state, is often time-consuming and expensive. For substances for which suitable force field models exist, molecular modeling and simulation are well-suited to generate thermodynamic data or to augment experimental data, however, at the expense of larger uncertainties. The major goal of this work is to present a new approach for the development of equations of state using (1) symbolic regression, which is a machine learning based model development approach, (2) optimal experimental design, and (3) efficient data acquisition. We demonstrate this approach using the example of density data of an air-like binary mixture (0.2094O2+0.7906N2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.2094\,\hbox {O}_{2}\,+\,0.7906\,\hbox {N}_{2}$$\end{document}) over the temperature range from 100K\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${100}\,{\textrm{K}}$$\end{document} to 300K\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${300}\,{\textrm{K}}$$\end{document} at pressures of up to 8MPa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${8}\,{\textrm{MPa}}$$\end{document}, which covers the gaseous, liquid, and supercritical regions. For this purpose, an experimental data set published by von Preetzmann et al. (Int. J. Thermophys. 42, 2021) and molecular simulation data sampled in this work are used. The two data sets are compared in terms of acquisition time, cost, and uncertainty, showing that an optimized combination of experimental and simulation data leads to lower cost while maintaining low uncertainties.


Introduction
For many tasks in chemical and energy engineering, the precise knowledge of thermodynamic properties and the phase behavior of the involved fluids plays a key role. In process design and simulation, such properties are calculated using equations of state (EOS). However, the quality of such calculations using EOS largely depends on the availability and accuracy of the underlying data. In addition to experimental data, other sources, such as molecular simulations (MS), can accelerate the development of EOS as described by Span, his colleagues and coworkers [1]. One focus of the present work is the comparison between MS and experimental data for an air-like binary mixture over the temperature range from 100 K to 300 K at pressures of up to 8 MPa in terms of acquisition time, financial expenditure, and uncertainty.
Combining efficient data acquisition with modeling thermodynamic properties is challenging for several reasons. Moreover, at least two perspectives (model developer, experimenter) come into play: • Model developer: -Which data source should be chosen for modeling? -Which functional form best fits the problem?
• Experimenter: -What does the model developer need? -Is there a preliminary model that can be used to optimize the experimental design?
To overcome the tedious modeling and measurement processes, Wilhelmsen et al. [2] also support the use of hybrid data sets that include MS or quantum chemical data [3] for new modeling approaches, as suggested by Rutkai et al. [4]. Since none of the currently existing EOS seem to have the perfect functional form [5], we propose a machine learning approach, namely symbolic regression (SR), for a fast development of simple thermodynamic models that are suitable as a basis for optimal experimental design (OED). In preceding publications, we used the SR software Eureqa [6] to create models for liquid densities ( = f (T, p) ) of ethylene glycol [7] and methanol [8]. Furthermore, the SR software DataModeler [9] was used to fit a model for the refrigerant R-1243zf [10]. Eureqa and DataModeler, however, are not open source tools, and thus impose restrictions on the exposed parameters, extensibility, and automation capacity. As a consequence, we developed our own SR tool, which we refer to as thermodynamics-informed symbolic regression (TiSR) (Sect. 4). Although SR was used in thermophysical property modeling before [11,12], it still does not appear to be widely used today. Many other scientific fields, however, apply SR methods frequently so that a variety of open source algorithms was developed in recent years [13]. In addition to a new modeling approach, a further objective of this paper is to demonstrate the utility of OED for thermodynamic property measurements.
OED has the potential to reduce the experimental effort by selecting experiments containing the highest amount of information about the model parameters. Bardow and colleagues [14][15][16] successfully applied OED in the context of thermophysical property measurements. Moreover, we have recently shown how OED can decrease the experimental effort and still lead to satisfactory equations when compared to conventional experimental designs [17]. In the present work, a new method for efficient data acquisition by combining SR based modeling, OED, and hybrid data sets based on MS and experiments is proposed.
The following section (Sect. 2) provides a short overview of the new, synergistic procedure, some associated challenges, and the suggested solutions to those. Sect. 3 deals with the MS data; it is described how the data were sampled, and they are compared to the experimental data in terms of uncertainty as well as financial and temporal expenditure. In Sect. 4, our new SR tool TiSR is briefly described and classified, and it is shown how to use it to develop a preliminary model as a basis for OED. Sect. 5 specifies how OED is applied and constrained in the present work, while Sect. 6 details the results, and Sect. 7 includes a summary of our approach and an outline for future work.

Equation of State Development Strategy
Within the scope of our ongoing research, we aim to devise an efficient procedure for data acquisition and EOS development using a machine learning (ML) and OED based approach. Figure 1 gives a schematic overview of the proposed acquisition and modeling process. OED requires a model, which leads to a "chicken-and-egg problem". As shown in Fig. 1, this issue can be solved by choosing initial values or initial models, albeit preliminary ones. Existing data or models may be found in the literature or, as in the present case, can be created by MS. The complete procedure is as follows: 1. initial data acquisition (here: based on MS) 2. development of an EOS based on these data (here: utilizing TiSR) 3. calculation of the next most informative set of measurements (here: OED for next best (T, p) state points) 4. further data acquisition at the proposed state points (here: conducting measurements) 5. refitting the parameters of the current EOS • If the fit is satisfactory, keep the EOS and continue with step 3. • If not, continue with step 2.
The iterative procedure of modeling and experiment is continued until a defined termination criterion is reached (e.g., data reproducibility within the experimental uncertainty).
Of course, the application of little established methods like SR and OED comes with some challenges (Fig. 1). OED is so far not very well established in thermophysical property modeling. In fact, experiments in this domain impose a number of challenges for classical OED formulations. For instance, an adjustment of the temperature between experiments is significantly more time-consuming than a change of pressure. As a result, thermodynamic property measurements are often conducted along isotherms.
Thus, the next best set of T, p state points constrained to the same temperature is calculated. Future plans include breaking down the true effort, adding further constraints, and formulating a more differentiated cost function representative of the involved effort into the OED procedure.
There are also some reservations about ML and SR due to the fact that existing knowledge is difficult to integrate. This is one of the major reasons for developing our own tool, which allows us to constrain the developed equation to adhere to a certain behavior. Note that points 3, 4 and 5 in Fig. 1 are relevant for developing multiparameter equations of state and are not considered within the scope of this paper.
In this study, we utilized already published experimental data to develop and refine the procedure. Consequently, OED was used to select from the available measurements only, rather than allow it to select state points for measurements freely. However, as a showcase, we also calculated six isotherms using a free state point selection and contrast it with the experimental design of von Preetzmann et al. [18].

Molecular Dynamics Simulation vs. Experiment
In order to use hybrid data sets consisting of simulated and experimental data, it is important to be aware of their different characteristics regarding uncertainty, as well as financial and temporal expenditure. For this reason, data for an air-like binary mixture ( 0.2094 O 2 + 0.7906 N 2 ) over the temperature range from 100 K to 300 K at pressures of up to 8 MPa were sampled in the present study, which is consistent with the published experimental data by von Preetzmann et al. [18]. Moreover, values calculated with REFPROP 10.0 based on the GERG-2008 EOS with a generalized mixture model [19,20] were considered for additional comparisons. Density values retrieved from the three sources mentioned are listed in Table 4 of the supplementary material and depicted in Fig. 2. Details regarding the experimental data are reported in the paper of von Preetzmann et al. [18].
All molecular simulations of the mixture O 2 + N 2 were conducted with the massively parallel molecular simulation software ms2 [21][22][23][24]. The applied potential models are given in Ref. [25], and the unlike interactions were determined with the modified Lorentz-Berthelot rules ab = ( aa + bb )∕2 and ab = √ aa bb with = 1.007 [26]. Monte Carlo simulations in the isobaric-isothermal (NpT) ensemble were carried out to calculate the density . The equilibration was set to 2.5 ⋅ 10 4 canonical (NVT) and 5 ⋅ 10 4 NpT cycles, followed by a production run of 5 ⋅ 10 5 cycles. Based on these densities, molecular dynamics simulations were performed in the NVT ensemble with 2 ⋅ 10 5 time steps for the equilibration and 1.5 ⋅ 10 6 time steps for production. Newton's equations of motion were solved numerically by utilizing the Gear predictor-corrector integrator [27] with a time step of Δt = 2 fs , and specified temperatures were maintained through isokinetically rescaled velocities. The formalism of Lustig [28,29] was applied to sample the residual Helmholtz energy derivatives A r mn , and the residual Helmholtz energy A r 00 itself was determined from chemical potential data obtained with Widom's test particle insertion method [30]. All simulations were conducted with N = 2048 molecules. The cutoff radius was set to 15.75 Å, and long-range interactions were considered analytically by angle averaging [31]. Results from the NVT ensemble simulations are listed in the supporting information.
Uncertainties for the given data sets were calculated based on different methods. In the case of experimental data, the expanded combined uncertainty (k = 2, i.e., approximately 95% confidence interval) ( U( exp ) ) was calculated according to the "Guide to the Expression of Uncertainty in Measurement" [32]. The statistical uncertainty of the simulation data set ( d( sim ) ) was obtained from block averaging [33].
When comparing the reported uncertainties ( U( Exp ) , d( sim ) ) and the deviations in Table 1 between the different data sets, a discrepancy can be observed. First, it is noticeable that the uncertainties given for the simulation data are much smaller than those for the experimental data and the values calculated with GERG-2008 EOS. This is a result of the different ways the uncertainties are calculated.
For measurements, the uncertainties are based on the individual uncertainties, e.g. uncertainty of the temperature setting, pressure setting, and density measurement, which are propagated and added into the resulting uncertainty for the measured quantity. This is possible, because measurement is ultimately an observation of the ground truth. The uncertainty of MS, on the other hand, cannot be determined on that basis. It is approximated by calculating the standard deviation during the block averaging of the production cycles and the error propagation law [24,33]. Thus, it still contains the model simplifications and numerical errors.
Moreover, it is also evident from Fig. 2 that, especially at the phase boundary, the deviations between the GERG-2008 data, simulation, and experimental data are quite large. In comparison, in the gaseous and liquid phases, all data sets are in good agreement, while experimental and GERG-2008 data show even smaller deviations compared to the sim/exp or sim/GERG deviations. This leads to the following conclusions: (1) Based on the commonly used method (block averaging) for the uncertainty of the simulated data, it is difficult to make reliable decisions about data quality. Based on the comparisons with the experimental and GERG-2008 data, we estimate the true uncertainty of the simulated data to be ≈ 1... 2% . (2) At the phase boundary, experimental and simulation data agree quite well. However, the GERG-2008 data show a larger deviation compared to the others at the phase boundary.
(3) The GERG-2008 model for the binary N 2 + O 2 mixture thus appears to have a larger uncertainty here than indicated by the authors.
In addition to the uncertainties, a comparison of temporal and financial expenditures has to be conducted. The assumptions underlying the expenditure calculations are based on discussions with the authors of Ref. [18]. For the temporal and financial expenditures, the preparation time, the execution time, the post-processing time, as well as variable material and energy costs were considered. For the variable Table 1 Mean uncertainties of the experimental data [18] and the sampled simulation data reported in Table 4 of the supplementary material.
For the uncertainty of the data calculated with REFPROP using the GERG-2008 generalized mixture model [20], the reported upper bounds of uncertainty for standard dry air are listed. Moreover, the table also lists the mean relative deviations of the three data sets with respect to each other. Four of the points are close to the phase boundary and have a larger uncertainty than the remaining points. Therefore, the uncertainties and deviations are additionally provided for all but these four state points. The overstrikes refer to the mean of the respective quantities Designation personnel cost, we assume an hourly rate of 52 €/h (approximated based on the typical salary of an experimenter/model developer in Germany in 2022). Tasks that do not require constant supervision are multiplied by an attention factor, which we assume to be 0.1 for a running simulation and 0.33 in all experimental contexts. We do not consider the depreciation costs of equipment and assume that the measurement apparatus is available, set up, and calibrated. For both, simulation and experiment, the calculations are conducted for 58 state points along six isotherms. The resulting expenditures are listed in Table 2.
For the experiments, there is a large discrepancy between the calculated and the actual expenditures. Most of this discrepancy stems from unpredictable delays and the fact that a majority of the tasks cannot be interrupted and resumed the next day. All assumptions, a brief discussion of the expenditure calculations, and the discrepancies between the calculated and realistic expenditures for experimental work can be found in Sect. 1 of the supplementary material.
The question arises whether the relatively small difference in uncertainty between simulation and experiment data are worth the effort. On the one hand, our answer is yes because without accurate data, uncertainty estimation for simulation data is difficult. On the other hand, we believe that the number of measurements can be reduced based on a given simulation data set. Consequently, we seek to promote an efficient data acquisition process which reduces experimental data quantity, resulting in significantly lower costs, while still achieving low uncertainties. It is important to keep in mind that fast data acquisition is an important step to enable a rapid industrial utilization of substances, while low uncertainties of data are important for EOS development and ultimately for process simulation.

Modeling with Symbolic Regression
In this section, a brief overview of our new SR tool TiSR is provided. It employs the multi-objective genetic programming algorithm NSGA-II [34]. Models are selected in a Pareto-optimal sense using the quality of fit to the data, the model complexity Table 2 Resulting temporal and financial expenditures for the data acquisition of 58 state points along six isotherms.
For the experimental expenditures, the net time was approximated based on discussions with the authors of Ref. [18]. The actual experimental expenditures associated with these measurements are listed separately. A more detailed itemization and a brief discussion is provided in Sect. 1  (number of operators, variables, and model parameters), and the model age (number of generations in the algorithm since the model was generated and has been selected into the next generation). Schmidt and Lipson have shown that the model age is a critical addition to the selection criteria, which helps counter premature convergence [35]. The tool was implemented in the programming language Julia and it adopts large parts of SymbolicRegReSSion.Jl [36,37]. However, we adapted and rewrote other parts, sacrificing general performance to gain flexibility and extensibility, thus creating a more customized tool. In contrast to many other SR implementations, parameter identification is performed using a Levenberg-Marquardt algorithm. The parameter identification minimizes the weighted sum of squared relative errors. The weights are determined by the inverse of the variances, i. e. squared uncertainty, associated with the data to be fitted.
In order to prevent overfitting and to improve the generalization performance of the fitted models, the parameter identification step is regularized using early stopping. In early stopping, the input data are split into a training and validation data set, and the parameter identification is only performed using the training data set. During each iteration of the Levenberg-Marquardt algorithm, the model is additionally evaluated on the validation data set using the current parameters. As soon as the residual norm of the validation data set increases monotonically for a predefined number of iterations, the parameter identification is stopped, and the parameters associated with the lowest sum of residual norms of the training and validation data set are selected. TiSR is still under development, and the details will be published later.
For the development of an initial model as basis for OED, we utilized MS, as described in Sect. 3, to create a data grid over eleven temperatures at ( 100, 120, 140, 160, 180, 200, 210, 220, 240, 260, 280, and 300) K and seven pressures at (0.1, 0.5, 1, 2, 4, 6, and 8) MPa . The simulation results are listed in Table 5 of the supplementary material. Before finding the model, the data were normalized to ensure that all data points have the same order of magnitude, as follows: Fig. 3 Cutout of the Pareto front of the equations found using TiSR. While the vertical axis represents the mean squared error, the color bar indicates the maximum relative deviation to the simulation data.
The normalization parameters in Eq. 1 were T 0 = 150 K , 0 = 1000 kg∕m 3 , and p 0 = 10 MPa . TiSR was used to find an equation in the form of p = f (T, ) to ensure a satisfactory fit over both, the gaseous and the liquid regions. Fig. 3 shows the Pareto front of the resulting equations upon termination of TiSR.
Considering that multiple Pareto-optimal models were created, we chose the model with the lowest complexity that still fits the simulation data with a relative deviation of ±2% . The limit of 2% was chosen according to the approximated uncertainty of the simulation data, as discussed in Sect. 3. This results in the following equation, which is also highlighted in Fig. 3 (see Table 3 in the supplementary material for the parameters): In Fig. 4, the predictions of this equation are compared to the simulation data. The relative deviations range within −2... 2% and were thus satisfactorily low. Thus, Eq. 2 was used as a basis for OED in this work.

Optimal Experimental Design
Optimal experimental design is a technique to select experiments which are most informative with regard to the parameters of a given model. We refer the interested reader to Pazman [38], Uciński [39], and Atkinson et al. [40] for a comprehensive background. As shown by Frotscher et al. [17] and Bardow et al. [14,16], OED can be used to minimize the amount of data required and still develop accurate thermodynamic models. In the present study, OED was used to reduce the number of data points required to identify the parameters c 1 , … , c 7 of Eq. 2.
The thermodynamic model under consideration has the form where p and T are the independent variables, is the measured (dependent) quantity, and c = (c 1 , … , c 7 ) are the model parameters.
In OED, the information content of a single measurement is expressed in terms of the elementary Fisher information matrix (FIM) I i where 2 i is the variance associated with the dependent (measured) coordinate of the i-th data point, and the row vector j i is the elementary derivative of the dependent variable with respect to the independent variables at this point. Since Eq. 3 is solved for p, but the dependent quantity is , we reformulate our model as an implicit one as follows: The required elementary derivative is then evaluated using the implicit function theorem as follows: Each elementary FIM I i is a symmetric, positive semi-definite rank-1 matrix. Since we can assume individual measurements to be statistically uncorrelated, the FIM associated with a series of experiments is obtained as I = ∑ i I i . The selection of the best experiments is usually conducted by minimizing a scalar objective function of the FIM. Here, we utilized the A-criterion: where j is the j-th eigenvalue of I . To fit the parameters of the model, we need at least as many experimental data points as we have parameters c . This is ensured in our case, as we use the MS data as a basis for the OED calculations. Consequently, we seek to minimize Ψ A (I) in order to maximize the information content of the   collection of experiments selected, and simultaneously minimize the parameter variation in the face of measurement errors. The resulting state point selection is optimal in the sense that it carries the most information considering the current model, the current parameters, and the data already acquired.
As mentioned in Sect. 2, we conduct OED in two different ways, which we refer to as fixed and free state point selection. In both approaches, we use as a starting point the model shown in Eq. 2 with initial parameters from Table 3 in the supplementary material, which were identified using TiSR and the entire simulation data set. Both approaches use OED to determine an optimized set of additional isotherms to be measured in order to further increase the model accuracy.
In case of the fixed state point selection, we iteratively choose the next best isotherm, add its measurements, and refit the parameters of the model. If the fit is no longer satisfactory, the current model is replaced by a newly generated one using TiSR. In this study, as we only use data from six previously measured isotherms, we are bound to choose from a limited set of temperature and pressure values. Subsequent studies will be accompanied by measurement campaigns and thus will not be subject to such limitations.
The free state point selection is a demonstration of how the procedure would look like without such limitations. However, in the absence of an accompanying measurement campaign in this work, no further data were actually added. Consequently, we keep the initial model and its parameters, and determine an optimal combination of six isotherms simultaneously.
For the free state point selection, we use gradient-based optimization to minimize the A-criterion Eq. 7. More specifically, we use the Newton interior-point algorithm from the Julia package optim.Jl [41,42]. All derivatives are calculated using algorithmic differentiation capabilities of the FoRwaRddiFF.Jl package [43]. We impose the same bounds for the temperature and pressure values as given by the simulation data, i.e., temperatures 100... 300 K and pressures 0.1... 8 MPa . Note that the number of pressure values along each isotherm is fixed, but their values are subject to optimization. There is no constraint prohibiting that multiple isotherms can be at the same temperature.

Results
Based on the TiSR model Eq. 2 and the simulation data, we utilized OED as explained in Sect. 5 to select the best order of isotherms from the given experimental data (fixed state point selection), one isotherm at a time. Once an isotherm had been selected, we refitted the model parameters to the augmented data set. In this process, we weighted the MS data lower (based on their assumed uncertainty of 2% ) than the experimental data (based on their expanded combined uncertainty). Figure 5(a) shows the order in which isotherms were selected. In addition, Fig. 5(b) depicts the mean relative deviation over data sets. After measurements along three isotherms at (115, 130, and 145) K , we observed that the mean relative deviation no longer changed significantly.
This indicates that only little valuable information is added after three isotherms. In fact, given the fixed selection of isotherms, the fit to those "unnecessary" isotherms leads to higher relative deviations at state points where fewer data or data with lower weight are available. This can also be seen in Fig. 6, which shows the data from simulation and experiment as well as the predictions and relative deviations of Eq. 2 for different OED iteration steps.
According to the procedure proposed in this work, after measuring four isotherms and refitting Eq. 2 after each one, TiSR was utilized again to find an improved model using the additional data from the four isotherms (for coefficients c 1 , … , c 7 see Table 3 in the supplementary material). Incidentally, the new model again contains seven parameters: 7 . Fig. 5 The order of selection of the next best isotherm to measure ( ) according to OED based on Eq. 2, the simulation data, and the already measured isotherms ( ) is shown in Panel (a). Starting with Eq. 2 fitted to MS data only (no. of isotherms = 0) to determine the first isotherm, the equation was fitted again after each isotherm using the additionally available data. Panel (b) shows the mean relative deviations between the model predictions and the experimental as well as the MS data .
Finally, we generated a model using all data from MS and experiment and it yielded the following equation (for coefficients c 1 , … , c 7 see Table 3 in the supplementary material): For both, the preliminary model Eq. 2 and the model Eq. 8, there is no significant benefit from fitting more than the four best isotherms and simulation data. The simulation data and the four most informative, experimentally acquired isotherms are sufficient. This shows that with an optimized selection based on OED, models can be developed using fewer measurements with uncertainties similar to classical experimental design. Selected statistics of the discussed equations are listed in Table 3.
As we are bound to the previously measured data, we can only select a subset of the data rather than freely selecting optimal state points. Therefore, we cannot reduce the relative deviations of the predictions using fewer data, and thus, we cannot show the full potential of OED and the proposed procedure in this study. To demonstrate how a free selection might look like in this case, in Fig. 7, the state points selected by von Preetzmann et al. [18] are compared with a free state point selection based on Eq. 2 using only the simulation data. However, the OED state point selection in Fig. 7 differs from the one proposed in the work of Table 3 Statistics of the relative absolute deviations of the equation predictions with regard to the simulation and experimental data for Eq. 2 fitted to the simulation data, Eq. 2 fitted to the simulation data and the four best isotherms, Eq. 8 fitted to the simulation data and the four best isotherms and Eq. 9 fitted to the simulation and experimental data. von Preetzmann et al., as there are six best isotherms that are calculated based on the simulation data, rather than iteratively calculating the next best one and refitting.
In the future, we aim to utilize the free state point selection to determine the next best isotherms during measurement campaigns for the sake of improving the model predictions and decreasing the experimental effort.

Conclusion and Outlook
A combined measurement and modeling procedure based on a hybrid data set, OED, and symbolic regression is presented.
Using molecular simulations (MS), we sampled an initial data set for an air-like binary mixture ( 0.2094 O 2 + 0.7906 N 2 ) over the temperature range from 100 K to 300 K and the pressure range from 0.1 MPa up to 8 MPa . The data sets were compared in terms of uncertainty, as well as financial and temporal expenditure. While MS yield data with larger uncertainty -we estimate 2% -, these data are significantly cheaper and faster to generate compared to highly accurate measurements. This is mainly due to the parallelization of MS, which is unfeasible in high-accuracy measurements. For such measurements, a large contributor to the financial and temporal expenditure are the long time spans required to ensure thermal equilibrium in the measuring cell.
The MS data were used to generate a preliminary model with the newly developed thermodynamics-informed symbolic regression (TiSR) tool. This model was then used as a basis for OED to iteratively determine the next most informative isotherm out of the ones measured by von Preetzmann et al. [18]. During this process, we iteratively added the next best isotherm to the fitting data set and weighted the data points according to their associated uncertainties. Thereby, we ranked the isotherms by their information content with respect to the parameters of the preliminary model. We demonstrated that after adding the four (out of six) most informative isotherms, the benefit of adding further isotherms decreases significantly. We conclude that more data do not necessarily add more information.
Finally, we generated new equations with the hybrid data set using TiSR. Models using the complete hybrid data set and using the experimental data set with only the four most informative isotherms were created. Both versions displayed a minimal improvement compared to the preliminary model generated using the MS data set only. In summary, we have shown that models with similar uncertainty can be delivered using fewer data.
In the future, we will use our iterative process of modeling and measurement based on hybrid data sets to further decrease the financial and temporal effort, providing models with comparatively low uncertainty. Both, our OED and TiSR implementations, are still under development and will be described in more detail in future publications. Upcoming developments contain: • symbolic regression: -development of implicit equations -addition of further thermophysical constraints -development of fundamental EOS (e.g., in the form of the Helmholtz energy) • optimal experimental design: -incorporation of different criteria for the selection of state points -extensions of the cost functions to more closely reflect the cost of measurements -use for experiments that measure multiple properties simultaneously (e.g., parallel measurements of density and speed of sound in a vibrating tube densimeter) Our next steps are to tackle the issues and to make our software available in open source format. This will be done in combination with the measurements of new substances, to validate the concept on real applications.
you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.