Introduction

Results from laboratory tests on rock samples are critical for the derivation and substantiation of constitutive models to be used in modeling beyond the spatial and temporal scales of laboratory and field tests (Kolditz et al. 2021). The synergies between experimental and numerical approaches (e.g., Esterhuizen 2014) range from hazard prevention, in the context of volcano activity (Heap and Violay 2021), rockbursts (Li et al. 2019; Wang et al. 2021), and waste repositories (e.g., Bossart 2007), to initiatives to build virtual rock physics laboratories for educational purposes (Zhu et al. 2012; Vanorio et al. 2014). The comparability of results obtained in standardized experiments forms the basis for the credibility of laboratory work. The demands on the experimental procedure are particularly high in geosciences and geotechnical engineering, because the investigated rock material is often heterogeneous, anisotropic, and limited in its quantity.

In economic applications, subsurface characterization rests on standardized preliminary surveys to plan processes and costs based on results gained under comparable conditions. Examples of regulations serving this purpose are standards published by the American Society of Testing Materials (e.g., ASTM 2017), suggested methods published by the International Society of Rock Mechanics (ISRM; e.g., Kovari et al. 1983), or national standards and recommendations. In scientific context, the investigated problems are usually highly specialized and require deviations from such standards. Intermediate and deep (core) drilling operations certainly represent an endmember among geoscientific projects, because costs are extremely high and the resulting sample material is severely limited. Such drilling operations became increasingly important during the past decades, for example, regarding nuclear waste disposal (e.g., Almén 1994; Delay et al. 2007), mitigation of geohazard (e.g., Prior and Doyle 1984) or geothermal energy provision (e.g., Fridleifsson and Elders 2005). These endeavors may benefit from a process understanding that cannot be gained from material and structure characterization based on field surveys and laboratory tests alone, but require a combination of field testing and large-scale modeling. The complexity of the modeling, both in terms of structures and relevant processes, often mandates the use of numerical codes that have to be verified, validated, and benchmarked using independent constraints from experiments and observational evidence spanning scales from hand samples to rock masses (e.g., Jing 2003; Diehl et al. 2019; Birkholzer and Bond 2022).

It is not uncommon that individual studies combine dedicated experimental work and numerical modeling of rock failure behavior in general (e.g., Holt et al. 2005) or during engineering operations, such as hydraulic fracturing (Deb et al. 2021) and tunneling (Zhang et al. 2018). To tap the large pool of the results of independent experimental studies, a rigorous assessment of the significances of their outcomes may lead to improved understanding of fundamental questions related to the role of methodological peculiarities vs. that of sample-to-sample variability. Comparative studies differ regarding the number of involved laboratories, considered rock varieties, and applied methods (Appendix A), with a good fraction dedicated to the specific and difficult task of determining hydraulic properties of close to impermeable shales (e.g., Ghanizadeh et al. 2015), for which a qualitative method comparison is provided by Sander et al. (2017). Often, different methods for determination of a particular property are compared by tests in a single laboratory on a single sample, at times even in a single device (e.g., Winhausen et al. 2021; Schepp and Renner 2021; Zhang et al. 2022). Efforts regarding interlaboratory validation tests are documented from the 1980ies, but partly in reports to funding agencies (e.g., Rasilainen et al. 1996; Sandström 2006) or in conference papers (e.g., McPhee and Arthur 1994; Davy et al. 2019) causing problems to track details. True round robins, in principle possible for non-destructive testing (e.g., Rasilainen et al. 1996; Profice et al. 2016), eliminate sample-to-sample variability and thus allow for assessing the role of protocol deviations and method principles, but pose organizational challenges and raise questions regarding history dependence of measurement results. These challenges are probably the reasons for the up to today largest comparative study involving 24 laboratories refraining from attempting a round robin for hydraulic permeability testing of Grimsel granodiorite (David et al. 2018a,b). For destructive strength testing (e.g., Pincus 1993, 1994, 1996; Minardi et al. 2021), however, one has to resort to the selection of to-be-distributed sample suites based on their a-priori characterization (e.g., Minardi et al. 2021), accompanied by the challenge to minimize the uncertainty of the role of sample-to-sample variability, for example, by centralized sample preparation and characterization. In cases, previous studies tended to focus on statistical analyses of results omitting a rigorous uncertainty analysis of the individual measurements (e.g., David et al. 2018a, 2018b), hampering the assessment of the significance of observed differences. For the present study, rock mechanics and rock physics laboratories worldwide were invited to participate in an interlaboratory comparison in the context of the San Andreas Fault Observatory at Depth (SAFOD) deep drilling project (Lockner et al. 2009; Logan et al. 2010; Zoback et al. 2010). Test conditions and aspects of procedures were specified before laboratories received sample blocks from five different rock types. The five rock types were selected, because they (i) occur in deposits with sizes motivating commercial quarrying and thus promise future availability, (ii) have been subject of a range of previous studies and accordingly were expected to span a wide range in the physical properties to be investigated, and (iii) promised to minimize the influence of anisotropy and to ensure homogeneity at the decimeter-scale to allow for preparation of comparable samples. Owing to the destructive nature of strength tests and potential irreversible interactions between fluid and samples, we refrained from a round robin procedure, but the group at the U.S. Geological Survey, Menlo Park, (USGS) organized selection, purchase, and shipment of blocks of the rock types, from which the participating institutions prepared samples locally. The specific objectives of this study were.

  1. (1)

    To compare the experimental approaches—including sample preparation—and results from different laboratories to determine causes for potential deviations among results,

  2. (2)

    To establish tractable standards required for research objectives associated with deep drilling projects,

  3. (3)

    To establish the significance of results of laboratory tests in the light of verification and validation efforts for numerical models, and

  4. (4)

    To build confidence in the laboratories’ procedures.

We provide results for Young’s modulus and compressive strength derived from uniaxial and triaxial deformation experiments of intact rock samples (U.S. Geological Survey—USGS, Ruhr-Universität Bochum—RUB) and for hydraulic permeability (USGS, RUB, The Pennsylvania State University—PSU), the central physical properties for hydromechanical modeling whose importance for fundamental research and industrial applications is increasingly appreciated (e.g., Neuzil 2003; Ghassemi 2012).

Materials and methods

Materials

Five rock types were studied (Fig. 1, Table 1):

Fig. 1
figure 1

Optical micrograph images (crossed polarized light) of a Crab Orchard sandstone, b Berea sandstone, c Wilkeson sandstone, d Carrara marble, and e Sierra White granite. Images are taken from thin sections prepared perpendicular to the drilling directions for the samples

Table 1 Mineralogical compositions from X-ray diffraction in weight percentage (wt%) with uncertainties of ± 2–5% depending on the mineral
  1. (1)

    Crab Orchard sandstone is a fine-grained, low-porosity sandstone mainly composed of quartz (91.0 wt%), minor amounts of plagioclase (0.5 wt%) and orthoclase (1.5 wt%), and clay minerals (smectite: 3.0 wt%, illite/muscovite: 4.0 wt%). The nominal porosity is about 5% (Benson et al. 2005) and grain size as inferred from thin sections is < 300 µm with an average and a standard deviation of 79 ± 11 µm (line-intercept).

  2. (2)

    Berea sandstone represents a lightly banded sandstone with about 20% nominal porosity (e.g., Churcher et al. 1991) composed of 88.0 wt% quartz, 5.0 wt% orthoclase, 2.0 wt% plagioclase, 2.5 wt% dolomite, and 2.5 wt% kaolinite. The grain size distribution is similar to that of Crab Orchard sandstone with an average and standard deviation of 98 ± 12 µm and quartz grains not exceeding 300 µm.

  3. (3)

    Wilkeson sandstone represents a medium-grained sandstone with 10% nominal porosity (e.g., Duda and Renner 2013). It is composed of 50 wt% quartz, 10 wt% orthoclase, 26 wt% plagioclase, 4 wt% dolomite, 2 wt% siderite, and 8 wt% mica. The maximum grain size of individual quartz grains reaches up to 2 mm with an average and standard deviation of 172 ± 27 µm.

  4. (4)

    Carrara marble is composed of medium-grained calcite with minor additional constituents and a low porosity (e.g., Schmid et al. 1980). A minor grain size anisotropy barely exceeding the standard deviation was determined with 150 ± 20 µm and 173 ± 22 µm in two orthogonal directions.

  5. (5)

    Sierra White granite (Knowles granodiorite) is a low-porosity granite (e.g., Miller and Florence 1991) composed of quartz (38 wt%), orthoclase (10 wt%), plagioclase (43 wt%), and micas (9 wt%), including muscovite and biotite. It exhibits a wide range of grain sizes from a few tens of µm to several mm with an average and standard deviation of 649 ± 257 µm.

Apart from color gradients for Berea sandstone the investigated blocks showed no macroscopic signs of heavy weathering, anisotropy or heterogeneities.

Sample preparation

Uniaxial and triaxial deformation tests and permeability tests were performed on cylindrical samples prepared by the individual groups who were provided with blocks of the various rock types, whose faces were labelled by the group at USGS as T–B, N–S, and E–W. Specimens were drilled with water-cooled diamond drill bits. All samples intended for comparative measurements were cored in the T–B orientation uniformly defined for all participating institutions, but samples of Wilkeson sandstone for permeability measurements at PSU that were drilled in E–W direction, i.e., orthogonal to the “standard direction”. At PSU, additional samples for permeability measurements were drilled from Berea sandstone and Crab Orchard sandstone in E–W and N–S directions.

For strength tests, right cylinders were prepared (USGS: 25.4 mm diameter × 63.5 mm length and RUB: 30 mm diameter × 75 mm length), providing an aspect ratio of about 2.5:1, chosen to ensure a homogeneous stress distribution in the center of samples when subjected to conventional compression (Paterson and Wong 2005). Samples for permeability tests had nominal dimensions of 25.4 mm diameter × 50 mm length (USGS, PSU) and 30 mm diameter × 50 mm length (RUB). For both tests, end faces were ground square to within 0.1% parallelism. At the USGS, samples were additionally cylindrically ground to achieve a uniform diameter (within ± 0.01 mm) and consistent surface finish, after which they were cleaned with acetone. Samples prepared at RUB by drilling only exhibited diameter variations of less than ± 0.03 mm and were devoid of drilling-score marks. Diameters were measured by calipers with a resolution and an accuracy of better than 0.01 and 0.1 mm, respectively. Finished samples were vacuum-dried at ~ 60 °C for approximately 24 h.

Except for Sierra White granite, the diameter of the specimens exceeded the largest grains in the rock by at least a factor of six, in agreement with ISRM’s suggested methods (Bieniawski and Bernede 1979). In the light of this favorable size vs. grain size ratio, deviations of sample size from recommendations for deformation tests (e.g., ASTM 2017) were allowed on purpose—all samples were smaller than the recommended 40 to 50 mm in diameter—to account for requirements of testing apparatus and to simulate typical material limitations associated with scientific drilling projects.

Sample-to-sample variability deduced from basic rock physical properties

Prepared samples were investigated for their basic physical properties at RUB to exemplarily assess sample-to-sample variability. The differences in basic physical properties of samples originating from a specific block and determined at ambient conditions were not significant, as standard deviations were generally smaller than the experimental uncertainty determined by error propagation (Table 2). Thus, the five rock types were considered sufficiently homogeneous for the planned experiment series and the comparison among laboratories.

Table 2 Average values (avg), standard deviations (std), and experimental uncertainty (Δ) of density ρ, P-wave and S-wave velocities (vP and vS) of dry and saturated (sat) samples, and connected porosity ϕ for each rock type. The number of investigated samples is indicated in parenthesis

Experimental procedures

All tests were to be performed according to instructions concerning sample treatment, number of repeat tests, and applied pressures and their sequences (Tables 3, 4). We refer the reader to Lockner (1998), Duda and Renner (2013), and Ahrens et al. (2017) for technical details of the apparatuses used for deformation tests. The testing procedures did not fully comply with ISRM’s suggested methods (Kovari et al. 1983): (a) spherical seats were not employed; (b) tests were run in displacement control selecting piston velocities in the two laboratories that resulted in the pre-described strain rate of ~ 1 × 10–5 s−1 for the samples with different lengths (Table 3), and with controlled confining pressure. The true strain rates vary over the course of a test by up to a factor of about 2 between the phases of initial steep stress increase and the near constant stress conditions at maximum stress in a single test and also between stiffest (Carrara marble) and most compliant (Berea and Wilkeson sandstone) samples owing to the system deformation (please see the data availability statement for links to test records).

Table 3 Specifications for deformation experiments (after preparation)
Table 4 Specifications for permeability tests (after preparation)

Four methods were used at the participating institutions to obtain permeability: constant-flow, constant-head, and pulse tests at PSU, constant-head tests at USGS, and oscillatory pore-pressure tests at RUB. For theoretical background and experimental setup of permeability tests, we refer to Bernabé et al.( 2006), Song, and Renner (2007), Song et al. (2013), and David et al. (2018a, b).

The necessary steps for the evaluation of the mechanical and hydraulic tests are detailed in Appendix B, including a comprehensive discussion of involved uncertainties. Specifically, the conversions of recorded displacements to strains and recorded loads to stresses and stress differences, the difference between axial stress and confining pressure also referred to as deviatoric or differential stress (see Paterson and Wong 2005), need to account for the (current) sample dimensions and system stiffness. The compliances of the assemblies used at USGS and RUB are about 0.002 mm/MPa and 0.001 mm/MPa, respectively, and thus the corrections involved in strain determination amount to up to 70% of the total recorded displacement for tests at USGS on the stiffest rock type, Carrara marble. The different applied hydraulic methods essentially rest on fitting analytical functions to observed pressure transients or spectral analyses of the periodic pressure signals.

Uncertainty analysis

The principles of the estimation of uncertainties of reported quantities relying on Gaussian error propagation of the accuracies of sensors and parameters are documented in Appendix B. Commercial sensors in the United States (US) are traceable back to National Institute of Standards and Testing (NIST). The European providers of the sensors used at RUB guarantee conformity with DIN EN ISO/IEC 17025 (ISO/IEC 2017), i.e., the regulation for calibration services. We used the sensitivities provided by suppliers when transforming electrical signals to physical quantities. Furthermore, displacement transducers are calibrated on a regular basis against calipers; pressure gauges are referenced to analog Heise gauges; the readings of load cells are checked in relation to pressures recorded during hydrostatic loading of the triaxial rigs, measurements that also constrain the friction on the loading piston, and, at RUB, are also tested against a force ring.

Electronic noise in the digitized signal is small compared to the uncertainty of stress difference as determined by the error analysis. The uncertainty of stress difference of 0.4% calculated for peak and residual strengths (indicated for RUB data in the corresponding figures) includes accuracy of the external load cell and the uncertainty in initial sample diameter, i.e., the uncertainty related to the accuracy of the used caliper and shape imperfections but not the change in cross section due to pressurization or axial shortening. Using only initial cross section ensures the direct comparability of the results from the two laboratories, but leads to an increasing overestimation of stress difference with increasing axial strain (see Appendix B).

Stress difference is calculated relative to axial stress on the moving piston before it contacts the specimen (hit-point); this procedure eliminates seal friction as a source of uncertainty in axial load but for its potential variability with piston deformation. Yet, results of calibration experiments at hydrostatic conditions and deviatoric loading suggest that the friction on the deformation piston is controlled by the confining pressure and does not change with increasing axial load. Nevertheless, friction on the loading piston constitutes an example of methodological uncertainties that are difficult to constrain precisely and that are also encountered for the other physical property determinations (for details see Appendix B). For permeability determination, a likewise critical methodological issue is, for example, to what extent the combination of sample length and used end-plugs actually approximate the condition of one-dimensional flow underlying the evaluation of pressure transients. We propose that an accuracy in permeability of half an order of magnitude appears a realistic, in cases possibly conservative, rule of thumb. Smaller uncertainties have been reported for permeability (e.g., Benson et al. 2005; David et al. 2018a), but it seems that the full cumulative effect of the various sources of uncertainty was not appreciated in these cases. The partial consideration of uncertainty is potentially acceptable when the objective is to resolve the effect of a specific parameter, such as pressure on permeability, in a single study but not for an interlaboratory comparison.

Results

Mechanical parameters

Apparent Young’s modulus

The recorded stress–strain curves exhibit various degrees of non-linearity complicating determination of Young’s moduli (Fig. 2). The values reported here, labeled “apparent” to indicate that they might differ from intrinsic Young’s moduli, represent the maximum slope of the tangent to a polynomial fit to the pre-peak stress–strain curve. For about half of the tests, the apparent moduli determined by the two institutions agree within 15% (Fig. 2). However, the moduli determined at RUB tend to be larger than the ones determined at USGS. We do not find systematics in the dependencies of the moduli on confining pressure of the deformation tests; for example, the moduli measured at RUB for Carrara marble and Wilkeson sandstone exhibit much less and more pronounced pressure dependence than the ones determined at USGS. Neither do we observe a clear trend in the discrepancies between the moduli from the two laboratories with their absolute values nor between tests on dry and saturated samples, requiring different assemblies.

Fig. 2
figure 2

a Examples of stress–strain curves (blue colors) for dry samples of Sierra White granite deformed at 20 MPa confining pressure; the tangent moduli (brown colors) are gained from polyfits to the stress–strain curves (USGS: degree 5, RUB: degree 10), their maxima are used as static Youngs’s moduli. The dashed sections with markers represent the phases of rapid failure, during which the elastic energy stored in the pistons included between the measuring points of the external displacement transducers unloads into the weakening sample. b Comparison between maximum tangent modulus, here used as constraint on apparent (static) Young’s modulus, determined by RUB and USGS. Error bars indicate experimental uncertainty. The long-dashed lines indicate one-to-one correspondence and the short-dashed lines indicate 15% boundaries. The data for Berea sandstone and Crab Orchard sandstone, and the corresponding correlation lines are shifted along the x-axis for presentational purposes. (acronyms USGS and RUB denote data gained at U.S. Geological Survey, Menlo Park, and Ruhr-Universität Bochum, respectively; labels “dry” and “sat” distinguish tests on dry and saturated samples, respectively)

Peak and residual strength

The repeat tests reveal good reproducibility for the characteristics of the stress–strain curves recorded at the two institutions, further documenting the homogeneity of the blocks (Table 5). Yet, the standard deviation of repeat tests exceeds the experimental uncertainty for stress difference, suggesting some influence of sample-to-sample variability regarding the distribution of micro-flaws not resolved by bulk properties, such as density or ultrasonic velocity (Table 2).

Table 5 Relative standard deviations (%) of peak strength, residual strength, and Young’s modulus obtained from repeated tests (conditions see Table 3). The number of tested samples is indicated in parenthesis

Peak strengths reported by the two institutions for the suite of rocks span an order of magnitude, with a “weaker” group comprising Carrara marble, Berea and Wilkeson sandstone, and a “stronger” group comprising Crab Orchard sandstone and Sierra White granite, and are generally in close agreement within < 10% (Fig. 3a), but some systematics in the small deviations are evidenced by the correlation details (Table 6). For all rock types except for Carrara marble, samples tested at USGS appear slightly stronger (< 12%) than those tested at RUB (see also Fig. 4). This observation also applies to Sierra White granite, for which results cannot be fully represented in the cross plots because of differences in the confining pressures applied at the two institutions, judging from a comparison of the trends of strength with pressure (Fig. 5). Unconstrained linear regression between the data sets of the two laboratories leads to intercepts of a magnitude (Table 6) that we find difficult to plausibly explain by systematic shifts in load measurements or stress determination but attribute to sample-to-sample variability.

Fig. 3
figure 3

Comparison between a peak strength and b residual strength measured at RUB and USGS. In a, error bars for RUB indicate the total uncertainty of ± 0.4%. In b, error bars for USGS-data exemplify the strain dependence of residual strength. The long-dashed lines indicate one-to-one identity; the short-dashed lines indicate 10% and 20% deviations for peak and residual strengths, respectively. In the two plots, the data are split into two groups with results for one shifted along the x-axis for presentational purposes. (acronyms USGS and RUB denote data gained at U.S. Geological Survey, Menlo Park, and Ruhr-Universität Bochum, respectively; labels “dry” and “sat” distinguish tests on dry and saturated samples, respectively)

Table 6 Correlation of results for peak strength (USGS vs. RUB, see Fig. 3a) and its uncertainty estimated accounting for the experimental uncertainty of RUB data. (acronyms USGS and RUB denote data gained at U.S. Geological Survey, Menlo Park, and Ruhr-Universität Bochum, respectively)
Fig. 4
figure 4

Deviation of peak-strength values from the identity line (Fig. 3a) in comparison to sample-to-sample variability as derived from standard deviations of the results of repeat tests (dashed lines: blue USGS, orange RUB). A positive deviation indicates that the strength measured by USGS exceeds that measured by RUB. Symbols are plotted in order of increasing confining pressure (see Table 3) from left to right; error bars indicate experimental uncertainty of ± 0.4%. (acronyms USGS and RUB denote data gained at U.S. Geological Survey, Menlo Park, and Ruhr-Universität Bochum, respectively)

Fig. 5
figure 5

Peak strength as a function of confining pressure for dry Sierra White granite. (acronyms USGS and RUB denote data gained at U.S. Geological Survey, Menlo Park, and Ruhr-Universität Bochum, respectively)

The residual strengths determined by USGS tend to be less than the ones determined at RUB for nominally equivalent tests, most notably for Crab Orchard sandstone (Fig. 3b) but also for Sierra White granite (Fig. 5), the two strongest rocks. The effect becomes more significant at higher confining pressures, and is probably partly related to the difference in the extent of overshoot during unstable brittle fracture controlled by the difference in system compliance (Fig. 2a).

Hydraulic permeability

Measured permeability values span approximately six orders of magnitude (Fig. 6). The observed order of magnitude agreement in permeability between the participating laboratories is good considering that four different methods were used. The examination of samples of Berea sandstone and Crab Orchard sandstone drilled in three orthogonal directions by the group at PSU revealed hydraulic anisotropy with the measurement directions of USGS and RUB constituting the least permeable one and the two other directions being up to a factor of two more permeable.

Fig. 6
figure 6

Comparison between permeability measured at RUB, USGS, and PSU, and by Song et al. (2013), label “Song”. Two calculating methods were used at PSU: average of single tests (avg) and linear approximation (lin). Error bars indicate experimental uncertainty. The dashed line indicates identity. Only results from the first pressurization are plotted (but see Appendix C for the documentation of cycle-dependence), since the cycling procedure conducted at PSU after the initial pressurization did not fully comply with the recommended test sequence (see Appendix C). (acronyms USGS, PSU, and RUB denote data gained at U.S. Geological Survey, Menlo Park, Penn State University, and Ruhr-Universität Bochum, respectively)

For a single rock, permeability varied up to two orders of magnitude over the explored range in confining pressure. The pressure dependence of permeability differs significantly in two cases. The pressure dependence of Crab Orchard sandstone observed by USGS exceeds that reflected by data from PSU and RUB (Fig. 6b). Berea sandstone did not exhibit a pressure dependence in permeability for the investigated range when tested by the oscillatory method at RUB, while it did for pulse and constant-flux tests performed at PSU (Fig. 6a), albeit with considerable variation during the three loading–unloading cycles (see Appendix C).

Discussion

As a whole, the results for strength measures confirm that (a) the chosen rocks were suitable for a comparative study, and (b) the accuracies reached by the experimental setups and procedures do not limit the significance of the determined strength measures, in agreement with the conclusions of Pincus (1996). The situation is quite different for the results of the permeability determinations. The consistency between the order of magnitude of results may be considered satisfactory but discrepancies in detail of the results, in particular regarding the pressure dependence of permeability, suggest methodological issues.

Factors affecting deformation characteristics

The slope of a stress–strain curve resulting from a conventional triaxial compression test with a single loading cycle may deviate from the intrinsic static Young’s modulus of the tested material for a number of reasons (Fjær 2019), among them a notable physical one, the irreversible closure of microfractures (David et al. 2020). The accuracy of the transformation of external displacement measurements into sample strain is not only affected by the uncertainty of stiffness calibrations but also by potential tilting owing to non-parallelism of sample and/or piston end faces. The presented apparent moduli provide a way to evaluate the accuracy in strain, relevant, for example, in the light of the determination of characteristic strain values employed as rock-failure criteria (e.g., Aydan et al. 1993; Fujii et al. 1998) and also for discussions of the mismatch between static and dynamic elastic parameters (e.g., Fjær 2009, 2019).

The values for the apparent static Young’s moduli from the two laboratories fall within the limits expected from the composition of the tested rocks, but only half of them match within 15% with the values determined from tests performed at RUB tending to exceed the ones from tests at USGS. The good correspondence of maximum stress difference (Fig. 3) between the two laboratories suggests that neither uncertainty in stress determination nor imperfect sample geometry can account for the observed trend between the two moduli data sets. The compliances of the assemblies used at USGS and RUB are about 0.002 mm/MPa and 0.001 mm/MPa, respectively, and thus the corrections involved in strain determination amount to up to 70% of the total recorded displacement for tests at USGS on the stiffest rock type, Carrara marble. The compliance calibrations in the two laboratories follow the accepted procedure of testing a steel dummy with supposedly known elastic properties. The discrepancy between the two data sets for static Young’s moduli could well be the result of the successive approximations underlying its determination, i.e., (i) the approximation of the machine compliance by an analytical function used in the correction calculation (USGS: linear, RUB: non-linear) that prominently affects the details of the resulting stress–strain curves in particular during the initial steep increase, and (ii) the degree of the polynomial fit to the pre-peak section of the stress–strain curves. Apart from an overlooked methodological issue, which likely can only be resolved by a round robin, size dependence may play a role. Observations on size dependence of elastic moduli are not only disparate but also restricted to tests at ambient pressure (e.g., Zhai et al. 2020; Li et al. 2021) and thus may not apply to our set of data from tests at elevated pressure, at which the large microcracks that presumably dominate behavior at ambient pressure are closed.

The compressive strength of brittle materials critically depends on their inventory of microdefects, such as pores and cracks. The suite of tested sandstones serves as an illustrative example for the inverse correlation of strength and porosity. The role of microdefects introduces a random component to strength owing to the variability in the actual realizations of micro-flaw distributions beyond directionally independent bulk properties, such as density. Thus, it is not surprising that strength exhibits a variability beyond measurement accuracy. On average, however, the differences in strength observed for the two institutions are qualitatively and quantitatively in accord with the size-effect of higher strength for smaller samples, commonly considered a consequence of microdefect statistics (e.g., Bernaix 1969; Lockner 1995; Paterson and Wong 2005). For example, a typical strength loss of ∆σ/σ ~ (∆L/L)−1/2 (Lockner 1995) predicts the larger RUB samples to be approximately 8% weaker than the smaller USGS samples. Our results imply that sample size may affect interlaboratory strength comparisons or use of strength data as input in numerical codes. However, we cannot exclude that the differences in preparation contribute to the systematic difference in measured strength. For example, the absence of cylindrical grinding at RUB may facilitate fault nucleation at surface flaws and absolute differences in end-face parallelism between RUB and USGS may cause slight deviations to the stress distribution.

The tests on saturated samples of Wilkeson sandstone were likely not fully drained according to volumetric strain measurements and the constraints on hydraulic diffusivity (Ahrens et al. 2017). Insufficient internal drainage may increase or decrease (or consecutively both depending on the evolution of hydraulic properties during deformation) the effective stress state during deformation, therefore, affecting strength. The shorter samples used by USGS in principle favor effective internal drainage over the longer ones used by RUB. The absence of a substantial difference between the strengths observed in the two laboratories for tests on saturated samples may indicate that the modest length difference does not critically affect internal drainage conditions in this case and/or be related to the generally low dilatancy-hardening potential (Brace and Martin 1968; Duda and Renner 2013) of the experiments performed at a fluid pressure of only 2 MPa. The latter would also annihilate possible contributions of differences in design of the interface between sample and piston, i.e., realization of technical drainage, and loading details, e.g., waiting time to reach equilibration after hydrostatic pressurization, and deviatoric loading with constant piston velocity vs. constant strain rate.

Residual strength, in contrast to peak strength, is hard to uniquely determine, because the post-failure section of stress–strain curves typically does not reach a well-defined stress-plateau (Fig. 2a). Ideally, residual strength in brittle faulting represents a constant frictional stress, independent of continued sliding, attained after a fault is fully developed. In practice, sample failure may produce fractures that intersect the loading pistons in contact with the samples or produce fractures with varying fault angles. As a result, reproducibility of residual strength is expected to be worse than for peak strength. Furthermore, the actual contact area of the fracture plane decreases with continued sliding, leading to a decrease in residual stress with increasing axial strain (Fig. 2a), even for a constant friction coefficient. Thus, the difference in absolute strain, at which residual stress was determined, partly controlled by machine stiffness owing to its control on the uncontrolled release of elastic energy stored in the loading pistons in a rapidly failing sample, may account for the difference in residual stress values between the two laboratories. The role of machine stiffness for post-failure characteristics has been noted before (e.g., Hudson et al. 1972; Mansurov 1994); also the jacketing procedure and material as well as sample size may have some effect. Combined with measurements of the shear fracture orientation determined on samples retrieved from the vessel after the conventional triaxial testing, the residual strengths determined at RUB are in general agreement with Byerlee’s rule (Fig. 7) up to about 150 MPa normal stress. Wilkeson sandstone exhibits the lowest friction coefficient, as previously observed for other porous sandstones (Costamagna et al. 2007), in this case possibly related to its fairly large content in phyllosilicates (Table 1; Tembe et al. 2010). The deviations from Byerlee’s rule observed for the sandstone samples at normal stresses above about 150 MPa may indicate the increasing contribution of cataclastic flow by pore collapse to their deformation.

Fig. 7
figure 7

Normal and shear stresses derived from residual strengths and failure angles observed by Ruhr-Universität Bochum (labels “dry” and “sat” distinguish tests on dry and saturated samples, respectively). The dashed line indicates Byerlee’s bilinear rule (Byerlee 1978)

Issues related to the determination of hydraulic permeability

Constant flow experiments correspond to the direct implementation of Darcy’s law and their results thus exhibit benchmark character for permeability of a specific sample. The analysis procedures of all transient methods assume that samples represent homogeneous and isotropic continua on length scales much smaller than the sample scale, an assumption whose general applicability appears rather debatable in the light of the complexity of the conduit networks of rocks. Nevertheless, Schepp and Renner (2021) showed that constant-flow experiments and oscillatory pore-pressure tests (harmonic pressure-interference) agree within experimental uncertainty for Wilkeson sandstone and Westerly granite, the latter probably a good match for Sierra White granite, when performed on the same sample.

Testing different samples in different laboratories, fundamentally, cannot resolve whether the origin of the differences in permeability results obtained using different methods reflect sample-to-sample variability or methodological characteristics, a limitation that also applies to the recent comparative study of the permeability of Grimsel granodiorite (David et al. 2018a,b). The sample of Crab Orchard sandstone tested by Song et al. (2013) originated from the block used by PSU in this study and has a reported connected porosity of 3.5 ± 0.1%, i.e., almost 2% lower than those tested at RUB (Table 2), pointing to differences between samples from different blocks due to natural variability of the rocks. Yet, the deduced relation in porosity is opposite to the relation in permeability values gained at PSU and RUB (Fig. 6b). Heterogeneity has been demonstrated to be a crucial factor for the outcome of permeability measurements with transient methods, in cases causing a considerable effect of sample size (Song and Renner 2006) that may contribute to the observed differences here, too, owing to the differences in sample diameter used by RUB, and PSU and USGS.

Besides inhomogeneity, anisotropy constitutes an important and yet unresolved issue for permeability determination with transient methods. Judging from the first measurements at the lowest effective pressures performed at PSU in three perpendicular directions, the difference between the most and least permeable direction is less than a factor of 3 for Berea sandstone. The constant-flow tests on samples of Berea sandstone constitute benchmarks for the degree of anisotropy in permeability, possibly including some sample-to-sample variability though. The significance of the anisotropy constraints from constant-head tests on samples of Crab Orchard sandstone, i.e., a ratio of about 2 between least and most permeable direction, however, remains compromised by the unresolved effect of anisotropy on the evaluation strategy. Analytical and/or numerical modeling may facilitate progress in resolving this fundamental problem of the determination of hydraulic properties.

The most significant and suspicious differences in the results for permeability from the three institutions arise from their pressure dependence (Fig. 6), unlikely a result of either heterogeneity or anisotropy of tested samples. The partial convolution of the differences in pressure dependence with significant cycle dependences (Appendix B) may indicate protocol biases involving the actual achievement of pore-pressure equilibration between the various pressure steps, the oscillatory method nominally less depending on equilibration. The systematic inverse correlation of compressive strength of the tested rocks with the differences in pressure dependence and the occurrence of cycle dependence may, however, also indicate a contribution of local failure at sample end-faces in contact with the permeable end-plugs. Dedicated microstructural investigations and design variations could in principle clarify this issue. Finally, differences in the total duration of permeability tests may play a role when the samples contain clay minerals with the potential for swelling, as might be true for Berea sandstone (Table 1).

Conclusions

The sample-to-sample variability inherent to a natural material and the potential size dependence affect the quantitative significance of experimental data from laboratory tests on rock samples for validation of numerical codes. Constraining the actual sample-to-sample variability by basic physical characterization of samples and repeat tests may improve the understanding of the significance of results. Our interlaboratory comparison suggests that unresolved methodological uncertainties remain for permeability tests and to a much lesser degree for triaxial compression tests that outmatch the error propagation calculations based on the typical accuracy of high-quality sensors used in laboratories by large.

Static Young’s moduli were not included in the “official” work program of the interlaboratory comparison, but we reported results, because the documentation of differences appears instructive regarding the significance of numerical values for this parameter and highlights the importance of clarifying calculation procedures as well as paying attention to machine details, such as the number of external displacement transducers used and the stiffness correction employed. Post-failure more so than failure behavior appears to be an issue of conventional triaxial testing to address further regarding its relation to system stiffness. The interpretation of testing at elevated pore pressure may benefit from a thorough validation of effective drainage conditions.

The results for the various commonly applied methods to determine hydraulic permeability may be affected differently by heterogeneity at the sample scale, and by anisotropy. However, the observed differences in the dependence of permeability on pressure and pressurization history point to the potential benefits of confirming the suitability of the design of apparatus components and of the test procedures. Validation of permeability determinations in the context of digital rock physics (e.g., Mehmani et al. 2020) may have to account for the different boundary conditions used in experiments.

The extensive data set is provided in repositories (Cheng et al. 2023; Lockner et al. 2023) to serve future “benchmarking” intentions, be it to check the performance of new laboratory equipment or of numerical modeling approaches. In particular, the complete records of the deformation tests performed at elevated fluid pressure may allow testing hydro-mechanical codes. A great opportunity to reach progress in the understanding of the role of heterogeneity and anisotropy for laboratory-based constraints on physical properties of rocks lies in the bi-directive exploitation of the synergies between modeling and experimental approaches.