Statistical and geostatistical study of Rn and hydrocarbon components of a soil gas monitoring system : an application to surface hydrocarbon exploration

This work addresses three topics: (1) the study of the joint areal distribution of the Rn and hydrocarbon components of soil gases over a large region overlying some known hydrocarbon reservoirs in the southern part of Hungary; (2) the relationships between the positive anomalies of Rn and hydrocarbon components of soil gases to the existing reservoirs; (3) suggestions for new targets for surface hydrocarbon exploration based on the results. Given the very low correlation coefficients between the Rn and hydrocarbon components of the soil gases, factor analysis was used to reveal a background process controlling the common migration of hydrocarbon and Rn components. The lateral distribution of the factor scores were studied using sequential Gaussian distribution. The E-type grid generated from 100 realizations indicated several positive anomalies at the surface. Indications with a larger than 0.7 probability were kept for further analysis. Seismic sections of a 3D survey support the comparison of the surface locations of these anomalies and the surface projections of the known reservoirs. The results proved the connection between the known reservoirs and the Rn and HC components of soil gases. From the positive verification, regions with a high probability positive anomaly of factor scores, but without any reservoir counterparts may be suggested as targets for further surface hydrocarbon exploration. However, most of the existing applications rely only on de­ tecting either direct or indirect indicators. On the ground, not sur prisingly, the geostatistical analysis deals with these sources of information separately. CIOTOLI et al. (2007) studied the spa­ tial character of a soil gas monitoring system; OLIVIER & KHARYAT (1999), BUTTAFUOCO et al. (2007) dealt with the geostatistics of Rn­monitoring systems; MALDONATO & CAMPBELL (1992) or, recently, CASTILLO et al. (2015) used geostatistical tools in the spatial analysis of the hydrocarbon com­ ponents of soil gases. Only a few papers have dealt with a geosta­ tistically supported integrated interpretation of the Rn and hy­ drocarbon components of soil gases. The aims of this study were to (1) determine the common areal distribution of the Rn and hydrocarbon components of soil gases in a large region above some known hydrocarbon reservoirs in the southern part of Hungary (Fig.1); (2) relate positive anom­ alies of Rn and hydrocarbon components of soil gases to the ex­ isting reservoirs, and (3) to suggest new targets for the surface hydrocarbon exploration on the basis of the results. The studied area is located in the Southwestern part of Hun­ gary, close to the Croatian border. In this region, six known hy­ drocarbon reservoirs occur, including from North to South: Berzence, Somogyudvarhely, Vízvár­Nord, Vízvár, Heresznye and Görgeteg­Babócsa (Fig. 1). They were developed in the Szol­ nok, Algyő and Újfalu Formations of the Miocene sequences. The studied area was also surveyed by 3D seismic survey, the results of which were used in the latter part of this paper to compare and validate the findings of the geostatistical analyses. Article history: Manuscript received October 31, 2015 Revised manuscript accepted April 17, 2016 Available online June 28, 2016

However, most of the existing applications rely only on de tecting either direct or indirect indicators.On the ground, not sur prisingly, the geostatistical analysis deals with these sources of information separately.CIOTOLI et al. (2007) studied the spa tial character of a soil gas monitoring system; OLIVIER & KHARYAT (1999), BUTTAFUOCO et al. (2007) dealt with the geostatistics of Rnmonitoring systems; MALDONATO & CAMPBELL (1992) or, recently, CASTILLO et al. (2015) used geostatistical tools in the spatial analysis of the hydrocarbon com ponents of soil gases.Only a few papers have dealt with a geosta tistically supported integrated interpretation of the Rn and hy drocarbon components of soil gases.
The aims of this study were to (1) determine the common areal distribution of the Rn and hydrocarbon components of soil gases in a large region above some known hydrocarbon reservoirs in the southern part of Hungary (Fig. 1); (2) relate positive anom alies of Rn and hydrocarbon components of soil gases to the ex isting reservoirs, and (3) to suggest new targets for the surface hydrocarbon exploration on the basis of the results.
The studied area is located in the Southwestern part of Hun gary, close to the Croatian border.In this region, six known hy drocarbon reservoirs occur, including from North to South: Berzence, Somogyudvarhely, VízvárNord, Vízvár, Heresznye and GörgetegBabócsa (Fig. 1).They were developed in the Szol nok, Algyő and Újfalu Formations of the Miocene sequences.The studied area was also surveyed by 3D seismic survey, the results of which were used in the latter part of this paper to compare and validate the findings of the geostatistical analyses.

The basic geochemical theory
The parentelements of radon are supposed to migrate from the subsurface HCreservoir.Essentially, they are 238 U, 230 Th and 226 Ra.Radon (Rn) can appear in two different decay series: (1) 238 U ® 226 Ra ® 222 Rn (2) 232 Th ® 224 Ra ® 220 Rn Although radium (Ra) appears in both decay series, the two isotopes ( 222 Ra and 220 Ra) are not identical: their difference lies in their halflives.The half live of 222 Ra (from 238 U) is 38 days, while it is only 55 seconds for 220 Ra (from 232 Th).When equal quantities of each isotope are available under the same physical environment, the average travel distance of 222 Ra is around 100 cm, whereas that of 220 Ra is only 1 cm.That is why 222 Ra is ex pected to make much more of a contribution to the gamma flux.
The literature explains that the geochemical background of radiometric signals measured above hydrocarbon reservoirs is very wide.Some authors (e.g.ARMSTRONG & HEEMSTRA, 1973, SAUNDERS et al., 1993) proposed that deep seated hydro carbon pools provided the sources of radionuclides and their pro geny.They thought that radionuclides were transported by forma tion fluids from the deep pool, along some hypothetical pathways, up to the surface.In their view, relatively high gamma emissions form the "halo", while low emissions exist within the boundary of the halo.MORSE & ZINKE (1995) provided an elegant interpretation for the formation of the radiological "halo".Their theory was bas ed on Laubmeyer's concept of vertical gas migration.In this process, with the passage of time, low molecular weight hydro carbons migrating upward induce a chemically reducing environ ment in zones overlying the hydrocarbon (HC) accumulations.When oxidized uranium enters into this zone, its upward move ment stops.Consequently, the uranium concentration becomes lower than it would have been with no hydrocarbon present.The uranium attempting to transit into the reducing zone will preci pitate at the zone boundary.This precipitation appears as a halo as reported by PIRSON (1969).
The theories say that the radiation anomalies are not directly connected to the deep HC reservoirs.Rather, "geochemical cells", developed above the HC reservoirs, initialize such geochemical changes which cause the adsorption of the uranium content of the formation fluids.Radon, as the most mobile member of uraniumdecay, can give effective information about these anomalies.KRISTIANSSON & MALMQVIST (1982) explained the non diffusive transport of Rn by carriergas transport.One of the most complete theories for the relationship between the subsur face reservoirs and the surface radiological measurements are given by TEDESCO (1995).In Tedesco's theory, a geochemical cell with the cloud of inverse electric charge develops between the (subsurface) reservoir and the surface.It initializes the migra tion of ionic geochemical constituents.It is very important to note that only the longlived parent elements of Rn take part in this process.Radon is only a tracing element of this process, which can migrate via microseepage in a watersaturated permeable agent.VÁRHEGYI et al. (2008) supposed that vertical migrations of formation fluid and gas can be connected to the subsurface HC accumulations which carry radionuclides from the formations be tween the reservoirs and the surface.In this case the source of radioactive elements is not the HC reservoir, but the rock bodies lying above the HC layers.The reservoir only initializes the fol lowing simplified transport mechanism: HC reservoir ® geo chemical cell ® change of redox conditions near the surface ® selective adsorption of the dissolved uranium at the redoxfronts ® uraniumradium and radiumradon transformation ® upward migration of radon.Figure 2. shows this combination of the hy pothesized geochemical cell and the microseepage transport model of Rn migration (VÁRHEGYI et al., 2008).In this imple mentation the geochemical cell and the connected microseepage transport model of Rn assume high water saturation.The migra tion processes slows down above the surface of underground water.There is no chance of detecting Rn indication if the water surface is below 10 m (VÁRHEGYI et al., 2008).
In summary, it can be concluded that the radiation anomalies are not directly connected to the deep HC reservoirs.Rather, "geo chemical cells" developed above the HC reservoirs initialize such geochemical changes which cause the adsorption of the ura nium content onto the formation fluids.Radon, as the most mo bile member of uraniumdecay, can give effective information about these anomalies.The critical point in the applicability of the researchmethod based on radon detection is the differentiation of the deep origin and the background radiation.VÁRHEGYI et al. ( 2008) have de veloped a method for measuring this.

Soil gas measurements
During the radon-monitoring field measurements, a specified mass (12 g on average) of soil was taken for the headspace gas method of soil gas analysis.The samples were placed into a glass container of 20 ml volume, and sealed by a rubber membrane.This glassware was then heated at 85 °C for 20 minutes prior to gaschromatographic measurement.The headspace of the glass container was sampled by syringe through the membrane and the gas sample in the syringe was injected into a gas-chromatograph with an FID detector.With this apparatus, simple hydrocarbons C1, C2, C3, nC4, iC4, nC5 and iC5 (methane, ethane, propane, normal butane and isobutane, normal pentane and isopentane) were measured.To assess the reliability of the measurements, how the concentrations of hydrocarbon components decreased in "natural" and "artificial" soil samples were analyzed in the labo ratory.From the measured hydrocarbon components (HC), only C2, C3, nC4 and nC5 are involved in the further statistical and geostatistical analysis.

Analysis of the 222 Rn concentration of soil gas
The samples were taken in a regular surface grid where the nodes were 250 m apart.The change of background Ra radiation was followed by the so called basic registrations process.In the moni toring locations, three other properties were measured and cal culated besides the 222 Rn concentration.They were the uranium equivalent of the 222 Rn concentration (UeRn), the difference be tween 222 Rn and 226 Ra (RnRa), and the so called soilcorrected 222 Rn (RnAi) content.
The uranium equivalent of the measured Rn concentration (UeRn) was defined as the concentration of the parent element of Rn in the soil.It was directly measured using gammaspectro metry.Under general conditions, the 226 Ra concentration of the soil environment is the origin of the 222 Rn concentration.How ever, migration processes can increase or decrease this "regular" 222 Rn concentration.When the migration adds Rn to the "regu lar" concentration, this RnRa difference is positive, otherwise it is zero or negative.The local characteristics of soil geochemistry and its physical properties can significantly affect the Rn concen tration.The effects of these features on the local Rn concentra tion can even be much larger than those of the migration proce sses.That is why correction of the measured concentrations to the soilvariability was of primary importance.
The samples were grouped in K sets with homogeneous soil characteristics, , 1, 2, , = … k S k K. Then everages and variances were calculated for each set, , 1, 2, , Rn Ai was determined for each sample as a "standardized" value: where k j Rn is the Rn content of the sample, ( ) Mean is the average, and ( ) s k is the variance of the Rn content in ( ) Rn Ai can have positive and negative values.

FIELD DATA COLLECTION AND ITS DESCRIPTIVE STATISTICS
The geochemistry suggested that only the C2 (ethane), C3 (pro pane), nC4 (normal butane) and nC5 (normal pentane) compo nents of the HC gases were worth keeping in the statisticalgeo statistical modeling, since they could be expected to be of deep origin.All these components were expressed in ppb.In contrast, all the parameters measured or calculated in the radiological mo nitoring were involved in the analyses.They were as follows: Rn (radon content of soil gas, in kBq/m 3 ), UeRn (uranium equivalent of the Rn content, in ppm), RnRa (the difference between Rn and Ra components, in kBq/m 3 ), and RnAi (Rn measurements corrected by the soil environments, dimensionless).
In the sampling process of soil gas and radonmonitoring, 2441 samples altogether were analyzed.However, among them there were only 538 in which all the parameters could have been measured (Fig. 1).They were the target of the geomathematical analyses.Table 1.shows their general statistical characteristics.Since the hydrocarbon components had very skewed probability distributions, their original values were transformed by a log10 function before calculation of the statistics in Table 1.
The minimum value of RnRa, in Table 1, has a negative sign indicating that there were sampling locations from where the Rn migrated into the surroundings.However, the high positive value of the maximum RnRa property indicated that there were sam pling sites where the migrated Rn content were added to those generated "in situ" (Table 1).
The correlation coefficients between the radon and hydro carbon components in Table 2 (coloured light grey) were signifi cant, but the deterministic coefficients (r 2 ) of the highest correla tions between the Rn and HC components were, at best, 0.04.
simulation in the analysis of the spatial distribution.This ap proach does give the best estimation, but tends to honour the sta tistical characteristics while emphasizing the small scale hetero geneities (MUCSI et al., 2013).

Methods used in the statistical and geostatistical modeling
The workflow of statistical and geostatistical analysis shows the steps in which field and surface seismic data are integrated and their spatial connections revealed (Fig. 3).
The statistical and geostatistical analyses were based on three principles.(1) The factors were regarded as compact sum marizations of the background processes affecting the Rnand HCcomponents.Because factors represented continuous spatial processes in the joint migration of Rn and hydrocarbon gases, they could be regarded as continuous spatial variables and ana lyzed by geostatistical tools.(2) Because of the scattered sam pling, the spatial uncertainty of the geostatistical model was also important to study.(3) The final goal was to indicate the proba bility of those regions where the covariability of Rnand HC components could suggest HC structures at depth.

Factor analysis
Factor analyses are very effective tools of linear data integration.The main applications of factor analytical techniques are to re duce the number of variables and to detect structure in the rela tionships between variables, that is, to reduce data dimensiona lity.Here, we used this technique in the latter sense, as a structure detection method.In this so called Exploratory Factor Analysis (EFA) we tried to uncover complex patterns by exploring the da  These results were the consequences of very heterogeneous soil types, and the temperature and humidity differences between the sampling periods.This fact (among others) had three important consequences: (1) it was not possible to make a direct comparison of Rn and HC components; (2) instead of forcing the analysis of their pairwise relations, the pattern of their intercorrelations was targeted in the analysis; (3) quite high uncertainty could be ex pected in the measured data.Point (2) meant that a factor analysis as a data integration tool could be expected to reveal the connections between the HC and Rn components (JÖRESKOG et al., 1976;OLIVIER & KHARYAT, 1999).Point (3) suggested the use of a stochastic Figure 3. Workflow of analysis strategy.
taset and testing predictions (CHILD, 2006).EFA is used when the aim is to discover the number of factors influencing variables and to analyze which variables 'go together'.A basic hypothesis of EFA is that there are m common 'latent' factors ( 1 2 , , , … m F F F ) to be discovered in the dataset, and the goal is to find the small est number of common factors, p , that will account for the cor relations of the variables, 1 2 , , , … p X X X (MCDONALD, 1985).The model assumes that each observed variable is a linear function of these factors together with a residual variance.This model intends to reproduce the maximum correlations.
where ( ) ji a i m are the factor loadings, and j e is the specific or unique factor.
Generally, a factor analysis decomposes the modified corre lation matrix.The modification is that the original diagonal ele ments (they equal to 1) of the correlation matrix are replaced by the prior communality estimates.It means that this technique does not decompose the whole variance of each variable since it may also have error variance.The estimated proportion of vari ance of the variable (communality) is assumed to be free of error variance and is shared with other variables in the matrix.It is the variance of a variable in common with all others together (com munalities are smaller than 1).In this way factor analysis gives enough room to take measurement error and measuring uncer tainty into consideration.
One of the most convenient ways to interpret the results of factor analysis is the interpretation of factor loadings.The factor loadings are the correlation coefficients between the variables and factors.They represent to what extent a variable is explained by a factor.Analogous to Pearson's r , the squared factor loading is the percent of variance in the particular indicator variable explained by the factor.Loadings can range from (1) to 1. Loadings close to (1) or 1 indicate that the factor strongly affects the variable.In the interpretation of factor loadings the first important decision is to identify the high loadings.There is a rule of thumb saying that loadings should be 0.7 or higher.The rationale of this statement is that the 0.7 level corresponds to about one half ( 2 0.7 0.49) = of the variance in the indicator being explained by the factor.How ever, some researchers, particularly for exploratory purposes, use a lower level such as 0.4 for the central factor and 0.25 for other factors.In any event, factor loadings must be interpreted in the light of theory, not by arbitrary cutoff levels (STEVENS, 2002).Factor analysis can also be viewed as a transformation of the original sample space (defined by the original variables) into an orthogonal space defined by the factors.The coordinates of this factor space are called factor scores.Computing factor scores fa cilitates the search for factor outliers.Also, factor scores may be used as variables in subsequent modeling.
Here the main goal was to reveal such a factor in which both the Rn and all the hydrocarbon components appear with high positive factor loadings.After several trials a special factoring algorithm was discovered in which these requirements have been realized.In this algorithm the communality was approached by the multiple correlation coefficients, and we used the centroid method (THURSTONE, 1931) to extract the factors.The eigen values of the extracted factors were larger than 1.The rationale of this criterion was that the variance of the defined factor should be at least as large as that of each standardized variable.The cut off for the definition of a high factor score was 0.5.For each sam ple the factor scores were also calculated.Factor scores represent the amount of the effect of the actual factor exercised in each ob ject (sample).Consequently, the areal distribution of factor scores could be expected to show the areal variability of the 'hidden' process controlling the joint migration of Rn and the hydrocarbon components of soil gases.

Variography
In the geostatistical literature, spatial patterns are usually de scrib ed in terms of the dissimilarity (instead of similarity) be tween observations as a function of the separation distance.The average dissimilarity between data separated by a vector h is measured by the experimental semivariogram ( ) g  h , which is computed as half the average of the squared difference between the components of every data pair: (3) where ( ) N h is the number of data pairs for a given distance.The main purpose of the variogram calculation is partly to analyze the directional heterogeneity, and partly to transfer this information to kriging or simulation procedures.Therefore, a continuous function must be fitted to the values of the experimen tal variograms so as to deduce semivariogram values for any pos sible lag h required by prediction algorithms and also to smooth out sample fluctuations.The difficulty is that only conditionally negative definite functions can be considered as semivariogram models, in order to ensure the nonnegativity of the variance of the prediction error.In practice, only a few models are known to be permissible (e.g.DEUTSCH & JOURNEL, 1998, DEUTSCH 2002or GOOVAERTS, 1997).
Typically, two or more permissible models must be combined to fit the shape of the experimental semivariogram.Combinations of permissible models are permissible as long as the contribution of each basic model is positive, that is the nested model is writ ten as: where The way in which the permissible models are chosen and their parameters (range, sill) are estimated is still controversial (MCBRATNEY & WEBSTER, 1986;GOOVAERTS, 1997).There are several ways to check the goodness-of-fit of the per missible model.It can be done per view, or using some criteria based on minimizing the least square differences between the experimental and fitted model.However, the tool of variogram map can also be applied for this purpose.By comparison of the variogram map coming from original data and that of variogram models, the goodness of the variogram models can be evaluated.

Cell declustering
The sample points in this study show neither regular nor random spatial distribution over the region (Fig. 1).This kind of 'sam pling' is said to be preferential.This situation means that the available samples are quite far from statistical representativity.Consequently, neither the summary statistics of the data set nor the shape of the histograms of the measured attributes may be representative for the entire volume of interest.
Although most contouring or mapping algorithms adjust this preferential clustering by default, simulation algorithms fail in the process.These methods rely on the "intrinsic" assumption of the sample distribution being representative of the entire volume of interest (DEUTSCH, 2002).Since the "veryleaky" spatial ar rangements needed the application of stochastic simulations, the situation amounts for data"declustering".There are two com monly used solutions: the polygonal method (ISAAKS & SRIVASTAVA, 1989;GOOVAERTS, 1997;GEIGER, 2012) and the so called celldeclustering approach (JOURNEL, 1983;DEUTSCH, 1989;GOOVAERTS, 1997;GEIGER, 2012).In this study we used the latter method.
The cell-declustering approach calls for definition of a grid system over the A area of interest and counting the number B of cells that contain at least one datum and the number b n of data falling within each cell b.Then the weights are defined as: (5) These weights are inversely proportional to the number of data in each cell.Hence, these weights give more importance to isolated locations.A weight greater than 1 implies that the sam ple space is being over weighted, and a weight less than 1 shows that the corresponding location is being under weighted.In the latter case it is clustered with other locations.
In the implementations the "best" cell size for declustering the sample points can be recognized by plotting the declustered mean versus cell size for a range of cell sizes.From this plot the size with the lowest declustered mean should be chosen when data are clustered in highvalued areas or the size with largest declustered mean in the inverse situation (DEUTSCH, 1989).

Sequential Gaussian Simulation
Consider the simulation of the continuous attribute z at N grid nodes , then sampling it at each of the grid nodes visited along a random sequence.To ensure reproduction of the z semivariogram mo del, each ccdf is made conditional not only to the original n data but also to all values simulated at previously visited locations.
Other realizations are obtained by repeating the entire sequential drawing process.Sequential Gaussian simulation (sGs) typically involves a prior transform of the z data into normal score data.
The simulation is then performed in the normal space and the simulated normal scores are backtransformed in order to iden tify the original z histogram.

Goodness of the probabilistic model (simulation)
The goodness of a probabilistic model may be checked by its ac curacy and precision.In general, accuracy refers to the ultimate excellence of the data or computed results, e.g.conformity to truth or to the standard.Precision refers to the repeatability or refine ment (significant figures) of a measurement or computed result.DEUTSCH (1997) proposed specific definitions for these terms.In his view accuracy and precision were based on the ac tual fraction of true values falling within symmetric probability intervals of varying with p.A probability distribution is accurate if the fraction of true values falling in the p interval exceeds p for all p in [0,1].The precision of an accurate probability distri bution is measured by the closeness of the fraction of true values to p for all p in [ ] 0,1 .Stochastic simulation leads to L stochastic realizations at each data location.These realizations provide a model for the conditional cumulative distribution function (ccdf): where ( ) i n u is the set of n data at location i u .These local models may be derived from a set of L realizations, calculated from in dicator kriging or defined by the Gaussian mean, variance and transformation.
The probabilities associated with the true values ( ), 1, 2, , The average of ( ) u p over the n location i u : ( ) ( ) is the proportion of locations where the true value falls within the symmetric − p PI .According to the earlier definition of accuracy, the simula tion algorithm used to generate the ccdfs is accurate when ( ) , x ≥ ∀ p p p. A grapcal way to check the assessment of accu racy is to plot ( ) x p versus p and to see if all of points fall above or on the 45° line.This plot is referred to as an accuracy plot (DEUTSCH, 1997).

Analyses of uncertainty
The advantage of stochastic simulation over interpolation (kri ging) is that it allows the reproduction of statistics (histogram, semivariogram, scattergram) inferred from the data, hence the model or realization looks more "realistic" than a smooth esti mated map.Also, one can generate multiple realizations that all reasonably match the same sample statistics and exactly identify the conditioning data.The set of alternative realizations provides a visual and quantitative measure (actually a model) of spatial uncertainty.Spatial features are deemed certain if seen on most of the simulated maps.Conversely, a feature is deemed uncertain if seen only on a few simulated maps.Simulation can be accom plished using a growing variety of techniques, which differ in the underlying random function model (multiGaussian or nonpara metric), the amount and type of information that can be accounted for, and the computer requirements (GOTWAY & RUTHER FORD, 1994;SRIVASTAVA, 1996;DEUTSCH & JOURNEL, 1998).
According to SRIVASTAVA (1996), geostatistical realiza tions can be used for three main purposes: (1) assessing the im pact of uncertainty; (2) reproducing the spatial variation; (3) per forming MonteCarlo risk analysis.The concept of space of uncertainty, and the associated issue of equiprobability of reali zation were discussed in detail in the 1990s (e.g.JOURNEL, 1994;SRIVASTAVA, 1994).Some authors believed that the space of uncertainty must have been theoretically defined outside the use of a particular algorithm.Others stated that the space of un certainty could only be defined through the algorithm and con sists of all possible realizations that could be generated by the algorithm.
There is currently no theory that allows us to determine the set of all possible outcomes if fairly sampled.The characteriza tion of the space of uncertainty is rendered difficult by the fact that only a limited number of realizations is usually generated.
The uncertainty can be quantified by measures such as en tropy, interquartile range, or variance.In this paper, the variance was preferred because of its simplicity and wide acceptance as a measure of spread.The uncertainty of a probabilistic model was defined as the average conditional variance of all locations in the area of interest: where there are N locations of interest , 1, 2, , = … i u i N with each variance ( ) ) F u z n u .

Probability maps
SGS results in the generation of conditional cumulative probabi lity distributions (ccdf) at each grid node via the calculation of stochastic realizations.Hence, calculation of any probability in the sense of ( 7) is quite straightforward at grid nodes if we have large enough number of realizations.The 100 equally probable values simulated in grid nodes provided the possibility of this derivation.The accepted geochemical model said that the migra tion process of interest could be made probable in the region of "high" factor scores.If the cutoff of these high factor scores were known, the probability belonging to this cutoff in each grid node would characterize the possibility of the migration process in grid nodes.This thought needed two practical steps: (1) finding a cutoff to define "high" scores of the factor describing the migration process; (2) contouring the probability of this cutoff.
For the definition of cut-off of "high" scores, , two approaches were considered.The first was the box and whisker plot technique (TUKEY, 1977), the second being the ( ) s + mean approach, where s was the standard deviation.The latter pro vides an inflection point of the normal distribution.From these two definitions the latter was used in further analysis, since this gave a larger number of regions on the map to be compared with the seismic data.After choosing the proper cutoff, we calculated the ( ) probabilities in each grid node from the 100 simulated values.In this probability map, regions outlined by high probabilities were expected to show the surface projections of the subsurface.

Comparing geostatistical results and 3D seismic
The 3D seismic covers almost the whole sampling area.In the seismic data sets, analysis of average amplitude and the interpre tation of some seismic attributes have occurred.As a result, all the known subsurface reservoirs and the most important faults have been identified.Three sections were cut from the 3D seis mic data set.Their traces are shown in Fig. 4. Along these poly lines three transects were created from the probability map.These transects were compared with the corresponding seismic sections.This facilitated the checking of whether the structures appearing in the probability map coincided with the subsurface reservoirs or that such a fault system could be a potential pathway for the common geochemical migration of hydrocarbon and pa rent elements of Rn.

RESULTS
The result of factor analysis showed only two factors meeting the initial criterion (let the eigenvalue be larger than 1).The two (un rotated) factors shown in Table 3. described 65% of the total vari ance.They defined almost equal proportions of the total variance: 31.5% and 34.1% respectively.The first factor positively correlated all the other variables (Table 3, Factor 1).Therefore we supposed it reflected a complex process increasing both the HC and the Rn components of soil gases.Consequently this factor was called a "Factor of migra tion", which was exactly what we were looking for.Its low con tribution to the total variance could be explained by the high de pendence of Rn components on the very variable physical en vironment: humidity, soil type, season of measurements, etc.On the basis of the applied geochemical model this factor was expected to outline regions, surrounding the surface projections    of the existing subsurface reservoirs ("Positive radon anomaly" in Fig. 2).
The second factor showed a high positive correlation with all the Rn and high negative correlations with hydrocarbon compo nents, which was inconsistent with the assumed geochemical process.Even the information content of this factor was very low: it was able to explain only 3% of the total variability by itself (Ta ble 3, lowermost row) and is why Factor 2 was ignored at the next stage of modelling.
The lateral distribution of Factor 1 scores appeared with high positive values in the southern part of the studied area (Fig. 5, A).The clustered appearance of the sample locations required a cell declustering approach to get more representative descriptive sta tistics of the frequency distribution.The frequency distribution derived from the declustered data is skewed toward the high pos itive figures (Fig. 5, B).The normal score transformation of original declustered Factor 1 scores is shown in Fig. 5, C.
The variogram maps calculated for the normal score trans formations of the Factor 1 scores showed a definite NE-SW con tinuity direction (Fig. 6 A where the directions and the anisotropy ratios were 118°, 40°, 35°, and 0.76, 40 and 50, respectively. In this model, the short small scale heterogeneity had 1400 m of range, while 2400 m of range belonged to the large scale hete rogeneity.The nugget effect was very large (0.63 in ( 13)) expressing that about 60% of the total spatial variability cannot be characterized by any linear geostatistical model.
Variogram maps calculated from the models honoured the main anisotropic-characters of Factor 1 (Fig. 6, C).It can also be seen that the models did not cover the whole of the initial hetero geneity.However, the comparison of model and empirical vari ograms proved that this loss of information was beyond the range (Fig. 6 A, C).
Using the variogram model and declustering weights, 200 equally probable stochastic realizations were generated by se quential Gaussian simulation.The simulated normal scores were backtransformed to the original scale of the Factor 1 score.We checked the simulation results by the accuracy plot (Fig. 7) and a visual comparison of the distributions of original and simulated scores (Fig. 8).Since points in the accuracy plot of the simulation results lie above the 45° line, the result of the simulation was ac curate, but not exact (Fig. 7).
Figure 8. shows both the original and simulated Factor 1 scores.The spatial dependence of the Factor 1 scores did not al low the use of any statistical test.By visual examination we could say that the simulation decreased the minimum of the input data set.Fortunately, this did not affect the acceptance of the simula tion, since we were looking for the high positive and not for the low negative factor scores.
In grid nodes, the Etype estimation for Factor 1, was calcu lated from the backtransformed normal scores (Fig. 9).
The gridaverage of the Factor 1 scores and their variance were -0.072 and 0.192, respectively.The 'high' scores were de fined as those being larger than the expected value plus variance.In fact, this definition produced a larger set than that of the Boxplot (Fig. 10).Consequently it resulted in higher possibilities in the lateral outlining of the area with high Factor 1 scores.
These laterally continuous regions of high Factor 1 scores were the targets which we were looking for.The map of conditional variance was used to characterize the uncertainty of the stochastic model from which the Etype esti mation was derived (Fig. 11, A).This map suggested that although the general uncertainty was low in most of the area, there was a very definite subregion with high uncertainty.Fig. 11B shows the stack map of the Etype estimation (3D surface in Fig. 11B) and the contours of conditional variance (contour map above the 3D surface in Fig. 11B).This processing clearly shows that the high est uncertainty coincided with a subregion of high Factor 1 scores.
The probability map showing the probability of locations where Factor 1 scores were larger than the lower boundary of the high scores, 0.120 (mean+variance) was derived from the stochas tic realizations (Fig. 12, A).On this map, at least six larger conti nuous subregions of high probabilities could be easily recognized.These were expected to reflect the subsurface hydrocarbon reser voirs.Three transects were cut out from the probability grid along the surface projections of the seismic sections (Fig. 12, B,C and D).
Along the first transect (I-II in Fig. 13) we identified an al most typical halo geometry of high Factor 1 scores (Fig. 13, I). Figure 12, II demonstrates the seismic section along this transect.The lateral extension of Vízvár and Vízvár-North fields could be identified in the probability transect.The southern anomaly of Factor 1 scores (Heresznyeanomaly in Fig. 13) appeared with more than 0.7 probability and matched with the subsurface posi tion of the Vízvár Field.The northern anomaly was a typical halo anomaly corresponding to the subsurface location of the Vízvár North Field.
The second transect (III to IV in Fig. 12) went through three characteristic anomalies of Factor 1 scores appearing with more than 0.7 probability.They were, from NW to SE, the Berzence, Somogyudvarhely and Heresznye anomalies (Fig. 14).Among them Somogyudvarhely and Heresznye developed above the Víz vár hydrocarbon field, but there was no known equivalent hydro carbon accumulation below the Berzence anomaly (Fig. 14).
The third transect also crossed three spots of high Factor 1 scores: Somogyudvarhely, VízvárNorth and the Komlósd anoma lies.The probability of their appearance was more than 0.7 (Fig. 15).The subsurface counterparts of VízvárNorth and Komlósd were  the VízvárNorth and GörgetegBabócsa Fields.The Miocene fault appearing below the Somogyudvarhely anomaly was re garded as being the genetic background of this anomaly.

DISCUSSION AND CONCLUSION
The presented statistical and geostatistical analysis of the Rn and HC components of soil gas samples was based on two expecta tions.Firstly, that the process controlling the common migration of these components could be recognized by a particular factor in the factor matrix of the studied components.The consequence of this hypothesis was that the strength of this process could be characterized by the scores of this factor.
Secondly, that the regional appearance of this migration process could be analysed by the tools of stochastic simulations.
The choice of stochastic simulation was supported by two facts: (1) after filtering out some 'uncertain' results of the HC gas analyses, the remaining samples showed a very scattered pattern; (2) the factor scores were not direct measurements of the migration process, consequently we had to use probabilistic approaches.
The basis of the factor analysis approach was the lack of any strong correlation between the HC and Rn components.In fact, the application of factor analysis did not follow the traditional way, in which the first few factors are used in the interpretation.In this study we searched for "the" factor correlating with both the Rn and CH components.In this strategy we accept, that the radiological and HC properties of soil gases could be affected by several processes.It was also admitted that the targeted phenom enon (factor expressing the joint migration of Rn and HC gases)    might be not necessarily the dominant process.For the purpose of this study, the only important thing was the existence of such a factor.This was the main reason for (1) the application of only the unrotated factor matrix and (2) the spatial analysis of Factor 1 even if it did not have high communality.
The loadings in Factor 1 showed exactly what we were look ing for: the higher the Rn content, the higher the CH content is (Table 3).This result was in absolute harmony with the basic ge ochemical theory (Fig. 2).In Factor 2, the signs of factor loadings belonging to the Rn and HC component were contradictory.This process could be characteristic above such regions, where the vertical migration of HC gases slowed down while the migration of the Rn components did not change significantly.These phe nomena may be characteristic above the water phase of the rese rvoirs.Unfortunately this situation could not be checked on the basis of the available seismic survey.
The spatial analysis of the factor scores highlighted some spatial outliers.For example, in Figure 5A, a high score appeared (red dot on the map) among average scores (green and blue dots).Under regular circumstances, these outliers should be filtered out.We did not follow this rule, since the factor scores were regarded as measures of the strength of the background process.Also, the existence of such outliers calls for the application of a nonpara metric simulation approach.Fortunately, in this case the spatial outliers did not form any chainlike pattern which would have necessitated an indicator approach.
The 200 equally probable stochastic realizations honoured well the statistical characters of the input set (Figs. 7 and 8).The simulation result was accurate, although the simulated set showed slightly larger variability than that of the input.Since a Gaussian algorithm resulted in a unimodal distribution, the Etype grid could be accepted to show the average or general lateral distribu tion of the results of the migration process.The areal distributions of regions with high Factor 1 scores of the Etype estimation roughly matched the surface projections of the known fields (Figs. 1 and 9).The uncertainty of this rough matching was char acterized by the conditional variance of the probability distribu tions generated at the grid nodes.Unfortunately, the highest un certainty values are concentrated in the larger area of the high Factor 1 scores (Fig. 11B).Perhaps the heterogeneity of Factor 1 scores in this area and its surrounding were the reason for the less accurate estimation results.
The probability maps showed the possibility of locating such regions, where Factor 1 scores are large in the sense of the defi nition given above.The results showed that the regions where this event had a large probability (in general the probability was > 0.7) could be clearly correlated with the subsurface hydrocarbon res ervoirs identified in seismic profiles .This correla tion produced a positive result even in the region where the un certainty was very high (Heresznye anomaly in Fig. 1, Fig. 13 and Fig. 14).The probability counterparts of the other known reser voirs on the map of the Etype estimation of Factor 1 scores could be outlined with a significantly smaller uncertainty (Fig. 9, Fig. 11 and Fig. 15).
These results suggested that the research achieved its origi nal purpose: the connection between the known reservoirs and the Rn and HC components of soil gases have been proven in the studied area.This verification seems to be more precise than those of the relevant literature, since it used not only the Rn com ponents alone, but also the C2-C5 hydrocarbon components of soil gases.Because of the positive verification, regions with a high probability of a positive anomaly of Factor 1 scores, but without any reservoir counterparts, may be suggested as targets for surface hydrocarbon exploration.
However, it can also be said that the presented modelling strategy demonstrated a correlation of the measured signal to re servoirs rather than more widespread petroleum system sources.

Figure 2 .
Figure 2. A geochemical model of the radiation anomaly above hydrocarbon reservoirs VÁRHEGYI et al., 2008.
i b is the positive sill or slope of the corresponding basic semivariogram model ( ) i g h .The variogram map (sometimes called variogram surface) takes the idea of calculating the variogram in a number of direc tions.It can be created by posting the variogram values on a map, where the center of the map is the lag distance of zero (e.g.ISAAKS & SRIVASTAVA, 1989 or DEUTSCH, 2002).

Figure 4 .
Figure 4.The five traces of seismic sections.

Figure 6 .
Figure 6.Empirical variogram map of the normal scores of Factor 1 scores (A), Variogram models of the normal scores of Factor 1 (B), Variogram map of the variogram model of (C).

Figure 7 .
Figure 7. Accuracy plot of the simulation.

Figure 10 .
Figure 10.Comparison of the sets of "high Factor 1 scores defined by the Boxplot and (e + s) techniques.e is the expected value of the distribution; s is the standard deviation of the distribution.

Figure 11 .
Figure 11.A: The uncertainty of the stochastic model by which the E-type estimation was derived (A); B: Stacked map of the E-type estimation (3D surface) and the conditional variance (contour map above the 3D surface).

Figure 12 .
Figure 12.Map showing the lateral distribution of Prob(Factor 1 > 0.120).A: probability contours with the surface projections of seismic sections (I to VI).B, C and D are the transects created along the surface polygons of the seismic sections.

Figure 13 .
Figure 13.Correlation of the probability map of high Factor 1 scores and the subsurface hydrocarbon fields along the first transect.

Figure 14 .
Figure 14.Correlation of the probability map of high Factor 1 scores and the subsurface hydrocarbon fields along the second transect.

Figure 15 .
Figure 15.Correlation of the probability map of high Factor 1 scores and the subsurface hydrocarbon fields along the third transect.

Table 1 .
Descriptive statistics of the variables analyzed.

Table 2 .
Correlation coefficients between the variables.

Table 3 .
The result of factor analysis (high factor loadings are colored by light grey).
).The empirical continuity expressed by the variogram map was modeled by a nested theoretical vario gram model: