Application Research of Benford’s Law in Testing Agrometeorological Data

Benford’s law is an effective tool to test the quality of data sets. In order to strengthen the research on the quality evaluation of agrometeorological data, this paper empirically analyzed the precipitation data sets of China from 1951 to 2015 from the time, space and time staggered dimensions. It was found that the overall and local quality of precipitation data sets in China was high. Further improvement and expansion of Benford’s law’s application in the future can help to detect and control the quality of agricultural data efficiently and improve the utilization rate of agricultural data.


Introduction
Data is an important force driving the development of agricultural modernization and the key to link agricultural production, management, consumption, market and trade, etc [1]. The effective application of agricultural data, on the one hand, can reflect the whole process of agriculture in holographic 3-D, promote the relationship between the relevant elements. On the other hand, it can also predict the future through the correlation characteristics of data, prepare in advance in a bid to respond to industry changes [2][3]. However, the quality of agricultural data in China is not high, and there is still a dilemma of rich data but poor information [4][5]. In the process of agricultural modernization, a large amount of data has been accumulated in each process of production, circulation and consumption, etc. Agricultural modernization also has the characteristics of lagging information technology, wide data coverage, complex data sources, closely related to time and space, and long production cycle, all of which make the quality of agricultural data in China poor. There exist many abnormal data, moreover, low-quality data sets are not conducive to make scientific decisions [2,6].
This paper introduces Benford's law, which has wide applicability and advantages in testing large amount of sample data into the field of agricultural data testing. Taking agrometeorological data as an example, the research is aimed at evaluating the quality of agrometeorological data in China, improving the efficient exploration and analysis of data in the field of agriculture in an effort to promote agricultural development.  2 Benford's law, also known as the "first figure law" or "the first digit law", refers to the statistical probability of data sets with the first digit of 1-9 in nature, which is not equal to 1/9 in general cognition, but satisfies the logarithmic distribution [7]. Wherein, the first digit refers to the first nonzero effective integer in the number. It is a simple numerical description rule, but it has a significant effect on the anomaly test in the data set [8,9]. According to Benford's law, the first-digit frequency distribution table is shown in Table 1:

Benford's Law Application
In the field of natural sciences, many scholars have conducted a lot of research in mathematics, biology and physics, etc., and proved that many data sets conform to Benford's law. In mathematics, Pinkham [10] pointed out that data sets conforming to Benford's law are still conforming to Benford's law when multiplied by any non-zero constant. Berger A [11,12] proves that Benford's law is the only digital theorem with scaling invariance in mathematics. This series of studies on the nature of Benford's law laid a foundation for the study of data sets with correlation between indicators, and greatly broadened its application breadth and depth. Leemis L M's [13] statistical results for several common survival distributions of biology, J Burke's [14] statistics of the first digits of a large number of physical constants are consistent with Benford's frequency.
In the field of social sciences, due to the influence of environment and other factors, the phenomenon that data sets deviate from Benford's law in an acceptable range is more common. The method is applied to the statistical data detection research in the fields of taxation and accounting [15,16], social security and finance [17], showing that Benford's law has a significant effect on detecting the reliability of social science data.

Data Conditions Applying Benford's Law
Benford's law has a wide range of applications and has been applied to many unrelated fields. Nigrini [16] has made a thorough analysis and collation of the data detection research on a large number of applications of Benford's law, and put forward four necessary data conditions to satisfy the law: sample size large enough to represent the data as a whole; data values are unlimited; data is naturally formed, unaffected by human factors or less affected; there is no high coupling or cohesion between data.

Application of Agrometeorological Data Quality Detection
Firstly, the original data set is preprocessed to solve the problems of data missing, data redundancy, and data unavailability, etc. Then Benford's law is applied to detect data quality. If the statistical results of the first digit in time, space and time staggered dimensions, all the first digit statistical results pass significance test and obey Benford's law, the quality of data set is high, and the effective information can be extracted based on this part of data. If fails, the data set violates Benford's law, the anomaly data pool is screened for further analysis. According to the different characteristics of the test statistic, when the sample size is small, χ 2 ,VN*,m*,d* can be applied to conduct significance test upon the results. When the sample size is large, the correlation coefficient r is used to determine the result [18]. The statistics used in this paper are as follows.

Data Sources
The agrometeorological data selected in this paper are obtained from China Meteorological Statistical Bureau. The data set includes the 65-year meteorological data of 2,481 stations in China from 1951 to 2015, mainly including the maximum temperature, minimum temperature, sunshine duration, daily precipitation and other indicators of each station. In agrometeorological data, precipitation, temperature, sunshine and other data reflect the weather conditions of different regions, which can accurately reflect the severity of natural disasters such as drought, flood and so on. It is significant to study these data for agricultural production. According to the four application limitations of Benford's law, highly cohesive data sets are not suitable for Benford's law test. Therefore, this paper chooses precipitation data from 31 provinces as the main research objects.

Data Preprocessing
Cleaning the selected data sources can reduce the interference of invalid data and improve the processing efficiency of Benford's law. After observing the agrometeorological data, we find that some precipitation data are distributed between 3,000mm and 3,070mm. However, according to the grading of precipitation levels in China, it is rainy if the daily precipitation is more than 100mm. This part of data deviates obviously from the precipitation records of normal data sets and shall be ignored directly in the statistical data. In addition, Benford's law counts the first digits from 1 to 9 and does not include the number "0", so the record of daily precipitation of 0 shall also be ignored. Because of the particularity of Benford's law, ignoring this part of data will not have a negative impact on the test results.  Table 4, it can be seen that the first digital frequency distribution of seven group's data sets shows a monotonic decreasing trend, which is in good agreement with Benford's law. The frequency of which the first digit is "1" and "2" of precipitation data sets in China is about 32.7% and 18.6%, which is quite different from Benford's corresponding frequencies. Significance test is needed to determine whether these seven groups' data sets obey Benford's law. Because these seven groups' data are large sample data sets, according to the applicable conditions of Benford's law significance test method, distance m, d and correlation coefficient r pairs are selected as statistics of significance test.

Applying Benford
By conducting significance tests on the 7 groups' data, the results show that the correlation coefficient r of precipitation data from 1970 to 1979 is the lowest 0.9997, and that of the other six groups is 0.9998. According to the grading criteria of correlation coefficient, the precipitation data set is considered to be reliable as a whole. In addition, amongst the 7 data sets, the nearest distance from M is 0.025, the farthest distance is 0.03, the nearest distance from D is 0.027 and the farthest distance is 0.033. It can be considered that the frequency difference between data sets and Benford freq. is very small. The above test results show that the precipitation data sets in China conform to Benford's law, and the quality of each data set is high.
(2) Taking 5-year as group interval, the precipitation data sets from 1951 to 2015 in China are divided into 13 groups. Benford's law is tested. The statistical results of the test are shown in Table 5.  Table 5 shows that the frequency of "1" and "2" of China's precipitation data sets is about 32.7% or 18.5%, which is quite different from Benford's frequency of 30.103% and 17.609%. It needs to be judged whether the error can be approved by significance test. Since precipitation data sets belong to large sample data, according to the applicability of Benford's law test method, distance m, d and correlation coefficient r pairs are selected as statistics for significance test. Significance test results show that the correlation coefficient r of precipitation data in 1970-1979 is the lowest 0.9996, and the correlation coefficient r of precipitation data in 2000-2009 is the highest 0.9998. According to the grading criterion of correlation coefficient, 13 sets of precipitation data conform to Benford's law, and the overall quality of data sets is high.
(3) To test data sets with relatively large fluctuations in the first digit in time dimension: In order to show and study the differences and characteristics of data in different time periods more clearly, the error maps of different groups' data frequency statistics and Benford frequency are drawn. Among them, because the errors of the digit "3-9" are small and the gap is not big, the digit "3" is    Figure 1 and Figure 2 show the errors between the first-digit freq. of each group set and Benford freq. Overall, the first digit "1" error fluctuates at 2.5% frequency, and the first digit "2" error fluctuates at 1.0% freq. Assuming that the data set with large error fluctuations may exist anomalies, the years of the most pointed and extreme points in the above figures are selected to do detailed analysis. Figure 1 shows that the first digit "1" is the lowest in 1970-1979 and the highest in 1990-1999. The first digit "2" is the smallest in 1951-1954, the largest in 1970-1979. As shown in Figure 2, the first digit "1" is the smallest in 1975-1979, the largest in 1990-1994, and the local maximum in 1965-1969; The first digit "2" is the lowest in 1951-1955 and the highest in 1970-1974. When studying these years' data, in order to ensure the accuracy of the test, by taking major event scope as standard, we select the precipitation volume data of 1951-1954, 1970-1979 and 1990-1999 to do analysis year by year.
(4) By taking one-year as group interval, Benford's law is tested on three data sets of precipitation volume of 1951-1954, 1970-1979 and 1990-1999. The results are shown in Tables 6, 7 and 8.  Table 6 shows that the distribution of the first digit of national precipitation volume data from 1951 to 1954 is consistent with that of Benford on the whole, but there is slight fluctuation. Whether the difference is obvious or not requires a significant test. The results of significance test show that the correlation coefficient r of precipitation volume data in 1951 is the lowest 0.999, and that in 1952 is the highest 0.9998. According to the grading criteria of correlation coefficient, these five data sets are not doubtful.  Table 7 shows that the distribution of the first digit of national precipitation data from 1970 to 1979 is consistent with that of Benford on the whole, but there is slight fluctuation. Whether the difference is obvious or not requires a significant test. The results of significance test show that the correlation coefficient r of precipitation volume data in 1993 is the lowest 0.9995, and that in 1995 is the highest 0.9997. The data sets pass the significance test, and these 10 sets of data sets are not doubtful.  Table 8 shows that the distribution of the first digit of national precipitation data from 1990 to 1999 is consistent with that of Benford on the whole, but there is slight fluctuation. Whether the difference is obvious or not requires a significant test. The results of significance test show that the correlation coefficient r of precipitation data in 1993 is the lowest 0.9995, and the correlation coefficient in 1995 is the highest 0.9998. The data sets pass the significance test, and these 10 sets of data sets are not doubtful.
From the above test, we can see that the data sets of precipitation in China conform to Benford's law in terms of time dimension, and the quality of data sets is high.

Spatial Dimension Test
(1) According to the spatial division of the eastern, central and western regions, the precipitation data sets of 31 provinces in China from 1951 to 2015 are divided into three groups as shown in Table 9. The first-digit distribution frequencies of data sets are studied respectively, and the statistical results are as shown in Table 10.   Table 10, we can see that the results of the first digital frequency distribution of data sets in accordance with geographical distribution also show a monotonous decreasing trend, which is in good agreement with Benford's law. This shows that the quality of precipitation data in all regions of China is better and is not affected by the degree of regional development. The frequency of which the firstdigit is "1" or "2" is about 32.7% and 18.5% in accordance with the geographical distribution of precipitation data sets, which basically fits the frequency of the first digital distribution of the data sets analyzed from the time dimension, but it is still quite different from Benford's frequency of 30.103% and 17.609%, so it is necessary to do significance test on the data sets.
According to the applicability of Benford's law test method, the correlation coefficient r pair is selected as the statistical significance test. The results show that, among the data sets distributed according to geography, the correlation coefficient r of precipitation data in the eastern region is the lowest 0.9983, the correlation coefficient r of precipitation data in the central region is 0.9986, and the correlation coefficient r of precipitation data in the western region is the highest 0.9988. According to the grading standard of correlation coefficient, the overall quality of precipitation data sets is very high.
(2) According to the geographical division of East China, South China, Central China, North China, Northwest China, Southwest China and Northeast China, the precipitation data of 31 provinces from 1951 to 2015 are further subdivided into seven groups, as shown in Table 11. Benford's law is tested on 7 sets of data sets. The statistical results are shown in Table 12.   Table 12, it can be seen that the results of the first-digital frequency distribution of the data sets are in good agreement with Benford's law, which also shows a monotonic decreasing trend. The frequency of which the first digit is "1" or "2" of precipitation data sets in seven major regions is about 32.6% and 18.5%, which is quite different from Benford's frequency of 30.103% and 17.609%. Whether this difference is significant still needs to be tested.
According to the applicability of Benford's law test method, the correlation coefficient r pair is selected as the statistical significance test. The results show that the correlation coefficient r of precipitation data in South China is the lowest 0.9981, and the correlation coefficient r of precipitation data in Southwest China is the lowest 0.9988, which is very close to 1. According to the grading criteria of correlation coefficient, the correlation between precipitation data sets and Benford's law is very high.
(3) To test data sets with relatively large fluctuations of the first digit in spatial dimensions: In order to display and study the differences and characteristics of data in different regions more clearly, we make the error maps of statistical results of each group data freq. and Benford freq. Among them, because the error of the digit "3-9" is small and the gap is not large, the digit "3" is chosen as the representative. The error maps of the first-digit distribution frequency of precipitation volume in seven major regions from 1951 to 2015 are shown in figure 3.  Figure 3 shows the errors between the first digital frequency of each group data set and Benford frequency. Overall, the first digit "1" error fluctuates at a frequency of 2.5%, while the first digit "2" fluctuates at an error of 1.0%, but the fluctuation range is very small. Assuming that the data set with large error fluctuation may have anomalies, the regions where the most pointed and extreme points in the above figure are selected for detailed analysis. The first digit "1" is the maximum in Central and Southwest China, and the first digit "2" is the minimum in Northwest China. Next, for the next step, the three regions are analyzed and studied.  Table 13.  Table 13, it can be seen that the distribution of the first digit of precipitation data in 1951-2015 in all provinces is consistent with Benford on the whole, but with slight fluctuations. The results of significance test show that the correlation coefficient r of precipitation data in Hunan and Tibet, etc. is the lowest, which is 0.9995. The correlation coefficient of precipitation in Xinjiang is the highest, which is 0.9999. Significance test results show that the quality of precipitation data sets in different provinces is high, and there is no significant regional difference in the quality of precipitation data in China.

Time and Space Staggered Dimension Test
(1) Table 13 shows that the first digit "1" and "2" of Hunan Province are significantly higher than Benford's law. In order to narrow the scope of the study, Hunan 2010-2015 data sets are divided into six groups and conduct Benford's law tests; the distribution frequency of the first digit is as shown in Table 14.  Table 14, we can see that the distribution of the first digit of precipitation data in Hunan Province in 2010-2015 is consistent with Benford on the whole, but the fluctuation is larger than other data sets, so it is necessary to test whether the difference is significant. Significance test results show that the correlation coefficient r of precipitation data in Hunan in 2010 is the lowest, 0.9984; the correlation coefficient in 2011 is the highest, 0.9998. Therefore, according to the criterion for judgment of correlation coefficient r, the data set conforms to Benford's law.  10 (2) To further narrow the scope of the study, data of 12 regions in Hunan in 2010 are selected and divided into three groups, including Anhua, Anren, Anxiang and Baoqing; Chaling, Changning, Changde and Binzhou; Chenxi, Chengbu, Cili and Dao County, we conduct Benford's law tests upon them, the results of Benford's law tests are as shown in Table 15.  Table 15 shows that the precipitation data sets in some areas of Hunan are slightly different from Benford's law, and whether the difference is significant still needs to be tested. Because the sample sizes of each group are 665, 644 and 602 respectively, which are small sample data sets, the significance test can adopt statistics χ 2 , VN*, m*, d*, and the regional statistics of Anhua, etc. are 11.91, 1.29, 0.025 and 0.03 successively. The statistics of Chaling and other areas are 18.813, 1.951, 0.038 and 0.06 successively. The statistics of Chenxi and other areas are 15.334, 2.526, 0.05 and 0.061 in turn, which are less than the critical value of each statistic at confidence level of 0.01, and the correlation coefficient r is greater than 0.99. Therefore, at confidence level of 0.01, these three groups of data sets also conform to Benford's law.

Conclusion
This paper studied the first-digit statistics of precipitation data sets of 65 years in China from the time, space and time-space interlaced dimensions. Benford's law was applied to test data quality. In order to eliminate the possibility that a small amount of abnormal data hidden in large data sets, the scope of testing was continuously narrowed down in the process of data quality testing. Finally, the scope of testing was reduced to 12 regions in Hunan Province in 2010. The first digit distribution frequency of data sets passes the significance tests and obeys Benford's law well. The empirical results showed that from the point of view of Benford's law, the quality of national precipitation data sets is good, and there are basically no suspicious data. This part of agrometeorological data can effectively guide agricultural production.