Survey of public data sources on the Internet usage and other Internet statistics

The Internet research is mainly driven by data. Obtaining such data by planning and launching measurement campaigns is rather time consuming and costly. Much more efficient, and in many cases, sufficient data acquisition strategy is to exploit the existing datasets available in public databases, repositories, and documents. Hence, the public data sources related to the Internet usage and other Internet statistics are systematically surveyed and categorized to make the search for the Internet data much easier and faster. Extensive online searches and exploring websites of the key organizations were used to identify the data sources. Each data source was then carefully explored to describe its characteristics and contents. The data are usually aggregated over certain time periods and regions, and often indexed by age, gender, application, website, activity and other attributes. Some data sources also support various data visualization options, and offer data export in multiple formats.


Survey of public datasets on WWW Data format URL with brief description Experimental factors
Each dataset was explored for its contents and then categorized.

Experimental features
Guided searches on WWW. Search of WWW of relevant organizations.

Data source location N/A Data accessibility
Included with this data brief.

Value of the data
There is no other centralized collection of public datasets on the Internet usage and other Internet statistics while searching for such datasets is a very time consuming process.
The identified data sources can be used to identify and understand the current and future trends in the Internet usage.
Having the Internet usage data and statistics is important to understand a rapidly ongoing digitalization of many sectors of the economy including manufacturing, healthcare, education and entertainment.
It is vital that data sources on the Internet usage are regularly updated and made available to the researchers as well as policy and decision makers in order to keep evolving the Internet for the maximum benefits of the society.

Data
There are different types of data associated with the Internet. The Internet measurements are used for real-time network management, long-term network planning as well as to evaluate how the Internet is used and its impact on the economy and the society. The measurements used for network management are obtained directly from the Internet traffic flows. The network planning for 1 or 2 years ahead of the current demand is based on the traffic trend forecasting by extrapolating historical data. For longer-term planning beyond a 2 year horizon and to evaluate the long-term and large scale impacts of the Internet, the user questionnaires and other market research methods are necessary. The Internet data are regularly published by the government institutions, regulatory bodies, communication service providers, and private organizations. The data with low level of aggregation are usually only accessible for a fee. The published Internet usage data are highly aggregated which reduces the resolution, but also the variations in data. In order to obtain measurements for high volume and high throughput traffic, or to efficiently represent the Internet topology at multiple time scales, statistical sampling strategies are required to process smaller but statistically relevant traffic traces.
The collected datasets were somewhat liberally organized in the following 5 tables: Table 1. Internet measurements: These datasets mostly contain data on the Internet usage in a given region such as the number of subscribers, cellular coverage, security issues and the quality of service including access speeds and response times. Table 2. Internet applications: These datasets can be used to evaluate how the users spent their time on the Internet, often focusing on the specific regions of the world and specific websites such as social networking applications or Youtube. Table 3. Various Internet statistics: These datasets provides information about the Internet usage and usage analytics in various parts of the world. Table 4. Internet availability: These datasets are concerned with the Internet accesses in different countries and world regions.        Table 5. More comprehensive datasets related to the Internet: These datasets often contain various other data in addition to the Internet data.

Experimental design, materials and methods
The most important data sources on the Internet usage containing the most comprehensive datasets are summarized in Fig. 1. The following two main strategies were used to find the relevant datasets. First, the World Wide Web (WWW) was searched using various combinations of keywords to find about 80% of the collected datasets. The keywords considered combined the word`Internet' with one or more of the following search terms: traffic, database, data, access, usage, users, and statistics. The keyword search can be further modified, for example, by specifying the country or a region. The second strategy for finding the relevant datasets is to first identify the institutions which are known to be reporting various Internet related statistics, and then examine their websites. Overall, over 400 candidate websites were explored to select the relevant datasets. It is not surprising that often, but fortunately, not always, the high quality datasets (i.e., those which are reliable and regularly kept up to date) are only available to the paying subscribers. However, these pay-for-access datasets have been excluded from the survey. The main attributes of interest that were checked for when screening the datasets are: whether the data are provided for free, who is maintaining the dataset, how often the data are updated, in what format the data are available, and especially, what data are specifically provided and how they are indexed or categorized. Moreover, many international institutions have a policy to publish datasets with description in multiple languages. The total volume of data in each dataset varies significantly among the data providers including the time coverage and the availability of historical data. Most data seem to be normally accessible in smaller chunks, usually organized around a specific reporting purpose or issue.   References [1][2][3][4][5] are recommended for performing longer-term traffic and applications forecasting in the Internet. The Internet measurement procedures are described in [6][7][8][9] whereas focus on measuring the digital economy can be found in [10][11][12]. Digitalization of the economy is explained in [13] while many technical and business aspects of digitalization of the Internet services are studied in [14]. The value of open access data and the associated data regulation are discussed in [15] and [16].

Funding sources
This work is part of a Ph.D. research work of Murooj Nadhom who received the full scholarship from the Iraqi Government. The grant is provided by the Iraqi Ministry of Higher Education and Scientific Research. The grant reference number is 1873.

Transparency document. Supplementary material
Transparency document associated with this article can be found in the online version at https:// doi.org/10.1016/j.dib.2018.04.107.