Data describing the relationship between world news and sustainable development goals

The data article presents a dataset and a tool for news-based monitoring of sustainable development goals defined by the United Nations. The presented dataset was created by structured queries of the GDELT database based on the categories of the World Bank taxonomy matched to sustainable development goals. The Google BigQuery SQL scripts and the results of the related network analysis are attached to the data to provide a toolset for the strategic management of sustainability issues. The article demonstrates the dataset on the 6th sustainable development goal (Clean Water and Sanitation). The network formed based on how countries appear in the same news can be used to explore the potential international cooperation. The network formed based on how topics of World Bank taxonomy appear in the same news can be used to explore how the problems and solutions of sustainability issues are interlinked.


Value of the Data
• The pre-processed and categorised data of the number and tone of the news can be used for strategic management and monitoring of sustainability issues. • The dataset supports the comparison of the sensitivity of the media to sustainability issues in different countries. • The network formed based on how countries appear in the same news can be used to explore the potential international cooperation. • The network formed based on how topics of World Bank taxonomy appear in the same news can be used to explore how the problems and solutions of sustainability issues are interlinked.

Data Description
Sustainable development goals (SDGs) are the basic principles for achieving economic, industrial development goals while maintaining the ability to preserve the natural systems on which our civilizations depend. Therefore, monitoring and analyzing the multiple indices to intervene is critical. News play a significant part in conveying objectives and major focus areas of both governmental as well as public interests [1] . News analysis can play an active role in terms of defining the status of the roadmap for implementing the SDGs [2] . The localization of the SDGs is critical as social spaces are vital factors of the successful implementation and preservation of the goals [3] . GDELT uses natural language processing, data mining, and deep learning algorithms to extract and monitor news of the world. The Global Knowledge Graph (GKG) is a part of GDELT which records people, organizations, locations, themes and taxonomies, sources, tone and events of news into a network storage.
Included tables will be listed as: name of the  [4] . • WB_SHORT ( string ) -World Bank taxonomy element identifier.
• WB_NAME ( string ) -Full name of the topic based on World Bank classification.
• WB_LVL ( int ) -Represents the ontological level of the World Bank identifier. -country_topic.csv : Contains the country wise topics, average tone, total news count and the news percent of the topics. This dataset enables to create and analyze country profiles in regard to the sustainable development related news appearing.
• TOTAL_NEWS_COUNT ( int ) -Number of total news in the country.
• LEVEL ( int ) -Level of taxonomy element. -country_topics_pivot.xlsx :The pivot table of the news at different levels of the world bank ontology. This dataset shows the number of news stories appeared in a country and also measures its average tone.
The "PIVOT" contains the following information: • The first column represents the topics.
• The last column represents the average tone of a topic across all countries.
• The other columns are the enumerated countries, and the average tone of the category in the category according the topic. • The last row represents the average tone of the country according the selected topics.
-news_analysis_countries_topics.csv : The file contains the numeric aggregation of the news by countries. This dataset defines country profiles in regard to their attitude towards the sustainable development goals (tone of news -average, very positive, very negative) in counts and in percentage.
• NEWS_COUNT ( int ) -Count of the articles matching the category.
• TOTAL_NEWS_COUNT ( int ) -Total number of articles in the country.
• TONE ( float ) -Average tone of the articles in the country, which are matching the topic.
• NEWS_PERCENT ( float ) -Percent of the news matching the topic in the selected country.
• NEWS_POS_COUNT ( int ) -Very positive news count, based on the overall tone, where it is ≥ 10 • NEWS_NEG_COUNT ( int ) -Very negative news count, based on the overall tone, where it is ≤ −10 • NEWS_POS_PERCENT ( float ) -Very positive news in percent (NEWS_POS_COUNT / NEWS_COUNT) • NEWS_NEG_PRECENT (float) -Very negative news in percent (NEWS_ NEG _COUNT / NEWS_COUNT) -country_network.csv : News category-based co-occurrences of the countries in sustainability news. This dataset is the base of a multilayered network representing countries cooccurrence in regard to a news category and its tone. The data table contains the fields described below: • Source ( string ) -A node, representing the source country encoded in ISO_A3.
• Target ( string ) -A node, representing the target country encoded in ISO_A3.
• Layer ( string ) -World Bank category in numeric format.
• Layer Name ( string ) -World Bank category name in readable format.
• Weight ( int ) -The weight of the described edge, representing the number of cooccurrences in the articles of the countries. • Weight_POS ( int ) -alternative edge description, representing the number of very positive co-occurrences of the countries, based on overall tone ≥ 10. • Weight_NEG ( int) -alternative edge description, representing the number of very negative co-occurrences of the countries, based on overall tone ≤ −10. -country_network_SDG.csv : SDG-based co-occurrences of countries in sustainability news.
This dataset is the base of a multilayered network representing countries co-occurrence in regard to the sustainable development goals and their tone. The data table has the fields described below: • Source ( string ) -A node, representing the source country encoded in ISO_A3.
• Target ( string ) -A node, representing the target country encoded in ISO_A3.
• Layer ( int ) -Represents a sustainable development goal as a layer in the network.
• Weight ( int ) -The weight of the described edge, representing the number of cooccurrences in the articles of the countries. -world_bank_taxonomy.csv: The file contains the whole World Bank taxonomy [6] . This dataset defines the categories of the news and their hierarch based on the World Bank taxonomy. The • FIPS ( string ) -FIPS encoding of the country.
• LATITUDE ( float ) -Latitude of the middle of the country.
• LONGITUDE ( float ) -Longitude of the middle of the country.
The data was generated by the following Google BigQuery SQL scripts that can be easily modified to study specific time periods. The SQL scripts are commented to highlight how the code can be tailored for specific analysis. All scripts have the following setting options: • TIMEFRAME_START ( datetime ) -Starting time • TIMEFRAME_END ( datetime ) -Ending time • TOPICS ( list ) -List for categories enumerated in this filter. The filter is an OR separated list, for the where statement.
-1_query_fulldatabase.sql: To query raw data, we recommend the following Google BigQuery script, which returns the articles, the related countries, topics and tone of news. Google Big-Query SQL script that returns the following data: • GKGRECORDID ( string ) -Unique identifier of an article • V2SOURCECOMMONNAME ( string ) -URL to the original medium published the article.
• V2DOCUMENTIDENTIFIER ( string ) -Full URL to the article.
• COUNTRYCODE ( string ) -Countries mentioned in the article.
• THEME ( string ) -World Bank identifiers for the categories of the article. -3_country_network_incl_stat.sql: News articles often refer to multiple countries, therefore to display the link between countries, we recommend to use this Google BigQuery SQL script that creates the network of countries and return the following values: • Source ( string ) -Country representing the source node in the network.
• Target (string) -Country topic representing the target node in the network.
• AVG_TONE ( float ) -Average tone between the two countries, based on the co-occurring documents. Included Gephi files that contain the networks generated based on the presented data -topic_net_sdg6.gephi -Contains the network of World Bank taxonomy elements from the SDG6 matching. It defines the connections (co-occurrence) of the SDG6-related WB taxonomy elements and their tone.
-topic_net_all_sdg.gephi -Contains the network of World Bank taxonomy elements from all SDGs matching. It defines the connections (co-occurrence) of the SDG-related WB taxonomy elements and their tone. -countries_net_sdg13.gephi -Contains the network of the countries interconnected by the mentions about SDG13. It defines the connections (co-occurrence) of countries in regard to the SDG13-related news and their tone.

Experimental Design, Materials and Methods
Governmental policies are operated in long-term planning horizon by reflecting socioeconomic as well as environmental development focused visions. We aim to combine the advantage of the news analysis through The GDELT Project as well as the network analysis models to reveal interconnection.
The following steps of news-centered network analysis of sustainability issues can be distinguished: -Determination of search words connected to SDGs: Search words are basis on the World Bank Group's Topical Taxonomy [5] and the My World 2015 [6] survey contributed to the formulation of Agenda 2030. The selection of search words and their association with the sustainable development goals (see table world_bank_to_sdg.csv ) was done manually by the authors who are experts of the field, however, the methodology allows them to validate through the joint occurrences of the topics.

-Development of the related SQL (Structured Query Language) Queries: GDELT Global
Knowledge Graph (GKG) database records people, organizations, locations, themes, taxonomies, sources, tone and events of news and makes this huge amount of data available as a quaryable dataset in the Google BigQuery (GBQ). Therefore, the systematic queries are based on a schematic SELECT query which captures the main details of the GKG database, namely location, date, topics and tone of news. -Generation of networks: A labelled multilayer network is created, which enables to identify the profiles of countries or regions as well as examine their overreaching national relationships based solely on news appearing. The network is generated using GDELT geolocation, topic recognition and sentiment analysis. Based on the location mentioned in the article GDELT geolocates the articles to countries and cities. The multidimensional network can be defined as nodes (V), edges (E) and dimensions (D). The edge represents the connection between two nodes (u and v) in a dimension (d). It is express as the following: Furthermore, the categories of the article define which layer the previously mentioned edge appears in. An article includes characteristics as article's id (i), publication date (t i ), the identified set of locations mentioned within it (L i ), the dimensions and tags of the article (D i ) and also its sentiment (s i ).
Two types of networks are built, one is when the nodes are the topics, the edges are the news and the dimension of the edges is the countries or groups of countries and the weight of the edges is the number of pieces and/or tone. The other option is when the nodes represent the countries and edges are the news of a topic and the weight of the edges can be determined from the number of articles.
This methodological development allows continuous monitoring throughout the world through online queries, therefore measuring the social acceptance of SDGs and encouraging participation in terms of their implementation, as well as helping countries around the world to share experiences concerning their problems and successes, which is essential for the implementation of the Agenda [7] .

Validation of the applicability of the data
We categorized the news gained about according to the 17 sustainable development goals. In the following we will only consider for an example analysis the 6th goal of the UN sustainable development goals, which is the "Clean Water and Sanitation" goal. This covers 5 624 192 articles, from 226 countries.
The overview of the top 10 most mentioned countries by the news already creates an impression that this topic in these countries is fairly natural, tone is around 0, however, there are certain outliers in the number of positive and negative articles. An article is positive if the overall tone is above 10 and negative if it is below −10.
The distribution of global news related to SDG6 (Ensure availability and sustainable management of water and sanitation for all) is shown in Fig. 1 . The map cutouts were selected based on the Global Burden of Disease Study by Fullman et al. [8] . Fig. 1 shows the darker countries are discussed the Clean Water and Sanitation topic more detailed. It can be seen, that in Norfolk Island (NFK) from the overall news is occupied 0,8% percent of the news related to SDG6. In Pitcairn Islands (PCN), the 0,75% of news talks about Clean Water and Sanitation. Fig. 2 shows the connection between the topics of SDG6. The edges are colored with red having bad tone when connecting two topics, and green with good tone connecting them. The thickness represents the number of connections between the topics. It can be seen, that "Water Treatment" and "Water, Sanitation and Hygiene" are the most common topics appearing together in the news, with an overall bad tone. While "Water, Sanitation and Hygiene" is connecting to Ecosystems with an overall good tone.
The ever increasing social participation and environmental awareness can be achieved primarily through news, therefore, their monitoring is essential. We presented a dataset that supports the news-based analysis of sustainable development goals. We have demonstrated the applicability of the data by presenting how the news related to the 6th SDG "Clean Water and Sanitation" goal are distributed among the countries and how the topics of World Bank taxonomy are connected to this critical issue.
Thanks to the intelligent functions behind the studied GDELT Knowledge Graph the proposed tool overcomes the limitations of geographical coverage and language barriers. With the proposed Google BigQuery SQL scripts the global media is discoverable and thematically analyzable in terms of the SDGs, which was not possible before.

Ethics Statement
Not applicable.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.