Bridging the Gap in ESG Measurement: Using NLP to Quantify Environmental, Social, and Governance Communication

Environmental, social, and governance (ESG) criteria take a central role in fostering sustainable development in economies. However, a remarkable gap exists in the precise and transparent measurement of ESG. To address this problem, we propose and validate a new set of NLP models to assess textual disclosures toward all three subdomains of ESG in three steps. First, we use our corpus of over 13.8 million text samples from corporate reports and news to pre-train new specific E, S, and G models. Second, we create three 2k datasets to create classifiers that detect E, S, and G texts in corporate disclosures. Third, we validate our model by showcasing that the communication patterns detected by the models can effectively explain variations in ESG ratings. Thus, this paper introduces a novel and robust method to bridge the significant gap in assessing corporate disclosures towards sustainability.


Introduction
Motivation.Since the inception of environmental, social, and governance (ESG) criteria in 2004 (UN, 2004), the importance of ESG has constantly risen.At the same time, there exists a vivid debate about the measurement of companies' ESG efforts.Strikingly, the most prominent measurement approaches in the market -ESG ratings -arrive at inconsistent conclusions about companies' ESG integration (Berg et al., 2022).One remedy for this problem presents the analysis of companies' disclosures using Natural Language Processing (NLP).While prior research has developed NLP methods specifically in the climate change domain (see, e.g.Webersinke et al., 2022;Sautner et al., 2023), the general environmental, as well as social and governance domains, are partly or entirely neglected.
Contribution.This project aims to fill this gap by introducing main contributions to quantify ESG texts using NLP methods.First, we complement the existing literature by pre-training new language models for the environmental, social, and governance domains using our corpus of over 13.8 million text samples.Second, we introduce three expert-annotated datasets with 2,000 text samples for each of the three pillars of ESG.This enables the training of classification models for all subdomains.Third, we validate the models by assessing over 2,500 annual reports of the largest European enterprises from 2016-2021 and investigate the relation between the communication patterns of companies and their ESG ratings.We publish both datasets and models to enable various stakeholders to analyze corporate disclosures. 1  Results.We demonstrate that the pre-trained ESG models achieve state-of-the-art performance for classifying texts into their respective categories.We further find evidence for the effectiveness of the models in our real-world use case.Controlling for various other variables, we find a strong positive and statistically significant association between ESG communication and ESG ratings.
Implications.Our findings have vast implications for both the academic and professional sectors.As the amount of disclosed textual information increases drastically, the need for accurate and actionable ESG metrics becomes increasingly urgent.The models we've developed are critical for investment professionals, ESG analysts, and corporate strategists, enabling them to efficiently search through large data sources and obtain insights.At the same time, companies may be encouraged to align their operations more closely with ESG principles, knowing that their disclosures will be subject to rigorous review.Finally, the academic community will benefit as these datasets and models serve as a foundation for future ESG-focused studies.

Background
Rising Importance of ESG.The integration of sustainability efforts is profoundly changing the way business is conducted.In this realm, a company's sustainability efforts are closely associated with environmental, social, and governance criteria.Throughout the last few years, these ESG criteria experienced a steady increase in importance, reflected in multiple channels.
First, investments in ESG-conform companies have risen rapidly throughout the last couple of years.For instance, the assets under management of investors committing to incorporate ESG through the UN Principles of Responsible Investment has risen from US$21 trillion in 2010 to US$123 trillion in 2023 2 .Second, the quantification of ESG plays an increasingly important role in research and industry.In correspondence to the rise in investments in ESG, there emerged a rising number of ESG rating agencies.However, these rating agencies often arrive at differing assessment results (Berg et al., 2022).Third, and also as a response to these uncertainties, the increased importance of ESG is reflected in the rise of new reporting frameworks and obligations for companies (see, e.g. Financial Stability Board, 2017;GFANZ, 2022).For instance, the European Union requires companies to rigorously disclose risks and opportunities arising from ESG issues from 2025 on (EU, 2023).However, with rising incentives for enterprises to disclose ESG matters, this opens up a space for greenwashing practices, i.e., companies that claim to be sustainable while they are actually not.Therefore, there exists a high need to achieve transparent and holistic assessments of companies' ESG claims and actions.
NLP in the ESG domain.To better assess the communication aspect of ESG integration, previous literature has already employed NLP methods.However, previous efforts are mainly centered around the climate change domain -and therefore only a subdomain of environmental.In earlier and more applied finance research projects, keywordbased approaches are frequently employed (Cody et al., 2015;Sautner et al., 2023).It has to be noted, though, that these approaches suffer from a lack of context sensitivity (Varini et al., 2021).Thus, more recent research improved performance by using context-sensitive machine learning models like BERT.There exists a variety of datasets for BERT-2 See, //www.unpri.org/pribased models solving various downstream tasks like classifying climate content (Webersinke et al., 2022;Kölbel et al., 2022;Bingler et al., 2022Bingler et al., , 2023;;Callaghan et al., 2021), topic detection (Varini et al., 2021), Q&A approaches (Luccioni et al., 2020) or claim detection and verification (Stammbach et al., 2022;Wang et al., 2021).Despite this large amount of research projects in the climate domain, the analysis of the communication in the general environmental pillar and the social and governance pillars of ESG is widely disregarded.
Pretraining BERT models.The lack of general environmental as well as social and governance models is also reflected in the domain-specific pretraining model domain.General pretraining describes the language learning process by NLP models in a semi-supervised fashion.In this process, the model is provided with a large corpus of textual data.Typically, the language model learns by predicting masked words in a sentence (Devlin et al., 2019).In this process, general domain data like books or Wikipedia articles is commonly used (Liu et al., 2019).While these general pretraining datasets equip the model with good overall language understanding capabilities, prior research found that domain-specific, niche language can pose problems for these general-purpose models (Araci, 2019).This led to the creation of various text corpora to further pretrain general models for specific subdomains that outperform their general counterparts in downstream tasks like classification or claim verification (Rasmy et al., 2021;Chalkidis et al., 2020;Araci, 2019).In the climate domain, ClimateBERT was further pre-trained on a climaterelated text corpus (Webersinke et al., 2022).
Collectively, the topic of ESG takes an evergrowing importance, and there is a significant demand for assessing companies' communication in the ESG domain.However, prior research has not addressed the detection of communication patterns or the creation of domain-specific pretrained models in the broader environmental, social, and governance domains.This research aims to fill this gap by creating holistic models and datasets to assess all three subdomains of ESG.enhance the language model's understanding of the subdomains.We create these specialized datasets for our tasks by following a two-step procedure.
First, we compile a base dataset of relevant underlying sources.As relevant sources, we define corporate news, annual reports, and sustainability reports.This decision aims to strengthen the models' ability to specialize in corporate jargon.We split each source into its sentences (see Appendix A for an overview of the employed datasets).Second, we apply a keyword search to find relevant text passages for all three subdomains of ESG (see Appendix B for all keywords).Thus, we create a specific dataset to train each model on a particular task.The sentence characteristics of the base sources, as well as the datasets, can be viewed in Table 1.
After creating the datasets for each subdomain of ESG, we pre-train the RoBERTa (Liu et al., 2019) and DistilRoBERTa (Sanh et al., 2020)   very high inter-annotator agreement of more than 86% on each task (for more details, see Appendix F).The resulting label distribution is outlined in Figure 1.In the second step, we use the created datasets to fine-tune and evaluate several classification models.To evaluate the model performance with the created datasets, we perform a five-fold cross-validation.This allows us to test the model performance on the entirety of the dataset.As Table 2 shows, the further pre-trained models consistently outperform their base models.Even the smaller, further pretrained DistilRoBERTa models achieve on-par or superior performance compared to the larger base RoBERTa models.The overall results indicate strong model performance with over 93% accuracy for the social and environmental domain and over 89% in the governance domain.These results remain consistent when using different sets of hyperparameters, solidifying the superiority of our further pre-trained models (see Appendix J).
Electronic copy available at: https://ssrn.com/abstract=4622514 To find evidence for the validity of the models, we analyze whether the models' assessed communication patterns in companies' annual reports can explain variations in their respective ESG ratings.We hypothesize that companies are highly incentivized to disclose their ESG activities for various interconnected reasons.There exists an increasing societal as well as corresponding regulatory pressure to disclose ESG activities.On the one hand, there are protests for more climate action around the world 3 and increased consumer awareness of overall ESG implementation 4 .On the other hand, legislators recognize the need for action in the ESG domain by introducing new regulatory frameworks -particularly in Europe 5 .Thus, we argue that ESG communication is associated with higher ESG ratings.At the same time, we acknowledge that sole general ESG communication can be prone to greenwashing.Therefore, this relationship might instead represent a general average than the truth for every individual company.
In this study, we analyze the annual reports from the EuroStoxx600 index from 2017-2021.Thus, we sample 2758 annual reports of 600 enterprises from 22 countries.To mitigate concerns about the divergence of individual ratings, we consider the ESG ratings of three major data providers: Bloomberg, Refinitiv Asset4, and RobecoSAM.Furthermore, we complement the ESG data with fundamental data from Compustat.To quantify the ESG communication of a company, we employ the E, S, and G models on every sentence of the firm's annual report.If a sentence qualifies for either subcategory of ESG, it is automatically an ESG sentence.Furthermore, a sentence with more than one label between E, S, and G is assigned with the multilabel label.To build a company score, we divide the number of ESG sentences by the number of all sentences in the annual report (ESG_com) (for more details, see Appendix G).To further investigate the relationship between ESG communication and ESG ratings, we propose the following model: where ESG_rat denotes the ESG (or E, S, G) rating of a company.We both investigate the relation-3 See for example Fridays for Future. 4See for instance this PWC market study. 5Examples include the EU green deal.ship between individual ratings and build a combined rating.ϵ are company fundamentals, and ρ denotes fixed effects (see Appendix H for an exploration of all variables in the regression).
The regression results support the assumption that ESG communication possesses explanatory power for combined ESG ratings.As Table 3 shows, all ESG communication coefficients indicate a strong and significant relationship.Simplified, a 1% increase in ESG (E) communication is associated with an increase of the ESG (E) rating by 0.6% (0.88%).These findings are largely consistent when performing the regressions with the single Refinitiv Asset4, Bloomberg, and RobecoSAM ratings as dependent variables (see Appendix I).

Conclusion
In conclusion, this paper demonstrates the development of robust pre-trained and fine-tuned models to detect ESG communication in textual disclosures.The publication of datasets and models will help a variety of stakeholders to rigorously and transparently assess companies' ESG communication.

Limitations
While we include a large number of textual samples, the models are likely limited to written disclosures and may fail to show the same effectivity on transcripts of verbal communication.Furthermore, the broad nature of the governance label might entail generalizability problems.Finally, although the regression analysis results suggest that the models are indeed working in the intended manner, further investigations are needed to uncover patterns beyond correlations.

Appendix A Descriprive Statistics of the Pre-Training Datasets
The following Tables display the descriptive statistics of the pretraining datasets.While Table A.1 shows the general population of sentences, Table A.2 gives an overview over the keyword-filtered datasets for all three domains.

B ESG Keywords
The following list presents the keywords used to identify ESG sentences for creating the pretraining data for the environmental, social, and governance models.While we acknowledge that keywords will likely result in many false positives, i.e., non-ESG sentences, we argue that this is beneficial in the context of pretraining.We want the model to learn the context of sustainability topics.Thus, the models must be exposed to true and false positive samples during the additional pretraining.This procedure also explains why we create complementary environmental models to the existing ClimateBERT project that is closely related (Webersinke et al., 2022).First, our new pretraining dataset is sentence-level, while ClimateBERT is based on a paragraph-level dataset.Second, we refine the keyword approach to reflect the differences between the more specific climate and, vice versa, more general environmental topics.
Furthermore, these keywords were used to cre-ate a subset of the labeling data for the 2k expertannotated datasets.

C Pre-training Details
For the pretraining process, we perform a 90-10 train-eval split to create training and evaluation datasets.Thus, we use the evaluation dataset to constantly investigate the training loss and determine a reasonable stopping point for the pretraining.As Table C.3 shows, we stop the training after 15 epochs.Furthermore, we select the training arguments in line with prior research (Webersinke et al., 2022). The

D Detailed Description of the Labeling Data Creation
The construction of the labeling data follows three considerations.First, we want to include sentences that specifically address all three subdomains as Electronic copy available at: https://ssrn.com/abstract=4622514well as the differences between the subdomains.Thus, we sample 1,000 sentences of the keywordfiltered datasets (see A.2).These 1,000 sentences are labeled with all three labels of environmental, social, and governance.This allows us and the trained models to identify intersections and clear decision boundaries between the labels.Second, we complement this dataset with sentences that specifically address one subdomain.We design three datasets that contain 500 sentences per subdomain with keywords (see datasets in A.2) and 250 sentences with no specified keywords (see datasets in A.1). Thus, we sample and label 2,250 sentences, 750 sentences per subdomain.This composition of the dataset aims to create a dataset that contains samples with task-specific differences as well as general language.
To further enhance the dataset with this third consideration -exposing the model to general language sentences -we label another 250 sentences from the general dataset (see Table A .1).This should help the models better understand the semantic space the three domains occupy in terms of general sentences.

E ESG Labelling Guidelines
The aim of the guidelines presented in Table E.4 is to define a clear and consistent system for labeling content across the entire spectrum of the environmental, social, and governance domains within ESG across three datasets, each focusing on its respective domain.Labels are assigned as "1" for Yes, indicating a sentence is relevant to that domain, and "0" for No, indicating it is not.

F Inter-Annotator Agreement
Ultimately, we create a 2k dataset for every domain of ESG.Table F.5 displays the inter-annotation agreement rates.While, generally, the agreement is very high, the governance domain stands out with the lowest agreement rates.This mirrors the more generic nature of the governance domain.

G Building Environmental, Social, Governance and Multilabel Scores
To build a company score, we divide the number of ESG (or E, S, G) sentences by the number of all sentences in the annual report: where i indexes the company and t denotes the respective year.Besides the E, S, G, and ESG communication score, we also calculate a multilabel score.This signals sentences that are assigned to more than one label of E, S, and G.

H Descriptive Statistics of the Regression Variables
Table H.6 delivers an overview of the variables used in the regression analysis.The rating scores are min-max scaled between 0-1 to enable a comparison of the upcoming regression results.

I Regressions with Individual Ratings Agencies
The regression analysis aiming to investigate correlations between ESG communication and ESG ratings was performed on the aggregated ESG ratings as well as on the individual ones.The Tables I.7, I.8, I.9 show the results for the regression with RobecoSAM, Refinitiv Asset4 and Bloomberg ratings as dependent variables.All ratings are minmax-scaled between 0-1 to enable a comparison of the results.As the tables show, these findings are largely consistent when performing the regressions with the single Refinitiv Asset4, and RobecoSAM ratings as dependent variables.In contrast, for the Bloomberg rating, only the ESG score correlates significantly with the ESG communication.The single dimensions display no significant relation.These findings align with the general debate about heterogeneity in rating agency results.However, overall, ESG communication seems to possess explanatory power for ESG ratings.Therefore, the models deliver the desired outcome.

J Models Results
Table J.10 gives an overview of the detailed results of the five-fold-cross-validation with the standard hyperparameter setup.The results indicate that the further pre-trained models outperform their base model counterparts in the vast majority of evaluation dimensions.It becomes apparent that social and environmental are comparatively easy to distinguish for the model.This is in line with the insights on the inter-annotator agreement.Governance sentences are oftentimes more vague and broad.Furthermore, Table J.11 displays the same evaluation with a differing set of hyperparameters.The further pre-trained models consistently outperform their base counterparts solidifying the overall results.

Figure 1 :
Figure 1: Label distribution in the 2k expert-annotated datasets Figure B.1: Bar charts representing the number of sentences filtered based on the top 10 keywords for the Environment, Social, and Governance domains respectively Figure C.5: Pre-Training Log Loss of SocRoBERTa across epochs across epochs

Table 2 :
Classification results of five-fold crossvalidation

Table 3 :
Regression results Environmental criteria comprise a company's energy use, waste management (contaminated land or disposal of hazardous/toxic emissions), pollution, natural resource conservation, and treatment of the biosphere, as well as compliance with governmental regulations.Special areas of interest are climate change and environmental sustainability (e.g., handling diminishing raw materials). A. It reconciles its participants' greenhouse gas emissions reduction targets (typically varying by 10% to 75% on scope 1 and 2) with the science-based climate data.(Yes)B.Through the course of 2021, M&G Investments continued to participate in CA100+ collective engagement groups.(Yes,CA100+ is a climate initiative)C.Individual investments can have both a positive and negative impact on society and the environment.December 2017, the Company set up two free shares plans for the benefit of employees of the Company and related companies or corporate groups to share with them the success of the Group since its creation and, in particular to take into account its exceptional growth during the 2016 and 2017 financial years.(Yes,solefinancial benefits for stakeholders are considered social) The board of management serves as an oversight for our ESG implementation.(Yes)B.An ethical code has been issued to all Group employees.(Yes,generic but ethically relevant)C.The members of the Supervisory Board shall also have sufficient time to carry out their respective responsibilities, taking into account all personal and professional commitments.(No, sole mentioning of the board is not enough)