On the information content of sovereign credit rating reports: Improving the predictability of rating transitions q q Journal of International Financial Markets, Institutions & Money

In order to identify novel qualitative determinants of transitions in sovereign credit ratings, we construct six different textual sentiment and subjectivity measures using dictionary-based, and machine learning approaches on sovereign credit rating reports issued by Moody’s and Fitch in the period from 2002 to 2017. After controlling for macroeconomic and ﬁscal strength, soft information, as well as known sources of proximity biases, we ﬁnd that, on average, these novel text-based measures improve the classiﬁcation accuracy of downgrades and upgrades. The improvement is more notable for sentiment than subjectiv- ity measures, and for downgrades compared to upgrades. Next, we ﬁnd evidence that credit rating agencies seem to follow the through-the-cycle rating philosophy by taking a longer horizon into account. Finally, to the best of our knowledge, we offer the most comprehensive analysis of textual sentiment measures and their effect on sovereign credit rat- ings thus far. (http://creativecommons.org/licenses/by/4.0/). ”horse race” when it comes to subjectivity measures. We estimate all models with the two rating philosophies in mind, namely point-in-time and through-the-cycle. Our results show that credit rating agencies follow the latter when assigning sovereign credit ratings. Finally, we perform robustness checks by adding an outlook to all models, which conﬁrms our previous results. We acknowledge the potential drawbacks of our approach, namely the neglect of different probabilities of rating changes for particular rating classes. We believe our approach is a necessary step in order to identify the most informative measures and best models, but mainly to highlight the importance of credit rating reports in the ﬁrst place. The outbreak of a global exogenous disaster, such as the COVID-19 pandemic, further emphasizes the need to better capture, incorporate and com-municate qualitative aspects of the readiness and capacity of sovereigns to deal with the type of risks that are difﬁcult to peg quantitatively but can have signiﬁcant material effects on sentiment and subjectivity when it comes to credit rating. The next step is to estimate transition matrices for those models and textual analysis measures that are the most promising.


Introduction
The formation of sovereign credit ratings has puzzled researchers for over two decades. Cantor and Packer (1996) were among the first to delve into the determinants of sovereign credit ratings, spurring an ample amount of subsequent research (see e.g. Afonso, 2003;Butler and Fauver, 2006;Afonso et al., 2009;Özturk, 2014). Nevertheless, a relatively large part of ratings has always remained unexplained by hard (quantitative macroeconomic) data (see e.g. De Moor et al., 2018;Özturk, 2014). Credit rating agencies (Moody's, 2016;Fitch, 2017;Standard and Poor's, 2017) claim that part of the rating represents the qualitative knowledge of the rating committee.
Despite being aware of this, countries or government issuers rely on credit rating agencies and sovereign credit ratings, as they give a clear and relatively reliable signal to the international capital markets of their creditworthiness. Higher ratings translate into lower borrowing costs and vice versa. Consequently, changes in sovereign credit ratings are equally, if not even more important, as they lead to deterioration or improvement of their borrowing costs in the future (Eijffinger, 2012;Alsakka and ap Gwilym, 2013). Of a particular importance is the transition between investment and speculative grade. Kiff et al. (2010) find evidence of this transition break-point having significant effects on CDS spreads. Being able to understand the complex determinants of rating transitions beyond the well established quantitative and known qualitative elements would have meaningful benefits for issuers and investors. Both would be better able to assess the potential impact of changes in the sentiment and subjectivity when it comes to rating transitions. A few studies try to estimate sovereign transition matrices (see e.g. Hill et al., 2010;Hu et al., 2002), but face limitations mainly due to data shortage. However, the underlying problem remains, namely that the qualitative judgment of the rating committee is not captured by the traditional determinants of sovereign credit ratings.
The main objective of this paper is thus to explore additional qualitative determinants of sovereign credit ratings. Specifically, we apply a different approach to identifying the qualitative component, as proposed by Slapnik and Loncarski (2019). They hypothesize that qualitative judgement or interpretation of the rating committee is to a large extent expressed in the reports, which have been largely neglected in the past. They use dictionary-based textual sentiment methods to extract sentiment and subjectivity score from the reports and find the latter to additionally explain sovereign credit ratings even after the political risk, institutional strength and potential bias are controlled for.
We aim to determine whether textual sentiment and subjectivity measures improve performance of models predicting rating transitions, using a sample of 98 countries rated by Moody's in the period from 1996 to 2017 and 76 countries rated by Fitch in the period from 2002 to 2017. Based on the use of the logistic regression, we focus on the classification accuracy of downgrades and upgrades. We define our binary dependent variable as equal to one if country i is downgraded/upgraded at time t, and zero otherwise.
We expand construct six different textual sentiment and subjectivity measures, using both dictionary-based and machine learning approaches on sovereign credit rating reports issued by Moody's and Fitch. Sentiment measures, namely net sentiment and polarity, are based on detecting negative and positive words or sentences, while subjectivity measures focus on detecting opinion. We compare these measures by separately, as well as simultaneously including them in models and explore their informativeness (e.g. significant effect on upgrades/downgrades of the ratings).
We find that, on average, sentiment measures perform better than subjectivity measures. Specifically, we observe that correct classification of true positives, i.e. sensitivity, increases when we include sentiment measures in the models. This improvement is more pronounced for downgrades than upgrades. The increase in performance is more distinct for Moody's compared to Fitch. This might be due to the fact that Moody's sample predominantly consists of emerging markets, while Fitch's sample is dominated with advanced economies. The former may require a stronger qualitative judgement component by the rating committee due to more uncertainty and limited data (see e.g. Cantor and Packer, 1996;Luitel et al., 2016).
On the other hand, subjectivity scores, on average, offer relatively poor results compared to sentiment scores. There is also no clear winner for the textual analysis approach. With sentiment measures, the dictionary-based techniques appear to outperform machine learning. However, machine-learning approach surpasses dictionary-based applications for subjectivity measures.
Finally, credit rating agencies can comply with two different rating philosophies: with through-the-cycle rating philosophy, they take a longer horizon into account, while with the point-in-time approach, they focus on current information (Basel Committee on Banking Supervision, 2005). Credit rating agencies generally employ the through-the-cycle approach but have been criticised in the past on their failure to do so (Kiff et al., 2012;Kaminsky and Schmukler, 2002;Ferri et al., 1999). We thus estimate our models in both frameworks. Our findings suggest that, while credit rating agencies may, from time to time, "fall off the wagon" 1 , they, on average, follow the through-the-cycle rating philosophy since taking into account past and future values leads to better model performance.
We believe that this paper provides a significant methodological contribution. Although textual sentiment analysis is relatively widely used in corporate finance (see e.g. Loughran and McDonald, 2016;Kearney and Liu, 2014), the application to sovereigns is limited, especially in the field of sovereign credit ratings. To the best of our knowledge, only two prior studies examine them: the already mentioned study by Slapnik and Loncarski (2019) and Agarwal et al. (2019), who use a machine learning approach and also find evidence of sentiment or tone having additional explanatory value. However, to the best of our knowledge, no study has explored more than one approach, and no comparative analysis exists on the performance of such measures. This paper thus far offers the most comprehensive analysis of textual sentiment measures and their effect on sovereign credit ratings.
Additionally, practical implications are considerable, since sovereign credit ratings and changes in ratings have significant effects on both the international debt markets and governments' borrowing costs. Better understanding of qualitative determinants, in particular those driven by sentiment and subjectivity, provides more insight into the discussion regarding the soft information and bias in credit ratings, thus yielding an exogenous perspective (interpretation) of credit rating reports. Hence, our findings provide important additional insight when it comes to interpretation of credit rating reports and their impact on rating transitions.
The rest of the paper is structured as follows. In section two, we review the existing literature on sovereign rating transitions and credit rating philosophies. We describe the data and methodology in section three. We specifically outline the textual analysis framework. Next, in section four, we comment on the results in three subsections: in the first part, we focus on the point-in-time analysis while we examine the though-the-cycle philosophy in part two. In the last subsection, we perform robustness checks. We conclude in section five.
1 For a discussion regarding the inconsistency and the so-called "great migration" in ratings, see, for example, Forest et al. (2015) and Zenner et al. (2013).

Literature review
Sovereign credit ratings reflect the creditworthiness of governments or the probability that they will repay their debts. Credit rating agencies usually assign ratings and decide on rating changes by taking into account various quantitative and qualitative factors. A large body of literature tries to identify the determinants of sovereign credit ratings (Cantor and Packer, 1996;Afonso, 2003;Afonso et al., 2009), as well as the proxies for qualitative factors, i.e. the credit rating committee's opinion (Özturk, 2014;De Moor et al., 2018;Slapnik and Loncarski, 2019). However, the extent of quantitative reasoning is not clear. Credit rating agencies all note that their ratings are merely an opinion (Moody's, 2016;Fitch, 2017;Standard and Poor's, 2017). Despite increased demand for transparency of the credit rating process, discrepancies and lack of clarity remain (Eijffinger, 2012;Reusens and Croux, 2017;Kiff et al., 2010;Forest et al., 2015).
Given that traditional approaches are relatively unsuccessful in determining the importance of the credit rating committee's opinion, we adopt a different approach. Slapnik and Loncarski (2019) use sovereign credit rating reports as sources of additional information not taken into account by previous research that sheds new light on the qualitative judgement of the rating committee. They argue that textual sentiment and subjectivity extracted from the reports using a dictionary based textual analysis approach helps in explaining sovereign credit ratings. We build on the textual sentiment approach but extend the analysis with different measures, obtained from both dictionary-based and machine learning methods. We focus specifically on downgrades and upgrades, as rating changes have important economic consequences (Eijffinger, 2012;Alsakka and ap Gwilym, 2013). Since the rating change is explained in the sovereign credit rating report, we hypothesize that textual sentiment measures capture this and will thus help in classifying downgrades and upgrades.
Little additional evidence exists on the use of textual sentiment analysis on sovereign credit rating reports. To the best of our knowledge, only one other study tackles this problem. Agarwal et al. (2019) find that negative tone significantly affected the CDS spreads and helped in predicting future downgrades. These studies shed new light on the formation of sovereign credit ratings and highlight the importance of alternative sources of information for investors, issuers, and other users apart from sovereign credit ratings alone.
More studies exist on transitions between sovereign credit ratings. Previous research predominantly tries to estimate transition matrices, specifically, the default probabilities for each rating class and the probabilities of transition between them (e.g. Hill et al., 2010;Fuertes and Kalotychou, 2007;Hu et al., 2002). However, as Hill et al. (2010) argue, the availability of data due to short time series, especially for emerging markets, poses a limitation when conditioning transitions between ratings at the sovereign level. Some researchers address this problem by constructing rating histories to augment the dataset (Hu et al., 2002;Fuertes and Kalotychou, 2007). We believe this approach is potentially problematic because it assumes the underlying model predicting the missing ratings is the true model. Existing evidence shows that most models have limited classification accuracy of correct rating predictions (Reusens and Croux, 2017;Özturk, 2014). Fuertes and Kalotychou (2007) test three alternative estimators of sovereign transition matrices and identify biases. Others avoid this by only focusing on upgrades, downgrades and no change in ratings (Purda, 2007). We take the latter approach, partly because of the above mentioned concerns, but also because we want to focus the attention on textual sentiment measures and their comparison, and not on transition matrices themselves. A potential drawback of this approach is that, on average, the probability of a rating change is higher for lower rating classes (Hill et al., 2010). We control for this by including credit ratings in the model. Additionally, Hill et al. (2010) find that credit watch and outlook have, on average, the relatively strong predictive power of rating changes. We thus include the outlook variable in the second part of the analysis. Purda (2007) finds that upgrades are relatively more difficult to predict than downgrades. We thus expect to achieve higher classification accuracy for downgrades compared to upgrades.
Credit rating agencies have been criticised extensively in the past and accused of being procyclical (Forest et al., 2015). Ferri et al. (1999) argue that rating changes were delayed during the East Asian crisis, i.e. were downgraded when it was already too late, causing a deepening of the crisis. On the other hand, the ratings did not increase sufficiently after the crisis, i.e. were upgraded too late. Additionally, Mora (2006) finds evidence of ratings lagging behind financial markets, which is more evidence that credit rating agencies are (or at least were) not as forward-looking as they claim to be. Similarly, Kaminsky and Schmukler (2002) find that downgrades occurred after the markets started crashing. The East Asian crisis failure is not an isolated event, as Kiff et al. (2012) argue that this is also the case of the last financial crisis, especially the downgrades of European sovereigns.
Credit rating agencies can follow different rating philosophies for the incorporation of macroeconomic effects in credit ratings: through-the-cycle (TTC) and point-in-time (PIT), where the former looks over the whole economic cycle (i.e., longer horizon) and the latter reflects currently available information (i.e., shorter horizon) (Basel Committee Basel Committee on Banking Supervision, 2005). As Kiff et al. (2012) point out, ratings are based on the probability that the issuer will withstand potential turmoil and should not be changed unless fundamentals change (through-the-cycle). Taking this into consideration, a recession or tightening should not cause a downgrade. They note that susceptibility to cycles affects the rating, but not the current situation (point-in-time). One could argue that in the above examples, the credit rating agencies reacted more in line with the point-in-time philosophy rather than through-the-cycle. Kiff et al. (2012) argue that in light of these excessive downgrades, credit rating agencies established new methodologies that extended the TTC criteria to what they call 'through crisis criteria', using different hypothetical stress scenarios corresponding to different rating categories. Credit rating agencies use this to determine how much stress governments are able to endure before defaulting.
Nevertheless, Kiff et al. (2012) argue that credit rating agencies provide added value with respect to the already available public information and therefore have a significant role in international markets. Specifically, they find significant effects of upgrades and downgrades of in and out of investment grade category on CDS spreads. Additionally, rating stability is important at a systemic level, since rating downgrades (especially from investment to speculative grade) can be related to liquidation and price falls (Eijffinger, 2012). More problems arise due to spillovers across markets (Alsakka and ap Gwilym, 2013). Consequently, predicting future downgrades or upgrades is important for a country as well as its economic and financial partners. This is especially important for emerging markets, which in general, have relatively low ratings.
Given the above mentioned ambiguity of credit rating agencies assigning ratings in accordance with the point-in-time (PIT) or through-the-cycle (TTC) philosophy, we propose two approaches: one representing the point-in-time concept, where we take into account current values only, and one representing the through-the-cycle concept, where we also consider past and future values. If credit rating agencies follow the through-the-cycle approach, the classification accuracy should outperform the point-in-time approach.

Data
Sovereign credit ratings are predominantly assigned by the three biggest, US-based, credit ratings agencies, namely Standard and Poor's, Moody's, and Fitch. We obtain the historical sovereign credit ratings from Thomson Reuters Eikon. We examine long-term foreign currency sovereign ratings assigned (i) to 98 countries from 1996 to 2017 by Moody's, and (ii) to 76 countries from 2002 to 2017 by Fitch. Our sample is limited by the availability of sovereign credit rating reports, which we use for the textual sentiment analysis. Specifically, we focus on rating transitions, i.e. downgrades and upgrades. CRAs periodically review the assigned ratings, when the ratings are either changed or affirmed. A more frequent review of ratings is possible under special circumstances, such as in the examples given above (East Asian crisis, European sovereign debt crisis, etc.). There are 35 advanced and 63 emerging countries in Moody's sample, and 48 advanced and 28 emerging countries in the Fitch sample 2 . The list of countries included in the analysis is provided in Appendix A (Table A.11).
The countries are rated both as investment grade (ratings Aaa 3 /AAA 4 through Baa3/BBB-) and speculative grade (ratings Ba1/BB + through C/D), with Aaa/AAA being the highest possible rating and C/D the lowest. We transform the ratings to a numerical ordinal scale that ranges from 1 to 21, where 21 corresponds to Aaa/AAA and 1 corresponds to C/D rating.
The dataset for the main analysis consists of four groups of variables, which are described in Table 1. In addition to the traditional macroeconomic explanatory variables, we include country risk indicators and introduce our key variable(s) obtained from textual sentiment analysis, i. e. textual sentiment and subjectivity measures, as potential proxies for the qualitative judgement of the rating committee. We discuss these in more detail in the next section. The full summary statistics is provided in Appendix A in Tables A.12 and A.13.
We take into account an accumulating body of literature that point to the existence of a bias in sovereign credit ratings, especially a downward bias towards the emerging markets (Fuchs and Gehring, 2017;Zheng, 2012;De Moor et al., 2018;Gültekin-Karakas ß et al., 2011). To control for potential bias, in line with De Moor et al. (2018) and Slapnik and Loncarski (2019), we add proxies for economic and cultural proximity, namely: (i) trade proximity, reflecting the trade intensity of a country with the US; (ii) common language with the US; (iii) religious proximity, reflecting the probability that two randomly chosen individuals in the US and particular country share the same religion; and (iv) geographical distance, based on latitude and longitude from Washington DC (US) to the capital city of a particular country. 5

Textual sentiment analysis
Kearney and Liu (2014) describe sentiment as the degree of positivity or negativity in texts, which can contain both subjective judgement or opinion and objective view of economic environment. Credit rating agencies generally issue credit rating reports together with the assigned ratings, explaining their decision. The objective is to apply a textual sentiment analysis approach to said reports in order to examine how various textual sentiment measures affect the performance of the model and its classification accuracy. Stone et al. (1966) are pioneers in the field of textual analysis, which is widely known as natural language processing. They define it as any technique that enables inference by objectively and systematically identifying specified characteristics within the text. Two main textual analysis approaches exist, namely dictionary-based and machine learning. The latter is then split into supervised and unsupervised learning. The dictionary-based approach is relatively straightforward. It entails a computer processing the text and classifying words or sentences into categories base on a previously defined dictionary (Li, 2010). It is commonly known as the 'bag-of-words' approach, as it ignores any structure or connections between words 2 Based on the IMF classification. 3 Moody's credit rating scale. 4 Fitch credit rating scale. 5 Given that we focus on the effect of sentiment and subjectivity of credit rating reports on rating transitions, we are not reporting detailed results for all of the control variables, as the tables would be too complex and uninformative. Detailed results are available from authors upon a request. (Manning et al., 1999). Machine learning is a bit more complex, as it relies on statistical techniques to infer the content of texts and to classify them based on statistical inference (Li, 2010).
The dictionary approach and the machine learning approach have some advantages and disadvantages. Loughran and McDonald (2016) list several important advantages of a dictionary-based approach. By selecting a dictionary, researchers' subjectivity is avoided. Usually, large samples are generated since computer programs tabulate the frequency counts of words. As Kearney and Liu (2014) argue, the dictionary approach is likely the easiest for economists and financiers to employ. Additionally, the dictionary-based approach will, on average, be less time-consuming and less costly than the machine learning approach, since the text in the 'training set' has to be manually categorized. However, as Li (2010) argues, it is highly likely that there is no existing dictionary for a particular type of text at hand, such is the case of credit rating reports. Even if such a dictionary exists, the dictionary-based approach does not take into consideration the context of a sentence or text. Additionally, the accuracy rate of machine learning is usually higher than the dictionary-based approach. Loughran and McDonald (2016) focus on Naïve Bayes, but their arguments can be generalized. Since machines process the text, large corpora can be included in the analysis. After the classification rules are established, the measuring of sentiment will not be exposed to any additional subjectivity of the researcher. We collected Rating Action reports by Moody's available between 1996 and 2017 and Full Rating reports by Fitch, available between 2002 and 2017, which form the corpus for various textual analysis techniques. We begin with the dictionarybased approach and use the LM financial dictionary by Loughran and McDonald (2011) and focus on the positive and negative sentiment categories. This is in line with Kearney and Liu (2014) and Loughran and McDonald (2011), who warn that general language dictionaries (such as General Inquirer (GI) or DICTION) are not appropriate for analysing financial texts and note that using finance specific dictionaries leads to more accurate sentiment scores. Regarding the weighting of words and sentences, we apply proportional weighting, which assumes an equal weight for every word or sentence. We calculate the percentage of the words/sentences in a given sentiment category in the total number of words/sentences in the text. We make two assumptions: (i) if more than one report is published in a calendar year, we take the sentiment from the last report in that year; and (ii) if no reports are published in a calendar year, we assume there was no change in the prevailing sentiment/perception and take the value from the previous year.
We construct two different sentiment measures resulting from the dictionary approach. (Net) sentiment is the difference between positive and negative sentiment, where negative/positive sentiment is calculated as the ratio between the number of negative/positive words in the text and the total number of words. Next, we define polarity as: where pos i;t is the count of positive words and neg i;t is the count of negative words in the text.
In addition to simple negative/positive dichotomy, the LM dictionary also offers categories for 'uncertainty' (terms expressing imprecision rather than exclusively focusing on risk), 'strong modal' and 'weak modal' words (terms expressing levels of confidence). They argue that subjective statements typically express personal opinion, emotion or judgement, while objective statements mainly consist of facts. They propose an alternative measure to net sentiment and polarity, namely the subjectivity score. In order to achieve that they construct a new 'subjectivity' category, consisting of the three abovementioned categories, and expand it with particular words that are not originally included in any of these categories, but express subjectivity, such as expect or forecast. They note that qualitative judgement represents a relatively important part of sovereign credit ratings, especially for emerging markets. Rating committees reviewing such countries are faced with data shortages or unreliable information, and consequently have to rely more on their expert knowledge. They hypothesise that the subjectivity score may be better equipped to detect this than sentiment. We build on this idea and first repeat the analysis using the 'subjectivity' category to retrieve the subjectivity score. It is measured as the percentage of subjective words in the text in the total number of words. To ensure comparability to the machine learning approach, we also construct the subjectivity score at the sentence level, calculated as the ratio between the number of subjective sentences in the text and the total number of sentences. Subjective sentences are defined as sentences that contain at least one word from the 'subjectivity' category.
We obtain the final two sentiment and subjectivity measures using the machine learning approach. The main steps of the machine learning approach to extract sentiment/subjectivity are: First, a part of the complete corpus of text is specified as the training set. Each sentence in this set is manually categorized, where the categories or classifications may simply be positive or negative, or any other sentiment. We define positive, negative, and neutral categories for sentiment, and objective and subjective categories for subjectivity. After preprocessing 6 , an assortment of sentiment analysis algorithms is trained on the training set. Our algorithm of choice is Naïve Bayes, which is one of the oldest and most established algorithms for text analysis. The algorithm learns the sentiment/subjectivity classification rules (or grammar) from the pre-classified data set and applies these rules out-of-sample to the whole corpus. After all the sentences in the complete corpus are classified, we construct sentiment and subjectivity measures using the initial classifications or combinations of them. The first is the polarity index, as defined above, where pos i;t is the count of positive sentences and neg i;t is the count of negative sentences in the text. The second is subjectivity, measured as the ratio between the number of subjective sentences in the text and the total number of sentences. Finally, these measures, together with other variables, are used for further analysis. The summary statistics for textual analysis measures is provided in Table 2. We report the correlations between measures in Table 3. Sentiment/polarity are, on average, negatively correlated with subjectivity measures. We also report correlations with outlook. The correlations suggest that textual sentiment measures will have information value beyond the outlook variable.

Downgrades and upgrades
The dependent variables in our models are changes in ratings, so we look at these in more detail. In Fig. 1, we show the frequency of downgrades and upgrades per year since 1995. The frequency increases as we move towards the end of the period. This is because in the 1990s, less (predominantly advanced) countries were. Most of the emerging markets got their first time rating in the 2000s. In 1995, only 20 countries were rated by Moody's and 33 by Fitch. These numbers increased to 128 and 111 7 in 2017. We observe a relatively similar pattern by both agencies. A detailed comparison of the timing of rating actions by Moody's and Fitch is beyond the scope of this paper.
In Table 4, we present the frequency of total downgrades and upgrades. We make a distinction between changes in ratings by one notch (+/-1) or more than one notch. On averages, in absolute terms, Moody's downgraded countries by more than one notch 49-times, while Fitch only 37-times. However, in relative terms, these numbers are comparable. For example, downgrades of more than one notch happened to Greece in 2012. Similarly, for upgrades, there were 29 upgrades of more than one notch by Moody's and 22 by Fitch. Overall, the samples are relatively balanced. We report 116 downgrades and 149 upgrades by Moody's, and 113 and 163 upgrades by Fitch. 7 Note that we do not include all these countries in our analysis due to data limitations.

The framework
We define the dependent variable as: The probability that the binary dependent variable Y it equals one given the covariates is modelled using the following specification: where a and b are parameters to be estimated, and X it is a vector of a country specific time-varying and time-invariant explanatory variables described in Table 1, depending on model specification. K is the logistic distribution function, corresponding to the logit model. Similar to Slapnik and Loncarski (2019) we explore three baseline models. The first model (Model 1) contains only macroeconomic and fiscal strength variables defined in Table 1. This corresponds to early studies of the determinants of sovereign credit ratings (e.g. Cantor and Packer, 1996). Next, we extend Model 1 with proxies for institutional strength and political risk. This is based on previous findings arguing that the credit rating committee takes soft information into account when assigning sovereign credit ratings (e.g. Özturk, 2014;Butler and Fauver, 2006). Finally, we include proxies for cultural and economic proximity in Model 3 to control for a potential bias identified in existing literature (Fuchs and Gehring, 2017;Luitel et al., 2016;Zheng, 2012). Next, we separately and simultaneously add the indicators for (textual) sentiment and subjectivity to each of the three models. In order to examine whether credit rating agencies employ the throughthe-cycle or point-in-time approach, we extend the models by adding the one period lags and leads of time-varying explanatory variables. Finally, since one could argue that textual sentiment is just a proxy for outlook, we include outlook in all our models as a robustness check.  Our aim is to compare the classification accuracy of all models. We focus on sensitivity, i.e. correct classification of true positives. We also report the overall classification accuracy and area under the ROC.

Point-in-time approach
We begin our main analysis with baseline models assessed with current values, corresponding to the point-in-time rating philosophy. Table 5 shows that Model 1 correctly classifies 21.6% of downgrades by Moody's and 24.8% of downgrades by Fitch. The overall classification accuracy of both agencies is also comparable, at 92.4% and 91.5%, respectively. Once we extend the model to Models 2 and 3, we observe an improvement in the case of Moody's, specifically a rise to 25.0% and 25.9%, respectively. However, classification accuracy for downgrades deteriorates, albeit marginally, for Fitch, as it decreases to 23.9% and 22.1%, respectively.
Moving on to upgrades, we observe that both the accuracy of classifying upgrades vs. downgrades and areas under ROC curves are significantly worse. This may be due to the credit rating agencies' reluctance to upgrade sovereigns after they have failed to downgrade them during a downturn. Ferri et al. (1999) give an example of the East Asian crisis and argue that credit rating agencies are motivated to remain cautious after failing to predict or even detect crises or downturns in order to restore their credibility. Since using the cut-off of 0.5 gives negligible results (see Table A.14 in the Appendix), we lower the cut-off to 0.3. Table 6 shows that the classification accuracy of Moody's downgrades is still relatively low, ranging between 5.0% and 7.5% for the three models. Differently from downgrades, the classification accuracy of upgrades for Fitch is significantly higher than that of Moody's, and it increases from 16.3% in Model 1 to 19.2% in Model 3. This discrepancy between Moody's and Fitch may be explained by Alsakka and ap Gwilym (2010), who find that Moody's on average upgrades sovereigns before the other agencies do. This means that Moody's potentially upgrades sovereigns when the rationale for upgrade is not yet reflected in the macroeconomic and political environment. Fitch lags behind and upgrades issuers when such action is supported by data, resulting in higher classification accuracy for upgrades in our models.
Next, we add the textual sentiment and subjectivity measures. The results for downgrades are reported in Table 5. They show that including textual sentiment measures, namely sentiment and either polarity, significantly improves the classification of downgrades by Moody's. Depending on the model, classification accuracy increases to between 33.6% (Model 1 with polarity from the dictionary-based approach, at word level) and 39.7% (Model 3 with the sentiment from the dictionary-based approach, at word level). Sensitivity increases for Fitch as well, but the improvement is not that substantial. In some cases, sensitivity actually deteriorates. It increases to between 25.7% (Model 1 with polarity from the machine learning approach, at the sentence level) and 28.3% (Model 3 with the sentiment from the dictionary-based approach, at word level). On the other hand, inclusion of subjectivity measures, on average, lowers the classification accuracy of downgrades for both Moody's and Fitch. The most notable improvement is in the case of Fitch, Model 3 with subjectivity from the machine learning approach, where sensitivity increases from 22.1% without subjectivity to 26.6% with subjectivity. Overall, the single measure with the highest performance appears to be sentiment (dictionary-based approach). We also include all six measures simultaneously in the models and observe even higher sensitivity scores, with the highest resulting from Model 2 with all measures, at 47.4% for Moody's and 29.2% for Fitch. The observed discrepancy between both agencies when taking textual sentiment measures into account suggests that Moody's expresses their opinion or qualitative judgement on downgrades in sovereign credit rating reports more efficiently than Fitch. One explanation for this could simply be in the structure of the analysed samples. Moody's sample consists of approximately one third advanced economies and two thirds emerging markets, while the opposite is true for Fitch. Given that analysts are faced with less reliable information and missing data when assessing emerging markets, they may have to rely more on their qualitative judgement than for advanced economies (Cantor and Packer, 1996;Luitel et al., 2016). This, in turn, may be reflected in the sovereign credit rating reports. Since Moody's sample is more heavily weighted towards emerging markets than Fitch's sample, we speculate this is the reason discrepancies between agencies arise. The results for upgrades are presented in Table 6. Differently than in the case of downgrades, we detect significant improvements in classification accuracy of upgrades when taking sentiment measures into account for both Moody's and Fitch. Including sentiment measures, in most cases, more than doubles sensitivity for Moody's and significantly increases sensitivity for Fitch in all three models. In the case of Moody's, classification accuracy rises to between 9.2% (Model 1 with polarity from the machine learning approach, at the sentence level) and 15.8% (Model 3 with polarity from the machine learning approach, at the sentence level). In the case of Fitch, sensitivity improves to between 24.1% (Models 1 and 2 with polarity from the machine learning approach, at the sentence level) and 32.6% (Model 3 with polarity from the machine learning approach, at the sentence level). As mentioned above, Moody's generally upgrades sovereigns before the motivation for such action is reflected in the data. Therefore the credit rating committee has to justify their decision more extensively in the reports. This is reflected in sentiment scores and leads to higher model performance increases compared to downgrades and also compared to Fitch. Regarding subjectivity measures, we notice relatively minor improvements compared to sentiment measures. The increase in classification accuracy is almost negligible for Moody's but notable for Fitch. The best performer seems to be subjectivity measure from the machine learning approach. Overall, the underlying models delivering the highest sensitivity on average are models with polarity scores from the dictionary-based approach, at the word level.

Through-the-cycle approach
Next, we continue with models assessed with lagged, current and forward values, corresponding to the through-the-cycle rating philosophy. Compared to models, following the point-in-time philosophy, we notice a substantial improvement in classification accuracy of downgrades, especially for Models 2 and 3, where sensitivity more than doubles, as evident from Table 7. Sensitivity for Moody's rises to 51.4% and 53.9%, respectively. Analysis for Fitch yields slightly lower performance, at 44.7% and 45.6%, respectively. This is evidence supporting the notion that credit agencies follow the through-the-cycle rating philosophy, instead of the point-in-time philosophy. The overall performance of the model and area under the ROC curve increase as well, although the rise is not as high. The improvement in sensitivity is even more pronounced for upgrades (see Tables 8 and A.15 in the Appendix). This is especially true for Moody's, where sensitivity in Models 1, 2, and 3 rises to 32.7%, 36.5% and 38.7%, respectively. This growth is extensive and supports the argument that Moody's is more forward-looking than Fitch when it comes to upgrades. The performance of Fitch improves as well, as classification accuracy for upgrades increases to 35.2% in Model 1, 37.5% in Model 2, and 46.1% in Model 3. The results are comparable to Moody's. Similarly, as with the point-in-time approach, adding textual sentiment measures improves the results further (see Table 7). This is again more noticeable for sentiment measures compared to subjectivity measures, especially for Moody's. The classification accuracy for downgrades by Moody's climbs to between 47.7% (Model 1 with polarity from the dictionarybased approach, at word level) and 63.4% (Model 3 with polarity from the dictionary-based approach, at the sentence level). As before, the improvement is less profound for Fitch, where sensitivity stands between 36.6% (Model 1 with polarity from the dictionary-based approach, at word level) and 45.6% (Model 2 with the sentiment from the dictionary-based approach, at the word level, and Model 3). The addition of subjectivity measures only marginally helps to increase sensitivity, while in some cases, even decreases it.  Table 8 shows that sensitivity for upgrades when adding sentiment measures increases substantially for Fitch, but not for Moody's. Classification accuracy of upgrades for Moody's increases to between 36.5% (all Models 1 with sentiment measures) and 45.2% (Model 3 with polarity from the machine learning approach, at the sentence level). On the other hand, the sensitivity of upgrades for Fitch, when including sentiment measures, ranges between 41.2% (Model 1 with polarity from the machine learning approach, at the sentence level) and 55.5% (Model 2 with polarity from the dictionary-based approach, at word level). This result is puzzling, as it contradicts our findings for downgrades and suggests that Fitch, on average, explains their rationale for upgrades more clearly than Moody's. Extending the models with subjectivity measures, on average, offers a more noticeable rise than in previous cases, which especially holds for Fitch. The predominantly leading models are models that include subjectivity measures from the machine learning approach, which are also the overall best performers for Moody's. In the case of Fitch, sentiment measures outperform subjectivity measures.
Overall, we find that sentiment measures perform better than subjectivity measures. Within the sentiment category, net sentiment appears to be a winner for downgrades, whereas polarity from the dictionary-based approach works best for upgrades, albeit marginally. Within the subjectivity category, subjectivity from the machine learning approach outperforms the dictionary-based approach for both downgrades and upgrades.

Robustness check
Credit rating agencies generally also publish outlook for issuers, which can be either stable, negative, or positive. Since outlook may be perceived as a substitute for textual sentiment and subjectivity, we perform a robustness check by including outlook in all models. Given that our previous findings suggest credit rating agencies follow the through-the-cycle rating philosophy, we only comment on the TTC results, while the PIT results are reported in the Appendix in Tables A.16 and A.17.
Starting with baseline models, we find that adding outlook further improves sensitivity (see Tables 9 and 10). The increase is comparable for Moody's (53.1% for Model 1) and Fitch (47.5% for Model 1). Furthermore, when adding sentiment and subjectivity measures, we do not observe any deterioration in improvements compared to models without outlook.
Specifically, when analysing downgrades and adding sentiment measures, sensitivity increases to between 61.5% (Model 1 with polarity from the machine learning approach, at sentence level) and 77.1% (Model 2 with the sentiment from the dictionary-based approach, at word level) for Moody's. As before, the increase is not as notable for Fitch and ranges between 49.5% (Model 1 with polarity from the machine learning approach, at the sentence level, and with the sentiment from the dictionary-based approach, at word level). and 57.6% (Model 2 with polarity from the dictionary-based approach, at word level). The results for subjectivity measures support previous findings.
Finally, in the case of upgrades, the initial increase for Fitch in Models 1 is substantial but fades away as we move on to Models 2 and 3. This result solves the previously mentioned puzzle since it disappears when we include outlook in our models. Among sentiment measures, polarity from the machine learning approach appears to notably offer additional information not captured by Moody's outlook alone. As with previous results, subjectivity measures, on average, do not add significant value to the performance of the models.
Overall, evidence suggests that sentiment and, to some extent, subjectivity measures offer important insights into the changes of sovereign credit ratings that go beyond the simple outlook of credit rating agencies.

Conclusion
This paper offers unique insights into sovereign rating transitions. We utilize textual analysis techniques to analyse sovereign credit rating reports in order to extract sentiment and subjectivity scores. To the best of our knowledge, this has not been done on such a comprehensive scale before. We apply both dictionary-based and machine learning methods and construct six different measures. We compare them in terms of the classification accuracy of both downgrades and upgrades. We find that, on average, sentiment measures ensure higher sensitivity than subjectivity measures. In addition, downgrades seem to have better predictability compared to upgrades. Relative to the textual analysis approach, dictionary-based methods seem to work best for sentiment measures, while machine learning techniques lead the "horse race" when it comes to subjectivity measures. We estimate all models with the two rating philosophies in mind, namely point-in-time and throughthe-cycle. Our results show that credit rating agencies follow the latter when assigning sovereign credit ratings. Finally, we perform robustness checks by adding an outlook to all models, which confirms our previous results.
We acknowledge the potential drawbacks of our approach, namely the neglect of different probabilities of rating changes for particular rating classes. We believe our approach is a necessary step in order to identify the most informative measures and best models, but mainly to highlight the importance of credit rating reports in the first place. The outbreak of a global exogenous disaster, such as the COVID-19 pandemic, further emphasizes the need to better capture, incorporate and communicate qualitative aspects of the readiness and capacity of sovereigns to deal with the type of risks that are difficult to peg quantitatively but can have significant material effects on sentiment and subjectivity when it comes to credit rating. The next step is to estimate transition matrices for those models and textual analysis measures that are the most promising. Country-year observations for 98 countries in the period from 1996 to 2017.