Assessing News Credibility: Misinformation Content Indicators

The development of explainable news credibility prediction models is critical both for ﬁghting the viral propagation of misinformation and improving media literacy. This work investigates a variety of content indicators approaching diﬀerent semantic and discourse dimensions, such as title repre-sentativeness, reasoning errors, and sentiment intensity. These indicators were inspired by a previous study conducted for English news, aimed at reaching a collective consensus on which indicators could be widely used for predicting news credibility. This new study, performed by a multi-disciplinary team, relies on a corpus of 80 news articles from Portuguese mainstream and alternative news media, which were annotated by junior and senior journalists. The assessment of the corpus annotations provides insight into the prevalence of diﬀerent indicators in each type of news source. The results obtained for Portuguese correlate in most cases with the ones reported for English, which motivates the adoption of common standards for supporting the collaborative development of interoperable automatic misinformation detection approaches.


Introduction
The negative societal impact of misinformation has attracted a growing interest from researchers, both from social and computer sciences, the latter particularly interested in identifying computational features and models to automatically detect false or misleading information presented as news, commonly called fake news [23]. Despite the promising outcomes of misinformation P. Carvalho E-mail: pcc@inesc-id.pt detection (MID), particularly the ones derived from deep learning techniques, most MID approaches do not directly contribute to improve media literacy, since the final consumer is not capable of understanding how a learning classifier arrives at a particular decision on the content credibility.
Rather than attempting to label or filter out a specific content as misleading, we aim to develop explainable models tailored to facilitate readers in analyzing, critically evaluating and ethically sharing information on the web, contributing both to the development of information literacy and fighting misinformation. To that end, it is crucial to firstly understand which credibility indicators could be identified in news content, and show association with misleading information (e.g. sentiment, title representativeness, sources of information, etc.).
In this paper, we describe a thorough annotation study that allowed to identify and assess a set of credibility indicators in Portuguese news texts, based on a corpus of 80 news articles, collected from both Portuguese mainstream and alternative media. Alternative news media seeks to address the gaps left by the mainstream news media, offering complementary, corrective and subjective perspectives on the topics or events reported in such news sources [10]. Since alternative media are not typically subject to any type of verification nor control, they are a potential source of misinformation [17].
The corpus annotation is based on a set of content indicators introduced by Zhang et al. [25] for assessing news articles' credibility, which were extensively discussed and tested in a collection of English news articles. Those indicators were then refined, based on a preliminary annotation study we conducted, which involved a team of media experts, linguists, and data scientists.
The annotation process was performed by three trained annotators with background in media studies, and an experienced journalist. The results provide an understanding of which characteristics are particularly productive and contribute the most for the perception of a news story as credible or misleading. We also investigate how the content indicators explored in our study interconnect with each other, and which ones are particularly informative for assessing credibility in news text.
To assess the reliability of the annotations in the corpus, we conducted an inter-annotator agreement (IAA) study, which helped understanding the consistency of the different indicators across annotators and how they relate with the expert annotations. Moreover, the IAA study allowed to evaluate the task complexity, and hence identifying the indicators that may require specific knowledge from annotators or access to information that is not provided in news text.
This work extends and consolidates the research previously carried out by Zhang et al. [25], namely by i) investigating new credibility indicators, ii) making adjustments to the content indicators that proved to be more difficult to recognize in the above-mentioned study, and by iii) applying the proposed methodology in a language other than English. Furthermore, our research demonstrates that the proposed indicators are a useful and powerful tool to distinguish credible from misleading news, on the one hand, and mainstream from alternative news, on the other hand.
The remainder of the paper is structured as follows: Section 2 presents related work, Section 3 describes the data collection used in this study, and details the annotation process. Section 4 describes the results on the performed inter-annotator agreement study, provides some descriptive statistics on the annotated corpus, and presents the correlation results for a set of credibility indicators considered in our research. Finally, Section 5 highlights the main findings, and Section 6 discusses the research limitations and future research directions.

Related Work
Social media and alternative information sources allow users to produce a variety of news content, which is not subject to any type of verification nor control, often violating journalistic codes of ethics. These new sources of information are starters for real-time viralization of news stories, becoming a vehicle for the diffusion of misinformation to a global audience [23].
Most approaches for coping with misinformation employ journalists for fact-checking, aiming at assessing whether claims or statements are true. However, fact-checking is hard, expensive, and not feasible, given the amount of dynamic information disseminated through media channels [3]. Moreover, readers often access publishers directly instead of being mediated by third parties (e.g. fact-checking agencies). In addition, misinformation is often more viral and generally spreads faster than credible information, namely because of its engaging narrative [16,20]. Furthermore, users are usually time-pressed news readers lacking media literacy skills, becoming "susceptible hosts" to misinformation [18].
Finding evidences that support the reader's decision would be beneficial both for identifying potential biased or false information and preventing its further spreading. To this end, explanatory detection has become a trending research topic on misinformation detection [8]. Explainable machine learning models have been adopted to offer interpretable predictions on misinformation (e.g. Yang et al. 24), reporting why and how the systems arrived at a given prediction. In summary, models are not trained to infer the veracity of a news content, but to help readers make decisions on the content they are confronted with, by providing them with empirical evidence that may act as quality control indicators (e.g. Fuhr et al. 5).
Within this context, Fuhr et al. [5] proposed a model that adapt the concept of nutritional information to news content. The information nutrition labels would thus be analogous to the nutrition fact labels on food packages, allowing publishers and consumers to communicate and understand the underlying reasons that contribute to news credibility [25]. Among others, the label provides information on the following computable criteria: factuality, virality, opinion, controversy, authority, technicality, and topicality [5]. Aker et al. [2] specifically addressed the sentiment indicator in news articles. They created a corpus composed of 250 fake and non-fake news articles with article-level sentiment indicators assigned by different paid annotators. The annotations performed suggest that fake articles are significantly more sentimental than the non-fake ones.
Following this idea, Kevin et al. [13] developed a browser plugin to assist online news readers in evaluating the quality of news content, based on the source popularity, article popularity, ease of reading, sentiment, objectivity, and political bias, considering both the news content and respective social network virality metadata. However, this study lacks an evaluation on how the indicators would help users to perceive misinformation. Besides, the authors do not provide any information about the relative relevance of the analyzed indicators, meaning that it is impossible to know which indicators would be more informative to the users.
Gollub, Potthast, and Stein [7] worked on the presentation of information quality indicators in an intuitive, unambiguous, and intelligible way. The authors designed textual metrics, organized into five categories or dimensions composing the information quality indicators. Then, a small crowdsourcing inquiry was conducted to assign interval ranges for each indicator. As an ideation study, no concrete implementation of the indicators was carried and, again, any analysis of its usefulness has been conducted.
To demonstrate that information quality indicators can also support automated fact-checking tasks, Agez et al. [1] proposed a model to handle the check-worthiness task (i.e. predicting which statements in a debate should be fact-checked). The authors considered a set of information attributes initially proposed by Fuhr et al. [5] as input features to their models. This work, rather than applying indicators as in the original conceptualization of the nutritional labels, employs them as explainable input features of a predictive model.
Information quality indicators were also used for inferring the credibility of news articles. In an exploratory study, Zhang et al. [25] defined and tested a set of context and content indicators, involving a team of media experts. Although the annotation study has involved a relatively small collection of news articles (40), it demonstrates that the proposed indicators lead to a critical reflection about the overall credibility of the reported events.
Though the current scientific efforts to define and implement new information quality indicators, the resources specifically tailored to deal with misinformation (namely, annotated corpora) are still scarce or even nonexistent in languages such as Portuguese. Moreover, it is difficult to assess which credibility indicators reported in literature are the most informative or efficient, since researchers typically adopt different methodological approaches and sets of indicators, making it difficult to perform a thorough comparative analysis.

Methods
This section describes the data collection, the annotation guidelines, the platform adopted for conducting the corpus annotation, and the profile of the annotators.

Data Collection
The corpus created for this study is composed of 80 news articles collected from Portuguese news sites: 15 mainstream, and 13 alternative news sources. The former includes authoritative news stories that follow journalistic codes of ethics, norms, and practices; the latter corresponds to news stories published in the blogosphere, where commitment to media standards is typically not required nor observed, being a potential source of misinformation. Table 1 presents the news articles distribution across the sources crawled for this study. The collected news stories cover current hot topics at national and international levels (e.g. Coronavirus vaccine; 2020 United States presidential election) explored and disseminated by both mainstream and alternative media sources. When collecting data, we selected pairs of news that approach similar topics, and have an approximate length, in order to get a comparable corpus. Some studies state that credible news texts are significantly longer than non-credible (or fake) news [11], but we avoided using this as a primary differentiating feature.
The external information that might influence the annotators' assessments, including information about the publication source, and the website aesthetics [22], was removed. Then, the articles were randomly distributed into four subsets, comprising 20 documents each. Each annotator was assigned a subset of 20 different news articles, plus a subset of 20 news articles that were common to all the annotators. The later corresponds to our golden collection, and will be used to assess the reliability of the annotations assigned to the entire corpus. Table 2 presents some statistics on the corpus, derived from of a set of metrics often used in computational linguistics to obtain common textual features that would help characterizing the (mainstream and alternative) news texts included in our corpus. In particular, we considered simple quantitative metrics related to style and text complexity, such as the title length and body text length, which estimate the average number of words they comprise. We have also calculated the average number of sentences per document, and the average number of words per sentence. Table 3 summarizes a set of linguistic metrics often used to measure lexical richness, specifically (i) lexical diversity (i.e. the ratio of different unique words to the total number of words) respecting both the title and the body content; (ii) content diversity (i.e. the ratio of different content words to the total number of content words), and (iii) lexical redundancy (i.e. the ratio of the total number of function or grammatical words to the total words (Zhou et al. 27)). We also include attributes specifically related to text credibility [28], namely the prominence of modifiers in text (i.e. the ratio of adjectives and adverbs to the total number of content words), the lexical expressivity (i.e. the ratio of adjectives and adverbs to the content words they usually modify, respectively nouns and verbs), and pausality (i.e. the number of punctuation marks over the total number of sentences).

Corpus Characterization
Finally, Table 4 presents information on sentiment, which has been verified as a powerful feature for detecting misleading news [26]. In particular, we calculated the sentiment distribution (i.e. the ratio of potential sentiment words to the total number of words) and sentiment polarity distribution (i.e. the ratio of positive and negative words to the total number of sentiment words in text).
To get some of these statistics, all documents were initially tagged with PoS (Part-of-Speech) and sentiment information, through the application of the CitiusTagger [6], and Sentilex [19] and OpLexicon [21], respectively. The results presented in Table 2 show that the mainstream and alternative news included in our corpus are similar in terms of document length and sentence complexity.
In what concerns lexical diversity, alternative articles make use of more different words in the news body (Table 3). On the other hand, mainstream texts use more grammatical words (redundancy) and punctuation signals (pausality). Particularly concerning punctuation, it is important to stress that nondeclarative sentences ending with a question mark or an exclamation point are more common in alternative than in mainstream news texts. In fact, both signals are commonly identified in literature as features for assessing news credibility [9].
In what concerns PoS distribution, adjectives and adverbs are slightly more frequent in alternative news, which is also reflected by the expressivity and modifiers ratio. Sentiment words are also more usual in alternative news, particularly in news titles (Table 4).
Although the statistics previously presented show some differences between both types of news texts included in our corpus, no valid and supported conclusion can be drawn from these data. In fact, our sample is too small, and the differences observed in most cases are statistically insignificant. This re-inforces the importance of pursuing a more in-depth study based on content indicators.

Annotation Guidelines
The annotation guidelines that we established for this study were inspired in the ones defined by Zhang et al. [25], who propose a set of indicators for assessing articles credibility, refined by a team of media experts. Those indicators explore both text content and external sources or metadata information. In our study, we only consider content indicators, since we are particularly interested in investigating how the information in text can signal (lack of) credibility and understanding to what extent those indicators are language or topic dependent.
Firstly, all content indicators described in the above-mentioned study were extensively discussed by our team of media experts, linguists, and data scientists, and assessed in a collection of 10 misleading news stories. This pilot annotation study allowed us to refine the guidelines, by (i) removing specific categories and subcategories (e.g. scientific inference) that do not seem relevant in our collection, (ii) adding other categories and subcategories that were not previously included (e.g. quotes or citations from protagonists or witnesses; some fallacies, like personal attack ), (iii) distinguishing specific categories potentially pertinent in the context of our project (e.g. distinction between acknowledged -or intentional -and undisclosed -or unintentionalomission of the sources of information), and finally (iv) merging categories, such as quotes or citations from outside experts and organizations.
The final annotation guidelines describe two different, but complementary, annotation tasks, namely (i) assignment of semantic categories to a text segment, and (ii) replying to a set of close-ended questions about the overall perception of the article's credibility. Henceforth, the former is mentioned as fine-grained indicators, and the latter as coarse-indicators.

Fine-grained indicators
In total, the code system includes 44 fine-grained indicators, organized into five different semantic categories, which were then classified into several subcategories, as described in Table 5.
Title. The main purpose of news titles and headlines is to draw the reader's attention to the story they briefly present. However, some titles can be misleading and unrepresentative of the body text, which may point to the lack of credibility of the news story being presented. Following [25], annotators were asked to identify the cases where the title is: off-topic (i.e. it explores a topic different than the one developed in the body); has a different emphasis (i.e. emphasizes information that does not correspond to the main issue approached in the body or takes a different stance than the body); carries little  information on the body; presents a different point of view ; and overstates or understates claims presented in the body. Annotators were also asked to identify the potential clickbait tactics or strategies used to tempt the reader to click on the link to the story, in particular the ones commonly found in viral journalism (cf. Table 5).
Information Sources. The inclusion of citations is vital to ethical reporting, and to support the statements, claims and conclusions presented in news stories. In addition, citation of news sources provides accurate information about what happened or what was said, contributing for the story's reliability [12,15]. Therefore, annotators were asked to identify both primary and secondary sources, which can be used to investigate news' credibility. The former includes citation of scientific studies and reports, quotes from outside experts or organizations, and quotes from protagonists or witnesses (i.e. individuals that are at the center of the event or issue, or have a personal knowledge of something and are able to testify to it.). On the other hand, secondary sources typically provide second-hand information, including, among others, cited news articles, reviews and news comments. Annotators were finally asked to identify (i) the cases where it is acknowledged that the information presented requires a quote or citation of a specific source to attest its reliability and accuracy, but it is omitted intentionally by the author (declared omission of the information sources), and (ii) the cases where this acknowledgement does not exist (undisclosed omission of the information sources).
Fallacies. Fallacies are reasoning errors that can be used deliberately or unintentionally by authors in argumentation to convince the readers of the validity of their arguments. In addition to the fallacies considered in the reference study (namely, appeal to fear, appeal to nature, straw man, false dilemma, and slippery slope), we have also taken into account appeal to action, appeal to authority, appeal to emotion, appeal to ignorance, personal attack, and hasty generalization fallacies, which proved to be relevant in the corpus explored in our pilot study.
Tone. Unlike the objective tone sought by most mainstream journalists, creators of alternative news typically use more subjectivity, seeking to strike an emotional connection with readers, in order to get an emotional reaction from them. Hence, annotators were asked to identify in text segments where authors employ a strong emotional tone or subjectivity (emotionally charged tone) and overstate or exaggerate their claims or points of view (exaggerated claims).
Acknowledgement of Uncertainty. The author acknowledges the possibility that the information presented could be unreliable somehow or analyzed from a different perspective.

Coarse-indicators
Additionally, annotators were asked to answer a set of 10 closed-ended questions (Table 6), to assess their overall perception of a set of properties related to news text credibility, which correspond to the coarse-indicators. These include dichotomous questions (Yes/No) and (five-point) Likert-scale questions.
In addition to the questions 1 -6, adopted from Zhang et al. [25], we added questions 7 and 8 to assess the overall sentiment and sentiment intensity conveyed in text, and questions 9 and 10 to evaluate the overall compliance with the linguistic and journalistic rules and standards, respectively.

Corpus Annotation
Three paid annotators were recruited for this task. All the annotators graduated either in Communication Sciences or in Media Studies, and two of them are currently attending a master program, one in Journalism and the other in News Media Management. Furthermore, all annotators completed at least one professional internship in a Portuguese leading news media company. The average age of the annotators is 23.6, and two annotators are male. In addition to the annotations provided by the recruited annotators, the corpus was also annotated by an experienced journalist. This enabled evaluating how the recruited annotators' judgements agreed with the expert's judgements, and thus assessing the reliability of the annotated corpus. The expert has been a journalist for over 32 years. He was editor of several newspapers' sections (e.g. Society, Science, Politics) in a Portuguese leading daily newspaper and is currently editor-in-chief in the Portuguese News Agency (Lusa).

Coarse-indicator
Close-ended questions Type Journalistic standards How rigorous is the news text regarding the journalistic rules and standards?
1-5 Table 6 Coarse-indicators and corresponding close-ended questions underlying them. The Type column shows the answers' format and scale.
The corpus annotation was performed using MAXQDA, whose interface is illustrated in Figure 1 1 . A training session, involving all annotators was conducted by an expert, who introduced them to the software, and demonstrated its main functionalities, namely the ones related to text codification (i.e. assignment of one or more fine-grained indicators to a selected text segment) and answering a list of close-ended questions related to the news article's credibility (i.e. text overall assessment based on the coarse-indicators previously described). In MAXQDA, the fine-grained indicators correspond to codes, and the coarse-indicators are handled as variables.
To perform the previously described annotation task, the recruited annotators spent on average 15 minutes per news, while the expert dedicated about 1 hour to the analysis of each news article. This time difference is reflected on the expert's meticulous work, who provided justifications for almost every decision made, and proposed some additional features to be explored in future work.
Most annotators reported that they consulted the guidelines during the annotation process more than five times. It must be stressed that guidelines include at least one illustrative example for each described code.

Corpus Analysis
This section describes the results on the inter-annotator agreement study performed, provides some descriptive statistics on the annotated corpus, and fi- nally presents the correlation results for a set of credibility indicators considered in our research.

Inter-Annotator Agreement
To assess the reliability of the annotations in the corpus, we measured interannotator agreement (IAA), using Krippendorff's alpha, to compare the results we obtained with the ones reported in related studies using similar indicators [25]. In addition, the Krippendorff's alpha coefficient is suitable for comparing different types of data annotations (e.g. nominal, ordinal), including missing data, from any number of coders [14]. IAA relies on a data sample composed of 20 mainstream and alternative news articles, which were manually annotated by both the recruited annotators and the experienced journalist (expert). To measure to what degree the judgments produced by the annotators agree with the expert's assessments, we aggregated the judgments of the annotators as follows: (i) for ordinal data, we calculated the average across annotators, and (ii) for nominal data, we used the value chosen by the majority of annotators. Table 7 shows the agreement rate regarding the coarse-indicators described in Section 3.2.2, and Table 8 shows the annotators' agreement regarding the explicit identification in text of the diversity of fine-grained indicators described in Section 3.2.1. In both cases, we present (i) the agreement obtained among the three annotators involved in this task (IAA), and (ii) the averaging agreement between the recruited annotators and the expert.
The agreement is substantial (0.6 -0.8) for the majority of coarse-indicators considered in this study. In fact, when asked to make an overall judgement of text, the annotators mostly agree in distinguishing, for example, credible from  Table 7 IAA for coarse-indicators. IAA-RA refers to the agreement among the three recruited annotators, while IAA-Expert refers to the agreement between the aggregated annotations from the recruited annotators and the annotations made by the expert.
misleading news (overall credibility), accurate from inaccurate news titles (title's representativeness), convincing from unpersuasive evidences (convincing evidences), and subjective from objective content (sentiment and opinions; sentiment intensity). Moreover, and probably due to their identical background in communication studies, annotators demonstrate a high degree of agreement regarding the judgement on the compliance with the journalistic standards. However, the same is not observed regarding their overall perception on the linguistic accuracy of news articles, leading to an agreement surprisingly slight (< 0.2). This low agreement suggests that annotators have different levels of linguistic demand and reinforces the idea that construct's dimensions (e.g. grammar, vocabulary, and discourse) should be considered in future studies. When comparing the judgments from recruited annotators to the expert's, we verify that, except for clickbait strategies and sentiment intensity, which reached a moderate agreement (0.5 -0.6), the agreement on the remaining coarse-indicators goes from substantial (0.6 -0.8) to almost perfect (> 0.8). Again, the highest agreement achieved concerns the overall perception on the compliance of journalistic standards. With regard to the linguistic accuracy, the agreement increases significantly when considering the average rate given by all the annotators.
Overall, IAA results regarding the identification of generic clues related to text credibility are quite promising. Nevertheless, when we asked annotators to employ the fine-grained indicators, by explicitly identifying the text excerpts to which they apply, the agreement decreases significantly (cf. Table 8). In this case, most indicators had a moderate agreement (0.4 -0.6), such as the ones involving the identification of both primary and secondary sources of information. Nevertheless, annotators tend to disagree in what concerns the identification of text segments where the source was (un)intentionally omitted. Still regarding the sources of information, IAA suggests the annotators and the expert seem to interpret the concept of protagonist and/or witness differently in this context.  Table 8 IAA for fine-grained indicators. IAA-RA refers to IAA among the three recruited annotators, while IAA-Expert refers to the agreement between the aggregated annotations from the recruited annotators and the annotations made by the expert. Only the fine-grained indicators recognized both by the expert and at least one annotator are reported.
Regarding the news title, although annotators substantially agree in distinguishing representative from unrepresentative titles (cf. Table 7), they often disagree in identifying the potential issues underlying the title's representativeness (cf. Table 8). Yet, particularly regarding the emphasis indicator, a strong agreement is observed between annotators and the expert.
Concerning fallacies, the agreement achieved among annotators in our study is quite encouraging, given the task's subjectivity and complexity. For example, appeal to fear reached an agreement of 0.634, against 0.314 in the reference study (cf. Table 9). In addition, the agreement on the recognition of false dilemma clearly surpasses the low agreement reported in Zhang et al. [25], yet this rate drops significantly if we compare the annotations assigned by annotators to the ones assigned by the expert (cf. 2 This could be explained given the discrepancy of annotations provided by annotators compared to the ones provided by the expert, who has only assigned this fine-grained indicator to three segments in the entire golden collection. An identical situation is inversely observed, for example, for acknowledgement of uncertainty, whose negative agreement be- Regarding the news article's tone, we had a fair to moderate agreement; again the results obtained are higher than the ones reported in Zhang's, specially in what concerns the recognition of an emotionally charged tone (0.433 against 0.098).
Globally, the inter-annotator agreement results reported in our study and the ones reported by Zhang et al. [25] have a moderate positive relationship (r = 0.564, measured using the Pearson correlation coefficient). Although the data collections cannot be directly compared, the lowest agreement in both studies concerns the straw man fallacy, which reinforces the difficulty in clearly identifying this fallacy in news articles. On the contrary, the highest agreement achieved is related to the identification of a particular scientific study underlying the news text.

Statistics on the Annotated Corpus
This section describes some statistics on the entire annotated corpus, considering both coarse-indicators, and fine-grained indicators. Table 10 shows that annotators are capable of distinguishing mainstream from alternative news, and, based on their assessments, the former is significantly more credible than the latter. In a five-point rating scale, the overall credibility is 4.45 for mainstream news articles and only 1.65 for alternative news articles. This is also supported by the dissimilar values assigned to the recognition of tween annotators and the expert is related to the fact that the latter has identified much more cases in text.

Coarse-indicator
Mainstream Alternative Total 854 100.00 1.594 100.00 Table 11 Absolute and relative distribution of fine-grained indicators (main categories) in the annotated corpus, contrasting mainstream and alternative news articles.
supported claims and convincing evidences in text, which have reached more than 4 points in mainstream news, and less than 2 points in alternative news. The opposite is observed with regard to sentiment, which seems to be much more expressive in alternative news. Regarding the overall compliance with the linguistic and journalistic rules, we observe that, according to the annotators, these standards are highly met in mainstream news, and fairly or poorly met in alternative news. Finally, in what concerns specifically the title, annotators considered that it is highly accurate in mainstream news (average rate = 4.36), and inaccurate in alternative news (average rate = 2.18). Table 11 provides information on the overall distribution of the main categories assigned by annotators to news articles from mainstream and alternative media sources. Globally, annotators have assigned 854 codes to the mainstream news articles, and 1.594 codes to the alternative news articles. As expected, the indicators potentially associated with the lack of credibility or misinformation were mostly assigned to alternative news articles. For example, most fallacies considered in this study were found in alternative news, particularly slippery slope, appeal to fear and hasty generalization. Sim-ilarly, an emotionally charged tone was chiefly recognized in alternative news, where annotators have also recognized an expressive number of cases involving overstatement and exaggerated claims (tone). The representativeness of such fine-grained indicators in mainstream news is almost insignificant.

Fine-grained indicators statistics
Concerning the mention in text to information sources, the opposite is observed. Primary and secondary news sources are particularly productive in mainstream news, reinforcing the news credibility, while the omission of the source of information is more significant in alternative news. Nevertheless, annotators identified a huge number of cases where the required mention to a specific source to attest the information provided is also omitted in mainstream news. Table 12 describes how the variables of this study correlate. For the ordinal indicators, we estimated the correlation using the Spearman's rank correlation coefficient (ρ). For computing the correlation between each ordinal indicator and the news source type (ST ), since the source type can be seen as a dichotomous variable, we employed the point biserial correlation coefficient (r pb ) assuming two possible values for ST , namely mainstream (ST = 1) and alternative (ST = 0).

Relationship between credibility indicators
The first conclusion to be drawn from our data is that mainstream news articles seem more credible, given the strong correlation between source type and overall credibility (r pb = 0.805). As observed, the news article's overall credibility is strongly dependent on the presentation of supported claims (ρ = 0.883) and convincing evidences (ρ = 0.874), which seem critical in mainstream news. Moreover, these aspects are positively related with the compliance of journalistic standards.
Conversely, a strong negative relationship is observed between journalistic standards and sentiment intensity (ρ = −0.816), meaning that the more subjective and emotional is the news text the less credible it seems to be. Regarding the linguistic accuracy, the correlation results suggest that alternative news tend to be more inaccurate from the linguistic point of view.
In regard to the title, the correlation results show that representative titles are more common in mainstream news (r pb = 0.688), and that clickbaitiness is a strategy mostly used in news that do not fit the journalistic standards (ρ − 0.584). Table 13 presents the results on the potential relationship between article credibility and the information sources mentioned in text. Despite the weak correlation observed in our corpus, it is possible to note two different trends: undisclosed omission of the source is negatively associated with overall news credibility (ρ = −0.312), while the explicit quotation or citation of the source of information (particularly primary sources, such as quotations from an expert or organization), is positively associated with the article's overall credibility (ρ = 0.607).    Table 14 Spearman correlation between the perception of the news article's overall credibility (OC) and the most frequent fallacies the in corpus: appeal to fear (AF), personal attack (PA), straw man (SM), slippery slope (SS) and hasty generalization (HG). *p < 0.001, **p < 0.01. Table 14 presents the correlation considering the news article's overall credibility, and the most frequent fallacies in our annotated corpus. A negative relationship can be observed between all fallacies and the news article's credibility. In our corpus, the higher correlation coefficient achieved applies to slippery slope, which has moderate to high negative correlation with the article's overall credibility (ρ = −0.519).

Relevance of Credibility Indicators
We evaluated whether the coarse-indicators and the main categories associated with the fine-grained indicators could be used to predict the article's credibility, through a backward step-wise linear regression, using the overall credibility as the dependent variable and the remaining indicators as independent variables. We carried out this analysis over the golden collection, by incrementally removing the variables that caused the most statistically insignificant deterioration on the model. Regarding the coarse-indicators, 4 variables remained: title's representativeness, clickbait title, sentiment intensity, and linguistic accuracy. Regarding the fine-grained indicators, the resulting variables are: fallacies and primary sources. We obtained, for coarse-indicators and fine-grained indicators, respectively: R 2 = 0.857 and R 2 = 0.504; Adjusted R 2 = 0.853 and Adjusted R 2 = 0.491, after the model convergence; and F -statistic = 230.0 and F -statistic = 39.05 (p< 0.001).

Discussion and Main Conclusions
This annotation study was based on a focused set of already tested credibility indicators [25], which allowed assessing their pertinence and adequacy in a different collection of news articles from a different language, and refining some categories based on previous experiments. Having similar guidelines and assessing instruments made it possible to compare the results achieved in both studies, and to define the next research steps regarding the detection of misinformation in news articles in a solid manner.
Overall, our research shows that the guidelines adopted from Zhang et al. [25] apply to other news topics and languages, specifically Portuguese. It also demonstrates that both coarse-and fine-grained indicators allow distinguishing mainstream from alternative news. In fact, we found several indicators that are closely related (in terms of agreement) with the ones reported in the above-mentioned study, which reinforces the reliability of the annotated corpus, and provides thorough information on the task complexity. Moreover, the results suggest that the alternative news included in our corpus are untrustworthy, presenting a sort of characteristics that are not usually found in news collected from mainstream media. The corpus annotation made it possible to determine which indicators are more frequent in both mainstream and alternative news, and content to understand which ones are easier to recognize, based on the inter-annotator agreement study we performed. Regarding frequency, the mention to the sources of information in news articles is mostly found in mainstream news, while alternative news sources typically omit them. In addition, the later often makes use of an emotionally charged tone, and overstates the conclusions and claims presented in text, presenting frequently different sorts of fallacies.
IAA results show that annotators are capable of globally distinguishing credible from misleading news. However, when we asked them to identify in text specific credibility indicators, the task becomes harder and more complex. In a five-point Likert scale (where 1 represents "no difficult" and 5 "extremely difficult"), most annotators selected 4 to characterize the task complexity, in an online anonymous survey they responded after completing the annotation process. Additionally, they mentioned that the fine-grained indicators most difficult to recognize in news articles were fallacies, in particular appeal to nature, slippery slope and straw man. This is corroborated by the low agreement achieved regarding the identification of some fallacies, with particular emphasis to straw man.
When taking into account the annotations provided by our expert, we also found specific cases that may be out of reach for non-expert annotators, or that require access to information that goes beyond text, such as the acknowledgement of uncertainty, some fallacies (e.g. straw man and false dilemma), and specific sources of information (e.g. protagonists and witnesses).
However, as mentioned before, the annotation results also show that some of these fine-grained indicators, namely the ones related with reasoning errors are prevalent in the misleading news articles, and could be used as features to assess text credibility. Hence, a further investment on a deeper analysis and formalization of such discourse strategies is needed.
According to the annotators' feedback, the easiest fine-grained indicators to recognize in news texts are the sources of information, particularly primary sources. In fact, annotators are globally capable of identifying different types of mentions to explicit sources of information in text. However, in what concerns the cases where the source of information is required in text, but was omitted by the author, annotators tend to often disagree. Disagreement among annotators, on the one hand, and between annotators and the expert, on the other hand, reinforces the importance of discussing and clarifying such indicators in future research, particularly undisclosed omission, given its representativity in the corpus.
IAA results also show that annotators are highly sensitive to journalistic standards, which is certainly linked to their specific background. In further experiments, it would be important to extend the annotation study to common news readers, to ascertain if this feature is also perceptible by non-experts.
On the contrary, the annotators' judgments on the linguistic accuracy of text are extremely divergent, despite the linguistic differences underlying mainstream and alternative news articles. This reinforces the importance of refining the linguistic assessment considering more specific dimensions (e.g. grammar, vocabulary, punctuation, etc.), namely because it appears to be a relevant indicator for misinformation detection, together with title representativeness, title clickbaitiness and sentiment intensity.
In line with the research results reported in literature [2], our study shows that sentiment plays an important role on the identification of credible news. In fact, the degree of expressiveness, sentiment and subjectivity in text (supported, among other evidences, by the usage of expressive punctuation, sentiment polarity and intensity) could be explored in future research to help dis-tinguishing not only credible from misleading news, but also identifying news styles, namely the ones that make use of sensationalism in their reporting.

Limitations and Future Work
Our research shows that the content indicators developed by Zhang et al. [25] can be applicable to other languages, in particular Portuguese, answering to a potential limitation they reported in their study. Nevertheless, it should be stressed that the comparison results with the above-mentioned study are only indicative and approximate, namely because the data collections are different and the content indicators are not completely coincident.
Furthermore, the generalizability of the results is limited by the corpus size and the news sources considered. Since we were interested in identifying clues associated with misinformation, we intentionally selected mainstream news sites that typically accomplish the journalistic standards, and alternative news sites that do the opposite. Hence, although the alternative news in our corpus seem untrustworthy, this conclusion cannot be generalized to the totality of alternative news media. In fact, there is a variety of news and media outlets that are not covered in our study.
Further research involves the enlargement of the annotated corpus, focusing on the indicators that demonstrate to be the most relevant in the present research. Additional news sources will be automatically crawled from the Portuguese Web Archive 3 , and the annotation will involve the participation of common news readers, adopting eventually strategies like the ones reported in [4].
Our ultimate goal is to use the enlarged corpus to train an explainable model for news credibility prediction and later embed it in a new tool to be used for improving media literacy and assisting news consumers in coping with disinformation.