Critical discourse analysis guided topic modeling: the case of Al-Jazeera Arabic

ABSTRACT This paper analyzes all the articles and corresponding comments of Al-Jazeera Arabic's coverage of the Syrian war from 2011 to 2017. I propose a multilayered Critical Discourse Analysis Guided Topic Modeling method that includes context of social structures and processes in the analysis of topics. This article shows that the employment of topic modeling without Critical Discourse Analysis does not unravel the power relations embedded within the platform. Two different applications of this method are used to demonstrate how a guided topic modeling method can lead to more nuanced results that can unravel important social dynamics which would have remained unperceived if applying traditional computational methods.


Introduction
Since the beginning of the 'Arab Spring' in late 2010, Al-Jazeera has played a significant role in covering, reporting, calling for regime changes, and encouraging people to take to the streets in countries where uprisings have taken place, including Tunisia, Egypt, Libya, and Syria (Khondker, 2011;Sultan, 2013). 1 Most studies analyze Al-Jazeera's role as a satellite television channel that acted as a catalyst in producing a virtual public political sphere, alongside Social Network Sites (SNSs) such as Facebook and Twitter (Aouragh & Alexander, 2011;Cherribi, 2017;Khondker, 2011;Lynch, 2015). Even when studies mention the Al-Jazeera websites, both 'aljazeera.com' and 'Aljazeera.net,' they examine them as streaming platforms for the station (Aouragh & Alexander, 2011;Cherribi, 2017).
This article contributes to the growing literature that combines Topic Modeling (TM) and Critical Discourse Analysis (CDA) methods (Brinkmann, 2019;Jacobs & Tschötschel, 2019;Törnberg & Törnberg, 2016a, 2016b. I propose and showcase a multilayered Topic Modeling (TM) method guided by a Critical Discourse Analysis (CDA) framework to examine the Al-Jazeera Arabic environment to study a manifestation of sect habitus online on this platform. 2 I define sect habitus as the socially constructed dispositions of sect identities that are practiced in a specific community as well as the normative ways of feeling and expressing these identities at the individual as well as the group/sect levels online and offline (Rouhana, 2021). Because of the platform's textual nature, the analysis of sect-habitus on Aljazeera.net is based on the expression and manifestation of sect-based discourse through text. Since Aljazeera.net as a platform sets the rules and regularities (i.e., the medium of text) of how sect-identities are practiced and expressed by both its authors and the readers/commenters, the sect-based dispositions and practices that are embodied in one's sect-habitus are partially expressed through the textual nature of the conversation that takes place on this platform.
I define sect-based discourse loosely to be able to capture the range of violent, discriminatory, peaceful, implicit, and explicit discourse. Sect-based discourse is any discourse that uses categories that differentiate between groups of people based on their religious identities. This discourse could be about intragroup or intergroup categories of people, for example, difference between Muslim and Christian counts as intergroup sect-based discourse; versus, Sunni and Shi'a would be intragroup sect-based discourse.
I examine whether Al-Jazeera's materials have any impact on Syrians during the uprisings. There are two types of content to analyze: (1) since 2011, Al-Jazeera and its editorial team have published news articlesincluding written reports, opinion pieces, blog posts, live news header coverageabout the Syrian uprising-turned-war; and (2) the comments that readers post on these articles. I propose, develop, and use an extended exploratory method based on CDA and TM combined to analyze Al-Jazeera Arabic's coverage of the Syrian war in relation to its readers' comments on these articles. I show how this proposed method captures changes in the language use by Al-Jazeera articles over time (see section Method 1: Applying TM on the full dataset), which opens new avenues for questions to be answered that neither TM nor CDA on its own can answer. Then I show that the language changes in the articles do not map over to the users' comments over time. First, I use Topic Modeling (TM), an unsupervised Machine Learning (ML) technique used to analyze large datasets of unstructured documents, as the initial step in examining the data. Next, I show the limitations of prior applications of TM methods and propose to use TM in a Critical Discourse Analysis (CDA) framework, which I call CDA-guided-TM. This method allows answering many questions that TM or CDA alone cannot. My findings show that Al-Jazeera used sect-based language in its content, and that articles reporting violence receive the most sect-based comments. Finally, I use the proposed method in two different ways to analyze how a sect habitus was expressed in the users' comments in relation to the published articles. The two CDAguided-TM examples presented in this paper lead to two separate but complementary results which illustrates the levels of nuances that this approach can capture.

Data collection
Using Python, I developed a scraper that extracted data starting with the landing page of Aljazeera.net and that scraped through all the articles that Aljazeera.net had from 2010 to 2017. As the scraper collected articles, I analyzed the articles with a word clustering technique and weighed each article's relevance to the Syrian case study using a list of keywords such as Syria, Damascus, Aleppo, etc. (See link for complete list). Then, if an article included more than three top-ten weighted words that matched the reference list, I collected the article's text and saved it in an unstructured database using MongoDB and PyMongo, a Python library to communicate with the database. Additionally, I collected all user comments on each of these chosen articles, as well as all the replies to the comments.
Out of the entire 926,936 articles inspected, 44,707 matched the reference list with relevance to the case of Syria, including articles about Iran, Russia, Lebanon, Turkey, UN, USA, Saudi, and many others. These articles included 218,394 user comments and 26,273 replies to comments (users commenting on each other's comments). Out of the total number of articles 23,457 explicitly concern Syria and have 125,501corresponding users' comments. I was cautious to not overload the Al-Jazeera servers and impact their website in any way that is why the initial scraping took more than eight months to complete. After this initial round of collection, future scraping became quicker with only checking for new articles and selecting the articles that were relevant to the Syrian case study. This scraping process ended when Al-Jazeera discontinued their comments section and moved their commentary features to Facebook in August 2017 (Al Jazeera English, 2017;Fletcher, 2017).
The goal of such a broad collection is to be able to analyze the regional and international conditions that potentially influenced the war in Syria and users' comments. For example, another potential study could be how Al-Jazeera's coverage of Syria was impacted by the Iran and P5 + 1 nuclear agreement in 2015, as well as its impact on the users' reactions to that agreement in the context of the ongoing war in Syria, if any. However, in this article, I only analyze the articles that explicitly include Syria. Table 1 shows the total number of these articles and their respective comments.

Topic modeling
Why topic modeling?
I explore whether there is any correlation between the articles published by Al-Jazeera and the corresponding users' comments on those articles. I use Topic Modeling (TM) an unsupervised machine learning technique that is used to identify topics in large volumes of unstructured texts. 3 These techniques use inductive models to extract topics from a collection of distinct texts and subsequently group these collections of texts by topic, itself defined by a list of terms from the texts. Some of these models, such as Latent Dirichlet Allocation (LDA), 4 allow overlapping topics and overlapping terms, which opens up the possibility of documents to contain more than one topic (Silge & Robinson, 2017). Using TM allows me to identify what topics are discussed in each of the articles and in the user comments to these articles.

Latent Dirichlet allocation (LDA)
LDA is a topic modeling technique that is used in computational content analysis to extract unknown thematic structures in a cluster of large groups of text documents. LDA is mainly used for exploratory and descriptive analyses (Elgesem et al., 2015;Koltsova & Shcherbak, 2015;Valdez et al., 2018). Developed by Blei, Ng, and Jordan in (2003), LDA uses a Bayesian statistical model to generate the latent topics on which the texts in question are based (Blei et al., 2003). LDA starts with the assumption that the group of documents to be analyzed form a corpus. This corpus is supposed to represent a number of topics predetermined by the researcher. I discuss my choice of the number of topics in a later section. Each document of the corpus is based on a specific probabilistic distribution of words; this distribution renders the document relevant for one or many topics. By running through the corpus while assuming that the topics are infinitely exchangeable in a document, a probability of a sequence of words and a topic is then calculated. Based on this probability, a document is clustered under one or many topics alongside other documents that probably cover the same topic. These topics are represented with a sequence of words. That sequence becomes the basis to infer what this topic is concerned with. For example, the set of words extracted from Topic 10 generated from all Al-Jazeera articles from the year 2017, translated here: City, Raqqa, Syria, Country, Group, Forces, Army, Democratic, Kurdish, Alliance, 5 leads to infer that the documents pertaining to this topic discuss the US-led, international alliance's support for the Syrian Democratic Forces (Kurdish forces in Northern Syria) in their successful attempt to take over the city of Raqqa and defeat ISIS there. By randomly extracting and manually reading a sample of articles pertaining to this cluster, I was able to confirm that these articles revolve around the topic inferred above. Next, I will showcase how TM is usually applied by researchers and how I propose to use it differently.

Preprocessing text
Most text-based ML techniques and Natural Language Processing algorithms employ preprocessing techniques to efficiently analyze and represent the data in question. This process includes removing stop words (such as that, to, on, which, what, etc.), which constitute common words that give little to no value to the text, especially when attempting to look for patterns in order to match a set of texts (Manning et al., 2008). These lists have been built into many programming language libraries but there is not a consensus around a standardized list, not even for the English language. 6 For Arabic, there are even fewer lists of stop words developed for Modern Standard Arabic (MSA). I employ them as part of a customized stop words list I developed for this research (see link for full list).
Moreover, the Arabic language combines the definitive article 'the,' 'al' ' ‫ﺍ‬ ‫ﻝ‬ ' to the beginning of the words. However, the software will treat a word with the definitive article as different from one without it, which would lead to skewed results with TM. For example, the word ' ‫ﺳ‬ ‫ﻮ‬ ‫ﺭ‬ ‫ﻱ‬ ' which translates to 'Syrian' becomes ' ‫ﺍ‬ ‫ﻟ‬ ‫ﺴ‬ ‫ﻮ‬ ‫ﺭ‬ ‫ﻱ‬ ' when used in a sentence with 'the Syrian.' There are some exemptions where specific words such as 'God' 'Allah' which has 'al' as part of the base word 'Allah.' 7 So, before removing the stop words, it is important to remove the 'al' articles while paying attention to exceptions to this rule. Returning to the previous example, if we remove the 'al' articles from the word 'Allah' the word changes meaning and becomes the word 'lah' which translates to 'his' which is a stop word that gets removed before starting the TM analysis. As we will see later the word 'Allah' is the most frequently word used in the comment dataset. Moreover, the Classical Arabic (CA) and the MSA are written with diacritics (h  arakāt). For example, the word for supplementary diacritics ‫ﺗ‬ ‫ﺸ‬ ‫ﻜ‬ ‫ﻴ‬ ‫ﻞ‬ can be written with diacritics in the following way ‫ﺗ‬ َ ‫ﺸ‬ ْ ‫ﻜ‬ ِ ‫ﻴ‬ ‫ﻞ‬ ; this does not change the meaning of the word but it represents missing vowels and consonant length. The Dialectical Arabic (DA) language usually does not abide by the linguistic rules used in the CA and MSA, but in online texts (depending on the platform, tools, and sometimes the user), all three, DA, CA, and MSA, sometimes include h  arakāt and sometimes not. The problem is that the software reads the same word with h  arakāt differently than without h  arakāt regardless of the meaning, which leads to skewed percentages as they are divided into different topics rather than understood to carry the same meaning. So, an additional layer of 'cleaning' was applied in order to remove all the diacritics (h  arakāt) that render the same word differently.

Extracting topics
One of the main issues of Topic Modeling is the question of the predetermined number of topics that researchers are required to manually set prior to running the model. There are many suggestions and methods developed to help researchers make an educated guess to choose the number of topics, including calculating the model coherence. In my experience, the best way is by trial and error, even when the outcome seems good enough, it is important to proceed with multiple trials. Roberts et al. (2018) propose between 5 and 50 topics for smaller datasets and between 60 and 100 for larger datasets (Roberts et al., 2018). In this study, with considerably large datasets, I found that having 60 or more topics is not useful as the topics became difficult to differentiate based on the terms associated with the topics. After multiple trials, I decided to use twenty topics which was also confirmed by the model coherence value. 8 In Figures 1 and 2, I present the topics extracted from the articles' dataset and from the comments dataset respectively sorted by the most to the least probable, represented by the top 10 probable terms in each topic. Figure 1 represents the topics extracted from the articles, sorted by the proportion of the topic in relation to the full corpus. For example, Topic 5 in Figure 2 is the most probable topic in the dataset. As mentioned above, LDA allows overlaps between topics, so this visualization does not show mutually exclusive topics. This means that documents included in Topic 17 might also be included in other topics (for an interactive visualization, follow this link to explore the topics with the most probable terms and the topic overlaps). At this point in the TM process, researchers typically deduce the topics based on the most probable sequence of words that the model might include. Researcher set the number of terms; in this case, ten terms were enough to deduce the topics. For example, Topic 7 in Figure 1, which is the most probable topic out of all 20 topics, is about active military operations in Aleppo and Damascus' suburbs. When visualizing the model using the LDAvis library (the interactive model accessible via this link), it is noticeable that the topic also includes the terms Idlib and Homs and most probably other regions that witnessed active military operations.

Deducing topics
The topics I deduced from the articles are represented in Table 2 and those deduced from the comments in Table 3.
Topic 1 is potentially clustering articles that focus on the use of chemical weapons and the UN security council's resolution on Syria's chemical weapons. Topic 3 is potentially clustering articles about ISIS, Nusra Front, the Kurdish forces, and the US-led global alliance against ISIS.
I have not found any attempts to use topic modeling to correlate two separate corpuses. Technically, the corpuses I am using here are not totally separate and I already  know that they revolve around the Syrian situation. Previous studies that use sub-corpora aim at either speeding the processing time of large datasets (Sbalchiero & Eder, 2020) or at exploring the model itself (Murakami et al., 2017).
As for the topics extracted from the comments dataset and listed in Figure 2, Topic 5 is the most probable topic. It includes a combination of the word's god, great, victory, sham, Islam, Muslims and potentially reflects comments that include religious idioms including 'Allah is Greater' that discuss the ongoing war events. But there might be a misdeduction of the topic: the Lebanese political party, Hezbollah, is one of the warring factions that supported the Syrian government, whose leader's name is Hassan Nasrallah, written in Arabic as which can be interpreted as three words instead of two, in which case the name translates to Hassan Victory God. In order to investigate the outcome, I randomly selected ten articles that are included in Topic 5 and read them to assess the potential impact of such a coincidence on the topics. I found no mention of Hezbollah's leader and found the use of Quranic verses in them. But, in order to be sure, I also searched for the frequency of the sequence of words Hassan Nasr Allah and found that out of 32,222 comments forming Topic 5, only 33 included Hassan Nasrallah which led me to conclude that this did not have an impact on the topic deduction. Topic 7 represents the comments that include the opposition, the Free Syrian Army and Syrian regime, which Al-Jazeera also called, the Assad regime.
Most research usually stops at this point to assess the quality of the topics deduced and draw conclusions about the corpus. In my case, the topic model implemented above does not substantially inform my questions about sect-based discourse and the differences, if any, between the online versus offline sect habitus. Structured Topic Modeling (STM), which I employ in this study, usually adds more nuance to the analysis by including covariates such as changes of topics over time (Lindstedt, 2019). Implementing STM and testing the topic prevalence by published date gives us a better understanding of the topic proportion by time. 9 Table 2. The top 20 topics deduced from the Al-Jazeera articles.

Topics
Deduced topic Topic 1 Chemical weapons attacks and UN security council reactions Topic 2 The opposition and government negotiations Geneva and Astana Topic 3 ISIS, Nusra Front Topic 4 Kidnapping the two Christian Bishops and Syrian security forces Topic 5 Al Jazeera on the ground coverage Topic 6 Iran nuclear deal Topic 7 Opposition and army active military operation in Aleppo Topic 8 Hezbollah, Israel and Syria Topic 9 Syrian revolution and regime reactions Topic 10 Sanctions and economic crisis Topic 11 Muslim Brotherhood Topic 12 Turkish and Kurdish armed struggle in northern Syria Topic 13 Damascus and the Syrian army military operations Topic 14 The revolution and political developments Topic 15 Refugee crisis Topic 16 Russian military operations Topic 17 Regional Arab players (Jordan and Gulf states) Topic 18 United States of America's role (including Obama and Trump) Topic 19 Humanitarian aid, UN, and Madaya, Kefraya, and Faoua Topic 20 The opposition abroad and Syrian government interactions I will detail some of the limitations of this method in the next section as even STM falls short in my case before I propose a Critical Discourse Analysis approach to develop STM further.

Limitations of topic modeling
As I show above, the TM approach successfully clustered documents relatively well but, when applied to the full datasets, the topics are general for the most part and do not reflect the critical developments that took place between 2011 and 2017, such as the shifting from revolution to war, the chemical weapons usage on civilians, the rise and fall of ISIS, etc. In order to extract more nuanced topics, I split the articles and their comments by years as separate sub-corporas. Each corpus returned a list of 20 topics from the articles and comments datasets (see link for the 2013 topics). The topics extracted are more granular, but not to the extent to make consequential deductions to answer questions such as: did Al-Jazeera use a sect-based discourse employed at all times? Did its commenters mirror the discourses of editorial policies? What triggers sect-based discourses? Next, I propose a Critical Discourse Analysis-guided Topic Modeling.

Critical discourse analysis
Why critical discourse analysis?
Critical Discourse Analysis (CDA) employs a critical linguistic approach to study text. This approach regards 'language as social practice,' (Fairclough & Wodak, 1997) where the context of language use is important to the analysis (Wodak & Meyer, 2001). CDA is interested in 'the semiotic dimensions of power, injustice, abuse, and political-economic or cultural change in society' regardless of the theories or methods employed to achieve it. I use CDA because of its unique fit for my case study. CDA focuses on struggles, conflicts, discrimination, and ideology. My case is an example of one of the most violent struggles of the century to be broadcast in the media, and for the most part live. The warring factions in Syria used sect-identities' differences to justify violence, recruit fighters, and deepen the divisions and fears of the 'other' from all sides of the conflict. CDA focuses on the 'institutional, political, gender, and media discourses (in the broader sense) which testify to more or less overt relations of struggle and conflict' (Wodak & Meyer, 2016). Because Al-Jazeera had played a major role in the Arab Spring in general, and Syria in particular, I chose to use the Al-Jazeera Arabic website where discourse about Syria took place and the relation between language and power can be studied using the articles published on Al-Jazeera and the respective comments on these articles. CDA also fit with my approach of focusing on ordinary people's meaning making of the unfolding events because they are the ones paying the highest price in the ongoing war in Syria.
Additionally, CDA provides a framework to study prevailing social problems and carves a space for 'those who suffer the most' with a critical focus on the role of social, economic, cultural, and political structures that influence how ordinary people make meaning of their situations. Then, there is also the fact that CDA require[s] a theorization and description of both social processes and structures which give rise to the production of a text, and of the social structures and processes within which individuals or groups as social historical subjects, create meanings in their interaction with texts. 10 CDA 'critically analyzes those in power, those who are responsible, and those who have the means and the opportunity to solve such problems.' 11 In this case, what started as calls for social, economic, and human rights reforms escalated into a war that has been ongoing since then and where ordinary Syrians have been paying the highest price socially, economically, politically, and most terribly, with their lives.
For Wodack, CDA is particularly interested in the ways language mediates ideology. In Syria since 2011, multiple opposing ideologies circulated both online and offline including: the pan Arab, anti-Israeli, secular, authoritarian Baath ideology of the regime; an extreme leftist ideology of the communist party and many other leftist groups; as well as the ideology of extreme Islamic fundamentalist groups, with ISIS as one of its extremes and circulating as opposition. This is not to say that many parts of these ideologies do not overlap with each other; such as, fighting Israel and the pan-Arab ideology of both the communist and the Baath parties; or, fighting the 'imperial West' which overlaps with almost all the factions in Syria with a few exceptions.
I use CDA to determine how the different symbolic forms circulating on Aljazeera.net construct and convey ideological meanings that establish or sustain relations of domination and produce dominant narratives on SNSs (Thompson, 1991).

Why a guided approach to topic modeling?
This article builds on important attempts at using computer assisted techniques with CDA including primarily the corpus-based approaches and in more recent years the use of topic modeling with discourse analysis. While imperative to the advancement of CDA as a field of study, especially with the massive amounts of text data produced online in need of critical analysis, these research methods implemented Topic Modeling as a tool to assist in the discourse analysis of a large corpus (Jacobs & Tschötschel, 2019). For example, Törnberg and Törnberg (2016b) used TM to extract topics about Muslims and Islam as discussed on one of the leading forums in Sweden. This outstanding study was the first to use TM and CDA together. It revealed that users depict Islam and Muslims as a 'homogeneous outgroup, embroiled in conflict, violence and extremism: characteristics that are described as emanating from Islam as a religion (Törnberg & Törnberg, 2016b, p. 133).' Then, they also analyzed the evolution of the discourse over time, which revealed some changes in the importance of specific topics and shifts within topics (Törnberg & Törnberg, 2016b).
Fabian Brinkmann (2019) presented a very convincing argument for why TM and CDA are technically and theoretically compatible, specifically when analyzing discourse strands or the entanglement of many discourse strands. Brinkmann argues that TM's usefulness in identifying and clustering documents within one or multiple topics is useful in identifying structures of discourse. And because TM and CDA share a similar position about studying discourse as semantic macrostructures, themes and topics, TM can be used to analyze the semantic macrostructures in text. Brinkmann does not offer a case study but an argument that TM can provide a useful complementary tool for CDA in the analysis of large datasets. Jacobs and Tschötschel (2019) argue that Topic Modeling can help discourse analysis in areas that the latter could not address such as 'scaling, repetition and systematization.' They make a convincing case for TM in supplementing CDA in their study of hegemony in texts. They argued that rather than just focusing on the ruptures and breakdown instances of hegemony which is what traditional CDA does because of its inability to scale, TM would support the assumption that there was hegemony and the instances chosen to be studied closely are these ruptures. Then, they also made the case for TM use with CDA's study of language as TM provides a way to connect the documents as instances of language and the words within the documents as instances of languages. This could be achieved because TM not only assigns documents to topics but also assigns words in every document to topics.
CDA considers discourse as social practice. Moreover, CDA assumes the existence of a dialectical relationship between discourse and the material conditions or the situation(s), institution(s), and social structure(s) framing it (Van Dijk, 2011;Wodak & Meyer, 2001). These conditions shape the discourse in question and in turn are shaped by it. And because CDA does not start with a fixed theoretical and methodological stance, it can first provide great guidance to TM as an exploratory text analysis technique to extract more representative topics out of the corpus of text.
I propose to use TM as a CDA method by extending the application of TM at the technical level and in terms of CDA exploratory process as I detail next. By starting with a general research topic such as a sect-based discourse around the Syrian war, I first look for the existence of such discourse in Al-Jazeera articles and then in the comment's datasets.

CDA-guided topic modeling
Is there a sect-based discourse on Aljazeera.net?
In what follows, I show two different ways to proceed with the CDA-Guided-TM method. In the first method, I look at the topics extracted from the full datasets and conduct an analysis starting there. To narrow that down to a more granular level, I proceed to identify topics that potentially include sect-based discourse before running a topic modeling only on the documents included in these topics for both the article and comment datasets. The second method starts by assuming that the TM applied on the full datasets will give macro-topics. To capture more specific topics, I run Topic Modeling by year. For example, I run topic modeling on the articles from the year 2010 and their corresponding comments, then 2011, until 2017, in so that each year gets its own topics for the articles and the comments, and the analysis would start there.

Method 1: applying TM on the full dataset
Based on the topics extracted from both the articles and comments datasets (shown in Tables 2 and 3), I proceed to identify whether sect-based discourse exists in the articles and then in the comments.
The list of topics extracted from the articles and shown in Table 2 does not clearly reveal an explicit sect-based discourse as a mainstream topic published by Al-Jazeera. But CDA requires a systematic contextualization of all background information in the analysis of discourse. CDA depends on the researcher's knowledge and context of the conditions around the production of a specific text. That is why a CDA-guided-TM opens up new avenues of inquiry that TM as a method does not usually investigate. For example, in Topic 19, the terms from the TM above were 'Humanitarian aid, UN, and Madaya, Kefraya, and Faoua.' These terms alone do not indicate the possibility for a sect-based discourse, rather one might imply that it is about humanitarian aid about these towns. However, the context is very important in this case. At the time, these towns were being reported on using two conflicting discourses; on the one hand, opposition anti-regime media used the case of the long siege of the towns of Kefraya and Faoua whose residents were predominantly Shi'a Muslims, to advance the argument that the Syrian government is supporting Shi'a Muslims against Sunni Muslims, and that the Lebanese Hezbollah was defending Shi'a Syrians against Sunni Syrians; on the other hand, pro-regime media used the case to advance the argument that the Sunni-based takfiri oppositions were brutally sieging, bombing and shelling the two small Shi'a villages and the 'civilian villagers.' With this additional context, it is clear that this topic will probably contain sect-based discourse. In order to pursue this possibility, I ran a TM on the articles that pertain to Topic 19 because I suspected they either include explicit or implicit sect-based discourse. The results show, as expected, that one of the extracted sub-topics contain the words Islam, Sunna, Shi'a, Sect, and fundamentalism. For further confirmation, Figure 3 shows the evolution of Topic 19 which confirms the progression of the events as they unfolded. First, a dramatic increase of Topic 19 articles when the siege on the two towns took place in March 2015, and when it became a humanitarian issue and a bargaining chip between the opposition and the government while there were intermittent attacks taking place. Then, it was no longer mentioned in the media, reflected in the decline in the graph until a deal to evacuate the towns started to take place in mid-2017 (which does not show in my data because the data collection stopped at around that period). Next, I ran a TM on the articles within Topic 19 only. One of these subtopics contained the terms Sunni, Shia, Alaoui, Taifi, which indicates the presence of a sect-based discourse. Figure 4 is a visualization of the evolution this latter subtopic of Topic 19 which contained sect-based discourse. It shows that a sect-based discourse started showing up in the Al-Jazeera articles around mid-2010. That sect-based discourse continued to increase until early 2011. At that point, the discourse was characterized by a short stable period before decreasing between mid-2011 and mid-2012. After that, the sect-based discourse increased again to its highest peak in mid-2013. It also shows that between mid-2013 and mid-2014, it was absent, and that it started increasing again until about mid-2015, but then fluctuated throughout 2016 and 2017. Two conclusions can be drawn from this observation. First, despite the prevalence of this discourse, it is unknown whether Al-Jazeera used sect-based discourse as defined above in reporting or whether it was more ubiquitous in opinion pieces, but the prevalence of this discourse was higher in earlier years. Even when the Kefraya and Faoua situation unfolded starting in 2015-2017, this discourse was lower than it was in 2011 and 2013. For future work, I will analyze the ratio of opinion pieces to reporting that include this discourse in order to settle this possibility. Second, the difference in saliency of the sect-based discourse between the early and then later years could be attributed to the fact that by 2015 a hegemonic sectarian divisive discourse between Sunni and Shi'a had been set in place, so the use of Sunni and Shi'a categories to describe Kefraya and Faoua in reporting was not needed at that point. And because CDA questions the working of power, it would be imperative to conduct a close analysis of the texts in earlier years and in 2015. This conclusion would need further investigation in order to determine whether it was intentional use of sectbased discourse by Al-Jazeera or not.
The list of topics extracted from the comment dataset have many more topics that potentially include sect-based discourse (including topics 5, 7, 8, 13, and 18). To identify which of these topics, if any, include sect-based discourse, I ran a TM on the combined comments in topics 5, 7, 8, 13, 18 altogether, and then a TM on the documents in each of the topics separately to reveal subtopics within these documents (see link for the extracted topics).
The topics extracted from the combined TM are shown in Table 4. Topics 4, 11, and 12 might include sect-based discourse with the reference to Islam, Sunnis and Shi'a. However, Topic 1 is clearly about divisive sect-based discourse with its use of terms Shi'a, Sunnis, Sectarian, Zionist, Arab, Muslims, and Islam. Included in the frequent and exclusive terms to this topic there are terms such as Christians, Safavid, and Crusaders. Figure 5 shows that divisive sect-based discourse drastically increased in 2011 and then fluctuated until 2015 then plateaued afterwards. This could be due to the intensification of the fight between a violent ruling minority and using military force against a Sunni majority until the rise of ISIS and its theatrical killings (this conclusion needs further investigation).
Before moving forward with the analysis of one topic that will clearly contain sectbased discourse (Topic 1) and another that potentially might contain sect-based discourse (Topic 11), it is important to note that sect-based discourses from Topics 1, 4, 11, and 12 are salient with discourses about Zionism, the Muslim brotherhood, regime violence, opposition violence, ISIS, Nusra, and the regional and international players' roles in the situation. This means that sect-based discourse circulates in tandem with other embedded ideological stereotypes and sectarian narratives. In order to test whether  there is a correlation between the articles and the sect-based comments, I now analyze Topic 1 and Topic 11 and apply TM on the articles where the comments from these topics were employed (see link). The TM on Topic 1 produced the subtopics as shown on this link along with its corresponding automated Google translation to English. The topics extracted out of Topic 1 are all active military events including war, bombings, killings, refugees, jiahadists, violence which includes regional, and international players and the geopolitics involving those players. A singular exception, Topic 6, is about the Muslim Brotherhood. All these topics overwhelmingly include sect-based discourse.
The TM on Topic 11 produced the subtopics as shown on this link along with its corresponding automated Google translation to English. Similarly, to Topic 1, the majority of the topics extracted from the articles connected to comments of Topic 11 reveals a total of 17 topics dealing with active military events including war, bombings, killings, refugees, jiahadists, violence which includes regional, and international players, and the geopolitics involving those players (all topics except 3,14, and 16). One topic (Topic 3) is about the Muslim brotherhood and two other topics use sect-based discourse (Topic 14 and 16). Again, all these topics overwhelmingly include sect-based discourse.

Method 2: applying TM on dataset partitioned by year
In this method, I start with the assumption that the topics from Tables 2 and 3 only offer a general overview of the articles and the comments but are not enough to capture the nuances of the unfolding war events and the shifts in fields that might impact sect habitus. Thus, there is a need to hone in on the data and one way to do this is to run topic modeling by year in order to extract topics pertaining to the articles and the comments of each year. This link includes all the topics in Arabic and their translation via Google translate into English. Out of these topics, I will focus solely on the ones that include explicit sect-based discourse and analyze them. Tables 4 and 5 show the topics that potentially include sect-based discourse extracted from the article and comment datasets with the number of documents included in these topics.
Before getting into the analysis of the topics themselves, a few notes are worth starting with. It is clear that except for 2013, Al-Jazeera articles that potentially include sect-based discourse are on average about the same. A close analysis of 2013 is worth conducting especially because of the military losses of the Syrian Army, the peace talks, the rise of Free Syrian Army, and it was just before the meteoric rise of ISIS and Al Nusra Front.
Going back to the averages, I notice that the sect-based comments do not follow an average which could lead to the conclusion that despite the instrumentalists' assumptions that elites (and in this case the Al-Jazeera media conglomerate) inflame peoples sectarianism at will, we notice that in 2013, there were 3267 comments that fit the sectbased discourse category and in 2014, these comments amounted to one third of the amount of comments in 2013. This leads to the conclusion that sect habitus online is dynamic and changes with the unfolding of the events, and that elites are not in full control of popular sect habitus. I am not excluding the role that the media and political elites play in shaping sect habitus, but I am arguing that they are not in control of it. I ran TM on the 2013 comments that fit the sect-based discourse category to show the topics relevant to these comments and then extracted the respective articles and their topics to deduce the relationship between these topics and the articles.
The topics extracted from the 3267 comments from the year 2013 reveal a combination of sect-based violence, sectarian strife, war activities, killings, local, regional, active, international players in Syria. The number of articles that these comments commented on is 994 articles and the topics covered by these articles shown here cover active military activities, reporting the violence killings, shelling, bombings, sniping, Al Qaeda, Israel, Hezbollah, and some international players.
The dynamic nature of sect habitus online is in interaction with media representations of sect-based discourse but never actually determined by it. It is less of a question of causation/determination ('Al-Jazeera makes them do it') and more of a media's attempt to represent the sect habitus barely scratches the surface ('Al-Jazeera is like the top of the sectarian iceberg we can see with our current methods of analysis').
It is worth linking CDA's concept of 'contexts' to Bourdieu's concept of 'fields; (Bourdieu, 1990, pp. 66-68)' that sect habitus enters (in this case online) where the rules of these fields include the platforms' environments, for example, user agreements. In other words, these rules are different on Al-Jazeera comments because these comments are 'responses' to specific articles, while Twitter for example has much more limited text and is not in the context of replying to an article. In CDA's terms, the context of Al-Jazeera is different than SNSs. Sect habitus online is influenced by and is influencing these fields. The latter are both online and offline even when it comes to the manifestations of sect habitus online. For example, sect-based comments were produced in 2013, regardless of the absent sect-based discourse in Al-Jazeera's articles. This tells us that sect habitus is not controlled by Al-Jazeera's editorial policies only. This relationality of sect habitus within multiple fields dismantles the top-down theories of 'sectarianism as controlled by elites.' Also, the articles that include sect-based comments tend to include more than one sect-based comments which mean sect-based discourse leads to more sect-based discourse. This CDA-guided-TM showcases how to identify what Brinkman calls 'structures of discourse' in two different but connected datasets: Al-Jazeera articles and users comments. These two applications showcase how CDA can bring in context to a TM analysis, which opens up the possibility for analyzing and questioning the role of structural and cultural conditions including power structures and ideology in the unfolding of events.
A clear limitation of CDA-guided-TM, for studies concerned with historical records in the contexts of contested histories, such as the case of Lebanon, despite this method being exploratory, is that it requires a deep understanding of the social and structural context within which these texts are produced. In cases where that deep understanding is not available, this method cannot be applied.

Conclusion
I have presented a method, CDA-guided-TM, examining the specific question of sectbased discourse. I showed two different approaches to use this method in analyzing Al-Jazeera articles and the respective comments. I found that Al-Jazeera used sect-based language in its content, and that articles reporting violence receive the most sect-based comments. I also showed that by splitting the corpus into sub-corpora by year, that not only the topics revealed are more specific, but that there is fluctuation in the sect-based discourse of the comments which reveals the dynamic nature of sect habitus.
This method could also be used to analyze any social media datasets. I could even include a Social Network Analysis of the top commenters and reveal their positions regarding topics of interest. This method could be also used to investigate the role of the international community, the question of the refugees, the humanitarian NGO work, the ongoing war events, violence, peace talks, the use of chemical weapons, etc. There are still infinite ways to use this method including answering different questions about the data in these datasets, investigating the most salient topics covered by Aljazeera.net by running topic modeling on the articles that received the highest number of comments. It is possible to also measure the likes and dislikes on the comments by topic.
The significance of this method is both in its ability to analyze large datasets of online discourse while simultaneously questioning the social structures and processes that form the contexts within which these discourses take place. Additionally, this method critically analyzes power beyond the instrumentalist approach which renders people (in this paper, commenters) instruments controlled by political elites. Instead, the method enacts a critical discourse analysis at scale, while accounting for power relations' dynamics in creating meaning through texts.