Specificity in English for Academic Purposes (EAP): A Corpus Analysis of Lexical Bundles in Academic Writing

The issue of specificity in English for Academic Purposes (EAP) settings has always challenged linguists and instructors in the field to take a stance on how language should be perceived, that is whether language forms and features are transferable across different academic disciplines or are specific to particular disciplines. This study intends to take this debate a step further by employing a corpus-driven method in identifying a type of phraseological sequence, namely lexical bundles in a corpus of journal articles in the field of International Business Management (IBM). The lexical bundles were compared with those compiled by Simpson-Vlach and Ellis (2010) in their study of Academic Formulas List (AFL) to determine the specificity of the lexical bundles identified in this study. Following frequency-based approach, the corpus tool, Collocate 1.0 was used to extract threeto five-word sequences. These word sequences were manually filtered to exclude irrelevant and meaningless combinations. The qualified lexical bundles were compiled and compared with lexical bundles in AFL (Simpson-Vlach and Ellis 2010) using log-likelihood test. The findings show that three-word lexical bundles are the most common types of lexical bundles in IBM corpus. The comparison reveals that lexical bundles in IBM corpus are relatively specific as compared with lexical bundles in AFL. A discipline-specific approach to the teaching and learning of lexical bundles in EAP settings is therefore advocated to enhance EAP syllabuses and instruction.


INTRODUCTION
Studies on phraseology in various genres and disciplines have been flourishing in recent years with the advancement of computer-mediated research methodology.Phraseology has been studied under the rubrics of, for instance, chunks, phraseological sequences, formulaic language, lexical bundles, collocations, multi-word items, recurrent sequences, n-grams, lexical phrases, and so on.Previous studies on phraseology have shown that the knowledge of phraseology is essential in ensuring fluency and natural use of language (Pawley & Syder 1983, Sinclair 1991, Hill 2000, Hyland 2012, Ang et al. 2017).Also, the appropriate use of phraseological sequences is a determining factor in warranting pragmatic competence, given the prevalence of these recurring sequences in both spoken and written discourse (Paquot & Granger 2012).The prevalence of phraseological sequences in discourse indicates that meaning creation and understanding is essentially dependent upon stocks of the phraseological sequences in language users' lexicon.In academic discourse, the mastery of the relevant phraseological sequences is particularly important to learners so that they could have access to the relevant "academic community" (Coxhead 2008, p. 151).Nevertheless, the formal conventions of academic discourse that are markedly different from those of other genres such as the conversational one pose difficulties for learners in processing information and interacting within the academic community in which they are in.Attention has thus been In response, there are several justifications made to defend the ESAP position.First, to counter the position that discipline-specific language should be taught by lecturers in the relevant disciplines, Hyland (2002Hyland ( , 2006) ) argues that subjects specialists usually do not emphasise the generic and language skills in lectures due to two main reasons.Firstly, subject specialists are not trained to teach language and they generally "lack both the expertise and desire to teach literacy skills" (Hyland 2002, p. 388).Secondly, it appears that many lecturers in various disciplines consider academic discourse conventions as "largely self-evident and universal" (ibid.).Subject lecturers often assess students' work without concerning much with how the language conventions and forms are used (Braine 1988, Lea & Street 1999, Hyland 2002, 2006).It is worth noting that the responsibility of teaching language conventions and skills lies ultimately with EAP teachers as they are trained to handle language classrooms.To cope with the diverse requirements and needs of learners from various academic disciplines, EAP instructors should be trained in a more professional way to teach specialised language used in different academic disciplines or domains.
Second, the claim that EAP courses mainly focus on generic skills such as summarising and paraphrasing as well as making presentations which are not much varied across the different disciplines deserves a second thought.It should be borne in mind that the main goal of setting up EAP courses is to prepare learners with specific language skills relevant to their respective disciplines (Hyland 2002).EAP teachers should primarily concentrate on the teaching of language forms that carry distinctive and "clear disciplinary values" (Hyland 2006, p. 12) which are frequent and important to the relevant discourse community.The teaching of the relevant phraseological expressions deserves to be prioritised in EAP classrooms as these phraseological expressions such as lexical bundles are the "basic building block of discourse" in academic writing (Biber et al. 2004, p. 371).
Lastly, it is disputable that there is a common core of language items.Hyland and Tse (2007, p. 238) doubt that there is "a single inventory [that] can represent the vocabulary of academic discourse and so be valuable to all students irrespective of their field of study".With the development of corpus-based studies in recent years, studies on vocabulary and phraseological sequences have been able to inform the necessary vocabulary and phrases teaching in EAP.These studies evidently show that there are significant variations between disciplines (Cortes 2004, Hyland & Tse 2007, Hyland 2008a, Durrant 2014).In addition, the variations between genres and registers have also been studied and proven to be a reality in the academic settings (Biber et al. 1999, Biber et al. 2004, Hyland 2008b).Also, any language forms may possibly have a number of different meanings and functions depending on the contexts in which the language is used.It is therefore sensible to claim that vocabulary behaves differently across disciplines and contexts (Hyland 2002(Hyland , 2006)).In a more assertive tone, Hyland and Tse (2007, p. 240) state that "all disciplines shape words for their own uses" and thus defend the discipline-specific approach to EAP.
The debate concerning which approaches should be established in EAP still continues as the rapid development of corpus linguistics continues to inform language teaching in EAP.The issue of specificity can impact the way EAP practitioners see the field and how they carry out their teaching.More studies need to be carried out to ascertain if the issue of specificity applies to the teaching of useful phrases in EAP classrooms.This study intends to take this debate a step further by comparing two lists of phraseological sequences which are compiled for the purposes of EGAP and ESAP, respectively.

PURPOSE OF THE STUDY
In order to see how language should be perceived and informed in the EAP settings, this study compares lists of phraseological sequences derived from two approaches (ESAP and EGAP).Specifically, this study attempts to identify a type of phraseological sequence, i.e. lexical bundles from a specialised corpus of journal articles in the field of International Business Management (henceforth IBM).The lexical bundles identified are compared with the lexical bundles in the Academic Formulas List (henceforth AFL) (Simpson-Vlach and Ellis 2010) to determine the specificity of the lexical bundles in this study.Following common-core approach, AFL (Simpson-Vlach & Ellis 2010) is a list of EGAP lexical bundles retrieved from a corpus of academic writing sampled across four academic disciplines: Humanities and Arts, Social Sciences, Natural Sciences /Medicine and Technology and Engineering while the lexical bundles identified in this study represent ESAP lexical bundles extracted from a specialised corpus which contains only research articles relevant to the field of IBM.

METHOD
The corpus and methods used to identify the discipline-specific lexical bundles are described in the following sub-sections.

THE CORPUS
The corpus for this study consists of academic research articles in the field of IBM.The journal articles were selected and compiled electronically.The selection of journals was based on the impact factor of the journals recognised by Thomson Reuters Web of Science.A total of two international journals were chosen.The rationale for selecting these journals is due to their specificity in publishing research articles pertaining to the field of IBM.The corpus consists of 1 million word tokens, and it includes 138 original research articles.

THE CORPUS TOOL
The corpus tool, Collocate 1.0 (Barlow 2004) was used to extract lexical bundles automatically by setting the span options.This corpus tool recognises plain text files which end with .txtextension.Collocate 1.0 extracts lists of n-grams (lexical bundles) using two statistical measures: frequency and Mutual Information.

STEPS IN IDENTIFYING LEXICAL BUNDLES
The first step of the analysis was to create a list of the most frequent lexical bundles of IBM.In accordance with Biber et al. (1999), lexical bundle is defined in this study as a frequently recurring sequence of words.As lexical bundles are a type of phraseological sequence, the terms lexical bundles and phraseological sequences are used interchangeably in this study.Following Biber et al. (1999), this study focuses on three-to five-word lexical bundles.The steps taken in identifying and determining the eligibility of phraseological sequences as lexical bundles are shown in Figure 1.

Manual inspection of dispersions in corpus
Items must occur at least in 10% of texts in corpus Steps in identifying lexical bundles The lexical bundles were identified using the frequency-based approach.There was a minimum cut-off point for retrieving the lexical bundles (Biber et al. 1999).Another important statistic used to create the list of lexical bundles is the Mutual Information (MI) score.MI is a measure of the strength of association between words.A higher MI score means a stronger association and thus a more coherent relationship between words (Simpson-Vlach & Ellis 2010, Salazar 2014).This metric was applied in order to eliminate those word sequences that do not have meaning or function but occur often because of the high frequency of words that they contain.It was also used to avoid discounting useful but less frequent phrases that tend to end up at the bottom of frequency-based lists (Simpson-Vlach & Ellis 2010).Also, the dispersion criterion is necessary to avoid individual writers' idiosyncrasies (Hyland 2008b).
Collocate 1.0 extracted a total of 1714 three-word sequences, 270 four-word sequences and 25 five-word sequences.After the extraction by Collocate 1.0, the next step was to check the dispersions of phraseological sequences in corpus.A phraseological sequence has to occur in 10% of texts to avoid idiosyncrasies of particular writers (Hyland 2008b).It was discovered that not every phraseological sequence on the list was of phraseological relevance and therefore further sifting was necessary in order to produce a more refined list of lexical bundles.
Following Salazar (2014), some exclusion criteria were adapted in order to weed out irrelevant word combinations.The modified criteria and some instances of excluded bundles are shown in Table 1 below.TABLE 1. Exclusion criteria for irrelevant word combinations 1) Fragments of other bundles : on the basis (On the basis of), in the case (in the case of) 2) Bundles consisting acronyms: gdp per capita, OECD anti-bribery convention 3) Bundles composed exclusively of function words: have also been, as it is 4) Bundles with random numbers : at least one, for the first 5) Random section titles : fig 1 b, table 2 in 6) Meaningless bundles: it that is, studies e g 7) In-text citations : Beck et al. , Gatignon Anderson 1988 After excluding the irrelevant word combinations, the remaining lexical bundles were identified and arranged according to normalised frequency order (per million words).The most frequent lexical bundles in this study were compared with those of Simpson-Vlach and Ellis's (2010) study to determine the specificity of the lexical bundles in this study.A statistical measure, log-likelihood test was performed on the lexical bundles found in both studies.The results of log-likelihood test are used to determine the degree of confidence pertaining to the statistical significance of the results of the analysis (Dunning 1993).By conducting this statistical test, researchers are able to move beyond simple descriptions of the data in the corpus.

RESULTS AND DISCUSSION
The following sub-sections present the results of analysis and the discussion of the findings.

THE LEXICAL BUNDLE LIST
A total of 1055 lexical bundles of varying lengths remained on the list after the application of the exclusion criteria.These 1055 bundles amount to a total of 48220 individual cases, which make up 2.19% of one million words in the corpus of this study.As can be expected, the lexical bundle list is largely composed of three-word strings, which account for 85% or 898 of the 1055 target bundles.They are followed by 147 four-word lexical bundles, or 14% of the total.There are only 10 different five-word lexical bundles in the corpus, representing 0.9% of all bundles.Tables 2, 3 and 4 display the normalised frequencies (per million words) and MI scores of the most frequent three-word, four-word and five-word lexical bundles found in the IBM corpus.It is apparent that the frequency and the length of lexical bundles are inversely related.This observation is in line with the general characteristics of the lexical bundles, that the longer the lexical bundle, the lower is its frequency (Biber et al. 1999;Hyland 2008b;Salazar 2014).As can be seen, the most frequent three-, four-and five-word lexical bundles are more likely to, are more likely to, and are more likely to be, respectively.The three-word lexical bundle more likely to is an independent bundle which may be arguably subsumed into fourword bundle are more likely to and five-word bundle are more likely to be.Similarly, the four-word bundle are more likely to could also be part of the longer bundle are more likely to be.Nevertheless, this shorter three-word bundle more likely to which seems to be the fragment of the longer four-and five-word bundles was maintained in this study.This is because the shorter three-word lexical bundle more likely to occurs 452 times per million words, much more frequent than the four-and five-word bundles of which it forms part (which occur 306 times and 55 times per million words, respectively).This shows that the three-word lexical bundle more likely to has more collocates in its collocational environment.It does not only overlap with the longer bundles are more likely to and are more likely to be, it also collocates with other words which forms other longer bundles.For instance, more likely to is part of other longer bundles such as is more likely to (44 times per million words), more likely to have (56 times per million words), are more likely to have (48 times per million words), and firms are more likely to (42 times per million words).

COMPARISON WITH SIMPSON-VLACH AND ELLIS'S (2010) ACADEMIC FORMULAS LIST
To reiterate, there are a total of 1055 types of three-to five-word lexical bundles found in IBM corpus.The top 50 types of lexical bundles of different lengths with their normalised frequencies (per million words) and MI scores are presented in Table 5.It can be seen that all lexical bundles in the top 50 occur more than 100 times per million words.Most of the frequent lexical bundles are in three-word strings, with only 8% of them in 4-word strings.The distinctive four-word bundles are the extent to which, are more likely to, on the other hand and in the context of.Simpson-Vlach and Ellis (2010).The comparison of the results of this study with those of Simpson-Vlach and Ellis (2010) was necessary to determine the specificity of the lexical bundles in this study.To reiterate, Simpson-Vlach and Ellis's list of academic formulas is a cross-disciplinary list of lexical bundles which uses a common-core approach to compile lexical bundles common in various academic disciplines.In contrast, the list of lexical bundles retrieved from IBM corpus is a discipline-specific list of lexical bundles, representing phraseological sequences which are seen specific and significant in the field of IBM.The comparison between these two lists of lexical bundles is methodologically justifiable as both lists of lexical bundles were retrieved using statisticallydriven methods.7 presents the list of lexical bundles common in IBM corpus and AFL.Of all the frequent lexical bundles in IBM corpus, 36% of them are seen common in the AFL.This means that 64% of the lexical bundles in IBM are not found in AFL.Also, the statistical measure, the log-likelihood test was performed to study the keyness of the lexical bundles in IBM and AFL.As keyness is an indicator of specificity, the results of the log-likelihood test show that more than 70% of the shared lexical bundles are more specific to IBM corpus.This indicates that the lexical bundles in IBM corpus are relatively specific as compared with AFL.Also, there are not enough AFL that could cater to the need of learners in the field of IBM.A discipline-specific approach to the teaching and learning of lexical bundles for EAP is seen necessary.This finding is in harmonious with Hyland (2008a) where Hyland demonstrates that there is considerable variation in disciplinary preferences in terms of the types of lexical bundles found in four different academic domains.Over half the lexical bundles in each list did not occur at all in any other discipline in Hyland's (2008a) study, while in this study, more than 60% of the lexical bundles were not found in AFL.Hyland (2008a) proposes that the creation of lists of academic lexical bundles should be discipline-specific oriented as the use of lexical bundles differs by discipline.For instance, Hyland (2008a) reveals that many lexical bundles used in electrical engineering were not found in other academic disciplines, including business studies, applied linguistics and biology.Moreover, electrical engineers were found using the biggest range of different bundles, while biologists employ the fewest bundle types in academic writing.However, Simpson-Vlach and Ellis (2010) argue that they were able to identify lists of lexical bundles that are commonly used in various academic disciplines.The results of this study are in line with those of Hyland (2008a), but are in contrast to Simpson-Vlach and Ellis's (2010).Nonetheless, it should be noted that there are methodological differences between the previous studies and this study.
First, in Hyland's (2008a) study, he investigates the lexical bundles using frequency cut-off threshold, while in Simpson-Vlach and Ellis's (2010) study, both frequency and Mutual Information (MI) cut-off thresholds are set in the corpus tool during the data extraction process.Similar to Simpson-Vlach and Ellis (2010), this study uses both the frequency and MI statistic to retrieve the relevant lexical bundles.The use of frequency and MI statistic in both the present study and in Simpson-Vlach and Ellis's study justifies the comparability of the lists of lexical bundles in both studies.It is worth noting that the use of MI is necessary as MI has been widely known as a good indicator of the association between words.Besides, the sole reliance on frequency count, such as in Hyland (2008a) would most probably overlook some significantly useful expressions with lower frequency count.A better alternative to the extraction of lexical bundles is to combine the use of frequency and MI statistic, as afforded by corpus tools, such as Collocate 1.0.
Second, in Hyland's (2008a) study, only four-word lexical bundles were analysed, while in Simpson-Vlach and Ellis's (2010), three-, four-, and five-word bundles were included in their data set.It is therefore apparent that both the results of this study and those of Simpson-Vlach and Ellis (2010) are relatively more comparable.In view of Simpson-Vlach and Ellis's claim on a common-core approach to the identification and use of lexical bundles for pedagogical purposes, there is a need to verify if there are enough common lexical bundles to facilitate learners with different disciplinary backgrounds.This study is an attempt to explore the issue of specificity with regard to the use of lexical bundles in a specific academic field.The findings of this study indicate that the constructions of academic phraseological sequences need to accord to specific academic needs and purposes.
In sum, in relation to the teaching of academic phrases and expressions, it is convincingly proven that EAP is better approached in a more specific manner.Practitioners in EAP should be provided with added professional training in order to efficiently handle "disciplinary-sensitive repertoire of bundles" (Hyland 2008a, p. 8).EAP instructors are also encouraged to work closely with subject specialists in order to gain a better understanding of subject-related discourse (Hyland 2006).

CONCLUSION
The most frequent lexical bundles in IBM corpus are three-word bundles, including more likely to, in order to, as well as, in terms of and the number of.The comparisons of lexical bundles in this study with those of Simpson-Vlach and Ellis (2010) indicate that lexical bundles are discipline-specific.The findings of this study have implications on how EAP should be perceived and approached in language classroom.Currently there are debates over the issue of specificity in EAP teaching, influencing both teachers and researchers.Based on the outcome of the analysis, it is suggested that the teaching and learning in EAP should follow a subject-or discipline-specific approach as phraseological sequences such as lexical bundles are highly likely to be markers of disciplines.It is nevertheless never easy to put specificity into practice in EAP classrooms.EAP teachers need to work closely with subject specialists to gain better understanding of the specific language conventions in the respective courses.The collaboration can take various forms, including regular discussions with subject experts.To sum up, there are differing views with regard to the approaches to EAP and this issue remains debatable in the field.It is thus necessary for researchers to continue exploring the various types of phraseological sequences in academic discourse for the sake of further enhancing EAP instructions and syllabuses.The enhancement of EAP syllabuses is crucially important to learners so that they are equipped with the ability to participate in the relevant "academic community" as espoused by Coxhead (2008).
Automated extraction byCollocate Minimum frequency: 20 times per million words Mutual Information (MI): 3.00 and above * Automated extraction by Collocate Minimum frequency: 20 times per million words Mutual Information (MI): 3.00 and above * Exclusion criteria Items which fall into the exclusion criteria group were discarded * Final list of eligible lexical bundles of IBM *

TABLE 2 .
Top 50 three-word lexical bundles in order of normalised frequency

TABLE 3 .
Top 50 four-word lexical bundles in order of normalised frequency

TABLE 5 .
Top 50 lexical bundles in IBM in order of normalised frequency

Table 6
compares the top 50 lexical bundles in IBM corpus with the frequent core academic formulas proposed by

TABLE 7 .
Lexical bundles common in IBM corpus and AFL