Issues in defining/extracting collocations in Japanese and Korean: Empirical implications for building a collocation database

Collocations in Japanese and Korean have been studied extensively based on statistical tools. The criteria for collocations in these languages, however, have not been fully established in the literature, and it is not obvious whether all statistically significant combinations of words could be regarded as collocations. In this article, we point out empirical problems in extracting collocations in Japanese and Korean, and provide a standard for identifying collocations (to be separated from “free combination” and “idiom”) in these languages. We concentrate on the discussion of empirical aspects of collocation research, rather than the statistical analyses of collocational patterns. As a preliminary to developing a database for Japanese-Korean contrastive work on collocations, the present study focuses on ten “Chinese-words” (漢語). We show that (i) the co-occurring verbs for eight Chinese-words in Korean all carry over to Japanese (but not vice versa); (ii) in the other two cases, Korean exhibits co-occurring verbs not found in Japanese; (iii) language-specific patterns of verb co-occurring are also observed in some instances. Overall, it is indicated that a much wider variety of co-occurring verbs are found in Japanese than in Korean.


Introduction
Collocation has been widely utilised in language teaching; it refers to related phrases or clauses which co-occur in a statistically significant way (Strafella and Maekawa, 2015). For instance, according to The Oxford Collocations Dictionary for Students of English (2nd edition, 2009), the noun influence frequently co-occurs with the verbs in (1): (1) have, gain, exercise, exert, retain, lose, expand, extend, increase, spread With the development of corpus linguistics, there has been a growing body of work which makes use of statistical tools to extract collocations from corpora. The data thus collected, however, include "free combinations" and "idioms," which raises the issues of (i) whether these are to be regarded as collocations and (ii) if not, what criteria are to be set out to constrain the range of collocations. These issues, in our view, have not been seriously addressed, especially for Japanese and Korean. For instance, as will be pointed out in Section 3.6, not all statistically significant combinations of words are not regarded as collocations (at least, with respect to the standard assumed in the present article). Therefore, whilst statistical analyses constitute important approaches to the study of collocations, empirical issues involved in the extraction of collocations are in need of thorough discussion and the standard for collocations must be established which could supplement such statistical approaches.
In this article, we aim to provide an empirical foundation to construct a collocation database to be employed for contrastive analyses of Japanese and Korean. Thus, rather than presenting a rigid statistical account, we focus on (i) the discussion of empirical issues in detecting collocational units in these languages and (ii) the presentation of a case study for a specific type of syntagmatic sequence which involves "Chinese-word" (kango, 漢語), a type of expression observed in both Japanese and Korean.
More specifically, we will provide the standard for collocations by applying Im's (2006) criteria for Korean collocations to Japanese data. This standard will serve as a ground on which we analyse the syntagmatic sequence "Chinese-word + case particle + verb." Through this case study, it will be revealed that, in spite of • Sejong Corpus (The National Institute of Korean Language, 2010), a 200 million-word corpus of written Korean • Kkokkoma Korean Analyzer (Seoul National University, 2009) The organisation of this article is as follows: Section 2 first surveys the notion of collocation and the issues in extracting collocations from corpora. Section 3 then offers the standard for identifying collocations in Japanese and Korean based on Im's (2006) proposal. With respect to this standard, Section 4 reports a case study of collocational units involving "Chinese-words." Finally, Section 5 summarises our main findings and points out some future directions.
From the perspective of this approach, a collocation refers to a polar binary relation of lexical dependence, where a dominating lexical unit α (called "base") selects a dominated lexical unit β (called "collocate"). In a collocation "noun + verb" in Japanese and Korea, a noun is a "base," while a verb is a "collocate" (Im and Kim, particle, DECL = declarative clause marker, NOM = nominative case particle, PAST = past tense marker, TOP = topic particle.) According to the Japanese way of teaching Japanese language, collocations are often seen as a fixed association of words, a unit comparable to "rengo-teki kanyôku" (associated idiom) (Miyaji, 1985) and "rengo" (associated words) (Kunihiro, 1997). From the viewpoint of language education, it may be reasonable to conceive of free combinations as collocations in order to, e.g., prevent language transfers (Ooso, 2005). For instance, 'take a medicine' is expressed as (5) in Korean, where mekta is used. The Korean learners of Japanese may thus produce the ungrammatical phrase (6), where taberu is disallowed (cf. (4) medicine-ACC eat Int. 'take a medicine' Since our main objective is to provide a foundation for building a database for linguistic purposes (rather than for educational purposes), we do not treat free combinations as collocations in our research.
Turning to the issue of identifying collocations from corpora, it is essential to clarify which statistical indicators are used. In the present work, we will adopt three statistical indicators. These will be explicated in turn below.
Firstly, the "t-score" is utilised to compare the frequency of a base with that of a collocate with respect to the total number of words in a corpus. The t-score is suitable for evaluating collocations highly used in speech and discourse.
Thirdly, the "Dice Coefficient" is adopted to measure the strength of a collocation solely based on the frequency relation between a base and a collocate. Chujo and Uchiyama (2004) report that the Dice Coefficient, although the calculation is simple, is a useful measure to extract "genre-specific words" (see also Ishikawa (2008)).
Our analysis makes use of these three indicators, taking into consideration (i) the co-occurrence frequency, (ii) the frequency of a base, (iii) the frequency of a collocate, and (iv) the total number of words in a corpus. In (7), fA stands for the frequency of a base A, fB the frequency of a collocate B, and w the total number of words in a corpus.

(7)
Dice Coefficient, MI-score, t-score Table 1 presents various statistical information relating to the association between the verb yomu 'read' as a base and its co-occurring lexeme in Japanese. Table 1 uncovers several difficulties in identifying collocations, as we will point out below.

Issues in Identifying Collocations
Firstly, it might appear that the verb yomu 'read' selects a noun such as hon 'book' and kiji 'article.' In fact, it is widely assumed in the syntax literature that (i) a verb is a head of Verb Phrase and (ii) if a verb is transitive, it selects an object NP as an (internal) argument (e.g. Carnie (2012)). Here, an argument is assumed to be an NP (not a noun); this is because in (8), what yomu selects is not the noun hon 'book' but the whole part of the NP omosiroi hon 'interesting book.' interesting book-ACC read.PAST 'I read an interesting book.' In the case of collocation, however, a noun selects a verb. Consider (9).
(9) hon-o yukkurito yonda [Japanese] book-ACC slowly read.PAST 'I read a book slowly.' In (9), the noun hon 'book' is in a collocational relation to the verb yonda 'read' (not the VP yukkurito yonda 'read slowly'). Thus, if we specify a noun as a base (and a verb as a collocate), collocational units such as hon-o yonda 'read a book' can be extracted. If we specify a verb as a base (and a noun as a collocate), however, such collocational pairs cannot be fully detected since yonda 'read' potentially selects as an argument, e.g., omosiroi hon 'interesting book,' tinpuna hon 'banal book,' kinô katta hon 'book which I bought yesterday,' and so on.
Secondly, in extracting a sequence "noun + verb," the choice of a noun or a verb as a base may affect results considerably. Consider (10).
line.space-ACC read 'read between the lines' This expression consists of the verb yomu 'read' and the noun gyôkan '(actual) space between lines,' but the meaning of the whole expression cannot be determined on the basis of these meanings encoded in each element. Although a lexical meaning of the verb yomu somehow persists in (10), it is not the case with the noun gyôkan. As will be argued in Section 3, a base in collocations must be "semantically transparent." Thus, if we select a noun as a base, expressions such as (10) do not count as collocations.
Thirdly, a lexeme combination that is identified statistically as a collocation may turn out to be an idiom. Consider the following examples: [Japanese] mackerel-ACC read 'provide disguised information (e.g. age)' In (11)-(12), neither the lexical meaning of a noun nor that of a verb persists. They are thus viewed as idioms, even if they may be statistically regarded as Article No~e00189 collocations. (See Section 3.6 for a more detailed discussion and illustration of the present issue.) Finally, according to Table 1, nouns which denote something to read are strongly associated with the verb yomu 'read.' It is not quite obvious, however, whether these associations are cases of collocation or free combination in virtue of statistical results alone. We thus need the criteria for distinguishing collocations from free combinations.
Based on the above considerations, we specify a noun as a base in identifying a collocation "noun + verb." In the next section, we will establish the Japanese-Korean criteria for extracting collocational units of the form "noun + verb," to be distinguished from free combinations and idioms.

Hypothesis
The overall aim of this article is to offer the standard for collocations which may be employed for contrastive analyses of Japanese and Korean collocations. As stated at the outset, one of the empirical challenges encountered by the statistical analyses is how to identify collocations, to be separated from free combinations and idioms. This issue has largely been untouched for Japanese and Korean (Im, 2006;Lim, 2015;Yoo, 2012). Of special note is Im (2006), who provides the classification of clustered expressions in Korean and applies it to various collocational data (e.g. synonymous and antonymous paradigmatic relations in collocational clusters). In this section, we will demonstrate that Im's (2006) criteria for Korean data (with slight amendments) are also applicable to Japanese. (Its empirical coverage will be further expanded in Section 4, where data that involve "Chinese-word" (kango, 漢語) will be analysed.) 3.1. Im (2006) In Table 2, we present Im's (2006: 174) classification of collocations, free combinations, and idioms (with slight amendments to be clarified shortly).
This classification is based on the two factors: • Either a lexeme in a syntagmatic relation is replaceable with a synonymous lexeme or such replacement is (highly) constrained.
• A lexeme in a syntagmatic relation is semantically transparent or semitransparent or non-transparent.
As displayed in Table 2, the former factor yields three types (A, B, C), and the latter factor yields five types (a, b, c, d, e). Each category is characterised by the combination of these two factors, as in "Aa," which characterises free In Table 2, the definitions of types "a, b, c, d, e" use the term "semi-transparent," while Im (2006) employs the term "non-transparent" in place of "semitransparent." We contend that the term "semi-transparent" is more appropriate. Consider (13).
deal-ACC open 'enter into business relation' (Im, 2006: 171) The original meaning of thuta is 'make a way by removing a stacking thing.' This meaning would be somehow related to the verbal part of 'enter into business relation,' and thuta is thus semantically "semi-transparent" in (13), which clearly contrasts with "non-transparent" cases like (14), where the original meaning of a consisting element is completely absent from the meaning of the whole part.
mackerel-ACC read 'provide disguised information (e.g. age)' In (14), the meaning of the whole expression 'provide disguised information' cannot be traced to the meanings of saba 'mackerel' and yomu 'read' in any sense.
We thus use the term "non-transparent" for such cases as (14), and use the term "semi-transparent" for such cases as (13). Im (2006) deals with only Korean data, but in the following subsections, we argue that Table 2 is also useful for capturing Japanese data.  (Im, 2006: 174, revised).

Free combinations Collocations Idioms
A. lexeme-replacement is free.
C. lexeme-replacement is highly constrained.

Free Combinations
In Table 2, free combinations are characterised by the type "Aa." That is to say, in a syntagmatic relation "α + β," a co-occurring item β (verb) may be replaced with a synonymous lexeme. Furthermore, both α and β are semantically transparent.
In (15), the verb arawasu 'show' is replaceable with other verbs such as simesu 'show.' Note also that the two consisting lexemes are semantically transparent; thus, kêi and arawasu here contribute their lexical meanings, 'respect' and 'show' respectively, to the overall meaning of the combined expression 'show a respect.'

Collocations
Let us turn to collocations, which are divided into three subtypes: "Ba," "Bb," and "Cb." These subtypes will be illustrated in turn.
In collocations of type Ba, consisting lexemes are both semantically transparent, as in the case of type Aa (Section 3.2). There are constrains, however, on the possibility of replacing a collocate (verb) with a synonymous lexeme.
In (16), both bôsi 'hat' and kaburu 'put on' are semantically transparent. But the collocate kaburu cannot be replaced with tukeru 'attach,' as shown in (17), although they are similar in terms of lexical meaning in Japanese.
In type Bb, difficulty is also present in replacing a collocate with a synonym. But unlike type Ba, a collocate is semantically semi-transparent.

Article No~e00189
The original meaning of ukeru is 'receive' but it means 'take' in (18), although some semantic similarity is still detected between 'receive (an order)' and 'take (an order).' In this sense, ukeru in (18) is semantically semi-transparent. In addition, ukeru cannot be replaced with the synonym morau 'receive,' as illustrated in (19).
order-ACC receive Int. 'take an order (at a restaurant)' In type Cb, a base is semantically transparent, whereas a collocate is semantically semi-transparent. Unlike type Bb, however, a collocate of type Cb is subject to heavier constraints on the replacement of a collocate with a synonymous lexeme. Type Cb, thus, blurs the distinction between collocations and idioms. (Examples of idioms will soon be given in the next subsection.) To pinpoint the problem, consider (20).
year-ACC take 'get old' In some dictionaries, (20) is registered as an idiom. This would be reasonable if we held that the base tosi 'age' were semantically non-transparent. As shown in (21)

Idioms
Let us finally examine idioms, which have three subcategories: "Cc," "Cd," and "Ce." As mentioned in the preceding subsections, in the cases of free combinations and collocations, a base is always semantically transparent. This is in sharp contrast with idioms, where a base is always not transparent (namely, always either "semitransparent" or "non-transparent"). Another distinguished property of idioms is that the possibility of replacing a collocate with a synonymous lexeme is always highly restricted (or perhaps impossible).
In type Cc, although a noun is semantically semi-transparent, a collocate (verb) is transparent.
mouth-ACC match 'make their (inconsistent) stories look agree in front of the third persons' The noun kuti, which literally means a mouth, is used semi-transparently to denote a story, whereas the verb awaseru 'match' (or more precisely 'make two things agree') is semantically transparent.
In type Cd, consisting items are all semantically semi-transparent.
hand-ACC cut 'break off the relationship with someone' In (24), te 'hand' means 'relationship,' and kiru 'cut' means 'break off.' Thus, though their lexical meanings are somehow related to the meaning of the whole expression (24), they are not identical, hence semantically semi-transparent.
Finally, type Cc represents the typical cases of idioms. Thus, in this category, the meaning of an idiomatic expression cannot, in any way, be related to the meanings of its composing elements. Example (25) is repeated from (14) in Section 3.1.
mackerel-ACC read 'provide disguised information (e.g. age)' The meaning of this idiomatic expression cannot be traced to the encoded meanings of the noun saba 'mackerel' and the verb yomu 'read.' Therefore, example (25) is a case of semantic non-transparency.

Ambiguous Cases
So far, the modified version of Im's (2006: 174) criteria (Table 2) has been applied to Japanese data. Although this issue is neither noticed nor discussed in Im (2006), Table 2 can be applied to ambiguous cases. In Section 3.4, example (24) was presented as a case of idioms. This noun-verb pairing also possesses a freecombination reading.
'cut a hand' The "a"-line specifies the idiomatic interpretation, and the "b"-line the freecombination interpretation. The standard in Table 2 is fully compatible with data like (26). First, (26) under the "a"-reading is of type Cd, a case of idiom. Second, the same string of words under the "b"-reading is of type Aa, a case of free combination (see Section 3.2).
Another example is provided in (27). 'open a curtain (e.g. in a theatre)' In the "a"-line, (27) would be of type Cd, a case of idiom. This "a"-reading is illustrated in (28). In the "b"-line, (27) is of type Aa, a case of free combination. This "b"-reading is illustrated in (29).
(28) atarasii-jidai-ga maku-o ake-ta These ambiguous cases are naturally expected in our classification, since each type in Table 2 is independent from the other types. Im (2006), who only targets Korean data, does not consider ambiguous cases, but comparable data are found in Korean, too. In the "a"-line, (30) would be of type Aa, a case of free combination. This "a"reading is illustrated in (31). In the "b"-line, (30) would be of type Cd, a case of idiom. This "b"-reading is illustrated in (32)

Summary
In this section, we have presented Im's (2006) criteria with slight amendments and have demonstrated that it is applicable to not only Korean but also Japanese data.
As mentioned in Section 1, one of the challenges posed for statistical approaches to collocations is the empirical issue of whether it is reliable to identify collocational units purely in terms of statistical results. Consider (33)-(34). Consider Table 3 and Table 4 below.
According to Hunston (2002), a syntagmatic relation may be statistically construed as a collocation if Freq. is on or more than 10, each of fA and fB is on or more than 100, the MI-score is on or more than 3.0, and the t-score is on or more than 2.0. As for (33), the MI-score and t-score in Table 3 mark high values (in particular, the tscore); it is then suggested that katudô-o suru constitutes a collocational unit that is frequently used. As for (34), consider Table 4. The Dice Coefficient, which calculates the collocational strength based on the frequency relation between hara 'stomach' and tateru 'evoke,' marks a high value. Furthermore, the MI-score and the t-score are also high. It is thus suggested that hara-o tateru constitutes a collocational unit. (For comparison purposes, Table 4 presents other data involving hara 'stomach': hara-o kukuru 'make up one's mind.' As in (34), the Dice Coefficient, the MI-score, and the t-score for this sequence are high, and it would be statistically treated as a collocation.) However, with respect to our criteria introduced in this section (Table 2), (33)- (34) are not viewed as collocations. In (33), katudô 'activity' and suru 'do' are semantically transparent, and this sequence of words is characterised as type Aa, an instance of free combinations. In (34), the contribution of hara 'stomach' to the overall meaning of the sequence is semi-transparent. On the other hand, tateru is polysemous and it exhibits the meaning of 'evoke (an emotion)' in this example; tateru is thus semantically transparent. With respect to Table 2, then, (34) is classified as type Cc, a case of idioms. (Further, the semantic contributions of hara 'stomach' and kukuru 'tie up' to the overall meaning of hara-o kukuru 'make up one's mind' are not transparent. Thus, according to Table 2, it is also regarded as an idiom of type Cd.) To sum up, the identification of collocations is not entirely achieved if statistical results alone are taken into account, and it is important to establish the standard against which collocations in Japanese and Korean are appropriately characterised and are also properly distinguished from free combinations and idioms. In the next section, we will further argue that our criteria are also useful for identifying collocational sequences that involve "Chinese-words" (kango, 漢語) in the two languages.

Analysis
The last section has presented the Japanese-Korean criteria for identifying collocations by making slight modifications to Im's (2006) proposal. In the present section, we will further confirm the usefulness of the criteria for Japanese and Korean by exploring the syntagmatic units which contain expressions mutually observed in the two languages: "Chinese-words" (kango, 漢語).
Chinese-words account for the large proportion of the vocabulary in Japanese and Korean. (Yamaguchi et al. (2004: 115) reports that Chinese-words amount to 45.89% of the Japanese vocabulary.) They thus serve as a reasonable starting point for providing a Japanese-Korean contrastive analysis. Furthermore, Chinese-words differ from the other types of word such as "wago" (Japanese-native words) and loanwords in that a number of Chinese-words denote an action and they are often combined with the light verbs (Muraki, 1991: 203): suru 'do' in Japanese (35)  We concentrate on the syntagmatic relation "Chinese-word + case particle + verb." In this sequential pattern, a Chinese-word (together with a case particle) is treated as a base, whereas a verb is treated as a collocate.
The lists of Chinese-words used in our survey are listed in (37) (for Japanese) and in (38)  For statistical types of research, it may not be quantitatively sufficient to target only these Chinese-words with the specified schema "Chinese-word + case particle + verb." In this article, however, we are not engaged in presenting a rigid statistical account, but attempt to (i) point out empirical problems for extracting collocations in Japanese and Korean, (ii) set out the standard for identifying collocations in these languages, and (iii) test this standard against specific data involving Chinesewords. A more large-scale exploration is thus left for future research.
Next, the lists of case particles exploited are presented in (39)  We first calculate the co-occurrence frequencies of the Chinese-words and a case particle in our corpora. The result is summarised in Table 5 (For reasons of space, only the t-scores are provided here).
Except for the Chinese-word meaning 'supply' in Korean, the other Chinese-words are most frequently marked with the accusative case particle: o in Japanese and (l)ul in Korean. Akimoto (1993) reports similar results for the case-marking of Chinesewords in Japanese. As for the Chinese-word for 'supply' in Korean, it is marked with the nominative particle i/ka or the accusative particle (l)ul to a similar degree. Given these results, we restrict our attention to the combinations of the Chinese-words with the accusative case particles (o in Japanese and (l)ul in Korean).
The sequences "Chinese-word + accusative case particle" being specified, we then identify collocations among them by assuming that a Chinese-word is a base and a verb is a collocate. This process relies on the criteria for collocations introduced in Table 2. The result is shown in Table 6. (Different notations for a lexeme count as one and the same lexeme. For instance, okonau 'do' may be expressed as "行う" or "行なう," but these are treated as different manifestations of the single lexeme okonau.) Of special note is that a much wider range of verbs is allowed as a collocate in Japanese, as will be discussed below.
First, the set of verbs (as collocates) in Korean forms a proper subset of the set of verbs in Japanese for eight Chinese-words meaning 'behaviour,' 'sympathy,' 'attention,' 'inspection' 'supply,' 'moving,' 'plan,' and 'judge,' respectively. In particular, the following Chinese-words in Korean co-occur with only a couple of verbs: hayngtong 'behaviour,' kongkam 'sympathy,' cwumok 'attention,' kemsa 'inspection,' kongkup 'supply,' kamtong 'moving,' and phankyel 'judgement.' Further, free combinations are included in this list, such as hayngtong hata (Lit. 'do behaviour'), where hata is a light verb 'do.' Moreover, the verb patta 'receive,' co-occurring with kongkup 'supply' and kamtong 'moving,' is not marked with a case particle, and they do not live up to our standard for collocations.
Note that the small frequency of a noun does not necessarily indicate that the variation of co-occurring verbs is also small. For instance, consider the Chineseword dentatu 'transmission.' The fA of dentatu is 74, but it co-occurs with a wide range of verbs, as illustrated in (41)     In these (and other) combinations involving dentatu, the MI-score is more than 3.0, though the t-score for (42) and that for (43) are below 2.0 due to the small frequency of the Chinese-word (see Section 3.6). When the frequency of a Chinese-word is small, the type of combination (i.e. free combination, collocation, idiom) can be identified based on our criteria presented in Table 2. For instance, okonau in (42) can be replaced as suru (see (41)), and dentatu-o okonau is thus classified as a free combination. In this way, the present article raises the empirical issues for the (purely) statistical treatments of collocations, and provides the means of identifying collocations in terms of lexeme-replacement and semantic transparency.
Second, collocations with 'change' vary across the two languages. Whilst Japanese exhibits 22 patterns, Korean exhibits only 14 patterns. Notably, although collocational patterns in Korean generally do not display much variation, the Chinese-word meaning 'change' in Korean co-occurs with several verbs which are absent from the list of verbs allowed in Japanese. The relevant examples include pyenhwa-lul kyekkta 'experience a change,' pyenhwa-lul ikkulta 'attract experience,' pyenhwa-lul kkoyhata 'attempt a change,' and so forth.
Finally, a discrepancy lies in the Chinese-words for 'enquiry' in the two languages.
In Japanese, only the verb okonau 'conduct' is identified as a collocate of this Chinese-word. In Korean, pel'ita 'make it happen' is identified as a collocate; the corresponding Japanese verb never co-occurs with the Chinese-word meaning 'enquiry.' Overall, it is indicated that whilst some idiosyncratic differences can be observed, a much wider variety of co-occurring verbs are found in Japanese than in Korean.

Conclusion
The study of collocations raises various empirical problems. In particular, in order to extract collocations from corpora, it is important to establish the criteria for determining collocations (to be distinguished from free combinations and idioms).
In this article, we have concentrated on these empirical issues (rather than rigid statistical analyses), and have shown that the criteria proposed for Korean collocations by Im (2006), with slight modifications, is extendable to Japanese in a uniform way. Based on these criteria, we have presented a case study of the syntagmatic sequence "Chinese-word + case particle + verb" in the two languages, hoping that our methodological/empirical discussions will be fruitfully combined Article No~e00189 with statistical approaches to collocations. One of our future prospects is to refine the criteria for collocations by addressing other aspects of the phenomena than "substitution of composing units" and "semantic transparency" (cf. Im (2006)) and by covering a wider spectrum of data within/beyond the "Chinese-word + case particle + verb" pattern. Further work in this direction, we hope, will shed light on lexical and grammatical facets of Japanese and Korean, and it will also contribute to the building of a Japanese-Korean collocation database for contrastive linguistic research.

Declarations
Author contribution statement Jong Seung Park, Tohru Seraku: Conceived and designed the experiments; Analyzed and interpreted the data; Wrote the paper.
Jieun Kiaer: Conceived and designed the experiments; Analyzed and interpreted the data.