Semantic Annotation of Anaphoric Links in Language I

The main purpose of this paper is to make some linguistic contributions to an ISO initiative to formulate a standard on the semantic annotation framework of reference, coreference, and other types of anaphoric phenomena in natural language. For this purpose, we ﬁrst make a brief review of some of the existing coreference annotation (CA) schemes. We then formulate an abstract syntax ASyn ana for anaphoric annotation on which a variety of concrete syntaxes such as an XML-based concrete syntax CSyn anaX can be developed to provide an interoperable representation format for the annotation. To satisfy the semantic adequacy of the proposed abstract syntax even partially, we check the possibility of developing formal semantics based on it. Such a semantics may be accepted as a valid application of the proposed annotation scheme. We ﬁnally consider the multilingual applicability of proposed ASyn ana by applying it to Korean, a non-inﬂectional agglutinating language with pro -drop properties.


Introduction
Datasets are important for any linguistic analysis at least in two respects.
(1) a. Datasets make the analysis data-oriented with its focus on real data.
b. Datasets provide a basis of validating theoretical claims. 1  This paper thus presents two sorts of datasets of language fragments: one sort of datasets in English and another in Ko- 1 This paper focuses on the design aspect of semantic annotation schemes based on the notion of abstract syntax validated with some consistent semantic interpretation of annotations.It does not present any corpus-based statistical analysis nor follows any big data approach to check the efficiency or other computational properties of an annotation scheme proposed.
rean.The English datasets are used to carry the general discussion of anaphoric links in language, whereas the Korean datasets are used to check how it applies to non-European languages like Korean, a non-inflectional language known as an agglutinating language with pro-drop properties.
While the Korean datasets are introduced in Section 7 Application to Korean, here is a fragment of the English dataset as in (2), a somewhat bizarre story from an undisclosed, but freely available source: 2(2) Not long ago {a retired professor mt1 of a small university mt2 in Essex mt3 } mt4 lost {his mt5 wealthy wife mt6 } mt7 , who mt8 had been ill for {over ten years} mt9 .After {her mt10 death mt11 } mt12 , {the professor} mt13 and {their mt14 only daughter mt15 } mt16 found {a large sum of money} mt17 left for them mt18 to share.
Billy Jones mt19 , 84, wanted to remarry and found {a charming young Vietnamese woman mt20 } mt21 , aged 24, 40 years younger than his mt22 own daughter mt23 .He mt24 told her mt25 that, if she mt26 agreed to marry him mt27 , she mt28 would receive {eighty million pounds} mt29 -{a million pounds} mt30 a year mt31 -right after their mt32 marriage mt33 and also inherit the rest mt34 after his mt35 death mt36 .The lady mt37 refused his mt38 tempting offer mt39 , instead marrying {a young British bandmaster mt40 } mt41 with {no promising career mt42 } mt43 .This short story contains at least 43 noun phrases (NPs) of various forms that mention or refer to something, called discourse entity, in the universe of discourse, a non-empty set of discourse referents (see [38] and [39]).Each of these phrases, which are technically called mentions, is uniquely identified with an integer and its prefix mt that stands for mention in the text. 3ost of the mentions are referentially grounded, namely referring expressions, and some of them are also identified as having coreferentially or, in more general terms, anaphorically related to others.To understand the whole story told by this short item, one should recognize these coreferential relations among the so-called mentions.The noun phrase A former head teacher mt1 or the larger phrase a retire professor mt1 and the name Billy Jones mt19 , for instance, refer to the same person and so do the noun phrase a charming young Vietnamese woman mt21 and the noun phrase The lady mt36 .
Each of such pairs that refer to the same entity in the discourse referents consists of two mentions, one called 'anaphor' and the other its 'antecedent'.They are also said to corefer or be coreferential.
The antecedent of an anaphor may be split into more than one.The pronoun their mt14 in the first paragraph of the story, for instance, has its antecedent split into two mentions, a retired professor mt1 or the professor mt13 and his wealthy wife mt7 . 4There are half a dozen occurrences of pronouns such as his, her, he, she, their, and them that refer to either Billy Jones mt13 , his wife mt6 , their daughter or the Vietnamese lady.The two pronouns their mt14 and them mt18 have two different sets of antecedents: the first one has professor mt1 and wife mt6 and the second one professor mt1 and daughter mt15 as their respective antecedents.
The scope of mentions can be taken as being either wide or narrow.In ordinary practice, this difference does not matter.The antecedent of the pronouns his mt5 may be taken narrowly as being professor mt1 or widely as being a retired professor of a small university in Essex mt4 .Only if necessary, this difference can be marked up as is being done by [30].
The study of reference, coreference, and other anaphoric relations has a long history with a very long list of literature.In the past several decades alone, a tremendous amount of work has been done in the area of syntax and semantics: generative grammar, for instance, has proposed various structural conditions and constraints on pronominalization and reflexivization, while formal semantics such as Montagovian semantics adopted model-theoretic denotational approaches, especially to the treatment of general quantification and variable binding.To cite a few, [77] started discussing constraints on variables that appear transformational rules in syntax and [49] introduced the structural notion of command between a referring expression and an anaphoric expression that played an important role in determining coreference and other related relations.[60] opened a new arena where both linguists and philosophers got engaged in analyzing and interpreting various types of quantifiers and pronouns or variables in language in the framework of higher-order formal logic and their scope and binding phenomena.This paper, however, has no intention to survey all this linguistic history that extends from syntax to semantics to pragmatics with discourse analysis, but focuses on a small area of most recent work by computational linguists on so-called coreference annotation (CA).
Reference or coreference resolution is a big issue in computational linguistics, especially in the area of information extraction (IE).There have been several important publications that deal with that issue: to cite some, we have: [30], [15], [23], [24], [72], [83], [84], [73], and [86].We do refer to them at times, although coreference resolution (CR) differs technically from coreference annotation (CA) in which this paper focuses on.
While referring to these works and others to be cited, this paper aims at constructing a semantic annotation scheme (AS) for coreference and other anaphoric link phenomena in a language (English) that may be proposed as an ISO standard for language resources management. 5he rest of the paper develops into seven sections.Section 2 reviews the five existing ASs, but in a non-critical way: 1. [5],
Section 3 shows how it adopts a two-level AS for anaphoric annotation, as followed by most of the anaphoric ASs.Given an input text, the first level simply identifies the whole set of referring or anaphoric expressions, called 'mentions', as possible markables.The second level is complex, consisting of three steps: (3) a. Selecting possible anaphors from the set of possible markables.
b. Paring each of the selected anaphors with their antecedent(s).
c. Specifying the type of each anaphor-antecedent relation, called 'anaphoric' link.
Each of these three steps listed in (3a,b,c) is constrained and triggered by a small set of attribute-value specifications for each anaphor-antecedent pair of the mentions and each type of the anaphoric link.Section 4 Formal description forms the main part of the paper that proposes (1) an integrated annotation scheme ASyn ana for coreference and other anaphoric relations as an abstract syntax and (2) an XML-based concrete syntax CSyn anaX derived from it with concrete attribute-value specifications.It is followed by section 5 Illustrations.
Section 6 shows how semantic forms are compositionally derived from each of the XML-represented annotations (XML-elements).This semantic section is an attempt to prove the semantic adequacy of the proposed ASyn ana at least in informal terms.This task is required by the general semantic principles on semantic annotation frameworks which are laid down in [36].
Semantic Annotation of Anaphoric Links in Language Section 7 Application to Korean is added to the original paper in order to test the multilingual extensibility of ASyn ana with a possible modification of attribute-value specifications for a concrete syntax based on it.No corpus-based analysis is, however, presented, for that task alone requires a series of reports.
The abstract syntax of ASyn ana with its concrete syntax CSyn anaX presented in this paper is an extensively revised version of that presented by [54] with more formal rigor.Section 7 is a totally new addition to [54].Section 8 concludes the paper with some summarizing remarks.

Review of Existing Annotation Schemes 2.1 Preliminaries
There are several types of the anaphoric relation.The best known type is coreference, an equivalence (symmetric, transitive, and reflexive) relation, that holds between two terms, called 'mentions', if and only if their denotations are identical.
(4) Definition-1: Coreference Two terms t 1 and is the denotation or referent of a term t i with respect to a model M.
Coreference and binding are two different, but related linguistic phenomena, often discussed together.Consider examples (5a,b): (5) a. John loves his mother.[coreference] b.Everyone loves his mother. [binding] In (5a), the pronoun his may be understood as coreferring with the name John as its antecedent.In (5b), on the other hand, the pronoun his does not corefer with the quantified noun phrase everyone, but is treated in formal semantics as a variable bound by the universal quantifier.This paper attempts to accommodate both the type of coreference, as defined in (4), and some other types of anaphoric phenomena, such as the one illustrated in (5b), into the proposed ASyn ana with an XML-based concrete syntax CSyn anaX .

Coindexing
In linguistics, coreference and binding are both annotated in the same manner by coindexing, as shown below.
(6) a. John i loves his i mother.[coreference] b.Everyone i loves his i mother. [binding] Split antecedents can also be represented by coindexing with a set index such as {x, y}.Here are some examples, (7a,b,c), taken from [78]: (7)

2.3
The TEI-based Annotation Schemes

Overview
There are two almost identical ASs for coreference: [5] and [85].They are both based on XML and also on the TEI Guidelines.They differ from each other mainly because [5] followed a much earlier version of the TEI Guidelines, while [85] is the most recent version, updated December 2016.

[5]
The AS, proposed by [5], consists of two (XML) elements: <rs> for referring strings and <link> for coreference.For each of the two elements, we can specify their associated attributes and possible values: 6(9) a. Attributes for the Element <rs>: attributes = type, key; type = "object"; key = ID; b.Attributes for the Element <link>: attributes = type, targets; type = "coref"; targets = IDREF IDREF.

[85]
[85] (16.5.1) discusses correspondence between textual segments: (11) <title xml:id="SHIRLEY">Shirley</title>, which made its Friday night debut only a month ago, was not listed on <name xml:id="NBC">NBC </name>'s new schedule, although <seg corresp="#NBC" xml:id="NETWORK"> the network</seg> says <seg corresp="#SHIRLEY" xml:id="SHOW">the show </seg> still is being considered. 9n this textual fragment, the name Shirley is annotated as the title of a show being broadcast over NBC, a television network.The text contains no pronominal forms, but the two nominal forms, the show and the network, are understood as corresponding to the two names Shirley and NBC, respectively.The annotation of the two segments, namely those nominal forms, right above introduces the attribute @corresp to indicate such a coreferential relation for each of the two.
As shown by (12), the use of the elements <linkGrp> and <link> makes correspondence relations more explicit: (12) <linkGrp type="anaphoric link" targFunc="antecedent anaphor"> <link target="#Shirley #show"/> <link target="#NBC #network"/> </linkGrp> The annotation (12) represents two instances of the anaphoric link involving anaphors and their antecedents.The attribute @target has two arguments, as specified by the attribute @targFunc: the first argument is antecedent and the second anaphor.The element <linkGrp> allows several instances of the anaphoric link to be grouped together, while simplifying the specification of the element <link> with a single attribute @target.

Overview
[30]'s CTD (Corerence Task Definition) lists four purposes of constructing an AS in the order of their importance.The first two are: (1) to support the MUC (Message Understanding Conference) information extraction tasks and (2) to be able to achieve good (ca.90%) inter-annotator agreement.
Creation of a corpus for research on coreference and discourse phenomena is the last goal.
The MUC projects were taken up by [17] of NIST (National Institute of Standards and Technology) and other organizations working on linguistic data.[57] then produced detailed annotation guidelines on the annotation of entities and locations with the specification of their types.[1] then recently compared the annotation works, ACE, ERE.TAC-KBP, and FrameNet annotation standards focusing on event annotation.Besides the annotation of entities, the research interest in event annotation was kept by works such as [2].The report by [67] introduces the CoNLL2011 project with OntoNotes.In this paper, we just focus on [30] to see how anaphoric annotation started in earlier stages.
In [30], verbs and other verbal forms such as gerunds (Slowing the economy) are not markables. 11Implicit pronouns, that is, null anaphora (Bill i called John and e i spoke with him for an hour.) and presumptive or intrusive pronouns (the movie i which I saw t i ) 12 as well as relative pronouns (complementizers) are not treated as markables.

Extents
The extent of a markable is a maximal string, while its head is marked with an attribute MIN (minimal string).The maximal noun phrases thus include their modifiers, appositional phrases, non-restrictive relative clauses, and prepositional phrases (Fred Frosty, the ice cream king of Tyson's Corner, MIN="Fred Frosty").

Coreference Links
Coreference in [30] is not restricted to referential identity.Here is the general principle for the annotation of coreference that they proposed: (13) Definition 2: Coreferential Two markables are coreferential if they both refer to sets, and the sets are identical, or they both refer to types, and the types are identical.
The principle (13) thus allows the possible coreferentiality between bound anaphora and quantified NPs that are their antecedents.

Semantic Annotation of Anaphoric Links in Language
Examples (14) are given by [30] to illustrate bound anaphoric relations. 1314) a. {Most computational linguists} i prefer their i own parsers.
b. {Every TV network} i reported its i profits yesterday.They i plan to release full quarterly statements tomorrow.

An SGML Serialization
[30] represents its coreference annotation in SGML.It introduces only one element <COREF> for the annotation of markables and also of their coreferential link type IDENT with the following specification of attribute-values: (15)  In example (17), the annotator is uncertain about the IDENT relation of the Board of Education and Livingstone Street, although they are locally identical.That is why STATUS="OPT" is introduced into the annotation (17) .
In example (19) there are two occurrences of president, but they are not annotated as being coreferential.<COREF ID="7"> has neither REF nor TYPE specified as being IDENT.

The MMAX2 Multi-level Annotation Scheme 2.6 Overiew
[79] use [61]'s GUI-based MMAX2 annotation tool for coreference resolution to build a fully coreference-annotated large corpus of 266 scholarly papers from the ACL anthology.Here we briefly introduce the MMAX2 coreference AS.

Anaphoric Links
Following [86] and their two other related works, [47,48], this AS differentiates coreference from other types of the anaphoric link.It suggests that the annotation of coreference proper be separated from other tasks such as annotation of bound anaphors and of the relation between a subject and a predicative NP.It calls for a division of labor that achieves better inter-annotator agreement.

Overview
The MATE annotation scheme for anaphoric relations has been developed mainly by [64], [65], and [66].It has accommodated several annotation schemes including the two annotation schemes, the TEI-based AS by [5] discussed in 2.3 and the MUC-7 CDT by [30] discussed in 2.4.Its application domain extends from ordinary text to dialogues to discourses with the GNOME corpus.Its semantic theory, especially related to discourse modeling, is solidly based on [87] and [39].The MATE website has, however, been shut down, whereas [82] attempted to restore it by introducing a short summary of the MATE annotation scheme.

Example
Here is an example, taken from [64] and also from [65], illustrating how the MATE annotation scheme works.
(21) a. 5.12 : we're gonna take the engine E3 The XML element, named <de>, stands for discourse entities that referred to by referring and anaphoric expressions.These expressions are markables.In example (21), the links , <link>, that capture anaphoric relations are represented standoff as in (b), while the markables are inline annotated as in (a).

Attribute-Value Specifications
As illustrated with 2.7.2, the MATE annotation scheme in XML has two elements: <de> and <link>.The element <de> annotates expressions that refer to discourse entities, specifying its attribute @ID with a unique value referring to each of the markables such as noun phrases and pronouns.The element <link> specifies anaphoric relations, containing one or more <anchor> elements.
Both <link> and <anchor> have an attribute @href.The value of the attribute @href for <link> refers to an anaphoric expression, whereas the value of @href for <anchor> refers to an antecedent.As shown in the last <link>, an anaphoric expression may have more than one antecedent or an ambiguous antecedent. 14he element <link> has an attribute @type.The list of possible values includes ident for coreference and subset denoting a subset of the set denoted by a discourse entity.This list can be extended to include values like part by following the DRAMA scheme by [63] as mentioned in [66], Extended Relations.

The Brandeis ISO-Space Annotation Guidelines
The ISO-Space Working Group at Brandeis University produced a manual of annotation guidelines for spatial information [70] and proposed it as an annex of an earlier CD (committee draft) version of [35], an ISO international standard for spatial annotation.First, the annex of a CD version annotates spatial entities ( I am sitting in the car se .) 15s referring expressions.Second, it introduces an element <metaLink> to annotate the three different types of coreference between these spatial entities: b. spatialEntity(id=se1, extent=two cars, countable=yes) spatialEntity(id=se2, extent=one, countable=yes) spatialEntity(id=se3, extent=them, countable=yes) metaLink(id=meta1, objetID1=se1, objectID2=se3, relType=coref) metaLink(id=meta2, objetID1=se1, objectID2=se2, relType=subCoref) The list of annotations in (24) can be interpreted as stating that the referents of two spatial entities cars se1 and them se3 are identical, while one se2 partially corefers with the spatial entity type expressions cars se1 .
3 Steps of Annotating Coreference and Other Anaphoric Relations

Overview
Given an input text, the task of coreference or other anaphoric link annotation is three-fold: (26) Step A: Identification of a set of mentions in the text that refer to something in the domain of discourse referents as its markables, Step B-1: Identification of a set of anaphor-antecedent pairs of the mentions that are anaphorically related and Step B-2: Specification of the type of such a relation.
To trigger and constrain these annotation steps, the entity type of mentions and the anaphoric relation are assigned a set of required or implied attribute-value specifications.Step B-1 and Step B-2 constitute two sub-processes unified into one, for they depend on each other.The input text can be of any size.It can range from a short sentence to a very large corpus.

Identifying Markables and Extents
The set of possible markables consists of terms or mentions, which comprise both referring and non-referring expressions in a text.As attested quantitatively by various reference resolution experiments such as [15], [24], [83], and [71], these mentions are mostly noun phrases (NPs).Noun phrases are then categorized into four forms as shown in ( 27 Definite and indefinite pronouns listed in (27d) are divided into four sub-classes as shown (28): , (28) a. Personal pronouns: e.g., he, she, they, it, b.Reflexives: e.g., himself, herself, themselves, c.Reciprocals: e.g., each other, one another, and d.Demonstratives: e.g., this, these, those. 16  The list of features such as 'gender' and 'number' specifies what morphosyntactic features are required or implied for the identification of mentions.The annotation of these features could be done at earlier stages of annotating raw data such as tokenization and morphosyntactic annotation.The process, whether manual or automatic, of marking up these mentions as markables should be straightforward at this basic level.

Anaphoric Links
The main task of annotating coreference and other types of the anaphoric link is to recognize antecedent-anaphor pairs among the set of markables and also to identify the type of their anaphoric link.

Anaphor-antecedent Pairs
Anaphors are part of the set of mentions, being mostly pronouns and other pronominal forms (see [42]).They are thus easily identified.
(29) a. Bob loves Jane, but she doesn't love him. 16Interrogatives are excluded.
b. Bob was tired, and so was I. Some definite noun phrases can be anaphors, too.Here are some examples: (30) a. {The project leader} i is refusing to help.
{The jerk} i thinks only of himself.b. {Hilary Clinton} i , {Bill's wife} i .
Among the list of pronouns, we may also include the use of it referring to propositions, facts, actions, etc., or the use of so that may involve so-called sloppy identities, as shown below: (31) a. John said {he has been to heaven} i , but I don't believe {it} i .b. John {loves his wife} i and so {does} i? Bob.
Examples such as these are often discussed in linguistic literature, but have been seldom treated in computational work.
The so-called expletive it and there, the complementizer that, and the impersonal use of the pronoun it, as shown in (32), are excluded from the list of possible anaphors as well as from the list of possible markables.
(32) a.It exp 's impossible to go out now, for it imp 's raining cats and dogs.
b.It exp is also reported that comp there exp is a storm approaching from the south.
The identification of anaphors as well as mentions can also be triggered by the morphosyntactic features of markables.

Types of the Anaphoric Link
Unlike anaphors, antecedents can be of any class of a word, phrase, or clause.It should, however, be a subset of markables as specified by Step A of annotation, listed in (26).If verbal forms are excluded from the set of markables for some practical reasons, then they would not be in the set of possible anaphors or antecedents.
The extent of antecedents is not restricted to a single word or phrase, but may extend to larger phrases such as conjoined phrases: (33) {The boys i and the girls j } k met at a party and they k danced all night.
Antecedents may not be contiguous, either, but split into two or more phrases, as in: (34) I i met {a farmer} j and {his j dog} k and we {i, j,k} all walked together.
There are at least two uses of pronouns: (35) a. anaphoric: John i loves his i wife.b. indexical or deictic: Look at him 1 .He 1 is naked.
Context: the speaker pointing to a person over there.
In the anaphoric use, the pronoun his i finds its antecedent John i in the given text.In the indexical use, the antecedent of the pronoun him 1 is not found in the text, but provided contextually.
Pronouns can be antecedents as well as anaphors in the chain of an anaphoric link.
(36) John i 1 loves {his i 2 wife} j and she j also loves him i 3 .
Anaphoric links may be forward or backward.The term that corefers with a pronoun normally precedes it, thus being called 'antecedent'.This so-called antecedent may also come after its related anaphor, as in: (37) When she i returned home, Sue i was surprised to find her dog gone.
In such a case, the pronoun is often called 'cataphor'.Sometimes it is difficult to decide which is an anaphor and which is its antecedent, as especially in appositive cases (Seoul i , {the capital of South Korea} j , where i and j corefer.).In such cases, we simply have to state that they correspond to or corefer with each other.
The antecedent-anaphor relation is normally a one-to-one relation, but there are cases in which the antecedent of an anaphor is split into many.Besides this case of split coreference, [70] lists subCoreference as another type of coreference: (38) I have {two cars} i , but one j of them i broke down.
Here one j is a member of the set of two cars i .

Overview
[7] provides a formal description of the annotation structure, consisting of two levels of syntax: (39) a.One is an abstract level of an annotation, called 'abstract syntax'; b.Another level, a concrete level of representing annotations, called 'concrete syntax'.
Every abstract syntax for semantic annotations must be supported by an explicit (formal) semantics, as sketched in section 6.An XML-serialization of an abstract syntax is an instance of a concrete syntax.The semantics of a concrete syntax is defined as the semantics of the abstract syntax for which it defines a concrete representation.(Different representations of the same abstract syntax thus have the same semantics.)Following [7], we first sketch a general view of 'abstract syntax' in 4.2.We then follow a modified version of [55] to present a general abstract syntax ASyn in 4.3 and an abstract syntax for anaphoric links ASyn ana as a particular instantiation of the general abstract syntax ASyn in 4.4.

A Sketch of Abstract Syntax
According to [7] and two of his subsequent works, [8] and [10], the abstract syntax of an annotation scheme consists of two parts: (40) a.A conceptual inventory that specifies the basic concepts from which annotation structures are built up; b.A specification of the possible ways of combining elements of the conceptual inventory into annotation structures.
An annotation structure is a set, consisting of two kinds of elements: entity structures and link structures.Entity structures provide linguistic information about a region of primary data; link structures provide information about the semantic relation between regions of primary data.
In the case of annotating coreference and other anaphoric link types, entity structures correspond to the entities that are related by anaphoric links, and link structures to the linkings of anaphoric expressions to their antecedents.
An entity structure is a pair m, a where m is a markable that identifies a region of primary data, and a is the specification of the semantic information that the annotation provides about that region of primary data.In the abstract syntax ASyn ana for anaphoric annotation, the a component of an entity structure is an n-tuple, 3 ≤ n ≤ 6 consisting maximally of a semantic type t, a definiteness d, a morphosyntactic form f , a natural gender g, a plurality p, and a collectiveness c (more about these elements below).The fact that the length n of these n-tuples may vary reflects the optionality of some of the elements.
A link structure is a triplet ε, η, ρ , where (41) a. ε is an entity structure and η, a set of entity structures, corresponding to an 'anaphor' and a set of its 'antecedents', respectively; 17   b.ρ is a relation between a pair of entity structures or between an entity structure and a set of entity structures.
For the abstract syntax ASyn ana for anaphoric annotation, the conceptual inventory is a 9-tuple M, T, D, F, G, P,C, Q, R , where e. G is a set of natural genders; f.P is a set of singular/plural values; g.C is a set of 'collectivity values'; 17 Split coreference has more than one antecedent.h.Q is a set of generalized quantifiers and i. R is a set of binary relations over the set of entity structures, corresponding to the various types of anaphoric links.
The annotation structures are defined by an assignment @ that specifies the semantic components of entity structures.
For each markable m in M, @(m), an n-tuple, 3 ≤ n ≤ 7, of elements from T × D × F × G × P ×C × Q. 18We do, however, follow [52,53], and his more recent work [55] in formalizing an abstract syntax in algebraic terms that are often used in defining formal grammars.We also use [37] extended BNF as a meta-language to specify various features of data types because BNF is expressively more powerful than simple set-theoretic listing.

General Abstract Syntax
Given a fragment L i of a language as primary data for annotation, the general structure of an abstract syntax ASyn for an annotation structure can be formally defined as a triple: (43) <M, T, @>, where 1. M is a set of (possibly null or non-contiguous) strings of character segments, called 'markables', in L i , delimited by E t ; 2. T is a finite set of semantic types, consisting of E t and L t , where E t is a finite set of 'entity types' and L t is a finite set of the types of link over E t , 3. @ is a set of assignment functions from T to linguistic features.
For semantic annotation, its markables in M are strings of character segments which are identified as tokens, words or phrases in a fragment of a language, given as its primary data. 19This is so because semantic annotation normally presupposes that its input data has been preprocessed by word segmentation or morpho-syntactic analysis.
Empty strings are allowed as markables in M. They represent so-called non-consuming tags with their use licensed in [35]. 20Consider example (44), (44) a. John i got up late and / 0 i had to skip breakfast.
b.We camped / 0 pl1 near the river.
The missing subject in the right conjunct of example (44a) can be represented as a non-consuming empty tag, if necessary.The / 0 pl1 in (44b) refers to some place pl1 near the river and the event of camping is anchored to that place pl1 .
T is a very small set, consisting of a very small number of entity types E t and link types L t .As will be shown in 4.4, the sets M, E t , and L t are interrelated by the set @ of feature assignment functions associated with those sets. 21A particular function @ i in @ assigns an entity type t in E t and other associated features F i to each markable m in M, generating an annotation structure <m, @ i > such that @ i (m) = t ⊕ F i .
For each link type in L t , @ also defines a set of (binary) relations ρ over the set <M, E t > of entity structures.Each of the specifications can be represented in extended BNF [37] as a function @ i over T that assigns a value type such as CDATA (character data) or IDREF (identifier reference) associated with XML to each of the features associated with each markable m in M. The particular names of features and their value types mentioned in the specifications are not fixed, but may vary for each concrete syntax or dialect.

Abstract Syntax for Anaphoric Annotation 4.4.1 General
The abstract syntax ASyn ana for anaphoric annotation is a particular instantiation of the general abstract syntax ASyn which is introduced in (4.3)General Abstract Syntax, <M, T, @>.
(45) ASyn ana consists of: 1. M is a set of (possibly null or non-contiguous) sequences of tokens or words that refer to objects of the entity types specified in E t ; 2. T is a set of semantic types, consisting of two types: a single basic entity type and a single anaphoric link such that each link type is associated with a triple <ε, η, ρ>; and 3. @ is a set of feature assignments to be specified for each of the types in T in extended BNF separately.
Among the elements of the 9-tuple M, T, D, F, G, P,C, Q, R introduced in (42), the first two elements M and T exactly correspond to the M and T in ASyn ana .The other six elements, D, F, G, P,C, Q, and R, are treated by the feature assignment function @ which is associated with each of the types in T .

Feature Assignments @
Associated with ASyn ana , the feature assignment @ entity assigns to each entity, which is referred to by a markable, a value for each of the attributes specified by D, F, G, P,C, Q, and R, as shown ( 46 The feature assignment function associated with the type link is specified as in ( 47): (47) Feature Assignment @ link : attributes = identifier, anaphor, antecedents, relType; identifier = unique ID; anaphor = IDREF; antecedents = IDREFS; relType = CDATA; The feature assignments ( 46) and ( 47) specify only the data type of a value for each attribute, which is either required or implied, without providing any specific values.Specific values are given by concrete syntaxes based on the abstract syntax for anaphoric annotation.

XML-based Concrete Syntax
We introduce an XML-based concrete syntax CSyn anaX , corresponding to the abstract syntax ASyn ana for anaphoric annotation.First, the annotation of each of the entity or link structures is represented as an XML element.CSyn anaX introduces two elements <entity> and <anaLink>22 that correspond to each of the entity structures and also to each of the link structures, respectively, as defined in ASyn ana .Both of these XML elements have an @identifier attribute in order to allow references from within the representation of a certain link structure to the representations of specific entity structures or other link structures.Moreover, <entity> structures have a @target attribute for representing the markables that they associate linguistic information with.
The feature assignment function for <anaLink> is specified as in ( 49): * } ground = IDREFS; { * Standing for the entity type antecedent.With its value being IDREFS, the attribute @ground may have more than one antecedent as its value.The indexical use of a pronoun may not have an antecedent in the element <entity>.* } relType = "ident", "partIdent", "splitIdent", 30 "qBound", "recip" ; comment = CDATA; The attribute name @figure stands for an anaphor, while the attribute name @ground stands for the entity type antecedent. 31The attribute @relType introduces values other than ident for referential identity.These values allow 23 The identifier is tagged xml:id for XML documents, otherwise id.
Examples of ID values are: "x3", "x20". 24The attribute @target has an extent ID in a tokenized source text or the extent itself as its value.This value can be a (possibly null or noncontiguous) sequence of tokens or their IDs. 25See [57] for the list of entity types. 26Verbal forms including sentential or adjectival forms are excluded. 27Optional attributes have a value "unspecifed" as default. 28CDATA allows any other values. 29The "universal" quant is a logical quantifier, while CDATA allows other generalized quantifiers that include a variety of existential quantifiers. 30"splitIdent" is the inverse of "partIdent". 31The MATE annotation scheme names it "anchor".
the types of the anaphoric link other than the type of coreference proper.The use of each of the values of the attribute @relType is illustrated as in ( 50 We introduce an XML-element <anaX> with its ID prefix ax as the root element for XML documents in the concrete XML annotation scheme CSyn anaX for coreference and other types of anaphoric link.Every annotation based on CSyn anaX is enclosed by a pair of the root element <anaX> and its closing element </anaX>.

Semantic Interpretations: A Sketch
As stated earlier in the introductory section, every semantic annotation must be supported by an explicitly defined semantics.It is not something interesting or optional to do.Possibly unlike POS-tagging or syntactic parsing, the construction of a semantics for the abstract syntax of semantic annotation is a task required for any valid semantic annotation scheme as explicitly stipulated in [36].An abstract syntax or a concrete syntax based on it alone cannot be justified without a supporting semantics.
The use of the lambda calculus in the line of [60] or that of the discourse representation structures (DRSs), proposed by [38] and [39], can, for instance, be linked to the abstract syntax to provide such a semantics for semantic annotations.Attempts have been made by [41], [68], [6,8,9], and [51] to develop an annotation-based semantics with the use of lambda calculus or by [7] with the use of DRSs.The use of lambda abstraction has run into the problem of complexity especially in dealing with multiple quantification and embedded adjunct structures.This should be the case with the treatment of various anaphoric phenomena.There are at least two interesting works to overcome this complexity problem: One is an earlier work by [62] which proposed a way of combining Montague semantics with DRSs and another is the most recent work by [9] which directly addresses to the treatment of anaphoric phenomena by combining underspecified representation (USR) that arises because of the presence of context-dependent expressions such as pronouns with annotation information representation (AIR).In constructing these representation structure, [9] shows how useful and necessary it is to combine the introduction of discourse referents in DRSs with markables in the annotation into USR and AIR, especially when there are multiple occurrences of identical anaphoric expressions, that is, pronouns, in a text.
Here is a very simple illustration showing how a list of annotations is converted into a semantic representation: (58) a. Text: John hates himself.
Second, as proposed and discussed in formal modeltheoretic semantics (see [60], [3], [58], and [45]), proper names, definite descriptions, indefinite singular (a dog) or bare plural (donkeys) noun phrases as well as quantified noun phrases (three students, every man) are also interpreted as referring to sets of sets or properties, in the world.In our treatment, universally quantified expressions are differentiated from other types of generalized, but existentially quantified expressions.
Third, pronouns, on the other hand, do not refer directly to any entities in the world, but only through being coreferential with some other terms in the text (anaphoric use) or by referring to some entities that are provided contextually in a discourse situation (indexical use) (see [44]).Nevertheless, pronouns are also marked up as referring expressions or mentions in coreference annotation (see [16], [19], [20].) There have been two trends in formal semantics that allow both of them to be treated as the same type of binding phenomena.One trend was initiated by [60] and further developed by [3], [58], [45], and a host of others.According to this trend, all nouns phrases that include proper names (e.g, George Washington) and definite descriptions (e.g., the king of France) are treated as (generalized) quantifiers just like the universal and the existential quantifiers in logic and interpret them all as denoting sets of sets, as illustrated below: Here, [|α|] M stands for the denotation of an expression α with respect to a model M.
Possibly initiated by [40] among the linguists, another trend has been to treat pronouns as bound variables.As a variable, a pronoun x has its semantic σ (x) determined not by a model M, but an assignment g that ranges over the domain D of a model M such that σ (x) = g(x), or written [|x|] g .Since [40] there have been discussions, either in the framework of formal semantics or the binding theory of the generative grammar, on the status of pronouns among linguists and logicians such as [16], [29], [19], [20], [38], and [78].As a result, various types of pronouns have come to be known, including pronouns, called 'presumptive', 'intrusive' 34 , or 'lazy/paycheck' pronouns.Here are some examples: 35   (62) Presumptive or intrusive pronoun: a.This is the girl i that whenever it rains she i cries.b.This is the camel i that I think he i likes Oscar.
(63) Lazy/paycheck pronoun: c. John spent his paycheck i on beer.Everyone else put it i? in the bank.
Another question has to do with the extension of markable mentions.Linguists such as [40], [29], [28], [42], [43], [88], [89] and too many others have discussed presumptive and lazy pronouns, while [74], [46], and other have discussed the occurrences of pronouns in modal contexts.Hence, to produce semantically adequate ASs, it may be necessary to accommodate all their ideas into a unified AS for anaphoric links. 34 [80] claims that English has no genuine presumptive pronouns, but they should be 'intrusive' pronouns. 35Examples (a) and (b) were copied from Wikipedia, the free encyclopedia, and see also ?], cited in Wikipedia.Example (c) was recopied from [88].
The interpretation of indexical pronouns such as I, we and you is more complex than their surface phenomena.It requires a clear understanding of dialog or discourse situations in which these indexicals interact with each other, especially in a language like Korean that has a complex honorific system.Little attention has, however, been given to the hearer-speaker mode exchange phenomena in communication except for [27] and many of his earlier works [25,26].[27] introduces the notion of anchoring that grounds the content of indexicals to real communication situations.In section 7, we introduce the elements such as <metadata> and <metaLink> to deal with the speaker and the hearer references that are often omitted in a pro-drop languages like Korean, but in a sketchy way to be able to understand such utterances: (64) a. from [27] Hanul-ey keysin wuli apeci, ilum-i keluk.hiyekim-ul patusimye nala-i imhaopsimye ttus-i hanul-eyse-wa kath-i ttang-eyse-to ilwu.e.ci.ita. 36Our Father, who art in Heaven, hallowed be Thy name, Thy Kingdom come, Thy will be done on earth as it is in Heaven.' (66) Lord's Prayer: Catholic New Version Hanul ey keysin wuli apeci, Semantic Annotation of Anaphoric Links in Language apeci-ui ilum-i keluk.hipicnasimye apeci-ui nala-ka osimye apeci-ui ttus-i hanul-eyse-wa kath-i ttang-eyse-to ilwu.e.cisose.
Both (65) and (66) show the rare use of pronouns in Korean: neither of them has any other pronoun than 우리 (wuli) corresponding to the pronoun our.The Protestant version of the Lord's prayer has no word corresponding to thy or your, while the Catholic version has 아버지(apeci) 'father' for the addressee, thy in English.These datasets show the characteristics of Korean as a typical pro-drop language that allows the dropping not only of subject pronouns, but also of any other contextually recoverable pronouns.( 67 The pro-dropping is often compensated by the repetition of the nouns referred to, as shown by (83)  Here are a couple of examples from [56] 37 that show how noun phrases are repeated: na-nun kamca-ka manh.inanun kocang eyse thaye nase kamca-lul muchek coh.a hanta.kuliko kamca-lul coh.a hanun salam-ul mannamyen hay-mata kamca-lul senmul-lo ponayko siphecitna.
'I love (my) mother.I can't even imagine a world without her.Mothers are really like the earth to me.' [56] claim that there are no third-person pronouns in Korean.As an agglutinating language, Korean can, however, produce a host of third-person pronominal expressions, based on the three basic demonstrative prenouns, as listed in (71):

Production of Pronouns in Korean
(71) a. 이 (i) 'this': pointing to some entity in close proximity b. 그 (ku) 'that': simply referring to some entity that has been mentioned or pointing to some entity in near proximity c. 저 (ce) 'that (over there)': pointing to some entity at a distance [14] calls 이 (i) 'this' proximal, 그 (ku) 'that' medial, and 저 (ce) 'that over there' distal.

Pragmatic Features Associated with Pronouns in Korean
The first and second person pronouns are seldom used in spoken language.If they are used, then their use is much constrained by pragmatic principles involving the speaker, the addressee(s), and the persons or things referred to in an utterance.너 (ne), 자네 (caney), 당신 (tangsin), 너희 (nehui), and 여 러분 (yelepun) are second-person pronouns, referring to an addressee or a group of addresses.The first two forms, 너 (ne) and 자네 (caney), and the two plural forms 너희 (nehui) and 여러분 (yelepun) are of the plain speech style form, whereas the other three 당신 (tangsin) is an addressee-honoring form or plain form.The use of these three forms is very much restricted, too.In prayers or some writings, 당신 (tangsin) is accepted as a humble form, but in ordinary dialog situations is often misunderstood as failing to convey respect to the addressee.There are many other addressee-referring forms such as 어르신 (elusin) 'elder', 부인 (puin) 'lady', 선생님 (seonsayingnim) 'teacher', 사모님 (samonim) 'teacher's wife', 강아지 (kang.aci)'puppy', 삼춘 (samchwun) 'uncle', 이모 (imo) 'aunt', 오빠 (oppa) 'brother', 자기 (caki) 'self', but their use is very much pragmatically constrained, reflecting various social-level relations between the speaker and the addressee.어르신 (elusin) is used only when an elderly person is addressed to, but older people may not want to be called as such.자기 (caki) is a third-person reflexive pronoun, but is often used among young married couples, addressing to each other.
When followed by a nominative (subject) marker, 나 (na), 저 (ce), and 너(ne) undergo a vowel-raising process, thus changing their forms to 내(nay), 제 (cey), and 네 (ney), respectively.These forms are also used as genitive case forms.(74)  This is the prayer that Jesus taught his disciples. 39The annotation of this prayer undergoes two annotation processes: (a) word segmenation, (b) metadata, (c) identification of markables and (d) the annotation of links.
The last link <anaLink xmlLid="anaL6"> in (88) can be interpreted as in ( 89 The interpretation (91) is a case of symmetric reciprocity.It is a particular case of (90) in which the cardinality of A is 2, |A| = 2, such that z = y.

Concluding Remarks: A Summary
This paper is an extended and revised version of [54].It mainly aimed at making some linguistic contributions as necessary ground work to an ISO initiative to produce an international standard on anaphoric annotation as part of a series of semantic annotation schemes.It has thus reviewed several existing annotation schemes for anaphoric links including TEI-based [5], [85], [30], [79] based on [61], [64], [65], and [70] in efforts to normalize them into an interoperable annotation scheme implementing common denominators as an ISO international standard on semantic annotation.
Section 4 through section 6 form the core of this paper.Following the initial proposal of [7] and his subsequent works including [10], a basic work for [36], this paper adopts a formal approach to the construction of a semantic annotation scheme for anaphoric annotation which consists of: 1. an abstract syntax ASyn ana , 2. a set of possible concrete syntaxes CSyn ana , and 3. a semantics supporting the abstract syntax.
As designed by [55], the proposed abstract syntax ASyn ana for anaphoric annotation is derivable from the general abstract syntax ASyn, providing an anaphoric annotation scheme structured with a two-level specification language in abstract terms: 1. the level of identifying entities, called 'discourse entities', as marakbles and 2. the level of linking each anaphor to one or more antecedents, called 'anchors' or 'grounds'.
ASyn ana allows a variety of types of anaphoric linking for coreference and other types of anaphoric relations by simply specifying their data type only.
ASyn ana allows the derivation of a class of concrete syntaxes CSyn ana that are semantically equivalent, but each with a different representation format.The proposed concrete syntax CSyn anaX is based on XML, providing an XML-based representation format.The basic entity type for discourse entities that are referred to by either referring or anaphoric expressions is tagged <entity> with its ID prefix being "x" which is followed by an integer for unique identification.The attribute-value specification associated with <entity> has a long list of optional or implied attributes such as @definite, @naturalGender and @plurality other than a small number of required attributes, @xml:id, @target, @type and @form, depending on the language to which the annotation scheme is applied.
The XML-based concrete syntax CSyn anaX for English has a single linking tag <anaLink> for anaphoric relations.Besides its ID with its prefix "anaL", which is needed for doing semantics as a handle for combinatorial operations, the element <anaLink> has three required attributes: @figure, @ground, 41 and @relType.The attribute @ground may have more than one antecedent as its value.The list of values for the attribute @relType includes: "ident" for coreference, "partIdent" for partial identity, "splitIdent", 42 "qBound" for quantifier binding, and "recip" for reciprocals.
The formulation of a semantics for an abstract syntax is required if the abstract syntax is for semantic annotation.Section 6 thus showed how semantic forms can be derived from the list of annotations or XML-elements in CSyn anaX in a compositional manner in order to validate the proposed ASyn ana even partially.The task of constructing a modeltheoretic semantics for ASyn ana is left as a future agenda item.
By applying ASyn ana to Korean, we have shown how adequately a concrete syntax CSyn anaX can be modified and extended.One such an example is the treatment of the reciprocal expression 서로 (selo) 'each other' in Korean.Given an adequate semantics of reciprocity in language, reciprocal expressions in Korean as well as in English are shown to be adequately annotated and interpreted.
In treating Korean, we have also extended ASyn ana to accommodate the notion of discourse situations as metadata, tagged <metaData>, that specifies various discourseinvolving entities such as the speaker, the addresee(s), the utterance time and the utterance place as well as a link that relates them with each other.These entities are then tagged in XML: <speaker>, <addresse>, <uTime>, and <uPlace> with the link being tagged <metaLink>.The element, tagged <metaData>, is the root element for the annotation of discourse situations.
The current versions of proposed ASyn ana and CSyn anaX both need to be evaluated against inter-annotator agreed annotations with large data, especially applied to Korean.Such task is, however, left as a future project. 41These attribute names stand for an anaphor and a set of antecedents. 42"splitIdent" is the inverse of "partIdent".

( 42 )
a. M is a non-empty set of markables; b.T is a set of semantic types; c.D is a set of definiteness values; d.F is a set of morphosyntactic forms;
List of Attributes and Possible Values for <COREF> John owns {two cars} i .One i1 of them i broke down.c. splitIdent: antecedents are split into more than one; I i met {a farmer} j and {his j dog} k and we { i, j, k} all walked together.d. qBound: case of bound anaphors; 32 {Every farmer} x owns a donkey y .They x beat it y .Every x farmer loves his x wife.
e. recipWe x5 love [one another] x5 .(reciprocal) a. Step 1: Identification of Markables Not long ago {a retired professor x1 of a small university x2 in Essex x3 } x4 lost {his x5 wealthy wife x6 } x7 , who x8 had been ill for {over ten years} x9 .After {her x10 death x11 } x12 , {the professor} x13 and {their x14 only daughter x15 } x16 found {a large sum of money} x17 left for them x18 to share.
b. Step 2a: Identification of Possible Anaphors andStep 2b: Links <anaX xml:id="ax1"> <entity xml:id="x1" Step 1: Identification of Markables: b.Text: Billy Jones x25 took {his x26 family x27 } x28 to {a manor } x29 in {the country} x30 near Edinburgh x31 .There x32 they x33 spent most of {a year} x34 , but stayed in {Cannes, France} x35 , in the winter season x36 .The antecedent of the pronoun they x33 is split into Billy Jones x25 and family x16 .The antecedent of the locative pronoun There x32 is ambiguous: it can be either a manor x29 or the country x30 .This can be represented as in 54, for both of the situations hold: Semantic Annotation of Anaphoric Links in Language Example (55b) can be annotated as below: (57) a.If Pedro x1 owns {a donkey} x2 , he x3 beats it x4 .

Table 1 .
Constraints on the Use of First and Second Person Pronouns 우리 (wuli), and 저희 (cehui) are firstperson pronouns, each referring to a speaker or speakers.나(na)is a plain speech style form, while 저 (ce) is called a humble or self-lowering style form, meaning that the speaker is considered as paying due respect to the addressee(s).우리(wuli)and 저희 (cehui) are plural forms, corresponding to the singular forms 나 and 저, respectively.38저희(cehui) is a humble form.