Grammar-Based Recognition of Documentary Forms and Extraction of Metadata

Metadata extraction is a critical aspect of ingestion of collections into digital archives and libraries. A method for automatically recognizing document types and extracting metadata from digital records has been developed. The method is based on a method for automatically annotating semantic categories such as person’s names, job titles, dates, and postal addresses that may occur in a record. It extends this method by using the semantic annotations to identify the intellectual elements of a document’s form, parsing these elements using context-free grammars that define documentary forms, and interpreting the elements of the form of the document to identify metadata such as the chronological date, author(s), addressee(s), and topic. Context-free grammars were developed for fourteen of the documentary forms occurring in Presidential records. In an experiment, the document type recognizer successfully recognized the documentary form and extracted the metadata of two-thirds of the records in a series of Presidential e-records containing twenty-one document types. 1 This paper is based on the paper given by the author at the 5th International Digital Curation Conference, December 2009; received November 2009, published June 2010. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. ISSN: 1746-8256 The IJDC is published by UKOLN at the University of Bath and is a publication of the Digital Curation Centre. Grammar-Based Recognition of Documentary Forms and Extraction of Metadata 149 Introduction The increasing volume of digital records being acquired by archives and libraries poses significant challenges to archivist’s manual procedures for processing records. Archivists traditionally describe records at record group (or collection), series, file unit and item levels. This provides the archive’s intellectual control over its holdings and supports access to the records. Archival descriptions (summaries or metadata) include the names of the types of records that occur in a record series, for example, correspondence, memoranda or agenda. Record descriptions also include author’s and addressee’s names as well as the topics of records. Archivists cannot completely describe a collection until the collection has been manually read and reviewed. With increasing volumes of electronic records, it may be decades or even centuries before new acquisitions are described. An automated method of metadata extraction and description is needed. The next section of this paper reviews the concept of documentary form and related concepts. The related research in document type (or genre) identification is summarized. Then the method for recognizing documentary forms and extracting document metadata is described. An implementation and experimental evaluation of the method is described. Finally, the results of the research are summarized with a discussion of open research issues. Documentary Form, Record Types and Document Type The International Council of Archivists (1999) in its standard for archival description defines a (documentary) form as “A class of documents distinguished on the basis of common physical (e.g. water colour, drawing) and/or intellectual (e.g. diary, journal, day book, minute book) characteristics of a document”. The standard also specifies that the names of forms be used in describing record series and titling records. The National Archives and Records Administration’s guideline for cataloging archival materials defines specific records type as “the intellectual format of the archival materials” (NARA, 2008). The purpose of the specific records type is that it “Enables users to search for archival materials by the types of document represented in the archival materials”. The guidelines also specify that specific records types be used in describing record series. The science of diplomatics defines documentary form as “the rules of representation used to convey a message, that is, the characteristics of a document which can be separated from the determination of the particular subjects, or places it concerns. Documentary form is both physical and intellectual” (Duranti, 1998). The intellectual form of a document is “the sum of a record's formal attributes that represent and communicate the elements of the action in which the record is involved and of its immediate context, both documentary and administrative”. The physical form of a document is “the overall appearance, configuration, or shape, derived from its material characteristics and independent of its intellectual content” (Duranti, 1998). The International Journal of Digital Curation Issue 1, Volume 5 | 2010 150 William Underwood The Standard Generalized Markup Language (SGML) uses a Document Type Definition (DTD) to define document form (International Standards Organization, 1986). A DTD specifies a set of elements, their relationships, and the tag set is used to markup the document. The Extensible Markup Language (XML) is a simpler subset of SGML (World Wide Web Consortium, 2006). The concept of document structure as defined by a XML DTD is a formal model of the concept of the intellectual form of a document. The concept of genre is similar to that of documentary form but includes classes of documents that are not characterized by their intellectual or physical form, but by pragmatic or rhetorical features. Examples of written genre include academic prose, biography, instructional material and newspaper reports. See Santini (2004b) for a discussion. Figure 1 shows examples of the names of some of the specific documentary forms (record types) discovered in Presidential e-records. Figure 1. Documentary Forms in Presidential Records. Related Research The reader is referred to Santini (2004b) for a survey of state-of-the-art approaches to genre identification of digital documents. Santini (2004a) also describes a method based on part-of-speech trigrams for classifying ten genres including conversations, interviews, public debate, biography and reportage. The objective of the research of Kim and Ross (2007a, 2007b) is the recognition of genre for the purpose of metadata extraction from digital records ingested into digital archives or libraries. Their approach is to identify features of documents that will allow them to automatically classify documents by genre. The features they have identified include: image features, syntactic features, stylistic features, semantic structure, and domain knowledge features. These features are used with an image classifier, an n–gram model classifier and a stylo-metric classifier. Our research differs from that of Kim and Ross, and of other researchers in genre identification, in that our objective is to recognize a document’s form by parsing its intellectual elements using grammars characterizing document types. However, there are document types for which it is necessary to use pragmatic features to recognize the genre, for example, white papers and biography. The International Journal of Digital Curation Issue 1, Volume 5 | 2010 Grammar-Based Recognition of Documentary Forms and Extraction of Metadata 151 A Method for Recognizing Documentary Forms and Extracting Document Metadata Legacy and current Presidential e-records are not XML documents, but e-records in proprietary file formats. However, it will be shown that it is possible to define, recognize and annotate the intellectual elements of a textual e-record, and that the structure of the intellectual elements of a particular documentary form can be defined with rules similar to those of an XML document type definition. This will enable the recognition of documentary forms and extraction of document metadata. The process of automatically recognizing the document types of documents in proprietary file formats is outlined in Figure 2. The italicized phrases to the right of the downward pointing arrows indicate inputs and outputs of the numbered processing steps (Underwood & Laib, 2008). Figure 2. The Process of Document Type Recognition and Metadata Extraction. The first through the sixth steps are a previously implemented method for automatically annotating semantic categories in text such as person’s names, job titles, dates, location names, postal addresses and organization names (Underwood & Isbell, 2008). The input to the method is an e-record in a proprietary file format. The first step converts that record to a plain text or html file format. The third step, Wordlist lookup, matches the terms (tokens) in the document against approximately 170,000 terms in 181 wordlists for such classes as person first names, surnames, city names, country names, months, and organizational nouns. If there is a match, the text is annotated with a tag for the name of that class. The sixth step, Semantic Tagger applies rules to the previously annotated text to produce additional annotations, for example, person’s full names, locations made up of city and state or country names. The International Journal of Digital Curation Issue 1, Volume 5 | 2010 152 William Underwood Figure 3 shows a document whose paragraphs, dates, times, and person, location and organization names have been annotated by the first six steps of the method. Figure 3. Document with Annotated Paragraphs and Semantic Categories. The seventh step, Intellectual Element Annotator, recognizes and annotates the intellectual elements occurring in a document. Currently, there are about 100 intellectual element rules. They apply to the annotated document and identify text strings such as FROM:, SUBJECT:, Attachment, or previously annotated semantic categories such as date, address and person’s name as intellectual elements. Figure 4 shows the document in Figure 3 after the annotation of the intellectual elements.


Introduction
The increasing volume of digital records being acquired by archives and libraries poses significant challenges to archivist's manual procedures for processing records.Archivists traditionally describe records at record group (or collection), series, file unit and item levels.This provides the archive's intellectual control over its holdings and supports access to the records.Archival descriptions (summaries or metadata) include the names of the types of records that occur in a record series, for example, correspondence, memoranda or agenda.Record descriptions also include author's and addressee's names as well as the topics of records.Archivists cannot completely describe a collection until the collection has been manually read and reviewed.With increasing volumes of electronic records, it may be decades or even centuries before new acquisitions are described.An automated method of metadata extraction and description is needed.
The next section of this paper reviews the concept of documentary form and related concepts.The related research in document type (or genre) identification is summarized.Then the method for recognizing documentary forms and extracting document metadata is described.An implementation and experimental evaluation of the method is described.Finally, the results of the research are summarized with a discussion of open research issues.

Documentary Form, Record Types and Document Type
The International Council of Archivists (1999) in its standard for archival description defines a (documentary) form as "A class of documents distinguished on the basis of common physical (e.g.water colour, drawing) and/or intellectual (e.g.diary, journal, day book, minute book) characteristics of a document".The standard also specifies that the names of forms be used in describing record series and titling records.
The National Archives and Records Administration's guideline for cataloging archival materials defines specific records type as "the intellectual format of the archival materials" (NARA, 2008).The purpose of the specific records type is that it "Enables users to search for archival materials by the types of document represented in the archival materials".The guidelines also specify that specific records types be used in describing record series.
The science of diplomatics defines documentary form as "the rules of representation used to convey a message, that is, the characteristics of a document which can be separated from the determination of the particular subjects, or places it concerns.Documentary form is both physical and intellectual" (Duranti, 1998).The intellectual form of a document is "the sum of a record's formal attributes that represent and communicate the elements of the action in which the record is involved and of its immediate context, both documentary and administrative".The physical form of a document is "the overall appearance, configuration, or shape, derived from its material characteristics and independent of its intellectual content" (Duranti, 1998).

The International Journal of Digital Curation
The Standard Generalized Markup Language (SGML) uses a Document Type Definition (DTD) to define document form (International Standards Organization, 1986).A DTD specifies a set of elements, their relationships, and the tag set is used to markup the document.The Extensible Markup Language (XML) is a simpler subset of SGML (World Wide Web Consortium, 2006).The concept of document structure as defined by a XML DTD is a formal model of the concept of the intellectual form of a document.
The concept of genre is similar to that of documentary form but includes classes of documents that are not characterized by their intellectual or physical form, but by pragmatic or rhetorical features.Examples of written genre include academic prose, biography, instructional material and newspaper reports.See Santini (2004b) for a discussion.
Figure 1 shows examples of the names of some of the specific documentary forms (record types) discovered in Presidential e-records.

Related Research
The reader is referred to Santini (2004b) for a survey of state-of-the-art approaches to genre identification of digital documents.Santini (2004a) also describes a method based on part-of-speech trigrams for classifying ten genres including conversations, interviews, public debate, biography and reportage.The objective of the research of Kim andRoss (2007a, 2007b) is the recognition of genre for the purpose of metadata extraction from digital records ingested into digital archives or libraries.Their approach is to identify features of documents that will allow them to automatically classify documents by genre.The features they have identified include: image features, syntactic features, stylistic features, semantic structure, and domain knowledge features.These features are used with an image classifier, an n-gram model classifier and a stylo-metric classifier.Our research differs from that of Kim and Ross, and of other researchers in genre identification, in that our objective is to recognize a document's form by parsing its intellectual elements using grammars characterizing document types.However, there are document types for which it is necessary to use pragmatic features to recognize the genre, for example, white papers and biography.Issue 1, Volume 5 | 2010

A Method for Recognizing Documentary Forms and Extracting Document Metadata
Legacy and current Presidential e-records are not XML documents, but e-records in proprietary file formats.However, it will be shown that it is possible to define, recognize and annotate the intellectual elements of a textual e-record, and that the structure of the intellectual elements of a particular documentary form can be defined with rules similar to those of an XML document type definition.This will enable the recognition of documentary forms and extraction of document metadata.
The process of automatically recognizing the document types of documents in proprietary file formats is outlined in Figure 2. The italicized phrases to the right of the downward pointing arrows indicate inputs and outputs of the numbered processing steps (Underwood & Laib, 2008).The first through the sixth steps are a previously implemented method for automatically annotating semantic categories in text such as person's names, job titles, dates, location names, postal addresses and organization names (Underwood & Isbell, 2008).The input to the method is an e-record in a proprietary file format.The first step converts that record to a plain text or html file format.The third step, Wordlist lookup, matches the terms (tokens) in the document against approximately 170,000 terms in 181 wordlists for such classes as person first names, surnames, city names, country names, months, and organizational nouns.If there is a match, the text is annotated with a tag for the name of that class.The sixth step, Semantic Tagger applies rules to the previously annotated text to produce additional annotations, for example, person's full names, locations made up of city and state or country names.The seventh step, Intellectual Element Annotator, recognizes and annotates the intellectual elements occurring in a document.Currently, there are about 100 intellectual element rules.They apply to the annotated document and identify text strings such as FROM:, SUBJECT:, Attachment, or previously annotated semantic categories such as date, address and person's name as intellectual elements.Figure 4 shows the document in Figure 3 after the annotation of the intellectual elements.

The
. The names of the intellectual elements shown in Figure 4 are chron(ological)date, for, person, from, subj, topic, para and attachment.
The eighth step, SUPPLE Parser/Interpreter (Gaizauskas, Hepple, Saggion, Greenwood, & Humphreys, 2005), recognizes the document type using a parse/interpreter with a context-free grammar that characterizes the intellectual form of a document type.A context-free grammar is a 4-tuple <N, T, R, S> where N is a set of non-terminal symbols, T is a set of terminal symbols, R is a set of rules of the form A→w (A is a member of N and w is a string of symbols from N or T), and S is a member of N called the initial symbol.Linguists use context-free grammars to define the structure of sentences in a natural language and Computer Scientists use them to define programming languages.
Figure 5 shows some of the rules of a context-free grammar for the intellectual form of a memorandum.MEMO is the initial symbol of the grammar.The first rule defines a MEMO as consisting of a MEMOHEAD followed by a BODY.The BODY may be followed by OPTIONAL elements.A MEMOHEAD consists of an intellectual element DATE followed by an ADDRLINE followed by a SNDRLINE followed by a SUBJLINE.Optionally, there may be a THRULINE between the ADDRLINE and SUBJLINE.An ADDRLINE consists of an intellectual element FOR followed by ENTITIES.The SNDRLINE consist of an intellectual element FROM followed by ENTITIES.The SUBJLINE consists of an intellectual element SUBJ followed by an intellectual element TOPIC.ENTITIES consist of a sequence of one or more intellectual elements PERSON, JOBTITLE, or PERSON JOBTITLE.The BODY consists of a sequence of intellectual elements PARA.An OPTIONAL element consists of an intellectual element ATTACHMENT or a CCLIST or a BCCLIST, or combinations of these.A CCLIST consists of an intellectual element CC followed by ENTITIES.Similarly for a BCCLIST.

The International Journal of Digital Curation
Figure 6 shows the grammar shown in Figure 5 augmented with semantic rules that create an interpretation of the meaning of the documentary form, that is, a representation of the name of document type, its date, author, addressee, and topic.The Intellectual Element Annotator assigns a value to each of the intellectual elements in the grammar.For example, for the annotated document in Figure 4, the intellectual element PERSON after the intellectual element MEMORANDUM FOR will get the value 'SAM SKINNER'.
In Figure 6, the two percent symbols (%%) indicate a comment.A grammar rule such as A → B 1 , …B n is represented to the parser by a rule of the form rule(A [B 1 , … B n ]).The grammar rules are augmented with semantics by the notation included in parentheses after the symbols in the rules, e.g.rule(A( ) [B 1 ( ), …B n ( )]).For instance, the rule shown at the bottom of Figure 6 is used to recognize that a PERSON's name is an ENTITY.The value of the intellectual element PERSON is passed to the left-hand side of the rule, ENTITY, and a list [name, E, PERSON] is created whose semantic value is associated with ENTITY.When the rule ENTITIES  ENTITY is used to recognize an ENTITY as ENTITIES, the semantic value of ENTITY is passed to ENTITIES.When the intellectual element FOR followed by ENTITIES is recognized, the semantic value of ENTITIES is passed to ADDRLINE where it is made the value of ADDRList.When CHRONDATE, ADDRLINE, SNDRLINE and SUBJLINE are recognized, the semantic value of each of these elements is passed to the variables DATE, ADDRList, SNDRList, and TOPIC and become the semantic values of MEMOHEAD.When MEMOHEAD and BODY are recognized, the semantic values of MEMOHEAD become the semantic values of MEMO.The White House casual letters that were not recognized were due to the semantic category annotator failing to recognize a postal address, a person's name or job title.This problem can be addressed by improving the performance of the semantic annotator.

Conclusions
The results of this research are that: (1) the intellectual elements of documentary forms can be defined in terms of the keywords and semantic categories in a document, (2) documentary forms (record or document types) can be defined using context-free grammars, and (3) grammars for documentary forms can be used with a parser/interpreter for context-free grammars to automatically recognize the documentary form of textual records while simultaneously identifying document metadata including date, author, addressee, and topic.
Context-free grammars have been constructed for fourteen of the documentary forms that occur in Presidential e-records.Rules were constructed for recognizing the intellectual elements of these documentary forms.These grammars were translated into context-free attribute grammars that were used with a parser to parse and interpret the intellectual elements of Presidential e-records.The resulting semantic representation can be used to extract metadata needed for archival description and for record search and retrieval.Issue 1, Volume 5 | 2010 The intellectual elements of a documentary form were identified either by reference to a style manual for the form or by comparing examples of a document type to identify those elements of the examples that did not change from one example to another.The question arises, can the intellectual elements of a particular documentary form be learned from examples without a teacher?The question also arises, could grammatical induction be used with samples of a particular documentary form to induce a grammar automatically rather than manually?This would eliminate the manual effort needed to construct grammars from large samples, and could provide a method for automatically refining the grammar when examples of a documentary form were encountered that did not fit the current grammatical model.It would also facilitate the extension of documentary form recognition to a larger number of documentary forms.Underwood and Harris (2006) demonstrated that it is possible to induce a grammar for the documentary form of White House memoranda and correspondence from sequences of intellectual elements occurring in samples of these document types.One of the obstacles to progress in this research was that samples of document types were created from OCRed paper documents and the intellectual element recognizer had not been created.Now, the intellectual element recognizer and document type recognizer have been interfaced to PERPOS.There are hundreds of thousands of erecords in the PERPOS repository that can be used in grammatical induction experiments.The intellectual element recognizer is being modified to output the intellectual elements of a record for use in grammatical induction rather than in recognition.This research has not progressed to the point that there are experimental results to report.

The International Journal of Digital Curation
The research described in this paper addressed only the intellectual form of documents.In further research, rules will be formulated for recognizing the physical elements of the physical form of a document.These are elements such as the fonts, font sizes, underlining, horizontal bars, bold and italics.These features are important for recognizing the layout and appearance of a document and for defining additional intellectual elements such as headings.

Figure 1 .
Figure 1.Documentary Forms in Presidential Records.

Figure 2 .
Figure 2. The Process of Document Type Recognition and Metadata Extraction.

Figure 3
Figure3shows a document whose paragraphs, dates, times, and person, location and organization names have been annotated by the first six steps of the method.

Figure 3 .
Figure 3. Document with Annotated Paragraphs and Semantic Categories.

Figure 5 .
Figure 5. Grammar for the Intellectual Form of a Memorandum.

Figure 6 .
Figure 6.Part of the Grammar for the Intellectual Form of a Memorandum Augmented with Semantic Rules.