The naming of cats”: Automated genre classification

This paper builds on the work presented at the ECDL 2006 in automated genre classification as a step toward automating metadata extraction from digital documents for ingest into digital repositories such as those run by archives, libraries and eprint services (Kim & Ross, 2006b). We have previously proposed dividing features of a document into five types (features for visual layout, language model features, stylometric features, features for semantic structure, and contextual features as an object linked to previously classified objects and other external sources) and have examined visual and language model features. The current paper compares results from testing classifiers based on image and stylometric features in a binary classification to show that certain genres have strong image features which enable effective separation of documents belonging to the genre from a large pool of other documents.


Background and Objective
In [29], we summarised the valuable role of automated metadata extraction in the cost-effective efficient management of digital collections: metadata play a key role in management processes ( [43], [23]) and the manual creation of metadata is expensive ( [15], [23], [40]).As we pointed out in [29], ERPANET's ([18]) Packaged Object Ingest Project ( [19]) identified automatic extraction tools for technical metadata (e.g.[33], [35]), and substantial work on descriptive metadata extraction within specific domains has been conducted (e.g.[32], [13], [2], [50], [21], [22], [6], [26], [47], [51]) along with other work in information extraction from text (e.g.[3], [9], [49], [48]).However, a general tool has yet to be developed to extract metadata from digital objects of varied types and genres.This paper further develops concepts of genre classification introduced in [29] involving the automatic grouping of documents into distinctive document types followed by focused metadata extraction from single document types as a means of creating a tool capable of extracting metadata across many domains at different semantic levels.To reiterate the argument in [29], identifying the genre first provides a mechanism to limit the scope of document forms from which to extract other metadata.Within a single genre, metadata such as author, title, keywords, identification numbers or references can be expected to appear in a specific style and region, and independent methods have been developed for genre-specific extraction of such metadata for some classes of documents (e.g.Scientific Papers).Note also that different institutions focus on collecting and managing digital materials in different genres; genre classification will support automating the identification, selection, and acquisition of materials in keeping with local collecting policies.
A review of Biber ([7]), Karlgren et al. ([25]), Kessler et al. ([27]), Rauber et al. ([44]), Bagdanov et al. ([4]), Boese ([8]) , Finn et al. ([20]) and Santini ([45]) exemplifies the lack of consensus on the definition of genre.Biber's analysis of document genres employed five functional dimensions (information, narration, elaboration, persuasion, abstraction) to characterise text, while Karlgren et [5]) attempted the clustering of documents rather than classification.An overview of the various efforts in genre analysis can be found in a technical report by Santini ([46]).A broader review of metadata extraction and genre classification is also being prepared by the DELOS NoE Digital Preservation Cluster and is expected to be completed before the publication of this paper.
The variety of definitions adopted by these researchers illustrates a confused interplay of two notions: one of structure and one of function.Structure is defined by the visual layout and is expected to be distinguishable mostly by measurable features such as amount of white space; the length of the document, sentences, or words; and, the presence or absence and location of headers, delimiters, images, or links.Function, on the other hand, is defined by the intended role of the document and is expected to be characterised mostly by linguistic models and semantic analyses of the documents.The two notions are closely linked together by medium, process or event.For example, a scientific research article is usually sructured so that a title is present on the first page followed by author, affiliation, a body of text consisting of sections, and finally a list of references.It has the function of communicating, arguing or describing research.The interrelationship of structure and function are represented by the formatting requirements of journals or conventions in the community or event for which the document was created.The requirements and conventions evolve to optimise the communicative intentionality within the context; other communities or events may find different structures of documents to optimise the same function.Just as biologists study DNA as the building blocks of living organisms to understand the classes into which they have evolved within their environment, it seems important to identify documents by their structure and their function separately as building blocks to infer their genre class within a standardised schema.We seek to be able to achieve this by a full analysis of five document feature types: image features, syntactic features, stylistic features, semantic structure, and domain knowledge features.We aim to build a system which models the five feature sets for a schema of approximately seventy genres (Table 1).
The genres in Table 1 are not meant to be static: the schema has been evolving as we develop and incorporate well-structured classification standards and as we become aware of digital genres we had not encountered before or which have just emerged in the digital domain.The experiments in this paper have initially limited the study to the image and stylistic feature sets on the nineteen most prolific genres in our experimental data set.Along with the results in [29], the results here are intended to be another step towards a full analysis.
The experimental data in this paper is from the pool of 570 PDF ( [37]) files that were sampled randomly from the Internet as described in [29].As explained in [29], by confining the work to studying PDF files, we hope to put a boundary on the problem space, while working with a widely used portable format for digital objects ingested into digital repositories.This paper, along with [29] and [30], is intended to show the promise of combining separate classifiers trained on different types of features for genre classification.Also note that the bottom-up approach of starting from genrespecific extraction may results in several tools, which are overly dependent on the structures of the documents in the domain, with no obvious means of interoperability: the top-down approach of creating a tool which looks across genres, to be refined further within the domain, will enable us to avoid this problem.

Classifiers
The experiments described in this paper involve the use of two classifiers: Image classifier: this classifier depends on features extracted from the PDF document when handled as an image.It uses the module pdftoppm from XPDF ( [36]) to extract the first page of the document as an image.The resulting image is divided into a sixty-six by sixty-six grid1 .Then Python's Image Library (PIL) ([41], [39]) is employed to extract pixel values in each region.Each region is given a value of 0 or 1 depending on the amount of non-white pixel values it contains .The result is modeled using Naïve Bayes available with the Weka ( [52]) machine learning toolkit.Stylo-metric classifier: this classifier looks at the frequency of selected words, number of font changes, the difference between the largest font size and smallest font size, length of the document , average length of words, and number of words in the front page of the document.The font information was extracted on the level of words using a modified version of pdftohtml ( [38]), developed by Volker Heydegger at the University of Cologne.The modified version converts a PDF document to a XML file with all the font information for each word in the document.A word list was automatically constructed containing all words which appear in more than half of the files in any one genre.For each file, the frequency of each word was recorded as a vector then augmented by length and font information.The result was modeled using Naïve Bayes in the Weka ( [52]) machine learning toolkit.
This paper expresses the view that the image along with the stylistic features will capture the structural elements of genres while the language model combined with the stylistic and semantic features will help to separate documents of distinct functional categories.Involving the image of a document in the classification also enables the management of documents without violating security, maximises the viability of a language independent tool, frees the process from being solely dependent on text processing tools with encoding requirements and problems relating to special characters2 , and makes the method applicable to paper documents digitally imaged (i.e.scanned).
There are three main experiments described in this paper: Clustering experiment: this experiment compared the cluster resolution for two sets of features: the image features and the stylo-metric features.We grouped the data in nineteen genres into two clusters using the Weka Machine Learning Toolkit's ( [52]) Estimation-Maximisation algorithm.The purpose was to see how well the files in each genre group into one cluster.The result is expressed in terms of the percentage of files within each genre which have been grouped into one cluster.Periodicals versus Thesis: in this experiment, we took documents in the genres Periodicals and Thesis.We used the image classifier to classify the documents by using 10 fold cross validation.Periodicals versus Non-periodicals: we expanded on the experiment above to group four more genres with the genre Thesis as one group labelled Nonperiodicals.The four additional classes are Business or Operational Report, Minutes, Fictional Book, and Academic Book.The four extra classes were chosen from the genres that were grouped in the same image cluster as Thesis.

Results
Table 2 shows the results of the clustering experiment.The key finding in this experiment is that the genres for which image features fail to cluster are the genres for which stylo-metric features cluster very well.For instance, note that Scientific Research Articles divide half and half into each cluster with no preference when using the visual features while ninety two percent group into one cluster when using stylo-metric features.The opposite is true of Periodicals.
The results described in Tables 3, 4 and 5 use three standard indices in classification tasks: accuracy, precision and recall.Let N be the total number of documents in the data, N c the number of documents in the data set which are in class C, T the total number of correctly labelled documents in the data set independent of the class, T c the number of true positives for class C, and F c the number of false positives for class C. Accuracy is defined to be A = T N ; precision and recall for class C is defined to be P c = Tc (Tc+Fc) and R c = Tc Nc , respectively.Table 3 gives the result when the data set was confined to Periodicals and Theses.The accuracy was surprisingly high.To check if the results actually reflect the distinctiveness of image features in periodicals, the experiment was repeated with four more classes of non-periodical documents added to Thesis to form a group of Non-periodicals (results in Tables 4 and 5).A slight decrease in performance is visible in Table 4 (cf.Table 3), but the accuracy is still quite high.On the other hand the results for the stylistic classifier in Table 5 show that stylistic features do not fare as well in distinguishing Periodicals.For a proper evaluation of the performance, a significance test is in order (pending), but a difference of 17.6% in overall accuracy can not be ignored by the strictest of observers, and a decrease in precision on Periodicals from 73.9% to 47.8% (a difference of 26.1%) inspires the belief that the visual features are better equipped to distinguish periodicals from the other five genres.

Conclusion and Further Research
The results in [29] and the results in this paper indicate the promise of using a multi-layered decision tree on many different sets of features to classify genres.The results in Table 2 show definite divisions between genres which have strong image features and genres that have strong stylistic features.The results in Tables 2 and 3, indicate that Periodicals have more clearly distinguishing image features than stylo-metric features, while Table 4 suggests that Thesis shares its image features with four other genres.Previous reports ( [29], [30]) indicated that It is, however, very difficult to have sufficient data when constructing a tool which is intended to have dynamic and domain-independent properties.In [28] and [31], the CANDC part-of-speech tagger ( [10]), reputed to have performed well elsewhere, was employed to tag words in an Astronomy research articles.In Astronomy there is frequent usage of the term He to refer to the chemical element Helium.The tagger, which was trained on the Wall Street Journal articles, tagged He to be a pronoun for all instances, propagating further errors on subsequent words.Separating features into smaller groups will minimise the impact of such artefacts, by trying to exclude the noise from the start, making the most of the differing feature strengths for each genre type.The key seems to lie in identifying which genres belong to which type and to combine the classifiers in a reasonable way to build a general classifier.
Further improvement can also be envisioned by integrating more classifiers into the decision process.In [29] we suggested the following classifiers: -Extended image classifier which looks at more than the first page of the document: we could process the image of pages other than the first page of the document or several pages of the document in parallel.This would however involve several decisions:the optimal number of pages to be used and the best way to combine the information from different pages need to be determined (e.g. will several pages be considered to be one image; if not, how will the classification of synchronised pages be statistically combined to give a global classification).-Language model classifier on the level of POS and phrases built on the part-of-speech tags (tags which denote whether a word is a verb, noun or preposition) of the underlying words and also on partial chunk tags (tags indicating noun phrases, verb phrases or prepositional phrases).
-Semantic classifier modeling subjective or objective noun phrases (e.g. using [42]) and latent semantic analysis may be necessary for finer distinctions in document -Contextual Classifier built on source information of the document such as the name of the journal or address of the web page, and anchor texts or subject or domain information.
There are two obvious ways of gauging the performance of a genre classifier: comparing against human performance and measuring the stability of the performance as you transfer it across domains.We are undertaking an experiment to examine human performance.A significant amount of disagreement is expected in labelling genre even between human labellers; we intend to cross check the labelled data in two ways: 1. Document Retrieval Exercise (DRE): We plan to employ a cohort of postgraduates in information science who will be assigned genres from Table 1.They will retrieve one hundred PDF documents for each of the genres they have been assigned, and give a brief description of the source of the document and the reasons for including the document in their collection.2. Re-labelling Experiment: We will anonymise the file names of the documents collected in the DRE and randomise the document sequence.This corpus will be presented to two new groups of labellers drawn from different backgrounds for re-classifying.They will not have access to the initial genre classification information.
The first experiment will create a pool of PDF files which have already been classified into genres by established organisations and users; this will serve as a reference point, and help us to index the performance on well-designed classification standards.The re-labelling experiment will enable us to compare the disagreement of the three classes of labellers over the same data set: this will help to determine the maximum level of accuracy at which the automated system can be expected to perform and determine which genres are better defined by looking at percentage of files in agreement within each genre.The longer term aim, once a genre classifier with performance comparable to an average human labeller has been developed, will be to integrate the method with other tools which extract author, title, date, identifier, keywords, language, summarisations and other compositional properties of files within a single genre, and combine the tool with ingest models developed elsewhere.
al. and Boese concentrated on more popularly accepted genre classes such as FAQ, Job Description, Editorial or Reportage.Kessler et al. tried to address both types, while Finn et al. studied binary classifications (fact versus opinion, positive versus negative reviews).Santini discussed general genre facets, while Bagdanov limited his task to detecting specific journals and brochures.Others ([44],

Table 1 .
Scope of genres Genres Book Academic book, Fiction, Poetry, Handbook, Other book Article Abstract, Scientific research article, Other research article, Magazine article, News report Short Composition Fictional Piece, Poems, Dramatic Script, Essay, Short Biographical Sketch, Review Serial Periodicals (Newspaper, Magazine), Journals, Conference Proceedings, Newsletter Correspondence Email, Letter, Memo, Telegram Treatise Thesis, Business/Operational report, Technical report, Misc report Information Structure List, Catalogue, Raw Data, Table Calendar, Menu, Form, Programme, Questionnaire, FAQ

Table 2 .
A Comparison of Visual and Stylo-metric Clusters (percentage of files in one cluster)

Table 3 .
Distinguishing Periodicals from Thesis using image features

Table 4 .
Distinguishing Periodicals from Non-periodicals using image features 10 fold Cross Validation with the Image classifier, Overall accuracy: 92.23 %

Table 5 .
Distinguishing Periodicals from Non-periodicals using stylistic features Scientific Research Articles are not easily distinguishable by image features from Product Descriptions but better distinguishable when using syntactic features.If all these features are processed in one classifier, the statistical model can be misled by non-distinguishing features.If we were to train on sufficient data, this is not a problem; the non-distinguishing features will be filtered out as noise.