The textcat Package for n -Gram Based Text Categorization in R

Identifying the language used will typically be the ﬁrst step in most natural language processing tasks. Among the wide variety of language identiﬁcation methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n -gram frequencies have been particularly successful. This paper presents the R extension package textcat for n -gram based text categorization which implements both the Cavnar and Trenkle approach as well as a reduced n -gram approach designed to remove redundancies of the original approach. A multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics is used to illustrate the functionality of the package and the performance of the provided language identiﬁcation methods.


Introduction
Working with written text usually requires knowledge about the language used.For example, typical text mining workflows remove stop words from texts or transform words into their stems, which clearly cannot be performed without knowing the underlying language.Therefore, modern text processing tools heavily rely on highly effective algorithms for language identification.
The first language identification methods developed were very simple: one had to know a language in order to recognize it.Humans read documents and classified only the ones they understood.To improve the process, methods were developed which enabled a reader to identify the language of a document without actually knowing the language.This was done using lists (see Ingle 1976;Keesan 1987;Newman 1987) with peculiarities of the candidate languages, such as unique letters or combinations thereof and unique words or combinations thereof.This approach, however, had many shortcomings.One would often have to work through a considerable amount of text before finding one of the unique characteristics.In addition, the method was quite vulnerable to the presence of foreign or misspelled words, or the use of highly specialized vocabulary.Thus, these methods performed rather poorly, both in terms of speed and precision.
Then came the age of information technology and with it the pioneers in the field of automated language identification, such as Mustonen (1965), Beesley (1988), Henrich (1989), and Souter, Churcher, Hayes, Hughes, and Johnson (1994).The main idea was to create distributions of specific "elements" for a number of languages and, subsequently, to compare these to the distribution of the same elements obtained from a given text.Over the years, the elements chosen to represent a certain language were altered and the methods to create and compare the distributions improved.The key contribution was certainly the Cavnar and Trenkle (1994) paper on "n-Gram-Based Text Categorization", which suggested character n-gram frequencies as elements and the so-called out-of-place measure for comparing these.
Language identification has continued to be a subject of high interest until nowadays and many further approaches have been suggested.Batchelder (1992) built a neural net, Dunning (1994) classified using a Markov Chain model, Sibun and Reynar (1996) and Singh (2006) worked with mutual cross entropy, and Ahmed, Cha, and Tappert (2004) used cumulative frequency addition to increase accuracy and efficiency.Murray (2002) employed hidden Markov models that allow for segmentation of text into unknown languages and the extraction of foreign words in known languages from English text.Combinations of different approaches were tested in, e.g., Ljubešić, Mikelić, and Boras (2007).
Clearly, all general-purpose methods for text classification can also be applied to the specific problem of language identification.In this paper, however, we will focus on n-gram based approaches as these have been amazingly successful.In Section 2, we present the original Cavnar and Trenkle (1994) approach and a reduced version designed to eliminate redundancies of the original approach.The implementation in the R (R Core Team 2012) extension package textcat (Hornik, Rauch, Buchta, and Feinerer 2013) is discussed in Section 3. In Section 4, we use the Wikipedia_multi multi-lingual corpus obtained from the Wikipedia pages available on a selection of topics to illustrate the functionality of the package and the performance of the provided language identification methods.Section 5 concludes.

The Cavnar and Trenkle approach
An n-gram as defined by Cavnar and Trenkle (1994, p. 163) -hereafter CT -is a contiguous n-"character" slice of a longer word string.(In general, n-grams are subsequences of n items computed from a given sequence, e.g., Wikipedia (2013a).n-grams of lengths 1, 2 and 3 are typically referred to as unigrams, bigrams and trigrams, respectively; for lengths n ≥ 4, one simply uses the generic term "n-gram").Depending on whether word strings are taken as sequences of characters or sequences of bytes (where the distinction matters if multi-byte character systems are used for encoding text), CT n-grams are thus character n-grams or byte n-grams, and to be distinguished from word n-grams (sequences of consecutive words) commonly employed in a variety of natural language processing tasks.In this section, we will follow common practice to refer to the word tokens as "characters".
Typically, underscores are used to identify whether an n-gram includes the first character (e.g, '_fi'), characters from the middle (e.g., 'dd', no underscore here), or the end (e.g., 'nd_') of a word, i.e., to indicate word boundaries.Cavnar and Trenkle (1994) use the word 'text' to show how n-grams (n varying from 1 to 5) should be generated.In Table 1, this "classical" method is applied to the word 'corpus'.Note that words of length k yield k + 1 ngrams.(A useful tool for visualizing n-grams is provided by the WolframAlpha computational knowledge engine and can be found at http://www.wolframalpha.com/input/?i=n-gramwhen selecting "n-gram" as a general topic.) The CT n-gram based approach to language identification uses two steps.First, one collects training corpora of texts all written in the same known language, and builds language profiles from these.For every corpus, one computes the frequency distribution of the n-grams of the texts in the corpus for n = 1, . . ., n max .Typically, to improve performance, only a maximal number of words is used from each text.One then sorts the n-grams from the most to the least frequent, and retains the s most frequent ones, for a prescribed profile size s.Cavnar and Trenkle (1994, p. 162) argue that n-gram distributions follow Zipf's law (Zipf 1949, "The nth most common word in a human language text occurs with a frequency inverse proportional to n.") and hence a rather small value of s can be taken, recommending taking s = 300.Note, however, that the quoted empirical law really refers to word and not character n-grams: we will take a closer look at frequency distributions of the latter in Section 4.1, using the Wikipedia_multi multilingual corpus.
In a second step, to identify the (unknown) language of a given text "document", one computes a document profile from this text in the same manner as previously having computed the language profiles, and classifies the text according to the language of the (language) profile which best fits the document profile, in the sense of minimizing a suitable distance measure for n-gram profiles.(Optionally, if the fit is not good enough, one classifies as "unknown".)Cavnar and Trenkle (1994) suggest the so-called "out-of-place" distance measure, which counts the number of rank inversions between the profiles (how to handle n-grams not present in both profiles is not precisely specified).
We note that the CT approach is applicable to arbitrary text classification tasks, employing document profiles and category profiles obtained from corpora of texts belonging to the same known category.See, e.g., Cavnar and Trenkle (1994, Section 5) for an illustration of n-gram based text categorization on a computer newsgroup classification task, and Khreisat (2009) for an application to identifying topics in Arabic newspaper articles.However, the superb performance of the approach is typically only achieved for language identification tasks.

A reduced n-gram approach
The basic CT approach can be customized via several "options", such as the size s of the profiles, the number of words in a text used for computing the profiles, or the distance measure employed when comparing profiles.Interestingly, whereas many authors have explored how the performance of the approach varies with the choices for these options (e.g., Grefenstette 1995;Ahmed et al. 2004;Singh 2006), it seems that the method used for computing ngrams was never questioned.Our initial language identification task was based on very short (SMS-style) texts, prompting us to explore the possibilities of deriving more efficient n-gram representations.
The following observations can be made for the "classical" CT method for computing n-grams.
The unigram '_' indicating the beginning of a word is always included.If a fixed number of words is taken to generate n-grams, this will lead to precisely the same number of '_' for each document.
Both n-grams with and without the extra information about their position (i.e., the leading and trailing underscores) are included.In our example in Table 1, one can find 'corp' as well as '_corp'.Using the former would mean that the n-gram 'corp' can be found in the middle of the word 'corpus', which, of course, is not the case.It thus should be preferable to use n-grams which include the first or last character of a word only if the additional positional information is part of the n-gram.
Including 'pus_' as well as 'pus__' is redundant.In fact, only the information that there is a 'pus' at the end of the word 'corpus' is of importance.
Short words result in n-grams with omitted position information.E.g., the word 'is' yields the trigrams '_is ' and 'is_' (and 's__').Similar to the above, it might be preferable to drop these and only use the n-grams which include the position indicators (i.e., '_i', 's_', and '_is_').
Our new "reduced" method for computing n-grams thus adds the following set of rules to the classical method, in order to possibly improve performance.
n-grams containing the first or last, respectively, character of a word, without the additional information about this position (i.e., without the underscores), are not used.
n-grams containing more than one word boundary indicator at the end are excluded.
Words with k > 1 characters only yield n-grams of lengths n ≤ k or n = k + 2 (the word plus both word boundary indicators).Words with a single character c only yield the trigram '_c_'.
The "reduced" n-gram representation of 'corpus' obtained by applying these rules is shown in Table 2.
These rules are aimed at improving the "information" quantity and quality within the language profiles.As already mentioned, only the 300 most frequent n-grams were used by Cavnar and Trenkle (1994) to form a language profile.In this case, if a language consists of many words ending with, e.g., 's', then with n max = 5, 4 out of 300 places within the language profile will be used to record this same piece of information (by including 's_', 's__', 's___' and 's____'), so that 1% (3 out of 300) places are wasted.Similarly, excluding n-grams containing the first and/or last character of a word without the additional word boundary indicators avoids redundancies and improves consistency of the representation employed.

Implementation
The first implementation of the Cavnar & Trenkle approach which was publicly made available is Gertjan van Noord's Perl-based TextCat (van Noord 1997), which also provides language profiles for 74 "languages" (more precisely, language/encoding combinations).The code was subsequently integrated into the SpamAssassin spam filter software (Apache 2010), TextCat itself is no longer actively maintained.Van Noord provides a web page listing "competitors" (http://odur.let.rug.nl/~vannoord/TextCat/competitors.html),pointing in particular to TextCat-style implementations in Java and Python as well as libTextCat (WiseGuys 2003), a lightweight C library re-implementation, which is included in most Linux distributions (e.g., libtextcat0 on Debian-based systems).The implementation in the R extension package textcat aims at both flexibility and convenience.An n-gram profile really is a frequency table, so that it seems natural to represent the profile as a numeric vector of frequencies named by the corresponding n-grams.As names in R should really be "valid" character strings, when using character n-grams texts containing non-ASCII characters must declare their encoding, and will be re-encoded to UTF-8.For byte n-grams, we take advantage of the 'bytes' encoding for character strings added in R 2.13.0, motivated by our work on textcat.This new encoding allows representing a sequence of bytes as a single character string, rather than a sequence of individual raw bytes (which would result in a substantially more complex representation of byte n-gram profiles).Where necessary, functions readBytes() and readChars() from package tau (Buchta, Hornik, Feinerer, and Meyer 2012) can be used to read texts in files into byte strings and UTF-8 encoded character strings, respectively.
The basic data structure in package textcat is the S3 class "textcat_profile_db" for categorized collections of n-gram profiles, implemented as lists of frequency tables as discussed above, with the category IDs as names and the options employed for creating the profile data base (DB) as attributes.Provided that they use the same options, such profile DBs can be combined via c().
Profile DBs can be created using function textcat_profile_db(), with synopsis textcat_profile_db(x, id = NULL, ...) where id gives the category IDs (suitable language IDs in the case of language identification) and x the corresponding texts, either directly as character vectors or as R objects from which texts can be extracted using as.character(), such as text corpora obtained via function Corpus() in package tm (Feinerer, Hornik, and Meyer 2008).The further '...' arguments allow specifying the options to employ for creating the n-gram profiles, including: n: A vector containing the numbers of characters or bytes in the n-gram profiles (default: 1:5).
size: The maximal number of n-grams used for a profile (default: 1000L).
reduce: A logical indicating whether reduced n-gram representations as discussed in Section 2.2 should be employed (default: TRUE).
useBytes: A logical indicating whether to use byte n-grams rather than character n-grams (default: FALSE).
In textcat_profile_db(), texts are split according to the given categories (the default corresponds to taking each text separately), and n-gram profiles are computed via efficient C code for counting n-gram frequencies provided by function textcnt() in package tau (Buchta et al. 2012).
Categorization is performed by function textcat(), with synopsis textcat(x, p = TC_char_profiles, method = "CT") where x is a character vector of texts (or coercible to such using as.character()), p is a profile DB, and method is a character string specifying a built-in method, or a user-defined function for computing distances between n-gram profiles.By default, categorization uses the TextCat character profiles and the Cavner-Trenkle out-of-place measure.To provide a simple example: R> library("textcat") R> textcat(c( + "This is an English sentence.",+ "Das ist ein deutscher Satz.", + "Esta es una frase en español.")) [1] "english" "german" "spanish" As we see, all three sentences are classified correctly.
The TC_char_profiles DB provides a subset of 56 character profiles obtained by converting the TextCat byte profiles to UTF-8 strings where possible.(Actually, the byte profiles are taken from libTextCat rather than TextCat, which contains one additional non-empty profile).
The full set of byte profiles is available as TC_byte_profiles.Both profiles use a size of 400 and the classical method for computing n-grams.
Alternatively, textcat provides the ECIMCI_profiles DB for 26 mostly European languages built by one of us (JR) from the European Corpus Initiative Multilingual Corpus I (Armstrong-Warwick, Thompson, McKelvie, and Petitpierre 1994), using a size of 1000 and reduced ngrams.Traditionally, high-quality low-cost large-scale multilingual text corpora were rather scarce, with ECI's MC I a major step forward.In Section 4, we will show how nowadays Wikipedia can very conveniently be used for building domain specific multilingual corpora.
We are planning to make additional textcat profile data packages available at http://datacube.wu.ac.at/ (currently, this provides the character trigram profiles from the "An Crúbadán" project, Scannell 2007).
Pairwise distances between collections of n-gram profiles or text documents can be computed via textcat_xdist().Currently, the following distance methods for n-gram profiles are available and can be specified through the method argument to textcat_xdist() (and textcat()): "CT": The out-of-place measure of Cavnar and Trenkle (1994) (default).
"ranks": A variant of the Cavnar-Trenkle measure based on the aggregated absolute difference of the ranks of the combined n-grams in the two profiles.
"ALPD": The sum of the absolute differences in n-gram log frequencies.
"KLI": The Kullback-Leibler I-divergence I(p, q) = i p i log(p i /q i ) of the n-gram frequency distributions p and q of the two profiles.
"KLJ": The Kullback-Leibler J-divergence J(p, q) = i (p i − q i ) log(p i /q i ), the symmetrized variant I(p, q) + I(q, p) of the I-divergences.
"JS": The Jensen-Shannon divergence between the n-gram frequency distributions as a symmetrized and smoothed version of the I-divergence.
"cosine": The cosine dissimilarity between the profiles, i.e., one minus the inner product of the frequency vectors normalized to Euclidean length one (and filled with zeros for entries missing in one of the vectors).
"Dice": The Dice dissimilarity, i.e., the fraction of n-grams present in one of the profiles only.
For the measures based on distances of frequency distributions, n-grams of the two profiles are combined, and missing n-grams are given a small positive absolute frequency which can be controlled by option eps, and defaults to 1e-6).
The options used for building n-gram profiles ('...' arguments to textcat_profile_db()) and categorization based on these ('method' argument to textcat()) can also be manipulated as dynamic variables via textcat_options().

Applications
In order to study the performance of the classical and the reduced n-gram approach we scraped Wikipedia entries for Philosophy, Mathematics, Statistics, France, USA, Religion, Wikipedia, Internet, Medicine, and Rice in all available languages.Technically, we start with the English pages, use the MediaWiki API (action=query&prop=langlinks) to get the language links of these, and XPath (e.g., Wikipedia 2013c) to extract the page "texts" as their content inside <p> tags.As of 2010-10-17, this leads to a collection of 1641 text documents in 254 different languages.One should note how easily the same method can be used to build large-scale, possibly domain-specific multi-lingual corpora.
R> langs <-meta(Wikipedia_multi, "Language", type = "local") R> langs <-unlist(langs) R> texts <-lapply(Wikipedia_multi, paste, collapse = "\n") R> texts <-unlist(texts) The languages and the texts are now in a structure suited for further analyses.First, we start with an examination of the Zipf approximation for several languages.Second, we carry out simulation experiments where we study the classification performance of the classical and the reduced n-gram approaches for various scenarios (i.e., different numbers of words and different languages).

Examining the n-gram distributions
As mentioned above, Cavnar and Trenkle (1994) imply that character n-gram distributions follow Zipf's law.Recent works on word n-grams point out corresponding systematic deviations.Baayen (2008, p. 226) elaborates the problem of sample independence of Zipf's law.In fact, Ha, Hanna, Ming, and Smith (2009) propose an extension of Zipf's law for large corpora.Egghe (2000) shows that the rank-frequency distribution follows Zipf's law with an additional exponent.
In order to visualize the Zipf approximation for different languages, out of the 254 languages we pick the ones with the highest numbers of n-grams.These nine languages with corresponding language ID in parentheses are German (de), English (en), French (fr), Italian (it), Spanish (es), Russian (ru), Portuguese (pt), Catalan (ca), and Tamil (ta).We create the n-gram profiles for these languages using the textcat_profile_db() function (size = NA indicates to include all n-grams in the profiles).

Densities
Cumulative For each language we create a Zipf plot with the logarithm of the n-gram ranks on the x-axis and the log-frequencies of the n-grams on the y-axis.A regression line is added that reflects the expected trajectory under the Zipf distribution.
The results in Figure 1 show rather marked deviations from the Zipf distribution, indicating that unlike for word n-grams, Zipf's law does not hold for character n-grams, which seem to yield frequency distributions with heavier tails.Table 3 shows the densities and cumulative frequencies of the s-most frequent n-grams, for sizes s of 300 (the Cavnar-Trenkle suggestion), 400 (used for the TextCat profiles), and 1000 (the default profile size used by textcat).Using s = 1000 substantially increases coverage, suggesting that employing larger profile sizes than originally suggested might be more appropriate.
Character and byte n-gram distributions for many texts (e.g., obtained from Project Gutenberg, http://www.gutenberg.org/)typically look "rather similar" to the ones displayed in Figure 1.It should be both interesting and useful to find simple parametric families for representing such distributions.

Simulation study: Classical versus reduced n-gram approach
In order to study the behavior of both n-gram approaches for different languages and different number of words considered for n-gram generation, we carry out an extensive simulation experiment.First, let us select the languages that have Wikipedia entries for each of the 10 search terms mentioned at the beginning of Section 4. Then we extract the corresponding texts.
R> tab <- Note that we eliminate texts from "Simple English Wikipedia", a Wikipedia platform for people whose first language is not English (including children and adults who are learning English), see http://simple.wikipedia.org/.Not excluding Simple English in our simulation study would drastically affect the misclassification rate for English texts.After these pre-selections, 63 languages are left for our simulation experiment.
Now we create the n-gram profiles for the reduced approach (JR) and the classical Cavnar-Trenkle approach (CT).According to the suggestions by Cavnar and Trenkle (1994), we set the maximal number of n-grams to 300.
R> TC_Wiki_profiles_db_a_la_JR <-textcat_profile_db(texts, langs, + reduce = TRUE, size = 300) R> TC_Wiki_profiles_db_a_la_CT <-textcat_profile_db(texts, langs, + reduce = FALSE, size = 300) For each language we build a pool of the words used.Then, for every w from 1 to 20, we generate 1000 texts by randomly drawing w words from the word pool, and perform text categorization using textcat().Finally, for each language/number-of-words-scenario we count the number of correctly classified texts.The R code for reproducing the whole simulation experiment as well as our simulation results, stored in the R data file loaded below, are given in the supplementary materials.
R> load("simTCWiki.rda") As a first tool to explore the results we create trajectories plots.The panels in Figure 2 show the classification trajectories for both approaches.A single trajectory refers to a particular language.The black trajectory displays the median of the classification rates.Those languages that are classified badly are examined below in more detail.For the moment let us focus on the performance differences between the two approaches.To do so, we create the same type of plot, except that this time we put the differences in the classification rates between the classical and the reduced approach on the y-axis.
In Figure 3, trajectories below the zero line indicate the cases for which the reduced n-gram approach performs better than the classical approach.Especially for short texts (i.e., numbers of words smaller than five) we see that the reduced approach outperforms the classical one.
Let us have a closer look at languages that are classified badly.A table with the "worst" 10 languages (in terms of classification rates using texts of w = 20 words) for the reduced n-gram approach is given in The number of misclassifications between Spanish and Galician, spoken in Galicia, an autonomous community located in northwestern Spain, are rather low.Note that Galician is actually more similar to the Portuguese language than to Spanish.
Considering Scandinavian languages, the situation is quite interesting.There are two official forms of written Norwegian: One is Bokmål (Book Norwegian; no), the other one Nynorsk (New Norwegian; nn).The misclassification rates between these two languages are around 10% (in both directions).Furthermore, in 12.05% of the cases Book Norwegian texts are classified as Danish, whereas in 7.50% Danish texts are classified as Norwegian.This is not surprising since together with Swedish and Danish, Norwegian forms the family of Scandinavian languages which are more or less mutually intelligible.
Finally, with respect to language similarities addressed above, we visualize the n-gram distances between various languages by means of hierarchical clustering.The resulting dendrogram is given in Figure 4. On the one hand it substantiates the misclassifications from Table 4, on the other hand, interesting similarities between languages are visualized; not necessarily related to misclassifications.
Croatian, Serbo-Croatian and Bosnian are merged into a cluster at a very early stage.Subsequently, they are merged with Czech, Slowak, and Slowenian such that this cluster represents the group of Slavic languages (Latin script).The other cluster of Slavic languages (Cyrillic script), placed on the left hand side of the dendrogram, is formed by Russian, Bulgarian, Serbian, and Ukrainian.
Malay and Indonesian on the one hand, and Norwegian, Nynorsk, and Danish (and Swedish) on the other hand, are also clustered at an early stage.The latter languages belong to the family of Germanic languages and are subsequently clustered with German and Dutch.In the middle of the dendrogram we have the cluster with Romance (or Latin) languages such as Spanish, Portuguese, French, Italian, Galician, Romanian, and Catalan.

Conclusion
The n-gram based approach to text categorization introduced in Cavnar and Trenkle (1994) provides a popular, high-performance methodology for language identification.We discuss encodings are not necessarily known in applications.Thus, developing collections of byte profiles with more language/encoding combinations than currently provided in the TextCat byte profile data base would certainly be very useful.Hopefully, such collections can be made available in the future.

Figure 1 :Figure 2 :
Figure 1: Zipf plots of character n-gram frequencies for nine different languages.

Figure 3 :
Figure 3: Trajectories plots for classification differences between classical and reduced n-gram approach.

Table 1 :
Classical n-gram representation of the word 'corpus'.

Table 2 :
Reduced n-gram representation of the word 'corpus'.

Table 3 :
Densities and cumulative frequencies of the s-most frequent n-gram, for sizes s of 300, 400 and 1000.

Table 4
The percentage values in Table4are conditional on the row margins.For Bosnian (br) we see that in 14.63% percent of the cases it is classified as Croatian (hr) and in 12.16% as Serbo-Croatian (sh).This is not surprising since Bosnian is one of the three "Serbo-Croatian" standards, with the latter term resulting from the time before the dissolution of former SFR Yugoslavia (see also Wikipedia 2013b, for how Wikipedia handles this rather delicate matter).In fact, Standard Croatian, Serbian and Bosnian are almost completely mutually intelligible.This also explains the bad classification rate for Croatian which in 23.66% is "misclassified" as Serbo-Croatian.Note that Croatian is not confounded with Serbian since Serbian uses the Cyrillic script.

Table 4 :
Confusion matrix (in %) for the "worst" languages.The rows represent the correct languages, the columns represent the classified languages.