Using Maximum Entropy Models to Discriminate between Similar Languages and Varieties

DSLRAE is a hierarchical classiﬁer for similar written languages and varieties based on maximum-entropy (maxent) classiﬁers. In the ﬁrst level, the text is classiﬁed into a language group using a simple token-based maxent classiﬁer. At the second level, a group-speciﬁc maxent classiﬁer is applied to classify the text as one of the languages or varieties within the previously identiﬁed group. For each group of languages, the classiﬁer uses a different kind and combination of knowledge-poor features: token or character n-grams and ‘white lists’ of tokens. Features were selected according to the results of applying ten-fold cross-validation over the training dataset. The system presented in this article 1 has been ranked second in the Discriminating Similar Language (DSL) shared task co-located within the VarDial Workshop at COLING 2014 (Zampieri et al., 2014).


Introduction
Language identification (LI) can be defined as the task of determining the language of a written text.LI is also a cross-cutting technology supporting many other text analysis tasks: sentiment analysis, political tendency or topic classification.There are some interesting problems around written language identification that have attracted some attention recently, as native language identification (NLI, Tetreault et al., 2013), the identification of the country of origin or the discrimination between similar or closely related languages (DSL, Tiedemann and Ljubešić, 2012).
LI has reached a great success in discriminating between languages with unique character sets and languages belonging to different language groups or typologically distant.However, according to Zampieri (2013), multilingualism, noisy or non-standard features in text and discrimination between similar languages, varieties or dialects remain as the major known bottlenecks in language identification.For this reason, DSL can be considered as a sub-task in language identification.Interestingly enough, LI seems to work well with what Kloss (1967) called abstandsprache or language by distance (because Basque is an isolate, it is generally regarded as a distant language) but fails in dealing with ausbausprache or language by development (a standard variety together with all varieties heteronomous with respect to it, e. g.Basque Batua koiné and the various vernacular dialects).
Mass media, educational centres, administrations and communications favour standard languages instead of other varieties.Standard varieties of languages are then seen by sociolinguists and dialectologists as political and cultural constructs (Trudgill, 2004).However, languages and varieties are not just systems for communication between individuals, they are also used by groups and they are a crucial part of their identity and culture.Language variation is systematic, both inter-and intra-personal.It can be related to political, social, geographical, situational, communicative or instrumental factors.Variation within a language can be found at different levels: alphabet, orthography (diacritics), word structure (syllable composition, morphology), lexical choice or even syntax.Similar or closely related languages often reflect a common origin and are members of a dialect continuum (Bloomfield, 1935).
Solutions to language identification are often based either on generative or discriminative character n-gram language models.While character-based methods provide a means to distinguish between different languages on the basis of coarse-grained statistics on n-grams, it seems that discriminating between similar languages needs more fine-grained distinctions not always reflected by n-gram character distributions.According to Tiedemann and Ljubešić (2012), character-based n-gram methods fail for languages with a high lexical overlap, since the more shared words between two languages, the more similar will their n-gram character frequency profiles be.

Group Model
Lang/Var Precision Recall F 1 -score Model has a letter code indicating the kind of elements considered: C (characters), T (tokens), L (tokens from the list of the 10,000 most frequent tokens), and a number indicating how many consecutive elements have been taken in a feature: 1 (unigrams), 1-2 (unigrams and bigrams), 1-5 (sequences of length one to five).

Previous Approaches
Although focused on formal languages, Gold (1967) is usually credited as the first to attempt computational language identification.In particular, two common LI approaches, namely n-gram language models and white (or black) lists, echo Gold's information presentation methods.In the 1990s, language identification was formulated as a sub-task of text categorization and varied approaches were explored.Beesley (1988) pioneered the use of character n-grams models, which were also used by Dunning (1994) and Cavnar and Trenkle (1994).Grefenstette (1995) compared this approach to Ingle (1978), based on the frequency of short words.The interested reader is referred to Zampieri (2013) for a review of some statistical and machine learning proposals and to both Baldwin and Lui (2010) and Lui and Baldwin (2011) for an overview of some linguistically motivated models.
As Baldwin and Lui (2010) or Tiedemann and Ljubešić (2012) point out, language identification is erroneously considered an easy and solved problem2 , in part because of some general purpose systems being available, notably TextCat3 , Xerox Language Identifier4 and, more recently, langid.py(Lui and Baldwin, 2012).While it is true that it is possible to obtain brilliant results for a small number of languages (Baldwin and Lui, 2010) or typologically distant languages (Zampieri et al., 2013), accurately discriminating among closely related languages or varieties of the same language has been repeatedly reported as a bottleneck for language identification systems, in particular for those based on n-grams.
Back in 2004, Padró and Padró concluded that "since the tested systems tend to fail when distinguishing similar languages (e.g.Spanish and Catalan), further research could be done to solve these cases."Martins and Silva (2005) report similar difficulties in discriminating among European and Brazilian Portuguese.Ranaivo-Malanc ¸on (2006) motivates her work on the unsatisfactory performance of (then) available language identifiers when dealing with close languages such as Malay and Indonesian.Ljubešić et al. (2007) do not even attempt to distinguish Bosnian from Croatian when developing a Croatian identifier because of their closeness.Trieschnigg et al. (2012) come as an exception as they report satisfactory results in identifying sixteen varieties of Dutch with TextCat.
Ranaivo-Malanc ¸on ( 2006) presents a cascaded language identifier for Malay and Indonesian.It first distinguishes Malay or Indonesian from other four European languages using trigrams extracted from the most frequent words from each language.Texts classified as Malay or Indonesian are subsequently scanned for some linguistic features (format of numbers and exclusive words), yielding a more precise performance than TextCat.Ljubešić et al. (2007) also propose a cascaded identifier that relies on 'black lists' to discard non-Balkan languages and a second order Markov model on n-grams to discriminate among them, augmented with a 'black list' component that raises accuracy up to 0.99 when dealing with the most difficult pair (Croatian and Serbian).This work is followed up in Tiedemann and Ljubešić (2012) where 9% of improvement over standard approaches is reported and where support for Bosnian discrimination is included.
Huang and Lee ( 2008) use a bag of the most frequent words to build a voting identifier for three Chinese varieties with a top accuracy of 0.929.More recently, Zampieri (2013) compares the performance of n-gram based models to machine learning methods using bag of words when discriminating similar languages and varieties obtaining comparable performance with both approaches.Grouin et al. (2010) present the shared task DEFT 2010.Participants were challenged to identify the decade, country (France and Canada) and newspaper for a set of journalistic texts.As far as the country labeling is concerned, they report an upper 0.964 F 1 -measure and an average of 0.767.Very brief descriptions of the systems are also offered.
Zampieri and Gebre (2012) present a log-likelihood estimation method for language models built on orthographical (character n-grams), lexical (word unigrams) and lexico-syntactic (word bigrams) features.They report a 0.998 accuracy distinguishing European and Brazilian Portuguese with a language model based on character 4-grams.This approach is adapted in Zampieri et al. (2013) to deal with Spanish varieties, where the role of knowledge-rich features (POS tags) is also explored.They report a 0.99 accuracy when binarily distinguishing Argentinean and Mexican Spanish with single words or bigrams.Trieschnigg et al. (2012) compare the performance of TextCat to the nearest neighbour and nearest prototype in combination with a cosine distance when distinguishing among sixteen varieties of Dutch.They report a micro-average F 1 -score of 0.799 (and a macro-average F 1 -score of 0.527) with a top F 1 -score of 0.987 when dealing with Frisian.Lui and Cook (2013) report experiments with different classifiers to map English documents to their country of origin.An SVM classifier with bag of words is top ranked with a macro-average 0.911 F 1score in a cross-domain setting and 0.975 in an in-domain setting.
All these previous works (with the sole exception of Trieschnigg et al. (2012), where a general purpose LI system yields a satisfactory performance) agree in the specificity of DSL regarding LI.Maybe because of that, two level approaches are not uncommon.Features used to discriminate seem to be languagegroup specific, altough word rather than character features seem to perform better (Zampieri and Gebre (2012) report best results for character 4-grams, however, given that European and Brazilian Portuguese do not completely share ortography).

Maximum Entropy Models and Feature Engineering
Maximum Entropy modelling is a general purpose machine learning framework that has proven to be highly expressive and powerful in many areas.Maximum Entropy (maxent) was first introduced into natural language processing by Berger et al. (1996) and Della Pietra et al. (1997).Since its introduction, Maximum Entropy techniques and the more general framework of Random Fields have been applied extensively to natural language processing problems, where maxent classifiers are commonly used as an alternative to Naïve Bayes classifiers.In maxent modelling, the probability that an example x is in a class c is estimated from its bag of words (or n-grams) as: where f i (c, y) are indicator functions, w ci is the weight assigned to feature i in class c, and Z is a normalization factor.Features are modelled by indicator functions f i (c, y), which are evaluated to one when the feature i for a particular class c is true for a word y and zero otherwise.The following is an example of an indicator function modelling the presence of a particular word in a class: The class assigned to an example x is the most probable one: The maxent classifiers are implemented with the toolkit of Zhang Le (2004), and the parameters of the model are estimated using Generalized Iterative Scaling (Darroch and Ratcli, 1972).
Having chosen a closed approach to the DSL shared task, no other resources than the text samples given as training and development datasets have been used in features design.In this knowledge-poor approach to the problem, the maxent classifier has been trained with token and character n-gram features.Character-based features are obtained with a simple character tokenizer.However, for token-based features, texts are tokenized using an orthographic tokenizer which splits punctuation from words.Several bags of features have been considered during the experiments: single tokens (T1), single words from the list of the 10,000 most frequent tokens (L1), token bigrams (T2), and n-grams of character sequences of length from one to five (C1-5).We will also refer to the lists of the 10,000 most frequent words as 'white list', which have a complementary role to the 'black lists' of Tiedemann and Ljubešić (2012).
To determine which features are best suited to each group, we measured their performance using tenfold cross-validation on the training dataset and using the development dataset for testing.For group A, best results were obtained using bag of features consisting of variable length character n-grams ranging from one to five (C1-5).On group B, token bigrams (T2) performed slightly better in the development set than in the training set than the 'white list' of tokens (L1), which seems to indicate a better generalisation of the former on unseen examples.Results for group C were similar for all features considered.Regarding groups D and E, token-based features got similar results, with slightly better results for token bigrams.Finally, for English (group F) results were generally bad, reaching the 'white list' the better results.Group F is known to contain more than a few misclassifications due to news cross citing between American and British press.Results for each group's best model using ten-fold cross-validation on the training dataset are shown in Table 1.All figures have been macro averaged, i.e., they have been computed averaging the ten folds.
Because best results for each group are obtained with different feature sets, a new classifier is introduced.This classifier determines the language/variety group of each example before applying its particular group classifier.As can be seen in Table 2, the degree of token overlap between languages and varieties of different groups is rather low compared with the degree of overlap within the same group.Using only tokens, total accuracy is reached on the training dataset using cross validation.A classifier applying several classifiers in the way we propose is known as a hierarchical two-level classifier.

Evaluation and Error Analysis
Having as a goal to assess the performance of the hierarchical maxent classifier with the DSL task dataset, models were trained using all the examples provided in the training and development datasets.1.
Table 4 shows the confusion matrix for the classifier on the test dataset and Table 1 the results in terms of precision, recall and F 1 -score for each language and variety.As can be seen in Table 4, no example has been classified outside in a wrong group.Tan et al. (2014) provide a baseline using a Naïve Bayes classifier on character 5-grams.As can be seen if Table 3 is compared with Table 4 of Tan et al. (2014), figures for group A are slightly below the baseline, groups B and C achieve the same results, D and E groups get slightly better results with the maxent classifier, and the biggest difference is found in group F, having better results Naïve Bayes.The overall result without group F is similar: an F 1 -score of 0.947 for maxent and 0.942 for Naïve Bayes.
The DSL Corpus is composed of journalistic comparable texts to make the corpus suitable for discriminating similar languages and languages varieties but not text types or genres.Tiedemann and Ljubešić (2012) avoid biases towards topic and domain by experimenting with parallel texts reaching an overall accuracy of 90.3% for group A (br, hr, sr) using a 'black list' classifier and comparing its results with a Naïve Bayes approach.They found that the 'black list' classifier generalise better than the Naïve Bayes approach when moving from parallel to comparable corpora, since the former classifier is based on more informative features than the later.
Results of ten-fold cross-validation on the training dataset for different feature settings for group E (Spanish) were consistent with those of Zampieri et al. (2013), where word bigrams are reported to  outperform character n-grams.Given that datasets are not identical, it is difficult to draw any conclusion from the 1.2% difference in accuracy between DSLRAE and Zampieri et al. (2013).Manual inspection of misclassified news suggests some textual properties that are specially challenging: a) high density of foreign proper names (Russian, Baby, Pony, Jack, . . . ) may dilute the evidence provided by vernacular words; b) conversely, low density of features specific to any variant (such as place or family names5 , demonyms, lexical choices) may be insufficient to drive the text to the right class; this is also the case of some perfectly neutral sentences where a trained linguist could not spot any clue about their origin; c) certain syntactical idiosyncrasies (for example Argentinian idioms la pasas bien, tal como muchas veces, en exceso de) are not captured by bigrams; d) there are instances of cross-information, e. g., Argentinian news about Spain and vice versa where maybe more of a topic rather than a variety is being detected (e. g., news about Urdangarín or Fernández de Kirchner); e) there are some typos and misspellings (carabanas, dosco) whose role remains unclear; e) finally, there is at least one text misclassified in the gold standard: it is labeled as Argentinian but it was written by the Spanish EFE news agency.Some of these difficulties cross-cut all language groups and are not specific to Spanish but rather to DSL as a task.
In contrast to what Zampieri and Gebre (2012) found, ten-fold cross-validation on the training dataset for different feature settings on the DSL dataset did not find character n-grams to outperform word ngrams for group D (Portuguese).It could be hypothesized that they used a unique source (newspaper) for each variety and therefore rigid editorial conventions could be at play; moreover, the collections were three years distant, so topic consistency could also be compromised 6 .Manual inspection of mislabeled sentences shows some already known categories: evidence diluted by foreign words (Red Brick Warehouse, Mészáros, Fat Duck), poor evidence (Valongo, Sao Paulo) or cross-information (TAP, Brasília).There is, however, a Portuguese-specific issue: some texts obey the 1990 Orthographic Agreement 7 which blurs the orthographic distinctions regarding diacritics or consonant clusters; in fact, one sentence contains words following both standards (perspectiva and reproduc ¸ão).It remains unexplained why word bigrams did not capture the Brazilian preference for passive voice (foram rebaixados), auxiliary + gerund chunks (estamos utilizando) or clitic dropping (lembro).
Despite findings by Tiedemann and Ljubešić (2012), character n-grams performed better during tenfold cross-validation on the training dataset for different feature settings on the DSL dataset for group A (Bosnian, Croatian and Serbian).Misclassified sentences involve failing to capture adapted place names (Belgiji, Švedskoj) or derivational choices (organiziranog).
Results of ten-fold cross-validation on the training dataset for different feature settings for group B (Indonesian and Malay) top ranked word unigrams.Ranaivo-Malanc ¸on (2006) uses number formatting and exclusive word lists.It can be hypothesized that lexical overlap is low (see Table 2) and/or frequency distributions are dissimilar thus allowing word unigrams to perform as well as 'white lists'.
Languages of group C (Czech and Slovak) are dissimilar both orthographically and lexically.These dissimilarities are surprisingly well captured by the top 10,000 most frequent words.

Conclusions and Future Work
In this paper, we have shown that a hierarchical classifier is well suited to discriminate among different language groups and languages or varieties therein.Different features are shown to better suit typological traits of supported languages.A comparison to previous approaches is provided, when available.
In a multilingual setting, the effect of adding Galician to group D could be investigated.Focusing on Spanish language, we plan to geographically expand the classifier to deal with all national varieties, a much harder task as both Baldwin and Lui (2010) and Zampieri et al. (2013) remark.Moreover, the classifier could be used, as Tiedemann and Ljubešić (2012) suggest, to learn varieties discriminators to label texts beyond national classes (e.g. both Caribbean and Andean Spanish cross-cut national borders and, conversely, nations involved are known not to be dialectally uniform).Given that error analysis showed that word bigrams fail to capture certain syntactical idiosyncrasies, a model with longer n-grams and/or knowledge-richer features such as POS sequences could also be explored, although Zampieri et al. (2013) report lower performance than knowledge-poor features.Finally, classification techniques such as those described in Gyawali et al. (2013) may be used to discard translations when building monolingual, vernacular corpora.
A diachronic expansion, such as Trieschnigg et al. (2012), is also in mind.Medieval Castilian coexisted with other Romance varieties such as Leonese or Aragonese whose features permeated Castilian texts.Researchers are in need of a tool to properly classify diachronic texts to accurately describe older stages of Spanish.Following the suggestion of Tiedemann and Ljubešić (2012), we envisage the use of parallel texts such as versions of the Bible from different areas to learn the differences among varieties.

Table 1 :
Macro-averaged Precision, Recall and F 1 -score on the DSL training dataset resulting from 10fold cross-validation using the best model for each group of languages o varieties.

Table 2 :
Lexical overlap between pairs of languages as a percentage.Only orthographic forms and punctuation signs appearing more than once in the training dataset has been considered.

Table 3 :
Macro-averaged Precision, Recall and F 1 -score on the DSL test dataset.Models are described in Table

Table 4 :
Confusion matrix for the hierarchical maxent classifier on languages and varieties in the DSL test dataset.The 1,000 Bosnian texts have been classified as Bosnian (875), Croatian (61) and Serbian (64).

Table 5 :
Languages and varieties groups and codes.