Report of NEWS 2016 Machine Transliteration Shared Task

This report presents the results from the Machine Transliteration Shared Task conducted as part of The Sixth Named Entities Workshop (NEWS 2016) held at ACL 2016in Berlin, Germany. Similar to previous editions of NEWS Workshop, the Shared Task featured machine transliteration of proper names over 14 different language pairs, including 12 different languages and two different Japanese scripts.A total of 5 teams participated in the evaluation, submitting 255 standard and 19 non-standard runs, involving a diverse variety of transliteration methodologies. Four performance metrics were used to report the evaluation results. Once again, the NEWS shared task on machine transliteration has successfully achieved its objectives by providing a common ground for the research community to conduct comparative evaluations of state-of-the-art technologies that will benefit the future research and development in this area.


Introduction
Names play an important role in the performance of most Natural Language Processing (NLP) and Information Retrieval (IR) applications. They are also critical in cross-lingual applications such as Machine Translation (MT) and Cross-language Information Retrieval (CLIR), as it has been shown that system performance correlates positively with the quality of name conversion across languages (Demner-Fushman and Oard 2002, Mandl and Womser-Hacker 2005,Hermjakobet al. 2008, Udupa et al. 2009). Bilingual dictionaries constitute the traditional source of information for name conversion across languages, however they offer very limited support due to the fact that, in most languages, names are continuously emerging and evolving.
All of the above points to the critical need for robust Machine Transliteration methods and systems. During the last decade, significant efforts has been conducted by the research community to address the problem of machine transliteration (Knight and Graehl 1998, Meng et al. 2001, Li et al. 2004, Zelenko and Aone 2006, Sproat et al. 2006, Sherif and Kondrak 2007, Hermjakob et al. 2008, Al-Onaizan and Knight 2002,Goldwasser and Roth 2008, Goldberg and Elhadad 2008,Klementiev and Roth 2006, Oh and Choi 2002, Virga and Khudanpur 2003, Wan and Verspoor 1998, Kang and Choi 2000, Gao et al. 2004,Li et al. 2009a, Li et al. 2009b). These previous works fall into three main categories: grapheme-based, phoneme-based and hybrid methods. Graphemebased methods (Li et al. 2004) treat transliteration as a direct orthographic mapping and only uses orthography-related features while phoneme-based methods (Knight and Graehl 1998) make use of phonetic correspondences to generate the transliteration. The hybrid approach refers to the combination of several different models or knowledge sources to support the transliteration generation process.
The first machine transliteration shared task (Li et al. 2009b, Li et al. 2009a) was organized and conducted aspart of NEWS 2009 at ACL-IJCNLP 2009. It was the first time that common benchmarking data in diverse language pairs was provided for evaluating state-of-the-art machine transliteration. While the focus of the 2009 shared task was on establishing the quality metrics and on setting up a baselinefor transliteration quality based on those metrics, the 2010 shared task (Li et al. 2010a, Li et al. 2010b) focused on expanding the scope of the transliteration generation task to about a dozen languages and on exploring the quality of the task depending on the direction of transliteration. In NEWS 2011(Zhang et al. 2011a, Zhang et al. 2011b), the focus was on significantly increasing the hand-crafted parallel corpora of named entities to include 14 different language pairs from 11 language families, and on making them available as the common dataset for the shared task. The NEWS 2016Shared Task on Transliteration has been a continued effort for evaluating machine transliteration performance over such a common dataset following the NEWS 2015 (Banchs, et al., 2015), NEWS 2012and 2011 shared tasks.
In thispaper, we presentin full detail the results of the NEWS 2016Machine Transliteration Shared Task. The rest of the paper is structured as follows. Section 2 provides as short review of the main characteristics of the machine transliteration task and the corpora used for it. Section 3 reviews the four metrics used for the evaluations. Section 4 reports specific details about participation in the 2016 edition of the shared task, and section 5 presents and discusses the evaluation results. Finally, section 6 presents our main conclusions and future plans.

Shared Task on Transliteration
Transliteration, sometimes also called Romanization, especially if Latin Scripts are used for target strings (Halpern 2007), deals with the conversion of names between two languages and/or script systems. Within the context of the Transliteration Shared Task, we are aiming not only at addressing the name conversion process but also its practical utility for downstream applications, such as MT and CLIR.
In this sense, we adopt the same definition of transliteration as proposed during the NEWS 2009 workshop (Li et al. 2009a). According to it, transliteration is understood as the "conversion of a given name in the source language (a text string in the source writing system or orthography) to a name in the target language (another text string in the target writing system or orthography" conditioned to the following specific requirements regarding the name representation in the target language: • it is phonetically equivalent to the source name, • it conforms to the phonology of the target language, and • it matches the user intuition on its equivalence with respect to the source language name. Following NEWS 2011, NEWS 2012 and NEWS2015, the three back-transliteration tasks are maintained. Back-transliteration attempts to restore transliterated names back into their original source language. For instance, the tasks for converting western names written in Chinese and Thai back into their original English spellings are considered. Similarly, a task for backtransliterating Romanized Japanese names into their original Kanji strings is considered too.

Shared Task Description
Following the tradition of NEWS workshop series, the shared task in NEWS 2016 consists of developing machine transliteration systems in one or more of the specified language pairs.Each language pair of the shared task consists of asource and a target language, implicitly specifyingthe transliteration direction. Training and developmentdata in each of the language pairs was made available to all registered participants for developing their transliteration systems.
At the evaluation time, a standard hand-crafted test set consisting of between 500 and 3,000source names (approximately 5-10% of the trainingdata size) was released, on which theparticipants were required to produce a ranked listof transliteration candidates in the target languagefor each source name. The system output istested against a reference set (which may includemultiple correct transliterations for some sourcenames), and the performance of a system is capturedin multiple metrics (defined in Section 3),each designed to capture a specific performancedimension.
For every language pair, each participant was requiredto submit at least one run (designated as a"standard" run) that uses only the data provided bythe NEWS workshop organizers in that languagepair; i.e. no other data or linguistic resources are allowed for standard runs. Thisensures parity between systems andenables meaningful comparison of performanceof various algorithmic approaches in a given languagepair. Participants were allowed to submitone or more standard runs for each task they participated in. If more thanone standard runs were submitted, it was required toname one of them as a "primary" run, which was the one used to compare results across different systems.
In addition, more than one "non-standard" runs could besubmitted for every language pair using either databeyond theone provided by the shared task organizers,any other available linguistic resources in a specific language pair, orboth. This essentially enabled participants to demonstrate the limits of performance of theirsystems in a given language pair.

Shared Task Corpora
Two specific constraints were considered when selectinglanguages for the shared task: language diversityand data availability. To make the shared taskinteresting and to attract wider participation, it isimportant to ensure a reasonable variety amongthe languages in terms of linguistic diversity, orthographyand geography. Clearly, the ability ofprocuring and distributing a reasonably large (approximately10K paired names for training andtesting together) hand-crafted corpora consistingprimarily of paired names is critical for this process. Following NEWS 2015, the 14 tasks shown in Tables 1.a-ewere used (Li et al.2004, Kumaran and Kellner 2007, MSRI 2009,CJKI 2010. The names given in the training sets for Chinese,Japanese, Korean, Thai, Persian and Hebrewlanguages are Western names and their respectivetransliterations; the Japanese Name (in English) → Japanese Kanji data set consists only of nativeJapanese names; the Arabic data set consists onlyof native Arabic names. The Indic data set (Hindi,Tamil, Kannada, Bangla) consists of a mix of Indianand Western names.
For all of the tasks chosen, we have been able to procure paired-name data between thesource and the target scripts and were able tomake them available to the participants. Forsome language pairs, such as the case of English-Chinese an-dEnglish-Thai, there are both transliteration andback-transliteration tasks. Most of the tasks are justone-way transliteration, although Indian data setscontainsa mixture of names from both Indian andWestern origins.

Evaluation Metrics and Rationale
The participants have been asked to submit standard and, optionally, non-standard runs.One of the standard runs must be named as the primarysubmission, which was the one used for the performance summary.Each run must contain a ranked list of up toten candidate transliterations for each source name.The submitted results are compared to the groundtruth (reference transliterations) using four evaluationmetrics capturing different aspects of transliterationperformance. The four considered evaluation metrics are: • Word Accuracy in Top-1 (ACC), • Fuzziness in Top-1 (Mean F-score),  In the next subsections, we present a brief description of the four considered evaluation metrics. The following notation is further assumed: • N : Total number of names (sourcewords) in the test set, • n i : Number of reference transliterationsfor i-th name in the test set (n i ≥ 1), • r i,j : j-th reference transliteration for ithname in the test set, • c i,k : k-th candidate transliteration (systemoutput) for i-th name in the test set(1 ≤k≤ 10), • K i : Number of candidate transliterationsproduced by a transliteration system.

Word Accuracy in Top-1 (ACC)
Also known as Word Error Rate, it measures correctnessof the first transliteration candidate in thecandidate list produced by a transliteration system.ACC = 1 means that all top candidates are correcttransliterations; i.e. they match one of the references,and ACC = 0 means that none of the topcandidates are correct.

Fuzziness in Top-1 (Mean F-score)
The Mean F-score measures how different, on average,the top transliteration candidate is from itsclosest reference. F-score for each source wordis a function of Precision and Recall and equals 1when the top candidate matches one of the references, and 0 when there are no common charactersbetween the candidate and any of the references. Precision and Recall are calculated based onthe length of the Longest Common Subsequence(LCS) between a candidate and a reference: whereED is the edit distance and |x| is the lengthof x. For example, the longest common subsequencebetween "abcd" and "afcde" is "acd" andits length is 3. The best matching reference, i.e. the reference for which the edit distance hasthe minimum, is taken for calculation. If the bestmatching reference is given by the Recall, Precision and F-score for the i-th word are calculated as: The lengths are computed with respect to distinct Unicode characters, and no distinctions are made for different character types of a language (e.g. vowel vs. consonant vs. combining diereses, etc.).

Mean Reciprocal Rank (MRR)
Measures traditional MRR for any right answerproduced by the system, from among the candidates.1/MRR tells approximately the averagerank of the correct transliteration. MRR closer to 1implies that the correct answer is mostly producedclose to the top of the n-best lists.

Mean Average Precision (MAP ref )
This metric measures tightly the precision in the n-best candidates for i-th source name, for which reference transliterations are available. If all of the referencesare produced, then the MAP is 1. If we denotethe number of correct candidates for the i-thsource word in k-best list as num(i,k), thenMAP ref is given by:

Participation in the Shared Task
A total of five teams from five different institutions participated in the NEWS 2016 Shared Task. More specifically, the participating teams were from National Institute of Information and Communications Technology (NICT), Qazvin Islamic Azad University (QIAU), University of Helsinki (UOH), Uppsala University (UPPS), Institute for Infocomm Research (I2R). Teams were required to submit at least one standard run for every task they participated in, and for NEWS 2012/2015 test sets. They are set as the official NEWS 2016 evaluation set. In total, we received 31 standard and 2 non-standard runs for all test sets; i.e. 255 standard and 19 non-standard runs in total. Table 2 summarizes the number of standard runs, non-standard runs and teams participating per task.  As seen from the table, the most popular task continues to be the transliteration from English to Chinese and Chinese to English (Zhang et al. 2012), followed by English to Hindi etc. Nonstandard runs were only submitted for 3 of the 14 tasks.

Shared Task on CodaLab
Different from previous years, in NEWS 2016 the Shared Task evaluation was run online by using the CodaLab platform (http://codalab.org/). CodaLab is a powerful online platform aiming at accelerating reproducible computational research. Two main functionalities are available at the Co-daLab platform: worksheets, which allows for running reproducible experiments and creating executable papers; and competitions, which allows for participating and/or hosting competitions.
CodaLab's competitions allows for running competitions that involve either code submissions or data submissions. For the case of NEWS 2016 Shared Task on transliteration, two Coda-Lab competitions on the data submission modality were created: NEWS 2016 Standard submissions (https://competitions.codalab.org/ competitions/8991) and NEWS 2016 Non-standard submissions(https://competitions.codalab.org/ competitions/9021). In the standard submissions competition, participants were required to use only the training and development data provided by the Shared Task, while for the non-standard submissions competitions, in addition to the training and development data provided by the Shared Task, participants were welcomed to use external data, either parallel or monolingual. A total of 12 and 4 participants registered for the standard submissions and non-standard submissions competitions, respectively, but finally only five teams submitted results into the competitions.
Each competition was composed of 14 phases, each corresponding to one of the 14 transliteration tasks available in the Shared Task. All phases were run in parallel, meaning that each participant was able to submit results to any of the phases at any moment during the evaluation campaign, which ran from April 25th to May 3rd. During this period, participants were allowed to submit to each of the two competitions up to 3 results per day and per task, with an overall maximum of 15 submissions per task during the complete evaluation period. For each task they participated in, participants were allowed to post only one result in the corresponding leader-board. The leader-boards for both the standard submissions and non-standard submissions competitions are available at https://competitions.codalab.org/ competitions/8991#results and https:// competitions.codalab.org/competitions/9021#results, respectively.

Baseline System Results
Also different from previous years, in NEWS 2016 a baseline system was set up and baseline results were computed for all the 14 transliteration tasks available in the Shared Task. Baseline results were based on a simple MT implementation at the character level using MOSES. The baseline system was generously provided by UPC, Barcelona (Costa-jussa, 2016).
A summary of NEWS 2016 Shared Task results, including the MOSES-based baseline results, is available in the workshop's website at: http://workshop.colips.org/news2016/results.htm l. As seen from the figure, with the exception of the English to Japanese Katakana, only transliteration tasks involving Arabic, Persian and the four considered Indian languages are consistently scored above 80%. For the rest of the languages, with the exception of Japanese Katakana and Hebrew, scores are consistently in the range from 60% to 80%. Notice also that, regardless the availability of training data, the English to Chinese transliteration task seems to be the more demanding one for state-of-the-art systems with respect to the considered metric.

Task Results and Analysis
Another interesting observation that can be derived from the figure, when looking to the language pairs English-Chinese and English-Thai, is that systems tend to perform slightly better for the case of back-transliteration tasks.
A much more comprehensive presentation of results for the NEWS 2016 Shared Task is provided in the Appendix at the end of this paper. There, resulting scores are reported for all received submissions, including standard and nonstandard submissions and the four considered evaluation metrics. All results are presented in 14 tables, each of which reports the scores for one transliteration task over one test set. In the tables, all primary standard runs are highlighted in bolditalic fonts.
Regarding the systems participating in this year evaluation, two highest performance systems of the five participants submitted their system description papers, which are from NICT and UPPS. The NICT's system (Finch et al. 2016) applied neural network Ensembles, each of which explores the agreement of targetbidirectional sequence-to-sequence neural network model. The ensembles show great improvements over their NEWS 2015 results, which utilized a rescoring reranking function to ensemble attention-based neural network and traditional machine translation models.
The UPPS's system (Shao et al. 2016) implemented a neural network trained on unsupervised sub-units alignments. They used a convolutional neural network to encode character-level transliteration information and a recurrent neural network as stacking. Their decoding performance demonstrates that their proposed neural network significantly outperforms the baseline which is a character-level system trained by Moses.
As seen from the previous system descriptions, neural networks become more and more predominant in the state-of-the-art machine transliteration. Significant improvements are achieved by neural network ensembles, while single neural network also obtains better performance than traditional phrase-based machine translation systems. The simple ensemble method achieved the best performance across all 14 phases. As seen from the figure, in most of the considered transliteration tasks, some incremental improvements can be observed between the 2015 and 2016 shared tasks. The most significant improvements are in those tasks involving Japanese Katakana, Tamil, Kannada, and Thai.
Regarding the observed drops in performance, the most significant one is from JnJk. It is mainly due to that the specific participant NICT applied a totally different methodology compared to JnJk in 2015. As their system description paper points out, the drop is because the large vocabulary set on the target side that neural network hardly handles.

Conclusions
The Shared Task on Machine Transliteration in NEWS 2016 has shown, once again, that the research community has a continued interest in this area. This report summarizes the results of the NEWS 2016 Shared Task.
We are pleased to report a comprehensive set of machine transliteration approaches and their evaluation results over the evaluation test set, as well as two conditions: standard runs and nonstandard runs.While the standard runs allow for conducting meaningful comparisons across different algorithms, the non-standard runs open up more opportunities for exploiting a varietyof additional linguistic resources.
Five teams from five different institutions participated in the shared task. In total, we received 31 standard and 2 non-standard runs for each test set; i.e. 255 standard and 19 non-standard runs in total. Most of the current state-of-the-art in machine transliteration is represented in the systems that have participated in the shared task. Encouraged by the continued success of the NEWS workshopseries, we plan to continue this event in the future to further promoting machine transliteration research and development.

Acknowledgments
The organizers of the NEWS 2016 Shared Task would like to thank the Institute for Infocomm Research (Singapore), Microsoft Research India, CJK Institute (Japan), National Electronics and Computer Technology Center (Thailand) and Sarvnaz Karim / RMIT for providing the corpora and technical support. Without those, the Shared Task would not be possible. We also want to thank all program committee members for their valuable comments that improved the quality of the shared task papers. Finally, we wish to thank all participants for their active participation, which have made again the NEWS Machine Transliteration Shared Task a successful one.