Cross-lingual similar document retrieval methods

. In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.


Introduction
Document retrieval from a large collection of texts is important information retrieval problem.This problem is extensively studied for short queries, such as user queries to search engines.The document retrieval with texts as queries impose some difficulties, among them inability to capture the main ideas and topics from the long text.The problem becomes even harder when we enter the field of cross-lingual document retrieval.Some tasks require to use a text (possibly long) as query to retrieve documents that are somehow similar to it.One of these tasks is plagiarism detection that is divided into two stages: source retrieval and text alignment.
• On the source retrieval stage for a given suspicious document, we need to find all sources of probable text reuse in a large collection of texts.For this task, a source is a whole text, without details of what parts of this document were plagiarized.Typically, we get a large set of documents (around 500 or more) as a result of this stage.
• On the text alignment stage: we compare the suspicious document to each candidate to detect all reused fragments, and identify its boundaries [1][2][3][4].In this work, we study only the first task.The same stages are valid for cross-lingual plagiarism detection.Given a query document in one language the goal is to find the most similar documents from the collection in another language.

Related work
Some works were recently devoted to the monolingual document retrieval for long texts.In [5], the authors introduce a siamese multi-depth attention-based hierarchical recurrent neural network that learns the long text semantics.They conducted multiple experiments including retrieval of similar Wikipedia articles.In [6], the authors try to employ standard approximate nearest neighbor (ANN) search instead of the usual discrete inverted index, for retrieving documents.They learned similarity function and showed that it can improve performance on two similar-question retrieval tasks.However, using the custom similarity functions makes impossible to employ existing frameworks for ANN, consequently they used exact search in experiments.In [7], a framework is introduced for monolingual and cross-lingual information retrieval based on cross-lingual word embeddings.They represent user queries and documents as averaged embeddings of words and employ exact search to find similar documents for a given query.The overview of different approaches for cross-lingual source retrieval is presented in [8] and [9].Also, there made an evaluation and a detailed comparison of some featured methods.In [10], NMT (neural machine translation) is used to translate a query document to other language.They solve source retrieval task by employing shingles (overlapping word N-grams) method.They use word-class shingles, instead of word shingles, where each word is substituted by the label of the class it belongs to.To obtain word classes they apply agglomerative clustering on word embeddings learned from English Wikipedia.The work [11] describes a training of word embeddings on comparable monolingual corpora and learning the optimal linear transformation of vectors from one language to another (there were used Russian and Ukrainian academic texts).Also there were discussed usage of those embeddings in source retrieval and text alignment subtasks.This work focuses on comparison of retrieval-based approaches with ANN approach for distant language pair.

Document retrieval methods
In this section, we describe various methods that we used for document retrieval.

Preprocessing
On a preprocessing stage, we split each sentence into tokens, lemmatize tokens and parse texts.We use AOT for the Russian language and Udpipe 1 [12] for English language.Besides, we removed words with non-important part of speech: conjunction, pronoun, preposition, etc., and common stop-words (be, являться, etc.).

Cross-lingual embeddings
We train cross-lingual word embeddings for a Russian-English pair on parallel sentences available on the Opus site [13] namely: We extend this corpus with sentences from the Yandex Parallel corpus 2 [14].All parallel sentences are preprocessed.After that, all pairs that have a difference in the size of more than 10 words are filtered out.We use syntactic phrases up to 4 words in length to enrich the vocabulary.We take only those phrases (noun phrases and some prepositional phrases for the English language) that are common for the corpus (>10 occurrences).We duplicate one sentence multiple times if there are some overlapping phrases.For example, from the sentence with the phrase «Russian presidential election ...» will be generated three variations with different phrases: • «Russian_presidential_election ... »; • «Russian_election presidential_election ... »; • «Russian presidential election ... ».
Finally, we assembled a corpus of more than 5.1 million sentences (more than 10 million sentences with phrases variations).The dictionary size was around 680 000 words/phrases.We apply two different methods for learning cross-lingual embeddings [15].First, we learn monolingual embeddings for each language.We use word2vec skip-gram model [16] with the following parameters: dimensionality of embeddings was 300, a window size of 10 words, the minimal corpus frequency of 10, negative sampling with 10 samples, no downsampling, 20 iterations over the corpus.Then we use vecmap [17,18] framework to learn a transformation matrix that maps representations in one language to the representations of the 1 english-ewt-ud-2.4-190531model 2 Англо-русский параллельный корпус: https://translate.yandex.ru/corpus?lang=en Zubarev D.V., Sochenkov I.V. Cross-lingual similar document retrieval methods.Trudy ISP RAN/Proc.ISP RAS, vol. 31, issue 5, 2019, pp. 127-136 130 other language.We use 20 000 random word pairs from the bilingual dictionary of MUSE project 3 [19] as the training data.Second, we apply the method proposed in [20], designed for learning bilingual word embeddings from a non-parallel document-aligned corpus, but it can be used for learning on parallel sentences too.We assume that the structures of the two sentences are similar.Words are inserted into the pseudo-bilingual sentence relying on the order in which they appear in their monolingual sentences and based on length ratio of two sentences.For example, if we were given two sentences: «Мама мыла раму» and «Mother washed beautiful frame», the result of their merging is «мама mother мыла washed раму beautiful frame».Since we removed auxiliary words from sentences, we assume that corresponding Russian and English words are in the same context window.It would not be the case if there were a different word order, so we experimented with different window sizes and chose size == 10.After that, the word2vec skip-gram model is used on the resulting bilingual corpus.We use gensim word2vec implementation with those parameters: dimensionality of embeddings was 300, a window size of 10 words, the minimal corpus frequency of 10, negative sampling with 10 samples, no down-sampling, 20 iterations over the corpus.

Retrieval-based approach
We use a custom implementation of inverted index [21], which maps each word to a list of documents in which it appears along with weight (e.g.TF) that represents the strength of association of this word with a document.Along with words, we index syntactic phrases up to 4 words, which occur in a document more than once.At query time, we extract the top words/phrases from the query document according to some weighting scheme.Then we map each keyword to N other language keywords with cross-lingual embeddings.We precompute the most similar words for each word in our vocabulary to speed up this operation.We preserve the weights of keywords from the original top.The searcher iterates over the top keywords, retrieves corresponding documents from the inverted index, and merges them into weighted vectors of keywords that represent the other documents.Then we compare the query vector with all other vectors.It should be noted that comparison is asymmetrical since vectors of other documents consist only of words from the query vector.Although it is not the most accurate representation of these documents, the comparison is very efficient, and retrieval performance (recall) is not affected much by that.To compute the similarity score between vectors we employ some similarity measure (e.g.cosine similarity).

Approximate nearest neighbor search (ANN)
In this approach, we represent each document as a dense vector.It is done by averaging vectors of the top K keywords of the document.After that, we index all vectors with ANN index.At query time, the given document is transformed into the vector representation, and the approximate nearest neighbor search is employed to retrieve the most similar documents.

Explicit semantic analysis (ESA)
We implemented CL-ESA method described in [9] and firstly introduced for solving monolingual semantic relatedness task in [22].This method represents the document as a weighted vector of concepts.Concepts are defined by Wikipedia articles.In the original work, the authors used all English Wikipedia articles as concepts.We selected around 800 000 English articles that are aligned with Russian Wikipedia articles (articles that identified as comparable across languages by the Wikipedia community).For a given document the weight of a concept is defined as cosine similarity between top M keywords of and matched keywords of an article that is linked with the concept : where is the weight of a word for (e.g.TF-IDF), is the weight of a word for a Wikipedia article linked with the concept (e.g.TF-IDF).We precomputed vector of concepts for each document in text collection and stored them with the same inverted index implementation that was used for the retrieval-based approach.At query time, the query document is converted to a vector of weighted concepts, i.e., identificators of Wikipedia articles.Then those identificators are mapped to articles in other language, and similar documents are retrieved via search in the inverted index.

Dataset
We use Russian-English aligned Wikipedia articles as a dataset for evaluation of retrieval methods (Wikipedia dump of June 2019).We exclude all articles, which title starts with words "List of", which size in symbols is less than 800, and which number of sentences is less than 10.Then we divide all remaining pairs of articles into two groups and each group into five bins by the size of a Russian article in sentences: • comparable by size: those articles that satisfy the following requirement: • non-comparable by size: Those articles that satisfy the following requirement: Then we sampled 100 documents from each group.That gives us a dataset that contains 1000 document pairs4 .

Indexing of Wikipedia
We indexed all articles from English (5.8M) and Russian (1.5M) Wikipedia dumps (June 2019).

Retrieval-based approach
We use TF-IDF weighting scheme with log ( ) ( ( ) + 1) as TF weight for word from a document , and (0, log ( − + 0.5)/( + 0.5)) as IDF, where is total amount of documents in a collection.

Approximate nearest neighbor search
We take top keywords with weight > 0.05 and average embeddings of those words.We use Faiss IVFFlatIndex [23] for indexing document embeddings with the following parameters: number of centroids -4 * | |, where set of all vectors that we need to index, training set size - , where _ _ _ is equal 39 by default, nprobe -16, compression -SQfp16.Our experiments showed that these parameters result in efficient search time and search precision greater than 90%.

ESA
When precomputing concept vectors for ESA method, we used 200 top keywords (with weight > 0.05) of a document to compute weights of concepts.We kept the maximum 1200 concepts with the largest weight per document.Since we build vectors of Wikipedia articles using Wikipedia articles as concepts, we excluded a concept that represents the same article from the vectors.

Evaluation Results
We used grid search for parameters tuning on 400 documents that were sampled independently of the testing data.We performed a search on all 1000 documents using various methods, retrieved the most similar 600 documents and measured standard metrics: Recall, MAP.We use the following abbreviation in the table 3 and below: • RBA -Retrieval-based approach; • EMB -Embeddings that were used: BIL -embeddings built on bilingual corpus, MAPembeddings mapped via Vecmap framework; • MP -Maximum phrase size (1-4), if 1 the keywords may only contain single words; • N -Number of similar words in other language that were taken for each word when mapping keywords (1 if not specified explicitly); • MTS -Number of keywords in other language (mapped top size) (100 if not specified explicitly); • SK -Similarity score: cosine (cos) or hamming (ham) similarity measures (cos if not specified explicitly); • ANN -Approximate nearest neighbor search; • DIM -Dimensionality of embeddings (300 if not specified explicitly); • K -Document is an average of vectors of the top K keywords; • ESA -Explicit semantic analysis; • CTOP -Number of concepts to use for retrieval.
The Table 3 displays the evaluation results obtained on the wiki dataset.The results show that the retrieval-based approach is better in terms of Recall and Map than other methods.The embeddings, built on the bilingual corpus (EMB=BIL), give better results for this task than embeddings obtained via mapping (EMB=MAP).The results of experiments show that syntactic phrases give no significant boost in performance for RBA and ANN approaches.Doubling the number of components of embeddings from 300 to 600 results in better ranking for ANN aproach, but almost has no effect for RBA.ESA shows good recall, but ranking of found documents is worse than for the RBA and ANN methods.
It should be pointed out that the performance of the methods differs significantly depending on the size of the documents (table 4).
Denis Vladimirovich ZUBAREV -Engineer of FRC CSC RAS.Research interests: information retrieval, natural language processing, plagiarism detection, big data.
Ilya Vladimirovich SOCHENKOV -PhD, Head of the Department of Intelligent Technologies and Systems of FRC CSC RAS.Research interests: information retrieval, natural language processing, machine learning, big data.

Table 1 :
Statistic of comparable by size articles

Table 2 :
Statistic of non-comparable by size articles

Table 3 .
Evaluation results