figshare
Browse
2111.09749.pdf (833.87 kB)

Detecting Cross-Language Plagiarism using Open Knowledge Graphs

Download (833.87 kB)
Version 4 2021-12-16, 13:01
Version 3 2021-12-16, 12:44
Version 2 2021-12-16, 12:00
Version 1 2021-12-16, 11:47
preprint
posted on 2021-12-16, 12:44 authored by Johannes Stegmüller, Fabian Bauer-Marquart, Norman MeuschkeNorman Meuschke, Terry Lima RuasTerry Lima Ruas, Moritz Schubotz, Bela GippBela Gipp
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC