Detecting Cross-Language Plagiarism using Open Knowledge Graphs

@article{Stegmller2021DetectingCP,
  title={Detecting Cross-Language Plagiarism using Open Knowledge Graphs},
  author={Johannes Stegm{\"u}ller and Fabian Bauer-Marquart and Norman Meuschke and Terry Ruas and Moritz Schubotz and Bela Gipp},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.09749}
}
Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 65 REFERENCES
Methods for cross-language plagiarism detection
Cross-language plagiarism detection
TLDR
The results of the evaluation indicate that CL-CNG, despite its simple approach, is the best choice to rank and compare texts across languages if they are syntactically related.
Cross-Language Plagiarism Detection Using a Multilingual Semantic Network
TLDR
Experimental results indicate that the proposed graph-based approach is a good alternative for cross-language plagiarism detection and compared with two state-of-the-art models.
Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing
Cross-language plagiarism detection attempts to identify and extract automatically plagiarism among documents in different languages. Plagiarized fragments can be translated verbatim copies or may
Character N-Gram Tokenization for European Language Text Retrieval
TLDR
It is demonstrated empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages and is a good choice for those languages, and the increased storage and time requirements of the technique.
A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection
TLDR
The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to document), is based on both parallel and comparable corpora and contains human and machine translated texts.
The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages
TLDR
A new, unique and freely available parallel corpus containing European Union documents of mostly legal nature, available in all 20 official EU languages, which is particularly suitable to carry out all types of cross-language research and to test and benchmark text analysis software across different languages.
...
1
2
3
4
5
...