Cross-lingual document similarity

Abstract

In this paper we investigated how to compute similarities between documents written in different languages based on a weekly aligned multi-lingual collection of documents. Computing the cross-lingual similarities is based on an aligned set of basis vectors obtained by either latent semantic indexing or the k-means algorithm on an aligned multi-lingual corpus. We evaluated the methods on two data sets: Wikipedia and European Parliament Proceedings Parallel Corpus.

DOI: 10.2498/iti.2012.0467

3 Figures and Tables

Cite this paper

@article{Muhic2012CrosslingualDS, title={Cross-lingual document similarity}, author={Andrej Muhic and Jan Rupnik and Primoz Skraba}, journal={Proceedings of the ITI 2012 34th International Conference on Information Technology Interfaces}, year={2012}, pages={387-392} }