Set-Theoretic Alignment for Comparable Corpora

  title={Set-Theoretic Alignment for Comparable Corpora},
  author={Thierry Etchegoyhen and Andoni Azpeitia},
We describe and evaluate a simple method to extract parallel sentences from comparable corpora. The approach, termed STACC, is based on expanded lexical sets and the Jaccard similarity coefficient. We evaluate our system against state-of-theart methods on a large range of datasets in different domains, for ten language pairs, showing that it either matches or outperforms current methods across the board and gives significantly better results on the noisiest datasets. STACC is a portable method… CONTINUE READING

From This Paper

Figures, tables, and topics from this paper.


Publications referenced by this paper.

Similar Papers

Loading similar papers…