Exploiting Sentence Order in Document Alignment

  title={Exploiting Sentence Order in Document Alignment},
  author={Brian Thompson and Philipp Koehn},
In this work, we exploit the simple idea that a document and its translation should contain approximately the same information, in approximately the same order. We propose methods for both document pair candidate generation and candidate re-scoring which incorporate high-level order information. Our method results in 61% relative reduction in error versus the best previously published result on the WMT16 document alignment shared task. We also apply our method to web-scraped Sinhala-English… 

Figures from this paper

Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment

The WMT Shared Task on Parallel Corpus Filtering posed again the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting the highest-quality data to be used to train ma-chine translation systems.

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingsual retrieval capabilities of this approach.

Dubbing in Practice: A Large Scale Study of Human Localization With Insights for Automatic Dubbing

How humans perform the task of dubbing video content from one language into another is investigated, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles, challenging a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automatic dubbing.

Domain Adaptation of Machine Translation with Crowdworkers

This work proposes a framework that quickly and effectively collects parallel sentences in a target domain from the web with the help of crowdworkers and can collect target-domain parallel data over a few days at a reasonable cost.

Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric into a Document-Level Metric

The method applies to four popular metrics: BERTScore, Prism, COMET, and the reference-free metric COMET-QE dramatically improves accuracy on discourse phenomena tasks, supporting the hypothesis that the document-level metrics are resolving ambiguities in the reference sentence by using additional context.



First Steps Towards Coverage-Based Document Alignment

A method for selecting pairs of parallel documents from a large collection of documents obtained from the web based on a coverage score that reflects the number of distinct bilingual phrase pairs found in each pair of documents, normalized by the total number of unique phrases found in them.

A Portable Method for Parallel and Comparable Document Alignment

This work presents a document alignment method based on expanded lexical translation sets and document-level Jaccard similarity that outperforms alternative methods in most scenarios for both parallel and comparable corpora.

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment

Four methods to overcome some of the challenges presented by the nature of the corpus, by considering the string similarity of source URL and candidate URL, and combining the first two approaches are presented.

MT-based Sentence Alignment for OCR-generated Parallel Texts

This work describes an alternative alignment algorithm which uses machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points and shows that this approach outperforms state-of-the-art algorithms in this alignment task and translates into better SMT performance.

YODA System for WMT16 Shared Task: Bilingual Document Alignment

This paper addresses the task of automatically aligning/detecting the bilingual documents that are translations of each other from a single web-domain as part of WMT 2016 using an n-gram based approach and an IR-based approach that uses both content and the meta data of each web page url.

Vecalign: Improved Sentence Alignment in Linear Time and Space

We introduce Vecalign, a novel bilingual sentence alignment method which is linear in time and space with respect to the number of sentences being aligned and which requires only bilingual sentence

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

This paper proposes a new method for this task based on multilingual sentence embeddings, which relies on nearest neighbor retrieval with a hard threshold over cosine similarity, and accounts for the scale inconsistencies of this measure.

Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance

A greedy algorithm is introduced that runs quicker and performs better in practice than the optimal solution to bipartite graph matching and can be improved even further through combination with URL based pair matching.

An Expectation Maximization Algorithm for Textual Unit Alignment

An Expectation Maximization (EM) algorithm for automatic generation of parallel and quasi-parallel data from any degree of comparable corpora ranging from parallel to weakly comparable.

When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

This work performs a human study on an English-Russian subtitles dataset and identifies deixis, ellipsis and lexical cohesion as three main sources of inconsistency as well as introducing a model suitable for this scenario and demonstrating major gains over a context-agnostic baseline on new benchmarks without sacrificing performance as measured with BLEU.