Real-Time Statistical Speech Translation

@inproceedings{Wok2014RealTimeSS,
  title={Real-Time Statistical Speech Translation},
  author={Krzysztof Wołk and Krzysztof Marasek},
  booktitle={WorldCIST},
  year={2014}
}
This research investigates the Statistical Machine Translation approaches to translate speech in real time automatically. [] Key Method TED, Europarl, and OPUS parallel text corpora were used as the basis for training of language models, for developmental tuning and testing of the translation system. We also conducted experiments involving part of speech tagging, compound splitting, linear language model interpolation, TrueCasing and morphosyntactic analysis. We evaluated the effects of variety of data…

Polish-English speech statistical machine translation systems for the IWSLT 2014

TLDR
Various elements of the TED parallel text corpora for the IWSLT 2013 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system.

Enhancements in Statistical Spoken Language Translation by De-normalization of ASR Results

TLDR
The problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition systems in the Polish language is explored and reverse normalization of the recognized speech samples is experimentally tested.

Spoken Language Translation for Polish

TLDR
PJIIT's experiences in the SLT gained from the Eu-Bridge 7th framework project and the U-Star consortium activities for the Polish/English language pair are presented and progress in the IWSLT TED task (MT only) will be presented.

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

TLDR
Improvements to current comparable corpora mining methodologies are presented by re- implementation of the comparison algorithms (using Needleman-Wunch algorithm), introduction of a tuning script and computation time improvement by GPU acceleration.

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

TLDR
Improvements to current quasi-comparable corpora mining methodologies are presented by re-implementing the comparison algorithms, introducing a tuning script and improving performance using GPU acceleration.

Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora

TLDR
The purpose of this research is to bring calculation time up gradation via GPU acceleration, tuning script introduction and the enhancement and improvements in the methodologies of the contemporary comparable corpora mining through re-implementation of analogous algorithms through Needleman-Wunch algorithm.

Automatic Parallel Data Mining After Bilingual Document Alignment

TLDR
The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16 by utilizing different classifiers and algorithms and by use of advanced computation.

Big data language model of contemporary polish

TLDR
In this research, it is detailed exactly how the corpus was obtained and pre-processed, with a prominence on issues which surface when working with information on this scale.

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

TLDR
A web crawling method for building subject-aligned comparable corpora from e.g. Wikipedia dumps and Euronews web page is proposed and improvements in machine translation are shown on Polish-English language pair for various text domains.

References

SHOWING 1-10 OF 14 REFERENCES

The Best Lexical Metric for Phrase-Based Statistical MT System Optimization

TLDR
It is shown that people tend to prefer BLEU and NIST trained models to those trained on edit distance based metrics like TER or WER, and that using BLEu or NIST produces models that are more robust to evaluation by other metrics and perform well in human judgments.

Moses: Open Source Toolkit for Statistical Machine Translation

We describe an open-source toolkit for statistical machine translation whose novel contributions are (a) support for linguistically motivated factors, (b) confusion network decoding, and (c)

Factored Translation Models

TLDR
In a number of experiments, it is shown that factored translation models lead to better translation performance, both in terms of automatic scores, as well as more grammatical coherence.

Using Linear Interpolation and Weighted Reordering Hypotheses in the Moses System

TLDR
This paper proposes to introduce a novel reordering model in the open-source Moses toolkit and describes a domain adaptation technique which is based on a linear combination of an specific indomain and an extra out-domain translation models.

SRILM - an extensible language modeling toolkit

TLDR
The functionality of the SRILM toolkit is summarized and its design and implementation is discussed, highlighting ease of rapid prototyping, reusability, and combinability of tools.

TED Polish-to-English translation system for the IWSLT 2012

TLDR
Efforts in preparation of the Polish-toEnglish SMT system for the TED lectures domain that is to be evaluated during the IWSLT 2012 Conference are presented.

Unsupervised and Knowledge-Free Learning of Compound Splits and Periphrases

TLDR
An approach for knowledge-free and unsupervised recognition of compound nouns for languages that use one-wordcompounds such as Germanic and Scandinavian languages is presented, showing promising results above 80% precision for the splits and about half of the compounds periphrased correctly.

Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger

TLDR
This paper presents results for a maximum-entropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging by incorporating these features: more extensive treatment of capitalization for unknown words, and features for the disambiguation of the tense forms of verbs.

KenLM: Faster and Smaller Language Model Queries

TLDR
KenLM is a library that implements two data structures for efficient language model queries, reducing both time and memory costs and is integrated into the Moses, cdec, and Joshua translation systems.

Parallel Implementations of Word Alignment Tool

TLDR
Two parallel implementations of GIZA++ that accelerate this word alignment process by showing a near-linear speed-up according to the number of CPUs used, and alignment quality is preserved.