Corpus ID: 1819712

Extracting bilingual terminologies from comparable corpora

@inproceedings{Aker2013ExtractingBT,
  title={Extracting bilingual terminologies from comparable corpora},
  author={Ahmet Aker and M. Paramita and R. Gaizauskas},
  booktitle={ACL},
  year={2013}
}
In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for… Expand
Classification and Selection of Translation Candidates for Parallel Corpora Alignment
TLDR
A labelled lexicon with entries tagged for correctness enables bilingual learning by incorporating human feedback in parallel corpora alignment and term translation extraction tasks, and by using all human validated term translation pairs that have been marked as correct. Expand
Extracting bilingual terms from the Web
TLDR
Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems. Expand
Towards producing bilingual lexica from monolingual corpora
TLDR
This work describes an approach to automatically learn bilingual lexica by training a supervised classifier using word embedding-based vectors of only a few hundred translation equivalent word pairs, obtained from source and target monolingual corpora, which are not necessarily related. Expand
Two approaches to compilation of bilingual multi-word terminology lists from lexical resources
TLDR
A binary classifier is developed that decides whether a candidate pair, composed of aligned source and target terms, is valid and trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of the extraction system. Expand
Automatic compilation of bilingual terminologies from comparable corpora
TLDR
This work focuses on bilingual terminology induction from freely available comparable corpora, i.e. thematically related documents in two or more languages, and integrates automatically compiled bilingual terminologies with Statistical Machine Translation systems to more accurately translate unknown terms. Expand
In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
TLDR
A new approach is presented for both monolingual and multilingual term annotation in comparable corpora with detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains. Expand
TermEnsembler: An ensemble learning approach to bilingual term extraction and alignment
TLDR
TermEnsembler is a bilingual term extraction and alignment system utilizing a novel ensemble learning approach to bilingual term alignment using an ensemble of seven bilingual alignment methods which are first executed separately and then merged using the weights learned with an evolutionary algorithm. Expand
Evaluating Features for Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families
TLDR
This paper uses the phrase table of a state-of-theart phrase-based statistical machine translation model to collect candidates of synonymous translation equivalent pairs from parallel patent sentences and identifies the minimum number of features that perform as comparatively well as the optimal set of features. Expand
Bootstrapping Term Extractors for Multiple Languages
TLDR
A low cost method for creating terminology extraction resources for 21 non-English EU languages is reported, and a General POS Tagger is created for these languages using parallel corpora and a projection method. Expand
Augmenting Translation Lexica by Learning Generalised Translation Patterns
TLDR
An approach to automatically induce segmentation and learn bilingual morph-like terms is explored as a phase prior to the suggestion of out-of-vocabulary bilingual lexicon entries, thereby saving the time involved and progressively improving alignment and extraction quality. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Automatic Bilingual Phrase Extraction from Comparable Corpora
In this work we present an approach for extracting parallel phrases from comparable news articles to improve statistical machine translation. This is particularly useful for under-resourced languagesExpand
Automatic extraction of bilingual terms from a Chinese-Japanese parallel corpus
This paper proposes a new approach for the automatic extraction of bilingual terms from a domain-specific bilingual parallel corpus. We combine existing monolingual term extractor and a wordExpand
Bilingual lexicon extraction from comparable corpora using in-domain terms
TLDR
The proposed method is based on the notion of in-domain terms which can be thought of as the most important contextually relevant words and can learn highly accurate bilingual lexicons without using orthographic features or a large initial seed dictionary. Expand
Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages
TLDR
Methods for term extraction, term tagging in documents, and bilingual term mapping from comparable corpora for four under-resourced languages: Croatian, Latvian, Lithuanian, and Romanian are presented. Expand
Towards Automatic Extraction of Monolingual and Bilingual Terminology
TLDR
This paper makes use of linguistic knowledge to identify certain noun phrases, both in English and French, which are likely to be terms, and proposes a statistical method to build correspondences of multi-words units across languages. Expand
Mining named entity transliteration equivalents from comparable corpora
TLDR
A novel method is introduced, called MINT (MIning Namedentity Transliteration equivalents), with the following innovations for effective mining of NETEs from comparable corpora: MINT relies on little linguistic resources, requiring a Named Entity Recoginizer (NER) in only one language; henceNETEs from even a resource poor language may be mined, when paired with a language where an NER is available. Expand
Finding Terminology Translations from Non-parallel Corpora
We present a statistical word feature, the Word Relation Matrix, which can be used to find translated pairs of words and terms from non-parallel corpora, across language groups. Online dictionaryExpand
Bilingual Terminology Mining - Using Brain, not brawn comparable corpora
TLDR
It is shown how important the type of discourse is as a characteristic of the comparable corpus and ensures the quality of the acquired terminological resources. Expand
Learning Translations of Named-Entity Phrases from Parallel Corpora
We develop a new approach to learning phrase translations from parallel corpora, and show that it performs with very high coverage and accuracy in choosing French translations of English named-entityExpand
Identifying bilingual Multi-Word Expressions for Statistical Machine Translation
TLDR
A strategy for detecting translation pairs of MWEs in a French-English parallel corpus and three methods aiming to integrate extracted bilingual MWE S in M OSES, a phrase based Statistical Machine Translation (SMT) system are described. Expand
...
1
2
3
...