Automatic Detection of Multilingual Dictionaries on the Web

@inproceedings{Grigonyte2014AutomaticDO,
  title={Automatic Detection of Multilingual Dictionaries on the Web},
  author={Gintare Grigonyte and Timothy Baldwin},
  booktitle={ACL},
  year={2014}
}
This paper presents an approach to query construction to detect multilingual dictionaries for predetermined language combinations on the web, based on the identification of terms which are likely to occur in bilingual dictionaries but not in general web documents. We use eight target languages for our case study, and train our method on pre-identified multilingual dictionaries and theWikipedia dump for each of our languages. © 2014 Association for Computational Linguistics. 

Figures and Tables from this paper

References

SHOWING 1-10 OF 25 REFERENCES
Automatic Detection and Language Identification of Multilingual Documents
TLDR
This work introduces a method that is able to detect that a document is multilingual, identify the languages present, and estimate their relative proportions.
Using the Web as a Bilingual Dictionary
TLDR
An algorithm for obtaining translation candidates based on the distance of Japanese and English terms in web documents, and the results of a preliminary experiment are presented.
Linguini: language identification for multilingual documents
  • J. Prager
  • Computer Science
    Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers
  • 1999
TLDR
Linguini could identify the language of documents as short as 5-10% of the size of average Web documents with 100% accuracy, and can be applied to subject categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.
The Web as a Parallel Corpus
TLDR
The use of supervised learning based on structural features of documents to improve classification performance, a new content-based measure of translational equivalence, and adaptation of the system to take advantage of the Internet Archive for mining parallel text from the Web on a large scale are presented.
Parallel Web text mining for cross-language IR
TLDR
This paper describes a parallel text mining system called PTMiner (Parallel Text Miner) for the Web environment as well as its implementation using a distributed model and database technology.
Bilingual Dictionary Extraction from Wikipedia
TLDR
An approach is presented, which combines context heterogeneity similarity and dependency heterogeneity similarity, to extract bilingual dictionary from the collected comparable corpora and shows that the proposed approach outperforms both the two individual approaches.
Text Segmentation by Language Using Minimum Description Length
TLDR
The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment through dynamic programming.
PanLex and LEXTRACT: Translating all Words of all Languages of the World
TLDR
PanLex, a lemmatic translation resource which combines a large number of translation dictionaries and other translingual lexical resources, as well as lextract, a new effort to expand the coverage of PanLex via semi-automatic dictionary scraping.
Automatic construction of clean broad-coverage translation lexicons
TLDR
This lexicon cleaning technique can produce translation lexicons with recall and precision both exceeding 90\%, as well as dictionary-sized translation lexicon that are over 99\% correct.
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora
TLDR
Since nonparallel corpora contain a lot more polysemous words, many-to-many translations, and different lexical items in the two languages, the output from Convec is reasonable and useful.
...
...