BootCaT: Bootstrapping Corpora and Terms from the Web
@inproceedings{Baroni2004BootCaTBC, title={BootCaT: Bootstrapping Corpora and Terms from the Web}, author={Marco Baroni and Silvia Bernardini}, booktitle={International Conference on Language Resources and Evaluation}, year={2004} }
This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. [] Key Method The seeds are used to build a corpus via automated Google queries, and more terms are extracted from this corpus. In turn, these new terms are used as seeds to build a larger corpus via automated queries, and so forth. The corpus and the unigram terms are then used to extract multi-word terms. We conducted an evaluation of the tools by…
Figures from this paper
396 Citations
Retrieving Japanese specialized terms and corpora from the World Wide Web
- Computer Science
- 2004
It is reported that the BootCaT procedure can be successfully applied, with relatively small modifications, to a language very different from English and the other Indo-European languages on which the procedure was tested originally.
A novel approach to build Kannada web Corpus
- Computer Science2012 International Conference on Computer Communication and Informatics
- 2012
An evaluation of the Kannada Corpus tool is conducted by applying it to the construction of Kannataka corpora from the domains such as Recent Discussions, Articles, Recent Activities, Proverbs, Recent Feedback's, Poems and Fifteen Books, Novels, News paper, Dictionary, Blogs and Informal Chats.
CorpoMate: A framework for building linguistic corpora from the web
- Computer Science2016 19th International Conference on Computer and Information Technology (ICCIT)
- 2016
CorpoMate is introduced, an extensible framework with a pipeline-inspired and modular architecture for automating the creation of linguistic corpora, from web resources via crawling websites or parsing feeds and, can export aggregated data into widely-accepted formats.
Building Large Corpora from the Web Using a New Efficient Tool Chain
- Computer ScienceLREC
- 2012
A software toolkit for web corpus construction and a set of siginificantly larger corpora built using this software, which performs basic cleanups as well as boilerplate removal, simple connected text detection aswell as shingling to remove duplicates from the corpora.
WebBootCaT: a Web Tool for Instant Corpora
- Computer Science
- 2006
A web service for quickly producing corpora for specialist areas, in any of a range of languages, from the web, which is easy for non-technical people to use as all they need do is fill in a web form.
A Suite to Compile and Analyze an LSP Corpus
- Computer ScienceLREC
- 2008
This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as…
Open-source Corpora: Using the net to fish for linguistic data
- Computer Science
- 2006
Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.
Exploiting the Internet to build language resources for less resourced languages
- Computer Science
- 2010
A general view of the El yar Foundation’s strategy to build several types of language resources for Basque out of the web in a cost-efficient way is presented, which is very interesting and attractive for less resourced languages too, provided they have enough presence on the web.
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
- Computer Science
- 2010
GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.
Common Crawled Web Corpora: Constructing corpora from large amounts of web data
- Computer Science
- 2017
This thesis develops a new very large English corpus with more than 135 billion tokens and evaluates the corpus by training word embeddings and shows that the trained model largely outperforms models trained on other corpora in a word analogy and word similarity task.
References
SHOWING 1-10 OF 14 REFERENCES
Mining the web to create minority language corpora
- Computer ScienceCIKM '01
- 2001
The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for…
A Statistical Corpus-Based Term Extractor
- Computer ScienceCanadian Conference on AI
- 2001
A language independent statistical corpus-based term extraction algorithm is proposed and the quality and recall of the extractor are evaluated by assessing its predictiveness on an unseen corpus using perplexity and precision and recall.
Comparing Corpora using Frequency Profiling
- LinguisticsProceedings of the workshop on Comparing corpora -
- 2000
The method can be used to discover key words in the corpora which differentiate one corpus from another and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.
Spidering Hacks
- Computer Science
- 2003
Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course…
Introduction to the Special Issue on the Web as Corpus
- LinguisticsCL
- 2003
This special issue of Computational Linguistics explores ways in which this dream of freely available language data in vast quantity and freely available is being explored.
Comparative study of trauma-related phenomena in subjects with pseudoseizures and subjects with epilepsy.
- Psychology, MedicineThe American journal of psychiatry
- 2002
Subjects with pseudoseizures exhibited trauma-related profiles that differed significantly from those of epileptic comparison subjects and closely resembled those of individuals with a history of traumatic experiences.
Web mining in the translation classroom
- Web mining in the translation classroom
- 2004