• Corpus ID: 15701997

BootCaT: Bootstrapping Corpora and Terms from the Web

  title={BootCaT: Bootstrapping Corpora and Terms from the Web},
  author={Marco Baroni and Silvia Bernardini},
  booktitle={International Conference on Language Resources and Evaluation},
This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. [] Key Method The seeds are used to build a corpus via automated Google queries, and more terms are extracted from this corpus. In turn, these new terms are used as seeds to build a larger corpus via automated queries, and so forth. The corpus and the unigram terms are then used to extract multi-word terms. We conducted an evaluation of the tools by…

Figures from this paper

Retrieving Japanese specialized terms and corpora from the World Wide Web

It is reported that the BootCaT procedure can be successfully applied, with relatively small modifications, to a language very different from English and the other Indo-European languages on which the procedure was tested originally.

A novel approach to build Kannada web Corpus

An evaluation of the Kannada Corpus tool is conducted by applying it to the construction of Kannataka corpora from the domains such as Recent Discussions, Articles, Recent Activities, Proverbs, Recent Feedback's, Poems and Fifteen Books, Novels, News paper, Dictionary, Blogs and Informal Chats.

CorpoMate: A framework for building linguistic corpora from the web

CorpoMate is introduced, an extensible framework with a pipeline-inspired and modular architecture for automating the creation of linguistic corpora, from web resources via crawling websites or parsing feeds and, can export aggregated data into widely-accepted formats.

Building Large Corpora from the Web Using a New Efficient Tool Chain

A software toolkit for web corpus construction and a set of siginificantly larger corpora built using this software, which performs basic cleanups as well as boilerplate removal, simple connected text detection aswell as shingling to remove duplicates from the corpora.

WebBootCaT: a Web Tool for Instant Corpora

A web service for quickly producing corpora for specialist areas, in any of a range of languages, from the web, which is easy for non-technical people to use as all they need do is fill in a web form.

A Suite to Compile and Analyze an LSP Corpus

This paper presents a series of tools for the extraction of specialized corpora from the web and its subsequent analysis mainly with statistical techniques. It is an integrated system of original as

Open-source Corpora: Using the net to fish for linguistic data

Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.

Exploiting the Internet to build language resources for less resourced languages

A general view of the El yar Foundation’s strategy to build several types of language resources for Basque out of the web in a cost-efficient way is presented, which is very interesting and attractive for less resourced languages too, provided they have enough presence on the web.

Automatic parallel corpora and bilingual terminology extraction from parallel WebSites

GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.

Common Crawled Web Corpora: Constructing corpora from large amounts of web data

This thesis develops a new very large English corpus with more than 135 billion tokens and evaluates the corpus by training word embeddings and shows that the trained model largely outperforms models trained on other corpora in a word analogy and word similarity task.



Mining the web to create minority language corpora

The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for

A Statistical Corpus-Based Term Extractor

A language independent statistical corpus-based term extraction algorithm is proposed and the quality and recall of the extractor are evaluated by assessing its predictiveness on an unseen corpus using perplexity and precision and recall.

Comparing Corpora using Frequency Profiling

The method can be used to discover key words in the corpora which differentiate one corpus from another and is shown to have applications in the study of social differentiation in the use of English vocabulary, profiling of learner English and document analysis in the software engineering process.

Spidering Hacks

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course

Introduction to the Special Issue on the Web as Corpus

This special issue of Computational Linguistics explores ways in which this dream of freely available language data in vast quantity and freely available is being explored.

Comparative study of trauma-related phenomena in subjects with pseudoseizures and subjects with epilepsy.

Subjects with pseudoseizures exhibited trauma-related profiles that differed significantly from those of epileptic comparison subjects and closely resembled those of individuals with a history of traumatic experiences.

Analysis of Contingency Tables

Automatic Natural Acquisition of a Terminology

Web mining in the translation classroom

  • Web mining in the translation classroom
  • 2004