Corpus ID: 15701997

BootCaT: Bootstrapping Corpora and Terms from the Web

  title={BootCaT: Bootstrapping Corpora and Terms from the Web},
  author={M. Baroni and S. Bernardini},
  • M. Baroni, S. Bernardini
  • Published in LREC 2004
  • Computer Science
  • This paper introduces the BootCaT toolkit, a suite of perl programs implementing an iterative procedure to bootstrap specialized corpora and terms from the web. [...] Key Method The seeds are used to build a corpus via automated Google queries, and more terms are extracted from this corpus. In turn, these new terms are used as seeds to build a larger corpus via automated queries, and so forth. The corpus and the unigram terms are then used to extract multi-word terms. We conducted an evaluation of the tools by…Expand Abstract
    362 Citations

    Figures and Topics from this paper.

    Retrieving Japanese specialized terms and corpora from the World Wide Web
    • 9
    • PDF
    Comparable corpora BootCaT
    • 5
    • PDF
    Building Large Corpora from the Web Using a New Efficient Tool Chain
    • 164
    • PDF
    WebBootCaT: a Web Tool for Instant Corpora
    • 45
    • PDF
    A novel approach to build Kannada web Corpus
    • 6
    CorpoMate: A framework for building linguistic corpora from the web
    A Suite to Compile and Analyze an LSP Corpus
    • 11
    • PDF
    Open-source Corpora: Using the net to fish for linguistic data
    • 87
    • PDF
    Babouk: Focused Web Crawling for Corpus Compilation and Automatic Terminology Extraction
    • Clément de Groc
    • Computer Science
    • 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology
    • 2011
    • 54
    • PDF


    Mining the web to create minority language corpora
    • 76
    • PDF
    Comparing Corpora using Frequency Profiling
    • 486
    • PDF
    A Statistical Corpus-Based Term Extractor
    • 128
    • PDF
    Spidering Hacks
    • 17
    • PDF
    Introduction to the Special Issue on the Web as Corpus
    • 649
    • PDF
    Comparative study of trauma-related phenomena in subjects with pseudoseizures and subjects with epilepsy.
    • 78
    Translators and disposable corpora
    • 57
    Analysis of Contingency Tables
    • 566
    Automatic Natural Acquisition of a Terminology
    • 89
    Automatic natural acquisition of bilingual terminology
    • Journal of Quantitative Linguistics
    • 1995