Milos Jakubícek

Learn More
This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing(More)
For many linguistic investigations, the first step is to find examples. In the 21st century, they should all be found, not invented. Thus linguists need flexible tools for finding even quite rare phenomena. To support linguists well, they need to be fast even where corpora are very large and queries are complex. We present extensions to the CQL 'Corpus(More)
In this paper, we present an application-driven low-cost concept of building a multipurpose language resource for Czech which is based on currently available results of previous work by various research teams active in the area of natural language processing. We particularly focus on the first phase which consists in extracting noun phrases from a(More)
1 Overview Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a reference corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for(More)
We present bilingual word sketches: automatic, corpus based summaries of the grammatical and col-locational behaviour of a word in one language and its translation equivalent in another. We explore, with examples, various ways that this can be done, using parallel corpora, comparable corpora and bilingual dictionaries. We present the formalism for(More)