Milos Jakubícek

Learn More
For many linguistic investigations, the first step is to find examples. In the 21st century, they should all be found, not invented. Thus linguists need flexible tools for finding even quite rare phenomena. To support linguists well, they need to be fast even where corpora are very large and queries are complex. We present extensions to the CQL 'Corpus(More)
In this paper, we present an application-driven low-cost concept of building a multipurpose language resource for Czech which is based on currently available results of previous work by various research teams active in the area of natural language processing. We particularly focus on the first phase which consists in extracting noun phrases from a(More)
This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing(More)
1 Overview Term candidates for a domain, in a language, can be found by • taking a corpus for the domain, and a reference corpus for the language • identifying the grammatical shape of a term in the language • tokenising, lemmatising and POS-tagging both corpora • identifying (and counting) the items in each corpus which match the grammatical shape • for(More)
In this paper we present our approach to the Bilingual Document Alignment Task (WMT16), where the main goal was to reach the best recall on extracting aligned pages within the provided data. Our approach consists of tree main parts: data preprocessing, keyword extraction and text pairs scoring based on keyword matching. For text preprocessing we use the(More)
The NLP researcher or application-builder often wonders " what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job? " Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which(More)