{bs, hr, sr}WaC - Web Corpora of Bosnian, Croatian and Serbian

@inproceedings{Ljubesic2014bsHS,
  title={\{bs, hr, sr\}WaC - Web Corpora of Bosnian, Croatian and Serbian},
  author={Nikola Ljubesic and Filip Klubicka},
  booktitle={WaC@EACL},
  year={2014}
}
In this paper we present the construction process of top-level-domain web corpora of Bosnian, Croatian and Serbian. For constructing the corpora we use the SpiderLing crawler with its associated tools adapted for simultaneous crawling and processing of text written in two scripts, Latin and Cyrillic. In addition to the modified collection process we focus on two sources of noise in the resulting corpora: 1. they contain documents written in the other, closely related languages that can not be… Expand
C4Corpus: Multilingual Web-size Corpus with Free License
TLDR
This article presents the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Expand
Crawl and crowd to bring machine translation to under-resourced languages
TLDR
A widely applicable methodology to bring machine translation to under-resourced languages in a cost-effective and rapid manner relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. Expand
Semi-Automatic Construction of Comparable Genre-Oriented Corpora of Serbian in Cyrillic and Latin Scripts
This article deals with methods for the semi-automatic construction of genre-oriented corpora from the web, drawing on the BootCaT toolkit. In particular, it reports the results of two parallelExpand
The slWaC Corpus of the Slovene Web
The availability of large collections of text (language corpora) is crucial for empirically supported linguistic investigations of various languages; however, such corpora are complicated andExpand
Web corpora - the best possible solution for tracking rare phenomena in underresourced languages: clitics in Bosnian, Croatian and Serbian
Complex linguistic phenomena, such as Clitic Climbing in Bosnian, Croatian and Serbian, are often described intuitively, only from the perspective of the main tendency. In this paper, we argue thatExpand
The slWaC Corpus of the SloveneWeb
The availability of large collections of text (language corpora) is crucial for empirically supported linguistic investigations of various languages; however, such corpora are complicated andExpand
Very Large russian Corpora : new opportunities and new ChaLLenges
Our paper deals with the rapidly developing area of corpus linguistics referred to as Web as Corpus (WaC), i.e., creation of very large corpora composed of texts downloaded from the web. SomeExpand
Corpus-Based Diacritic Restoration for South Slavic Languages
TLDR
This paper presents diacritic restoration models that are trained on easy-to-acquire corpora and considerably outperforms charlifter, so far the only open source tool available for this task. Expand
Neural Machine Translation between similar South-Slavic languages
TLDR
Automatic evaluation shows that multilingual systems with joint Serbian and Croatian data are better than bilingual, as well as that character-based cleaning leads to improved scores while using less data, and adds back-translated data further improves the performance. Expand
*MWELex - MWE Lexica of Croatian, Slovene and Serbian Extracted from Parsed Corpora
The paper presents *MWELex, a multilingual lexical repository of Croatian, Slovene and Serbian multiword expressions that were extracted from parsed corpora. The lexica were built with theExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 18 REFERENCES
hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene
TLDR
Two new annotated web corpora are introduced, built using a modified standard "Web as Corpus" pipeline having in mind the limited amount of available web data, focusing on the content extraction from HTML pages, which combines high precision of extracted language content with a decent recall. Expand
Building Large Corpora from the Web Using a New Efficient Tool Chain
TLDR
A software toolkit for web corpus construction and a set of siginificantly larger corpora built using this software, which performs basic cleanups as well as boilerplate removal, simple connected text detection aswell as shingling to remove duplicates from the corpora. Expand
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
TLDR
UkWaC, deWaC and itWaC are introduced, three very large corpora of English, German, and Italian built by web crawling, and the methodology and tools used in their construction are described. Expand
Lemmatization and Morphosyntactic Tagging of Croatian and Serbian
TLDR
Results indicate that more complex methods of Croatian-to- Serbian annotation projection are not required on such dataset sizes for these particular tasks. Expand
The SETimes.HR Linguistically Annotated Corpus of Croatian
TLDR
This work builds and evaluates statistical models for lemmatization, morphosyntactic tagging, named entity recognition and dependency parsing on top of SETimes.HR and the test sets, providing the state of the art in all the tasks. Expand
Efficient Web Crawling for Large Text Corpora
TLDR
How to deal with inefficient data downloading and how to focus crawling on text rich web domains is described and efficiency figures from crawling texts in American Spanish, Czech, Japanese, Russian, Tajik Persian, Turkish and the sizes of the resulting corpora are presented. Expand
The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction
TLDR
This paper examines notions of text quality in the context of web corpus construction, and describes the general approach to the construction of carefully cleansed and non-destructively normalized web corpora. Expand
Efficient Discrimination Between Closely Related Languages
TLDR
This paper proposes and compares methods based on simple document classification techniques trained on parallel corpora of closely related languages and methods that emphasize discriminating features in terms of blacklisted words and demonstrates that these techniques are highly accurate for the difficult task of discriminating between Bosnian, Croatian and Serbian. Expand
HunPos: an open source trigram tagger
TLDR
HunPos is presented, a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully utilize the potential of HMM architectures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process. Expand
Parsing Croatian and Serbian by Using Croatian Dependency Treebanks
TLDR
This work makes use of the two available dependency treebanks of Croatian to produce state-of-the-art parsing models for both languages, giving insight into overall parser performance for Croatian and Serbian, impact of preprocessing for lemmas and morphosyntactic tags and influence of selected morphosynthesis features on parsing accuracy. Expand
...
1
2
...