• Corpus ID: 15138302

Towards the National Corpus of Polish

@inproceedings{Przepirkowski2008TowardsTN,
  title={Towards the National Corpus of Polish},
  author={Adam Przepi{\'o}rkowski and Rafał L. G{\'o}rski and Barbara Lewandowska-Tomaszyk and Marek Lazinski},
  booktitle={LREC},
  year={2008}
}
This paper presents a new corpus project, aiming at building a national corpus of Polish. What makes it different from a typical YACP (Yet Another Corpus Project) is 1) the fact that all four partners in the project have in the past constructed corpora of Polish, sometimes in the spirit of collaboration, at other times - in the spirit of competition, 2) the partners bring into the project varying areas of expertise and experience, so the synergy effect is anticipated, 3) the corpus will be… 

Recent Developments in the National Corpus of Polish

TLDR
A number of recent developments in the NKJP project are outlined, including the design of text encoding XML schemata for various levels of linguistic information, a new tool for manual annotation at various levels, and numerous improvements in search tools.

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

TLDR
The corpus structure and fieldwork experiences with recording, labelling system, and two levels of transcription (pronunciation-based and standardized) are described, as well as the main characteristics of the corpus interface (web concordancer) and the availability of the original corpus files.

Bulgarian National Corpus Project

TLDR
The paper presents Bulgarian National Corpus project (BulNC) - a large-scale, representative, online available corpus of Bulgarian playing a significant role in natural language processing of Bulgarian contributing to scientific advances in spelling and grammar checking, word sense disambiguation, speech recognition, text categorisation, topic extraction and machine translation.

Towards a Bank of Constituent Parse Trees for Polish

TLDR
The overall shape of the parse trees including the extent of encoded grammatical information is discussed and the problem of syntactic disambiguation as a challenge for the job is delve into.

Annotation tools for syntax and named entities in the National Corpus of Polish

TLDR
The technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities are presented and the first results of a CRF classifier trained on these data are presented.

Creating a Coreference Resolution System for Polish

TLDR
The results of the first attempt of the co\-re\-fe\-rence resolution for Polish using statistical methods are presented and the plans for the future usage of the tool are described.

Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish

TLDR
This work presents the technical environment and methodological background developed for the three upper annotation levels: the level of syntactic words and groups, and thelevel of named entities, and shows how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus.

ISOcat Definition of the National Corpus of Polish Tagset

TLDR
This paper describes the first definition of a complete morphosyntactic tagset, The National Corpus of Polish Tagset, in the ISOcat Data Category Registry and presents certain limitations of ISOcat and offers some suggestions for its further development.

TEI P5 as a Text Encoding Standard for Multilevel Corpus Annotation

TLDR
Standards are also needed for the interoperability of tools and for the facilitation of data exchange within projects, especially where multiple partners and multiple levels of linguistic data are involved.

Towards Word Sense Disambiguation of Polish

TLDR
The goal was to analyse applicability and limitations of known methods in relation to Polish and Polish language resources and tools, and achieved the accuracy of sense disambiguation greatly exceeding the baseline of the most frequent sense.
...

References

SHOWING 1-10 OF 35 REFERENCES

An efficient implementation of a large grammar of Polish

TLDR
The paper presents a parser implementing Marek Świdzinski’s formal grammar of the Polish language, named Świgra, which goes far beyond a toy parser due to the use of a morphological analyser and a broad range of linguistic phenomena covered by Ś Widzinski's grammar.

Manatee, Bonito and Word Sketches for Czech

This paper deals with a newly designed and developed system Manatee that can be employed to manage corpora, especially extremely large ones with billions of words, and enables the efficient

Layering and Merging Linguistic Annotations

The American National Corpus and its annotations are represented in a stand-off XML format compliant with the specifications of ISO TC37 SC4 WG1's Linguistic Annotation Framework. Because few systems

A Search Tool for Corpora with Positional Tagsets and Ambiguities

TLDR
A corpus indexing and query tool, which understands positional tagsets and which does not assume that word forms are annotated with unique morphosyntactic tags.

Poliqarp: An open source corpus indexer and search engine with syntactic extensions

TLDR
Recent extensions to Poliqarp are presented, which turn it into a tool for indexing and searching certain kinds of treebanks, complementary to existing treebank search engines.

Reductionistic, Tree and Rule Based Tagger for Polish

TLDR
The paper presents an approach to tagging of Polish based on the combination of handmade reduction rules and selecting rules acquired by Induction of Decision Trees, where the overall problem is reduced to subproblems of ambiguity classes.

XCES: An XML-based Encoding Standard for Linguistic Corpora

TLDR
This paper instantiated the CES as an XML application called XCES, based on the same data architecture comprised of a primary encoded text and "standoff" annotation in separate documents, and demonstrated how XML mechanisms can be used to select from and manipulate annotated corpora encoded according toXCES specifications.

Trigram morphosyntactic tagger for Polish

  • L. Debowski
  • Computer Science
    Intelligent Information Systems
  • 2004
TLDR
An implementation of a plain trigram part-of-speech tagger which appears to work well on Polish texts and achieves 9.4% error rate, which makes it signficantly better than the previous stochastic disambiguator.

Extraction of Polish Named-Entities

1DPHG HQWLWLHV 1( FRQVWLWXWH VLJQLILFDQW SDUW RI QDWXUDO ODQJXDJH WH[WV DQG DUH ZLGHO\ H[SORLWHG LQ YDULRXV 1/3 DSSOLFDWLRQV $OWKRXJK FRQVLGHUDEOH ZRUN RQ QDPHG HQWLW\ UHFRJQLWLRQ 1(5 IRU IHZ PDMRU

Poliqarp 1.0: Some technical aspects of a linguistic search engine for large corpora

  • 2007