XCES

XCES is an XML based standard to encode text corpora, which are used by linguists and natural language researchers. XCES is highly based on the…

Wikipedia

Papers overview

Semantic Scholar uses AI to extract papers important to this topic.

2012

A Procedural DTD Project for Dictionary Entry Parsing Described with Parameterized Grammars

N. CurteanuM. Moruz
2012
Corpus ID: 61753093

The present paper continues the successful parsing experiments with the method of Segmentation-Cohesion-Dependency (SCD…

Review

2009

Review

2009

A Common XML-based Framework for Syntactic Annotations

It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and…

2008

Informationsstrukturierung für die syntaktische Annotation eines diachronen Korpus des Deutschen

Michael Heilemann
2008
Corpus ID: 189526107

Diese Arbeit beschreibt fur das Projekt Diachrone Syntax Deutsch (DiSynDe) die Informationsstrukturierung fur ein diachrones…

2007

Taming the Tiger Topic : An XCES Compliant Corpus Portal to Generate Subcorpora Based on Automatic Text-Topic Identification

Large-corpus projects generally use a rich header to describe their texts allowing several types of text searching to create…

2007

Converting SUC2.0 to XCES with stand-off annotation

Beáta MegyesiB. Dahlqvist
2007
Corpus ID: 60345140

2006

On Heads and Coordination in a Partial Treebank

A. Przepiórkowski
2006
Corpus ID: 62080620

The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [13] and the…

2006

Corpus Construction Tools

R. Garabík
2006
Corpus ID: 18711188

Современное развитие вычислительной техники позволяет нам принять участие в раньше невозможных направлениях научного исследования естественного языка. Основной, необходимой базой данных являются корпусы языков, в том числе и репрезентативные большие (национальные) корпусы. Уже широко доступны общие программные средства позволяющее эффективно обрабатывать большие количества текстов, как и средства поиска в корпусах. Всё-таки, создание корпуса с большим количеством данных требует определённый план организации обработки текстов, вместе с структурой программного обеспечения. В докладе представлена общая система позволяющая быстро применить специфические черты обработки данных конкретного языка. Обсуждены необходимые аспекты национального корпуса, как с лингвистической, так и с компьютерной точек зрения. Система использует преимущественно современный объектно-ориентированный язык программирования Python, имеющий превосходные возможности обработки текстовых данных. Разметка текста состоит из двух частей, из лингвистической (внутренней) разметки текста, которая является внутренним свойством лингвистических единиц (слов) в тексте, и из общих данных о документах (метатекстовая, внешняя разметка). Внутренняя разметка текста входит прямо в формат обработанных текстов, в результате использования существующих стандартов репрезентации текстовых данных, как XML (XCES). Внешняя разметка сохраняется в простых текстовых файлах, с реляционной базой данных построенной над этой структурой. Introduction There exists a reasonably extensive literature concerning principles of corpora structure and end-user interaction [1, 2, 3, 4 and many others]. However, technical details of corpora construction are usually left out as uninteresting or too closely tied up with a specific corpus, and therefore not applicable in general. As with every big project, creating and maintaining an extensive (i.e. “national”) corpus of written language requires careful thought up design of data structure and of data manipulation. Consequently, each newly created big corpus ends up reinventing the wheel and implementing the data workflow and manipulation from the scratch. During the Slovak National Corpus construction, we did basically the same thing, but we tried to make our design general and clean, in order to serve as an inspiration for eventual other yet to be created big corpora. This does not include end-user information searching by a corpus manager – there are several (thought not many)

2005

Procesamiento y aplicaciones de los corpus paralelos

Francisco Javier Gómez Guinovart
2005
Corpus ID: 187794563

un corpus paralelo es una coleccion de bitextos, siendo un bitexto el texto constituido por un texto y su traduccion. En…

2003

Constitution de banques de textes multilingues : un mécanisme fondé sur le standard XML

Andrei Popesca-Belis
2003
Corpus ID: 192174472

Nous presentons dans cet article une methodologie pour la realisation de ressources linguistiques reutilisables, a savoir des…

Review

2000

Review

2000

XML support for Annotated Language Resources

N. IdeLaurent Romary
2000
Corpus ID: 1998931

The XML Corpus Encoding Standard (XCES) is a part of the EAGLES Guidelines developed by the Expert Advisory Group on Language…

XCES

Related topics

Papers overview