XML is among the preferred formats for storing the structure of documents such as scientific articles, manuals, documentation, literary works, etc. Sometimes publishers adopt established and well-known vocabularies such as DocBook and TEI, other times they create partially or entirely new ones that better deal with the particular requirements of their… (More)
Recognising textual structures (paragraphs, sections, etc.) provides abstract and more general mechanisms for describing documents independent of the particular semantics of specific markup schemas, tools and presentation stylesheets. In this paper we propose an algorithm that allows us to identify the structural role of each element in a set of homogeneous… (More)
Evaluating collections of XML documents without paying attention to the schema they were written in may give interesting insights about the expected characteristics of a markup language, as well as bout any regularity that may span across vocabularies and languages, and that are more fundamental and frequent than plain content models. In this paper we… (More)
There is still a gap between models for external annotations of markup documents and their applications. In this paper we present the EARMARK API, a Java framework that allows users to combine embedded markup with stand-off markup. We discuss a few relevant issues on adding annotations to TEI documents that refer to external entities and we show how to use… (More)
This paper introduces the RASH Framework, i.e., a set of specifications and tools for writing academic articles in RASH, a simplified version of HTML. RASH focuses strictly on writing the content of the paper leaving all the issues about its validation, visualisation, conversion , and data extraction to the tools developed within the framework.
In order to make semantic assertions about the text content of a document we need a mechanism to identify and organize the text structures of the document itself. Such mechanism would closely resemble a document-oriented markup language and would be free of the classical constraints of an embedded markup language, having no limitations given by… (More)
In this paper we introduce ODMiner, an automatic tool that enhances open datasets provided in heterogenous structured formats) in order to extract named entities and relations between the open dataset elements. ODMiner is designed as modular and extensible software architecture and its process can be customised in order to address specific needs of final… (More)