Stijn Vansummeren

Learn More
Information describing the origin of data, generally referred to as <i>provenance</i>, is important in scientific and curated databases where it is the basis for the trust one puts in their contents. Since such databases are constructed using operations of both query and update languages, it is of paramount importance to describe the effect of these(More)
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning <i>deterministic</i> regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we(More)
We present statistics on real world SPARQL queries that may be of interest for building SPARQL query processing engines and benchmarks. In particular, we analyze the syntactical structure of queries in a log of about 3 million queries, harvested from the DBPedia SPARQL endpoint. Although a sizable portion of the log is shown to consist of so-called(More)
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning <i>concise</i> regular expressions from positive examples strings. We identify two classes of concise regular expressions&#8212;the single occurrence regular expressions (SOREs) and the chain regular(More)
Science, industry, and society are being revolutionized by radical new capabilities for information sharing, distributed computation, and collaboration offered by the World Wide Web. This revolution promises dramatic benefits but also poses serious risks due to the fluid nature of digital information. One important cross-cutting issue is managing and(More)
The diversity and large volumes of data processed in the Natural Sciences today has led to a proliferation of highlyspecialized and autonomous scientific databases with inherent and often intricate relationships. As a user-friendly method for querying this complex, ever-expanding network of sources for correlations, we propose exploratory queries.(More)
Regular expression patterns provide a natural, declarative way to express constraints on semistructured data and to extract relevant information from it. Indeed, it is a core feature of the programming language Perl, surfaces in various UNIX tools such as sed and awk, and has recently been proposed in the context of the XML programming language XDuce. Since(More)
An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a <i>document spanner</i> (or just <i>spanner</i> for short). A spanner maps an input string into a relation over the spans (intervals specified by(More)