Learn More
Querying data from presentation formats like HTML, for purposes such as information extraction, requires the consideration of tree structures as well as the consideration of spatial relationships between laid out elements. The underlying rationale is that frequently the rendering of tree structures is very involved and undergoing more frequent updates than(More)
This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents. The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to(More)
Information extraction is of paramount importance in several real world applications in the areas of business intelligence, competitive and military intelligence. Although several sophisticated and indeed complex approaches were proposed, they are still limited in many aspects. In this paper the novel ontology-based system named XONTO, that allows the(More)
Deep Web pages convey very relevant information for different application domains like e-government, e-commerce, social networking. For this reason there is a constant high interest in efficiently, effectively and automatically extracting data from Deep Web data sources. In this paper we present SILA, a novel Spatial Instance Learning Approach, that allows(More)
Ontologies enable to directly encode domain knowledge in software applications, so ontology-based systems can exploit the meaning of information for providing advanced and intelligent functionalities. One of the most interesting and promising application of ontologies is information extraction from unstructured documents. In this area the extraction of(More)
Information extraction is of paramount importance in several real world applications in the areas of business, competitive and military intelligence because it enables to acquire information contained in unstructured documents and store them in structured forms. Unstructured documents have different internal encodings, one of the most diffused encoding is(More)
If smartly utilized, Big Data locked in unstructured sources, such as PDF documents, can yield unprecedented insights in solving tough business issues, optimizing business processes and improving customer relations. The challenge addressed in this paper is to unlock the value held in data plunged in unstructured document. We describe how a contextual(More)