• Publications
  • Influence
Automatic web news extraction using tree edit distance
tl;dr
In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Expand
  • 366
  • 26
  • Open Access
A brief survey of web data extraction tools
tl;dr
In the last few years, several works in the literature have addressed the problem of data extraction from Web pages. Expand
  • 751
  • 19
  • Open Access
A source independent framework for research paper recommendation
tl;dr
We propose a novel source independent framework for research paper recommendation. Expand
  • 112
  • 12
DEByE - Data Extraction By Example
tl;dr
In this paper we present DEByE(Data Extraction By Example), an approach to extracting data from Web sources, based on a small set of examples specified by the user. Expand
  • 180
  • 11
Automatic generation of agents for collecting hidden Web pages for data extraction
tl;dr
We present a method for automatically generating agents to collect hidden Web pages for sites with common navigational characteristics. Expand
  • 97
  • 11
A fast and robust method for web page template detection and removal
tl;dr
We present a new method that efficiently and accurately removes templates found in collections of web pages. Expand
  • 112
  • 11
  • Open Access
A Genetic Programming Approach to Record Deduplication
tl;dr
In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a dedu plication function that is able to identify whether two entries in a repository are replicas or not. Expand
  • 101
  • 8
Organizing Hidden-Web Databases by Clustering Visible Web Documents
tl;dr
We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context - both within and in the neighborhood of forms - as the basis for similarity comparison. Expand
  • 56
  • 8
  • Open Access
FLUX-CIM: flexible unsupervised extraction of citation metadata
tl;dr
In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Expand
  • 77
  • 6
  • Open Access
Active Learning Genetic programming for record deduplication
tl;dr
We introduced an active learning genetic programming algorithm, named AGP, and instantiated it for the task of record deduplication. Expand
  • 36
  • 5
  • Open Access