Web Harvesting

  title={Web Harvesting},
  author={Wolfgang Gatterbauer},
  booktitle={Encyclopedia of Database Systems},
DEFINITION Web harvesting describes the process of gathering and integrating data from various heterogeneous web sources. Necessary input is an appropriate knowledge representation of the domain of interest (e.g. an ontology), together with example instances of concepts or relationships (seed knowledge). Output is structured data (e.g. in the form of a relational database) that is gathered from the Web. The term harvesting implies that, while passing over a large body of available information… 
Harvesting Information from Heterogeneous Sources
An information harvesting tool (hetero Harvest) is presented with objectives to address problems by filtering the useful information and then normalizing the information in a singular non hypertext format.
Web Data Extraction, Applications and Techniques: A Survey
Aggregation of Categorized Internet Web Content in the Social Circle of a User
An application which not only tracks the internet consumption of the individuals but also collaborates it for a social group and displays the summary in a categorical format, providing a very crisp summary which has a plethora of practical advantages to various sections of society.
Collection of U.S. Extremist Online Forums: A Web Mining Approach
This study proposes a systematic Web mining approach to collecting and monitoring extremist forums, and creates a collection of 110 U.S. domestic extremist forums containing more than 640,000 documents, which could serve as an invaluable data source to enable a better understanding of the extremists' movements.
Web Scraping Online Newspaper Death Notices for the Estimation of the Local Number of Deaths
The results of local online death notices and print-media obituaries are compared to administrative mortality data and the resulting estimates of death rates and demographic characteristics of the deceased are statistically different from known population values.
Methodik zur automatisierten Extraktion und Klassifikation semistrukturierter Produkt-und Adressdaten aus Webseiten
Diese Arbeit stellt eine neue Methodik fur die automatisierte Extraktion und Klassifikation von Daten aus Webseiten vor. Die Methodik EH ("Extraction Heuristics") ist fur die Domanen der Produkt- und


Learning to Harvest Information for the Semantic Web
A methodology for harvesting information from large distributed repositories (e.g. large Web sites) with minimum user intervention is described and its implementation in the Armadillo system is described.
Web-scale information extraction in knowitall: (preliminary results)
KnowItAll, a system that aims to automate the tedious process of extracting large collections of facts from the web in an autonomous, domain-independent, and scalable manner, is introduced.
Automatic information extraction from large websites
A novel approach to information extraction from websites is presented, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference, and shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.