A manually annotated HTML corpus for a novel scientific trend analysis

  title={A manually annotated HTML corpus for a novel scientific trend analysis},
  author={Rich{\'a}rd Farkas and R{\'o}bert Orm{\'a}ndi and M{\'a}rk Jelasity and J{\'a}nos Csirik},
Here we present a manually annotated corpus of web pages and annotation tool for Web Content Mining. The corpus is extensively annotated, has a hierarchical label structure and is freely available for research purposes. The annotation tool is a Firefox extension which allows the annotator to work with the pages in their original appearance. This tool handles the annotation hierarchy independently of the DOM tree of the web pages, and it allows overlapped annotation between the HTML tags. 
2 Citations
20 References
Similar Papers


Publications citing this paper.


Publications referenced by this paper.
Showing 1-10 of 20 references

Chen-Chuan-Chang, “Editorial: special issue on web content mining

  • B. Liu
  • ACM SIGKDD Explorations Newsletter,
  • 2004
Highly Influential
7 Excerpts

Ishizukab, “POLYPHONET: An advanced social network extraction system from the Web

  • Y. Matsuoa, J. Morib, M. Hamasakia, T. Nishimuraa, H. Takedab, K. Hasidaa
  • Journal of Web Semantics,
  • 2007
1 Excerpt

Shaalan, “A Survey of Web Information Extraction Systems

  • M. Kayed, K.F
  • IEEE Transactions on Knowledge and Data…
  • 2006

Web Usage Mining

  • B. Mobasher
  • Encyclopedia of Data Warehousing and Mining,
  • 2006
1 Excerpt

Similar Papers

Loading similar papers…