Le Phong Bao Vuong

Learn More
This paper introduces an approach to the use of clustering for data extraction from semi-structured Web pages. A variant hierarchical agglomerative clustering (HAC) algorithm K-neighbours-HAC is developed which uses the similarities of the data format (HTML tags) and the data content (text string values) to group similar text tokens into clusters. Using(More)
This paper introduces an approach that achieves automated data extraction for semi-structured Web pages by using clustering to group text tokens and data tuples into clusters. This approach uses both HTML and text features of text tokens to detect the similarities between them. After clustering, similar text tokens are expected to be in the same text(More)
  • 1