Thanh Hoang Nguyen

Learn More
In this paper we describe a new approach to extract element labels from Web form interfaces. Having these labels is a requirement for several techniques that attempt to retrieve and integrate information that is hidden behind form interfaces, such as hidden Web crawlers and metasearchers. However, given the wide variation in form layout, even within a(More)
Recent research has taken advantage of Wikipedia’s multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in(More)
We present DeepPeep (http://www.deeppeep.org), a new system for discovering, organizing and analyzing Web forms. DeepPeep allows users to explore the entry points to hidden-Web sites whose contents are out of reach for traditional search engines. Besides demonstrating important features of DeepPeep and describing the infrastructure we used to build the(More)
There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A(More)
Wikipedia has become an important source of information which is growing very rapidly. However, the existing infrastructure for querying this information is limited and often ignores the inherent structure in the information and links across documents. In this paper, we present a new approach for querying Wikipedia content that supports a simple, yet(More)
Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate(More)
  • 1