Learn More
Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By “record” we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely(More)
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich,(More)
Web documents that are either partially or completely duplicated in content are easily found on the Internet these days. Not only these documents create redundant information on the Web, which take longer to filter unique information and cause additional storage space, but also they degrade the efficiency of Web information retrieval. In this paper, we(More)
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an(More)
The academic performance of students is affected by their reading ability, which explains why reading is one of the most important aspects of school curriculums. Promoting good reading habits among K-12 students is essential, given the enormous influence of reading on students' development as learners and members of society. In doing so, it is indispensable(More)
Tens of thousands of news articles are posted on-line each day, covering topics from politics to science to current events. In order to better cope with this overwhelming volume of information, RSS (news) feeds are used to categorize newly posted articles. Nonetheless, most RSS users must filter through many articles within the same or different RSS feeds(More)
Traditional phrase matching approaches, which can discover documents containing exactly the same phrases, fail to detect documents including phrases that are semantically relevant, but not exact matches. We propose a correlation-based phrase matching (CPM) model that can detect RSS news articles which contain not only phrases that are exactly the same but(More)
Among the HTML elements, HTML tables [RHJ98] encapsulate hierarchically structured data (hierarchical data in short) in a tabular structure. HTML tables do not come with a rigid schema and almost any forms of two-dimensional tables are acceptable according to the HTML grammar. This relaxation complicates the process of retrieving hierarchical data from HTML(More)