Fast webpage classification using URL features

@inproceedings{Kan2005FastWC,
  title={Fast webpage classification using URL features},
  author={Min-Yen Kan and Hoang-Lan Nguyen Thi},
  booktitle={CIKM '05},
  year={2005}
}
We demonstrate the usefulness of the uniform resource locator (URL) alone in performing web page classification. [] Key Method Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness on two standardized domains. Our results show that in certain scenarios, URL-based methods approach the performance of current state-of…

Tables from this paper

URL-Based Web Page Classification: With n-Gram Language Models
TLDR
Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets.
Web page classification using n-gram based URL features
TLDR
An URL based web page classification method that does not need either the web page content or its link structure and is implemented by Support Vector Machines and Maximum Entropy Classifiers.
A statistical approach to URL-based web page clustering
TLDR
This work proposes a technique to cluster web pages by means of their URL exclusively, which is non-supervised, requiring little intervention from the user, and does not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages.
Web page classification: Features and algorithms
TLDR
As work in Web page classification is reviewed, the importance of these Web-specific features and algorithms are noted, state-of-the-art practices are described, and the underlying assumptions behind the use of information from neighboring pages are tracked.
Combining content-based and context-based methods for Persian web page classification
TLDR
This paper analyzes content-based and context-based web page features of Persian Wekipedia and tries to exploit a combination of features to improve categorization accuracy of Persian web page classification.
The Role of URLs in Objectionable Web Content Categorization
TLDR
A novel URL-based objectionable content categorization approach is described and it is demonstrated that the optimum Web filtering results could be achieved when it was used with a content-based approach in a production environment.
URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models
TLDR
This paper proposes a new solution based on the use of an n-gram language model that shows good classification performance and is scalable to larger datasets, and allows for the problem of classifying new URLs with unseen sub-sequences.
Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach
This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is
Purely URL-based topic classification
TLDR
A machine learning approach is applied to the topic identification task and its performance is evaluated in extensive experiments on categorized web pages from the Open Directory Project (ODP).
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 22 REFERENCES
Using urls and table layout for web classification tasks
TLDR
It is shown that the automated classification of Web pages can be much improved if, instead of looking at their textual content, it is considered each links's URL and the visual placement of those links on a referring page.
Web page classification without the web page
TLDR
This paper explores the use of URLs for webpage categorization via a two-phase pipeline of word segmentation/expansion and classification, and quantifies its performance against document-based methods, which require the retrieval of the source document.
Web classification using support vector machine
TLDR
The use of Support Vector Machine (SVM) classifiers to classify web pages using both their text and context feature sets is proposed and it is shown that the use of context features especially hyperlinks can improve the classification performance significantly.
Web-page classification through summarization
TLDR
This paper gives empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web- page classification algorithms and proposes a new Web summarization-based classification algorithm that achieves an approximately 8.8% improvement over pure-text based methods.
Using web structure for classifying and describing web pages
TLDR
By ranking words and phrases in the citing documents according to expected entropy loss, this work is able to accurately name clusters of web pages, even with very few positive examples.
Overview of the TREC-2001 Web track
TREC-2001 saw the falling into abeyance of the Large Web Task but a strengthening and broadening of activities based on the 1.69 million page WTlOg corpus. There were two tasks. The topic relevance
Combining Statistical and Relational Methods for Learning in Hypertext Domains
TLDR
This work presents a new approach to learning hypertext classifiers that combines a statistical text-learning method with a relational rule learner and demonstrates that this new approach is able to learn more accurate classifiers than either of its constituent methods alone.
Hierarchical classification of Web content
TLDR
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content using support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification.
A Study of Approaches to Hypertext Categorization
TLDR
This paper examines five hypertext regularities which may (or may not) hold in a particular application domain, and whose presence (or absence) may significantly influence the optimal design of a classifier.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
...
1
2
3
...