• Corpus ID: 11222009

Web Spam Taxonomy

  title={Web Spam Taxonomy},
  author={Zolt{\'a}n Gy{\"o}ngyi and Hector Garcia-Molina},
Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures. 

Figures and Tables from this paper

A Survey of Major Techniques for Combating Link Spamming
An overview of present status of link spamming technique, and a summary of several combating techniques are presented, including a classification of those techniques.
Identifying Web Spam by Densely Connected Sites and its Statistics in a Japanese Web Snapshot
This paper analyzes distributions of link spam in the archive of Japanese web pages using link analysis techniques and concludes that link spam is a major factor in the degradation of search results.
Classifying Spam using URLs
  • Computer Science
  • 2018
An algorithm that is able to identify a spam web page prior to crawl time would not only save precious crawl resources by allowing the crawler to bypass spam urls, but it would also remove the possibility of that spam url from later being served to user queries, thereby improving the quality of search results.
Detecting Arabic Web Spam
The current spamming techniques, ranking algorithms for Web pages, applying three algorithms that detected Arabic spam pages, and comparison between their different result, which show K-nearest neighbour is better than other used algorithms are discussed.
Web spam
This paper talks about what web spam is, with examples and a discussion of the overlap with ‘legitimate’ marketing material, and presents some ideas about how to identify it automatically in order to filter it out of the authors' web corpora.
User behavior oriented web spam detection
Preliminary experiments on Web access data collected by a commercial Web site show the effectiveness of the proposed spam page detection algorithm based on Bayes learning.
Spam and popularity ratings for combating link spam
A new approach for propagating spam scores in web graphs is presented, in order to combat link spam, and the resulting spam rating is then used for propagates popularity scores like PageRank.
Counter measures against evolving search engine spamming techniques
A new way to counter black hat techniques using link based spam detection combined with the page rank algorithm is proposed, which helps to discover target page and trace down the entire graph responsible for spreading spam.
Detecting Content Spam on the Web through Text Diversity Analysis
This paper proposes a set of content diversity features based on frequency rank distributions for terms and topics that combine with a wide range of other content features to produce a content spam classifier that outperforms existing results.
Towards Evaluating Web Spam Threats and Countermeasures
The results indicate that online real time tools are highly recommended solutions against web spam threats.


Spam, Damn Spam, and Statistics
This paper proposes that some spam web pages can be identified through statistical analysis, and examines a variety of properties, including linkage structure, page content, and page evolution, and finds that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.
Combating Web Spam with TrustRank
Link Spam Alliances
This paper studies how web pages can be interconnected in a spam farm in order to optimize rankings and shows that alliances can be synergistic and improve the rankings of all participants.
Challenges in web search engines
This article presents a high-level discussion of some problems in information retrieval that are unique to web search engines. The goal is to raise awareness and stimulate research in these areas.
The PageRank Citation Ranking : Bringing Order to the Web
This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
The Happy Searcher: Challenges in Web Information Retrieval
It is shown that leveraging the vast amounts of data on web, it is possible to successfully address problems in innovative ways that vastly improve on standard, but often data impoverished, methods.
Inside PageRank
A circuit analysis is introduced that allows to understand the distribution of the page score, the way different Web communities interact each other, the role of dangling pages (pages with no outlinks), and the secrets for promotion of Web pages.
Deeper Inside PageRank
A comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, and suggested alternatives to the traditional solution methods.
Authoritative Soueces in a Hyper-linked Environment
Authoritative sources in a hyperlinked environment
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set ...