Adversarial Information Retrieval on the Web (AIRWeb 2007)

  title={Adversarial Information Retrieval on the Web (AIRWeb 2007)},
  author={C. Castillo and K. Chellapilla and Brian D. Davison},
  journal={SIGIR Forum},
The ubiquitous use of search engines to discover and access Web content shows clearly the success of information retrieval algorithms. However, unlike controlled collections, the vast majority of Web pages lack an authority asserting their quality. This openness of the Web has been the key to its rapid growth and success, but this openness is also a major source of new adversarial challenges for information retrieval methods. 
Spam detection through link authorization from neighboring nodes
The paper proposes Link Authorization Model to detect link spam propagation onto neighboring pages, which is able to predict page-links as true or false authorization, and for every false authorization detected, the out-going page receives a penalization by a pre-determined threshold. Expand
Web spam detection using trust and distrust-based ant colony optimization learning
A machine learning approach for solving the problem of Web spam detection based on an adoption of the ant colony optimization (ACO) is proposed to construct rule-based classifiers to distinguish between non-spam and spam hosts. Expand
Adversarial Web Search
It is shown that search engine spammers create false content and misleading links to lure unsuspecting visitors to pages filled with advertisements or malware, and work over the past decade or so that aims to discover such spamming activities is examined, demonstrating that this conflict is far from over. Expand


Building bridges for web query classification
A novel approach for QC is presented that outperforms the winning solution of the ACM KDDCUP 2005 competition and introduces category selection as a new method for narrowing down the scope of the intermediate taxonomy based on which the authors classify the queries. Expand
Detecting spam web pages through content analysis
Some previously-undescribed techniques for automatically detecting spam pages are considered, and the effectiveness of these techniques in isolation and when aggregated using classification algorithms is examined. Expand
Improving web search ranking by incorporating user behavior information
It is shown that incorporating user behavior data can significantly improve ordering of top results in real web search setting, improving the accuracy of a competitive web search ranking algorithms by as much as 31% relative to the original performance. Expand
SpamRank -- Fully Automatic Link Spam Detection
A novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention is proposed. Expand
Hash cash - a denial of service countermeasure,
  • Technical. Report
  • 2002
Mining clickthrough data for collaborative web search
This paper proposes a Collaborative Web Search (CWS) framework based on the probabilistic modeling of the co-occurrence relationship among the heterogeneous web objects: users, queries, and Web pages, and experiments validate the effectiveness of the CWS approach. Expand
Detecting semantic cloaking on the web
An automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser and estimated that more than 50,000 of these pages employ semantic cloaked pages are proposed. Expand
Using rank propagation and Probabilistic counting for Link-Based Spam Detection
This paper proposes spam detection techniques that only consider the link structure of Web, regardless of page contents, and compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the Web graph. Expand
External-memory algorithms and data structures
A large-scale study of link spam detection by graph algorithms
This paper studies the overall structure and distribution of link farms in a large-scale graph of the Japanese Web with 5.8 million sites and 283 million links and expands these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. Expand