Efficient and effective spam filtering and re-ranking for large web datasets

@article{Cormack2011EfficientAE,
  title={Efficient and effective spam filtering and re-ranking for large web datasets},
  author={Gordon V. Cormack and Mark D. Smucker and Charles L. A. Clarke},
  journal={Information Retrieval},
  year={2011},
  volume={14},
  pages={441-465}
}
The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based… 
TREC 2010 Web Track Notebook: Term Dependence, Spam Filtering and Quality Bias
TLDR
It is found that using Wikipedia as a high-quality document collection for query expansion can ameliorate some of the negative effects of performing pseudo-relevance feedback from a noisy web collection such as ClueWeb09.
UMass at TREC 2010 Web Track : Term Dependence , Spam Filtering and Quality Bias
TLDR
It is found that using Wikipedia as a high-quality document collection for query expansion can ameliorate some of the negative effects of performing pseudo-relevance feedback from a noisy web collection such as ClueWeb09.
An intrinsic evaluation of the Waterloo spam rankings of the ClueWeb09 and ClueWeb12 datasets
TLDR
A ground truth for binary classification (spam vs non-spam) is constructed from Web pages that are judged as spam or relevant under the assumption that a Web page judged as relevant for any query cannot be spam.
Using Anchor Text, Spam Filtering and Wikipedia for Web Search and Entity Ranking
TLDR
It is found that documents in ClueWeb09 category B have a higher probability of being retrieved than other documents in category A, and following the external links on Wikipedia pages to find the homepages of the entities in the Clue Web collection, works better than searching an anchor text index.
Multi-View Learning for Web Spam Detection
TLDR
It is shown that each web page can be classified with satisfactory accuracy using only its own HTML content and that multi-view learning significantly improves the classification performance, namely AUC by 22%, while providing linear speedup for parallel execution.
Indexing without spam
TLDR
To remove spam pages at indexing time, therefore obtaining a pruned index that is virtually “spam-free” is suggested, and it is found that the strategy decreases both the time required by the indexing process and the space required for storing the index.
The Classification Power of Web Features
TLDR
It is concluded that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on this dataset and can be further improved only slightly by computationally expensive features.
Revisiting Spam Filtering in Web Search
TLDR
Through a detailed failure analysis, it is shown that simple spam filtering is a high risk practice that should be avoided in future work, particularly when working with the ClueWeb12 test collection.
The classification power of Web features Version 1 . 0
TLDR
A comprehensive comparison of the best performing classification techniques based on [9, 37, 36, 38] and new experiments concludes that, with appropriate learning techniques, a simple and computationally inexpensive feature subset outperforms all previous results published so far on this data set.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Link analysis for Web spam detection
TLDR
After tenfold cross-validation, the best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.
Machine Learning for Information Retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks
TLDR
For the TREC 2009, this approach was used exclusively for the adhoc web, diversity and relevance feedback tasks, as well as to the batch legal task: the ClueWeb09 and Tobacco collections were processed end-to-end and never indexed.
Nullification test collections for web spam and SEO
TLDR
A need is identified for an adversarial IR collection which is not domain-restricted and which is supported by a set of appropriate query sets and (optimistically) user-behaviour data, and the term nullification is introduced.
Overview of the TREC 2010 Web Track
TLDR
A preliminary spam ranking of the pages in the corpus is provided, as an aid to groups who wish to reduce the number of low-quality pages in their results, and a new assessment structure includes a spam/junk level, which assisted in the evaluation of the spam task.
A Framework for Measuring the Impact of Web Spam
TLDR
A framework for measuring the degradation in quality of search results caused by the presence of web spam is presented and it is demonstrated that simple removal of spam pages from result sets can increase result quality.
TREC 2006 Spam Track Overview
TLDR
TREC’s Spam Track uses a standard testing framework that presents a set of chronologically ordered email messages a spam filter for classification and four different forms of user feedback are modeled, intended to model a user reading email from time to time and perhaps not diligently reporting the filter's errors.
Heuristic Ranking and Diversification of Web Documents
TLDR
It is found that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results.
On-line spam filter fusion
TLDR
It is shown that it is possible to select a priori small subsets of the filters that, when combined, still outperform the best individual filter by a substantial margin.
On information retrieval metrics designed for evaluation with incomplete relevance assessments
TLDR
This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task.
Online Discriminative Spam Filter Training
TLDR
A very simple technique for discriminatively training a spam filter using a very simple feature extractor and gradient descent of a logistic regression model.
...
...