Detecting spam web pages through content analysis

@inproceedings{Ntoulas2006DetectingSW,
  title={Detecting spam web pages through content analysis},
  author={A. Ntoulas and Marc Najork and M. Manasse and Dennis Fetterly},
  booktitle={WWW '06},
  year={2006}
}
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2… Expand
Identifying Spam Web Pages Based on Content Similarity
TLDR
This paper presents a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content, and develops a spam-detection tool that outperforms existing anti-spam methods by an average of 10% in terms of F-measure. Expand
Web Spam: A Study of the Page Language Effect on the Spam Detection Features
  • A. Alarifi, M. Alsaleh
  • Computer Science
  • 2012 11th International Conference on Machine Learning and Applications
  • 2012
TLDR
The analysis results show that selecting suitable features for a classifier that segregates spam pages depends heavily on the language of the examined Web page, due in part to the different set of Web spam mechanisms used by each type of stammers. Expand
A Survey of Web Spam Detection Techniques
TLDR
This paper explains different kinds of web spam, and describes some method, used to combat with this difficulty, which involves commercial, political and economic applications. Expand
Web spam detection using SVM classifier
  • Rahul C. Patil, D. Patil
  • Computer Science
  • 2015 IEEE 9th International Conference on Intelligent Systems and Control (ISCO)
  • 2015
TLDR
This paper has implemented spam detection system based on a SVM classifier that combines new link features with content and qualified link analysis, and has used the kullback-Leibler divergence for characterizing the relationship between the two linked pages. Expand
Link analysis for Web spam detection
TLDR
After tenfold cross-validation, the best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods. Expand
Web Mining Techniques to Block Spam Web Sites
TLDR
The aim of this paper is to introduce a system based on web mining techniques to prevent spamming web pages using Decision Tree(DT) rules, which is the best classifier to detect Web spam content. Expand
Characterizing Web Spam Using Content and HTTP Session Analysis
TLDR
The first large-scale characterization of web spam using content and HTTP session analysis techniques on the Webb Spam Corpus is performed, showing significant concentration of hosting IP addresses in two narrow ranges as well as significant overlaps among session header values. Expand
User behavior oriented web spam detection
TLDR
Preliminary experiments on Web access data collected by a commercial Web site show the effectiveness of the proposed spam page detection algorithm based on Bayes learning. Expand
Using rank propagation and Probabilistic counting for Link-Based Spam Detection
TLDR
This paper proposes spam detection techniques that only consider the link structure of Web, regardless of page contents, and compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the Web graph. Expand
Removing web spam links from search engine results
TLDR
A classification technique is developed that uses important features to successfully distinguish spam sites from legitimate entries and the threat posed by malicious web sites can be mitigated, reducing the risk for users to get infected by malicious code that spreads via drive-by attacks. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 34 REFERENCES
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
TLDR
This paper proposes that some spam web pages can be identified through statistical analysis, and examines a variety of properties, including linkage structure, page content, and page evolution, and finds that outliers in the statistical distribution of these properties are highly likely to be caused by web spam. Expand
Combating Web Spam with TrustRank
TLDR
This paper proposes techniques to semi-automatically separate reputable, good pages from spam, and shows that they can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites. Expand
Identifying link farm spam pages
TLDR
Algorithms for detecting link farms automatically are presented by first generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it, providing a modified web graph to use in ranking page importance. Expand
SpamRank -- Fully Automatic Link Spam Detection
TLDR
A novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention is proposed. Expand
Web Spam, Propaganda and Trust
TLDR
This paper analyzes the influence that web spam has on the evolution of the search engines and identifies the strong relationship of spamming methods to propagandistic techniques in society, which can lead to browser-level web spam filters that work in synergy with the powerful search engines to deliver personalized, trusted web results. Expand
Web Spam Taxonomy
TLDR
This paper presents a comprehensive taxonomy of current spamming techniques, which it is believed can help in developing appropriate countermeasures. Expand
Detecting phrase-level duplication on the world wide web
TLDR
The algorithms used to discover a number of other instances of large-scale phrase-level replication within the two data sets collected in December 2002 and June 2004 are described. Expand
Link Spam Alliances
TLDR
This paper studies how web pages can be interconnected in a spam farm in order to optimize rankings and shows that alliances can be synergistic and improve the rankings of all participants. Expand
Blocking Blog Spam with Language Model Disagreement
TLDR
An approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments, which requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Expand
An Analysis of Web Documents Retrieved and Viewed
  • B. Jansen, A. Spink
  • Computer Science
  • International Conference on Internet Computing
  • 2003
TLDR
This work presents findings from large-scale research into the page viewing patterns of users of the FAST commercial Web search engine, and examines common patterns concerning the number of pages of results viewed, thenumber of pages viewed and the relationship between the number and time between multiple site visits. Expand
...
1
2
3
4
...