Corpus ID: 5977119

Machine Learning Methods for Spamdexing Detection

  title={Machine Learning Methods for Spamdexing Detection},
  author={Tiago A. Almeida and Renato Moraes Silva and Akebo Yamakami},
In this paper, we present recent contributions for the battle against one of the main problems faced by search engines: the spamdexing or web spamming. They are malicious techniques used in web pages with the purpose of circumvent the search engines in order to achieve good visibility in search results. To better understand the problem and finding the best setup and methods to avoid such virtual plague, in this paper we present a comprehensive performance evaluation of several established… Expand
Towards Web Spam Filtering Using a Classifier Based on the Minimum Description Length Principle
The MDLClass, a classifier technique based on the minimum description length principle, applied to the context of web spam filtering is presented and a new approach to detect web spam that combines the predictions obtained by the classifiers using content-based, link- based, and transformed link-based features is evaluated. Expand
WSF2: A Novel Framework for Filtering Web Spam
Applying the WSF2 framework over the publicly available WEBSPAM-UK2007 corpus, it is demonstrated that a simple combination of different techniques is able to improve the accuracy of single classifiers on web spam detection and is a powerful tool for boosting applied research in this area. Expand
A dynamic model for integrating simple web spam classification techniques
The present study introduces WSF2, a novel web spam filtering framework specifically designed to take advantage of multiple classification schemes and algorithms, and demonstrates its effectiveness by conducting a set of experiments involving a publicly available corpus, as well as different simple well-known classifiers and ensemble approaches. Expand
Spam host classification using PSO-SVM
  • A. Enache, V. Sgârciu
  • Engineering
  • 2014 IEEE International Conference on Automation, Quality and Testing, Robotics
  • 2014
Search engines have become a de facto place to start information acquisition on the Internet. Sabotaging the quality of the results retrieved by search engines can lead users to doubt the searchExpand
WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora
The design and implementation of WARCProcessor is presented, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research and facilitates the automatic and concurrent download of web sites from Internet. Expand
Towards automatic filtering of fake reviews
A comprehensive analysis of content-based classification methods for fake review detection using multiple settings, employing different types of learning and datasets provides sufficient evidence to respond appropriately to the open questions. Expand
Survey of Challenges in Sentiment Analysis
This paper provides a review of sentiment analysis, its challenges, issues and also a survey of different approaches and techniques to handle those issues with respective advantages and disadvantages. Expand
Protection Against Semantic Social Engineering Attacks
Three high-level defense approaches are discussed, including adopting the semantic attack killchain concept which simplifies targeted defense; principles for preemptive and proactive protection for passive threats; and platform based defense-in-depth lifecycle designed to harness technical and non-technical defense capabilities of platform providers and their user base. Expand
Application of Data Mining Technology on Surveillance Report Data of HIV/AIDS High-Risk Group in Urumqi from 2009 to 2015
Data mining technology, as a new method of assisting disease screening and diagnosis, can help medical personnel to screen and diagnose AIDS rapidly from a large number of information. Expand
Versatile Cybersecurity
Covert channels circumvent security measures to steal sensitive data undetectable to an onlooker. Traditionally, covert channels utilize global system resources or settings to send hidden messages.Expand


An Analysis of Machine Learning Methods for Spam Host Detection
A comprehensive performance evaluation of several established machine learning techniques used to automatically detect and filter hosts that disseminate web spam indicates that bagging of decision trees, multilayer perceptron neural networks, random forest and adaptive boosting of decision Trees are promising in the task of web spam classification. Expand
Towards Web Spam Filtering with Neural-Based Approaches
A performance evaluation of different models of artificial neural networks to automatically classify web spam and results indicate that the evaluated approaches outperform the state-of-the-art web spam filters. Expand
Artificial Neural Networks For Content-based Web Spam Detection
A performance evaluation of different models of artificial neural networks used to automatically classify and filter real samples of web spam based on their contents indicates that some of evaluated approaches have a big potential since they are suitable to deal with the problem and clearly outperform the state-of-the-art techniques. Expand
Using rank propagation and Probabilistic counting for Link-Based Spam Detection
This paper proposes spam detection techniques that only consider the link structure of Web, regardless of page contents, and compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the Web graph. Expand
Improving web spam classification using rank-time features
It is shown that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only, and this paper is the first to investigate theUse ofRank-time features, and in particular query- dependent rank- time features, for web spam detection. Expand
Survey on web spam detection: principles and algorithms
This paper presents a systematic review of web spam detection techniques with the focus on algorithms and underlying principles, and categorizes all existing algorithms into three categories based on the type of information they use: content- based methods, link-based methods, and methods based on non-traditional data. Expand
Removing web spam links from search engine results
A classification technique is developed that uses important features to successfully distinguish spam sites from legitimate entries and the threat posed by malicious web sites can be mitigated, reducing the risk for users to get infected by malicious code that spreads via drive-by attacks. Expand
Detecting Link Spam Using Temporal Information
This paper defines temporal features such as in-link growth rate (IGR) and in- link death rate (IDR) in a spam classification model (i.e., SVM) and shows that link spam can be successfully detected with the proposed method. Expand
Know your neighbors: web spam detection using the web topology
A spam detection system that combines link-based and content-based features, and uses the topology of the Web graph by exploiting the link dependencies among the Web pages, which finds that linked hosts tend to belong to the same class. Expand
Web Spam Detection
  • Marc Najork
  • Computer Science
  • Encyclopedia of Database Systems
  • 2009
Combating web spam consists of identifying spam content with high probability and – depending on policy – downgrading it during ranking, eliminating it from the index, no longer crawling it, and tainting affiliated content. Expand