• Corpus ID: 7842019

A Comparison of Outlier Detection Algorithms for Machine Learning

  title={A Comparison of Outlier Detection Algorithms for Machine Learning},
  author={Hugo Jair Escalante},
In this paper a comparison of outlier detection algorithms is presented, we present an overview on outlier detection methods and experimental results of six implemented methods. We applied these methods for the prediction of stellar populations parameters as well as on machine learning benchmark data, inserting artificial noise and outliers. We used kernel principal component analysis in order to reduce the dimensionality of the spectral data. Experiments on noisy and noiseless data were… 

Figures and Tables from this paper

An Outlier Detection Algorithm Based on Spectral Clustering

  • Peng YangBiao Huang
  • Computer Science
    2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application
  • 2008
The experimental results show that the outlier detection algorithm outperforms the K-means based algorithm with high precision and low false alarm rate as well as desirable coverage ratio.

Detection and visualisation of outliers using kernel principal components

  • Alissar NasserD. Hamad
  • Computer Science
    2015 Fifth International Conference on Digital Information and Communication Technology and its Applications (DICTAP)
  • 2015
A new method to identify outliers from a dataset is applied to use the K-means clustering algorithm on the smallest principal components provided by the kernel principal components analysis.

A Comparative Evaluation of Supervised and Unsupervised Methods for Detecting Outliers

Light is shed on the layout and performance analysis of supervised and unsupervised outlier detection methods in determining the aforementioned outliers and the data mining tools like Rapid Miner and R are used.

Comparative Study of Outlier Detection Algorithms for Machine Learning

A comparison between effects of multivariate outlier detection algorithms on machine learning problems is performed and a comparative review is performed to distinguish the advantages and disadvantages of each algorithm and their respective effects on accuracy of SVM classifiers.

A Modified Density Based Outlier Mining Algorithm for Large Dataset

  • Peng YangBiao Huang
  • Computer Science
    2008 International Seminar on Future Information Technology and Management Engineering
  • 2008
A modified density based detection algorithm which utilizes the data partitioning method and presents some speedup strategies such as the introduction of module information to avoid large number of unnecessary computations while finding outliers.

KNN Based Outlier Detection Algorithm in Large Dataset

  • Peng YangBiao Huang
  • Computer Science
    2008 International Workshop on Education Technology and Training & 2008 International Workshop on Geoscience and Remote Sensing
  • 2008
A KNN based outlier detection algorithm which is consisted of two phases, which partitions the dataset into several clusters and then in each cluster, it calculates the Kth nearest neighborhood for object to find outliers.

An Efficient Outlier Mining Algorithm for Large Dataset

  • Peng YangBiao Huang
  • Computer Science
    2008 International Conference on Information Management, Innovation Management and Industrial Engineering
  • 2008
An efficient outlier mining algorithm based on KNN is proposed and it can find outlier more accurately through defining a correlation matrix considering the importance and correlation between attributes.

A Comparison of Outlier Detection Algorithm for Wireless Sensor Network

Experiments show that the proposed classification approach that provides outlier detection and data classification simultaneously outperforms other techniques in both effectiveness & efficiency.

Outlier Detection: Applications And Techniques

This paper attempts to bring together various outlier detection techniques, in a structured and generic description, to attain a better understanding of the different directions of research on outlier analysis for ourselves as well as for beginners in this research field.



Algorithms for Mining Distance-Based Outliers in Large Datasets

This paper provides formal and empirical evidence showing the usefulness of DB-outliers and presents two simple algorithms for computing such outliers, both having a complexity of O(k N’), k being the dimensionality and N being the number of objects in the dataset.

A Unified Notion of Outliers: Properties and Computation

A unified outlier detection system can replace a whole spectrum of statistical discordancy tests with a single module detecting only the kinds of outliers proposed.

Robust Decision Trees: Removing Outliers from Databases

This paper examines C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shown to consistently achieve higher accuracy, and extends the pruning method to fully remove the effect of outliers, and this results in improvement on many databases.

Efficient algorithms for mining outliers from large data sets

A novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor is proposed and the top n points in this ranking are declared to be outliers.

An introduction to kernel-based learning algorithms

This paper provides an introduction to support vector machines, kernel Fisher discriminant analysis, and kernel principal component analysis, as examples for successful kernel-based learning methods.

Noise Clustering with a Fixed Fraction of Noise

The so-called noise clustering technique is modified making it more robust against a wrong choice of its main control parameter, the noise distance, including a computationally efficient algorithm.

Discovering Informative Patterns and Data Cleaning

A method for discovering informative patterns from data that can be reduced to only a few representative data entries and an attractive candidate for new applications in knowledge discovery is presented.

Probabilistic noise identification and data cleaning

  • J. KubicaA. Moore
  • Computer Science
    Third IEEE International Conference on Data Mining
  • 2003
This work presents LENS, an approach for identifying corrupted fields and using the remaining noncorrupted fields for subsequent modeling and analysis, and provides an algorithm for the unsupervised discovery of such models.

Identifying and Eliminating Mislabeled Training Instances

Empirical results suggest that the ensemble filter approach is an effective method for identifying labeling errors, and further, that the approach will significantly benefit ongoing research to develop accurate and robust remote sensing-based methods to map land cover at global scales.

Nonlinear Component Analysis as a Kernel Eigenvalue Problem

A new method for performing a nonlinear form of principal component analysis by the use of integral operator kernel functions is proposed and experimental results on polynomial feature extraction for pattern recognition are presented.