Distance-based outliers: algorithms and applications

@article{Knorr2000DistancebasedOA,
  title={Distance-based outliers: algorithms and applications},
  author={Edwin M. Knorr and Raymond T. Ng and Vladimir Tucakov},
  journal={The VLDB Journal},
  year={2000},
  volume={8},
  pages={237-253}
}
Abstract. This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance… Expand
Example-Based Outlier Detection for High Dimensional Datasets
TLDR
A novel solution to the problem of detecting outliers based on user examples for high dimensional datasets by discovering the hidden view of outliers and picking out further objects that are outstanding in the projection where the examples stand out greatly is presented. Expand
Class Outliers Mining: Distance-Based Approach
TLDR
This research poses the problem that is Class Outliers Mining and a method to find out those outliers and proposes the Class Outlier Factor (COF) which measures the degree of being a Class outlier for a data object. Expand
Outliers Detection in Multi-label Datasets
TLDR
This paper proposes a method that measures the degree of anomaly of an object in a multi-label dataset and quantifies the level of irregularity of that object with respect to the dataset. Expand
Mining class outliers: concepts, algorithms and applications in CRM
TLDR
The notion of class outlier is developed and proposed practical solutions by extending existing outlier detection algorithms to this case are proposed and its potential applications in CRM (customer relationship management) are also discussed. Expand
Outlier detection by example
TLDR
This OBE (Outlier By Example) system is the first that allows users to provide examples of outliers in low-dimensional datasets and can discover values that a user would consider outliers. Expand
A Scalable and Efficient Outlier Detection Strategy for Categorical Data
TLDR
Attribute Value Frequency (A VF) is introduced, a fast and scalable outlier detection strategy for categorical data that scales linearly with the number of data points and attributes, and relies on a single data scan. Expand
A Scalable and Efficient Outlier Detection Strategy for Categorical Data
TLDR
Attribute Value Frequency (A VF) is introduced, a fast and scalable outlier detection strategy for categorical data that scales linearly with the number of data points and attributes, and relies on a single data scan. Expand
Detection of outliers and outliers clustering on large datasets with distributed computing
TLDR
This work presents several distributed computing algorithms to outlier detection, starting from a distributed version of an existent algorithm, CURIO, and introducing a series of optimizations and variants that leads to a new method, Curio3XD, that allows to resolve both the common issues typical of this problem, the constraints imposed by the size and the dimensionality of the datasets. Expand
A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database
  • Fahed Yoseph, M. Heikkilä
  • Computer Science
  • 2019 International Conference on Machine Learning and Data Engineering (iCMLDE)
  • 2019
TLDR
A clustering-based approach to identifying outliers in a retail point-of-sales dataset is proposed and the experimental results show that the K-means algorithm outperforms the (FCM) Fuzzy C-mean algorithm in terms of outlier detection efficiency, and it is an effective outlier Detection solution. Expand
Outlier mining in large high-dimensional data sets
  • F. Angiulli, C. Pizzuti
  • Mathematics, Computer Science
  • IEEE Transactions on Knowledge and Data Engineering
  • 2005
TLDR
An in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases are presented. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 47 REFERENCES
Algorithms for Mining Distance-Based Outliers in Large Datasets
TLDR
This paper provides formal and empirical evidence showing the usefulness of DB-outliers and presents two simple algorithms for computing such outliers, both having a complexity of O(k N’), k being the dimensionality and N being the number of objects in the dataset. Expand
A unified approach for mining outliers
TLDR
The proposed, intuitive notion of outliers can unify or generalize many of the existing notions of outlier provided by discordancy tests for standard statistical distributions, so that when mining large datasets containing many attributes, a unified approach can replace many statistical discordancies tests, regardless of any knowledge about the underlying distribution of the attributes. Expand
A Unified Notion of Outliers: Properties and Computation
TLDR
A unified outlier detection system can replace a whole spectrum of statistical discordancy tests with a single module detecting only the kinds of outliers proposed. Expand
BIRCH: an efficient data clustering method for very large databases
TLDR
A data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is presented, and it is demonstrated that it is especially suitable for very large databases. Expand
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
TLDR
DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it. Expand
Fast Computation of 2-Dimensional Depth Contours
TLDR
A fast algorithm is given, FDC, which computes the first k 2-D depth contours by restricting the computation to a small selected subset of data points, instead of examining all data points. Expand
A Linear Method for Deviation Detection in Large Databases
TLDR
The problem of finding deviations in large data bases is described, a formal description of the problem is given and a linear algorithm for detecting deviations is presented, using the implicit redundancy of the data. Expand
Eecient and Eeective Clustering Methods for Spatial Data Mining
Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases. In this paper, we explore whether clustering methods have a roleExpand
Efficient and Effective Clustering Methods for Spatial Data Mining
TLDR
The analysis and experiments show that with the assistance of CLAHANS, these two algorithms are very effective and can lead to discoveries that are difficult to find with current spatial data mining algorithms. Expand
Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
TLDR
Early experiences are presented with a prototype exploratory data analysis environment, CONQUEST, designed to provide content-based access to such massive scientific datasets, and several associated feature extraction algorithms implemented on MPP platforms. Expand
...
1
2
3
4
5
...