• Corpus ID: 8558306

Outlier Detection for Text Data : An Extended Version

@article{Kannan2017OutlierDF,
  title={Outlier Detection for Text Data : An Extended Version},
  author={Ramakrishnan Kannan and Hyenkyun Woo and Charu C. Aggarwal and Haesun Park},
  journal={ArXiv},
  year={2017},
  volume={abs/1701.01325}
}
The problem of outlier detection is extremely challenging in many domains such as text, in which the attribute values are typically non-negative, and most values are zero. [] Key Method Our iterative algorithm TONMF is based on block coordinate descent (BCD) framework. We define blocks over the term-document matrix such that the function becomes solvable. Given most recently updated values of other matrix blocks, we always update one block at a time to its optimal.

Figures and Tables from this paper

Outlier Detection for Text Data

This paper presents a matrix factorization method, which is naturally able to distinguish the anomalies with the use of low rank approximations of the underlying data, and has significant advantages over traditional methods for text outlier detection.

A Lightweight Yet Robust Approach to Textual Anomaly Detection

A new approach based on an alternative regularization of the NMF objective is introduced, which surpass other linear AD models and are on par with deep models, performing comparably well even in very small outlier concentrations.

A Study on Different Methods of Outlier Detection Algorithms in Data Mining

This survey provides an overview of outliers and existing outliers by classifying them into different dimensions and proves the performance of the rough set based entropy measure with weighted density value over existing methods.

Anomaly Detection Between Judicial Text-Based Documents

Two methods for searching for anomalies in judicial practice are presented and a comparative analysis of the results of the effectiveness of both methods is given.

Progress in Outlier Detection Techniques: A Survey

This survey presents a comprehensive and organized review of the progress of outlier detection methods from 2000 to 2019 and categorizes them into different techniques from diverse outlier Detection techniques, such as distance-, clustering-, density-, ensemble-, and learning-based methods.

Convolutional Neural Networks for Unsupervised Anomaly Detection in Text Data

A specific CNN architecture that consists of one convolutional layer and one subsampling layer, which uses RBF activation function and logarithmic loss function on the final layer and minimization of the corresponding objective function helps to calculate the location parameter of the features’ weights discovered on the last network layer.

GMOTE: Gaussian based minority oversampling technique for imbalanced classification adapting tail probability of outliers

This paper proposed Gaussian based minority oversampling technique (GMOTE) with a statistical perspective for imbalanced datasets that generates instances by the Gaussian Mixture Model and adapt tail probability of instances through the Mahalanobis distance to consider local outliers.

Anomaly Detection in Text Documents using HTM Networks

This work has combined multiple algorithms, including a non-traditional neural network model the Hierarchical Temporal Memory (HTM) network, to find anomalies in texts, using semantic folding to represent the text inputs for the HTM algortihm.

DATE: Detecting Anomalies in Text via Self-Supervision of Transformers

This work learns the DATE model end-to-end, enforcing two independent and complementary self-supervision signals, one at the token-level andOne at the sequence-level, and shows strong quantitative and qualitative results on the 20Newsgroups and AG News datasets.

Topic modeling for sequential documents based on hybrid inter-document topic dependency

Two new topic modeling methods for sequential documents based on hybrid inter-document topic dependency are proposed, which outperform state-of-the-art models in terms of the accuracy of topic modeling, the quality of topic clustering, and the effectiveness of outlier detection.

References

SHOWING 1-10 OF 36 REFERENCES

Outlier detection for high dimensional data

New techniques for outlier detection which find the outliers by studying the behavior of projections from the data set are discussed.

LOF: identifying density-based local outliers

This paper contends that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier, called the local outlier factor (LOF), and gives a detailed formal analysis showing that LOF enjoys many desirable properties.

Algorithms for Mining Distance-Based Outliers in Large Datasets

This paper provides formal and empirical evidence showing the usefulness of DB-outliers and presents two simple algorithms for computing such outliers, both having a complexity of O(k N’), k being the dimensionality and N being the number of objects in the dataset.

Feature bagging for outlier detection

A novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed, which combines results from multiple outlier detection algorithms that are applied using different set of features.

Outlier Ranking via Subspace Analysis in Multiple Views of the Data

This work proposes Outrank, a novel outlier ranking concept that exploits subspace analysis to determine the degree of outlierness, and outperforms state-of-the-artoutlierness measures.

HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

A novel subspace search method that selects high contrast subspaces for density-based outlier ranking and proposes a first measure for the contrast of subspace dimensions to enhance the quality of traditional outlier rankings.

Outlier Detection in Arbitrarily Oriented Subspaces

In this paper, we propose a novel outlier detection model to find outliers that deviate from the generating mechanisms of normal instances by considering combinations of different subsets of

Efficient algorithms for mining outliers from large data sets

A novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor is proposed and the top n points in this ranking are declared to be outliers.

Outlier Analysis

Outlier Analysis is a comprehensive exposition, as understood by data mining experts, statisticians and computer scientists, and emphasis was placed on simplifying the content, so that students and practitioners can also benefit.

Robust PCA via Outlier Pursuit

This work presents an efficient convex optimization-based algorithm that it calls outlier pursuit, which under some mild assumptions on the uncorrupted points recovers the exact optimal low-dimensional subspace and identifies the corrupted points.