Cube Sampled K-Prototype Clustering for Featured Data

  title={Cube Sampled K-Prototype Clustering for Featured Data},
  author={Seemandhar Jain and Aditya A. Shastri and Kapil Ahuja and Yann Busnel and Navneet Pratap Singh},
  journal={2021 IEEE 18th India Council International Conference (INDICON)},
Clustering large amount of data is becoming increasingly important in the current times. Due to the large sizes of data, clustering algorithm often take too much time. Sampling this data before clustering is commonly used to reduce this time. In this work, we propose a probabilistic sampling technique called cube sampling along with K-Prototype clustering. Cube sampling is used because of its accurate sample selection. K-Prototype is most frequently used clustering algorithm when the data is… 
1 Citations

Figures and Tables from this paper

An Efficient Anomaly Detection Approach using Cube Sampling with Streaming Data
The novelty of this paper is in applying Cube sampling in iForest and calculating inclusion probability, which proves that the proposed approach is equally successful at detecting anomalies as existing state-of-the-art approaches, requiring significantly less storage and time complexity.


Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
  • J. Huang
  • Computer Science
    Data Mining and Knowledge Discovery
  • 2004
Two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values are presented and are shown to be efficient when clustering large data sets, which is critical to data mining applications.
Scaled and Projected Spectral Clustering with Vector Quantization for Handling Big Data
This work proposes a modified version of spectral clustering, which it is called Projected Spectral Clustering (PSC), and implements it on Apache Spark using two approaches for computing the Gaussian Kernel matrix.
A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis
Concepts and algorithms related to clustering, a concise survey of existing (clustering) algorithms as well as a comparison, both from a theoretical and an empirical perspective are introduced.
A fast algorithm for balanced sampling
This paper proposes a very fast implementation of the cube method, where the execution time does not depend on the square of the population size anymore, but only on the population Size.
Least squares quantization in PCM
  • S. P. Lloyd
  • Computer Science
    IEEE Trans. Inf. Theory
  • 1982
The corresponding result for any finite number of quanta is derived; that is, necessary conditions are found that the quanta and associated quantization intervals of an optimum finite quantization scheme must satisfy.
An overview of principal component analysis
The principal component analysis is a kind of algorithms in biometrics that covers standard deviation, covariance, and eigenvectors and is a tool to reduce multidimensional data to lower dimensions while retaining most of the information.
Sampling Algorithms
  • Yves Tillé
  • Economics
    International Encyclopedia of Statistical Science
  • 2011
The first € price and the £ and $ price are net prices, subject to local VAT, and the €(D) includes 7% for Germany, the€(A) includes 10% for Austria.
The master programme in Applied Geology aims to provide comprehensive knowledge based on various branches of Geology, with special focus on Applied geology subjects in the areas of Geomorphology, Structural geology, Hydrogeology, Petroleum Geologists, Mining Geology), Remote Sensing and Environmental geology.
UCI machine learning repository,
  •, Accessed: February
  • 2021