NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud

  title={NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud},
  author={S. M. Zobaed and Md. Enamul Haque and Shahidullah Kaiser and Razin Farhan Hussain},
  journal={2018 21st International Conference of Computer and Information Technology (ICCIT)},
Cloud services are widely deployed to store and process big data. Organizations who deal with big data, especially large document set, prefer utilizing cloud services for storage and computational efficiency. However, for processing large text corpus, an inefficient data processing is computationally expensive for real-time systems. In addition, efficient memory utilization is important to cluster big data including large text corpus. Clustering of the large text corpus is an important… 

Figures and Tables from this paper

ClustCrypt: Privacy-Preserving Clustering of Unstructured Big Data in the Cloud

  • S. ZobaedSahan AhmadRaju N. GottumukkalaM. Salehi
  • Computer Science
    2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2019
This paper presents an approach named ClustCrypt for efficient topic-based clustering of encrypted unstructured big data in the cloud that dynamically estimates the optimal number of clusters based on the statistical characteristics of encrypted data.

Adaptive and Concurrent Garbage Collection for Virtual Machines

An adaptive and concurrent garbage collection technique that can predict the optimal GC algorithm for a program without going through all the GC algorithms and is helpful in finding better heap size settings for improved program execution.

Performance Analysis of Cryptographic Algorithms for Selecting Better Utilization on Resource Constraint Devices

A comprehensive performance evaluation of popular symmetric and asymmetric key encryption algorithms to selecting better utilization on resource constraint and mobile devices and DSA and ElGamal as an asymmetric encryption algorithms.

Community of practice: converting IT graduate students into specialists via professional knowledge sharing

Purpose The paper aims to highlight how an applied learning framework or “community of practice” (CoP) combined with a traditional theoretical course of study enables the identification of



Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases

The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance and improvements in retrieval effectiveness.

S3BD: Secure semantic search over encrypted big data in the cloud

To keep real‐time response on big data, S3BD proactively prunes the search space to a subset of the whole dataset, and proposes a method to cluster the encrypted data.

Cluster-based language models for distributed retrieval

A new approach to distributed retrieval based on document clustering and language modeling is proposed and it is shown that all three methods improve the effectiveness of distributed retrieval.

S3C: An architecture for space-efficient semantic search over encrypted data in the cloud

S3C is presented, a system that provides a semantic search functionality over encrypted data in the cloud that combines approaches from traditional keyword-based searchable encryption and semantic web searching and is suitable for large scale datasets.

Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections

Looking at clustering as an information access tool in its own right obviates objections, and provides a powerful new access paradigm, and presents fast (linear time) clustering algorithm.

Efficiency trade-offs in two-tier web search systems

This paper analyzes the standard two-tier architecture for Web search with the difference that the corpus to be searched for a given query is predicted in advance, and shows that any predictor better than random yields time savings, but this decrease in the processing time yields an increase in the infrastructure cost.

Practical solutions to the problem of diagonal dominance in kernel document clustering

A selection of strategies for addressing the implications of diagonal dominance for unsupervised kernel methods in the task of document clustering are proposed, and their effectiveness in producing more accurate and stable clusterings is evaluated.

X-means: Extending K-means with Efficient Estimation of the Number of Clusters

A new algorithm is introduced that eeciently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criteria (AIC) measure.

Learning Feature Representations with K-Means

This chapter will summarize recent results and technical tricks that are needed to make effective use of K-means clustering for learning large-scale representations of images and connect these results to other well-known algorithms to make clear when K-Means can be most useful.