Taming Wild High Dimensional Text Data with a Fuzzy Lash

  title={Taming Wild High Dimensional Text Data with a Fuzzy Lash},
  author={Amir Karami},
  journal={2017 IEEE International Conference on Data Mining Workshops (ICDMW)},
  • A. Karami
  • Published 1 November 2017
  • Computer Science, Mathematics
  • 2017 IEEE International Conference on Data Mining Workshops (ICDMW)
The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR… 
Application of Fuzzy Clustering for Text Data Dimensionality Reduction
  • A. Karami
  • Computer Science, Mathematics
    International Journal of Knowledge Engineering and Data Mining
  • 2019
This research explores fuzzy clustering as a new UFT-based approach to create a lower-dimensional representation of documents to solve the problem of sparsity and high dimensionality in large textual corpora.
Practical Analysis of Representative Models in Classifier: A Review
The widely used supervised machine learning technique employing representative models are reviewed and the practical analysis and comparison on various techniques for document representation are presented.
Twitter and Research: A Systematic Literature Review Through Text Mining
This study systematically mines a large number of Twitter-based studies to characterize the relevant literature by an efficient and effective approach and finds that while 23.7% of topics did not show a significant trend, it is found that these hot and cold topics represent three categories: application, methodology, and technology.
What do the US West Coast public libraries post on Twitter?
This paper proposes a computational approach to collecting and analyzing using Twitter Application Programming Interfaces (API) and investigates more than 138,000 tweets from 48 US west coast libraries using topic modeling, finding 20 topics and assigning them to five categories including public relations, book, event, training, and social good.
Political Popularity Analysis in Social Media
This study investigated eight economic reasons behind the senator’s popularity in Twitter, and collected and examined 4.5 million tweets related to a US politician, Senator Bernie Sanders.
Computational Analysis of Insurance Complaints: GEICO Case Study
A computational approach to characterize the major topics of a large number of online complaints, based on using the topic modeling approach to disclose the latent semantic of complaints is proposed.
Unwanted Advances in Higher Education: Uncovering Sexual Harassment Experiences in Academia with Text Mining
Text mining was utilized to disclose hidden topics and explore their weight across three variables: harasser gender, institution type, and victim's field of study, and it was found that more than 50% of the topics were assigned to the unwanted sexual attention theme.
An Exploratory Study of (#)Exercise in the Twittersphere
The results from this experiment indicate that the exploratory data analysis is a practical approach to summarizing the various characteristics of text data for different health and medical applications.
Characterizing transgender health issues in Twitter
This research employs a computational framework to collect tweets from self‐identified transgender users, detect those that are health‐related, and identify their information needs, and found both linguistic and topical differences in the health-related information shared by transgender men (TM) as compared to transgender women (TW).
Dimension Reduction Supervised Feature Selection Feature Transformation Unsupervised Feature Selection Feature Transformation
Large textual corpora are often represented by the document-term frequency matrix whose elements are the frequency of terms; however, this matrix has two problems: sparsity and high dimensionality.


Concept Decompositions for Large Sparse Text Data Using Clustering
The concept vectors produced by the spherical k-means algorithm constitute a powerful sparse and localized “basis” for text data sets and are localized in the word space, are sparse, and tend towards orthonormality.
Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches
This paper reviews those techniques that preserve the underlying semantics of the data, using crisp and fuzzy rough set-based methodologies, and several approaches to feature selection based on rough set theory are experimentally compared.
A Comparative Study on Feature Selection in Text Categorization
This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
Iterative clustering of high dimensional text data augmented by local search
A local search procedure that refines a given clustering by incrementally moving data points between clusters, thus achieving a higher objective function value and a powerful "ping-pong" strategy that often qualitatively improves k-means clustering and is computationally efficient.
Learning in high-dimensional multimedia data: the state of the art
This survey covers feature transformation, feature selection and feature encoding, three approaches fighting the consequences of the curse of dimensionality of multimedia data analysis.
Unsupervised fuzzy-rough set-based dimensionality reduction
In this paper, several unsupervised FS approaches are presented which are based on fuzzy-rough sets which require no thresholding information, are domain-independent, and can operate on real-valued data without the need for discretisation.
A survey of fuzzy clustering algorithms for pattern recognition. I
An equivalence between the concepts of fuzzy clustering and soft competitive learning in clustering algorithms is proposed as a unifying framework in the comparison of clustering systems.
FFTM: A Fuzzy Feature Transformation Method for Medical Documents
This paper presents a feature transformation method named FFTM, a novel feature transformation methods that helps reduce the dimensionality of data and improve the performance of machine learning algorithms, and shows that the quality of text analysis in medical text documents can be improved.
A survey of fuzzy clustering algorithms for pattern recognition. II
In this paper, five clustering algorithms taken from the literature are reviewed, assessed and compared on the basis of the selected properties of interest, and a set of functional attributes is selected for use as dictionary entries in the comparison of clustered algorithms.
A Fuzzy Approach Model for Uncovering Hidden Latent Semantic Structure in Medical Text Collections
This is the first study in the medical domain that has been done to use fuzzy set theory to express semantic properties of words and documents in terms of topics, and the experimental results showed major improvements.