• Publications
  • Influence
Information-theoretic software clustering
TLDR
LIMBO, a scalable hierarchical clustering algorithm based on the minimization of information loss when clustering a software system, is introduced and a method that can assess the usefulness of any nonstructural attribute in a software clustering context is presented. Expand
LIMBO: Scalable Clustering of Categorical Data
TLDR
This work introduces LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering, and shows how the LIMBO algorithm can be used to cluster both tuples and values. Expand
Clean Answers over Dirty Databases: A Probabilistic Approach
TLDR
This work rewrite queries over a database containing duplicates to return each answer with the probability that the answer is in the clean database, and experimentally study the performance of the rewritten queries. Expand
Software clustering based on information loss minimization
TLDR
LIMBO is a scalable hierarchical clustering algorithm based on the minimization of information loss when clustering asoftware system and can be used to evaluate the usefulness of various types of non-structural information to the software clustering process. Expand
Overview and semantic issues of text mining
TLDR
This survey discusses semantic issues from the natural language particularities, syntactic matters, tokenization concerns and it focuses on the different text representation techniques, categorisation tasks and similarity measures suggested. Expand
A Process Mining Based Model for Customer Journey Mapping
TLDR
The proposed CJM model brings data scientists and customer journey planners closer together, the first step in gaining a better understanding of customer behavior, and highlights the prospective value of process mining for CJM analysis. Expand
Limbo: A scalable algorithm to cluster categorical data
Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherentExpand
Scalable clustering of categorical data and applications
TLDR
This thesis introduces LIMBO, a scalable hierarchical categorical clustering algorithm based on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering, and proposes a set of tools based on LIMBO for finding structural summaries that are useful in characterizing the information content of the data. Expand
Making Open Data Transparent: Data Discovery on Open Data
TLDR
Open Data poses interesting new challenges for data integration research and one of those challenges is data discovery, how can the authors find new data sets within this ever expanding sea of Open Data. Expand
...
1
2
3
4
5
...