Similarity encoding for learning with dirty categorical variables

  title={Similarity encoding for learning with dirty categorical variables},
  author={Patricio Cerda and Ga{\"e}l Varoquaux and Bal{\'a}zs K{\'e}gl},
  journal={Machine Learning},
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant… 

Encoding High-Cardinality String Categorical Variables

This work introduces two encoding approaches for string categories: a Gamma-Poisson matrix factorization on substring counts, and a min- hash encoder, for fast approximation of string similarities, and shows that min-hash turns set inclusions into inequality relations that are easier to learn.

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

In this study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results and traditionally widely used encodings that make unreasonable assumptions to map levels to integers or to reduce the number of levels were not as effective in comparison.

Statistical learning with high-cardinality string categorical variables. (Apprentissage statistique à partir de variables catégorielles non-uniformisées)

This work studies a series of categorical encodings that remove the need for preprocessing steps on high-cardinality string categorical variables and are adapted to large-scale settings, and create feature vectors that are easily interpretable.

Encoding Categorical Variables with Ambiguity

This paper extends existing One-Hot encoding methods to handle ambiguous categorical variables explicitly and proposes two encoding methods based on missing value imputation algorithms, Ambiguous Forests and naive extension of the MissForest algorithm.

Complex Encoding

Empirical results show that not only complex encoding avoids the ill-conditioning problem of one-hot and thermometer encodings, it can generally lead to a comparable or higher classification accuracy with respect to others at the expense of only about two-fold increase in memory usage withrespect to ordinal encoding.

Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables

This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data, and proposes two embedding schemes for single-valued and multi-valued categorical data.

Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables

This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data, and proposes two embedding schemes for single-valued and multi-valued categorical data.

Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine

A Bayesian encoding technique developed for WeWork's lead scoring engine which outputs the probability of a person touring one of the authors' office spaces based on interaction, enrichment, and geospatial data is described.

Analytics on Non-Normalized Data Sources: More Learning, Rather Than More Cleaning

This study suggests that using machine learning directly for analysis is beneficial because it captures ambiguities hard to represent during curation, and improves results validity more than manual cleaning with considerably less human labor.

Search Filter Ranking with Language-Aware Label Embeddings

This work learns from customers’ clicks and purchases which subset of filters is most relevant to their queries treating the relevant/not-relevant signal as binary labels, and shows that classification performance for rare classes can be improved by accounting for the language structure of the class labels.



Entity Embeddings of Categorical Variables

It is demonstrated in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown, and is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit.

Interactive deduplication using active learning

This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.

A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems

A simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression.

Enriching Word Vectors with Subword Information

A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.

Integration of heterogeneous databases without common domains using queries based on textual similarity

This paper rejects the assumption that global domains can be easily constructed, and assumes instead that the names are given in natural language text, and proposes a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

ActiveClean: Interactive Data Cleaning For Statistical Modeling

This work proposes ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees, and returns more accurate models than uniform sampling and Active Learning.

Partitioning Nominal Attributes in Decision Trees

A new heuristic search algorithm is presented based on ordering the attribute's values according to their principal component scores in the class probability space, and is linear in n, which does not offer a practical search method when n and k are large.

A Comparison of String Metrics for Matching Names and Records

An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.

Duplicate Record Detection: A Survey

This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.