Similarity encoding for learning with dirty categorical variables
@article{Cerda2018SimilarityEF, title={Similarity encoding for learning with dirty categorical variables}, author={Patricio Cerda and Ga{\"e}l Varoquaux and Bal{\'a}zs K{\'e}gl}, journal={Machine Learning}, year={2018}, volume={107}, pages={1477-1494} }
For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant…
123 Citations
Encoding High-Cardinality String Categorical Variables
- Computer ScienceIEEE Transactions on Knowledge and Data Engineering
- 2022
This work introduces two encoding approaches for string categories: a Gamma-Poisson matrix factorization on substring counts, and a min- hash encoder, for fast approximation of string similarities, and shows that min-hash turns set inclusions into inequality relations that are easier to learn.
Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features
- Computer ScienceComputational Statistics
- 2022
In this study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results and traditionally widely used encodings that make unreasonable assumptions to map levels to integers or to reduce the number of levels were not as effective in comparison.
Statistical learning with high-cardinality string categorical variables. (Apprentissage statistique à partir de variables catégorielles non-uniformisées)
- Computer Science
- 2019
This work studies a series of categorical encodings that remove the need for preprocessing steps on high-cardinality string categorical variables and are adapted to large-scale settings, and create feature vectors that are easily interpretable.
Encoding Categorical Variables with Ambiguity
- Computer Science
- 2019
This paper extends existing One-Hot encoding methods to handle ambiguous categorical variables explicitly and proposes two encoding methods based on missing value imputation algorithms, Ambiguous Forests and naive extension of the MissForest algorithm.
Complex Encoding
- Computer Science2021 International Joint Conference on Neural Networks (IJCNN)
- 2021
Empirical results show that not only complex encoding avoids the ill-conditioning problem of one-hot and thermometer encodings, it can generally lead to a comparable or higher classification accuracy with respect to others at the expense of only about two-fold increase in memory usage withrespect to ordinal encoding.
Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables
- Computer ScienceJ. Intell. Inf. Syst.
- 2022
This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data, and proposes two embedding schemes for single-valued and multi-valued categorical data.
Hierarchy-based semantic embeddings for single-valued & multi-valued categorical variables
- Computer ScienceJournal of Intelligent Information Systems
- 2021
This paper presents a method that uses prior knowledge of the application domain to support machine learning in cases with insufficient data, and proposes two embedding schemes for single-valued and multi-valued categorical data.
Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine
- Computer ScienceArXiv
- 2019
A Bayesian encoding technique developed for WeWork's lead scoring engine which outputs the probability of a person touring one of the authors' office spaces based on interaction, enrichment, and geospatial data is described.
Analytics on Non-Normalized Data Sources: More Learning, Rather Than More Cleaning
- Computer ScienceIEEE Access
- 2022
This study suggests that using machine learning directly for analysis is beneficial because it captures ambiguities hard to represent during curation, and improves results validity more than manual cleaning with considerably less human labor.
Search Filter Ranking with Language-Aware Label Embeddings
- Computer ScienceWWW
- 2022
This work learns from customers’ clicks and purchases which subset of filters is most relevant to their queries treating the relevant/not-relevant signal as binary labels, and shows that classification performance for rare classes can be improved by accounting for the language structure of the class labels.
References
SHOWING 1-10 OF 47 REFERENCES
Entity Embeddings of Categorical Variables
- Computer ScienceArXiv
- 2016
It is demonstrated in this paper that entity embedding helps the neural network to generalize better when the data is sparse and statistics is unknown, and is especially useful for datasets with lots of high cardinality features, where other methods tend to overfit.
Interactive deduplication using active learning
- Computer ScienceKDD
- 2002
This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems
- Computer ScienceSKDD
- 2001
A simple preprocessing scheme for high-cardinality categorical data that allows this class of attributes to be used in predictive models such as neural networks, linear and logistic regression.
Enriching Word Vectors with Subword Information
- Computer ScienceTACL
- 2017
A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks.
Integration of heterogeneous databases without common domains using queries based on textual similarity
- Computer ScienceSIGMOD '98
- 1998
This paper rejects the assumption that global domains can be easily constructed, and assumes instead that the names are given in natural language text, and proposes a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval.
Efficient Estimation of Word Representations in Vector Space
- Computer ScienceICLR
- 2013
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
ActiveClean: Interactive Data Cleaning For Statistical Modeling
- Computer ScienceProc. VLDB Endow.
- 2016
This work proposes ActiveClean, which allows for progressive and iterative cleaning in statistical modeling problems while preserving convergence guarantees, and returns more accurate models than uniform sampling and Active Learning.
Partitioning Nominal Attributes in Decision Trees
- Computer ScienceData Mining and Knowledge Discovery
- 2004
A new heuristic search algorithm is presented based on ordering the attribute's values according to their principal component scores in the class probability space, and is linear in n, which does not offer a practical search method when n and k are large.
A Comparison of String Metrics for Matching Names and Records
- Computer Science
- 2003
An open-source Java toolkit of methods for matching names and records is described and results obtained from using various string distance metrics on the task of matching entity names are summarized.
Duplicate Record Detection: A Survey
- Computer ScienceIEEE Transactions on Knowledge and Data Engineering
- 2007
This paper presents an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database and covers similarity metrics that are commonly used to detect similar field entries.