Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering

  title={Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering},
  author={Raymond Leung and Mehala Balamurali and Arman Melkumyan},
  journal={Mathematical Geosciences},
Abstract The presence of outliers in geochemical data can impact the accuracy of grade models and influence the interpretation of mine assay data. Removal of outliers is therefore an important consideration in grade estimation work. This paper presents two sample truncation strategies which have been devised to reject outliers in multivariate geochemical data. In essence, a data-dependent threshold is applied to the robust distances of sorted samples to discard outliers within a given class… 

Fractal Modeling of Geochemical Mineralization Prospectivity Index based on Centered Log-Ratio Transformed Data for Geochemical Targeting: a Case Study of Cu Porphyry Mineralization

This work aims to investigate the geochemical signatures of the Cu porphyry deposit in the Dalli area using the geochemical soil samples. At the first step, the geochemical data was opened using the

Empirical observations on the effects of data transformation in machine learning classification of geological domains

The results reveal that different ML classifiers exhibit varying sensitivity to these transformations, with some clearly more advantageous or deleterious than others, and the best performing candidate is ILR which is unsurprising considering the compositional nature of the data.

Surface Warping Incorporating Machine Learning Assisted Domain Likelihood Estimation: A New Paradigm in Mine Geology Modeling and Automation

Large-scale validation experiments are performed to assess the overall efficacy of ML assisted surface warping as a fully integrated component within an ore grade estimation system where the posterior mean is obtained via Gaussian Process (GP) inference with a Matérn 3/2 kernel.

Statistical Outliers

t-Distributed Stochastic Neighbor Embedding

  • M. Balamurali
  • Computer Science
    Encyclopedia of Mathematical Geosciences
  • 2021



Unmasking Multivariate Outliers and Leverage Points

This work proposes to compute distances based on very robust estimates of location and covariance, better suited to expose the outliers in a multivariate point cloud, to avoid the masking effect.

Outlier Detection for Compositional Data Using Robust Methods

It is shown that theMDs based on classical estimates are invariant to the family of logratio transformations, and that the MDsbased on affine equivariant estimators of location and covariance are the same for additive and isometriclogratio transformation.

t-SNE Based Visualisation and Clustering of Geological Domain

This work compares PCA and some other linear and non-linear methods with a newer method, t-Distributed Stochastic Neighbor Embedding (t-SNE) for the visualization of large geochemical assay datasets and finds significant differences between the nonlinear method t-S NE and the state of the art methods in two dimensional target spaces.

Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

A new methodology of class discovery and clustering validation tailored to the task of analyzing gene expression data is presented, and in conjunction with resampling techniques, it provides for a method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.

A Fast Algorithm for the Minimum Covariance Determinant Estimator

For small datasets, FAST-MCD typically finds the exact MCD, whereas for larger datasets it gives more accurate results than existing algorithms and is faster by orders.

Geostatistics for Compositional Data: An Overview

The main result is that multivariate geostatistical techniques can and should be performed on log-ratio scores, in which case the system data-variograms-cokriging/cosimulation is intrinsically consistent, delivering the same results regardless of which log-Ratio transformation was used to represent them.

Robust Estimation of Dispersion Matrices and Principal Components

Abstract This paper uses Monte Carlo methods to compare the performances of several robust procedures for estimating a correlation matrix and its principal components. The estimators are formed