Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

  title={Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance},
  author={Antonio Irpino and Rosanna Verde},
  journal={Advances in Data Analysis and Classification},
  • A. Irpino, R. Verde
  • Published 7 February 2012
  • Computer Science, Mathematics
  • Advances in Data Analysis and Classification
In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the $$\ell _2$$ℓ2 Wasserstein distance is proposed. Some properties of such a metric are… 
Linear regression model with histogram‐valued variables
A new linear regression model for histogram‐valued variables is proposed that solves the quadratic optimization problem, subject to non‐negativity constraints on the unknowns; the error measure between the predicted and observed distributions uses the Mallows distance.
Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables
This work proposes a new linear regression model for histogram-valued variables that solves this problem, named Distribution and Symmetric Distribution Regression Model and is associated with a goodness-of-fit measure whose values range between 0 and 1.
Factor Analysis of Interval Data
This paper presents a factor analysis model for symbolic data, focusing on the particular case of interval-valued variables. The proposed method describes the correlation structure among the measured
Linear regression models for data with variability
Symbolic Data Analysis is concerned with data tables where the values in each cell are not single values but elements that express the variability of the records, e.g., intervals or histograms.
Linear regression with empirical distributions
In the classical data framework one numerical value or one category is associated with each individual (microdata). However, the interest of many studies lays in groups of records gathered according
Artificial Neural Network with Histogram Data Time Series Forecasting: A Least Squares Approach Based on Wasserstein Distance
The empirical results demonstrate that the AR—ANN model based Irpino-Verde approach performs better than other models.
New models for symbolic data analysis
This work introduces a new general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries.
Trajectories from Distribution-Valued Functional Curves: A Unified Wasserstein Framework
A novel, comprehensive framework which models their temporal evolution trajectories under the unifying scheme of Wasserstein distance metric and preserves the functional characteristics of the curve, models the temporal change in distribution profiles and forces the estimated distributions to be valid.
On the use of Wasserstein metric in topological clustering of distributional data
This paper deals with a clustering algorithm for histogram data based on a Self-Organizing Map (SOM) learning. It combines a dimension reduction by SOM and the clustering of the data in a reduced


Univariate and Multivariate Linear Regression Methods to Predict Interval-Valued Features
Two new approaches to fit a linear regression model on interval-valued data are introduced and the evaluation of the proposed prediction methods is based on the average behavior of the root mean squared error and the determination coefficient in the framework of a Monte Carlo experiment.
A new linear regression model for histogram-valued variables
  • Mathematics
  • 2011
In classical data analysis, each individual takes one single “value” on each descriptive variable. Symbolic Data Analysis ([Bock and Diday (2000)], [Billard and Diday (2007)]) generalizes this
Ordinary Least Squares for Histogram Data Based on Wasserstein Distance
A linear regression model for histogram variables is introduced and a new Ordinary Least Squares approach for a linear model estimation, using the Wasserstein metric between histograms is presented, assuming that the regression coefficient are scalar values.
A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data
A new distance is presented, based on the Wasserstein metric, in order to cluster a set of data described by distributions with finite continue support, or, as called in SDA, by “histograms”, a measure of inertia of data with respect to a barycenter that satisfies the Huygens theorem of decomposition of inertia.
Descriptive Statistics for Symbolic Data
The intention of this chapter is to extend the concept of frequency distribution, and the standard definitions of descriptive statistics for real-valued data, such as the empirical mean the empirical
Dynamic clustering of interval data using a Wasserstein-based distance
Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics)
This chapter discusses Descriptive Statistics: Two or More Variates, which focuses on the part of the model concerned with Hierarchy-Divisive Clustering and Cluster Analysis.
Regression Shrinkage and Selection via the Lasso
A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.