# Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance

@article{Irpino2015LinearRF, title={Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance}, author={Antonio Irpino and Rosanna Verde}, journal={Advances in Data Analysis and Classification}, year={2015}, volume={9}, pages={81-106} }

In this paper we present a new linear regression technique for distributional symbolic variables, i.e., variables whose realizations can be histograms, empirical distributions or empirical estimates of parametric distributions. Such data are known as numerical modal data according to the Symbolic Data Analysis definitions. In order to measure the error between the observed and the predicted distributions, the $$\ell _2$$ℓ2 Wasserstein distance is proposed. Some properties of such a metric are…

## 16 Citations

Linear regression model with histogram‐valued variables

- Mathematics, Computer ScienceStat. Anal. Data Min.
- 2015

A new linear regression model for histogram‐valued variables is proposed that solves the quadratic optimization problem, subject to non‐negativity constraints on the unknowns; the error measure between the predicted and observed distributions uses the Mallows distance.

Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables

- Mathematics, Computer Science
- 2013

This work proposes a new linear regression model for histogram-valued variables that solves this problem, named Distribution and Symmetric Distribution Regression Model and is associated with a goodness-of-fit measure whose values range between 0 and 1.

Factor Analysis of Interval Data

- Mathematics
- 2017

This paper presents a factor analysis model for symbolic data, focusing on the particular case of interval-valued variables. The proposed method describes the correlation structure among the measured…

Linear regression models for data with variability

- Mathematics
- 2013

Symbolic Data Analysis is concerned with data tables where the values in each cell are not single values but elements that express the variability of the records, e.g., intervals or histograms.…

Linear regression with empirical distributions

- Mathematics
- 2014

In the classical data framework one numerical value or one category is associated with each individual (microdata). However, the interest of many studies lays in groups of records gathered according…

Artificial Neural Network with Histogram Data Time Series Forecasting: A Least Squares Approach Based on Wasserstein Distance

- Computer Science
- 2020

The empirical results demonstrate that the AR—ANN model based Irpino-Verde approach performs better than other models.

New models for symbolic data analysis

- Mathematics, Computer Science
- 2018

This work introduces a new general method for constructing likelihood functions for symbolic data based on a desired probability model for the underlying measurement-level data, while only observing the distributional summaries.

Batch SOM algorithms for interval-valued data with automatic weighting of the variables

- Computer ScienceNeurocomputing
- 2016

Trajectories from Distribution-Valued Functional Curves: A Unified Wasserstein Framework

- Computer ScienceMICCAI
- 2020

A novel, comprehensive framework which models their temporal evolution trajectories under the unifying scheme of Wasserstein distance metric and preserves the functional characteristics of the curve, models the temporal change in distribution profiles and forces the estimated distributions to be valid.

On the use of Wasserstein metric in topological clustering of distributional data

- Computer ScienceArXiv
- 2021

This paper deals with a clustering algorithm for histogram data based on a Self-Organizing Map (SOM) learning. It combines a dimension reduction by SOM and the clustering of the data in a reduced…

## References

SHOWING 1-10 OF 39 REFERENCES

Constrained linear regression models for symbolic interval-valued variables

- MathematicsComput. Stat. Data Anal.
- 2010

Centre and Range method for fitting a linear regression model to symbolic interval data

- Mathematics, Computer ScienceComput. Stat. Data Anal.
- 2008

Univariate and Multivariate Linear Regression Methods to Predict Interval-Valued Features

- MathematicsAustralian Conference on Artificial Intelligence
- 2004

Two new approaches to fit a linear regression model on interval-valued data are introduced and the evaluation of the proposed prediction methods is based on the average behavior of the root mean squared error and the determination coefficient in the framework of a Monte Carlo experiment.

A new linear regression model for histogram-valued variables

- Mathematics
- 2011

In classical data analysis, each individual takes one single “value” on each descriptive variable. Symbolic Data Analysis ([Bock and Diday (2000)], [Billard and Diday (2007)]) generalizes this…

Ordinary Least Squares for Histogram Data Based on Wasserstein Distance

- Computer Science, MathematicsCOMPSTAT
- 2010

A linear regression model for histogram variables is introduced and a new Ordinary Least Squares approach for a linear model estimation, using the Wasserstein metric between histograms is presented, assuming that the regression coefficient are scalar values.

A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data

- Computer ScienceData Science and Classification
- 2006

A new distance is presented, based on the Wasserstein metric, in order to cluster a set of data described by distributions with finite continue support, or, as called in SDA, by “histograms”, a measure of inertia of data with respect to a barycenter that satisfies the Huygens theorem of decomposition of inertia.

Descriptive Statistics for Symbolic Data

- Mathematics
- 2000

The intention of this chapter is to extend the concept of frequency distribution, and the standard definitions of descriptive statistics for real-valued data, such as the empirical mean the empirical…

Dynamic clustering of interval data using a Wasserstein-based distance

- Computer SciencePattern Recognit. Lett.
- 2008

Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics)

- Computer Science
- 2007

This chapter discusses Descriptive Statistics: Two or More Variates, which focuses on the part of the model concerned with Hierarchy-Divisive Clustering and Cluster Analysis.

Regression Shrinkage and Selection via the Lasso

- Computer Science
- 1996

A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.