- Published 2016

A common set of statistical metrics has been used to summarize the performance ofmodels ormeasurements— the most widely used ones being bias, mean square error, and linear correlation coefficient. They assume linear, additive, Gaussian errors, and they are interdependent, incomplete, and incapable of directly quantifying uncertainty. The authors demonstrate that these metrics can be directly derived from the parameters of the simple linear errormodel. Since a correct errormodel captures the full error information, it is argued that the specification of a parametric error model should be an alternative to the metrics-based approach. The error-modeling methodology is applicable to both linear and nonlinear errors, while themetrics are onlymeaningful for linear errors. In addition, the error model expresses the error structure more naturally, and directly quantifies uncertainty. This argument is further explained by highlighting the intrinsic connections between the performancemetrics, the error model, and the joint distribution between the data and the reference. 1. Limitations of performance metrics One of the primary objectives of measurement or model evaluation is to quantify the uncertainty in the data. This is because the uncertainty directly determines the information content of the data (e.g., Jaynes 2003), and dictates our rational use of the information, be it for data assimilation, hypothesis testing, or decision-making. Further, by appropriately quantifying the uncertainty, one gains insight into the error characteristics of the measurement or model, especially via efforts to separate the systematic error and random error (e.g., Barnston and Thomas 1983; Ebert and McBride 2000). Currently the common practice of measurement or model verification is to compute a common set of performance metrics. These performance metrics are statistical measures to summarize the similarity and difference between two datasets. These metrics are based on direct comparison of datum pairs on their corresponding spatial/ temporal location. The most commonly used ones are bias, mean square error (MSE), and correlation coefficient (CC) (e.g., Fisher 1958; Wilks 2011), but many Corresponding author address: Yudong Tian, NASA Goddard Space Flight Center, Mail Code 617, Greenbelt, MD 20771-5808. E-mail: yudong.tian@nasa.gov FEBRUARY 2016 T I AN ET AL . 607 DOI: 10.1175/MWR-D-15-0087.1 2016 American Meteorological Society variants or derivatives, such as (un)conditional bias (e.g., Stewart 1990), MSE (Murphy and Winkler 1987), unbiased root-mean-square error (ubRMSE), anomaly correlation coefficient, coefficient of determination (CoD), and skill score (SS; e.g., Murphy and Epstein 1989) also fall into this category. Table 1 lists some of these metrics and their definitions. Among them, the ‘‘big three’’—bias, MSE, and CC— are the most widely used in diverse disciplines, exemplified by the popular ‘‘Taylor diagram’’ (Taylor 2001). These metrics do, however, have several limitations: 1) Interdependence. Most of these conventional performance metrics are not independent; they have been demonstrated to relate to each other in complex ways. For example, the MSE can be decomposed in many ways to link it with other metrics, such as bias and correlation coefficient (e.g.,Murphy 1988; Barnston 1992; Taylor 2001; Gupta et al. 2009; Entekhabi et al. 2010). These relations indicate both redundancy among these metrics, and the metrics’ indirect connection to independent error characteristics. This leads to ambiguity in the interpretation and intercomparison of these metrics. 2) Underdetermination. It is easy to verify that these metrics do not describe unique error characteristics, even when many of them are used collectively. In fact, many different combinations of error characteristics can produce the same values of these metrics. This is illustrated in Fig. 1. Amonthly time series of land surface temperature anomaly data, extracted from satellite-based observations (Wan et al. 2004) over a location in the United States (358N, 958W), is used as the reference (black curves) for validating two separate hypothetical sets of predictions (Figs. 1a and 1b, blue curves). Their respective scatterplots are also shown (Figs. 1c and 1d), with values of fivemajor conventional metrics listed (bias, MSE, CC, CoD, and SS). When seen from either the time series plots or the scatterplots, the two measurements exhibit apparently very different error characteristics. However, all the metrics, except bias, give nearly identical values (Figs. 1c and 1d). In fact, there is an infinite number of ways to construct measurements that can produce identical values for many of the metrics. Therefore, when given a set of these metrics values, one will have fundamental difficulty in inferring and communicating the error characteristics of the predictions. 3) Incompleteness. There are no well-accepted guidelines on how many of these metrics are sufficient. Many inexperienced users follow a ‘‘the more the better’’ philosophy, and it is not rare to see works TABLE 1. Examples of conventional performance metrics.* The observations and forecasts are denoted as x and y, respectively. Name Definition Ideal value

@inproceedings{Tian2016PerformanceME,
title={Performance Metrics, Error Modeling, and Uncertainty Quantification},
author={Yudong Tian and G. Nearing and Christa D. Peters-Lidard and Kenneth W. Harrison and Ling Tang},
year={2016}
}