Corpus ID: 237532482

Humanly Certifying Superhuman Classifiers

@article{Xu2021HumanlyCS,
  title={Humanly Certifying Superhuman Classifiers},
  author={Qiongkai Xu and Christian J. Walder and Chenchen Xu},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.07867}
}
Estimating the performance of a machine learning system is a longstanding challenge in artificial intelligence research. Today, this challenge is especially relevant given the emergence of systems which appear to increasingly outperform human beings. In some cases, this “superhuman” performance is readily demonstrated; for example by defeating legendary human players in traditional two player games. On the other hand, it can be challenging to evaluate classification models that potentially… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 38 REFERENCES
The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation
TLDR
This article shows how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. Expand
A large annotated corpus for learning natural language inference
TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
Artificial intelligence: Learning to play Go from scratch
TLDR
An artificial-intelligence program called AlphaGo Zero has mastered the game of Go without any human data or guidance, and the work suggests that the same fundamental principles of the game have some universal character, beyond human bias. Expand
Axiomatic analysis of aggregation methods for collective annotation
TLDR
An axiomatic framework for collective annotation is developed, focusing amongst other things on the notion of an annotator's bias, to efficiently label large amounts of data using nonexpert annotators. Expand
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation
TLDR
The majority vote applied to generate one annotation set out of several opinions, is able to filter noisy judgments of non-experts to some extent and the resulting annotation set is of comparable quality to the annotations of experts. Expand
An Empirical Study Into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation
TLDR
A methodology is applied to four image-processing problems to quantify the interannotator variance and to offer insight into the mechanisms behind agreement and the use of ground truth, finding that when detecting linear structures, annotator agreement is very low. Expand
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented. Expand
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
TLDR
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network. Expand
...
1
2
3
4
...