• Corpus ID: 245906184

Fantastic Data and How to Query Them

  title={Fantastic Data and How to Query Them},
  author={Trung-Kien Tran and Anh Le-Tuan and Manh Nguyen-Duc and Jicheng Yuan and Danh Le-Phuoc},
It is commonly acknowledged that the availability of the huge amount of (training) data is one of the most important factors for many recent advances in Artificial Intelligence (AI). However, datasets are often designed for specific tasks in narrow AI sub areas and there is no unified way to manage and access them. This not only creates unnecessary overheads when training or deploying Machine Learning models but also limits the understanding of the data, which is very important for data-centric… 

Figures from this paper


Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
It is found that the performance on vision tasks increases logarithmically based on volume of training data size, and it is shown that representation learning (or pre-training) still holds a lot of promise.
MOS: Towards Scaling Out-of-distribution Detection for Large Semantic Space
  • Rui Huang, Yixuan Li
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This paper proposes a group-based OOD detection framework, along with a novel OOD scoring function termed MOS, to decompose the large semantic space into smaller groups with similar concepts, which allows simplifying the decision boundaries between invs.
Software 2.0 and Snorkel: Beyond Hand-Labeled Data
  • C. Ré
  • Computer Science
  • 2018
Snorkel is described, a system that enables users to help shape, create, and manage training data for Software 2.0 stacks and shows that estimating and accounting for the quality of the labeling functions in this way can lead to improved training set labels and boost downstream application quality-potentially by large margins.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
ImageNet: A large-scale hierarchical image database
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Overton: A Data System for Monitoring and Improving Machine-Learned Products
Overton automates the life cycle of model construction, deployment, and monitoring by providing a set of novel high-level, declarative abstractions that shift developers to these higher-level tasks instead of lower-level machine learning tasks.
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene
Taskonomy: Disentangling Task Transfer Learning
This work proposes a fully computational approach for modeling the structure of space of visual tasks via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space and provides a computational taxonomic map for task transfer learning.
ConceptNet 5.5: An Open Multilingual Graph of General Knowledge
A new version of the linked open data resource ConceptNet is presented that is particularly well suited to be used with modern NLP techniques such as word embeddings, with state-of-the-art results on intrinsic evaluations of word relatedness that translate into improvements on applications of word vectors, including solving SAT-style analogies.
YouTube-8M: A Large-Scale Video Classification Benchmark
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.