• Corpus ID: 53479905

Toddler-Inspired Visual Object Learning

  title={Toddler-Inspired Visual Object Learning},
  author={Sven Bambach and David J. Crandall and Linda B. Smith and Chen Yu},
  booktitle={Neural Information Processing Systems},
Real-world learning systems have practical limitations on the size of the training datasets that they can collect and consider. [] Key Method Using head-mounted cameras, eye gaze trackers, and a model of foveated vision, we collected first-person (egocentric) imagery that represents a highly accurate approximation of the "training data" that toddlers' visual systems collect in everyday, naturalistic learning contexts.

Figures from this paper

Embodied vision for learning object representations

—Recent time-contrastive learning approaches man- age to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal

Decoding Attention from Gaze: A Benchmark Dataset and End-to-End Models

This paper studies using computer vision tools for “attention decoding”, the task of assessing the locus of a participant’s overt visual attention over time, and proposes two end-to-end deep learning models for attention decoding and compares these to state-of-the-art heuristic methods.

Learning to Associate Spoken Words and Visual Objects from Egocentric Video of Parent-infant Social Interaction

As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners.

Active Object Manipulation Facilitates Visual Object Learning: An Egocentric Vision Study

The experimental results suggest that supervision with hand manipulation is better than without hands, and the trend is consistent even when a small number of images is available.

A Computational Model of Early Word Learning from the Infant's Point of View

This study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents to simulate infant word learning, and provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners.

Learning task-agnostic representation via toddler-inspired learning

This work designs an interactive agent that can learn and store task-agnostic visual representation while exploring and interacting with objects in the virtual environment and shows that such obtained representation was expandable to various vision tasks such as image classification, object localization, and distance estimation tasks.

Contrastive Learning Through Time

This paper considers several state-of-the-art contrastive learning methods and demonstrates that CLTT allows linear classification performance that approaches that of the fully supervised setting if subsequent views are sufficiently likely to stem from the same object.

Modeling joint attention from egocentric vision

Numerous studies in cognitive development have provided converging evidence that Joint Attention (JA) is crucial for children to learn about the world together with their parents. However, a closer

Self-supervised learning through the eyes of a child

The results demonstrate the emergence of powerful, high-level visual representations from developmentally realistic natural videos using generic self-supervised learning objectives.

The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks

Comparing the core object recognition performance of 146 children against adults and against DNNs suggests that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely a mere accumulation of experience with distorted visual input.



Active Viewing in Toddlers Facilitates Visual Object Learning: An Egocentric Vision Approach

The work in this paper is based on the hypothesis that active viewing and exploration of toddlers actually creates highquality training data for object recognition, and that CNNs can take advantage of these differences to learn toddler-based object models that outperform their parent counterparts in a series of controlled simulations.

A Developmental Approach to Machine Learning?

It is proposed that the skewed, ordered, biased visual experiences of infants and toddlers are the training data that allow human learners to develop a way to recognize everything, both the pervasively present entities and the rarely encountered ones.

Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

This work proposes to address the interpretability problem in modern DNNs using the rich history of problem descriptions, theories and experimental methods developed by cognitive psychologists to study the human mind, and demonstrates the capability of tools from cognitive psychology for exposing hidden computational properties of DNN's while concurrently providing us with a computational model for human word learning.

Real-world visual statistics and infants' first-learned object names

We offer a new solution to the unsolved problem of how infants break into word learning based on the visual statistics of everyday infant-perspective scenes. Images from head camera video captured by

Exploring the Limits of Weakly Supervised Pretraining

This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

Embodied attention and word learning by toddlers

Adapting Deep Network Features to Capture Psychological Representations

It is found that deep features learned in service of object classification account for a significant amount of the variance in human similarity judgments for a set of animal images, but these features do not appear to capture some key qualitative aspects of human representations.

You Only Look Once: Unified, Real-Time Object Detection

Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Visual scenes are categorized by function.

The hypothesis that scene categories reflect functions, or the possibilities for actions within a scene, is tested, suggesting instead that a scene's category may be determined by the scene's function.