• Corpus ID: 53479905

Toddler-Inspired Visual Object Learning

@inproceedings{Bambach2018ToddlerInspiredVO,
  title={Toddler-Inspired Visual Object Learning},
  author={Sven Bambach and David J. Crandall and Linda B. Smith and Chen Yu},
  booktitle={NeurIPS},
  year={2018}
}
Real-world learning systems have practical limitations on the size of the training datasets that they can collect and consider. [] Key Method Using head-mounted cameras, eye gaze trackers, and a model of foveated vision, we collected first-person (egocentric) imagery that represents a highly accurate approximation of the "training data" that toddlers' visual systems collect in everyday, naturalistic learning contexts.

Figures from this paper

Embodied vision for learning object representations

—Recent time-contrastive learning approaches man- age to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal

Learning to Associate Spoken Words and Visual Objects from Egocentric Video of Parent-infant Social Interaction

TLDR
As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners.

Active Object Manipulation Facilitates Visual Object Learning: An Egocentric Vision Study

TLDR
The experimental results suggest that supervision with hand manipulation is better than without hands, and the trend is consistent even when a small number of images is available.

A Computational Model of Early Word Learning from the Infant's Point of View

TLDR
This study uses egocentric video and gaze data collected from infant learners during natural toy play with their parents to simulate infant word learning, and provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners.

Learning task-agnostic representation via toddler-inspired learning

TLDR
This work designs an interactive agent that can learn and store task-agnostic visual representation while exploring and interacting with objects in the virtual environment and shows that such obtained representation was expandable to various vision tasks such as image classification, object localization, and distance estimation tasks.

Contrastive Learning Through Time

TLDR
This paper considers several state-of-the-art contrastive learning methods and demonstrates that CLTT allows linear classification performance that approaches that of the fully supervised setting if subsequent views are sufficiently likely to stem from the same object.

Modeling joint attention from egocentric vision

Numerous studies in cognitive development have provided converging evidence that Joint Attention (JA) is crucial for children to learn about the world together with their parents. However, a closer

Self-supervised learning through the eyes of a child

TLDR
The results demonstrate the emergence of powerful, high-level visual representations from developmentally realistic natural videos using generic self-supervised learning objectives.

The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks

TLDR
Comparing the core object recognition performance of 146 children against adults and against DNNs suggests that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely a mere accumulation of experience with distorted visual input.

Embodied Amodal Recognition: Learning to Move to Perceive Objects

TLDR
Experimental results show that agents with embodiment (movement) achieve better visual recognition performance than passive ones and in order to improve visual recognition abilities, agents can learn strategic paths that are different from shortest paths.

References

SHOWING 1-10 OF 35 REFERENCES

Active Viewing in Toddlers Facilitates Visual Object Learning: An Egocentric Vision Approach

TLDR
The work in this paper is based on the hypothesis that active viewing and exploration of toddlers actually creates highquality training data for object recognition, and that CNNs can take advantage of these differences to learn toddler-based object models that outperform their parent counterparts in a series of controlled simulations.

A Developmental Approach to Machine Learning?

TLDR
It is proposed that the skewed, ordered, biased visual experiences of infants and toddlers are the training data that allow human learners to develop a way to recognize everything, both the pervasively present entities and the rarely encountered ones.

Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study

TLDR
This work proposes to address the interpretability problem in modern DNNs using the rich history of problem descriptions, theories and experimental methods developed by cognitive psychologists to study the human mind, and demonstrates the capability of tools from cognitive psychology for exposing hidden computational properties of DNN's while concurrently providing us with a computational model for human word learning.

Real-world visual statistics and infants' first-learned object names

We offer a new solution to the unsolved problem of how infants break into word learning based on the visual statistics of everyday infant-perspective scenes. Images from head camera video captured by

Exploring the Limits of Weakly Supervised Pretraining

TLDR
This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date.

Embodied attention and word learning by toddlers

Adapting Deep Network Features to Capture Psychological Representations

TLDR
It is found that deep features learned in service of object classification account for a significant amount of the variance in human similarity judgments for a set of animal images, but these features do not appear to capture some key qualitative aspects of human representations.

You Only Look Once: Unified, Real-Time Object Detection

TLDR
Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background, and outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

Very Deep Convolutional Networks for Large-Scale Image Recognition

TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Visual scenes are categorized by function.

TLDR
The hypothesis that scene categories reflect functions, or the possibilities for actions within a scene, is tested, suggesting instead that a scene's category may be determined by the scene's function.