A novel program smoothing technique using surrogate neural network models that can incrementally learn smooth approximations of a complex, real-world program's branching behaviors is proposed and used together with gradient-guided input generation schemes to significantly increase the efficiency of the fuzzing process.
A dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset, and a supervised neural network is trained as a baseline and its performance compared to human consistency on the tasks is analyzed.
NEUZZ is designed, implemented, and evaluated, an efficient fuzzer that guides the fuzzing input generation process using deep neural networks that consistently outperforms AFL, a state-of-the-art evolutionary fuzzer, both at finding new bugs and achieving higher edge coverage.
A framework that instead uses the visual modality to align multiple languages, using images as the bridge between them, and estimates the cross-modal alignment between language and images, and uses this estimate to guide the learning of cross-lingual representations.
A meta-learning framework that learns how to learn word representations from unconstrained scenes using the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition is introduced.
An unsupervised, mid-level representation for a generative model of scenes that allows effortless rearrangement, removal, cloning, and restyling of objects in scenes and measures correlation between blob presence and semantic categories as predicted by an off-the-shelf network to empirically verify the associations discovered by the model.
A framework that learns how to learn text representations from visual context is proposed that significantly outperforms the state-of-the-art in visual language modeling for acquiring new words and predicting new compositions.
Although the model is trained with minimal supervision, it is competitive with or outperforms baselines trained on large (supervised) datasets of successfully executed goals, showing that observing unintentional action is crucial to learning about goals in video.
This work proposes a self-supervised solution to temporal cycle consistency jointly in vision and language, training on narrated video, that learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed.
A large-scale analysis is performed to quantitatively understand the difference between the representations learned by self-supervised learning and supervised learning, and suggests that two key differences between self- supervised and supervised models are its representations for 3D geometry and deformable objects, which also substantially contribute to its failures.