While machine learning systems have recently achieved impressive, (super)human-level performance in several tasks, they have often relied on unnatural amounts of supervision – e.g. large numbers of labeled images or continuous scores in video games. In contrast, human learning is largely unsupervised, driven by observation and interaction with the world. Emulating this type of learning in machines is an open challenge, and one that is critical for general artificial intelligence. Here, we explore prediction of future frames in video sequences as an unsupervised learning rule. A key insight here is that in order to be able to predict how the visual world will change over time, an agent must have at least some implicit model of object structure and the possible transformations objects can undergo. To this end, we have designed several models capable of accurate prediction in complex sequences. Our first model consists of a recurrent extension to the standard autoencoder framework. Trained end-to-end to predict the movement of synthetic stimuli, we find that the model learns a representation of the underlying latent parameters of the 3D objects themselves. Importantly, we find that this representation is naturally tolerant to object transformations, and generalizes well to new tasks, such as classification of static images. Similar models trained solely with a reconstruction loss fail to generalize as effectively. In addition, we explore the use of an adversarial loss, as in a Generative Adversarial Network, illustrating its complementary effects to traditional pixel losses for the task of next-frame prediction.