Self-Supervised Multi-Task Pretraining Improves Image Aesthetic Assessment

  title={Self-Supervised Multi-Task Pretraining Improves Image Aesthetic Assessment},
  author={Jan Pfister and Konstantin Kobs and Andreas Hotho},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
Neural networks for Image Aesthetic Assessment are usually initialized with weights of pretrained ImageNet models and then trained using a labeled image aesthetics dataset. We argue that the ImageNet classification task is not well-suited for pretraining, since content based classification is designed to make the model invariant to features that strongly influence the image’s aesthetics, e.g. stylebased features such as brightness or contrast.We propose to use self-supervised aesthetic-aware… 

Figures and Tables from this paper

CLIP knows image aesthetics

Comparing the usefulness of features extracted by CLIP compared to features obtained from the last layer of a comparable ImageNet classification model suggests that CLIP is better suited as a base model for IAA methods than ImageNet pretrained networks.

Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2

A new version of Aesthetic Multi-Attributes Networks (AMANv2) based on the BUTD model and the VLPSA model is proposed, which is the aesthetic attributes captioning, i.e., to assess the aesthetic Attributes such as composition, lighting usage and color arrangement.

Considering User Agreement in Learning to Predict the Aesthetic Quality

A re-adapted multi-task attention network to predict both the mean opinion score and the standard deviation in an end-to-end manner is proposed and a brand-new confidence interval ranking loss is proposed that encourages the model to focus on image-pairs that are less certain about the difference of their aesthetic scores.



Revisiting Image Aesthetic Assessment via Self-Supervised Feature Learning

This paper designs two novel pretext tasks to identify the types and parameters of editing operations applied to synthetic instances and adapt them for a one-layer linear classifier to evaluate the performance in terms of binary aesthetic classification.

Attention-based Multi-Patch Aggregation for Image Aesthetic Assessment

A novel multi-patch aggregation method for image aesthetic assessment using an attention-based mechanism that adaptively adjusts the weight of each patch during the training process to improve learning efficiency and outperforms existing methods by a large margin.

A-Lamp: Adaptive Layout-Aware Multi-patch Deep Convolutional Neural Network for Photo Aesthetic Assessment

An Adaptive Layout-Aware Multi-Patch Convolutional Neural Network (A-Lamp CNN) architecture for photo aesthetic assessment that is able to accept arbitrary sized images, and learn from both fined grained details and holistic image layout simultaneously.

NIMA: Neural Image Assessment

The proposed approach relies on the success (and retraining) of proven, state-of-the-art deep object recognition networks and can be used to not only score images reliably and with high correlation to human perception, but also to assist with adaptation and optimization of photo editing/enhancement algorithms in a photographic pipeline.

Revisiting Self-Supervised Visual Representation Learning

This study revisits numerous previously proposed self-supervised models, conducts a thorough large scale study and uncovers multiple crucial insights about standard recipes for CNN design that do not always translate to self- supervised representation learning.

NICER: Aesthetic Image Enhancement with Humans in the Loop

This work proposes the Neural Image Correction & Enhancement Routine (NICER), a neural network based approach to no-reference image enhancement in a fully-, semi-automatic or fully manual process that is interactive and user-centered and shows that NICER can improve image aesthetics without user interaction.

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.

AVA: A large-scale database for aesthetic visual analysis

A new large-scale database for conducting Aesthetic Visual Analysis: AVA, which contains over 250,000 images along with a rich variety of meta-data including a large number of aesthetic scores for each image, semantic labels for over 60 categories as well as labels related to photographic style is introduced.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Scaling and Benchmarking Self-Supervised Visual Representation Learning

It is shown that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation and visual navigation using reinforcement learning.