• Corpus ID: 234338162

MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just One Transformer VAE

  title={MuseMorphose: Full-Song and Fine-Grained Music Style Transfer with Just One Transformer VAE},
  author={Shih-Lun Wu and Yi-Hsuan Yang},
Transformers and variational autoencoders (VAE) have been extensively employed for symbolic (e.g., MIDI) domain music generation. While the former boast an impressive capability in modeling long sequences, the latter allow users to willingly exert control over different parts (e.g., bars) of the music to be generated. In this paper, we are interested in bringing the two together to construct a single model that exhibits both strengths. The task is split into two steps. First, we equip… 

Figures and Tables from this paper

CPS: Full-Song and Style-Conditioned Music Generation with Linear Transformer

CPS (Compound word with style), a model that can specify a target style and generate a complete musical composition from scratch, is introduced, which performs better in terms of basic music metrics as well as metrics for evaluating controlled ability.

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

This work trains FIGARO (FIne-grained music Generation via Attention-based, RObust control) by applying description-to-sequence modelling to symbolic music and achieves state-of-the-art results in controllable symbolic music generation and generalizes well beyond the training distribution.

Compose & Embellish: Well-Structured Piano Performance Generation via A Two-Stage Approach

The authors' objective and subjective experiments show that C OM POSE & E MBELLISH shrinks the gap in structureness between state of the art and real performances by half, and improves other musical aspects such as richness and coherence as well.

MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding

An attempt to employ the mask language modeling approach of BERT to pre-train a 12-layer Transformer model for tackling a number of symbolic-domain discriminative music understanding tasks, finding that, given a pretrained Transformer, the models outperform recurrent neural network based baselines with less than 10 epochs of fine-tuning.

Compositional Steering of Music Transformers

This paper builds on lightweight fine-tuning methods, such as prefix tuning and bias tuning, to propose a novel contrastive loss that enables us to steer music transformers over arbitrary combinations of logical features, with a relatively small number of extra parameters.

Variable-Length Music Score Infilling via XLNet and Musically Specialized Positional Encoding

A new self-attention based model for music score infilling, i.e., to generate a polyphonic music sequence that fills in the gap between given past and future contexts, that can infill a variable number of notes for different time spans is proposed.

Structure-Enhanced Pop Music Generation via Harmony-Aware Learning

Experimental results reveal that compared to the existing methods, HAT owns a much better understanding of the structure and it can also improve the quality of generated music, especially in the form and texture.

DadaGP: A Dataset of Tokenized GuitarPro Songs for Sequence Models

DadaGP opens up the possibility to train GuitarPro score generators, fine-tune models on custom data, create new styles of music, AI-powered songwriting apps, and human-AI improvisation.

Sketching the Expression: Flexible Rendering of Expressive Piano Performance with Self-Supervised Learning

This work aims to disentangle the entire musical expression and structural attribute of piano performance using a conditional VAE framework and employs self-supervised approaches that force the latent variables to represent target attributes.

SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias

SongDriver, a real-time music accompaniment generation system without logical latency nor exposure bias, which outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics, meanwhile significantly reducing the physical latency.



LakhNES: Improving Multi-instrumental Music Generation with Cross-domain Pre-training

To improve the performance of the Transformer architecture, this work proposes a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset, and finds that this transfer learning procedure improves both quantitative and qualitative performance for the primary task.

Encoding Musical Style with Transformer Autoencoders

This work presents the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance, and shows it is possible to combine this global representation with other temporally distributed embeddings, enabling improved control over the separate aspects of performance style and melody.

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

This paper presents a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types, and proposes a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types.

Music Transformer: Generating Music with Long-Term Structure

It is demonstrated that a Transformer with the modified relative attention mechanism can generate minutelong compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies.

MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment

Three models for symbolic multi-track music generation under the framework of generative adversarial networks (GANs), which differ in the underlying assumptions and accordingly the network architectures are referred to as the jamming model, the composer model and the hybrid model are proposed.

MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

We introduce MIDI-VAE, a neural network model based on Variational Autoencoders that is capable of handling polyphonic music with multiple instrument tracks, as well as modeling the dynamics of music

Music SketchNet: Controllable Music Generation via Factorized Representations of Pitch and Rhythm

Music SketchNet, a neural network framework that allows users to specify partial musical ideas guiding automatic music generation, is proposed, and it is demonstrated that the model can successfully incorporate user-specified snippets during the generation process.

PIANOTREE VAE: Structured Representation Learning for Polyphonic Music

The experiments prove the validity of the PianoTree VAE viasemantically meaningful latent code for polyphonic segments, more satisfiable reconstruction aside of decent geometry learned in the latent space, and this model's benefits to the variety of the downstream music generation.

PopMAG: Pop Music Accompaniment Generation

A novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks and is called PopMAG, which largely outperforms other state-of-the-art music accompaniment generation models and multi- track MIDI representations in terms of subjective and objective metrics.

Pop Music Transformer: Generating Music with Rhythm and Harmony

This paper builds a Pop Music Transformer that composes Pop piano music with a more plausible rhythmic structure than prior arts do and introduces a new event set, dubbed "REMI" (REvamped MIDI-derived events), which provides sequence models a metric context for modeling the rhythmic patterns of music.