Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
read the original abstract
Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing
DisciplineGen-1M is a million-scale multidisciplinary dataset for text-to-image generation and editing, paired with a discipline-informed model that improves results on discipline-specific benchmarks.
-
SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning
HRM adapters via Hankel reduced-order modeling outperform LoRA on long-context tasks in Mistral-7B when used as SSM residual modules with FFT-based parallel scan.
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
Latent Fourier Transform
LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.
-
Self-Supervised Test-Time Tuning for Packet Loss Concealment
TTT-PLC adapts existing PLC models at test time via self-supervised synthetic masking of received audio packets, improving concealment on the same lossy signal in both file and streaming settings.
-
PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.
-
Rubato: Transcribing Piano Music with Timestamps
Rubato model with InterMo representation outperforms cascade methods in generating timestamped piano sheet music from audio, even when cascades receive ground-truth MIDI.
-
Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music
Introduces the first large-scale Persian music dataset and shows fine-tuned MusicGen produces compositions more aligned with Persian stylistic conventions via tag-based evaluation.
-
Music Transcription with (Almost) No Supervision
Cycle-consistent translation enables competitive music transcription performance with mostly unpaired audio and scores plus minimal paired supervision.
-
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.