Kyungsu Kim¹**, Yejin Kim**¹ and Kyogu Lee¹²³

Music and Audio Research Group (MARG), Department of Intelligence and Information, Seoul National University, Seoul, Republic of Korea¹ Interdisciplinary Program in Artificial Intelligence, Seoul National University² Artificial Intelligence Institute, Seoul National University³

Overall architecture and alternating optimization steps of Sori. (1) The discriminator is trained to distinguish between musical and general audio. (2) The encoder and decoder are jointly trained: the encoder learns domain-invariant representations, while the decoder is optimized for predicting the note events of next timestep.

Sori is a real-time general audio to MIDI transformation system that handles general audio as input. It uses classifier-free guidance to control the trade-off between input accordance and musical quality at inference time.

1. Domain Adversarial Training

Since general audio lacks MIDI annotations, we apply domain adversarial training to the encoder to align representations between general and musical audio, encouraging the model to produce musical output regardless of the input domain.

Input

Without Domain Adversarial Training

With Domain Adversarial Training (Proposed)

memi_1.0_6.png

memi_1.0_2.png

new_cough_1.0_1.png

new_cough_1.0_6.png

moonspeech_1_1.0_2.png

moonspeech_1_1.0_2.wav

moonspeech_1_1.0_12.png

moonspeech_1_1.0_12.wav

real_music_1.wav

real_music_1_1.0_8_no_adv.png

real_music_1_1.0_8_no_adv.wav

real_music_1_1.0_12.png

real_music_1_1.0_12.wav

2. Controllability on Input Accordance-Musical Quality Trade-off

At inference time, we adjust the CFG guidance scale $\gamma$ to control the trade-off between input accordance and musical quality: lower values ($\gamma < 1.0$) shift the output distribution toward the unconditional prior, favoring musical fluency, while higher values ($\gamma > 1.0$) increase input accordance by increasing the influence of the conditioning signal.