Kyungsu Kim¹**, Yejin Kim**¹ and Kyogu Lee¹²³
Music and Audio Research Group (MARG), Department of Intelligence and Information, Seoul National University, Seoul, Republic of Korea¹ Interdisciplinary Program in Artificial Intelligence, Seoul National University² Artificial Intelligence Institute, Seoul National University³

Overall architecture and alternating optimization steps of Sori. (1) The discriminator is trained to distinguish between musical and general audio. (2) The encoder and decoder are jointly trained: the encoder learns domain-invariant representations, while the decoder is optimized for predicting the note events of next timestep.
Sori is a real-time general audio to MIDI transformation system that handles general audio as input. It uses classifier-free guidance to control the trade-off between input accordance and musical quality at inference time.
Since general audio lacks MIDI annotations, we apply domain adversarial training to the encoder to align representations between general and musical audio, encouraging the model to produce musical output regardless of the input domain.












At inference time, we adjust the CFG guidance scale $\gamma$ to control the trade-off between input accordance and musical quality: lower values ($\gamma < 1.0$) shift the output distribution toward the unconditional prior, favoring musical fluency, while higher values ($\gamma > 1.0$) increase input accordance by increasing the influence of the conditioning signal.