Kyungsu Kim¹**, Junghyun Koo¹, Sungho Lee¹, Haesun Jung¹, Kyogu Lee**¹²³

Music and Audio Research Group (MARG), Department of Intelligence and Information, Seoul National University, Seoul, Republic of Korea¹ Interdisciplinary Program in Artificial Intelligence, Seoul National University² Artificial Intelligence Institute, Seoul National University³

Overall pipeline of TokenSynth. (a) TokenSynth takees a timbre embedding and MIDI tokens as input, where the timbre embedding is extracted from reference audio using a pre-trained CLAP encoder. Both reference and target audio are synthesized using the same instrument but with different musical notes. The objective for TokenSynth is to predict the encoded audio tokens from the target audio signal using a pre-trained DAC encoder. (b) During the inference stage, the timbre embedding can be extracted from either reference audio, a text description, or a combination of both. The predicted output audio tokens are then converted into an audio signal using a pre-trained DAC decoder.

Overall pipeline of TokenSynth. (a) TokenSynth takees a timbre embedding and MIDI tokens as input, where the timbre embedding is extracted from reference audio using a pre-trained CLAP encoder. Both reference and target audio are synthesized using the same instrument but with different musical notes. The objective for TokenSynth is to predict the encoded audio tokens from the target audio signal using a pre-trained DAC encoder. (b) During the inference stage, the timbre embedding can be extracted from either reference audio, a text description, or a combination of both. The predicted output audio tokens are then converted into an audio signal using a pre-trained DAC decoder.

TokenSynth is a token-based neural synthesizer that generates polyphonic single-instrument musical audio from MIDI and timbre embeddings, enabling instrument cloning, text-to-instrument synthesis, and timbre manipulation. It uses a decoder-only transformer trained on neural audio tokens with CLAP-based timbre conditioning, allowing for flexible sound design without fine-tuning.

1. Zero-shot Instrument Timbre Cloning

TokenSynth is capable of performing zero-shot instrument timbre cloning. Given short length (5 seconds) of single-instrument reference audio and arbitrary MIDI notes, TokenSynth generates output audio that plays given MIDI notes with the timbre of given reference audio. Note that instruments shown in following examples are unseen during training.

The reference audio is used to extract CLAP embedding. The reference audio is used to extract the CLAP embedding. The ground-truth audio plays the same MIDI notes given to TokenSynth as input, using the same instrument as the reference audio.

1.1 Dry Audio

Reference

Ground-Truth

TokenSynth

TokenSynth-Aug


01_03.png

01_03.wav

01_04.png

01_04.wav

01_00004.png

01_00004.wav

01_00004.png

01_00004.wav

11_20.png

11_20.wav

11_19.png

11_19.wav

11_00019.png

11_00019.wav

11_00019.png

11_00019.wav

21_07.png

21_07.wav

21_08.png

21_08.wav

21_00008.png

21_00008.wav

21_00008.png

21_00008.wav