Vocal & Instrumental Isolation

Parakeet (extract text from audio)

Parakeet by NVIDIA — is a modern automatic speech recognition (ASR) model designed for accurate and efficient conversion of English speech to text. Unlike Whisper, this model works only with English speech, but delivers higher quality results for English. It also generates quite accurate timestamps. Quality metric WER: 6.03 on Huggingface Open ASR Leaderboard.

Model page: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2

🗎 Copy link

Medley Vox (Multi-singer separation)

Medley Vox is a dataset for testing algorithms for separating multiple singers within a single music track. Also, the authors of Medley Vox proposed a neural network architecture for separating singers. However, unfortunately, they did not publish the weights. Later, their training process was repeated by Cyru5, who trained several models and published the weights in the public domain. Now the trained neural network is available on MVSep.

🗎 Copy link

MVSep Multichannel BS (vocals, instrumental)

MVSep Multichannel BS - this model is prepared for extracting vocals from multichannel sound (5.1, 7.1, etc.). Emphasis on lack of transformation and loss of quality. After processing, the model returns multi-channel audio in the same format in which it was sent to the server with the same sample rate.

🗎 Copy link

MVSep Male/Female separation

A model for separating male and female voices within a single vocal track. The track should contain only voices, no music.

Quality metrics

Algorithm name	Male/Female validation dataset
Algorithm name	SDR Male	SDR Female	L1_Freq Male	L1_Freq Female
BSRoformer by Sucial (SDR: 6.52)	6.82	6.23	40.99	40.62
BSRoformer by aufr33 (SDR: 8.18)	8.47	7.89	46.65	44.73
SCNet XL (SDR: 11.83)	12.08	11.58	50.50	51.51
MelRoformer (2025.01) (SDR: 13.03)	13.39	12.68	57.61	56.76

🗎 Copy link

Demucs3 Model (vocals, drums, bass, other)

Algorithm Demucs3 splits track into 4 stems (bass, drums, vocals, other). The winner of the Music Demuxing Challenge 2021.

Link: https://github.com/facebookresearch/demucs/tree/v3

Quality table

Algorithm name	Multisong dataset					Synth dataset
Algorithm name	SDR Bass	SDR Drums	SDR Other	SDR Vocals	SDR Instrumental	SDR Vocals	SDR Instrumental
Demucs3 (Model A)	9.50	8.97	4.40	7.21	13.52	---	---
Demucs3 (Model B)	10.69	10.27	5.35	8.13	14.44	9.78	9.48

Note: For version A only MUSDB18 training data was used for training, so quality is worse than Demucs3 Model B. Demucs3 Model A and Demucs3 Model B has the same architecture, but has different weights.

🗎 Copy link

Vit Large 23 (vocals, instrum)

Experimental model VitLarge23 based on Vision Transformers. In terms of metrics, it is slightly inferior to the MDX23C, but may work better in some cases.

Quality table

Algorithm name	Multisong dataset		Synth dataset		MDX23 Leaderboard
Algorithm name	SDR Vocals	SDR Instrumental	SDR Vocals	SDR Instrumental	SDR Vocals
Vit Large 23 (512px) v1	9.78	16.09	12.33	12.03	10.47
Vit Large 23 (512px) v2	9.90	16.20	12.38	12.08	---

🗎 Copy link

MVSep MelBand Roformer (vocals, instrum)

Mel Band Roformer - a model proposed by employees of the company ByteDance for the competition Sound Demixing Challenge 2023, where they took first place on LeaderBoard C. Unfortunately, the model was not made publicly available and was reproduced according to a scientific article by the developer @lucidrains on the github. The vocal model was trained from scratch on our internal dataset. Unfortunately, we have not yet been able to achieve similar metrics as the authors.

Quality table

Algorithm name	Multisong dataset		Synth dataset		MDX23 Leaderboard
Algorithm name	SDR Vocals	SDR Instrumental	SDR Vocals	SDR Instrumental	SDR Vocals
Mel Band Roformer v1 (vocals)	9.07	---	11.76	---	---

🗎 Copy link

LarsNet (kick, snare, cymbals, toms, hihat)

The LarsNet model divides the drums stem into 5 types: 'kick', 'snare', 'cymbals', 'toms', 'hihat'. The model is from this github repository and it was trained on the dataset StemGMD. The model has two operating modes. The first (default) applies the Demucs4 HT model to the track at stage one, which extracts only the drum part from the track. On the second stage, the LarsNet model is used. If your track consists only of drums, then it makes sense to use the second mode, where the LarsNet model is applied directly to the uploaded audio. Unfortunately, subjectively, the quality of separation is inferior in quality to the model DrumSep.

🗎 Copy link

Stable Audio Open Gen

Audio generation based on a given text prompt. The generation uses the Stable Audio Open 1.0 model. Audio is generated in Stereo format with a sample rate of 44.1 kHz and duration up to 47 seconds. The quality is quite high. It's better to make prompts in English.

Example prompts:
1) Sound effects generation: cats meow, lion roar, dog bark
2) Sample generation: 128 BPM tech house drum loop
3) Specific instrument generation: A Coltrane-style jazz solo: fast, chaotic passages (200 BPM), with piercing saxophone screams and sharp dynamic changes

🗎 Copy link

MVSep MultiSpeaker (MDX23C)

MVSep MultiSpeaker (MDX23C) - this model tries to isolate the most loud voice from all other voices. It uses MDX23C architecture. Still under development.

🗎 Copy link

Aspiration (by Sucial)

The algorithm adds "whispering" effect to vocals. Model was created by SUC-DriverOld. More details here.

The Aspiration model separates out:
1) Audible breaths
2) Hissing and buzzing of Fricative Consonants ( 's' and 'f' )
3) Plosives: voiceless burst of air produced while singing a consonant (like /p/, /t/, /k/).

🗎 Copy link

AudioSR (Super Resolution)

Algorithm AudioSR: Versatile Audio Super-resolution at Scale. Algorithm restores high frequencies. It works on all types of audio (e.g., music, speech, dog, raining, ...). It was initially trained for mono audio, so it can give not so stable result on stereo.

Metric on Super Resolution Checker for Music Leaderboard (Restored): 25.3195
Authors' paper: https://arxiv.org/pdf/2309.07314
Original repository: https://github.com/haoheliu/versatile_audio_super_resolution
Original inference script prepared by @jarredou: https://github.com/jarredou/AudioSR-Colab-Fork

🗎 Copy link

FlashSR (Super Resolution)

FlashSR - audio super resolution algorithm for restoring high frequencies. It's based on paper FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation.

Metric on Super Resolution Checker for Music Leaderboard (Restored): 22.1397
Original repository: https://github.com/jakeoneijk/FlashSR_Inference
Inference script by @jarredou: https://github.com/jarredou/FlashSR-Colab-Inference

🗎 Copy link

Matchering (by sergree)

Matchering is a novel tool for audio matching and mastering. It follows a simple idea - you take TWO audio files and feed them into Matchering:

TARGET (the track you want to master, you want it to sound like the reference)
REFERENCE (another track, like some kind of "wet" popular song, you want your target to sound like it)

This algorithm matches both of these tracks and provides you the mastered TARGET track with the same RMS, FR, peak amplitude and stereo width as the REFERENCE track has.

It based on code by @sergree.

🗎 Copy link

turbo@mvsep.com

Advanced features

Quality Checker

Algorithms

Full API Documentation

Company

Terms & Conditions

Refund Policy

Cookie Notice

Extra

Help us translate!

Help us promote!