In the field of Singing Voice Conversion, there is not only one project, SoVitsSvc, but also many other projects, which will not be listed here. The project was officially discontinued for maintenance and Archived. However, there are still other enthusiasts who have created their own branches and continue to maintain the SoVitsSvc project (still unrelated to SvcDevelopTeam and the repository maintainers) and have made some big changes to it for you to find out for yourself.
This project is an open source, offline project, and all members of SvcDevelopTeam and all developers and maintainers of this project (hereinafter referred to as contributors) have no control over this project. The contributor of this project has never provided any organization or individual with any form of assistance, including but not limited to data set extraction, data set processing, computing support, training support, infering, etc. Contributors to the project do not and cannot know what users are using the project for. Therefore, all AI models and synthesized audio based on the training of this project have nothing to do with the contributors of this project. All problems arising therefrom shall be borne by the user.
This project is run completely offline and cannot collect any user information or obtain user input data. Therefore, contributors to this project are not aware of all user input and models and therefore are not responsible for any user input.
This project is only a framework project, which does not have the function of speech synthesis itself, and all the functions require the user to train the model themselves. Meanwhile, there is no model attached to this project, and any secondary distributed project has nothing to do with the contributors of this project
Updated the 4.0-v2 model, the entire process is the same as 4.0. Compared to 4.0, there is some improvement in certain scenarios, but there are also some cases where it has regressed. Please refer to the 4.0-v2 branch for more information.
Branch | Feature | whether compatible with the main branch model |
---|---|---|
4.0 | main branch | - |
4.0v2 | The VISinger2 model is used | incompatibility |
4.0-Vec768-Layer12 | The feature input is the Layer 12 Transformer output of the Content Vec | incompatibility |
The singing voice conversion model uses SoftVC content encoder to extract source audio speech features, then the vectors are directly fed into VITS instead of converting to a text based intermediate; thus the pitch and intonations are conserved. Additionally, the vocoder is changed to NSF HiFiGAN to solve the problem of sound interruption.
After conducting tests, we believe that the project runs stably on Python 3.8.9
.
hubert
directory# contentvec
wget -P hubert/ http://obs.cstcloud.cn/share/obs/sankagenkeshi/checkpoint_best_legacy_500.pt
# Alternatively, you can manually download and place it in the hubert directory
G_0.pth
D_0.pth
logs/44k
directoryGet them from svc-develop-team(TBD) or anywhere else.
Although the pretrained model generally does not cause any copyright problems, please pay attention to it. For example, ask the author in advance, or the author has indicated the feasible use in the description clearly.
If you are using the NSF-HIFIGAN enhancer, you will need to download the pre-trained NSF-HIFIGAN model, or not if you do not need it.
pretrain/nsf_hifigan
directory# nsf_hifigan
https://github.com/openvpi/vocoders/releases/download/nsf-hifigan-v1/nsf_hifigan_20221211.zip
# Alternatively, you can manually download and place it in the pretrain/nsf_hifigan directory
# URL:https://github.com/openvpi/vocoders/releases/tag/nsf-hifigan-v1
Simply place the dataset in the dataset_raw
directory with the following file structure.
dataset_raw
├───speaker0
│ ├───xxx1-xxx1.wav
│ ├───...
│ └───Lxx-0xx8.wav
└───speaker1
├───xx2-0xxx2.wav
├───...
└───xxx7-xxx007.wav
You can customize the speaker name.
dataset_raw
└───suijiSUI
├───1.wav
├───...
└───25788785-20221210-200143-856_01_(Vocals)_0_0.wav
Slice to 5s - 15s
, a bit longer is no problem. Too long may lead to torch.cuda.OutOfMemoryError
during training or even pre-processing.
By using audio-slicer-GUI or audio-slicer-CLI
In general, only the Minimum Interval
needs to be adjusted. For statement audio it usually remains default. For singing audio it can be adjusted to 100
or even 50
.
After slicing, delete audio that is too long and too short.
python resample.py
python preprocess_flist_config.py
python preprocess_hubert_f0.py
After completing the above steps, the dataset directory will contain the preprocessed data, and the dataset_raw folder can be deleted.
keep_ckpts
: Keep the last keep_ckpts
models during training. Set to 0
will keep them all. Default is 3
.
all_in_mem
: Load all dataset to RAM. It can be enabled when the disk IO of some platforms is too low and the system memory is much larger than your dataset.
python train.py -c configs/config.json -m 44k
# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -s "nen" -n "君の知らない物語-src.wav" -t 0
Required parameters:
-m
| --model_path
: Path to the model.-c
| --config_path
: Path to the configuration file.-s
| --spk_list
: Target speaker name for conversion.-n
| --clean_names
: A list of wav file names located in the raw folder.-t
| --trans
: Pitch adjustment, supports positive and negative (semitone) values.Optional parameters: see the next section
-a
| --auto_predict_f0
: Automatic pitch prediction for voice conversion. Do not enable this when converting songs as it can cause serious pitch issues.-cl
| --clip
: Voice forced slicing. Set to 0 to turn off(default), duration in seconds.-lg
| --linear_gradient
: The cross fade length of two audio slices in seconds. If there is a discontinuous voice after forced slicing, you can adjust this value. Otherwise, it is recommended to use. Default 0.-cm
| --cluster_model_path
: Path to the clustering model. Fill in any value if clustering is not trained.-cr
| --cluster_infer_ratio
: Proportion of the clustering solution, range 0-1. Fill in 0 if the clustering model is not trained.-fmp
| --f0_mean_pooling
: Apply mean filter (pooling) to f0, which may improve some hoarse sounds. Enabling this option will reduce inference speed.-eh
| --enhance
: Whether to use NSF_HIFIGAN enhancer. This option has certain effect on sound quality enhancement for some models with few training sets, but has negative effect on well-trained models, so it is turned off by default.If the results from the previous section are satisfactory, or if you didn't understand what is being discussed in the following section, you can skip it, and it won't affect the model usage. (These optional settings have a relatively small impact, and they may have some effect on certain specific data, but in most cases, the difference may not be noticeable.)
During the 4.0 model training, an f0 predictor is also trained, which can be used for automatic pitch prediction during voice conversion. However, if the effect is not good, manual pitch prediction can be used instead. But please do not enable this feature when converting singing voice as it may cause serious pitch shifting!
auto_predict_f0
to true in inference_main.Introduction: The clustering scheme can reduce timbre leakage and make the trained model sound more like the target's timbre (although this effect is not very obvious), but using clustering alone will lower the model's clarity (the model may sound unclear). Therefore, this model adopts a fusion method to linearly control the proportion of clustering and non-clustering schemes. In other words, you can manually adjust the ratio between "sounding like the target's timbre" and "being clear and articulate" to find a suitable trade-off point.
The existing steps before clustering do not need to be changed. All you need to do is to train an additional clustering model, which has a relatively low training cost.
python cluster/train_cluster.py
. The output model will be saved in logs/44k/kmeans_10000.pt
.cluster_model_path
in inference_main.py
.cluster_infer_ratio
in inference_main.py
, where 0
means not using clustering at all, 1
means only using clustering, and usually 0.5
is sufficient.Introduction: The mean filtering of F0 can effectively reduce the hoarse sound caused by the predicted fluctuation of pitch (the hoarse sound caused by reverb or harmony can not be eliminated temporarily). This function has been greatly improved on some songs. However, some songs are out of tune. If the song appears dumb after reasoning, it can be considered to open.
f0_mean_pooling
to true in inference_main.py
[23/03/16] No longer need to download hubert manually
[23/04/14] Support NSF_HIFIGAN enhancer
Use onnx_export.py
checkpoints
and open itcheckpoints
folder as your project folder, naming it after your project, for example aziplayer
model.pth
, the configuration file as config.json
, and place them in the aziplayer
folder you just created"NyaruTaffy"
in path = "NyaruTaffy"
in onnx_export.py to your project name, path = "aziplayer"
model.onnx
will be generated in your project folder, which is the exported model.Note: For Hubert Onnx models, please use the models provided by MoeSS. Currently, they cannot be exported on their own (Hubert in fairseq has many unsupported operators and things involving constants that can cause errors or result in problems with the input/output shape and results when exported.)
CppDataProcess are some functions to preprocess data used in MoeSS
If the original project is equivalent to the Roman Empire, This project is Eastern Roman Empire(The Byzantine Empire) and so-vits-svc-5.0 is Kingdom of Romania
For some reason the author deleted the original repository. Because of the negligence of the organization members, the contributor list was cleared because all files were directly reuploaded to this repository at the beginning of the reconstruction of this repository. Now add a previous contributor list to README.md.
Some members have not listed according to their personal wishes.
MistEO |
XiaoMiku01 |
しぐれ |
TomoGaSukunai |
Plachtaa |
zd小达 |
凍聲響世 |
任何组织或者个人不得以丑化、污损,或者利用信息技术手段伪造等方式侵害他人的肖像权。未经肖像权人同意,不得制作、使用、公开肖像权人的肖像,但是法律另有规定的除外。未经肖像权人同意,肖像作品权利人不得以发表、复制、发行、出租、展览等方式使用或者公开肖像权人的肖像。对自然人声音的保护,参照适用肖像权保护的有关规定。
【名誉权】民事主体享有名誉权。任何组织或者个人不得以侮辱、诽谤等方式侵害他人的名誉权。
【作品侵害名誉权】行为人发表的文学、艺术作品以真人真事或者特定人为描述对象,含有侮辱、诽谤内容,侵害他人名誉权的,受害人有权依法请求该行为人承担民事责任。行为人发表的文学、艺术作品不以特定人为描述对象,仅其中的情节与该特定人的情况相似的,不承担民事责任。