In our recent experience of working with text-to-speech models has teached us about using TTS for downstream tasks in other languages via finetuning.
Text-to-Speech (TTS) technology that is transforming how we engage with digital content, bridging written and spoken communication. At the forefront of this, the most used LLM is Facebook's MMS model, capable of handling TTS, Speech-to-Text (STT), and Language Identification (LID) tasks across 1,100 languages.
We've been hard at work leveraging this powerful model to develop a highly accurate TTS system, especially focusing on Arabic.
As we started to see the results of MMS model, there were a few accuracy issues related to numbers.
However, we understood that we need to fine-tune this Facebook's MMS model.
These are few issues that we faced for fine-tuning.
Lack of Official Fine-Tuning Framework: Facebook hasn't provided an official framework, but there is collaborative efforts on GitHub by individuals that yield promising results.
Voice Cloning Accuracy: Achieving high accuracy in voice cloning is challenging, especially with multiple speaker models.
We've utilized the VITS TTS model, fine-tuned with the LJ Speech dataset format, to improve clarity and consistency.
Due to these reasons we also opted for other open source models as well. The models that can be retrained easily than that of Facebook's MMS.
How we accomplished this is via VITS.
GovTech ✨
5monice! The link to the spaces seems to be broken? btw - consider applying for hf’s community grants - it may help with hosting the model on the spaces https://huggingface.co/docs/hub/en/spaces-gpus