I have an idea to create a Text-to-Speech (TTS) model, but I haven’t worked on such a project before. I’m seeking advice on tutorials, articles, books, and papers to understand how to fine-tune existing models, build a model from scratch, and manage aspects such as training time and data handling. Any guidance is appreciated. Thank you!
Hey there and welcome to the forum!
So, to start, fine-tuning a TTS model like whisper is a doable approach by an individual. Constructing a from-scratch model, where it must be trained on massive amounts of data, is going to be a lot of work, and quite expensive. And by expensive, I mean millions of dollars worth of compute. So realistically speaking, you would be looking at fine-tuning a pre-existing model.
The next question becomes; what are you trying to fine tune for? Essentially, what are you trying to improve or change with the model? While I haven’t personally fine-tuned a TTS model before, I suspect there would be a bit more work involved in pre-processing the data than other fine tune methods. There’s not as much on TTS fine tuning as there is on other models. Also, iirc, OpenAI’s TTS model is fairly new still, and I don’t think they have a fine-tunable model released yet in any way (not to confuse this with Whisper, which is STT).
Perhaps to get started, check out ElevenLabs?
Just a quick comment because I’m really frustrated. Elevenlabs’ demo quality is excellent, but their enterprise customer support is a 1/10. You get passed from one person to the next, likely because they are still small and overwhelmed with inquiries. I’m simply trying to get some standard documents to evaluate if we can work with them, but it seems like they aren’t even reading the emails. I hope OpenAI will develop its own fine-tuned TTS solution.