Domestic AI voice cloning: a complete strategy from breaking through difficulties to building serverless services

Posted by: XinYe 2 weeks, 1 day ago

As a short video blogger, video formats can be roughly divided into two categories: face-showing and face-less. Compared with the randomness of face-showing videos, when making knowledge and tutorial videos, the form of recording and matching materials is more controllable. The production process usually involves writing the copy first, then recording, and finally overlaying matching material in editing software.

However, bloggers encounter difficulties in practice: when the throat is uncomfortable, the mood is bad, or the environment is noisy, recording cannot be carried out, which in turn leads to video interruption and loss of fans. From this, the blogger came up with the idea of letting AI clone his own voice, that is, he only needs to provide copywriting to generate the corresponding audio. In fact, the blogger has already realized this idea, and the sound in the video is produced in this way.

Model selection for construction services

To implement AI voice cloning, you first need to choose a suitable model. Among the many TTS models, Cozy Voice is the one with the best reviews and the best results tested by bloggers. Initially, the blogger deployed Cozy Voice on his computer, but the generation process was extremely time-consuming. After all, running large model applications locally consumes a lot of GPU resources, and there is no need to dedicate a 4090 machine for this purpose.

With the help of computing power leasing platform

By chance, the blogger came into contact with the Zhiling GPU computing power rental platform while researching stall setting AI. The platform supports the quick startup of instances and on-demand computing of computing resources, and can also provide external services in the form of Serverless, perfectly meeting this demand.

Build a speech-to-text service

This build requires two Serverless services, one of which is a speech-to-text service. Since Cozy Voice needs to transmit the text of the sample audio when cloning the sound, in order to simplify the operation, the blogger chose Whisper to complete this task. The Zhiling platform not only provides official Whisper templates, but also supports customized templates. Interested students can create exclusive AIGC templates according to the platform's official tutorials to meet customized AIGC needs.

The specific construction steps are as follows: add a Serverless service, name it, and set the Active Worker of the graphics card configuration to zero, that is, there is no fixed Worker, and no fees will be incurred when it is not called. Scaling policies are configured as queues. When there are too many tasks, they are queued. Keep the default settings. Choose Whisper for the template configuration because the service returns the text corresponding to the voice synchronously and does not require mounting storage. After clicking Add to start successfully, a curl request example will appear. You need to replace it with your own key and generate a permanent key by clicking API Key. Based on the official sample project, the input parameters are transmitted in base64 format, and other parameters remain unchanged. The response is fast after requesting and the copy is extracted accurately.

Build a complete cloning service

Next, build the Cozy Voice service, also using the official template, and keeping other configurations consistent with Whisper. After waiting for the service to start, generate the API key and replace it in the curl command. After the input parameters are set, copy the edited curl to the command console for execution. At this point, the entire sound cloning service is completed. It is worth mentioning that the serverless service construction of the Zhiling platform is very flexible, and everyone can turn the AIGC products they are interested in into services. The blogger recommends the official teaching video and GitHub repository for everyone to get more information.

client build

After the service is set up, we start writing the client. This time the blogger chose to create a single page application and use Cursor to complete it. Students who are not familiar with Cursor can view related videos previously posted by bloggers. The application is a form that contains four configuration parameters: Whisper id, Whisper api key, Cozy voice id, Cozy voice api key, and two business parameters: sample audio and clone copy. After clicking Execute, JS will base64 encode the sample audio internally, call the Whisper service to obtain a copy of the sample audio, and then call Cozy Voice to clone and generate the sound.

During the test, the blogger discovered that Whisper could not accurately obtain the required copy every time, and there were certain typos. Therefore, a separate speech-to-text button has been added to the interface. If the user clicks this button, the speech-to-text result will be echoed, allowing the user to modify incorrect text and ensure the accuracy of voice cloning. If the user clicks Voice Clone directly, the speech-to-text step will be handled internally by the program and will not be echoed.

Overall, the project worked well. The blogger will upload the entire project to GitHub, including code and prompt words. After you download it locally, replace it with your own API key and id and you can use it normally. All links and resources mentioned in the video will be placed in the top comments for students in need to use for self-help.

Domestic AI voice cloning: a complete strategy from breaking through difficulties to building serverless services

Posted by: XinYe 2 weeks, 1 day ago

Model selection for construction services

With the help of computing power leasing platform

Build a speech-to-text service

Build a complete cloning service

client build

Comments

Recent Posts

Archive

2025

2024

2023

2022

2021

2020

Categories

Tags

Authors

Feeds