Amazing! An AI tool that allows you to speak in its learned voice if you have just a 3-second voice sample.

【All it takes is “three seconds of voice”. Microsoft is developing an AI with an amazing ability to mimic the human voice.】
https://www.lifehacker.jp/article/2301-gizmodo-microsoft-ai-voice-mimic-deepfake-natural-copy-audio/

 

・Microsoft’s VALL-E AI tool is based on Meta’s encoded speech compression technology, the Neural Codec Language Model

・Meta’s technology uses AI to compress CD-quality or better sound data to a data rate 10 times smaller than an MP3 file without compromising sound quality.

・Meta was developed to improve the quality of voice during calls and to save bandwidth for music streaming services, but Microsoft is using it to create a highly accurate voice conversion AI tool.

・Microsoft’s VALL-E Learns 60,000 hours of audio data from over 7,000 English speakers

Microsoft’s VALL-E can create a learned voice with just three seconds of voice data. You can speak with a learned voice.

 

 

The above is a quote from the article

 

 



 

Imagine a time when your voice is imitated in just three seconds.

 

I was also surprised to see that the data rate of the MP3 file is 10 times smaller than that of the CD or higher sound quality data (10/1 compression, I guess?) without losing the Meta sound quality. I was surprised to hear that,

Microsoft’s VALL-E, built using that technology, seems to be able to speak in a learned voice if it learns just 3 seconds of voice. (Though 60,000 hours of learning have been done beforehand for the AI tool.)

 

【VALL-E Sample】
https://valle-demo.github.io/

 

You can listen to the actual audio here.

・”The “Speaker Prompt” is a sample sound file (approx. 3 seconds?) of the person whose voice is being mimicked.

・”Groud Truth” is a comparison voice where the person being voice-manipulated speaks the text on the left.

・”Baseline” is a common AI-generated voice

・”VALL-E” is the sound generated by VALL-E

 

In fact, when you listen to it, you can hardly notice any difference between the human voice and the voice created by VALL-E. (although there does seem to be a slight difficulty with the English accent)

 

Still, it is really amazing to create a voice in 3 seconds.

 

AI singers and AI voices have been developed in the past, but to make them a reality, huge amounts of data had to be learned.

Because think about it.

If you think of Japanese, there are words in Japanese syllabary order (46 characters), but there must be words that are not included in the 3 seconds of data to be learned. It is amazing that the system is able to generate words that are not included in the 3 seconds of data. (Generating a myriad of words and phrases is an amazing feat in itself.)

 

I’ve written about this at the end of the article above,

Because of the potential for abuse, this technology is not publicly available.

 

Will it really continue to be developed without being made public?

What is Microsoft’s purpose in developing this technology in the first place?

 

As a musician who works with voices, I was very concerned about that.

 

What I personally think,

I imagine that if this technology develops, it will be possible for historical singers to sing forever.

 

See you then.

 

If it is possible to create a face, figure, and voice in the digital world, then in the metaverse, it would be possible to recreate human beings themselves, regardless of whether they exist in this world or not.

 

 

 

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *