Hello,
I would like to know how vocoders and/or text readers (which use Microsoft's text-to-speech engine for example) work.
Where can I find some information about the human voice sound (pitches, frequencies, intensities, ADSR, ...), especially about every letter in the alphabet.
I posted this here (and not in the General off-topic) because the aim of it is programming and not just learning information.
Thank you in advance for your help.
Sound, vocoders and human voice
Re:Sound, vocoders and human voice
Recorded diphones are used instead of letters in most text-to-speech applications. Diphones are segments containing combinations of any two sounds, not letters, to make transitions between the sounds seem more natural. Text gets interpreted, first, to make sure any sound is spelled right, as one letter could be read differently in different words. In Estonian, my native language, you'd also need to know if a sound is palatalized, for example.
Re:Sound, vocoders and human voice
if it'd were my project I'd record each syllable-pronounciation, plus each syllable-to-syllable transition. Then chain them together and play.
For most languages I think that'd work, for some you'd need to rewrite the plain-text version to a syllable-version before playing as it might be a nontrivial mapping (thinking mainly of dutch now :S).
For most languages I think that'd work, for some you'd need to rewrite the plain-text version to a syllable-version before playing as it might be a nontrivial mapping (thinking mainly of dutch now :S).
Re:Sound, vocoders and human voice
Yes, but by using this method, you need recorded sounds of human voices and that can take a lot of place on the disk.
Maybe I wasn't clear enough when I said letters, I rather meant "formants". I'm not sure if that word exists in english but what I mean by that is the frequency range of every diphones, three to six of these "formants" being enough for every diphones.
So what I need is how to synthesise voice. I know this method doesn't make very natural voices but it's better than having to deal with megabytes of recorded sounds using very complicated algorithms.
Maybe I wasn't clear enough when I said letters, I rather meant "formants". I'm not sure if that word exists in english but what I mean by that is the frequency range of every diphones, three to six of these "formants" being enough for every diphones.
So what I need is how to synthesise voice. I know this method doesn't make very natural voices but it's better than having to deal with megabytes of recorded sounds using very complicated algorithms.
- Pype.Clicker
- Member
- Posts: 5964
- Joined: Wed Oct 18, 2006 2:31 am
- Location: In a galaxy, far, far away
- Contact:
Re:Sound, vocoders and human voice
i used to have a 4KB text-to-speech demo on my disk (not my own, though). Once@home, i'll try to locate it ...