Page 3 of 3

Re: Simple audio output

Posted: Mon Nov 16, 2015 1:17 am
by onlyonemac
The diagram sounds interesting - so essentially you're comparing the frequencies present in different vowels? It would also be interesting to compare the waveforms - something makes me think that these harmonic patterns have something to do with the times when I was playing around with music synthesisers and suddenly it sounded like someone going "ooh" or "aah" (depending on the setting). If we could figure out a consistent pattern, I guess it would thus be quite easy to harmonically synthesise the English vowels, and maybe even to apply a similar proc9ess to the consonants (although I imagine that synthesising the consonants would be somewhat harder)? It's also interesting that espeak mixes recorded sounds with synthesised samples - something which I was not aware of - however I wonder why they thought it too hard to synthesise the sampled sounds and if they're thinking of replacing those to get a fully synthesised voice (which would be both cool and useful in the endless customisation of the voice that it would permit - I can just imagine setting things like exactly how much separation I want between phonemes, how long I want particular phonemes to sound for, and many other things that would be great for fine-tuning the speech synth for high-speed listening by a blind person).

Heck, now we need to start a "SpeechSynthDev" forum!

Re: Simple audio output

Posted: Mon Nov 16, 2015 11:31 am
by DavidCooper
Consonants are generally made from white noise with a range of frequencies of hiss which sound different due to that range. What makes it white noise rather than tones is that there is random variation in the cycle length to the point that there is no constant frequency in there. Some of the consonants involve a blockage of the flow, so that's a momentary lack of sound, and when the sound starts up again you get a very short blast of white noise. With the sound "t", for example, there's a silence followed by the same white noise as the sound "s", but it's so short that you don't normally recognise it as "s". With "p", the silence ends with white noise similar to that of the sound "f" (though it's actually a bilabial "ph" kind of "f" that uses both lips instead of involving the teeth). With "k", the white noise after the silence is a short burst of the "ch" in the Scottish word "loch" (which many people will know from German [acht], Spanish [junto], Arabic [khubz]).

I'd have thought it would be easy enough to generate all these hisses without needing to store recorded samples - they could then be generated when the program first runs, but it wouldn't save a vast amount of storage space so there's little need to bother unless you want the whole OS to fit on a floppy disk (which I ideally would want to do).