Simple audio output

onlyonemac · Post by **onlyonemac** » Thu Nov 12, 2015 3:01 am

iamnoob wrote:My suggestion would be to be up front, specific about your problem, and show what you have already tried.

I must show what I've already tried??? How can I do that? I've got no idea where to start. And I'm sure I've been specific enough: I have a PC with AC97 audio and I want to know the simplest way of getting sound out of it.

Anyway I think I've got my answer, and I'll look at using the PC speaker for now as those are pretty universal and it's not a problem that it uses all the CPU time as it's just for my debugger routine which currently runs after whatever test I've set up but will soon be invoked by a keypress.

Schol-R-LEA · Post by **Schol-R-LEA** » Thu Nov 12, 2015 9:38 am

OK, in furtherance of the eventual AC97 driver dev, could you post the model of your sound card (or mobo, if it is built-in - I am guessing it is, since AC97 most often was), along with any specs you may have for it? We might be able to help you track down the details needed to write the driver, and even if we can't, having those might help you organize things a bit more.

If you don't mind me asking, is there a specific reason you are working with this particular hardware, and if not, is there any chance of getting a target system that supports HDaudio? I don't know you plans or your finances, so I'm not certain why this particular system is your primary target.

It may also help us overall - and this holds not just for you, but for any of the posters coming here for help - if you could write out an overview of your planned OS design, with any specific requirements or details you feel are notable or which could clarify what you need help with. This not only will make it easier for others to help you, it will help organize and stabilize your plans, much like writing an outline for a story or paper.

Hold on a minute, I think I'd better go add that to the wiki before something else comes up... EDIT: well, I've added a rather long section to the Getting Started page, that ended up being a good deal more comprehensive than what I set out to write, and TL;DR may be a problem, but I think that for those who do read it, it ought be be helpful.

SpyderTL · Post by **SpyderTL** » Fri Nov 13, 2015 4:55 pm

Just as a follow up, I just got my first "noise" playing from my HDAudio device in VirtualBox. I'm headed over to the AWW YEAH! thread to brag about it now.

onlyonemac · Post by **onlyonemac** » Sat Nov 14, 2015 8:35 am

As another followup, a prerecorded test of the espeak synth played through the PC speaker was completely unintelligible, so I'm not going to bother porting my synth just yet. I'll have to stick with Morse code for now.

*returns to memorising Morse code chart*

DavidCooper · Post by **DavidCooper** » Sat Nov 14, 2015 12:01 pm

If you're sticking with working on that machine, it may be that someone who knows how to use AC97 can help you cobble together a simple machine-specific driver to give you easy access to sound output, so that's still a route worth exploring. So far as I can remember from looking at AC97 in the past (though I may have misunderstood the whole thing), the codec arrangement is the same on all of them, which should simplify the task. You would then have something that could be developed later into a full driver for any other machine with AC97.

If you're going to use Morse code and don't already know it well, I came up with something that makes it a bit easier to learn. Each letter is associated with a word or phrase which has the same pattern of long/short syllables (or stressed/unstressed):-

A .- alarm
B -... BIG sausages
C -.-. calculator
D -.. dangerous
E . egg
F ..-. fabrication
G --. gold bracelet
H .... helicopter
I .. in it
J .--- Japan's blue sky
K -.- K's for king
L .-.. laborious
M -- May Day
N -. nature
O --- old blind dogs (which is what we call football referees in Scotland)
P .--. police car chase
Q --.- Q stands for queen
R .-. remind me
S ... sausages
T - tie
U ..- undergo
V ...- varicose veins
W .-- we can now
X -..- X is a cross
Y -.-- you're a daft fool
Z --.. zig zag tacking (as in sailing)

They're not perfect (so if you can think of better ones, use them - "C you later" might be better than calculator, for example), but it's very easy to learn to pronounce these words/phrases the right way to go with the Morse.

However, you can change the pitch of the PC beep (and it can even play more than one note at a time) so it can be used for faster information output. I noticed something about the alphabet many years ago which can be helpful: if you cut it up and start a new line on every vowel you get this:-

ABCD.
EFGH.
IJKLMN.
OPQRST.
UVWXYZ.

This leads to a slower alternative to Morse code which is easier to learn: you can use a series of up to five dots to select a vowel, then follow it by a number of up to five dots to select a consonant on that row (or a silence if you want the vowel). There are spare ones in the first two rows which can be used as prefixes to introduce the numbers 0-4, 5-9, and ten other symbols, or further prefixes.

If you then use five different pitches of note (dough ray me fah sew), you can speed things up dramatically and signal any letter in just one or two peeps and any single-digit number in just three beeps. If you're planning to use the PC beep for output for a long time, it would be well worth using such a system instead of Morse.

SpyderTL · Post by **SpyderTL** » Sat Nov 14, 2015 1:36 pm

I've always wanted to do my own speech synthesis code, from scratch. And since all OSes eventually add text to speech support, I think it may deserve its own osdev wiki page.

Let me know if anyone wants to tackle this one, as I'd love to help. It'll probably be a while before I can do it all myself.

Several years ago, I recorded myself saying a few different vowel sounds, and then analyzed the visual WAV pattern in an audio editor. I noticed that vocal patterns are similar to a string instrument being repeatedly plucked very quickly. The wave pattern would oscillate at a specific frequency, while the amplitude would repeatedly fade out, then spike, then fade out, then spike, etc. This should be pretty easy to reproduce, and perhaps you could record someone making a specific "neutral" vocal sound, like AHHHHHH, and then use that data to generate your initial wave pattern. Then it would just be a matter of applying different filters to modify that wave based on mouth, teeth, toung, and throat position, and breath levels.

This is all stuff that I've thought about, but I haven't done any real research on how existing speech engines work, so this may be very similar to what we currently have, or it may be completely different. I assume it's pretty similar. But I (obviously) prefer to re-invent the wheel rather than copy existing patterns on the unlikely chance that there is a better approach.

Anyway, just let me know if anyone is working on any of this, or are planning to in the near future.

carstenkuckuk · Post by **carstenkuckuk** » Sat Nov 14, 2015 3:31 pm

You should have a look at Terrence J. Sejnowski and Charles R. Rosenberg, "NETtalk: a parallel network that learns to read aloud" from 1986.

https://papers.cnl.salk.edu/PDFs/NETtal ... 8-3562.pdf

onlyonemac · Post by **onlyonemac** » Sat Nov 14, 2015 3:57 pm

@DavidCooper: thanks for the tips - I was quite quick at learning Braille so Morse shouldn't be too hard and is probably a Good Thing for a blind person to know anyway, so I might as well just get to it and learn it now

.

@SpyderTL: yes there is research being done (and possibly even a few synths implemented?) around the principle of using a basic tone and then modulating it to produce sound - if it was possible to change the amplitude of the PC speaker it might even be possible to generate speech that way, but it isn't possible to change the amplitude of the PC speaker unfortunately. But yes that is one approach, however the more common approach is to break words down into phonemes (there are standard algorithms for that) and then play a sound for each phoneme or, in the case of more advanced and better-sounding synths, group of phonemes. Overall it's actually pretty simple to implement a speech synth, as you just need to implement a standard word-to-phoneme algorithm, get a set of standard phonemes in a suitable format, and then put all the words through the word-to-phoneme algorithm and play out the phonemes. It might also be worth mentioning that the best synths for blind people aren't actually the most natural-sounding ones but rather the more monotonous kind which pronounce each phoneme consistently and clearly and so can thus be easily understood at speeds upwards of 500 WPM - drives mom crazy but I usually use headphones and can fly round the Linux console faster than I ever did as a sighted person

.

Brendan · Post by **Brendan** » Sat Nov 14, 2015 5:20 pm

Hi,

onlyonemac wrote:@SpyderTL: yes there is research being done (and possibly even a few synths implemented?) around the principle of using a basic tone and then modulating it to produce sound - if it was possible to change the amplitude of the PC speaker it might even be possible to generate speech that way, but it isn't possible to change the amplitude of the PC speaker unfortunately.

It's possible to change the amplitude of the PC speaker. The general technique is called "Pulse Width Modulation".

Imagine a high frequency square wave (maybe 110 KHz) with a 50% duty cycle (half the time "on" and half the time "off"). This is too fast for the speaker to cope with, so the speaker ends up half on and half off. By adjusting the duty cycle you adjust the position of the speaker.

Now imagine you've got a source of digitised sound with a 22 KHz sample rate and 8-bit samples. In this case your 110 KHz high frequency square wave does 5 "pulses" for each sample. If the sample is 255 you set the duty cycle to 100% for 5 high frequency pulses, if the sample is 128 you set the duty cycle to 50% for 5 high frequency pulses, etc.

Now imagine if you scale those 8-bit samples by some value. For example, you could halve the amplitude - if the sample is 255 you set the duty cycle to 50% for 5 high frequency pulses, if the sample is 128 you set the duty cycle to 25% for 5 high frequency pulses, etc.

Basically; you can get (mono, not necessarily awesome quality) digitised sound with volume control out of the PC speaker.

However; this will cost CPU time and take a lot of care.

To control the high frequency sound accurately you'd want to use both PIT channel 0 and PIT channel 2 in "one shot" mode ("mode 0 - interrupt on terminal count"). Within the IRQ handler for PIT channel 0; you set PIT channel 2's count to match the time until the rising edge of the pulse; and set PIT channel 0's count to generate an IRQ when you want the falling edge of the high frequency pulse. For 110 KHz this adds up to 110000 IRQs per second, and (assuming you're using "low byte only" mode) it'll cost of about 3 us per IRQ, or about 330 ms of time per second, or about a third of one CPU's time. Of course for 100% duty cycle and a 0% duty cycle you can optimise by setting both timer counts to cover the duration of 5 whole pulses at once, so that "a third of one CPU's time" is a slightly pessimistic estimate.

You also have to worry about "jitter" caused by anything that disables IRQs (possibly including other IRQ handlers), because this can cause audible distortion if it's bad enough.

Finally; some computers use a small traditional "moving coil" speaker and for these it shouldn't be too hard to get reasonable digitized sound playback because the speaker's diaphragm takes time to move; but some computers use smaller piezoelectric speakers that move much faster and for these you might not be able to get acceptable quality digitized sound playback using timers because the high frequency square wave needs to be higher frequency than the CPU, PIT and/or PIC chips can handle. In that case you'd probably need to skip the timers and dedicate an entire CPU to constantly pounding the speaker's on/off gate (which isn't necessarily insane if you're got a 4-core chip and want it bad enough).

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Sat Nov 14, 2015 7:10 pm

I've been planning to have a go at speech synthesis for a while, but it'll need an extensive dictionary to translate words into phonetic form and I don't know if there are any free off-the-shelf ones available to use for that. I plan to build my own over time and to use an adapted version of my own phonetic writing system for it.

(Here's that paragraph again in phonetic form: ajv bin plxnik tw hxv a go xt spytt sinthusis fqr a hwajl, bat it nydz a dicshunurj tw trxnsljt wurdz intw fqnetyc fqrm xnd aj dont no iv dher ar enj fry qf-dhu-shelf wanz avjlubul tw ywz fqr djxt. aj plxn tw bild maj on ovur tajm xnd tw ywz xn xdxptud vurshun qv maj on fqnetic rajtik sistum fqr it.)

When they get people to supply their voice for speech synthesis, they give them a script to read out which contains all the sounds and clusters of sounds required, but I haven't managed to find an example of such a script. You need to work with clusters of phonemes a lot of the time because what you think of as the same phoneme can be quite different in different contexts: the "n" in "can't" doesn't release like the "n" in "not", and in the word "often" it's more like a vowel with no release at all. I don't have a comprehensive list of all the possible variations, but it shouldn't be hard to work them all out. You could then do the chopping in Audacity, cutting out each sound or cluster of sounds and saving each as a mono wav file.
I suspect the audio data would fit into a couple of megabytes.

I experimented a year ago with cutting vowels down in length and repeating the cycle many times to reduce the size of the data, but it always introduced extra sounds, perhaps because the repetition was too regular. They also sounded rather empty, making different vowels hard to tell apart - again I think it's because they are too regular. It's probably best just to stick with using samples of real speech sound to begin with, and to speak like Andy Murray so as to keep all the sounds on the same note. The ideal way to do it would still be to produce the sounds mathematically, perhaps starting with data from sound analysis which shows which frequencies are involved in each sound, but it would be a lot more work. I wrote a program to analyse sounds in that way and I've attached a file showing eight vowels (oo, oh, aw, ah, a, e, ay and ee), but to go in the opposite direction you'd need to be able to tell the difference between a tone and white noise, and it's hard to read the data and make proper sense of it. I was trying to do speech recognition, but it took too long doing experiments that kept not working out and I had to put it aside to get on with more important things. Speech synthesis should be fairly quick to implement though, and I'd be happy to compare notes with anyone else who's trying to add it to their OS.

Brendan · Post by **Brendan** » Sat Nov 14, 2015 8:26 pm

Hi,

DavidCooper wrote:When they get people to supply their voice for speech synthesis, they give them a script to read out which contains all the sounds and clusters of sounds required, but I haven't managed to find an example of such a script. You need to work with clusters of phonemes a lot of the time because what you think of as the same phoneme can be quite different in different contexts: the "n" in "can't" doesn't release like the "n" in "not", and in the word "often" it's more like a vowel with no release at all. I don't have a comprehensive list of all the possible variations, but it shouldn't be hard to work them all out. You could then do the chopping in Audacity, cutting out each sound or cluster of sounds and saving each as a mono wav file.
I suspect the audio data would fit into a couple of megabytes.

I'd expect that the best method would be to begin with the physics of sound waves, and build a parameterised mathematical model of the human vocal system (vocal cords, air speed for inhale/exhale, various chamber sizes, lip separation, etc); such that phonemes can be defined as a set of control points, and "speech" is the result of applying the mathematical model while making "smooth curve" transitions between control points. Not only would this avoid problems with the transitions between phonemes; you could control speed (e.g. go faster to give a sense of urgency) and use different "base vocal cord and chamber size/s" characteristics to change the voice (e.g. male/female, small/large person) without changing the "sets of control points" data itself; and maybe also use the same system for things that aren't speech at all (screams, humming, whistling, burps, groans, etc) or possibly things that aren't human at all (animals).

EDIT: Apparently what I'm describing here is called articulatory synthesis and is the method used by Gnuspeech.

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Sat Nov 14, 2015 9:21 pm

Hi,

Brendan wrote:Hi,

I'd expect that the best method would be to begin with the physics of sound waves, and build a parameterised mathematical model of the human vocal system (vocal cords, air speed for inhale/exhale, various chamber sizes, lip separation, etc); such that phonemes can be defined as a set of control points, and "speech" is the result of applying the mathematical model while making "smooth curve" transitions between control points. Not only would this avoid problems with the transitions between phonemes; you could control speed (e.g. go faster to give a sense of urgency) and use different "base vocal cord and chamber size/s" characteristics to change the voice (e.g. male/female, small/large person) without changing the "sets of control points" data itself; and maybe also use the same system for things that aren't speech at all (screams, humming, whistling, burps, groans, etc) or possibly things that aren't human at all (animals).

EDIT: Apparently what I'm describing here is called articulatory synthesis and is the method used by Gnuspeech.

Cheers,

Brendan

For generating a wider range of sounds, it would be worth looking at the lyre bird which can immitate almost anything, and it does it all without duplicating the same pipework. It's that ability to mimic things without simulating the throat geometry that leads me to think you could do a reasonable job just by looking at sounds and their components and trying to build rules based on that alone. It would be time-consuming either way though if you want high accuracy - there's no harm in generating something that sounds artificial but which is intelligible and hopefully not too unattractive, but I still think building a system using recorded phoneme samples would be the best way to begin - getting something that works up and running is the important bit, so the bells and whistles can be tackled later while higher priority things on the todo list are worked on first. There may be advantages in having a machine sound like a machine in any case - if you make these things too human they'll really freak out many people with bipolar disorder who already think they're being spied on by an army of people.

onlyonemac · Post by **onlyonemac** » Sun Nov 15, 2015 3:41 pm

@DavidCooper: You don't need to devise your own algorithm; there's a pretty common one called the NRL algorithm, a variation of which is used by the excellent espeak speech synthesiser and which is easy to find documentation on. There's also a table defining what each fo the phonemes sound like in terms of words that they occur in, and I believe (although have not tested) that bythat recording those words and cropping them to just the relevant phoneme you'll get a pretty good set of phonemes to work with (if you want to use diphones or triphones then things get a little more tricky, but you'll probably not want to use those for initial experimentation as there are plenty of good synths that use just single phonemes).

Also I would appreciate a description of that picture that you posted a few posts ago.

Brendan · Post by **Brendan** » Sun Nov 15, 2015 4:30 pm

Hi,

onlyonemac wrote:Also I would appreciate a description of that picture that you posted a few posts ago.

The picture looks like someone was knitting a set of 8 scarves, but kept running out of wool and had to keep changing colours, and ran out of wool for different scarves at different times; and then put the resulting half-finished multi-coloured scarves on a dark background in order of length (shortest scarf on the left, longest on the right) and took a photo of their messed up scarves.

Note: I have absolutely no idea what that picture is supposed to be.

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Sun Nov 15, 2015 4:35 pm

Interesting that they use a mixture of methods:-

"The eSpeak synthesizer creates voiced speech sounds such as vowels and sonorant consonants by adding together sine waves to make the formant peaks. Unvoiced consonants such as /s/ are made by playing recorded sounds. Voiced consonants such as /z/ are made by mixing a synthesized voiced sound with a recorded unvoiced sound.

"The Klatt synthesizer mostly uses the same formant data as the eSpeak synthesizer. It produces voiced sounds by starting with a waveform which is rich in harmonics (simulating the vibration of the vocal cords) and then applying digital filters in order to produce speech sounds."

Sorry about the lack of description on the photo I posted. It shows eight vowels being analysed for all the frequencies they contain, and I programmed it to run through the colours magenta, blue, cyan, green, yellow and red with rising pitch for each octave to make it easier to identify harmonics. There are two main bright bands low down on the screen which are the same for all eight vowels - these show the main note of my voice, with one of them an octave higher than the other. Both these bands are the same colour (bright yellow in this case). The first vowel, "oo", is shown as a column to the left of the picture, and it has very little going on above the two main bands - this vowel kills all the harmonics bar the first one, so it results in a very pure sounding vowel, not unlike a low beep generated by a machine. The second vowel is "oh", and now we see a cyan band appear which represents another harmonic, this time a fifth higher (in musical terms, so it's seven semi-tones higher), plus another cyan band an octave above that. The third vowel is "aw", and three more prominant bands have appeared, one more yellow another octave up from the higher of the original two yellow bands, but also a magenta below the cyan band and a green one above the cyan band - more harmonics are being allowed to resonate. The fourth vowel, "ah", looks identical apart from a fourth yellow band a little above the green one, though some of the other bands have brightened a bit.

I'll describe the rest in a moment, but they get messier. What I want to say first though is that while this makes it sound easy to tell the first four vowels apart, it isn't that simple - different recordings of vowels look different and there is a lot of overlap between them visually, even though it's easy to hear which are which when you play them. The patterns are also different when you speak at different pitches, so it's hard trying to write code that can identify these sounds, even though you can hear the difference between them with ease. It takes ages writing experimental code and trying it out only to find that it doesn't work well enough in most cases.

Onward then: the fifth vowel is "a" as in the word "cat", and now the fourth yellow band has faded away while a tight pack of bands have appeared higher up. The sixth vowel is "e" as in "bed", and we're now seeing a weakening of some of the bands below the fourth yellow band which had already disappeared. The seventh and eighth vowels are "ay" and "ee", and now the high pack of bands that appeared with "a" is moving higher up while the other bands which were weakening before are weakening further, introducing a larger empty space between pack of high bands and the lower ones - this empty space is a dark triangle coming in from the right of the picture and running through the "ee", "ay" and "e" columns, looking like a triangular flag tied to a pole at the right-hand side of the picture and flying out sideways across the image. The same triangular hole can be seen in similar pictures made by other people using FFT (fast fourier transform), but my software uses a different method for the maths where it just looks at wav file data (or the equivalent brought in from a microphone and written to memory by an HDA DMA engine) and counts up how much the line wanders up and down at a multitude of different frequencies.

OSDev.org

Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output

Re: Simple audio output