Hi,
First; do it as 2 steps - horizontal scaling, then vertical scaling.
To scale horizontally, think of each bit in the source as a tiny horizontal line. If the source data is 24 bits wide and the destination data is M pixels wide, then the first bit represents a tiny line from 0 to M/24, the second bit represents a tiny line M/24 to 2*M/24, the third bit represents a line from 2*M/24 to 3*M/24, and so on. With fixed point maths you should be able to generate "8-bit alpha per pixel" relatively easy. For example, with M=66 and "8.8 fixed point", 66/24 = 0x42.00/0x18 = 0x02.C0, which is one whole pixel (alpha = 0xFF) and then a partial pixel (alpha = 0xC0). Note that you probably want to detect runs - e.g. if the source data is 1101111011b then you'd detect 3 runs (one from 0 to 2*M/24, one from 3*M/24 to 7*M/24 and one from 8*M/24 to 10*M/24) and draw 3 small lines (and not 8 tiny lines).
To stretch vertically, calculate how much each "source row of alpha" effects the destination row. For example, if the source data is 32 rows high and the destination is N rows high, then the first source row covers the area from 0 to N/32, the second from N/32 to 2*N/32, the third from 2*N/32 to 3*N/32. With fixed point maths this gives you a weighting factor for each source row. For example; with N = 99 and "8.8 fixed point", 99/32 = 0x63.00/0x20 = 0x03.18, which means the first source row effects the whole first 3 row of the destination data (weighting factor = 0x01.00) and partially effects the fourth row of destination data (weighting factor = 0x00.18). You'd multiply the alpha values from the source row by this weighting factor and add it to the destination data. For example, if the "source row of alpha" is 0x20, 0xFF, 0xFF, 0x60, 0x00 and the weighting factor is 0x00.18 then you'd multiply to get the values 0x03, 0x18, 0x18, 0x09, 0x00 and then add them to the destination row.
Note that a lot of characters share the same "row patterns". For examples; the upper halves of 'O' and 'Q' are the same (and 'B' and 'P', and 'E' and 'F'); a letter like 'I' might be a single bit pattern that's duplicated on most rows; for a letter like 'B' (and 'O', 'E', 'H', etc) the top half is a mirror of the bottom half; etc.
You could store the font data as 2 parts - one part containing "bit pattern for unique row" data, and another part containing the characters where each row of a character is stored as "reference to unique bit pattern". For example, the character 'O' might be stored as "pattern11, pattern12, pattern13, ..., pattern13, pattern12, pattern11" (because its top half is a mirror of its bottom half) and the character 'Q' might be stored as "pattern11, pattern12, pattern13, ..." (because it's top half is the same as the top half of 'O'). With 128 characters that are 32 pixels high; you might only have 200 of these "unique row patterns" (instead of 4096 rows); and you'd scale these 200 "unique row patterns" horizontally (instead of scaling 4096 rows); which saves a whole lot of processing. Also; it can reduce file sizes (if the "reference to unique bit pattern" takes up fewer bits than the "bit pattern for unique row" data, and there's enough characters).
I have done all of this before (but with larger 64*64 fonts). The result (scaled to 2 different sizes) looked like this:
Finally; there's no reason why you can't store the "bit pattern for unique row" data as run-length encoded data. For example, you could have characters that are 254 pixel wide; then have "off width, on width, off width, ..."; where the pattern 000000111111....1111100000000 might be stored as the widths "6, 251, 8, 0xFF" (or just three bytes "6, 251, 0xFF") where the "0xFF" marks the end of the row. This would make it easier/faster to detect runs during the horizontal scaling step (and would also reduce file sizes for wide/high resolution fonts).
Cheers,
Brendan