OSDev.org

Posted: **Fri Feb 21, 2014 2:57 am**

Combuster wrote:Which doesn't surprise me. There's more extensive documentation ...

Thank you very much for the link. I had found the source code already, but didn't know about the documentation. By the way, reading manuals and datasheets which are hard to understand and learning about new techniques is one of my daily tasks

. I suppose that's true for the most people which are in this forum.

Combuster wrote:
VBE/AF
That standard was dead on arrival because of DirectX. Don't expect to find anything meaningful with that.

I agree. I have read about some BIOS calls, but there was no information about the parameters the respective interrupt needs. I won't do any research on this subject any more.

Combuster wrote:
Binarus wrote:VESA modes with 1 bits per pixel
The only reason they don't exist is because nobody wants them

Well, I think everybody who wants to output text as fast as possible in graphics would want them.

Combuster wrote:and many hardware including the VGA is at least 4-bit colours under the hood. Even regular text mode as you know it has 16 colours and I would think you tried using it already, no?

Yes, I know about that and about the famous four layers and their chaining modes. But in text mode, this does not play any role since the hardware itself generates the pixel data for a character after having put the character (one byte (or two including attribute information)) into the screen buffer. Putting a byte into there is very fast, basically one machine instruction, while putting the single pixels from a char (8x16) into the screen buffer with 8 bit color depth takes several hundred cycles because you have to do so much very inconvenient conversions.

I have even looked into MMX (which the CPU in question has) and SSE (which the CPU in question does not have anyway) in order to sort out if there are instructions which can distribute 8 bits onto a 64 bit register in the way I need (if there were such instructions, this would speed up things dramatically), but there aren't (or I didn't find them). Well, after having read some pages of documentation, it turns out that this sort of distribution is one of the key features of a (guess what) blitter.

Combuster wrote:1: "need"? You do? I don't believe you.

I really need them; otherwise, I had to use two screen lines per table row (see my previous post) and could put only the half of the table onto the screen at once which is a very bad idea.

Combuster wrote:2: You can do 720x480 with a VGA driver on any compatible VGA Hardware (including the VGA fixed frequency monitors, including alphanumeric mode, and I never heard an i855 being incompatible). Pulling it off however is your homework - if only to teach you how graphics cards work (and make more sense out of more modern hardware).

Looks like that is the option I will choose, despite of its heavy disadvantages in terms of performance. I already have done my homework (at least, most part of it - see my previous post)

Thank you very much,

Binarus

Posted: **Fri Feb 21, 2014 3:20 am**

I have even looked into MMX (which the CPU in question has) and SSE (which the CPU in question does not have anyway) in order to sort out if there are instructions which can distribute 8 bits onto a 64 bit register in the way I need (if there were such instructions, this would speed up things dramatically),

The instruction is MOV. In reality it's called a lookup table.

Posted: **Fri Feb 21, 2014 3:40 am**

Combuster wrote:Which suggests two things:
1) The output is not needed in production - because if you're not there to watch the screen nobody gets to interpret stuff anyway.
2) In debugging conditions, there's often no particular need to reach production speed.

Unfortunately, in our case, these assumptions are not true.

First, the main application works together with custom made hardware and software on other devices, and the interpolators are driven directly from the ms loop cycle, so extending the cycle is not an option.

Secondly, I could tell some interesting stories about how problems are being analyzed when a device is in China, India, or some other location which is thousands of km away from us and when every hour the device does not run costs about 20000$; to make a long story short, these are high-end devices, and at every location where such a device is there is a technician who has got a special training for this device and who will just use every possibility to debug which the device offers. By working together with such technicians, we even have analyzed problems with the help of video streams (technician filmed debug information from screen). Instead of keeping the debug ports secret (as most companies do), we make them public (at least in these devices) because a device not working costs so much money and must be solved as fast as possible.

Of course, screen output is not the only debug method, but believe it or not, it is the one which in much cases has helped very fast and easy. You probably won't believe if I told you that in many cases it doesn't seem possible to provide technicians who have to care about device in the 1e6$ range with laptops with CAN or other specialized hardware for 500$. Technicians get a specialized training how to disassemble the devices, how to change parts and so on, but as soon as they are asked to connect a laptop with a CAN adapter (or other hardware), they either have no laptop or no CAN adapter, or they don't know how to use it. On the other hand, until now, no technician was unable to find and connect some old VGA monitor to the device ...

Third, and perhaps most important, the rules for the approval of a device for medical use are extremely rigid, notably in Germany. Amongst others, this mean that a device's software must not be altered in any way after having been tested. Therefore, the debug screen will be kept in this software in unaltered form. (Off-topic - For those of you who are interested: It is even forbidden to recompile the same source code with another compiler or even with the same compiler with other flags and to use that software in the device instead of that on it has been tested / validated / approved with).

Combuster wrote:Also,
In graphics mode you have 640 time coordinates and 480 y coordinates. Quit the text and plot simple line graphs and you have nearly a second of history to watch.

We really have to see the thousandth part of a mm in all data which would not be possible using graphs; furthermore, we have about 60 data sets, each consisting of one value per axis, and we have 8 axes / 8 coordinates in the device. So, we would have 8 height pixels per data set and 80 horizontal pixels per axis; this just won't work. For history and recording, there will be other debug ports (later). Nevertheless, this is an interesting suggestion. Perhaps I will implement some graphs on a second screen page.

Thank you very much,

Binarus

Posted: **Fri Feb 21, 2014 4:01 am**

And finally it feels like we're getting get to the real problem

And in the meantime this is quite the confirmation fuel for all sorts of prejudices regarding the (American) medical system:

Secondly, I could tell some interesting stories about how problems are being analyzed when a device is in China, India, or some other location which is thousands of km away from us and when every hour the device does not run costs about 20000$;

Can I have my share of saved money now?

You probably won't believe if I told you that in many cases it doesn't seem possible to provide technicians who have to care about device in the 1e6$ range with laptops with CAN or other specialized hardware for 500$.

Then why don't you ask your hardware department to stick a flatpanel and a keyboard/keypad onto the device itself so the receiving end doesn't have to end up looking for a console? That way if live statistics reading isn't sufficient, you can switch from live to browse mode and access whatever logs have been stored in the past hour.

Posted: **Fri Feb 21, 2014 4:32 am**

Binarus wrote:I have even looked into MMX (which the CPU in question has) and SSE (which the CPU in question does not have anyway) in order to sort out if there are instructions which can distribute 8 bits onto a 64 bit register in the way I need (if there were such instructions, this would speed up things dramatically), but there aren't (or I didn't find them). Well, after having read some pages of documentation, it turns out that this sort of distribution is one of the key features of a (guess what) blitter.

wouldn't this do it ?

http://www.gladir.com/LEXIQUE/ASM/pshufb.htm

with regular mmx, you can only shuffle word, but i guess in conjunction with the packing function, you could get bytes into words, do the shuffle from there, and then repack to byte, as most color operation happen on mostly 4 component vectors, a pixel can still fit into a register, and if you want to do some operation on pixels, it can be good thing to have them in words anyway

Posted: **Sat Feb 22, 2014 2:12 am**

Combuster wrote:Why don't you ask your hardware department to stick a flatpanel and a keyboard/keypad onto the device itself so the receiving end doesn't have to end up looking for a console?

When I have proposed this, I have been told that it isn't possible for a bunch of reasons which are partly idiotic, amongst them:

- The end customers (hospitals) forbid their employees to use other laptops than the hospital's ones (you know, these ones where the employee doesn't have administrator status so that he can't install a CAN driver or even an USB-to-serial driver without the help of the IT department which of course mostly is not available when a device fails).

- It would make the price even higher (yes, the hospital administrations are really proud of saving 0,05% of the device's price; they don't have any clue of how devices work, they are too silly to compare the cost of an outage to the cost of the extra laptop in advance, and they use their small brains no earlier than when a device actually goes dead, and when this happens, they go crazy, become insolent and are outraged about the costs which the failure has caused)

- From the hospital administration's point of view, the technicians already have laptops (these ones which can't be used) ...

- From our point of view, we would have to be very cautious to make sure that the extra hardware is not considered to be a part of the device by the organization which approves the device (no, you really, really, really do NOT want to approve a standard Windows Laptop as part of a medical device).

Furthermore, from my experience, the debug live view is by far more valuable than the logs. We could solve (or at least analyze) most problems very efficiently with the help of the debug output (via serial port, screen, or other means).

Whatever the case may be, I need the screen output in the form I have described, and I will realize it via graphics mode and optimized assembly for printing chars in graphics mode. This way, I even can put additional information onto the screen (1280 pixels width allow for 160 chars per line which can well be read on an LCD monitor).

Of course, the graphics screen won't fit into 64K, and I can't tell from memory right now how to switch the memory window, but I am quite sure that this is not too difficult. I have some documents about this subject.

Thank you very much again,

Binarus

Posted: **Sat Feb 22, 2014 2:58 am**

h0bby1 wrote: wouldn't this do it ?

http://www.gladir.com/LEXIQUE/ASM/pshufb.htm

with regular mmx, you can only shuffle word, but i guess in conjunction with the packing function, you could get bytes into words, do the shuffle from there, and then repack to byte, as most color operation happen on mostly 4 component vectors, a pixel can still fit into a register, and if you want to do some operation on pixels, it can be good thing to have them in words anyway

Unfortunately, the CPU in question only has SSE2, but the command you have proposed is part of SSE3. Sorry for not mentioning this restriction.

Furthermore, when printing a char, you need to distribute 8 *bits* into an 8-byte register (given a bitmap font of width 8 pixels per char), and I don't understand yet how I could use pshufb and the packing functions to achieve this.

As an example, let's assume a certain "scan line" of a char in a certain font is described by the following byte (8 pixels char width, i.e. 8 bits, i.e. 1 byte): 0x55 (0b01010101). I have to convert this to 8 bit color depth, i.e. one byte per pixel. For the sake of simplicity, let's assume that the color 0xFF is white and that I want the respective character to be printed in white. Then the bit pattern mentioned above must be converted like that:

0b01010101 => 0xFF 0x00 0xFF 0x00 0xFF 0x00 0xFF 0x00

I still think that neither MMX nor SSE have instructions for that. If yes, could you please explain?

Thank you very much,

Binarus

Posted: **Sat Feb 22, 2014 3:13 am**

Combuster wrote:The instruction is MOV. In reality it's called a lookup table.

Yes, to be honest, when doing my homework, I already have used lookup tables. Currently, I have code which is working on nibbles and thus uses a LUT of 4x16 byte. But I am afraid that I could pollute the first level cache with the LUT data if I use the full size (that would be 256 entries with 8 bytes each, making a total of 2 kB; the CPU in question is some sort of an old embedded Celeron with small caches (from 2006 / 2007). On the other hand, if there are means to avoid cache pollution (and I guess there are, although I can't tell for sure at the moment), the memory accesses will make it slow ...

Well, I just will try, and since I am measuring CPU time at many places of the application, I could see the consequences quite well. But generally, I still would like to have access to the blitter or some means to do the job without too much memory accesses.

Thank you very much,

Binarus

Posted: **Sat Feb 22, 2014 4:02 am**

Binarus wrote:As an example, let's assume a certain "scan line" of a char in a certain font is described by the following byte (8 pixels char width, i.e. 8 bits, i.e. 1 byte): 0x55 (0b01010101). I have to convert this to 8 bit color depth, i.e. one byte per pixel. For the sake of simplicity, let's assume that the color 0xFF is white and that I want the respective character to be printed in white. Then the bit pattern mentioned above must be converted like that:

0b01010101 => 0xFF 0x00 0xFF 0x00 0xFF 0x00 0xFF 0x00

I still think that neither MMX nor SSE have instructions for that. If yes, could you please explain?

I haven't used SSE for ages, but maybe something like this:

Code: Select all

const1:   dq 0x8040201008040201
const2:   dq 0x0000000000000000

    mov eax,0x00000055
    movd xmm0,eax                            ;xmm0 = 0x0000000000000055
    punpcklbw xmm0,xmm0                      ;xmm0 = 0x0000000000005555
    punpcklbw xmm0,xmm0                      ;xmm0 = 0x0000000055555555
    punpcklbw xmm0,xmm0                      ;xmm0 = 0x5555555555555555
    pandn xmm0,[const1]                      ;xmm0 = 0x8000300004000200
    pcmpeqb xmm0,[const2]                    ;xmm0 = 0x00FF00FF00FF00FF

I'm not sure if there's a faster way (other than using a lookup table).

Binarus wrote:But I am afraid that I could pollute the first level cache with the LUT data if I use the full size (that would be 256 entries with 8 bytes each, making a total of 2 kB; the CPU in question is some sort of an old embedded Celeron with small caches (from 2006 / 2007). On the other hand, if there are means to avoid cache pollution (and I guess there are, although I can't tell for sure at the moment), the memory accesses will make it slow ...

If the CPU really is extremely bad (most 2006 Celerons have 512 KiB of L2 cache, so a 2 KiB lookup table is tiny), you could split the byte into a pair of 4-bit pieces and do two lookups, with a 64 byte lookup table:

Code: Select all

    result = table[byte & 0x0F] | (table[byte >> 4] << 32);

Cheers,

Brendan

Posted: **Sat Feb 22, 2014 6:02 am**

Binarus wrote:
h0bby1 wrote: wouldn't this do it ?

http://www.gladir.com/LEXIQUE/ASM/pshufb.htm

with regular mmx, you can only shuffle word, but i guess in conjunction with the packing function, you could get bytes into words, do the shuffle from there, and then repack to byte, as most color operation happen on mostly 4 component vectors, a pixel can still fit into a register, and if you want to do some operation on pixels, it can be good thing to have them in words anyway
Unfortunately, the CPU in question only has SSE2, but the command you have proposed is part of SSE3. Sorry for not mentioning this restriction.

Furthermore, when printing a char, you need to distribute 8 *bits* into an 8-byte register (given a bitmap font of width 8 pixels per char), and I don't understand yet how I could use pshufb and the packing functions to achieve this.

As an example, let's assume a certain "scan line" of a char in a certain font is described by the following byte (8 pixels char width, i.e. 8 bits, i.e. 1 byte): 0x55 (0b01010101). I have to convert this to 8 bit color depth, i.e. one byte per pixel. For the sake of simplicity, let's assume that the color 0xFF is white and that I want the respective character to be printed in white. Then the bit pattern mentioned above must be converted like that:

0b01010101 => 0xFF 0x00 0xFF 0x00 0xFF 0x00 0xFF 0x00

I still think that neither MMX nor SSE have instructions for that. If yes, could you please explain?

Thank you very much,

Binarus

mmx have equivalent of the shuffle function, but if i'm getting what you say, it would be like to convert from 1bpp image, i thought it was to convert from 8bpp, and to shuffles bytes on a register

otherwise something like that could do it

const1: dq 0x8040201008040201

movd mm0, eax
punpcklbw mm0, mm0
pshufw mm0, mm0, 0x00
pand mm0, [const1]
pcmpeb mm0, [const1]

the reverse can be done in one instruction with PMOVMSKB

but for hardware blitter, you'll have to initialize the graphic card with specific code, the vesa never really handled hardware blitting, but i think maybe it can be possible to copy an area of the video memory to another, as 'screen to screen blit', as vesa mode have memory area bigger than what is used for display, you can change the window access, or an ofset into the memory to start the frame buffer, and can sort of blit memory area from one point to another inside of the vesa memort, but i'm really unsure how this is supported by cards

there are supposedly a 'bitblt' and 'putmonoimage' function in vesa 2.0, it can copy from an aera of the offscreen vga memory to the framebuffer, but again not sure how many card support it, but it can worth a check, i remember trying to use them long time ago under dos

there is probably not lot of offscreen memory available anyway, specially in high res mode, which make it unsuitable for use for games, or real multimedia application, but to store a font, or even a single fullscreen image background, it can be usefull, or to have a fast double buffering, because you can generally assume at least twice the size of the frambuffer in some resolution

Posted: **Thu Mar 06, 2014 2:50 am**

Code: Select all

const1:   dq 0x8040201008040201
const2:   dq 0x0000000000000000

    mov eax,0x00000055
    movd xmm0,eax                            ;xmm0 = 0x0000000000000055
    punpcklbw xmm0,xmm0                      ;xmm0 = 0x0000000000005555
    punpcklbw xmm0,xmm0                      ;xmm0 = 0x0000000055555555
    punpcklbw xmm0,xmm0                      ;xmm0 = 0x5555555555555555
    pandn xmm0,[const1]                      ;xmm0 = 0x8000300004000200
    pcmpeqb xmm0,[const2]                    ;xmm0 = 0x00FF00FF00FF00FF

Brendan, thank you very much for this excellent proposition. Since I have never used SSE or MMX, and since I only had a vague idea that it could help in this case, and since I didn't have the time yet to study the complete SSE instruction set in detail, your code is very helpful. During the weekend, I will study the SSE instructions which are used by it. I am quite sure that your code will be much faster than my current solution which uses LUTs.

Brendan wrote:If the CPU really is extremely bad (most 2006 Celerons have 512 KiB of L2 cache, so a 2 KiB lookup table is tiny), you could split the byte into a pair of 4-bit pieces and do two lookups, with a 64 byte lookup table:

I was actually talking about the first level cache (which is by far faster than the second level cache), and for that reason, I am currently doing exactly what you propose (nibble based LUT) - see my previous post.

Thank you very much again for this excellent and helpful answer.

Binarus

Posted: **Thu Mar 06, 2014 3:10 am**

h0bby1 wrote:mmx have equivalent of the shuffle function, but if i'm getting what you say, it would be like to convert from 1bpp image, i thought it was to convert from 8bpp, and to shuffles bytes on a register

otherwise something like that could do it

const1: dq 0x8040201008040201

movd mm0, eax
punpcklbw mm0, mm0
pshufw mm0, mm0, 0x00
pand mm0, [const1]
pcmpeb mm0, [const1]

the reverse can be done in one instruction with PMOVMSKB

h0bby1, thank you very much for this code. Unfortunately, I can't use MMX since MMX is shared with the FPU unit which is heavily used throughout my application, and switching between MMX and FPU seems to be very costly. But your algorithm is interesting (seems like the one Brendan has proposed) and seems to be portable to SSE. I will look into the respective instructions during the weekend.

h0bby1 wrote:but for hardware blitter, you'll have to initialize the graphic card with specific code

This is what I initially have been after, but there is no documentation for the i855 (other than the linux drivers).

h0bby1 wrote:the vesa never really handled hardware blitting, but i think maybe it can be possible to copy an area of the video memory to another, as 'screen to screen blit', as vesa mode have memory area bigger than what is used for display, you can change the window access, or an ofset into the memory to start the frame buffer, and can sort of blit memory area from one point to another inside of the vesa memort, but i'm really unsure how this is supported by cards

I have made the same experience. There seem to be some BIOS calls, but again there either is no useable documentation, or it just didn't work when I have been playing around with it. Furthermore, BIOS generally seems to be extremely slow. If BIOS needs 170us for putting a char to screen in graphics mode, and if I need 1.7us with my nibble based lookup algorithm, the BIOS's blitting function will probably be still much slower than my algorithm, let alone Brendan's SSE based one or your MMX based one.

h0bby1 wrote:there are supposedly a 'bitblt' and 'putmonoimage' function in vesa 2.0, it can copy from an aera of the offscreen vga memory to the framebuffer, but again not sure how many card support it, but it can worth a check, i remember trying to use them long time ago under dos

I think I will rather use the algorithm which Brendan and you have proposed; I probably want to be able to run the application (well, actually, the code which does the screen output) on other hardware platforms as well, and what I found when researching and trying the BIOS VESA functions is by far too vague, too unreliable and non-portable.

Thank you very much again,

Binarus

Posted: **Thu Mar 06, 2014 4:30 pm**

For the speed, with the vesa 2.0 you can catch some function pointer to the bios routine to avoid an interrupt call, but bios function are coded for 16 bit real mode, so it can be a bit bothering, unless using some trick, or using something like univbe for protected mode, and it's not really that much documented anymore, or very used at all.They are not supposed to be very fast, but normally still faster than doing copy from memory to screen with the cpu, specially if you need to blit large area in high res. In the absolute if you only want to display text, it can be better to use text mode directly.

But anyway graphic in high res in software will always be slow, and i don't think i saw much any video game or graphic application even at the time of dos using the vesa function for more than fast double buffering, which can already be quite a gain if you just switch the memory windows of the framebuffer between two area. But same after if you are in protected mode, unless with tricks, you can't use bios function, and not sure what's really the state of the vbe things for modern cards.

Intel has released doc about their graphic adapter, as well as ati, i looked quickly into it, it seem to contain enough documentation to make a working drivers, or you can watch X11 ddx drivers as well for the 2D, surface blitting/convertion, there are open source drivers with X11, it's something i'm going to get into as soon as my os is advanced enought to have a code editor/compiler and being bootable on real hardware to test some graphic driver code on real hardware.

The code i posted is the intel way to do a reverse PMOVMSKB, which is what you need to convert a monochrome pixel to a byte, my code is the same than brendan except it replace two unpacking with a word shuffle, not sure which is faster, the one i posted is from intel, but pshuf has reputation to be sort of slow, if you want to optimize mmx well, also same for asm, also need to take in account all the micro opcode, and how the cpu deal with it internally.

And yes mmx, can't be used in the same time than fpu, so to take advantage of it need need to either not use fpu, or design well the part that use fpu and the one that use mmx, it's not THAT costly to switch with the 'emms' instruction, it will be costly if you need to interleave mmx with fpu code in the same loop or routine, but otherwise it's still manageable if you don't have big loop that use both, it can still be worthy if you have long loops that use lot of integer manipulation code.

In the absolute, even using mmx or sse won't really give huge boost of speed either, again unless you really have to do a lot of operation, like let say to program some kind of thing like the "Pod" game engine, that use lot of 3D, alpha blending, bilinear interpolation, it can still improve performance because many operation like the saturation or min/max can remove branching from the code, and can do some things a bit faster, but if you have also lot of fpu code interleaved with it, or that the whole thing is not almost designed from start to take advantage of mmx, like as well for memory alignment, and to have large part of code that can use mmx without switch, the performance gain will not be very big.

SSE can be useful to deal with large amount of float operation, specially if you want to still use mmx, and the fpu unit on intel always been not that great anyway, so sse can be worthy, there are also instruction to deal with the cache, like you can tell the cpu to prefetch memory, or there is a move temp instruction as well if you don't want to pollute the cache with some temporary data, on x64, it seem all code use SSE by default instead of the fpu, and there is the header with the sse function instrinsec so you can use them directly in C, it can come pretty handy to do lot of floating point vector math.

OSDev.org

How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?

Re: How to use graphic adapter's bit blitting?