gui scroll is super slow

lama · Post by **lama** » Tue Jul 19, 2011 3:09 pm

Hello, i have a question about scrolling in graphics mode, in which graphics card has no say in it - VESA mode.
So i'm using resolution 1280 x 1024 with 16 bit colors and this entire screen is occupied with one terminal window where you can do whatever you want, but when comes to scrolling it is insanely slow! I tried to optimize it as i could, but i just cant see the way how to speed it up. Here is the code snippet that does scrolling:

Code: Select all

vesaScroll:
cli
xchg bx, bx
call shlx_convdevstr                  
call vfs_getm
mov ebx, dword [eax+123]         ; get video memory
add ebx, 1280*100		; skip terminal title

mov edi, ebx			; destination.
add ebx, 1280*20
mov esi, ebx

;mov ecx, ((1280*960)-(1280*100))*2
mov ecx, (((1280*960)-(1280*100))*2)/4
cld
;rep movsb
rep movsd

;mov edi, dword [eax+123]
;add edi, ((1280*960)-(1280*100))*2

push eax

mov ax, 0x2945
mov ecx, (1280*20)/2
rep stosw

pop eax

;call shlx_convdevstr
;call vfs_getm

;mov cx, word [eax+135]
;sub cx, 150
;mov word [VESA_YPOS], cx
mov word [VESA_YPOS], 910
mov word [eax+129], 910
mov word [VESA_XPOS], 4

sti
ret

don't worry, shlx_convdevstr and vfs_getm are very fast. The super slow part of the code is ofcourse writing to memory.
In 1280 x 1024 x 2 resolution i get something around 2 Megs to transfer, but this memory really needs to be transfered to scroll down..
Ints are off, and movsd is the fastest instruction available on x86 ..
I can't believe that take SO MUCH time to write only 2 Megs of data to frame buffer using just a cpu!
There must be a better solution. Any ideas what can i do about it?

bluemoon · Post by **bluemoon** » Tue Jul 19, 2011 3:29 pm

You read from video memory, try implement an off-screen buffer, or render using 3D and alter texture UV coordinate.

guyfawkes · Post by **guyfawkes** » Tue Jul 19, 2011 3:39 pm

Have you set "MTRR: write-combining" ?

lama · Post by **lama** » Tue Jul 19, 2011 3:47 pm

You mean that if i will read the video from real ram memory (instead of the frame buffer) and then write it to frame buffer, then it have to be much more faster, right? So if i understand it correctly, accessing the non-existent memory assigned to frame buffer is way too slower then accessing the regular, existent, ram memory, right?

lama · Post by **lama** » Tue Jul 19, 2011 3:52 pm

No i have not. So, before scrolling i enable write-combining in MTRR, then do some scrolling, and then restore MTRR how it was before? I have never changed the cache settings, so i needed to ask

Gigasoft · Post by **Gigasoft** » Tue Jul 19, 2011 4:02 pm

You have an interesting understanding of the concept of nonexistence, that's for sure.

bluemoon · Post by **bluemoon** » Tue Jul 19, 2011 4:52 pm

If you have an off-screen buffer, you don't need "scroll" but just copy different region to videoram.
and yes, reading from videoram is slower than from main memory, and is non-standard which may cause unexpected issue.

FYI (just random search data for a rough idea, actual performance varies on many factors):
Nvidia AGP8X video ram bandwidth ~8Gb/s, latency ~ order of bus frequency
random search some Core i7 report: 10.6Gb/s, latency 9ns

And I don't think scrolling the video ram or single big off-screen buffer for all text is a good way to implement "terminal scroll".
Since the terminal history can get really big, I would just cache the displayable region with an off-screen buffer/texture,
and upon scroll just render the new text on that region (which should be pretty fast and not occur much) and set dirty flag to tell the display driver to flush that to video ram.

guyfawkes · Post by **guyfawkes** » Tue Jul 19, 2011 6:44 pm

lama wrote:No i have not. So, before scrolling i enable write-combining in MTRR, then do some scrolling, and then restore MTRR how it was before? I have never changed the cache settings, so i needed to ask

This may answer some of your ?
http://forum.osdev.org/viewtopic.php?f=1&t=13536

lama · Post by **lama** » Wed Jul 20, 2011 1:56 pm

so i'm a bit confused here

i tried the off-screen buffer ; without reading the frame buffer, just writing to it. I cannot help myself, it seems to me, that ram and frame access speeds are equal, so it is still insanely slow.
I looked at the page you posted, but i cant get that cache mode working.. is there some specification that can tell me something about the way how to handle MTRR? it would be awesome to have that combine-write mode only for frame buffer memory

Code: Select all

vesaScroll:
cli
mov eax, dword [THIS_SHLX]     ; current shell that is opened
mov ebx, dword [eax+256]         ; off-screen buffer
push ebx
add ebx, 1280*100
mov edi, ebx
add ebx, 1280*20
mov esi, ebx
mov ecx, (((1280*960)-(1280*100))*2)/4
cld
rep movsd
mov ax, 0x2945
mov ecx, (1280*14)/2
rep stosw
call shlx_convdevstr
call vfs_getm
mov ebx, dword [eax+123]
add ebx, 1280*100		; skip terminal title
mov edi, ebx			; destination. (this is frame buffer)
pop ebx
add ebx, 1280*100
mov esi, ebx
mov ecx, (((1280*960)-(1280*100))*2)/4
rep movsd
push eax
mov ax, 0x2945
mov ecx, (1280*14)/2
rep stosw
pop eax
mov word [VESA_YPOS], 910
mov word [eax+129], 910
mov word [VESA_XPOS], 4
sti
ret

as you can see, i do the actual scrolling inside the off-screen buffer, which should be a lot of faster then reading frame buffer. Then i write the finished screen to frame buffer. as i said before, speed does not differ a lot ..

araxestroy · Post by **araxestroy** » Wed Jul 20, 2011 4:15 pm

EDIT: Ignore this, I misread your code.

guyfawkes · Post by **guyfawkes** » Wed Jul 20, 2011 6:40 pm

You would do something like this:
First test processor.

Code: Select all

TestProcessor:
 ;====================================================;
 ; Test What processor                                ;
 ;====================================================;
        pushfd                                        
        pop     eax
        mov     ecx,eax
        xor     eax,0x00200000                        
        push    eax
        popfd
        pushfd                                        
        pop     eax
        push    ecx                                  
        popfd
        and     eax,0x00200000                        
        and     ecx,0x00200000                        
        cmp     eax,ecx
        jz      NotaPentium
 ;----------------------------------------------------;
 ;  Pentium or later                                  ;
 ;----------------------------------------------------;
        mov eax,0
        cpuid
        mov [VendorId],ebx
        mov [VendorId+4],edx
        mov [VendorId+8],ecx
 ;----------------------------------------------------;
 ;  Get Version & Features info                       ;
 ;----------------------------------------------------;
        mov eax,1
        cpuid
        mov [Version],eax
        mov [Veatures],edx
 ;----------------------------------------------------;
 ; Cpuid done                                         ;
 ;----------------------------------------------------;
Cpuiddone:
        ret

NotaPentium:  
        stc      
        ret

Then if processor type OK

Code: Select all

 ;====================================================;
 ; MtrrSetUp                     (for  write combine) ;
 ;====================================================;
MtrrSetUp:
        pushad
        mov   edx,[Veatures]
        test  edx,1000000000000b
        jz    NoMtrr
        call  FindEmptyMtrr
        jc    NoMtrr
        mov   edx,0x0                     
        mov   eax,[ModeInfo_PhysBasePtr]               ; NOTE: This is vesa2 LFB address
        or    eax,1
        wrmsr
        inc   ecx
        mov   edx,0xf
        mov   eax,0xff800800
        wrmsr
        mov   ecx,0x2ff                   
        rdmsr
        or    eax,100000000000b           
        wrmsr
        popad
        ret

NoMtrr:
        popad
        stc      
        ret

 ;====================================================;
 ; FindEmptyMtrr.                                     ;
 ;====================================================;
FindEmptyMtrr:
       mov    ecx,0x201-2
@@:
        add    ecx,2
        cmp    ecx,0x200+8*2
        jge    ErrorNoFreeMtrr
        rdmsr
        test   eax,0x0800
        jnz    @b
        dec    ecx
        ret
ErrorNoFreeMtrr:      
        stc  
        ret

Note: This is without any paging etc.

Brendan · Post by **Brendan** » Wed Jul 20, 2011 9:38 pm

Hi,

lama wrote:so i'm a bit confused here i tried the off-screen buffer ; without reading the frame buffer, just writing to it. I cannot help myself, it seems to me, that ram and frame access speeds are equal, so it is still insanely slow.

First, nothing guarantees that display memory is contiguous. For example, there might be 2048 pixels per line where 1280 of them are visible and the remaining 768 pixels are just padding. VBE gives you a "bytes_between_lines" value for this reason, and you should be using it - for e.g.:

Code: Select all

    dest_address = address of your buffer in RAM
    src_address = address for start of display memory in video card

    for(line = 0; line < lastLine; line++) {
        memcpy(dest_address, src_address, 1280*2);
        src_address += 1280*2;
        dest_address += byte_between_lines;
    }

The alternative (to make sure your code works on all video cards, rather than just yours, while only doing one copy where possible) is to do something like:

Code: Select all

    dest_address = address of your buffer in RAM
    src_address = address for start of display memory in video card
    if(byte_between_lines = 1280*2) {
            memcpy(dest_address, src_address, 1280*2 * lastLine);
    } else {
        for(line = 0; line < lastLine; line++) {
            memcpy(dest_address, src_address, 1280*2);
            src_address += 1280*2;
            dest_address += byte_between_lines;
        }
    }

Once that's fixed, the next thing I'd do is find out which pieces are causing the biggest problems. There's no point worrying about the code that copies pixels to display memory if 90% of the time is spent drawing characters in an insanely slow ("for(y2 = 0; y2 < 16; y2++;) { for(x2 = 0; x2 < 8; x2++) { if(something) put_pixel(x1 + x2, y1 + y2, colour); } }" nested loop. One trick I do is set the a group of pixels in the top left corner of the screen to different colours depending on what my code is doing. For example, you might set the top left pixels to white when drawing characters in the buffer, set the top left pixels to red when scrolling the buffer and then set them to yellow when blitting from the buffer to display memory. By watching those top left pixels you can get a fairly good idea how much time it's spending where - if they're white most of the time then....

Note: The fastest way to draw characters in graphics modes is with bitmasks - e.g. "pixelMaskForRow = lookupTable[fontDataForRow]; newPixelData = (oldPixelData & ~pixelMaskForRow) || (colour & pixelMaskForRow);". You want to try to do as many pixels at the same time as possible - e.g. in 32-bit code, you could use EAX and EDX together as a 64-bit mask and do 4 pixels at a time.

Next; if someone does "printf("Hello\nThis is nice\nThird line!\nHehee\n");" then you should only blit the buffer to display memory once after all lines have been added to the buffer (rather than doing it 4 times, once for each line); because that avoids lots of pointless writes to display memory.

You can extend this idea further. Your code to print stuff into the buffer should never copy the buffer to display memory at all; and you should have a separate routine (e.g. a "flushBuffer();" routine) that copies from the buffer to display memory. In this case, someone could print 20 things to the buffer and then call "flushBuffer();" once when they're finished printing everything. That avoids lots more pointless writes to display memory.

The next step is to keep a log as a big zero terminated string in memory. When someone prints something the characters are just appended to the end of the big zero terminated string in memory (and not converted to pixels and drawn in the buffer at all). Then when someone calls "flushBuffer();" you'd calculate how many lines were added to the big zero terminated string in memory, scroll the buffer once (e.g. if 22 lines where added to the big zero terminated string in memory you'd scroll the buffer once by 22 lines), and then copy from the buffer to display memory once. This avoids a lot of scrolling. For example, if they add 2000 lines to the big zero terminated string in memory then you could fill the entire buffer with the background colour once and then only draw the last 60 lines (or whatever actually fits on the screen) and avoid scrolling 2000 times and also avoid drawing thousands of characters that get scrolled off the top before they're seen.

Finally, often a lot of pixels don't change at all (especially for things like displaying text during boot, where you're only using 2 colours and the background colour is used for about 80% of the pixels). For example you might replace a space character with a full stop and only change 6 pixels, but copy 2.34 MiB of data to display memory anyway.

If you scroll a screen full of text, it's likely that there's plenty of white space (at the end of lines of text, etc). Then there's the gaps between characters, and things like changing a "P" to a "B", etc (where only about 10 pixels change). With some care, you can remove the majority of these writes. Most people are familiar with "dirty rectangles", but that really doesn't help for things like scrolling the screen. What I tend to do is have one buffer in RAM that contains the new contents of the screen plus another buffer in RAM that contains the current contents of the screen; and then (when blitting to display memory) I compare the data in both of these buffers and only write to display memory (and the second buffer) if the data was different. This tends to remove around 70% of display memory writes, and (despite the extra reads/writes to RAM) tends to make copying from your buffer to display memory twice as fast.

lama wrote:I looked at the page you posted, but i cant get that cache mode working.. is there some specification that can tell me something about the way how to handle MTRR? it would be awesome to have that combine-write mode only for frame buffer memory

Don't rely on write-combining to hide the fact that your code is slow - it'll only make it harder to find/fix the cause of the problem.

Fix your code so that it's fast, and when your code is as fast as possible then start thinking about write-combining (not before).

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Tue Jul 26, 2011 12:41 pm

berkus wrote:I think using vertical screen offset can also be used if your video RAM buffer has enough memory, by adjusting vertical offset to scroll up, but this gets more complicated when you need to wrap.

If you've got twice as much video memory as is needed for the display it's easy - you just write everything to two locations and you can scroll smoothly and at high speed indefinitely. The second-highest-resolution VGA graphics mode lets you do that, but I don't know if the same trick is possible with any good VESA modes.

Edit:-

It is possible, and indeed easy with VESA. Hardware scrolling both vertically and horizontally at high speed can be done by using VESA function 07h to set which byte of video memory is displayed in the top left corner of the screen. See this thread.

jal · Post by **jal** » Wed Jul 27, 2011 2:39 am

Brendan wrote:
Code: Select all
if(byte_between_lines = 1280*2) {

Or rather, make that a double =, or burn in hell forever :).

JAL

OSDev.org

gui scroll is super slow

gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow

Re: gui scroll is super slow