Scrolling terminal in software-emulated text mode

austanss · Post by **austanss** » Wed Dec 02, 2020 2:27 pm

Using QEMU monitor, I use the dump-guest-memory command.

Also, I fixed******** the issue.

*******The rendering is so incredibly slow, I could tally down on a piece of paper when the characters appear on the screen as they appear. No, I'm not being hyperbolic. Literally.

austanss · Post by **austanss** » Wed Dec 02, 2020 2:33 pm

OK, fixed that easily by individually rendering entries when I put an entry into the buffer, rather than re-rendering the whole buffer.

Octocontrabass · Post by **Octocontrabass** » Wed Dec 02, 2020 2:43 pm

rizxt wrote:Using QEMU monitor, I use the dump-guest-memory command.

How do you examine the memory dump?

rizxt wrote:The rendering is so incredibly slow

Your code has a lot of room for optimization. I see you already found one, but there are still others.

austanss · Post by **austanss** » Wed Dec 02, 2020 2:46 pm

Now my scrolling function has some screen tearing...

I'm going to attempt using a framebuffer buffer...

austanss · Post by **austanss** » Wed Dec 02, 2020 2:53 pm

After seeing the need for a framebuffer buffer, and recognizing that with my current codebase, I would need to inflate my kernel size to >100MB to create a framebuffer buffer, I now recognize that the need for memory management is dire.

austanss · Post by **austanss** » Wed Dec 02, 2020 2:54 pm

Can anyone refer me to some resources on implementing a memory management system?

bzt · Post by **bzt** » Wed Dec 02, 2020 2:55 pm

Looks like you're starting to realize what I was saying here.

Some people can only learn from their own mistakes, there's nothing wrong with that. Good luck with the optimizations! As I have said, try not to recalculate the offset for each and every pixel. That will help a lot.
(In general don't use any function calls within the loop, because that clears the instruction cache and slows down the execution considerably. The best if you use as few variables as possible, so that the compiler can optimize for register-only code. Plus avoid multiplication. Use addition and shifting instead.)

Cheers,
bzt

bzt · Post by **bzt** » Wed Dec 02, 2020 2:56 pm

rizxt wrote:After seeing the need for a framebuffer buffer, and recognizing that with my current codebase, I would need to inflate my kernel size to >100MB to create a framebuffer buffer, I now recognize that the need for memory management is dire.

That makes no sense. What do you have in mind? Your codebase's size has nothing to do with the video card's linear framebuffer.

Cheers,
bzt

austanss · Post by **austanss** » Wed Dec 02, 2020 2:58 pm

bzt wrote:
rizxt wrote:After seeing the need for a framebuffer buffer, and recognizing that with my current codebase, I would need to inflate my kernel size to >100MB to create a framebuffer buffer, I now recognize that the need for memory management is dire.
That makes no sense. What do you have in mind? Your codebase's size has nothing to do with the video card's linear framebuffer.

Cheers,
bzt

I meant that the only way I can properly reserve memory is by inflating my kernel through the use of resb. Otherwise I have to use arbitrary memory addresses that might overwrite something important or get overwritten by something else.

bzt · Post by **bzt** » Wed Dec 02, 2020 3:08 pm

rizxt wrote:I meant that the only way I can properly reserve memory is by inflating my kernel through the use of resb. Otherwise I have to use arbitrary memory addresses that might overwrite something important or get overwritten by something else.

That still makes no sense. Using resb will not inflate your kernel's size. Plus since framebuffer is in MMIO, usually above the last available RAM address, you can't overwrite it by accident. Not to mention that the framebuffer's address changes, it's different on each machine.
You'll need to query the memory map to see which regions of the RAM are free to use. You must not overwrite anything that's not listed as free. The framebuffer might be included in the list as used memory, but not necessarily, and not on all machines.

Cheers,
bzt

Octocontrabass · Post by **Octocontrabass** » Wed Dec 02, 2020 3:17 pm

It sounds like it's time to dive into memory management. By allocating memory dynamically from your kernel heap, you can ensure your kernel only allocates memory it's actually going to use.

But if resb is making your kernel binary larger, perhaps you're not putting it in the .bss section like you're supposed to. (It still reserves memory no matter where you put it!)

bloodline · Post by **bloodline** » Wed Dec 02, 2020 3:47 pm

rizxt wrote:
bzt wrote:
rizxt wrote:After seeing the need for a framebuffer buffer, and recognizing that with my current codebase, I would need to inflate my kernel size to >100MB to create a framebuffer buffer, I now recognize that the need for memory management is dire.
That makes no sense. What do you have in mind? Your codebase's size has nothing to do with the video card's linear framebuffer.

Cheers,
bzt
I meant that the only way I can properly reserve memory is by inflating my kernel through the use of resb. Otherwise I have to use arbitrary memory addresses that might overwrite something important or get overwritten by something else.

Just write a memory allocator! The memory allocator is the first part of the kernel you should write, that needs to be well tested and robust enough to build everything else on top of!

eekee · Post by **eekee** » Fri Dec 11, 2020 10:31 am

bzt wrote:In general don't use any function calls within the loop, because that clears the instruction cache and slows down the execution considerably.

This makes no sense. Is the CPU going to throw away 128KiB of cached instructions on every function call? I think you're thinking of the instruction pipeline, but even there, CPUs became very good at branch prediction years ago. Note the "prediction" part - this is for conditional branches; unconditional calls must surely have been optimized long before.

bzt · Post by **bzt** » Fri Dec 11, 2020 2:37 pm

eekee wrote:
bzt wrote:In general don't use any function calls within the loop, because that clears the instruction cache and slows down the execution considerably.
This makes no sense. Is the CPU going to throw away 128KiB of cached instructions on every function call?

There are multiple levels of caches in a CPU, furthermore the cache size differs for every level in every CPU family. You can't state that every CPU has 128K instruction cache, that's not true. At which level? For which CPU model?
https://en.wikipedia.org/wiki/CPU_cache ... _processor

Plus using a function call would cripple the compiler's optimizer by not allowing it to use all registers as it could (because not all registers are preserved in a call it assumes they will change, plus arguments must be passed in dedicated registers as per the ABI etc.), not to mention the additional stack usage and unnecessary stack frame creation/deletion. etc. etc. etc. If you doubt this, just use "objdump" to disassemble the generated code!
https://en.wikipedia.org/wiki/Register_allocation

eekee wrote:I think you're thinking of the instruction pipeline, but even there, CPUs became very good at branch prediction years ago.

"CALL" instruction does not use branch predictions as conditional near jumps like "JE", "JNE" etc. The distance of the call also matters, see Intel Software Developer Manual Vol 2A page 3-126. See section "Operation" with the microcode. Also read about cache handling with and without LFENCE for both near, normal and far calls (hint: there's a difference). However not using "CALL" in the first place will reduce the overhead of these to zero for sure.

eekee wrote:Note the "prediction" part - this is for conditional branches; unconditional calls must surely have been optimized long before.

If that were true, then the compiler wouldn't inline certain functions nor unroll loops for speed optimization. But it does (long before execution, so long that those are done in compile-time).

Good read on the topic: http://www.ece.uah.edu/~milenka/docs/mi ... WDDD02.pdf (explains why jump distance matters, and some other things as well)
And of course read: https://software.intel.com/content/www/ ... anual.html (the official optimization manual)

Finally, if this isn't enough, or you don't want to read through all the docs, then just measure it! Do a simple loop with a function call and one with an inlined function, then measure how long they're running! Same goes for the address calculation: write a loop where you calculate the address in each and every loop, and another one where you calculate the upper left corner before the loop once and you only add scanline to that inside the loop. Multiplication is a much more expensive instruction than a single add even on modern computers. For loops that run many many times and often, every CPU cycle counts.

Cheers,
bzt

eekee · Post by **eekee** » Fri Dec 11, 2020 5:55 pm

@bzt: I put myself in a difficult position by challenging you here because I don't exactly want to argue with you right now, but... I don't know, the things you say are just too weird. Some of your points are good -- I'm almost sure you're right about multiplication -- others I could challenge further. If reducing the number of variables results in more complex expressions then there's no improvement because intermediate results in the expressions need to be kept somewhere. I'm sure this somewhere is not any more or less registerizable than local variables. As for cache, you have to admit that it's amusing to see you ask me "which level?" when you made the claim, "that clears the instruction cache" in the first place! (Emphasis mine.) I admit I made a wild assumption: that the level 1 instruction cache will typically be 128KiB in size. High-end processors had this amount when I was last paying attention many years ago, (if I remember right, as always,) so I assume most or all processors have now have 128 KiB or more. The wikipedia page you linked tells me its divided into blocks, which makes your claim of one function call dumping the entire cache even more... entertaining! (I knew it was divided into cache lines anyway, as I'm sure do you.) Is this a language issue? Did you not mean to imply it flushes the entire thing? (Let's call "the instruction cache" the level 1 instruction cache for the sake of argument.) Anyway, I should test as you told me to, although I think I'll have to be careful to avoid the results telling me more about the compiler than the CPU -- it may very well inline function calls in the loop. The bit about call distance making a difference is certainly good info. Unrolling loops I'm not so sure about again. It was clearly a bad idea when I was using source-based Linux distributions because unrolled loops make bad use of caching, but I was using cheap CPUs with limited cache. I realized this long before the Gentoo "ricers" of that era. Here's something I tested back then: I got good results from optimizing all the userspace for size. I think I optimized my kernel for speed (which would include framebuffer scrolling) but I'm almost sure I never used -O3. I used -O2 which didn't unroll loops. (This was all with Gcc 3, maybe 4.0.)

As for scrolling in userspace, I set my xterms to "jumpscroll" which gets a very fast overall rate regardless of the rate of scrolling individual lines. I'm not sure what the algorithm is, but it works. I might look it up if I find myself struggling with scroll speed, but my terminal/console plans are odd anyway.

OSDev.org

Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode

Re: Scrolling terminal in software-emulated text mode