OSDev.org

Posted: **Mon Jun 13, 2011 7:18 pm**

Have you actually read the source for glibc's memcpy?
rep movsb (and all of the rep family) are simple, yes, but they execute in order, and hence don't benefit from instruction pipelining (because each instruction depends on the result of the previous).
For small memory blocks, this is not an issue, but once you get to larger blocks (>32 bytes, say) the slowdown will be significant.

Anyway, this is getting off topic.

You can write an OS in purely assembler (as Menuet has shown), but unless you are a VERY good programmer, C will be far easier, with minimal speed penalty.

Posted: **Mon Jun 13, 2011 7:39 pm**

REP MOVS only makes sense for Nehalem and Sandy Bridge. On other CPUs, other techniques are recommended (e.g., SSE2, SSE3-Suppl, AMD SSE5, or AVX). I personally wrote optimal assembly language implementations for various functions of the C standard library. Someone on #osdev provided benchmarks for my memset implementation which uses the techniques I was talking about: http://slexy.org/view/s2BwwHctgI --- in other words, if you're going to use assembly language, be careful what you do.

As for writing an OS in assembly, I think it's a terrible idea due to the fact that (a) assembly is not portable, productive, or as easily maintainable as higher-level languages, and (b) it can't be optimized for a specific CPU model unless you add a lot of extraneous code.

I strongly recommend higher-level languages such as C/C++ and even managed languages for the courageous.

Posted: **Mon Jun 13, 2011 8:04 pm**

Hi,

Imagine you do "memcpy(src, dest, 8 )". In this case the compiler can inline the memcpy, detect that "8" is a constant, unroll the loop, and generate about 4 instructions:

Code: Select all

    mov eax,[src]
    mov ebx,[src+4]
    mov [dest],eax
    mov [dest+4],ebx

That could easily be 10 times faster than "rep movsb".

For modern CPUs, "rep movsb/w/d/q" is optimised to work on a cache lines, but only in specific cases - e.g. if the destination address is aligned on a certain boundary (and maybe the source address too), if the count is large enough, if the source and destination areas don't overlap too much, etc. This means that a "rep movsb/w/d/q" has some initial overhead before it starts (while the CPU is determining if it can work on cache lines instead) which makes it slow when the count is tiny.

In general, for suitably aligned source and dest:

for tiny memory amounts (maybe less than 100 bytes) with a fixed count you want to unroll and generate a series of "mov" instructions
for tiny memory amounts (maybe less than 50 bytes) with a variable count you want to do a manual copy and avoid "rep movsb/w/d/q"
for medium memory amounts (maybe between 100 bytes and 10 KiB) you want to use "rep movsb/w/d/q"
for larger memory amounts (maybe between 20 KiB and 200 KiB) you start getting into trouble with cache pollution, and want to use CLFLUSH. In this case "rep movsb/w/d/q" is a bad idea because you end up doing tiny memory amounts between cache line flushes. You'd want to break it up into multiple "cache lines sized" chunks and generate a series of "mov" instructions to do each small/fixed size chunk. It's also probably a good idea to use MMX or SSE (if present) in this case; and you probably want to consider using prefetch instructions too.
for huge memory amounts (maybe above 100 KiB) you want to find the programmer who "needs" this and crush their hands with a hammer, so that they can't type (or program) ever again . In this case you should focus on avoiding the need to copy huge amounts of memory rather than optimising the memcpy(). For a very simple example, imagine a text editor where the text file is stored in memory as one big string, and every time a character is inserted or deleted you need to copy a (potentially huge) amount of memory - simple solution is to store the text file in memory as a linked list of lines so that you only ever need to move from the cursor to the end of the line (maybe 200 bytes worst case, rather than 500 KiB worst case).

Misaligned source and/or destination addresses are a major problem - it's tricky to get right, and it will kill the CPU's "rep movsb/w/d/q" optimisations. For cases with misalignment, you want to read an aligned value, shift it left or right, then write an aligned value. Of course this is also a good candidate for the "smash the programmer's fingers with a hammer" approach

.

Note: Far too many people benchmark their "memcpy()" implementations on large/huge memory sizes. This is stupid as people shouldn't be copying large/huge amounts of memory anyway. Most of the time memcpy() is used to copy strings and structures (between 10 and 200 bytes of data - possibly worse for C++ structures that represent objects). Also, for the people that benchmark their memcpy() with large amounts of data, I've never seen anyone benchmark how their memcpy() effects the speed of surrounding code (e.g. to figure out how badly they've trashed "more likely to be needed soon" data in caches).

Anyway, for an assembly language programmer it can be done, but you'd need a group of macros (one for "small" memory amounts, one for medium memory amounts, etc) where you use the preprocessor to detect if the count is a constant (and potentially if source and dest are constant and aligned), and not one generic (optimised for no specific case) "memcpy()" routine. You'd also need to test on a wide variety of different CPUs (and maybe have different versions for different CPUs), and if you want the best possible results in all cases then you can probably spend several months getting it right.

Cheers,

Brendan

Posted: **Tue Jun 14, 2011 1:50 am**

Brendan wrote:For modern CPUs, "rep movsb/w/d/q" is optimised to work on a cache lines, but only in specific cases - e.g. if the destination address is aligned on a certain boundary (and maybe the source address too), if the count is large enough, if the source and destination areas don't overlap too much, etc. This means that a "rep movsb/w/d/q" has some initial overhead before it starts (while the CPU is determining if it can work on cache lines instead) which makes it slow when the count is tiny.

Depending on the CPU model, the conditions under which REP MOVS moves an entire cache line per iteration (except maybe for the last one) are:

The counter must be high (as I often am ).
Both the source and destination must be properly aligned.
The direction must be forward.
The distance between the source and destination must be at least the size of a cache line.
The memory type for both the source and destination must be either write-back or write-combining.

REP MOVS takes a single cycle to perform both a read and a write. However, on P2 and P3, there is an extra cycle penalty if ESI + 1/2/4/8 - EDI is divisible by 32 (i.e., if bits 2--4 are the same for the source and destination addresses) due to a cache bank conflict. The simplest way to avoid this is to align the source and destination on a QWORD boundary.

Brendan wrote:
for tiny memory amounts (maybe less than 100 bytes) with a fixed count you want to unroll and generate a series of "mov" instructions

Yes, but there are two issues here:

Unrolling loops on Sandy Bridge CPUs is not recommended because of the micro-op cache.
It's not just MOV we're talking about; it's moves using the largest available registers (except for YMM as there are currently no x86(-64) CPUs which can perform 256-bit writes but expect that to change).

REP MOVS typically has some initial overhead for determining which method to use, which is why it's not recommended for small blocks. But it doesn't stop here; this method is typically the fastest even for larger block sizes (but not always, see below). Its downsides are that the block size and alignment must be known beforehand (unless self-modifying code is used but the only justification for that would be using the exact same routine many times). More on alignment later.

Brendan wrote:
for tiny memory amounts (maybe less than 50 bytes) with a variable count you want to do a manual copy and avoid "rep movsb/w/d/q"

I'm not sure I understand what you mean by "manually copying". I presume you're describing the same technique as above, except using a loop in order to save space. If that's the case, I would recommend combining the two methods into a partially unrolled loop, where for Sandy Bridge the number of MOV instructions inside the loop is much higher. By the way, using MOVS without the REP prefix, just like with all string instructions, is a performance waste.

Brendan wrote:
for medium memory amounts (maybe between 100 bytes and 10 KiB) you want to use "rep movsb/w/d/q"

Using REP MOVSB or MOVSW should never be used if performance is the goal, so they're not really worth mentioning. Anyway, REP MOVS is always the technique you want to use on Nehalem and Sandy Bridge for blocks ranging from anywhere from not very small to extremely big.

Brendan wrote:
for larger memory amounts (maybe between 20 KiB and 200 KiB) you start getting into trouble with cache pollution, and want to use CLFLUSH. In this case "rep movsb/w/d/q" is a bad idea because you end up doing tiny memory amounts between cache line flushes. You'd want to break it up into multiple "cache lines sized" chunks and generate a series of "mov" instructions to do each small/fixed size chunk. It's also probably a good idea to use MMX or SSE (if present) in this case; and you probably want to consider using prefetch instructions too.

I think you're overcomplicating it. Non-temporal writes can be used to bypass the cache. On many newer CPUs this is not recommended for blocks smaller than half the L2 cache, so that's probably when you should start worrying.

Brendan wrote:
for huge memory amounts (maybe above 100 KiB) you want to find the programmer who "needs" this and crush their hands with a hammer, so that they can't type (or program) ever again . In this case you should focus on avoiding the need to copy huge amounts of memory rather than optimising the memcpy(). For a very simple example, imagine a text editor where the text file is stored in memory as one big string, and every time a character is inserted or deleted you need to copy a (potentially huge) amount of memory - simple solution is to store the text file in memory as a linked list of lines so that you only ever need to move from the cursor to the end of the line (maybe 200 bytes worst case, rather than 500 KiB worst case).

That's an overgeneralization. There are scenarios in which copying huge blocks is imperative---although, none that I can think of for OS development.

Brendan wrote:Misaligned source and/or destination addresses are a major problem - it's tricky to get right, and it will kill the CPU's "rep movsb/w/d/q" optimisations. For cases with misalignment, you want to read an aligned value, shift it left or right, then write an aligned value. Of course this is also a good candidate for the "smash the programmer's fingers with a hammer" approach .

Things that come in handy are SSE2 (PSRLDQ/PSLLDQ/POR), and supplementary SSE3 (PALIGNR---not recommended on AMD's Bobcat as it doesn't perform well). If you have a sufficiently new CPU, add AMD's SSE5, and AVX to the list. You need 16 branches (one per shift count) unless you have XOP (VPPERM).

Evidently, it is preferrable to simply align your blocks so you don't have this problem to begin with. However, misalignment isn't as big a problem as it used to because newer CPUs have better cache interfaces.

Brendan wrote:Note: Far too many people benchmark their "memcpy()" implementations on large/huge memory sizes. This is stupid as people shouldn't be copying large/huge amounts of memory anyway.

As I've said earlier, I don't think this is true for every scenario. However, I think you are fully right about most situations.

Brendan wrote:Anyway, for an assembly language programmer it can be done, but you'd need a group of macros (one for "small" memory amounts, one for medium memory amounts, etc) where you use the preprocessor to detect if the count is a constant (and potentially if source and dest are constant and aligned), and not one generic (optimised for no specific case) "memcpy()" routine. You'd also need to test on a wide variety of different CPUs (and maybe have different versions for different CPUs), and if you want the best possible results in all cases then you can probably spend several months getting it right.

Optimizing compilers typically take this burden off the programmer's shoulders by using the appropriate thing, at least in theory (vendors don't always care about implementing a bajillion edge cases). E.g., clang will even optimize something like

Code: Select all

// Using a pointer is not required, but I just can't stop being a good programmer :)
for (ptr = array; ptr != &a[sizeof array / sizeof array[0]]; *ptr++ = 0);

into

Code: Select all

memset(a, 0, sizeof a);

given non-floating-point types, which is far more complicated to implement.

Posted: **Tue Jun 14, 2011 2:29 pm**

In answer to the original post, there is nothing wrong with writing your OS using whichever programming language you feel most comfortable with, because that is likely to be fastest for you. Any part of your OS that runs slow because it hasn't been well optimised can be improved later on when you go through everything carefully and try to speed it up (though it certainly helps if you have some idea about how code optimisation works so that you can recognise that you're writing slow code which will need to be looked at again later). What matters most to begin with is getting your code working and eliminating all the damaging bugs. Once you've done that you'll have created something worth of the effort of optimising it. When you do get to that point you'll find that most of the possible gains will come from focusing on a few critical innermost loops, and anything else you do probably won't make a measurable difference (unless you can improve the actual algorithms).

One big advantage of using assembler is that your code will fit into a fraction of the space that compiled code takes up, and that means you can typically have ten times as much functionality sitting in machine memory without the delays you get when page faults lead to bloated code being switched in and out from/to the hard drive. Of course, this advantage will be wiped out if you then run lots of bloated applications on top of it which weren't written with the same care.

As for porting your OS to another platform, if it's such good software that it's worth porting, you really aren't going to care about the extra work involved. In addition, there will in the future be software capable of porting it for you, and of optimising it to run on a specific machine. Of course, it will go further than that too: operating systems will be able to test their own performance and rewrite their own code to speed it up. They'll also be able to change the way they do just about everything if the user doesn't like the way they look or behave, and they will be able to rewrite applications in the same way. They will download recipes for programs and then build them themselves in minutes rather than running anything ready-made that might have a virus in it, so all the protection mechanisms in the machine will be made completely redundant. Human programmers will work in natural languages, simply telling the machine what they want it to do and allowing it to do all the hard work for them, so any Tom, **** or Harry will be able to produce first-rate programs and to tell the OS to behave the way they want it to. The sad/happy truth of it is, we're all going to be pushed aside by machine intelligence before long and will have to find something else to do with our time. There are machines out there which are on the verge of being able to hold rational conversations, and it's only a tiny step further on from there to them becoming expert programmers, loaded up with more knowledge than Brendan and Love4Gannets combined, and able to apply all of it many times a second within every machine on the planet.

So, it really doesn't matter how you write your OS, just so long as you enjoy doing it. It will be a magnificent monument to one of the last real human programmers.

[Edit: I didn't type the **** bit in - the autocensor did that!]

Posted: **Tue Jun 14, 2011 2:46 pm**

DavidCooper wrote:In answer to the original post, there is nothing wrong with writing your OS using whichever programming language you feel most comfortable with, because that is likely to be fastest for you. Any part of your OS that runs slow because it hasn't been well optimised can be improved later on when you go through everything carefully and try to speed it up (though it certainly helps if you have some idea about how code optimisation works so that you can recognise that you're writing slow code which will need to be looked at again later). What matters most to begin with is getting your code working and eliminating all the damaging bugs. Once you've done that you'll have created something worth of the effort of optimising it. When you do get to that point you'll find that most of the possible gains will come from focusing on a few critical innermost loops, and anything else you do probably won't make a measurable difference (unless you can improve the actual algorithms).

It seems to me that nobody is likely to notice if you fail to shave the last nano second off of a keyboard handler. On the other hand, they might notice if your graphics were slow. So you optimise where you need it, and nowhere else. I wonder if anybody remembers the early days of the PC, when there was a single core processor running at a little under 5MHz?

As for porting your OS to another platform, if it's such good software that it's worth porting, you really aren't going to care about the extra work involved.

The lack of portability bugged me slightly, but it is perhaps possible to take the writing of a hobby operating system too seriously. The reality is that it will probably never run on more than a handful of computers apart from my own. In any case, we are probably locked in to the x86 architecture. It has been around for thirty years now, and the installed user base is simply huge. It is difficult to imagine how it could be unseated , short of some kind of revolution in computer technology.

Posted: **Tue Jun 14, 2011 2:56 pm**

DavidCooper wrote:There are machines out there which are on the verge of being able to hold rational conversations, and it's only a tiny step further on from there to them becoming expert programmers, loaded up with more knowledge than Brendan and Love4Gannets combined, and able to apply all of it many times a second within every machine on the planet.

That is not a tiny step. People have thought it to be a tiny step since the dawn of AI. But, I believe we've already had this argument...

DavidCooper wrote: As for porting your OS to another platform, if it's such good software that it's worth porting, you really aren't going to care about the extra work involved.

Why did UNIX get so big? It was the first OS written entirely in a portable language. If your statement were true, people would simply have ported the other, more powerful systems of the time, but they didn't.

Posted: **Tue Jun 14, 2011 3:56 pm**

Come on guys, we've had this argument before.

Writing in pure asm is totally unmaintainable and is the realm of purist hobbyist programmers only.

... which do comprise a substantial proportion of this forum's userbase.

There. I TL;DR'd it. Can we stop this debate now?

Posted: **Tue Jun 14, 2011 4:20 pm**

JamesM wrote:Come on guys, we've had this argument before.

Writing in pure asm is totally unmaintainable and is the realm of purist hobbyist programmers only.

... which do comprise a substantial proportion of this forum's userbase.

There. I TL;DR'd it. Can we stop this debate now?

Back in the day, I wrote a command interpreter for MS-DOS in pure asm. I also wrote a popup debugger, which could single step through DOS itself without crashing - again in pure asm.

Posted: **Tue Jun 14, 2011 4:21 pm**

Casm wrote:In any case, we are probably locked in to the x86 architecture. It has been around for thirty years now, and the installed user base is simply huge. It is difficult to imagine how it could be unseated , short of some kind of revolution in computer technology.

The future of x86 isn't guaranteed - there's going to be plenty of room for innovations in hardware once it's no longer limited by any difficulties in getting existing software to run on it.

Posted: **Tue Jun 14, 2011 4:27 pm**

Casm wrote:
JamesM wrote:Come on guys, we've had this argument before.

Writing in pure asm is totally unmaintainable and is the realm of purist hobbyist programmers only.

... which do comprise a substantial proportion of this forum's userbase.

There. I TL;DR'd it. Can we stop this debate now?
Back in the day, I wrote a command interpreter for MS-DOS in pure asm. I also wrote a popup debugger, which could single step through DOS itself without crashing - again in pure asm.

Very clever, but proves my point precisely. Locking the thread now...

OSDev.org

Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language

Re: Writing an operating system in assembly language