OSDev.org

Posted: **Sun May 17, 2015 11:23 pm**

no92 wrote:That sort of a conclusion is dangerous, as you're comparing different algorithms in different languages. Even though the argument is somewhat flawed, your point is correct: hand-crafted assembly is often not better than one in a language like C, due to extremely simplified maintenance.

The only methods that I would do in assembly are the ones you have to do in assembly (LGDT/LIDT/LTR, ISRs etc.) + some libc functions, where optimizing can be extremely profitable (I'd say that for some functions a speedup by 100x is possible). For a correctly and thoroughly tested written libc function, maintenance shouldn't be a thing.

Talking about such a topic is relatively useless, as it is highly biased and heavily depends on the ability of the programmer to write/maintain assembly.

I wasn't being particularly serious. I agree this is a useless topic.

Posted: **Mon May 18, 2015 8:43 am**

The output of both ASM and C is x86 byte code, so one language can't technically be "faster" than the other. The comparison is between the code generated by the ASM and C compiler, and any "libraries" that you decide to use.

For any scenario, one combination of bytes will be the fastest, in wall-clock time. Writing those exact bytes may not even be possible in C, whereas it would be possible using ASM, assuming your compiler was up-to-date with the specific CPU you are running on.

However, it is unlikely that you will find those exact bytes for any given scenario, let alone multiple scenarios (different CPUs, memory, motherboard, etc.)

So then the comparison becomes comparing wall-clock time between your hand-written ASM code vs. your hand-written C code run through a particular compiler (plus any included libraries you decide to use).

You can probably find a few scenarios where your hand written ASM code is faster than your hand written C code with a specific compiler, but it probably won't be noticeable to the end user, and it is almost surely not worth the increased development time.

Of course, nobody writes their own OS with development time in mind...

I will say that I have noticed one area where ASM has an advantage over (optimized) C code is when doing I/O read/write instructions. Rather than use a large number of I/O ports, most hardware uses a pair of Address/Value ports, which greatly expands the number of "registers" available to the CPU. In C, this is usually handled by writing a device_read() function and device_write() function that take the Address and Value as parameters, and internally use inb() and outb() functions, which ultimately use the IN and OUT cpu instructions. Since the Address and Value ports are usually next to each other, ASM has the advantage of being able to use INC and DEC instructions to flip between the two, where C code must explicitly set the DX register to a specific value, or a value stored in a variable for both the Address and Value ports before any data can be read or written. I can't see how this can possibly be "faster" in C, regardless of how "smart" the compiler is.

Still, probably not noticeable to the end user...

Posted: **Mon May 18, 2015 2:37 pm**

If you tried hard enough you could probably write a program in assembly that would run slower than something written in a higher level languages.
Most of the time, the compiler knows best.

Posted: **Mon May 18, 2015 3:20 pm**

Most people won't have to try very hard to produce slower code in assembly than in C. It's more or less the natural outcome if you don't know exactly what you're doing. (And unfortunately, most of the proponents of asm-only coding don't.)

Posted: **Mon May 18, 2015 5:11 pm**

I for one, have no particular bias and if there is a way of doing something better, I want to learn it. Space to me is the primary concern, as grandiose as it may seem, but I do aspire to have my entire system fit on a high density floppy. Without knowing the specifics of my design intent, that statement doesn't mean much, but I do know I have to avoid any sort of bloat.

One thing I've come to learn in "C" and maybe because I'm not doing it right, is that the linker will page align each object module. What would be helpful if I was to see an example of how something would be coded in "C" to match or nearly approximate this example. I believe it would be called something like this;

char *Str = I2H ( long Value, int Position, int Flags ) {
.... Whatever need to happen here ....
}

Code: Select all

; ============================================================================================
; Converts 32/16 bit value in E(AX) to an ASCIIZ string.

;       ENTER:  ARG3: Flags & padd character
;               ARG2: Pointer to last digit position in ASCIIZ buffer
;               ARG1: Value to be converted, must be 32 bit even if only converting 16

;       LEAVE:  AX = Pointer to beginning of ASCIIZ buffer
;               Only CX & DX altered

;       FLAGS:  Unchanged

; 2015.05.05                                                                    4FH - 79 Bytes
; --------------------------------------------------------------------------------------------

    ; Arguments passed by caller
        
  %define   VALUE     BP +  4           ; Arg 1: 16/32 bit value to be converted
  %define    BUFF     BP +  8           ; Arg 2: Pointer last position ASCIIZ buffer
  %define FILL_CH     BP + 10           ; Arg 3: Flags & fill character
  %define   FLAGS     BP + 11           
  
    ; Bit positions in FLAGS
    
  %define  PADD   0                     ; Leading chars will be filled with FILL_CH
  %define  WIDE   1                     ; 32 bit conversion if ON
  %define  CASE   5                     ; A-F will be converted to lower case
  
   I2H: push    bp
        mov     bp, sp                  ; Empty proc frame so BP can address args.

        push    di                      ; Preserve
        
    ; This guarantees DF will be returned to its original state even if it's already set.
    
        pushf
        std                             ; Auto decrement STOSB
        
    ; Initialize registers for converesion loop
    
        mov     eax, [VALUE]            ; Get 32 bit value, only 16 might be significant
        mov      di, [BUFF]             ; Get pointer to ASCIIZ string. ES already set
        mov      dl, [FLAGS]
        mov      cx, 8                  ; Default 16 bit converesion
        bt       dx, WIDE
        jc      .Next                   ; Flag OFF, doing 16 bit conversion
        
    ; In order to guarantee padding works correctly, MSB of EAX needs to be striped
    ; otherwise output will be padded with zeros reguardless.
    
        shr      cx, 1
        and     eax, 0xffff             ; Strip bits 31-16 just in case

    ; Cycle until either EAX or AX is zero or CX is zero. 
        
 .Next: push    ax
        and     al, 15                  ; Strip bits 7-4
        or      al, '0'                 ; Convert to digit
        cmp     al, '9'                 ; Is AL a letter
        jbe     .J0
        
    ; Convert to alpha and optionally lower case
    
        add     al, 7                   ; Convert to alphabetic character
        bt      dx, CASE
        jnc     .J0
        or      al, 32                  ; Convert to lower case

    ; Keep writing characters until E(AX) is null or CX is null
        
   .J0: stosb                           ; Write to buffer       
        pop     ax
        shr     eax, 4                  ; Shift next digit position into AL
        loopnz  .Next                   ; Continue until either ZF=1 or CX=0
        
    ; Determine if caller wants to padd remainder of buffer
        
        bt      dx, PADD
        jnc     .Done
        mov     al, [FILL_CH]           ; Get fill character from parameters
        rep     stosb                   ; Fill remainder of buffer if CX > 0
        
 .Done: popf                            ; DF is restored to its original state
        mov     ax, di                  ; Return pointer to begining of buffer in AX
        pop     di      
        leave                           ; Kill procedure frame
        ret     8                       ; Waste parameters

If "C" can do this at least within 100 bytes, then I'll seriously consider changing what I do, but with all that needs to be done and fit in 1.4 meg, I need be thrifty. Just one caveat, no in-line assembly. That is one of the reasons I went to assembly to begin with.

Let's too dispense with any rhetoric. To paraphrase @SpyderTL slightly, from operators perspective, cpu operating at 3.8 gig, time is negligible. I believe there to be many proficient "C" & "C++" programmers, but if an example is not forthcoming, I think it will be safe to assume, it can't be done.

Posted: **Mon May 18, 2015 6:07 pm**

Just ran your code with nasm. test.asm -> test became 79 bytes, but the same source with nasm -f elf became 576 bytes. This is just a factor that shows up when using object formats with symbols, headers and the like. those seem large when the code they are attached to are small. That difference becomes much smaller after you have linked everything together, and afterwards you can strip away many things from the resulting binary. Take from this what you will.

Posted: **Mon May 18, 2015 8:24 pm**

Merlin wrote:afterwards you can strip away many things from the resulting binary.

Good to point out, as that is what I'm doing, eliciting a flat raw binary file and then writing directly to media with HexEdit. Ultimately, no matter how the example is constructed the object code would be able to be isolated and then compared with the previous example. Probably optimized with whatever switches "C" uses and striped of any extraneous information.

Posted: **Tue May 19, 2015 1:48 am**

Hi,

TightCoderEx wrote:

Code: Select all

; ============================================================================================
; Converts 32/16 bit value in E(AX) to an ASCIIZ string.

;       ENTER:  ARG3: Flags & padd character
;               ARG2: Pointer to last digit position in ASCIIZ buffer
;               ARG1: Value to be converted, must be 32 bit even if only converting 16

;       LEAVE:  AX = Pointer to beginning of ASCIIZ buffer
;               Only CX & DX altered

;       FLAGS:  Unchanged

; 2015.05.05                                                                    4FH - 79 Bytes
; --------------------------------------------------------------------------------------------

    ; Arguments passed by caller
        
  %define   VALUE     BP +  4           ; Arg 1: 16/32 bit value to be converted
  %define    BUFF     BP +  8           ; Arg 2: Pointer last position ASCIIZ buffer
  %define FILL_CH     BP + 10           ; Arg 3: Flags & fill character
  %define   FLAGS     BP + 11           
  
    ; Bit positions in FLAGS
    
  %define  PADD   0                     ; Leading chars will be filled with FILL_CH
  %define  WIDE   1                     ; 32 bit conversion if ON
  %define  CASE   5                     ; A-F will be converted to lower case
  
   I2H: push    bp
        mov     bp, sp                  ; Empty proc frame so BP can address args.

        push    di                      ; Preserve
        
    ; This guarantees DF will be returned to its original state even if it's already set.
    
        pushf
        std                             ; Auto decrement STOSB
        
    ; Initialize registers for converesion loop
    
        mov     eax, [VALUE]            ; Get 32 bit value, only 16 might be significant
        mov      di, [BUFF]             ; Get pointer to ASCIIZ string. ES already set
        mov      dl, [FLAGS]
        mov      cx, 8                  ; Default 16 bit converesion
        bt       dx, WIDE
        jc      .Next                   ; Flag OFF, doing 16 bit conversion
        
    ; In order to guarantee padding works correctly, MSB of EAX needs to be striped
    ; otherwise output will be padded with zeros reguardless.
    
        shr      cx, 1
        and     eax, 0xffff             ; Strip bits 31-16 just in case

    ; Cycle until either EAX or AX is zero or CX is zero. 
        
 .Next: push    ax
        and     al, 15                  ; Strip bits 7-4
        or      al, '0'                 ; Convert to digit
        cmp     al, '9'                 ; Is AL a letter
        jbe     .J0
        
    ; Convert to alpha and optionally lower case
    
        add     al, 7                   ; Convert to alphabetic character
        bt      dx, CASE
        jnc     .J0
        or      al, 32                  ; Convert to lower case

    ; Keep writing characters until E(AX) is null or CX is null
        
   .J0: stosb                           ; Write to buffer       
        pop     ax
        shr     eax, 4                  ; Shift next digit position into AL
        loopnz  .Next                   ; Continue until either ZF=1 or CX=0
        
    ; Determine if caller wants to padd remainder of buffer
        
        bt      dx, PADD
        jnc     .Done
        mov     al, [FILL_CH]           ; Get fill character from parameters
        rep     stosb                   ; Fill remainder of buffer if CX > 0
        
 .Done: popf                            ; DF is restored to its original state
        mov     ax, di                  ; Return pointer to begining of buffer in AX
        pop     di      
        leave                           ; Kill procedure frame
        ret     8                       ; Waste parameters

If "C" can do this at least within 100 bytes, then I'll seriously consider changing what I do, but with all that needs to be done and fit in 1.4 meg, I need be thrifty. Just one caveat, no in-line assembly. That is one of the reasons I went to assembly to begin with.

Few compilers will generate 16-bit code. If padding is requested, there's no way to determine how much padding from the arguments (e.g. if the buffer is 123 bytes and you want 115 bytes of padding followed by an 8 character dword).

In general, software should choose a sane/standard format and stick to it, rather than having poor consistency (e.g. numbers displayed in different ways in different places). For example, I'd go with "all digits always upper case, never suppress leading zeros, all dwords always 8 characters regardless of value, all words always 4 characters regardless of value, all hex numbers always shown with an "0x" prefix". Assuming a sane/standardised format is used; if this code is written as "static" or link time optimisation is used I'd expect a C compiler to notice that the "flags" argument is always the same and for most of the pointless bloat (e.g. the "flags" argument itself, all branches that depend on it, and support for lower case) to be removed.

Also note that your assembly isn't really very good. For example, "and eax, 0xffff" could be "movzx eax,ax", the "bt dx, CASE" should be "test dl,(1 << 5)"; using the lower 16-bits of 32-bit registers causes massive "partial dependency" problems, etc; and these types of things could've been fixed by a peephole optimiser. The code to convert a digit into a character has an "if lower case" branch that should have been moved out of the loop, and also has an "if (char > '9')" that is likely to cause frequent branch mispredictions that could've been avoided with a lookup table; and both these problems could've been fixed by an optimising compiler. Also, by passing parameters in registers you could've avoided most of the push/pop and the slow/micro-coded enter/leave; and an optimising compiler could've fixed all of that for you (assuming "static" or link time optimisation); up to and including inlining it so there's no call/ret if there's only one caller (possibly followed by constant propagation, constant folding and dead code elimination, potentially reducing the entire thing to a "rep movsd" that does nothing more than copy a string that was generated at compile time).

Finally; the code that would be generated by an adequate/optimising compiler will depend on how the code is called in how many places (including which arguments are constant, etc); so it's impossible to compare the code generated by a compiler unless the compiler wasn't allowed to optimise (and therefore impossible to have a fair comparison). For example, for "best case" this code isn't called at all and the compiler generates zero bytes while the assembler wastes ~100 bytes for nothing.

Cheers,

Brendan

Posted: **Tue May 19, 2015 9:57 am**

I'm not sure how my own perceptions will be relevant to others, because the assembler I am developing is far from conventional, and while I do intend to write an unusually large percentage of Kether with it, much of that code will be more abstract than most C code due to the extensive use of macros. However, I will in general agree that most of the time, even a skilled assembly language programmer will be hard-pressed to write code that beats a really good optimizing compiler (neither GCC nor the Microsoft compilers count as this, IMAO, but the Intel C compiler and some experimental compilers such as Stalingrad certainly do) on anything longer than a few pages of code. The compiler will never get tired, or forgetful, or careless, while a human assembly coder can, and any faults in the compiled code (resulting from the compilation process, that is) can be laid at the feet of the compiler writers, not the compiler itself.

Perhaps more importantly (from my POV - remember, I usually run Gentoo as my desktop system, and the design I have for the Thelema compiler and Kether OS in general relies on a 'slim binary' JIT compilation approach), a compiler, if well-written, can optimize the object code it produces for the specific systems it runs on, whereas the assembly code will be optimized only for the specific system it was written for.

Posted: **Tue May 19, 2015 11:45 am**

Brendan wrote:Also note that your assembly isn't really very good.

I agree, but the reality is, this real mode code has but one purpose to be informational. My focus is studying X86 architecture to a depth I've never have before and in order of getting some sense of what can be expected from the 3 actual systems (real hardware) I have, I need to be able to visualize it. In the process, algo design is improving over earlier versions.

Once I get past this hardware thing, then the next step will be an in-depth study of Intel instruction set, MMX, SSE, VMX etc., and I'm sure Agner Fog's writings on code optimization will come in real handy too. I'm optimistic, by that time my protected and long mode creations will be significantly better. Logic dictates that in order "C" live up to its reputation, there must be a proficient assembly programmer behind it.

Has "C" insulated programmers from the minutia of processor design, yes it has and there is no doubt in my mind that is why it has evolved into its popularity. Case in point, when I used it in the early 80's, not only did I not have to worry about instruction sets, but it was also cross platform, so my apps could be run on Zilog, Intel and Motorola, with just a few modifications.

I would propose there is no way under the sun any HLL can be as efficient as assembly code, but the caveat would be, it takes an assembly programmer of the same calibre as the compiler implementer. My choices are to study X86 and IA32/64 architecture or this. I think the reason I like NASM is that it is a clean and slick as the original K&R.

I know someday soon, I will be creating code that will rival anything, any compilers produce, but the reality is, soon may be in the beginning of the next decade. There is something to be said for goals though. If, however my mandate was to produce something that could actually be used, even a limited knowledge of "C"/"C++" is better than assembly.

Posted: **Tue May 19, 2015 9:20 pm**

The assembly vs high level debate reminded me of this thread:
http://forum.osdev.org/viewtopic.php?f= ... =copy+movs
which referenced memcpy:
http://www.eglibc.org/cgi-bin/viewvc.cg ... hrev=13698

The actual, optimal memcpy is enormously complex -- the compiler gets it right, and the assembly programmer almost inevitably does not. That really ends the debate.

Posted: **Wed May 20, 2015 1:54 am**

Hi,

azblue wrote:The assembly vs high level debate reminded me of this thread:
http://forum.osdev.org/viewtopic.php?f= ... =copy+movs
which referenced memcpy:
http://www.eglibc.org/cgi-bin/viewvc.cg ... hrev=13698

The actual, optimal memcpy is enormously complex -- the compiler gets it right, and the assembly programmer almost inevitably does not. That really ends the debate.

I wouldn't assume the compiler gets it right - the compiler almost always gets it "good enough" and almost never gets "optimal"; and there's multiple severe problems (for portability and usability, but also for optimisation) in both the tools (e.g. GCC) and the languages (e.g. C or C++); plus additional problems that effect both high level languages and assembly caused by the way software is delivered (e.g. executables built on programmer's machine where it's impossible to know exactly what the end user's CPU is and impossible to optimise specifically for the end user's CPU).

The funny thing is, the main reason I'm using assembly is portability. Not portability to different architectures; but portability to different tools. I know that regardless of how I design my IDE, high level language and native tool-chain, it will have to support assembly. If I write my code in assembly now then it will mostly be "cut and paste" to port it to my native tools, and if I write my code in C now then it'll have to be completely rewritten.

Cheers,

Brendan

Posted: **Wed May 20, 2015 6:52 am**

I was more productive when I wrote C but I was not comfortable with the toolchain dependency and hidden tendency to end up having a "culture-associated" design direction. Composing "binary images" by using an assembler (with my own varius helper tools) gives me a low-level base to build things on. It is the base level I chose and am comfortable with. Using e.g. C and available compilers would raise the base level too much. I would like to have a high-level language but I have not decided what it should be.

However, I need at least an assembler and I went through every option ranging from writing my own to trying all the common assemblers available. I chose using NASM but not without serious consideration and thorough study of its source code. At some point I will port NASM on my system but I decided that I am not going to build it on my system. This is a little drawback because first I thought my system would be fully self-hosted. However, there are things like ACPICA that I probably want later on so I accept the fact that there will be an external "connection" to the outside world. However, those "externally built modules" are not a core part of the system. Especially the assembler is just an ordinary application.

Brendan wrote:If I write my code in assembly now then it will mostly be "cut and paste" to port it to my native tools

Just for interest kind of question, are you writing the assembler module by yourself?

Posted: **Wed May 20, 2015 7:56 am**

Hi,

Antti wrote:
Brendan wrote:If I write my code in assembly now then it will mostly be "cut and paste" to port it to my native tools
Just for interest kind of question, are you writing the assembler module by yourself?

I will be; but it's not going to work the same as typical assemblers. Mostly, you'd type (or paste) text into the IDE and the IDE will immediately convert it into either tokens or ASL (the source code); then this "tokens or ASL" source code gets fed into the first compiler which optimises as much as possible (and inserts hints for the second compiler) to generate an "intermediate representation" executable file that user's download. Then, when the end user is installing the executable a second compiler converts the IR from "mixture of higher level language and assembly for multiple targets" into native code for the user's specific CPU (while doing more optimisation).

For me; the hard part here is determining the file format the IDE uses (what "source code" actually looks like). I've experimented with both tokens and ASL and haven't really been able to decide which is better, and want to cache "pre-compiled to IR" snippets in the same file (to speed up the compiling larger things when very little was modified), and want to pre-compile those snippets in the background (while the programmer is typing), and want to quickly generate diagrams representing the structure of software (e.g. call graphs, etc) which will probably involve storing/maintaining meta-data of some kind in the source file, etc. Basically; I'll probably need to write a few prototype IDEs before I've really decided what the file format for source code looks like, and I'd need to do that before I can start writing the compiler/s.

Cheers,

Brendan

Posted: **Wed May 20, 2015 7:36 pm**

These were my findings from back in 2012 on a 16-core AMD system:

https://twitter.com/ReturnInfinity/stat ... 2381677569

Same application code, different OS.

Application code is here: https://github.com/ReturnInfinity/BareM ... primesmp.c

OSDev.org

Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux

Re: Speed : Assembly OS vs Linux