Page 1 of 1

A question about the PE/COFF and ELF formats

Posted: Thu Apr 07, 2022 9:15 am
by Ethin
So in my compilers class we've entered the code generation phase. I won't get too hung up on it if you guys think I shouldn't do this, but my professor gave the go-ahead and I thought it would be a good exercise to learn the PE/COFF and ELF binary formats.

My professor is having us generate a C/C++ file containing inline assembly using the MSVC __asm block statement (__asm { ... }). But I'm on Linux, so don't have access to MSVC, so I'd need to either (1) use weird GCC hacks, (2) generate the assembly in a separate .S/.asm file and manually assemble it with an external assembler, or (3) generate the full binary myself. I'm opting for the third option, to (as I noted above) learn the binary formats and how building one actually works under the hood (without all the extra stuff a compiler does like debug information generation). I've done some digging and have settled on using asmjit for instruction generation (there's no way I'm pulling in LLVM, but if anyone knows a better library than asmjit, please do tell) and COFFI/ELFIO for PE/COFF/ELF binary generation. My question is... What makes a legal PE/COFF or ELF binary? What sections do I absolutely require, and what can I leave out?
I'm not looking to add in a bunch of fancy stuff to this (though the code we're supposed to create is fully relocatable, yay, so I'd love to learn how to add that in), just the basics. The compiler can't call out to external libraries/call over the FFI boundary, so the ABI doesn't really matter (at least, I'm pretty sure I can forget the ABI since all you can call are functions you've specifically declared and defined); I just want to know what I absolutely need to add (excluding instructions obviously) to create a fully working binary that, unless I don't write an instruction properly, won't throw any signals or cause problems. I need to learn both PE/COFF/ELF because I need to be able to debug the generated code to ensure that the assembly is correct and doesn't misbehave, and I'm not very skilled in using LLDB (I'm more experienced in GDB). If you guys think I shouldn't do it and should just opt for the (much simpler) option of generating just inline assembly and letting GCC/Clang do all the heavy lifting, or if you have any other advice, I'm definitely all ears. I'm mainly doing the standalone binary generation for the hell of it and as a major learning opportunity that I thought I might as well grab with both hands, particularly since my professor is encouraging it and thinks that it would be a good way of earning extra credit (though Idk if he actually will give me extra credit for that (though as far as I know I'm the only one who's considered doing this), we'll see).

Re: A question about the PE/COFF and ELF formats

Posted: Thu Apr 07, 2022 10:03 am
by nullplan
I think you are trying to take too many steps at once. Because typically the compiler generates assembly, the assembler generates an object file, and only then will a linker generate an executable file. That is three transformation processes and four file formats (if you count the source file), and in each case, you simply have completely different job to do.

So the compiler generates assembly code. That is, it generates the directives and instructions necessary to get the assembler to generate a valid object file. For the most part, the assembly file is just a textual representation of the object file, but certain things still make it worthwhile to break up. For one, outputting to text makes your stuff way easier to debug, for two, you can leave stuff the assembler will be doing to the assembler and just concentrate on getting the compiler right.

The assembler will convert the assembly code into object code. For the most part this means it will generate a file in the applicable object code format. Most of those (and particularly the two mentioned here) will allow multiple sections in each of them, and it will be the assembler's job to concatenate possibly multiple declarations for each of them into just a single section. Then there are also address calculations. The assembler first has to pass over the code to identify only how large each directive and instruction will be and thus place the symbols, and only then can it go about assembling the instructions. Another important thing are relocations. The assembler will annotate certain bytes in the object file as having to contain specific addresses, and getting those right is probably the most important part of the assembler. For example:

Code: Select all

.section ".rodata","a",@progbits
.LC1: .asciz "Hello World!\n"

.text
  movq $.LC1, %rdi
  callq printf
The assembler cannot know what the address of of .LC1 will be after linking, nor that of printf. So it generates the code for a 64-bit move, fills the field that will become .LC1 with zeroes, and generates a relocation entry that tells the linker to place the address of the .rodata section there. And for the call instruction, it fills the destination with zeroes, then creates a relocation entry that means those bytes should become the difference between the address of printf and that code address, plus four. This is because the destination of a call instruction is read to be relative to the end of the instruction.

Finally, the linker. The linker must read back the object file, and possibly multiple of those, concatenate like sections, and generate an executable file, while processing relocations. Since you mentioned Windows, you are going to have to deal with dynamic linking at least a little bit.
Ethin wrote:o the ABI doesn't really matter (at least, I'm pretty sure I can forget the ABI since all you can call are functions you've specifically declared and defined);
Well, you are going to have to at least call a few system calls. At least exit() or ExitProcess() on Windows. So ABI is probably still a necessity. Since the compiler knows the target platform, this ought not to be a big problem, however.
Ethin wrote: I just want to know what I absolutely need to add (excluding instructions obviously) to create a fully working binary that, unless I don't write an instruction properly, won't throw any signals or cause problems.
You must have the right headers and the right exit code at some place following the entry symbol, and you must execute that exit code. That being OS-specific.

Re: A question about the PE/COFF and ELF formats

Posted: Thu Apr 07, 2022 10:10 am
by neon
Hi,

Am not sure what the 2nd paragraph has to do with learning PE/COFF...what is the intent here? COFF objects consist of IMAGE_FILE_HEADER, section table, relocation tables, symbol table, and section data. PE files have several headers, including a non-optional "optional" header, an optional set of directories that point to more sections -- then it contains its own symbol table, section table, relocation tables, etc.

You can learn PE/COFF (the format) by writing a reader for it and referencing the standards and using editors to verify (i.e. PEView).

Re: A question about the PE/COFF and ELF formats

Posted: Thu Apr 07, 2022 10:38 am
by Ethin
nullplan wrote:I think you are trying to take too many steps at once. Because typically the compiler generates assembly, the assembler generates an object file, and only then will a linker generate an executable file. That is three transformation processes and four file formats (if you count the source file), and in each case, you simply have completely different job to do.

So the compiler generates assembly code. That is, it generates the directives and instructions necessary to get the assembler to generate a valid object file. For the most part, the assembly file is just a textual representation of the object file, but certain things still make it worthwhile to break up. For one, outputting to text makes your stuff way easier to debug, for two, you can leave stuff the assembler will be doing to the assembler and just concentrate on getting the compiler right.

The assembler will convert the assembly code into object code. For the most part this means it will generate a file in the applicable object code format. Most of those (and particularly the two mentioned here) will allow multiple sections in each of them, and it will be the assembler's job to concatenate possibly multiple declarations for each of them into just a single section. Then there are also address calculations. The assembler first has to pass over the code to identify only how large each directive and instruction will be and thus place the symbols, and only then can it go about assembling the instructions. Another important thing are relocations. The assembler will annotate certain bytes in the object file as having to contain specific addresses, and getting those right is probably the most important part of the assembler. For example:

Code: Select all

.section ".rodata","a",@progbits
.LC1: .asciz "Hello World!\n"

.text
  movq $.LC1, %rdi
  callq printf
The assembler cannot know what the address of of .LC1 will be after linking, nor that of printf. So it generates the code for a 64-bit move, fills the field that will become .LC1 with zeroes, and generates a relocation entry that tells the linker to place the address of the .rodata section there. And for the call instruction, it fills the destination with zeroes, then creates a relocation entry that means those bytes should become the difference between the address of printf and that code address, plus four. This is because the destination of a call instruction is read to be relative to the end of the instruction.

Finally, the linker. The linker must read back the object file, and possibly multiple of those, concatenate like sections, and generate an executable file, while processing relocations. Since you mentioned Windows, you are going to have to deal with dynamic linking at least a little bit.
Ethin wrote:o the ABI doesn't really matter (at least, I'm pretty sure I can forget the ABI since all you can call are functions you've specifically declared and defined);
Well, you are going to have to at least call a few system calls. At least exit() or ExitProcess() on Windows. So ABI is probably still a necessity. Since the compiler knows the target platform, this ought not to be a big problem, however.
Ethin wrote: I just want to know what I absolutely need to add (excluding instructions obviously) to create a fully working binary that, unless I don't write an instruction properly, won't throw any signals or cause problems.
You must have the right headers and the right exit code at some place following the entry symbol, and you must execute that exit code. That being OS-specific.
This is how the "typical" compiler works, right? I know that freepascal (as an example) integrates the compiler/linker/interpreter stages into one, at least I'm pretty sure that it does, but maybe it does call out to external tools?
Are you saying that I should just focus on generating assembly? The template that my professor is having us use looks like:

Code: Select all

#include <iostream>

using namespace std;

char DataSegment[65536];

int main() {
 _asm{
	 push eax  // store registers
	 push ebp
	 push edi
	 push esp
	 push ecx
	 push edx

	 lea ebp, DataSegment  // put starting address of data segment into ebp
	 jmp kmain    // jump around all of the procedures
// ...
	 pop edx   // restore the registers
	 pop ecx
	 pop esp
	 pop edi
	 pop ebp
	 pop eax
 }
	return 0;
}

Re: A question about the PE/COFF and ELF formats

Posted: Thu Apr 07, 2022 1:09 pm
by nullplan
Ethin wrote:This is how the "typical" compiler works, right? I know that freepascal (as an example) integrates the compiler/linker/interpreter stages into one, at least I'm pretty sure that it does, but maybe it does call out to external tools?
Honestly, I haven't looked at freepascal in a while, but I am reasonably sure that compiler, assembler, and linker are all part of the distribution. Maybe all packed into the same binary, but different identifiable parts of the program nonetheless. And while you don't necessarily need separate programs for these parts, you need to do all of the things mentioned in addition to compiling, and will likely end up generating a program structure that looks very similar to what I stated above.

The "typical" compiler will then also include a "compiler driver", which will have the sole task of configuring the other parts of the compiler correctly to yield the correct outputs. For example, GCC has the program "gcc", which is the compiler driver, then it has "cc1", which is the compiler, and then "as" and "ld" as assembler and linker, respectively, though the latter two are in the binutils package.
Ethin wrote:Are you saying that I should just focus on generating assembly?
No, I was merely pointing out that the process of creating a valid executable is more work than hammering out the correct file header, followed by object code, and trying to go from a system that uses all of the above components to self-rolling them all might be a higher jump than can be achieved inside of a semester. None of the tools mentioned is particularly simple to write; it's not exactly a weekend project.

And the template is a funny one. By using C++'s main() function, you already circumvent a big part of the problem. This alone already means you use the C language run time, which contains the ExitProcess() call mentioned previously. It also splits the command line for you and does a bit of stuff before entering main().

What I'm getting at is that the process of turning that template into an EXE file is pretty complicated and contains a lot of hidden complexity.

Re: A question about the PE/COFF and ELF formats

Posted: Thu Apr 07, 2022 1:30 pm
by Octocontrabass
Ethin wrote:The template that my professor is having us use looks like:
Here, I've translated it to work with GCC/Clang and an external assembly file instead of MSVC and inline assembly:

Code: Select all

void kmain( void );

char DataSegment[65536];

int main( int argc, char ** argv )
{
    (void)argc;
    (void)argv;
    int unused;
    asm volatile( "xchg %2, %%ebp\n\tcall %P1" : "=a"(unused) : "i"(kmain), "a"(DataSegment) : "ebx","ecx","edx","esi","edi","ebp","memory","cc" );
    return 0;
}
You should be able to use this with MinGW to get a Windows binary too, although you might have to make some small adjustments to get your assembler to produce a compatible object file.