Page 1 of 1

text vs. binary formats

Posted: Wed Apr 28, 2010 7:24 am
by NickJohnson
I'm currently designing my VFS, as a userspace server, and the format for communicating with it. However, my question is more general: should formats for communicating with servers in a microkernel system be binary formats, or completely text based, taking from systems like MySQL and UNIX, which have most communication done using text streams/queries? And how about other things, like executable formats and filesystem formats; is representation of data as text a practical possibility for those as well?

One of the things that I see as a major obstacle is representation of numbers in text. Octal works because the digits are contiguous in ASCII, but does not map well to bytes and takes up a lot of space; conversely, hexadecimal works because it maps well to bytes, but does not have contiguous digits. Possibly a hexadecimal system of 'A-Q' or 'G-W', or quaternary?

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 7:53 am
by gerryg400
NickJohnson,
I was also until recently designing a VFS in user space. (I've briefly paused to port my kernel to 64bit). It hadn't even occurred to me to use a text format for the messaging until right now.

Pros for text:
Easy to read the messages for debugging
Messages can be typed on the command line for testing
- both these would be nice in my current situation

Pros for binary:
My implementation uses switch statements to parse messages
I use (numerical) handles to index open files/memory regions etc. This makes security a little easier to implement in the kernel because it translates the client side handle to a server side verified handle.
It ought to be faster in most implementations.

I'm a long way down the binary path and won't change, but it's an interesting concept.

- gerryg400

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 8:09 am
by NickJohnson
I was more thinking about the portability than the debugging benefits, but that is a good point too. If you want to switch to 64 bit, the format never changes; it just is longer.

Idk why it would be more secure when using binary though: numbers can be represented in text equally well, albeit with more overhead. The real key to lowering that overhead is the representation of the numbers, but that is tricky, as I previously mentioned.

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 8:20 am
by gerryg400
You're right, neither is inherently more secure. Perhaps the only pro for binary is that it is more compact. The messages will be smaller and the parser will be slightly simpler.

Maybe give an example of what you envisage.

- gerryg400

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 3:28 pm
by NickJohnson
I don't know exactly what I'm envisioning at the moment: even within the idea of representing formats as text, there are a few ways to do it.

The simplest way is the one currently used in tarballs: specific indexes in the format are reserved for specific fields, and simply encoded in ASCII for strings and octal digits for numbers. The issue with this is that you do not get size flexibility, only endianness flexibility, and the format cannot be easily extended without reserving specific indexes. The format of the numbers is also fixed, so, for example, if the format starts out with octal, it must forever use octal. Another benefit of this way is that the total size of the format is completely fixed, so it could theoretically be used for something like a filesystem or a fixed-size microkernel message.

Another way would be to have data encoded as a simple string with whitespace for separation, and have a parser that reads in fields to a specified format. This is more complex for parsing, but allows variable length fields with potentially different number formats: "0x" and "0" prefixes can be used like in C to denote decimal, octal, or hexadecimal. The other benefit is that numbers which do not require more than a couple of digits, even if their fields could, do not take up extra space. Combined with the compress-ability of text formats, that could actually save a lot of space (i.e. null fields take only one byte *uncompressed*). Just like the simpler idea, there is also no way of reordering fields, so extensions are exclusively at one end.

The most flexible but most complex way would be to have a sort of key-value pair system for every field. The program would have to express the layout of some internal binary format to the parser, which would fill in the needed fields. Essentially, the format would be a property list. With this way and the last, there would have to be some complex way of marshalling and delimiting strings, so the contents cannot disrupt the formatting, so that would also add overhead. Maybe each field could be on a separate line, or a "line" delimited with a null character?

Idk, I'm just throwing out ideas: I really just wanted to see what people thought about the use of text formats.

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 4:11 pm
by Owen
While I would never recommend it for a system-local protocol, I have a great suggestion for files and for low-bandwidth internet links: Bencocde. Encoders and decoders are trivial in pretty much any language, the format is easy to parse, and its flexible.

The only thing I would do is decide on a uniform encoding to use for the contained strings. Good options would be UTF-8, or my favourite, SCSU.

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 4:53 pm
by NickJohnson
That seems like a good, simple format. The real sticking point, it seems, is once again speed. Are there any published benchmarks about how fast bencode can be encoded and decoded in relation to normal data? The amount of data being encoded in my situation is quite small; any large amount of data transfer has no format other than a small binary header on all of my IPC messages. The main problem with bencode for me is that strings seem to need to be copied out of the format to be manipulated, which would have a lot of overhead.

I'm currently formulating a format that might work better and more generally - I'll post back in a bit.

Re: text vs. binary formats

Posted: Wed Apr 28, 2010 6:08 pm
by Owen
As said - bencode isn't suitable for machine local or high bandwidth use. Its great for files and such, however.