Page 1 of 2

File formats, "human readable, etc [Split from PE vs Elf]

Posted: Sun Aug 10, 2008 2:01 am
by bewing
Hi Brendan,

is your NFF header human readable? I am strongly contemplating using a non-ELF non-PE format, myself -- but I really like the concept of having headers be readable, rather than being all in binary.

Re: PE vs ELF

Posted: Mon Aug 11, 2008 4:35 am
by Brendan
Hi,
bewing wrote:is your NFF header human readable? I am strongly contemplating using a non-ELF non-PE format, myself -- but I really like the concept of having headers be readable, rather than being all in binary.
My NFF header is definately *not* human readable. Partly because "human readable" typically means readable by a minority of humans only (e.g. English only), and partly because it's inefficient (takes up more space and needs to be parsed). A binary format is more efficient, and you can easily write utilities to display the header information in any language you like, so that it looks right in English, French, Japanese, Arabic, etc.
thepowersgang wrote:I was designing a format based on the HTMA (Html Application) idea in Windows XP and Vista.
Correct me if I'm wrong, but as far as I can tell "HTMA" is Microsoft's attempt at a ("Windows only" due to dependencies on the registry, browser, etc) replacement for (primarily open source, platform independent) Java. I'd refuse to go anywhere near it.
thepowersgang wrote:Would composing an app in XML be an idea? XML header only maybe?
[OFF-TOPIC RANT]I think it's rather insane that the slowest bandwidth connections we use (the internet) uses the most bloated and inefficient data formats (e.g. HTML). I'd be tempted to go the opposite way - replace HTML with an equivelent binary format designed to pack as much information into as few bytes as possible (and then compressed by some mandatory compression algorithm) to minimize download times and bandwidth related costs (remembering that we all pay good money for internet bandwidth in some way or another). The original HTML code would be the source code for a "HTML to binary" compiler, and the web browsers would just use the binary (instead of parsing the HTML into it's own internal binary format). You could also have a "binary to HTML" de-compiler for people that still use text editors to edit HTML so you wouldn't actually lose any functionality.

Worse is that HTML itself has very poor control over caches. You can control how long the page itself is cached (sort of - it'd crude/dodgy), but things like embedded images have no cache controls at all. It'd make sense for me if the web browser could cache everything and then ask the server for a list containing MD5 checksums for the page itself and for every file referenced by that page, so that the web browser only ever downloads lists of MD5 checksums, things that have changed and things that aren't in the cache. Of course then there's the modern "anti-cache" technologies (PHP, ASP, etc) that are severely over-used. It's as if none of it was designed for the internet at all.
[/OFF-TOPIC RANT]

Um, yeah - XML and XML headers. Here's a tiny bitmap image in XML:

Code: Select all

<image>
  <row>
   <pixel>
     <red = 5%>
     <blue = 59%>
     <green = 90%>
   </pixel>
   <pixel>
     <red = 15%>
     <blue = 55%>
     <green = 84%>
   </pixel>
   <pixel>
     <red = 5%>
     <blue = 59%>
     <green = 90%>
   </pixel>
  </row>
  <row>
   <pixel>
     <red = 8%>
     <blue = 57%>
     <green = 79%>
   </pixel>
   <pixel>
     <red = 9%>
     <blue = 53%>
     <green = 88%>
   </pixel>
   <pixel>
     <red = 11%>
     <blue = 61%>
     <green = 92%>
   </pixel>
  </row>
</image>
That must be better than raw binary (like a BMP), simply because it's XML, right? I mean it doesn't matter that it takes over 500 bytes to describe 6 pixels because those bytes would've been having lunch or doing something equally useless if we didn't use them, and spending thousands of cycles of CPU time parsing it will help to keep me warm on cold winter nights. As an added bonus it's human readable! This means I can look at the file in a text editor if I need to know exactly what color a pixel is! It's amazing! Can you imagine how bad a binary format would be if your life depended on knowing the color of a pixel? And people actually use binary formats for bitmap images? OMG what are they thinking! :roll:

Um, yeah - XML and XML headers....


Cheers,

Brendan

Re: PE vs ELF

Posted: Mon Aug 11, 2008 5:53 am
by Solar
Hehehe... Brendan on a rant, it's always a sight!

I agree, btw, and those "let's use XML, it's great" calls get me in a killing rage, too. I had such a run-in at the office recently, and would like to tell but I guess it would get me fired... :roll:

But regarding HTML and bandwidth, you know about mod_deflate, do you?

Re: PE vs ELF

Posted: Mon Aug 11, 2008 12:36 pm
by bewing
I enjoy Brendan's rants, too. :D I agree with him on the HTML stuff. But system headers are different.

(Edit: of course, the major bandwidth hogs of the internet are 1. spam/hacking, 2. video, 3. audio -- so the issue of the inefficiency of html tags is irrelevant in reality.)
Brendan wrote:and partly because it's inefficient (takes up more space and needs to be parsed).
It's not that bad. :wink:
If you have a "header variable" that tends to be small, <100 for example, you can fit it nicely in a human-readable ASCII decimal format in 5 bytes, and the length is flexible. If you store it in binary, you need 4 bytes ... unless the value has the potential to grow over 4G, in which case you have to .... #-o

And parsing decimal is almost as efficient as parsing non-aligned binary on an x86 chip. In a real-world system, the difference will be unnoticeable. It's only a 7 opcode loop for a nice parsing, as I recall.

Using ASCII decimal also eliminates all endian portability issues, of course.

Re: PE vs ELF

Posted: Mon Aug 11, 2008 1:23 pm
by Brynet-Inc
Fortunately, most HTTP servers these days compress data with gzip if the browser supports it.. I've even seen bzip2 used for this purpose. (XKCD, correct me if I'm wrong..).

I don't know if the "World Wide Web" should embrace binary formats, mostly because the current ones.. Flash/Java/Silverlight are either proprietary/bloat or lame.

That's my opinion, 0.02 cents CAD. 8)

Re: PE vs ELF

Posted: Mon Aug 11, 2008 1:41 pm
by tetsujin
bewing wrote:Hi Brendan,

is your NFF header human readable? I am strongly contemplating using a non-ELF non-PE format, myself -- but I really like the concept of having headers be readable, rather than being all in binary.
My take on this:

1: Consider your "audience". Who is going to read this header more often - the OS or a human? Isn't it more useful to prioritize the ease of writing code that handles the file format, rather than prioritizing the ease with which a naive human can negotiate the file format without any specialized tools?

2: "Human-readable" binary is a fallacy. It's only "Human-readable" in that you already have the tools you need to be able to read what's in there. Without such a tool even ASCII is no more "human-readable" than any other format. So all it takes to make another format "human-readable" is the tool that interprets the data you put in the header. ASCII means these tools don't need to be specialized, but how useful is a non-specialized tool for this job, anyway?

3: Consider automated tools that may want to be able to identify the type of this file. (Even if your OS makes the type of a file explicit through metadata or something, you might want to consider cases like recovering after data loss. If you've lost the filename and all metadata, it's handy to be able to get the file type from the file itself. But if the file structure leads with ASCII, then it might be mis-identified as ASCII text.)

Re: PE vs ELF

Posted: Mon Aug 11, 2008 2:07 pm
by bewing
1. Me.
2. If I can read it conveniently, it's good enough.
3. Not if ALL system/binary files start with a 4 byte ASCII text "magic number" that indicate their filetype.

Having a header be "human readable" (whatever you choose that to specifically mean) is an "additional feature" of a format, that I find useful. If it costs almost nothing to implement, then a useful additional feature at almost no cost is always a bargain.

Re: PE vs ELF

Posted: Mon Aug 11, 2008 5:34 pm
by Brendan
Hi,
Solar wrote:Hehehe... Brendan on a rant, it's always a sight!

I agree, btw, and those "let's use XML, it's great" calls get me in a killing rage, too. I had such a run-in at the office recently, and would like to tell but I guess it would get me fired... :roll:
You'll probably agree until I mention that the same arguments apply to text based configuration files (you know, those configuration files lazy programmers expect you to mess with when they couldn't be stuffed writing a decent multi-lingual configuration interface). :D
Solar wrote:But regarding HTML and bandwidth, you know about mod_deflate, do you?
I know Apache is capable of sending gzip'ed files if the browser supports it, but I don't know how many browsers support it. I do know that compression isn't part of the official HTML standard, and that a compressed elephant is still larger than a compressed mouse, and that compression won't help with decent caching...


Cheers,

Brendan

Re: PE vs ELF

Posted: Mon Aug 11, 2008 8:15 pm
by Brynet-Inc
Brendan wrote:I know Apache is capable of sending gzip'ed files if the browser supports it, but I don't know how many browsers support it. I do know that compression isn't part of the official HTML standard, and that a compressed elephant is still larger than a compressed mouse, and that compression won't help with decent caching...
It's in the HTTP 1.1 specifications, RFC 2616.. pages are compressed on-the-fly.

Search for Content-Encoding/Content Codings... as for client support, most do.. all major browsers do, from Mozilla/Microsoft and KTML/WebKit browsers. ;)

Re: PE vs ELF

Posted: Mon Aug 11, 2008 11:12 pm
by Brendan
Hi,
bewing wrote:It's not that bad. :wink:
If you have a "header variable" that tends to be small, <100 for example, you can fit it nicely in a human-readable ASCII decimal format in 5 bytes, and the length is flexible. If you store it in binary, you need 4 bytes ... unless the value has the potential to grow over 4G, in which case you have to ....
Ok, my generic header consist of a 64-bit file size, an 8-byte compliance string ("BCOS_NFF" in ASCII/UTF-8), a 32-bit CRC, the 32-bit file type, and 8-bytes that are reserved (in case I need them for something later).

For the file size and the CRC you could encode them as "hexadecimal ASCII", but that would double the number of bytes needed assuming there's no prefix and no string terminator (e.g. the string "12345678" is 8 characters while the value 0x12345678 is 4 bytes). You could use something like "base 32" (digits '0' to '9' and 'A' to 'W') but then a 32-bit value will still cost 7 bytes (6 full bytes and one partially used byte) and it'll cause lots of shifting, etc, and then you'd want an eighth unused byte to align the next field. The same applies to the file size (and case-sensitive "base 64" if you can find a few more characters/digits). Both of these values could be variable length strings, but that sucks worse because I want to make sure the size of the generic header is constant; so you don't need to parse the generic header to find out where the extended header or file data is, and so I don't need to have a "header_size" field in the header.

The 8-byte compliance string is in ASCII/UTF-8 already (without a string terminator).

The 32-bit file type is used by the file system itself, so that (for e.g.) if you download a file via. FTP the FTP client can get the file type from the header (after checking the compliance string and CRC to verify that it's a native file format) and tell the file system what type of file it is. However, only a subset of possible values are used for native file formats - the rest are used for non-native file formats. For example, file type 0x80000000 might be an image file using the native file format, while 0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc. Basically, a 4 character identifier string isn't enough for all (native and non-native) file formats, especially if you're limited to meaningful strings (rather than random characters). Instead I'd probably want at least 8 characters (but even then I'd expect the 3 and 4 letter identifiers to run out fast).

For all of these fields a human (e.g. using a text editor) wouldn't ever see any of them. If they want to know the file size or the file type, then they'd get this information from similar fields in the directory entry on the file system (not from the file itself) and (like all good OSs) there'd be ways of doing this *nicely* (e.g. right click on the file's icon and select "file properties"), where numbers are displayed using locale specific formatting (e.g. "1,234,567.89 KiB" or "1.234.567,89 KiB") and file types are displayed in the current language. For example, file type 0x00010002 might be displayed as "texto llano" for a Spanish user, "texte clair" to a French user, "testo normale" to an Italian user, "plain text" to an English user, etc.

Of course this is relatively easy (it's just look-up tables), while the opposite isn't true - for e.g. if a Spanish user wrote "texto llano" for their human readable file type then I'd have to do string comparisons to figure out that the file is plain text, and then I'd still need to convert that into "testo normale" for Italian users.

There is another alternative though - force all users to learn keywords (that probably aren't part of their native language), so that (for e.g.) they are all expected to know what "plain text" means in English. IMHO this is almost as silly as expecting people to learn that the binary value 0x00010002 means "plain text".
bewing wrote:Using ASCII decimal also eliminates all endian portability issues, of course.
A specification that clearly states "all values in the generic header are in little-endian format" is enough IMHO.


Cheers,

Brendan

Re: PE vs ELF

Posted: Tue Aug 12, 2008 1:17 am
by cyr1x
Brendan wrote:Hi,
For example, file type 0x80000000 might be an image file using the native file format, while 0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc.
Though I like this idea, I see one problem with this. Some formats require a string to be at the very beginning of the file (GIF for example). While you could making it run with your own OS you'll have problems when you move them to other OS's.

Re: PE vs ELF

Posted: Tue Aug 12, 2008 4:15 am
by Brendan
Hi,
cyr1x wrote:
Brendan wrote:For example, file type 0x80000000 might be an image file using the native file format, while 0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc.
Though I like this idea, I see one problem with this. Some formats require a string to be at the very beginning of the file (GIF for example). While you could making it run with your own OS you'll have problems when you move them to other OS's.
I'm not planning to add my header to existing file formats - I'm planning to invent entirely new "native" file formats and leave existing file formats alone... :)


Cheers,

Brendan

Re: File formats, "human readable, etc [Split from PE vs Elf]

Posted: Tue Aug 12, 2008 6:10 am
by cyr1x
Ahh! Now I got it :oops: .

Re: PE vs ELF

Posted: Tue Aug 12, 2008 7:58 am
by tetsujin
bewing wrote:I enjoy Brendan's rants, too. :D I agree with him on the HTML stuff. But system headers are different.

(Edit: of course, the major bandwidth hogs of the internet are 1. spam/hacking, 2. video, 3. audio -- so the issue of the inefficiency of html tags is irrelevant in reality.)
Brendan wrote:and partly because it's inefficient (takes up more space and needs to be parsed).
It's not that bad. :wink:
If you have a "header variable" that tends to be small, <100 for example, you can fit it nicely in a human-readable ASCII decimal format in 5 bytes, and the length is flexible. If you store it in binary, you need 4 bytes ... unless the value has the potential to grow over 4G, in which case you have to .... #-o
It's no more difficult to make a variable-length binary numeric field than to make a variable-length ASCII decimal field... The difference is that the binary version is more compact and easier for the computer to process (processing can be done via bit shifting - no multiply or divide needed)

For instance, the format used by Google Protocol Buffers, which is roughly like this:
  • A variable-length integer is stored as a stream of bytes, with 7 bits per byte used to store numeric data and one bit per byte used as the terminator.
  • least-significant bits are stored in earlier bytes
  • the top bit of each byte is used as the terminator - it is set to 1 on every byte in the varint except the last
  • for efficient encoding of signed numbers, there's a "zig-zag" encoding that can be used (0 -> 0, -1 -> 1, 1 -> 2, -2 -> 3, 2 -> 4, etc.) - zig-zag(x) where x : int32 = (x << 1) XOR (x >> 31)
Google Protocol Buffers use varints extensively in the message structure itself in addition to it being a possible encoding for field values... A 32-bit value could require as much as 5 bytes encoded this way - as opposed to 10 in binary...


The problem of how to type-tag files with type information (both with "file magic" and metadata outside of the file contents) interests me... I came up with some ideas of how I would handle it...

First, as others have mentioned there can be a set of enumerated values defined by the system which can be used to identify a limited set of types efficiently. This isn't a particularly scalable solution, however - and a problem with implementing a dense enumeration system like this is that as it grows it tends to become disorganized (abandoned IDs become "holes" in the list of enums, new IDs are assigned in the order in which they're adopted, so as the list of IDs grows it tends to become rather messy) - so I feel like the set of types identified this way should be pretty limited.

A more general level of type-identification in addition to that set can be done with a sparse ID system - this is an instance in which I do feel that ASCII data in a binary file can be useful - in the vein of the 4-byte chunk tags used in formats like IFF, RIFF/WAV, PNG, etc. It's generally reasonably easy to continue assigning new tags within that space as the set of defined types grows.

However, I don't feel like either enumeration system can be expected to account for every major file type on the system. In particular, as new formats are invented, provision of an ID for the type shouldn't depend on someone adding an enum to the list - in this case it may make sense to allow for an even more sparse ID space for these upcoming types - perhaps based on the "reversed DNS name" system that's pretty common these days (though you could also use 128-bit GUIDs and allow the people defining the formats to establish these on their own...)


I figure that in order to play nice, "magic numbers" in the file contents determining file type should still be sufficiently distinct from other "magic" used to identify other file types - even on my own system for my own datatypes... Because, after all, even on such a system I won't likely be imposing my scheme of things on all files present... Existing files in established formats can be left alone. PNG's file header is a good example of a robust header - the signature starts with non-ASCII (so it can't be mistaken for text), it contains \r\n and \n sequences (to detect file mangling due to ASCII transport on FTP) - add a type ID field in addition to such a header and I think you'd have a good basis for safely identifying a collection of file types... And then if there's additionally metadata that makes the file type identification more efficient, all the better...

Re: PE vs ELF

Posted: Tue Aug 12, 2008 8:59 pm
by Brendan
Hi,
tetsujin wrote:However, I don't feel like either enumeration system can be expected to account for every major file type on the system. In particular, as new formats are invented, provision of an ID for the type shouldn't depend on someone adding an enum to the list - in this case it may make sense to allow for an even more sparse ID space for these upcoming types - perhaps based on the "reversed DNS name" system that's pretty common these days (though you could also use 128-bit GUIDs and allow the people defining the formats to establish these on their own...)
I'm planning for a 32-bit file type that is split into a pair of 16-bit fields - a major file type and a minor file type. The major file type describes a group of related file formats, where the number is arbitrarily selected (sparse). The minor file type is used to determine which file format within a group, and are an enumeration starting with 0x0000 (the native file format).

For example, major file type 0x8000 might be for all file formats for 2D graphic data. This makes 0x80000000 the file type for my native 2D graphics format, and file types numbers for non-native file formats would be assigned sequentially from there (0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc).

Basically this gives me 65536 groups of file formats, with 65536 specific file formats per group. For some purposes (e.g. selecting an icon to represent files in a GUI) I'd ignore the minor file type (e.g. all file formats for 2D graphic data would be represented by the same icon).

There is a purpose behind all of this that I haven't mentioned yet...

First, for my file system files will have a file name, a file type and a file version; and different files with the same file name can co-exist in the same directory (as long as the file type or file version is different).

Second, the virtual file system keeps track of a file's type so that when an application tries to open a file the VFS can automatically select a file converter and convert the file into the corresponding native file format (if necessary) without the application knowing/caring.

For example, a directory might contain "foo.jpg (type 0x80000002, version 0)", "foo.jpg (type 0x80000002, version 1)" and "foo.jpg (type 0x80000000, version 0)" even though the file names are all the same. If an application opens the file "foo.jpg" then the VFS will find the most recent version of the file which would be "foo.jpg (type 0x80000002, version 1)". Then the VFS would realise that the file type isn't a native file type, and would automatically start a file converter that creates the new file "foo.jpg (type 0x80000000, version 1)". Finally, the VFS would open this new file for the application.

It sounds messy, but there's several advantages:
  • - all applications for my OS will only ever need to work on native file formats.
    - you'd be able to create/install a new file converter for a different file format and all existing applications would immediately support the new file format.
    - it's impossible for an application to use "vendor lock-in", where it's the only application that can support some proprietory file format.
Also, I don't really need to assign file types to all file formats - I only need to assign file types to files that can be converted into native file formats. If there's no file converter then no application will be able to support the file format and it won't matter that the file type is set to "unknown".


Cheers,

Brendan