Hi,
bewing wrote:It's not that bad.
If you have a "header variable" that tends to be small, <100 for example, you can fit it nicely in a human-readable ASCII decimal format in 5 bytes, and the length is flexible. If you store it in binary, you need 4 bytes ... unless the value has the potential to grow over 4G, in which case you have to ....
Ok, my generic header consist of a 64-bit file size, an 8-byte compliance string ("BCOS_NFF" in ASCII/UTF-8), a 32-bit CRC, the 32-bit file type, and 8-bytes that are reserved (in case I need them for something later).
For the file size and the CRC you could encode them as "hexadecimal ASCII", but that would double the number of bytes needed assuming there's no prefix and no string terminator (e.g. the string "12345678" is 8 characters while the value 0x12345678 is 4 bytes). You could use something like "base 32" (digits '0' to '9' and 'A' to 'W') but then a 32-bit value will still cost 7 bytes (6 full bytes and one partially used byte) and it'll cause lots of shifting, etc, and then you'd want an eighth unused byte to align the next field. The same applies to the file size (and case-sensitive "base 64" if you can find a few more characters/digits). Both of these values could be variable length strings, but that sucks worse because I want to make sure the size of the generic header is constant; so you don't need to parse the generic header to find out where the extended header or file data is, and so I don't need to have a "header_size" field in the header.
The 8-byte compliance string is in ASCII/UTF-8 already (without a string terminator).
The 32-bit file type is used by the file system itself, so that (for e.g.) if you download a file via. FTP the FTP client can get the file type from the header (after checking the compliance string and CRC to verify that it's a native file format) and tell the file system what type of file it is. However, only a subset of possible values are used for native file formats - the rest are used for non-native file formats. For example, file type 0x80000000 might be an image file using the native file format, while 0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc. Basically, a 4 character identifier string isn't enough for all (native and non-native) file formats, especially if you're limited to meaningful strings (rather than random characters). Instead I'd probably want at least 8 characters (but even then I'd expect the 3 and 4 letter identifiers to run out fast).
For all of these fields a human (e.g. using a text editor) wouldn't ever see any of them. If they want to know the file size or the file type, then they'd get this information from similar fields in the directory entry on the file system (not from the file itself) and (like all good OSs) there'd be ways of doing this *nicely* (e.g. right click on the file's icon and select "file properties"), where numbers are displayed using locale specific formatting (e.g. "1,234,567.89 KiB" or "1.234.567,89 KiB") and file types are displayed in the current language. For example, file type 0x00010002 might be displayed as "texto llano" for a Spanish user, "texte clair" to a French user, "testo normale" to an Italian user, "plain text" to an English user, etc.
Of course this is relatively easy (it's just look-up tables), while the opposite isn't true - for e.g. if a Spanish user wrote "texto llano" for their human readable file type then I'd have to do string comparisons to figure out that the file is plain text, and then I'd still need to convert that into "testo normale" for Italian users.
There is another alternative though - force all users to learn keywords (that probably aren't part of their native language), so that (for e.g.) they are all expected to know what "plain text" means in English. IMHO this is almost as silly as expecting people to learn that the binary value 0x00010002 means "plain text".
bewing wrote:Using ASCII decimal also eliminates all endian portability issues, of course.
A specification that clearly states "all values in the generic header are in little-endian format" is enough IMHO.
Cheers,
Brendan