File formats, "human readable, etc [Split from PE vs Elf]

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
tetsujin
Posts: 14
Joined: Mon Aug 11, 2008 10:49 am
Location: Massachusetts
Contact:

File tagging and conversion

Post by tetsujin »

Brendan wrote:Hi,
tetsujin wrote:However, I don't feel like either enumeration system can be expected to account for every major file type on the system. In particular, as new formats are invented, provision of an ID for the type shouldn't depend on someone adding an enum to the list - in this case it may make sense to allow for an even more sparse ID space for these upcoming types - perhaps based on the "reversed DNS name" system that's pretty common these days (though you could also use 128-bit GUIDs and allow the people defining the formats to establish these on their own...)
I'm planning for a 32-bit file type that is split into a pair of 16-bit fields - a major file type and a minor file type. The major file type describes a group of related file formats, where the number is arbitrarily selected (sparse). The minor file type is used to determine which file format within a group, and are an enumeration starting with 0x0000 (the native file format).

For example, major file type 0x8000 might be for all file formats for 2D graphic data. This makes 0x80000000 the file type for my native 2D graphics format, and file types numbers for non-native file formats would be assigned sequentially from there (0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc).
For my OS shell project I've been thinking along similar lines for some of the features you describe - things like file conversion service, multiple revisions or representations of the same thing, and so on... So I think I can appreciate the direction you're trying to go here.

Yeah, breaking up into at least a two-level hierarchy like this can help... But it still relies on someone central authority assigning those file type IDs. My suggestion was that there should be provisions for dealing with the file (in a somewhat less efficient manner) for the period from when the new file format starts being used to the point where the central authority decides to canonize the format. For instance:

1: Provide a baseline (least efficient, most comprehensive) method for ID-ing file types - something based on a sparse ID space (I suggested either binary GUIDs or some textual identification)
2: In addition to the primary two-level hierarchy, allow for alternate directories, so other "authorities" would also have the ability to declare enumerations for individual datatypes as necessary. (For instance, imagine all this were happening on Linux - there might be some group, maybe kernel devs or maybe a group like Linux Standard Base deciding the "main" enumeration - and meanwhile individual groups, like the distro maintainer or even individual sys admins, might want to provide their own directories for file types not yet centrally registered)
3: For space-efficient encoding of the data type a file could be tagged with a compact-form tag (an enum in which both levels are defined in one of the available directories) - then for exchange with another host, if the ID isn't one of the centrally-registered ones it'd have to be replaced with the less efficient version...

Basically, suppose I invent a new 2-D raster image format... I want people to use it so I create conversion libs for use on your platform. But how do I get an ID allocated to identify the file type? I ask you, I guess. But then from your perspective - your system is becoming more popular at this point, and you've got all kinds of people asking you for IDs. Half the ones you've assigned in the last month went to people who abandoned their projects soon after, so maybe you're a bit hesitant to allocate an ID to an unproven new project like mine... With the provisions I describe above, if nothing else I could just create a binary GUID and use that as my file type ID. And then if somebody's using my code on your system, and they want to make the tags on those files take less space (not like 16 bytes is that bad for a file tag these days - unless there's a huge number of very small files) they can have a type registry local to their own computer or network which includes an enumeration for my file type. Or maybe my OS distro maintainer already did it for me when they made that library into a package...

Of course, for a design in the planning stages, this isn't an immediate concern... But if you're thinking of this as something people could actually use, I think it's a consideration that must be made. To be useful the system has to keep up with other people's software - but no central group can keep up with all the software development out there, so there should be ways for people to work outside of the central registry.
For example, a directory might contain "foo.jpg (type 0x80000002, version 0)", "foo.jpg (type 0x80000002, version 1)" and "foo.jpg (type 0x80000000, version 0)" even though the file names are all the same. If an application opens the file "foo.jpg" then the VFS will find the most recent version of the file which would be "foo.jpg (type 0x80000002, version 1)". Then the VFS would realise that the file type isn't a native file type, and would automatically start a file converter that creates the new file "foo.jpg (type 0x80000000, version 1)". Finally, the VFS would open this new file for the application.

It sounds messy, but there's several advantages:
  • - all applications for my OS will only ever need to work on native file formats.
    - you'd be able to create/install a new file converter for a different file format and all existing applications would immediately support the new file format.
    - it's impossible for an application to use "vendor lock-in", where it's the only application that can support some proprietory file format.
The fact that you not only provide the conversion, but actually keep the converted file around seems a bit problematic - assuming it's an uncompressed representation of lossy-compressed data, for instance, it'll be pretty space-wasteful, and managing that means there would need to be decent UI support for managing the available revisions of a file and clean up the unneeded ones (though if you've planned for versioning then I expect you planned for this, too...)

I feel like there are various cases where auto-conversion is problematic, as well - for instance where it's lossy or doesn't represent the original file's true nature. (SVG might be considered a 2-D image format, for instance, but it's not a raster format... Or there's various music formats which can be readily converted to waveforms for playback - but this representation is neither efficient for editing or representative of the file's real nature... Those are the best examples I can come up with at the moment, unfortunately...)

I don't see how any design can prevent vendor lock-in, though. If they can establish their own file type you can't force them to provide conversion methods outside of their own application... And there's no guarantee that their file format will really map to one of your provided native formats anyway - at best a design like this encourages app writers to not attempt lock-in, but it doesn't make lock-in impossible. (I considered the same problem for my shell - I came to the conclusion that I can't control people's behavior. I can't force them to provide a translation of or an API to their data format, and even if I could there's no way I could force them to make it easy for people to work with that data format... It's really only at the point of standardization that you can attempt to open up and stabilize a format - but I can't force my users to only use standardized file types (or apps which are cooperative enough to provide translations to/from common formats), and I don't think they'd benefit from being so compelled, either...)
---GEC
Progress means holding on to the good and replacing the bad. Be a fan if you like, but don't let it blind you!
I want to write a truly new command-line OS shell. Design is tough...
User avatar
AJ
Member
Member
Posts: 2646
Joined: Sun Oct 22, 2006 7:01 am
Location: Devon, UK
Contact:

Re: File formats, "human readable, etc [Split from PE vs Elf]

Post by AJ »

I like some of the ideas being discussed, but see some other potential hurdles:

1) Plain text files. It is often useful to distribute ASCII files, such as a README which can easily be read cross platform. Are you going to add an ID to the start of every text file and how do you detect that it is a plain text file in the first place.

2) (really 1b!) Source files. Existing utilities which would be on my "to port" list are things like gcc, fasm and so on. They expect a plain source file without a 4 byte ID at the beginning. The way I see this, you either have to leave these files as they are, or convert them but then strip the file header each time before passing it to the compiler/assembler.

3) Further to the rant (good reading, by the way!) about HTML. HTML is a format which is likely to be around for some time to come, whether we like it or not. It has overhead associated with it (size-wise) because of the nature of the language. Are you really going to add more overhead by requiring a browser to convert it in to a native format before rendering it, each time a web page is downloaded? Same with web page graphics, scripts and so on.

If you get around 1 and 2 by allowing plain ASCII files on the system, the easiest way for me as a software vendor to introduce lock-in to your system, would be to tell your system that I was storing a plain text file (or whatever other format), but actually store my own content. How do you begin to get around that?

Cheers,
Adam
tetsujin
Posts: 14
Joined: Mon Aug 11, 2008 10:49 am
Location: Massachusetts
Contact:

Re: File formats, "human readable, etc [Split from PE vs Elf]

Post by tetsujin »

AJ wrote:I like some of the ideas being discussed, but see some other potential hurdles:

1) Plain text files. It is often useful to distribute ASCII files, such as a README which can easily be read cross platform. Are you going to add an ID to the start of every text file and how do you detect that it is a plain text file in the first place.
My approach to this problem (and a great many others) is a bit different from Brendan's - though my concept is for a shell to run on a Linux kernel, rather than for a whole new OS...

Basically, I feel I can't mandate people do anything they don't want to with their own file formats - can't expect them to put my kind of type IDing into the file contents, can't expect them to register enumerations for their file type in an ID registry, etc. So existing formats have to be left alone. Files without type-tagging have to have their format identified by file magic, or type tags can be stored in filesystem extended attributes. (If I were designing the whole OS I could reserve space in the file contents for a type tag, or space in the filesystem structure - but for my purposes having xattr's is sufficient.)

File magic on my system would auto-detect only a limited set of really prominent formats (but it would be fairly aggressive in making sure that the file really is what it seems to be - so for instance a text file beginning with FORM would not be mis-identified as an IFF file...) - anything not in that list would get categorized as text if it is mostly in the ASCII range, or "bytestream" if it's not. If the type determined by file magic is wrong, then it'd be up to the user to correct that... Then once a file is type-tagged, file magic needn't be used.

It's not a perfect system (non-ASCII character sets or ASCII text with IBM graphics characters in it are potential limitations to the approach) but the intent is to have a system that will work reasonably well in an "imperfect world" - one in which people may be getting their files from a shared volume somewhere where the files aren't (and can't be) type-tagged, etc.Any formats I define just have to be distinct enough that they won't be mistaken for anything else on my "file magic" list...

In my design I figured there's just no practical way to prevent people from making their own data formats that aren't interoperable. Basically, they have to want to cooperate or it ain't gonna happen.
---GEC
Progress means holding on to the good and replacing the bad. Be a fan if you like, but don't let it blind you!
I want to write a truly new command-line OS shell. Design is tough...
Post Reply