For my OS shell project I've been thinking along similar lines for some of the features you describe - things like file conversion service, multiple revisions or representations of the same thing, and so on... So I think I can appreciate the direction you're trying to go here.Brendan wrote:Hi,
I'm planning for a 32-bit file type that is split into a pair of 16-bit fields - a major file type and a minor file type. The major file type describes a group of related file formats, where the number is arbitrarily selected (sparse). The minor file type is used to determine which file format within a group, and are an enumeration starting with 0x0000 (the native file format).tetsujin wrote:However, I don't feel like either enumeration system can be expected to account for every major file type on the system. In particular, as new formats are invented, provision of an ID for the type shouldn't depend on someone adding an enum to the list - in this case it may make sense to allow for an even more sparse ID space for these upcoming types - perhaps based on the "reversed DNS name" system that's pretty common these days (though you could also use 128-bit GUIDs and allow the people defining the formats to establish these on their own...)
For example, major file type 0x8000 might be for all file formats for 2D graphic data. This makes 0x80000000 the file type for my native 2D graphics format, and file types numbers for non-native file formats would be assigned sequentially from there (0x80000001 might be PNG, 0x80000002 might be BMP, 0x80000003 might be JPG, etc).
Yeah, breaking up into at least a two-level hierarchy like this can help... But it still relies on someone central authority assigning those file type IDs. My suggestion was that there should be provisions for dealing with the file (in a somewhat less efficient manner) for the period from when the new file format starts being used to the point where the central authority decides to canonize the format. For instance:
1: Provide a baseline (least efficient, most comprehensive) method for ID-ing file types - something based on a sparse ID space (I suggested either binary GUIDs or some textual identification)
2: In addition to the primary two-level hierarchy, allow for alternate directories, so other "authorities" would also have the ability to declare enumerations for individual datatypes as necessary. (For instance, imagine all this were happening on Linux - there might be some group, maybe kernel devs or maybe a group like Linux Standard Base deciding the "main" enumeration - and meanwhile individual groups, like the distro maintainer or even individual sys admins, might want to provide their own directories for file types not yet centrally registered)
3: For space-efficient encoding of the data type a file could be tagged with a compact-form tag (an enum in which both levels are defined in one of the available directories) - then for exchange with another host, if the ID isn't one of the centrally-registered ones it'd have to be replaced with the less efficient version...
Basically, suppose I invent a new 2-D raster image format... I want people to use it so I create conversion libs for use on your platform. But how do I get an ID allocated to identify the file type? I ask you, I guess. But then from your perspective - your system is becoming more popular at this point, and you've got all kinds of people asking you for IDs. Half the ones you've assigned in the last month went to people who abandoned their projects soon after, so maybe you're a bit hesitant to allocate an ID to an unproven new project like mine... With the provisions I describe above, if nothing else I could just create a binary GUID and use that as my file type ID. And then if somebody's using my code on your system, and they want to make the tags on those files take less space (not like 16 bytes is that bad for a file tag these days - unless there's a huge number of very small files) they can have a type registry local to their own computer or network which includes an enumeration for my file type. Or maybe my OS distro maintainer already did it for me when they made that library into a package...
Of course, for a design in the planning stages, this isn't an immediate concern... But if you're thinking of this as something people could actually use, I think it's a consideration that must be made. To be useful the system has to keep up with other people's software - but no central group can keep up with all the software development out there, so there should be ways for people to work outside of the central registry.
The fact that you not only provide the conversion, but actually keep the converted file around seems a bit problematic - assuming it's an uncompressed representation of lossy-compressed data, for instance, it'll be pretty space-wasteful, and managing that means there would need to be decent UI support for managing the available revisions of a file and clean up the unneeded ones (though if you've planned for versioning then I expect you planned for this, too...)For example, a directory might contain "foo.jpg (type 0x80000002, version 0)", "foo.jpg (type 0x80000002, version 1)" and "foo.jpg (type 0x80000000, version 0)" even though the file names are all the same. If an application opens the file "foo.jpg" then the VFS will find the most recent version of the file which would be "foo.jpg (type 0x80000002, version 1)". Then the VFS would realise that the file type isn't a native file type, and would automatically start a file converter that creates the new file "foo.jpg (type 0x80000000, version 1)". Finally, the VFS would open this new file for the application.
It sounds messy, but there's several advantages:
- - all applications for my OS will only ever need to work on native file formats.
- you'd be able to create/install a new file converter for a different file format and all existing applications would immediately support the new file format.
- it's impossible for an application to use "vendor lock-in", where it's the only application that can support some proprietory file format.
I feel like there are various cases where auto-conversion is problematic, as well - for instance where it's lossy or doesn't represent the original file's true nature. (SVG might be considered a 2-D image format, for instance, but it's not a raster format... Or there's various music formats which can be readily converted to waveforms for playback - but this representation is neither efficient for editing or representative of the file's real nature... Those are the best examples I can come up with at the moment, unfortunately...)
I don't see how any design can prevent vendor lock-in, though. If they can establish their own file type you can't force them to provide conversion methods outside of their own application... And there's no guarantee that their file format will really map to one of your provided native formats anyway - at best a design like this encourages app writers to not attempt lock-in, but it doesn't make lock-in impossible. (I considered the same problem for my shell - I came to the conclusion that I can't control people's behavior. I can't force them to provide a translation of or an API to their data format, and even if I could there's no way I could force them to make it easy for people to work with that data format... It's really only at the point of standardization that you can attempt to open up and stabilize a format - but I can't force my users to only use standardized file types (or apps which are cooperative enough to provide translations to/from common formats), and I don't think they'd benefit from being so compelled, either...)