Compress ASM files?

SpyderTL · Post by **SpyderTL** » Tue Oct 18, 2016 2:52 pm

NunoLava1998 wrote:true, but not everyone has a decoder and i haven't seen anyone that knows machine code

Yes, we all have a decoder. It's called a disassembler.

Roman · Post by **Roman** » Tue Oct 18, 2016 2:56 pm

This is quite amazing, dude! You must be a computer science prodigy!

SpyderTL · Post by **SpyderTL** » Tue Oct 18, 2016 3:03 pm

I'm not really sure that it matters at this point... but you can probably remove this line...

Code: Select all

times 0-($-$$) db 0

iansjack · Post by **iansjack** » Tue Oct 18, 2016 3:07 pm

If the aim is to reduce the size of the file (say for network transmission), a compressor such as bzip is going to be far more effective for a text file.

If you want to protect your source code, encrypt it.

Both will preserve the file and achieve the desired effect efficiently.

Anyway, this has nothing to do with OS Development. Move it to another forum (prefereably the auto-delete one).

Schol-R-LEA · Post by **Schol-R-LEA** » Tue Oct 18, 2016 3:57 pm

NunoLava1998 wrote:[qAlso promotes security and demotes 7 year olds that try to copy the code but then ask their local doctor why their kiddie code isn't working. Basically, pretty useful for source code if you want security to your code.

That you, of all the people here, should say this, implies a level of hypocrisy so extreme, a lack of self-awareness so outrageous, that I would have expected time and space to be rent asunder when you wrote it.

OK, enough hyperbole. Seriously, though, you want to prevent people from copying the code you yourself took from some post on Sewage Overflow - which, from what we've seen so far, appears to be all of the code you have - because of IP rights concerns? That's... ah, different. Right.

As for the issue at hand, consider this: given that you are simply taking an indigestible bolus of assembled machine code, converting into a representation that will at least double the size of how it is stored (because it takes two characters to represent an eight-bit byte in hex, meaning that you use two bytes of character data for every single byte of the code), I want you to explain how a) this is in any way compressed relative to the BIN format file, and/or b) how representing code that is already assembled as source code gives any advantage over distributing the BIN file itself, given that it is not only larger than the BIN file, but also requires anyone who downloads it to assemble it again?

Or to put it another way: given that the two main advantages of distributing source code are readability and system-specific optimization, why distribute source in a way that negates those advantages while keeping the disadvantage of needing to assemble locally?

MichaelFarthing · Post by **MichaelFarthing** » Tue Oct 18, 2016 4:57 pm

Why's anyone discussing this nonsense or taking it seriously?

Brendan · Post by **Brendan** » Tue Oct 18, 2016 7:50 pm

Hi,

NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.

The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").

The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.

Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).

Cheers,

Brendan

Love4Boobies · Post by **Love4Boobies** » Tue Oct 18, 2016 10:14 pm

araxestroy · Post by **araxestroy** » Wed Oct 19, 2016 12:14 am

I for one find LZMA to be more effective.

kzinti · Post by **kzinti** » Wed Oct 19, 2016 1:08 am

If you wanted to compress your code further, you could assemble it to a binary file! Magic!

SpyderTL · Post by **SpyderTL** » Wed Oct 19, 2016 5:26 am

Brendan wrote:Hi,

NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").

The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.

Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).

Cheers,

Brendan

Nice job making this thread interesting.

Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?

abcdef4bfd · Post by **abcdef4bfd** » Wed Oct 19, 2016 5:58 am

I don't know is this won't violate the forum rules, so if it does please delete this.

NunoLava1998, sorry, but you're doing a pure sh*t everytime. And again sorry, I just wanted to say everything.

Brendan · Post by **Brendan** » Wed Oct 19, 2016 6:13 am

Hi,

SpyderTL wrote:
Brendan wrote:The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").

The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.

Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
Nice job making this thread interesting.

Thanks

SpyderTL wrote:Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?

For a traditional "plain text" compiler the first step (ignoring pre-processor, if any) is to convert the source into tokens. Maybe you have an enumeration of token IDs, and a structure containing "token ID and token data". Something like "x = y * 2 + 3" might be converted into a "symbol" token where the token's data is the symbol table entry for the "x" variable, and assignment operator token (no data), a "symbol" token where the token's data is the symbol table entry for the "x" variable, a "multiply" token (no data), a "numerical literal" token where the data is 2, etc. The end result of this is a linear sequence of tokens, like:

Code: Select all

SYMBOL(x), ASSIGN(),  SYMBOL(y), MULTIPLY(), NUMBER(2), ADD(), NUMBER(3)

The second step for a traditional "plain text" compiler is to convert the linear sequence of tokens into a tree. Mostly, for the structure containing "token ID and token data" you were using you add a "pointer to first child" field and "pointer to next sibling" field. Then you start building a tree using the language's rules as a guide. For the "x = y * 2 + 3" example, you might end up with:

Code: Select all

           SYMBOL(x)
              |
           ASSIGN()
              |
            ADD()
             / \
            /  NUMBER(3)
           /
        MULTIPLY()
        /    \
  SYMBOL(y)  NUMBER(2)

This is your abstract syntax tree. Of course for a normal program (rather than a single line) the abstract syntax tree would by much larger.

Now imagine an IDE designed to work on the abstract syntax tree itself. In this case when the user enters a line of text you'd tokenise the line and build a tiny abstract syntax tree (like the example above), then insert that (at the right place) into the larger tree for the program's source code.

For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.

What I did was have an variable length encoding scheme for token IDs (where more commonly used tokens took up less bits; and where tokens in a certain range had "data length" and "data" after them); where this encoding scheme included the 2 extra bits needed for the "has child" and "last child" flags. Of course I also had a symbol table and a few other things (type table, section table) in the file (partly for speed/indexing, but partly so I could use "symbol table entry number" instead of the symbol's name string as the data for some tokens).

I also had tokens for "full line comment" and "end of line comment"; and also "compressed full line comment" and "compressed end of line comment". The compressed comments were a cheap hack - mostly, subtract 32 from each character and pack the result into 5 bits, so that (e.g.) 20 characters get packed into 13 bytes (and if a comment string couldn't be compressed like this then just use the "uncompressed" token type).

This, plus a little minor trickery, is where I get the "tokens or AST halve file sizes" (from actual experimental results).

Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).

Cheers,

Brendan

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Oct 19, 2016 9:43 am

@Brendan: have you read Modern Compiler Design by Grune et. al.? While I haven't finished it myself, from what I have seen it does an admirable job of discussing compression techniques for symbol tables and ASTs. I have both the hardbound print version and the e-book, and it is well worth the price; however, you could probably also find it in most university libraries, and the publisher, Springer, sells e-books of specific chapters for particular topics separately as well.

SpyderTL · Post by **SpyderTL** » Wed Oct 19, 2016 9:59 am

Brendan wrote:For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.

Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).

Hey, you know what file format handles Trees really well, and has great IDE support?

XML.

Actually, the original reason that I decided to go down the XML route was because a co-worker and I were discussing how cool it would be if, instead of storing text source files in source control, that you could store pure "data" instead -- the idea being that the data could be converted back to text using your own personal formatting preferences and naming conventions, and comparing one changeset version to another would not be cluttered up by things like changing tabs to spaces and stuff like that.

But I just picked XML because the only other options I could think of would be CSV, JSON or SQL Server tables. Out of those options, at the time, XML seemed like the logical choice. But binary encoded trees would also be an option.

OSDev.org

Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?

Re: Compress ASM files?