Compress ASM files?

Question about which tools to use, bugs, the best way to implement a function, etc should go here. Don't forget to see if your question is answered in the wiki first! When in doubt post here.
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: Compress ASM files?

Post by SpyderTL »

NunoLava1998 wrote:true, but not everyone has a decoder and i haven't seen anyone that knows machine code
Yes, we all have a decoder. It's called a disassembler.
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
User avatar
Roman
Member
Member
Posts: 568
Joined: Thu Mar 27, 2014 3:57 am
Location: Moscow, Russia
Contact:

Re: Compress ASM files?

Post by Roman »

This is quite amazing, dude! You must be a computer science prodigy!
"If you don't fail at least 90 percent of the time, you're not aiming high enough."
- Alan Kay
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: Compress ASM files?

Post by SpyderTL »

I'm not really sure that it matters at this point... but you can probably remove this line...

Code: Select all

times 0-($-$$) db 0
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
User avatar
iansjack
Member
Member
Posts: 4706
Joined: Sat Mar 31, 2012 3:07 am
Location: Chichester, UK

Re: Compress ASM files?

Post by iansjack »

If the aim is to reduce the size of the file (say for network transmission), a compressor such as bzip is going to be far more effective for a text file.

If you want to protect your source code, encrypt it.

Both will preserve the file and achieve the desired effect efficiently.

Anyway, this has nothing to do with OS Development. Move it to another forum (prefereably the auto-delete one).
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Compress ASM files?

Post by Schol-R-LEA »

NunoLava1998 wrote:[qAlso promotes security and demotes 7 year olds that try to copy the code but then ask their local doctor why their kiddie code isn't working. Basically, pretty useful for source code if you want security to your code.
That you, of all the people here, should say this, implies a level of hypocrisy so extreme, a lack of self-awareness so outrageous, that I would have expected time and space to be rent asunder when you wrote it.

OK, enough hyperbole. Seriously, though, you want to prevent people from copying the code you yourself took from some post on Sewage Overflow - which, from what we've seen so far, appears to be all of the code you have - because of IP rights concerns? That's... ah, different. Right.

As for the issue at hand, consider this: given that you are simply taking an indigestible bolus of assembled machine code, converting into a representation that will at least double the size of how it is stored (because it takes two characters to represent an eight-bit byte in hex, meaning that you use two bytes of character data for every single byte of the code), I want you to explain how a) this is in any way compressed relative to the BIN format file, and/or b) how representing code that is already assembled as source code gives any advantage over distributing the BIN file itself, given that it is not only larger than the BIN file, but also requires anyone who downloads it to assemble it again?

Or to put it another way: given that the two main advantages of distributing source code are readability and system-specific optimization, why distribute source in a way that negates those advantages while keeping the disadvantage of needing to assemble locally?
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
MichaelFarthing
Member
Member
Posts: 167
Joined: Thu Mar 10, 2016 7:35 am
Location: Lancaster, England, Disunited Kingdom

Re: Compress ASM files?

Post by MichaelFarthing »

Why's anyone discussing this nonsense or taking it seriously? :roll:
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Compress ASM files?

Post by Brendan »

Hi,
NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").

The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.

Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Love4Boobies
Member
Member
Posts: 2111
Joined: Fri Mar 07, 2008 5:36 pm
Location: Bucharest, Romania

Re: Compress ASM files?

Post by Love4Boobies »

wat
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
User avatar
Kazinsal
Member
Member
Posts: 559
Joined: Wed Jul 13, 2011 7:38 pm
Libera.chat IRC: Kazinsal
Location: Vancouver
Contact:

Re: Compress ASM files?

Post by Kazinsal »

I for one find LZMA to be more effective.
kzinti
Member
Member
Posts: 898
Joined: Mon Feb 02, 2015 7:11 pm

Re: Compress ASM files?

Post by kzinti »

If you wanted to compress your code further, you could assemble it to a binary file! Magic!
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: Compress ASM files?

Post by SpyderTL »

Brendan wrote:Hi,
NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").

The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.

Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).


Cheers,

Brendan
Nice job making this thread interesting.

Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
User avatar
osdever
Member
Member
Posts: 492
Joined: Fri Apr 03, 2015 9:41 am
Contact:

Re: Compress ASM files?

Post by osdever »

I don't know is this won't violate the forum rules, so if it does please delete this.

NunoLava1998, sorry, but you're doing a pure sh*t everytime. And again sorry, I just wanted to say everything.
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Compress ASM files?

Post by Brendan »

Hi,
SpyderTL wrote:
Brendan wrote:The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").

The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.

Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
Nice job making this thread interesting.
Thanks :)
SpyderTL wrote:Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?
For a traditional "plain text" compiler the first step (ignoring pre-processor, if any) is to convert the source into tokens. Maybe you have an enumeration of token IDs, and a structure containing "token ID and token data". Something like "x = y * 2 + 3" might be converted into a "symbol" token where the token's data is the symbol table entry for the "x" variable, and assignment operator token (no data), a "symbol" token where the token's data is the symbol table entry for the "x" variable, a "multiply" token (no data), a "numerical literal" token where the data is 2, etc. The end result of this is a linear sequence of tokens, like:

Code: Select all

SYMBOL(x), ASSIGN(),  SYMBOL(y), MULTIPLY(), NUMBER(2), ADD(), NUMBER(3)
The second step for a traditional "plain text" compiler is to convert the linear sequence of tokens into a tree. Mostly, for the structure containing "token ID and token data" you were using you add a "pointer to first child" field and "pointer to next sibling" field. Then you start building a tree using the language's rules as a guide. For the "x = y * 2 + 3" example, you might end up with:

Code: Select all

           SYMBOL(x)
              |
           ASSIGN()
              |
            ADD()
             / \
            /  NUMBER(3)
           /
        MULTIPLY()
        /    \
  SYMBOL(y)  NUMBER(2)
This is your abstract syntax tree. Of course for a normal program (rather than a single line) the abstract syntax tree would by much larger.

Now imagine an IDE designed to work on the abstract syntax tree itself. In this case when the user enters a line of text you'd tokenise the line and build a tiny abstract syntax tree (like the example above), then insert that (at the right place) into the larger tree for the program's source code.

For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.

What I did was have an variable length encoding scheme for token IDs (where more commonly used tokens took up less bits; and where tokens in a certain range had "data length" and "data" after them); where this encoding scheme included the 2 extra bits needed for the "has child" and "last child" flags. Of course I also had a symbol table and a few other things (type table, section table) in the file (partly for speed/indexing, but partly so I could use "symbol table entry number" instead of the symbol's name string as the data for some tokens).

I also had tokens for "full line comment" and "end of line comment"; and also "compressed full line comment" and "compressed end of line comment". The compressed comments were a cheap hack - mostly, subtract 32 from each character and pack the result into 5 bits, so that (e.g.) 20 characters get packed into 13 bytes (and if a comment string couldn't be compressed like this then just use the "uncompressed" token type).

This, plus a little minor trickery, is where I get the "tokens or AST halve file sizes" (from actual experimental results).

Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
Schol-R-LEA
Member
Member
Posts: 1925
Joined: Fri Oct 27, 2006 9:42 am
Location: Athens, GA, USA

Re: Compress ASM files?

Post by Schol-R-LEA »

@Brendan: have you read Modern Compiler Design by Grune et. al.? While I haven't finished it myself, from what I have seen it does an admirable job of discussing compression techniques for symbol tables and ASTs. I have both the hardbound print version and the e-book, and it is well worth the price; however, you could probably also find it in most university libraries, and the publisher, Springer, sells e-books of specific chapters for particular topics separately as well.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: Compress ASM files?

Post by SpyderTL »

Brendan wrote:For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.

Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).
Hey, you know what file format handles Trees really well, and has great IDE support?

XML.

:)


Actually, the original reason that I decided to go down the XML route was because a co-worker and I were discussing how cool it would be if, instead of storing text source files in source control, that you could store pure "data" instead -- the idea being that the data could be converted back to text using your own personal formatting preferences and naming conventions, and comparing one changeset version to another would not be cluttered up by things like changing tabs to spaces and stuff like that.

But I just picked XML because the only other options I could think of would be CSV, JSON or SQL Server tables. Out of those options, at the time, XML seemed like the logical choice. But binary encoded trees would also be an option.
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Post Reply