Re: Compress ASM files?
Posted: Tue Oct 18, 2016 2:52 pm
Yes, we all have a decoder. It's called a disassembler.NunoLava1998 wrote:true, but not everyone has a decoder and i haven't seen anyone that knows machine code
Yes, we all have a decoder. It's called a disassembler.NunoLava1998 wrote:true, but not everyone has a decoder and i haven't seen anyone that knows machine code
Code: Select all
times 0-($-$$) db 0
That you, of all the people here, should say this, implies a level of hypocrisy so extreme, a lack of self-awareness so outrageous, that I would have expected time and space to be rent asunder when you wrote it.NunoLava1998 wrote:[qAlso promotes security and demotes 7 year olds that try to copy the code but then ask their local doctor why their kiddie code isn't working. Basically, pretty useful for source code if you want security to your code.
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
Nice job making this thread interesting.Brendan wrote:Hi,
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.
Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
Cheers,
Brendan
ThanksSpyderTL wrote:Nice job making this thread interesting.Brendan wrote:The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").
The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.
Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
For a traditional "plain text" compiler the first step (ignoring pre-processor, if any) is to convert the source into tokens. Maybe you have an enumeration of token IDs, and a structure containing "token ID and token data". Something like "x = y * 2 + 3" might be converted into a "symbol" token where the token's data is the symbol table entry for the "x" variable, and assignment operator token (no data), a "symbol" token where the token's data is the symbol table entry for the "x" variable, a "multiply" token (no data), a "numerical literal" token where the data is 2, etc. The end result of this is a linear sequence of tokens, like:SpyderTL wrote:Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?
Code: Select all
SYMBOL(x), ASSIGN(), SYMBOL(y), MULTIPLY(), NUMBER(2), ADD(), NUMBER(3)
Code: Select all
SYMBOL(x)
|
ASSIGN()
|
ADD()
/ \
/ NUMBER(3)
/
MULTIPLY()
/ \
SYMBOL(y) NUMBER(2)
Hey, you know what file format handles Trees really well, and has great IDE support?Brendan wrote:For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.
Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).