Yes, we all have a decoder. It's called a disassembler.NunoLava1998 wrote:true, but not everyone has a decoder and i haven't seen anyone that knows machine code
Compress ASM files?
Re: Compress ASM files?
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Re: Compress ASM files?
This is quite amazing, dude! You must be a computer science prodigy!
"If you don't fail at least 90 percent of the time, you're not aiming high enough."
- Alan Kay
- Alan Kay
Re: Compress ASM files?
I'm not really sure that it matters at this point... but you can probably remove this line...
Code: Select all
times 0-($-$$) db 0
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Re: Compress ASM files?
If the aim is to reduce the size of the file (say for network transmission), a compressor such as bzip is going to be far more effective for a text file.
If you want to protect your source code, encrypt it.
Both will preserve the file and achieve the desired effect efficiently.
Anyway, this has nothing to do with OS Development. Move it to another forum (prefereably the auto-delete one).
If you want to protect your source code, encrypt it.
Both will preserve the file and achieve the desired effect efficiently.
Anyway, this has nothing to do with OS Development. Move it to another forum (prefereably the auto-delete one).
- Schol-R-LEA
- Member
- Posts: 1925
- Joined: Fri Oct 27, 2006 9:42 am
- Location: Athens, GA, USA
Re: Compress ASM files?
That you, of all the people here, should say this, implies a level of hypocrisy so extreme, a lack of self-awareness so outrageous, that I would have expected time and space to be rent asunder when you wrote it.NunoLava1998 wrote:[qAlso promotes security and demotes 7 year olds that try to copy the code but then ask their local doctor why their kiddie code isn't working. Basically, pretty useful for source code if you want security to your code.
OK, enough hyperbole. Seriously, though, you want to prevent people from copying the code you yourself took from some post on Sewage Overflow - which, from what we've seen so far, appears to be all of the code you have - because of IP rights concerns? That's... ah, different. Right.
As for the issue at hand, consider this: given that you are simply taking an indigestible bolus of assembled machine code, converting into a representation that will at least double the size of how it is stored (because it takes two characters to represent an eight-bit byte in hex, meaning that you use two bytes of character data for every single byte of the code), I want you to explain how a) this is in any way compressed relative to the BIN format file, and/or b) how representing code that is already assembled as source code gives any advantage over distributing the BIN file itself, given that it is not only larger than the BIN file, but also requires anyone who downloads it to assemble it again?
Or to put it another way: given that the two main advantages of distributing source code are readability and system-specific optimization, why distribute source in a way that negates those advantages while keeping the disadvantage of needing to assemble locally?
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
- MichaelFarthing
- Member
- Posts: 167
- Joined: Thu Mar 10, 2016 7:35 am
- Location: Lancaster, England, Disunited Kingdom
Re: Compress ASM files?
Why's anyone discussing this nonsense or taking it seriously?
Re: Compress ASM files?
Hi,
The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.
Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
Cheers,
Brendan
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.
Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Love4Boobies
- Member
- Posts: 2111
- Joined: Fri Mar 07, 2008 5:36 pm
- Location: Bucharest, Romania
Re: Compress ASM files?
wat
"Computers in the future may weigh no more than 1.5 tons.", Popular Mechanics (1949)
[ Project UDI ]
[ Project UDI ]
- Kazinsal
- Member
- Posts: 559
- Joined: Wed Jul 13, 2011 7:38 pm
- Libera.chat IRC: Kazinsal
- Location: Vancouver
- Contact:
Re: Compress ASM files?
I for one find LZMA to be more effective.
Re: Compress ASM files?
If you wanted to compress your code further, you could assemble it to a binary file! Magic!
Re: Compress ASM files?
Nice job making this thread interesting.Brendan wrote:Hi,
The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").NunoLava1998 wrote:Disclaimer: This only compresses the .asm file. The binary result is not affected.
The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.
Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
Cheers,
Brendan
Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Re: Compress ASM files?
I don't know is this won't violate the forum rules, so if it does please delete this.
NunoLava1998, sorry, but you're doing a pure sh*t everytime. And again sorry, I just wanted to say everything.
NunoLava1998, sorry, but you're doing a pure sh*t everytime. And again sorry, I just wanted to say everything.
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing
OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing
OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.
Re: Compress ASM files?
Hi,
The second step for a traditional "plain text" compiler is to convert the linear sequence of tokens into a tree. Mostly, for the structure containing "token ID and token data" you were using you add a "pointer to first child" field and "pointer to next sibling" field. Then you start building a tree using the language's rules as a guide. For the "x = y * 2 + 3" example, you might end up with:
This is your abstract syntax tree. Of course for a normal program (rather than a single line) the abstract syntax tree would by much larger.
Now imagine an IDE designed to work on the abstract syntax tree itself. In this case when the user enters a line of text you'd tokenise the line and build a tiny abstract syntax tree (like the example above), then insert that (at the right place) into the larger tree for the program's source code.
For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.
What I did was have an variable length encoding scheme for token IDs (where more commonly used tokens took up less bits; and where tokens in a certain range had "data length" and "data" after them); where this encoding scheme included the 2 extra bits needed for the "has child" and "last child" flags. Of course I also had a symbol table and a few other things (type table, section table) in the file (partly for speed/indexing, but partly so I could use "symbol table entry number" instead of the symbol's name string as the data for some tokens).
I also had tokens for "full line comment" and "end of line comment"; and also "compressed full line comment" and "compressed end of line comment". The compressed comments were a cheap hack - mostly, subtract 32 from each character and pack the result into 5 bits, so that (e.g.) 20 characters get packed into 13 bytes (and if a comment string couldn't be compressed like this then just use the "uncompressed" token type).
This, plus a little minor trickery, is where I get the "tokens or AST halve file sizes" (from actual experimental results).
Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).
Cheers,
Brendan
ThanksSpyderTL wrote:Nice job making this thread interesting.Brendan wrote:The problem with assembly language is that (unlike higher level languages) you need lots of comments to make it maintainable; partly because you're using register names instead of descriptive variables names (e.g. "ebx" instead of "dollarsPerFortnight"), partly because often you're using instructions in "not-so-obvious" ways (e.g. "lea eax,[ebx*5]" when you're not loading an effective addresses at all but are multiplying by 5 instead), partly because there's lots of intermediate results (e.g. "x = (a*y + b*z + d) / c" gets split into 5 or more instructions with 4 or more intermediate results - "a*y" and "b*z", then "a*y + b+z", then "a*y + b+z + d").
The problem with your "compression" is that it's very lossy - everything that matters (all the comments, plus labels, whitespace/formatting, etc) get destroyed. This makes it completely useless for any practical purpose; given that code is read far more often than it's written.
Note that; in my experience; storing source code in tokenised form and/or as pre-parsed "abstract syntax tree" roughly halves the size it consumes without losing any important information, while also speeding up compile/assemble times (as parsing is partially or completely done already and there's less file IO), while also allowing you to add things like indexes to the file to speed things up (in both the compiler/assembler and in the IDE/editor) even more, while also making it easier for IDE to "auto-format to suit the user's preferences" (and avoiding stupid "tabs vs. spaces" and "which curly brace style this week" style arguments, and avoiding the "~10% on average" of programmers time wasted diddling with whitespace).
For a traditional "plain text" compiler the first step (ignoring pre-processor, if any) is to convert the source into tokens. Maybe you have an enumeration of token IDs, and a structure containing "token ID and token data". Something like "x = y * 2 + 3" might be converted into a "symbol" token where the token's data is the symbol table entry for the "x" variable, and assignment operator token (no data), a "symbol" token where the token's data is the symbol table entry for the "x" variable, a "multiply" token (no data), a "numerical literal" token where the data is 2, etc. The end result of this is a linear sequence of tokens, like:SpyderTL wrote:Can you post a quick example of what this "abstract syntax tree" code would look like? Or is it just straight up binary data?
Code: Select all
SYMBOL(x), ASSIGN(), SYMBOL(y), MULTIPLY(), NUMBER(2), ADD(), NUMBER(3)
Code: Select all
SYMBOL(x)
|
ASSIGN()
|
ADD()
/ \
/ NUMBER(3)
/
MULTIPLY()
/ \
SYMBOL(y) NUMBER(2)
Now imagine an IDE designed to work on the abstract syntax tree itself. In this case when the user enters a line of text you'd tokenise the line and build a tiny abstract syntax tree (like the example above), then insert that (at the right place) into the larger tree for the program's source code.
For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.
What I did was have an variable length encoding scheme for token IDs (where more commonly used tokens took up less bits; and where tokens in a certain range had "data length" and "data" after them); where this encoding scheme included the 2 extra bits needed for the "has child" and "last child" flags. Of course I also had a symbol table and a few other things (type table, section table) in the file (partly for speed/indexing, but partly so I could use "symbol table entry number" instead of the symbol's name string as the data for some tokens).
I also had tokens for "full line comment" and "end of line comment"; and also "compressed full line comment" and "compressed end of line comment". The compressed comments were a cheap hack - mostly, subtract 32 from each character and pack the result into 5 bits, so that (e.g.) 20 characters get packed into 13 bytes (and if a comment string couldn't be compressed like this then just use the "uncompressed" token type).
This, plus a little minor trickery, is where I get the "tokens or AST halve file sizes" (from actual experimental results).
Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).
Cheers,
Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
- Schol-R-LEA
- Member
- Posts: 1925
- Joined: Fri Oct 27, 2006 9:42 am
- Location: Athens, GA, USA
Re: Compress ASM files?
@Brendan: have you read Modern Compiler Design by Grune et. al.? While I haven't finished it myself, from what I have seen it does an admirable job of discussing compression techniques for symbol tables and ASTs. I have both the hardbound print version and the e-book, and it is well worth the price; however, you could probably also find it in most university libraries, and the publisher, Springer, sells e-books of specific chapters for particular topics separately as well.
Rev. First Speaker Schol-R-LEA;2 LCF ELF JAM POEE KoR KCO PPWMTF
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Ordo OS Project
Lisp programmers tend to seem very odd to outsiders, just like anyone else who has had a religious experience they can't quite explain to others.
Re: Compress ASM files?
Hey, you know what file format handles Trees really well, and has great IDE support?Brendan wrote:For storage as a file, you need to serialise the tree. Fortunately this can be done with 2 bits per node/token, one flag that indicates if the next token is the current token's child, and one flag that indicates if the current token is its parent's last child.
Sadly, I did this work starting with a "plain text to my tokenised source code file format" utility, and I didn't get much further than this (I started having ideas about caching pre-compiled pieces of code in the source file, and storing unit test results in the source file, etc; and realised I needed to start researching/building IDE prototypes before I could really figure out the full extent of what I wanted from the source file format).
XML.
Actually, the original reason that I decided to go down the XML route was because a co-worker and I were discussing how cool it would be if, instead of storing text source files in source control, that you could store pure "data" instead -- the idea being that the data could be converted back to text using your own personal formatting preferences and naming conventions, and comparing one changeset version to another would not be cluttered up by things like changing tabs to spaces and stuff like that.
But I just picked XML because the only other options I could think of would be CSV, JSON or SQL Server tables. Out of those options, at the time, XML seemed like the logical choice. But binary encoded trees would also be an option.
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott