Packing binary data and compressing at the bit level, etc...

~ · Post by ~ » Sat Jan 30, 2016 10:14 am

>> You can see the sample code for this topic running here <<

I have been doing some effort to use binary data for low level tasks in the HTML5/JavaScript environment for things like attempting to implement a CPU emulator to run nice and compact code in the web browser.

While trying to use a lot of binary data and code I have seen that it would be better to store such data compressed and then uncompress it from JavaScript, and then make the application save a copy of itself with the state of its data to save any changes of our work.

I have seen that I can use Base64 with the raw binary ASCII character values 0-63 ( !"#$%&'()*+,-./0123456789:;<=>?) so I can effectively take strictly 6 bits, unlike using the regulas Base64 alphabet (ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=).

The trick is that later, if there are 8-bit characters in the original data intermixed between the regular 7-bit ASCII characters (e.g., for UTF-8 Unicode characters and texts), we can force all bytes to contain only 6 used bits after encoding in binary-mode Base64, and then we can use the 8 bits by packing 6 bits of the current byte and 2 bits of the next byte and so on (packing 6 bits into 8 bits).

In this way we force all bytes to contain the same number of bits temporarily and then safely take back the space used by 7 and 8-bit characters by packing.

Here is the code of Base64 functions to encode and decode packed into an extremely simple JavaScript class:

Code: Select all

<title>Binary Base64 Encoding/Decoding (with 6-bit ASCII values 0-63)</title>





<body bgcolor="#bcbcbc" style="font-size:19px"><script>
function Base64(alphabet_to_use=0)
{
 this.nalpha=alphabet_to_use;
 this.alphabet64=[
                 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=", //0
                 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_="  //1
                ];

 //Generate a purely binary string (used to effectively pack 6 bits into 8-bit bytes
 //and normalize the 8 bits):
 ///
  this.alphabet64[2]="";
  for(var x=0; x<64; x++)
  {
   this.alphabet64[2]+=String.fromCharCode(x);
  }


 this.skip64=[
              "= \t\r\n",   //0
              "= \t\r\n",   //1
              ""            //2
             ];



 //Converts a raw "binary" data string into Base64 data:
 ///
  this.btoa_DataView=function(binary_data)
  {
   var Base64_TBL=this.alphabet64[this.nalpha].split("");
   var pad64=this.alphabet64[this.nalpha][64];
   if(pad64==undefined)pad64="";



   //This will contain our resulting converted data
   ///
    var base64="";

   //Check that the data is a string and that its length is not 0:
   ///
    if(!(binary_data.byteLength>0))return "";


   //Temporary 32-bit DWORD:
   ///
    var tmp="";


   //4-byte temporary Base64 chunk for each 3 bytes of data, and/or plus padding if necessary:
   ///
    var tm2="";

   //Number of '=' padding characters, to avoid generating Base64 charactes that should be padding to begin with:
   //Number of '=' characters if there is no further data in size divisible exactly by 3:
   ///
    var padcount="";


   //This loop advances in groups of 3 because 3 "binary" letters
   //produce 4 Base64 letters:
   ///
    for(var x=0;x<binary_data.byteLength;x+=3)
    {
     //INIT: Read a DWORD safely byte by byte, with "memory boundary checking"
     //INIT: Read a DWORD safely byte by byte, with "memory boundary checking"
     //INIT: Read a DWORD safely byte by byte, with "memory boundary checking"
     //INIT: Read a DWORD safely byte by byte, with "memory boundary checking"
     ///
      tmp=binary_data.getUint8(x)<<24;      //bits 31-24

      if(x+1<binary_data.byteLength)
      tmp|=binary_data.getUint8(x+1)<<16;   //bits 23-16
      else padcount++;                      //If there's no data, increase padding and this bit group is 0 automatically

      if(x+2<binary_data.byteLength)
      tmp|=binary_data.getUint8(x+2)<<8;    //bits 15-8
      else padcount++;                      //If there's no data, increase padding and this bit group is 0 automatically

                                            //bits 7-0 are 0 always


     //END:  Read a DWORD safely byte by byte, with "memory boundary checking"
     //END:  Read a DWORD safely byte by byte, with "memory boundary checking"
     //END:  Read a DWORD safely byte by byte, with "memory boundary checking"
     //END:  Read a DWORD safely byte by byte, with "memory boundary checking"


    //Shift 8 bits left to re-align all bit values in an order
    //proper for the 6 bits of Base64.
    //
    //This will NOT discard the first 2 bits of the DWORD, but anyway
    //the bits of the next byte of data, if present (byte 4),
    //belong to the next group of 3 bytes and are useless for the
    //current 32-bit run:
    ///
     tmp>>=8;   //tmp is a 32-bit DWORD here


    //"Flush" past, now useless data, before using the buffer again (might not be necessary
    //in C or assembly since the data in those languages will always be overwritten either by
    //data or padding; but it is required in JavaScript because the string cannot be handled
    //so freely in such language).
    ///
     tm2="";


     var sshl=6;  //I thought that this bit shifting was going to be dynamic, but after the adjusting above
                  //and the bit masking below inside the loop, it isn't necessary at all.


     for(var y=0;y<4;y++)
     {
      //Get bits 31-24, then 23-16 and 15-8 and use them as index for the third, second
      //and first Base64 characters, respectively:
      ///

      //Save the corresponding Base64 character, backwards, or if we are in a range in which
      //we previously detected that there was no data available for bytes 2 of 3 and/or 3 of 3,
      //just add padding.
      //
      //In other words, if the count of required padding characters is 0 (3 original bytes were
      //present for this loop run), or if the count of padding characters is not 0 and y is
      //in a range above/outside that of the padding characters to generate, then save a Base64
      //indexed character.
      //
      //Otherwise, we are in a byte range for padding, and we must generate and save a padding character
      //(it could and should only happen at the very end of the whole data buffer):
      ///
       if(padcount==0 || (padcount!=0 && y>padcount-1))
       {
        tm2=Base64_TBL[tmp&63]+tm2;
       }
        else tm2=tm2+pad64;


      //Keep shifting bits. We have saved backwards because in this way we
      //reduce the amount of bit shifting and bit masking required to get
      //the 6 bits required for each Base64 character, and still we can get
      //each Base64 character as soon as possible, as soon as its offset
      //is available to us.
      ///
       tmp>>=6;
     }

    //Save this chunk of Base64 characters:
    ///
     base64+=tm2;
    }

   return base64;
  };






  this.StrPad=function(str,padchar,paddedstrlen,direction)
  {
   //If no direction was specified, the default action is
   //to pad leftmost:
   ///
    if(!direction)direction="l";

 
   //Don't allow empty padding character variable or a bad final padded length
   //because it would cause an infinite loop:
   ///
    if(typeof(paddedstrlen)!="number" || typeof(padchar)!="string")return str;
    if(!(padchar.length>0) || paddedstrlen<=0)return str;


   if(direction.toLowerCase()=="r")
   {
     while(str.length<paddedstrlen)
     {
      str=str+padchar.charAt(0);
     }
   }
    else
    {
      while(str.length<paddedstrlen)
      {
       str=padchar.charAt(0)+str;
      }
    }

   return str;
  };





  //Convert a raw "binary" data in Base64 data:
  ///
   this.btoa=function(binary_data)
   {
    var pad64=this.alphabet64[this.nalpha][64];
    if(pad64==undefined)pad64="";

    binary_data=unescape(encodeURI(binary_data));

    const Base64_TBL=this.alphabet64[this.nalpha];
    var base64="";

    if(typeof(binary_data)!="string")return "";
    if(!(binary_data.length>0))return "";


    var tmp="";
    var tm2="";
    for(var x=0;x<binary_data.length;x+=3)
    {
     tmp=this.str2bin(binary_data.substring(x,x+3));

      for(var y=0;y<tmp.length;y+=6)
      {
       if(tmp.substring(y,y+6).length<6)
       {
        tm2=this.StrPad(str=tmp.substring(y,y+6),padchar="0",paddedstrlen=6,direction="r");
        base64+=Base64_TBL[parseInt(tm2,2)];
        }
        else
        {
         base64+=Base64_TBL[parseInt(tmp.substring(y,y+6),2)];
        }



       if(tmp.length==8)
       {
        if(y==6)
        {
         base64+=pad64+pad64;
        }
       }
        else if(tmp.length==16)
        {
         if(y==12)
         {
          base64+=pad64;
         }
        }
      }
    }

    return base64;
   };



   this.atob=function(base64_data)
   {
    var binary64="";
    var binary="";

    var pad64=this.alphabet64[this.nalpha][64];
    if(pad64==undefined)pad64="";

    if(typeof(base64_data)!="string")return "";
    if(!(base64_data.length>0))return "";


    var tmp="";
    var c="";
    for(var x=0;x<base64_data.length;x++)
    {
     c=base64_data.charAt(x);

     if(this.skip64[this.nalpha].indexOf(c)>=0)continue;
     else c=this.alphabet64[this.nalpha].indexOf(c);


     binary64+=this.StrPad(str=c.toString(2),padchar="0",paddedstrlen=6,direction="l");
  
     if(binary64.length>=8)
     {
      binary+=String.fromCharCode(parseInt(binary64.substring(0,8),2));

      binary64=binary64.substring(8,binary64.length);
     }
    }

    return binary;
   };



   //This function takes any string, including special characters, treats them
   //as a binary 8-bit data string and returns a string with the binary representation
   //of those bits.
   ///
    this.str2bin=function(ASCII_str)
    {
     if(typeof(ASCII_str)!="string")return "";

     var rret="";

      for(var x=0;x<ASCII_str.length;x++)
      {
       rret+=this.StrPad(str=ASCII_str.charCodeAt(x).toString(2),
                    padchar="0",
                    paddedstrlen=8,
                    direction="l"
                   );
      }

     return rret;
    };


};

</script>



<script>
//Use the binary Base64 alphabet at index 2 of our class so that we can use 6-bit characters:
///
 var base64=new Base64(2);


//This is the string to encode/decode:
///
 var s=" 	This article's lead section may not adequately summarize key points of its contents. Please consider expanding the lead to provide an accessible overview of all important aspects of the article. Please discuss this issue on the article's talk page. (February 2013)";


//Then let's use base64.btoa(original_data) and base64.atob(base64_code) to encode and decode, respectively:
///
 document.write(
       "<pre><b>Binary Base64 alphabet:</b><br />"+base64.alphabet64[base64.nalpha]+"<br />-------<br /><br /><br /><br />"+
       "<b>Coded with binary-mode Base64:</b><br />"+base64.btoa(s)+"<br /><br /><b>Coded with regular Base64:</b><br />"+btoa(s)+"<br /><br /><br />"+
       "<b>Decoded string from binary Base64:</b><br />"+base64.atob(base64.btoa(s))+"<br /><b>Decoded string from text Base64:</b><br />"+atob(btoa(s))+"<br /><br />"+
       "<b>Are decoded strings identical?</b> "+(base64.atob(base64.btoa(s))==atob(btoa(s)))
      );



</script>

FallenAvatar · Post by **FallenAvatar** » Sat Jan 30, 2016 10:24 am

~ wrote:>> You can see the sample code for this topic running here <<

I have been doing some effort to use binary data for low level tasks in the HTML5/JavaScript environment for things like attempting to implement a CPU emulator to run nice and compact code in the web browser.

have you taken a look at asm.js?

~ wrote:While trying to use a lot of binary data and code I have seen that it would be better to store such data compressed and then uncompress it from JavaScript, and then make the application save a copy of itself with the state of its data to save any changes of our work.

I have seen that I can use Base64 with the raw binary ASCII character values 0-63 ( !"#$%&'()*+,-./0123456789:;<=>?) so I can effectively take strictly 6 bits, unlike using the regulas Base64 alphabet (ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=).

Base64 does not generally compress anything. It inflates it. The general use of Base64 is to safely encode binary data into a form that will not cause javascript parsing errors. (Namely, it does not include unprintable chars, whitespace, quotes, semi-colons, etc.)

- Monk

Nutterts · Post by **Nutterts** » Sat Jan 30, 2016 2:22 pm

~ wrote:The trick is that later, if there are 8-bit characters in the original data intermixed between the regular 7-bit ASCII characters (e.g., for UTF-8 Unicode characters and texts), we can force all bytes to contain only 6 used bits after encoding in binary-mode Base64, and then we can use the 8 bits by packing 6 bits of the current byte and 2 bits of the next byte and so on (packing 6 bits into 8 bits).

I can see your line of thinking but it's understandingly flawed. What your overlooking is that you have three types of information: 6bit, 7bit and 8bit in length. Right now you can separate them because they are all stored as 8bits. If you remove the unused bits you no longer know where the next value starts.

To add a type prefix you'd need a 2bit identifier to each piece of information which will increase the overall size.

For compression you'd (for example) want to find a series of bits that occurs more often then a certain short series of bits and create an index in which you which the two around if and often if the amount of space saved minus the overhead of adding the index entry is bigger then zero. Search for run time encoding as a simple example or huffman coding. The first works reasonably well on textfiles, the latter works much better and faster on arbitrary binary data.

Edit: I misread part of your post but what I said is still accurate. Only you'd need to add a 1bit prefix to your 7bit & 8bit pieces of information which will increase your total size, not decrease it. Using Base64 leads you to believe something like that would work. But run a few tests and you'll see that on average your data will increase 0.5 bits for every byte. Maybe even worse.

FallenAvatar · Post by **FallenAvatar** » Sat Jan 30, 2016 4:26 pm

Nutterts wrote:
tjmonk15 wrote:...
...

Umm, wrong quote? (Atleast I didn't write that, ~ did :-p)

- Monk

Nutterts · Post by **Nutterts** » Sat Jan 30, 2016 7:20 pm

tjmonk15 wrote:Umm, wrong quote? (Atleast I didn't write that, ~ did :-p)

Haha, srry monk, didn't see that weird mix up. I could have sworn I just selected that text & clicked quote.

~ · Post by ~ » Mon Feb 01, 2016 5:40 pm

>> See the example code here <<

I have added functions to pack and compress the data by first converting the data to 6-bit Base64 values (actual 0-63 ASCII) and then packing those 6 bits into 8 bits filling all of the unused bits.

Since all values are 6-bit in size, there is no need to keep track of the size of each value, just the total count of bytes of the original data to decompress.

I have run several tests and I have proven that with this method the packing/compression ratio of the data always reaches around 25%, at least for regular or UTF-8 Unicode text data.

With this sample webpage you can input any text and in return you will get a version of the same text with the bits packed to save space.

It can actually be used to compress, for example, 1 Megabyte of text, down to 750 Kilobytes.

Brendan · Post by **Brendan** » Mon Feb 01, 2016 7:35 pm

Hi,

~ wrote:I have run several tests and I have proven that with this method the packing/compression ratio of the data always reaches around 25%, at least for regular or UTF-8 Unicode text data.

No. It only supports a limited sub-set of ASCII and trashes everything else. If you're going to do that, then have a "limited sub-set of only one character" (e.g. an alphabet of 'A' and nothing else) so you only need to store the length and can pack 4 GiB of text into 4 bytes.

For full support of ASCII you need all 128 characters. For most HTML data (but not all - e.g. not pre-formatted text) you could collapse white-space (e.g. replace sequences of tabs, 2 or more spaces, newlines, etc with a single space character) and assume no control characters are present; so that you're only handling the 95 characters from 32 to 127. Then you'd do something like "value = value * 95 | (character - 32);" (using "big integers") to pack it into the smallest possible size (which works out to approximately 6.55 bits per character).

Of course as soon as you go anywhere near UTF-8 you need the 8th bit and it all turns to mush.

On top of that, you shouldn't be ignoring the cost of the decoder. E.g. if you compress 3 KiB of text down to 2 KiB, but have to add 2 KiB of javascript to decode it; then you've "compressed" 3 KiB of data up to a total of 4 KiB and made things worse.

Finally; the HTTP protocol already has compression built in; so even if the file ends up smaller it probably cripples HTTP's compression ratio and ends up costing more bandwidth than the original would have.

Basically; it'd make more sense to forget about compression (leave it to HTTP's built in compression) and focus on "minification" (strip out all characters that aren't strictly necessary) instead. I'd start by stripping out all the javascript, partly to re-assure people that it isn't a bloated pile of tracking/advertising puke in disguise.

Cheers,

Brendan

Nutterts · Post by **Nutterts** » Mon Feb 01, 2016 8:57 pm

If your limiting yourself to only 64 legal symbols then you might want to forget about base64 and implement it as a lookup table and a state machine. Simpler and faster.

But i still think thats not your goal and that your misleading yourself by comparing the space saved in comparison with the base64 string. Not the original data which was base64 encoded. Remember that base64 adds to the original data size, you need to substract that from the 25%.

Text itself does compress quite nicely, the compression http uses (like combuster allready sugested) compresses text up to 90%.

Octocontrabass · Post by **Octocontrabass** » Mon Feb 01, 2016 9:13 pm

~ wrote:I have added functions to pack and compress the data by first converting the data to 6-bit Base64 values (actual 0-63 ASCII) and then packing those 6 bits into 8 bits filling all of the unused bits.

You take a stream of 8-bit bytes and split those into groups of 6 bits, encoded using Base64. Then, you take those 6-bit units and pack them back into 8-bit bytes.

Am I missing something here?

~ · Post by ~ » Mon Feb 01, 2016 9:48 pm

Octocontrabass wrote:
~ wrote:I have added functions to pack and compress the data by first converting the data to 6-bit Base64 values (actual 0-63 ASCII) and then packing those 6 bits into 8 bits filling all of the unused bits.
You take a stream of 8-bit bytes and split those into groups of 6 bits, encoded using Base64. Then, you take those 6-bit units and pack them back into 8-bit bytes.

Am I missing something here?

Yes, we have 7-used-bit and 8-used-bit values intermixed and use a binary version of Base64 with ASCII values 0-63, which take 6 used bits. This is to regularize the contents into a stream of 6-used-bit values. It will of course inflate the data just like regular Base64, but by packing it all back to an 8-used-bit values, we should only possibly have at most 1 extra byte and solid 8-bit values.

So our smallest data size is 8 bits trying to save all possible unused bits.

It gives us around 25% saving from the Base64 code, literally the same size as the original data, so it is more useful for smaller values such as several 3-bit values into one same byte.

It's more a proof of concept, but after being able to pack arbitrary-sized bit data, we can implement algorithms such as LZW and get to use variable-sized compression codes, with much better compression.

You can try it with things like chinese characters, and it should work.

I have used Base64 just because there are 7 and 8-bit values present, but if we could detect that there are ONLY 7-used-bit values or smaller in a regular way in a stream, we should achieve more actual compression, although it is probably better to implement this directly into the usual LZW algorithms, etc...

Octocontrabass · Post by **Octocontrabass** » Mon Feb 01, 2016 10:07 pm

~ wrote:literally the same size as the original data

So what's the benefit of using this encoding instead of transmitting the original data?

~ · Post by ~ » Mon Feb 01, 2016 11:22 pm

Octocontrabass wrote:
~ wrote:literally the same size as the original data
So what's the benefit of using this encoding instead of transmitting the original data?

The next thing to implement with the bit packing code would be something like LZW or any algorithm that spans one normally atomic value across more than 1 byte or machine word. The Base64 code would not be needed to store or transmit in binary mode.

In general packing the bits is better when all of the data has a maximum bit size that is smaller than 7 to fit those values in one same byte, taking into account the byte offset and the bit offset. Since text and source code contains mostly 7-bit data it is better to either ensure a version of the stream with all values being 6-bit (with binary Base64) and then repacking, checking that all bytes contain an uniform maximum number of bits per byte or per word, or assuming that all data is 8-bit or word-sized, and then use an algorithm like LZW which stores only the pages of a "repeated strings dictionary" (which has 1 string per page of varying length, coming from a substring of the original data and which potentially occurs more than 1 time) and which regenerates itself based on the values of those pages, which have variable sizes filling for example all 3 bits, then all 4 bits, up to all 12 bits.

It would benefit if the data was smaller, like 3, 4 o 5 bits and pack several total or partial values in the same byte.

The Base64 is used only to ensure that all data becomes 6-bit in size.

By packing back to 8 bits we get the same size and sometimes one additional byte.

We could then compare if an algorithm like LZW, LZ77 or Huffman coding yields a smaller compressed result with the original binary or with the binary-Base64-and-bit-packed version.

Octocontrabass · Post by **Octocontrabass** » Tue Feb 02, 2016 1:12 am

~ wrote:We could then compare if an algorithm like LZW, LZ77 or Huffman coding yields a smaller compressed result with the original binary or with the binary-Base64-and-bit-packed version.

If the compressor assumes each byte is a symbol, the original data will compress better. Your bit packed format breaks the correspondence between byte value and symbol value.

If the compressor is aware of each symbol, they will both compress the same. Each byte in the original data is a symbol, so it doesn't matter how you manipulate the symbols.

kzinti · Post by **kzinti** » Tue Feb 02, 2016 1:47 pm

~ wrote:We could then compare if an algorithm like LZW, LZ77 or Huffman coding yields a smaller compressed result with the original binary or with the binary-Base64-and-bit-packed version.

So basically, you have no idea how compression works. Have fun exploring!

OSDev.org

Packing binary data and compressing at the bit level, etc...

Packing binary data and compressing at the bit level, etc...

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et

Re: Packing binary data and compressing at the bit level, et