# Lzma - wow!

## devsk

So, I was comparing the various compression techniques and I found it amazing that the real entropy of the file of size 210MB+ was close to just 800KB. Look at this:

The file is a text file created by cat'ing /var/log/messages over and over.

```
# lt /var/tmp/mytestfile*

-rw-r--r-- 1 root root 217453638 2010-01-12 00:48 /var/tmp/mytestfile2

-rw-r--r-- 1 root root  20901610 2010-01-12 00:49 /var/tmp/mytestfile2.gz

-rw-r--r-- 1 root root  20831025 2010-01-12 00:49 /var/tmp/mytestfile2.pigz

-rw-r--r-- 1 root root  15097046 2010-01-12 17:24 /var/tmp/mytestfile2.bz2

-rw-r--r-- 1 root root    816164 2010-01-12 17:27 /var/tmp/mytestfile2.lzma

```

Here are the times:

```

# time lzma -z -M max -T 8 -c -9 /var/tmp/mytestfile2 > /var/tmp/mytestfile2.lzma

real    0m41.331s

# time bzip2 -c /var/tmp/mytestfile2 > /var/tmp/mytestfile2.bz2

real    0m27.679s

# time gzip -c /var/tmp/mytestfile2 > /var/tmp/mytestfile2.gz

real    0m2.951s

# time pigz -c /var/tmp/mytestfile2 > /var/tmp/mytestfile2.pigz

real    0m0.737s

```

pigz wins on time but lzma massacres the competition on the size. Less than 1MB for LZMA while the rest are close to 20MB. I could reduce the time by using lower level for LZMA but then its only as good as bzip2.

Look at the decompression time:

```

# time lzma -d -c /var/tmp/mytestfile2.lzma > /var/tmp/mytestfile3

real    0m0.186s

# time bzip2 -d -c /var/tmp/mytestfile2.bz2 > /var/tmp/mytestfile3

real    0m2.893s

# time gzip -d -c /var/tmp/mytestfile2.gz > /var/tmp/mytestfile3

real    0m0.868s

```

That's a massacre by lzma. Decompression is REALLY fast.

So, LZMA seems like the best choice for the situation where you need to compress-once-and-use-forever. Livecd is the ideal candidate for this. I don't mind if it takes 5 minutes to create. As long as it decompresses fast and takes fraction of space, I am fine with it.

Now, I am patiently waiting for the squashfs with LZMA support to land in the kernel.

Notes:

1. The real entropy of the file is the starting point of the 8MB /var/log/messages which I cat'ed over and over to create this 217MB file. LZMA came close to noticing that.

2. All tests were done in RAM to avoid I/O delays (note the folder /var/tmp). This is a pure test of the compression algo.

----------

## ppurka

Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel? 

Personally, I have the compression in the kernel set to gzip because "apparently" gzip takes less time to decompress. Now, this "apparently" is in question   :Confused: 

Also, does the compression of the 8MB original file by lzma also lead to same final size of ~800kB? This would mean that lzma is really close to the actual entropy of the source  :Smile: 

----------

## Mike Hunt

I needed to cat /var/log/messages over 12 million times, and boy are my fingers sore! 

```
 # xz -z -M max -T 8 -c -9 /var/tmp/mytestfile2 > /var/tmp/mytestfile2.xz

 # ls -l /var/tmp/mytestfile2*

-rw-r--r-- 1 root root 217946214 Jan 12 21:25 /var/tmp/mytestfile2

-rw-r--r-- 1 root root     31852 Jan 12 21:28 /var/tmp/mytestfile2.xz
```

compressing it in xz was about 9 times faster than bzip2, decompression of xz was almost instantaneous.

app-arch/xz-utils is keyword masked and is blocked by app-arch/lzma-utils because of stable app-portage/eix 

everything else is fine though because of: 

```
DEPEND="${DEPEND}

        || ( app-arch/xz-utils app-arch/lzma-utils )"
```

----------

## cach0rr0

meaningless as my test is, since this isn't real data, but rather hugely redundant

```

$ dd if=/dev/zero of=/home/meat/zeros

2421999+0 records in                               

2421999+0 records out                                

1240063488 bytes (1.2 GB) copied, 22.5843 s, 54.9 MB/s

```

I wasn't interested in time so much as overall compression

lzma -z -M max -T 8 -c -9 zeros > zeros.lzma

bzip2 zeros

yielded:

```

-rw-r--r-- 1 meat meat  909 Jan 12 21:04 zeros.bz2

-rw-r--r-- 1 meat meat 171K Jan 12 21:07 zeros.lzma

```

I don't know how useful this is at all in determining the extent to which lzma compresses massively redundant data VS bz2, but thought worth sharing.

----------

## LesCoke

Compression algorithms use various techniques to reduce the size by replacing redundancy with shorthand.  I suspect that a single copy of your original log file will compress to a size very near the same as the results when using the file containing multiple copies.

Text compresses very well because each word / phrase can be replaced with a number.  Frequent words get smaller numbers than less frequent words.

Files containing long sequences of identical bytes can be compresses to a shorthand form:  Duplicate value XX, N times,...

I'd be more interested in the results of compressing a large e-book.

Les

----------

## devsk

 *ppurka wrote:*   

> Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel? 
> 
> Personally, I have the compression in the kernel set to gzip because "apparently" gzip takes less time to decompress. Now, this "apparently" is in question  
> 
> Also, does the compression of the 8MB original file by lzma also lead to same final size of ~800kB? This would mean that lzma is really close to the actual entropy of the source 

 Yeah. I had the intermediate file of size 36MB which I had cat'ed six times. That 36MB file gave a size of 790KB... :Smile:  So, LZMA is pushing it almost to the limit of the source entropy with -9.

----------

## ppurka

```
/var/tmp/portage> time lzma -z -c -9 a > a.lzma

lzma -z -c -9 a > a.lzma  422.27s user 1.12s system 98% cpu 7:11.23 total

/var/tmp/portage> time gzip -c -9 a > a.gz; time bzip2  -c -9 a > a.bz2    

gzip -c -9 a > a.gz  24.46s user 0.13s system 98% cpu 24.896 total

bzip2 -c -9 a > a.bz2  54.13s user 0.16s system 98% cpu 55.103 total

/var/tmp/portage> ll

total 253M

-rw-r--r-- 1 root root 215M Jan 13 00:00 a

-rw-r--r-- 1 root root  16M Jan 13 00:14 a.bz2

-rw-r--r-- 1 root root  22M Jan 13 00:13 a.gz

-rw-r--r-- 1 root root 253K Jan 13 00:12 a.lzma

/var/tmp/portage> cp /var/log/emerge.log a-orig

/var/tmp/portage> time lzma -z -c -9 a-orig > a-orig.lzma; time gzip -c -9 a-orig > a-orig.gz; time bzip2 -c -9 a-orig > a-orig.bz2 

lzma -z -c -9 a-orig > a-orig.lzma  6.53s user 0.03s system 98% cpu 6.657 total

gzip -c -9 a-orig > a-orig.gz  0.38s user 0.00s system 98% cpu 0.393 total

bzip2 -c -9 a-orig > a-orig.bz2  0.87s user 0.01s system 98% cpu 0.883 total

/var/tmp/portage> ll

total 257M

-rw-r--r-- 1 root root 215M Jan 13 00:00 a

-rw-r----- 1 root root 3.4M Jan 13 00:15 a-orig

-rw-r--r-- 1 root root 260K Jan 13 00:16 a-orig.bz2

-rw-r--r-- 1 root root 349K Jan 13 00:16 a-orig.gz

-rw-r--r-- 1 root root 222K Jan 13 00:16 a-orig.lzma

-rw-r--r-- 1 root root  16M Jan 13 00:14 a.bz2

-rw-r--r-- 1 root root  22M Jan 13 00:13 a.gz

-rw-r--r-- 1 root root 253K Jan 13 00:12 a.lzma

```

This is the real comparison (with -9 for both gzip and bzip2)  :Smile: 

By the way, you guys have some different version of lzma. My version of lzma (lzma-utils is installed) doesn't have support for -T or -M.  Secondly, you have some real good system there devsk! My lzma took over 7 minutes! *Mike Hunt wrote:*   

> I needed to cat /var/log/messages over 12 million times, and boy are my fingers sore!

 For loops to the rescue for me  :Smile: 

```
for ((i=0;i<=60;i++)); do cat /var/log/emerge.log >> /var/tmp/portage/a; done
```

----------

## Mike Hunt

Actually, I used a loop like this:

```
for i in $(seq 1 1000000); do cat /var/log/messages >> /var/tmp/mytestfile2; done
```

Othetwise, I would probably be cat'ing for a couple of months!   :Laughing: 

----------

## devsk

My main box has i7 920 OCed to 4.4Ghz... :Smile:  So, it just tears through stuff!

I have xz-utils. The -T option currently doesn't do anything. Once that gets implemented, I think the lzma compression will just fly. Think 8 threads with HT on...yeah baby! Parallel mksquashfs creates a 600MB livecd in under 40 seconds.

----------

## devsk

I found this little utility called freearc. I am not sure why it is not there in a portage. This is extremely fast and extremely efficient in compressing. Get a load of this:

```
$ time ./arc create -mt8 -m9 /var/tmp/mytestfile2.arclzma /var/tmp/mytestfile2

FreeArc 0.60 creating archive: /var/tmp/mytestfile2.arclzma                   

Compressed 1 file, 217,453,638 => 703,401 bytes. Ratio 0.3%                   

Compression time: cpu 9.43 secs, real 5.66 secs. Speed 38,409 kB/s            

All OK                                                                        

real    0m5.710s

$ cd /

$ \rm  /var/tmp/mytestfile2

$ time unarc x /var/tmp/mytestfile2.arclzma

FreeArc 0.60 unpacker. Extracting archive: mytestfile2.arclzma

Extracting var/tmp/mytestfile2 (217453638 bytes)

All OK

real    0m0.692s

$ md5sum /var/tmp/mytestfile2.org /var/tmp/mytestfile2

24696247c934b7d581c156f001f362b6  /var/tmp/mytestfile2.org

24696247c934b7d581c156f001f362b6  /var/tmp/mytestfile2

```

So, not only this program created a file which is just 703KB (12% smaller than xz-utils) but it did it in 5.71 seconds compared to 40+ seconds for xz-utils. It decompresses in slower than xz-utils but it is still sub-second, so no big deal. Besides, its faster than gzip in decompression.

Now that's what I call compression!

----------

## mv

 *devsk wrote:*   

> The file is a text file created by cat'ing /var/log/messages over and over.

 

Such kind of tests are bogus: It essentially measures only the size of dictionary of the compressor. If one copy of the file is longer than the dictionary (which is probably the case for most compressors you used except perhaps lzma-utils/xz-utils with -9) they are of course worse by some factors, since a compressor with a sufficiently large dictionary essentially stores only "now repeat the last thing x times". If the original file (which is repeated) gets larger, then also lzma-utils/xz-utils will "suddenly" produce results which are larger by some factors.

----------

## d2_racing

 *ppurka wrote:*   

> Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel? 

 

In fact, I know that we can use lzma for the record, but I never tested it.

Anyone tested that actually ?

----------

## mv

 *d2_racing wrote:*   

>  *ppurka wrote:*   Very interesting set of tests. Does this have any implication to the usage of lzma to compress the kernel?  
> 
> In fact, I know that we can use lzma for the record, but I never tested it.

 

On x86 systems with 512MB RAM, it usually gives an out-of-memory error when booting with grub. On amd64 it works fine. The size difference is, as to be expected, some percentage, I do not remember in the moment.

----------

## d2_racing

Out of memory, well that's weird  :Razz: 

----------

## mikegpitt

LZMA compression time is looong.  I was playing around with it on a livecd I was building, and the compression time took about 1.5+ hours.  This was up from around 15-20 mins for bzip2.  Annoying if you are editing and need to rebuild something many times.

----------

## eccerr0r

Gzip still has its uses but lzma is pretty nice...

```

doujima:/tmp# time lzma < vmlinux > vmlinux.lzma

real    0m6.763s

user    0m6.596s

sys     0m0.120s

doujima:/tmp# time bzip2 < vmlinux > vmlinux.bz

real    0m4.372s

user    0m4.280s

sys     0m0.043s

doujima:/tmp# time gzip -9 < vmlinux > vmlinux.gz

real    0m0.857s

user    0m0.827s

sys     0m0.030s

doujima:/tmp# ls -l vmlinux*

-rwxr-xr-x 1 root root 3494717 Jan 13 11:19 vmlinux*

-rw-r--r-- 1 root root 1653809 Jan 13 11:20 vmlinux.bz

-rw-r--r-- 1 root root 1685150 Jan 13 11:20 vmlinux.gz

-rw-r--r-- 1 root root 1399599 Jan 13 11:19 vmlinux.lzma

```

----------

