Use multiple CPU thread/core to make tar compression faster

On many unix like systems, tar is a widely used tool to package and compress files, almost built-in in the all common Linux and BSD distribution, however, tar always spends a lot of time on file compression, because the programs itself doesn’t support multi-thread compressing, but fortunately, tar supports to use specified external program to compress file(s), which means we can use the programs support multi-thread compressing with higher speed!

From the tar manual (man tar), we can see:

-I, –use-compress-program PROG
filter through PROG (must accept -d)

With parameter -I or --use-compress-program, we can select the extermal compressor program we’d like to use.

The three tools for parallel compression I will use today, all can be easy installed via apt install under Debian/Ubuntu based GNU/Linux distributions, here are the commands and corresponding apt package name, please note that new versions of Ubuntu and Debian no longer have pxz package, but pixz can do the similar thing:

  • gz:   pigz
  • bz2: pbzip2
  • xz:   pxz, pixz

Originally commands to tar with compression will be look like:

  • gz:   tar -czf tarball.tgz files
  • bz2: tar -cjf tarball.tbz files
  • xz:   tar -cJf tarball.txz files

The multi-thread version:

  • gz:   tar -I pigz -cf tarball.tgz files
  • bz2: tar -I pbzip2 -cf tarball.tbz files
  • xz:   tar -I pixz -cf tarball.txz files
  • xz:   tar -I pxz -cf tarball.txz files

I am going to use Linux kernel v3.18.6 as compression example, threw the whole directory on the ramdisk to compress them, and then compare the difference!
(PS: CPU is Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz, 4 cores, 4 threads, 16GB ram)

Result comparison:

tarCompressComparison1

Time spent:
.                                  gzip         bzip2                    xz
Single-thread       17.466s     50.004s       3m54.735s
Multi-thread           4.623s      13.818s       1m10.181s
How faster ?          3.78x          3.62x                3.34x

Because I didn’t specify the compressor parameter, just let them use the default compress level, so the result file size may be a little bit different, but quite close, we still can add parameters to the external compression project like this: tar -I "pixz -9" -cf tarball.txz files, just quote the command with its argument, which is also pretty easy.

tarCompressComparison2

With parameter -9 to increase the compress level, it might need more memory when compressing, the result will become 81020940 bytes but not 84479960 bytes, so we can save additional 3.3 mega bytes! (also spent 40 more secs, you decide it!)

This is very useful for me!!!

zip 壓縮演算法的選擇

最近這半年用到 zip 壓縮的次數變比較頻繁一點,為了節省空間、順便想比較一下使用不同演算法的壓縮率,所以寫一下這篇筆記…

一般常見的壓縮檔格式 (以 Windows 平台來說) 大致上就是 zip、7z、rar ,rar 算是在費時以及壓縮比來講最具經濟效益的一種格式,因為格式本身有專利、壓縮的功能本身是付費授權,且 Win ME 以後 Windows 系統已經內建了 zip 的壓縮、解壓縮功能(雖然是有夠陽春),不需要安裝額外的軟體,在 Windows 平台上公開交換檔案來說,zip 仍舊是一種非常常見的壓縮格式。

zip 畢竟是比較古老的格式(當然後面有出了改良後的版本 zipx、但一直沒有普及),最為人詬病的地方大概就是壓縮率非常的差勁(且不支援unicode檔名),自己手邊的檔案隨便抓一些壓縮後做比較,發現 zip 很容易就比 rar 或 7z 多占 50% 左右的空間 (當然還是要看裡面的檔案格式而定)。

一般壓縮軟體在壓縮檔案的時候大多只會提供使用者選擇壓縮檔的格式,但並不會提供壓縮演算法的選擇,而 7-zip 剛好有提供這樣的選項,於是就用 7-zip 來壓縮 zip 做個比較。

7-zip_zip_algorithm

閱讀全文