Interesting usage of tar... but what is happening?

I saw the following interesting usage of tar in a co-worker's Bash scripts:

`tar cf - * | (cd <dest> ; tar xf - )`

Apparently it works much like rsync -av does, but faster. The question arises, how?

-m


EDIT: Can anyone explain why should this solution be preferable over the following?

cp -rfp * dest

Is the former faster?

Answers


On the difference between cp and tar to copy the directory hierarchies, a simple experiment can be conducted to show the difference:

alastair box:~/hack/cptest [1134]% mkdir src
alastair box:~/hack/cptest [1135]% cd src
alastair box:~/hack/cptest/src [1136]% touch foo
alastair box:~/hack/cptest/src [1137]% ln -s foo foo-s
alastair box:~/hack/cptest/src [1138]% ln foo foo-h
alastair box:~/hack/cptest/src [1139]% ls -a
total 0
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo-h
lrwxrwxrwx  1 alastair alastair    3 Nov 25 14:59 foo-s -> foo
alastair box:~/hack/cptest/src [1142]% mkdir ../cpdest
alastair box:~/hack/cptest/src [1143]% cp -rfp * ../cpdest
alastair box:~/hack/cptest/src [1144]% mkdir ../tardest
alastair box:~/hack/cptest/src [1145]% tar cf - * | (cd ../tardest ; tar xf - )
alastair box:~/hack/cptest/src [1146]% cd ..
alastair box:~/hack/cptest [1147]% ls -l cpdest
total 0
-rw-r--r--  1 alastair alastair    0 Nov 25 14:59 foo
-rw-r--r--  1 alastair alastair    0 Nov 25 14:59 foo-h
lrwxrwxrwx  1 alastair alastair    3 Nov 25 15:00 foo-s -> foo
alastair box:~/hack/cptest [1148]% ls -l tardest
total 0
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo-h
lrwxrwxrwx  1 alastair alastair    3 Nov 25 15:00 foo-s -> foo

The difference is in the hard-linked files. Notice how the hard-linked files are copied individually with cp and together with tar. To make the difference more obvious, have a look at the inodes for each:

alastair box:~/hack/cptest [1149]% ls -i cpdest
24690722 foo  24690723 foo-h  24690724 foo-s
alastair box:~/hack/cptest [1150]% ls -i tardest
24690801 foo  24690801 foo-h  24690802 foo-s

There are probably other reasons to prefer tar, but this is one big one, at least if you have extensively hard-linked files.


It writes the archive to standard output, then pipes it to a subprocess -- wrapped by the parentheses -- that changes to a different directory and reads/extracts from standard input. That's what the dash character after the f argument means. It's basically copying all the visible files and subdirectories of the current directory to another directory.


For a directory with 25,000 empty files:
$ time { tar -cf - * | (cd ../bar; tar -xf - ); }
real    0m4.209s
user    0m0.724s
sys 0m3.380s

$ time { cp * ../baz/; }
real    0m18.727s
user    0m0.644s
sys 0m7.127s
For a directory with 4 files of 1073741824 bytes (1GB) each
$ time { tar -cf - * | (cd ../bar; tar -xf - ); }
real    3m44.007s
user    0m3.390s
sys 0m25.644s

$ time { cp * ../baz/; }
real    3m11.197s
user    0m0.023s
sys 0m9.576s

My guess is this phenomenon is highly filesystem-dependent. If I'm right you will see a drastic difference between a filesystem that specializes in numerous small files, such as reiserfs 3.6, and a filesystem that is better at handling large files.

(I ran the above tests on HFS+.)


This is a unique usage of pipes. Basically, the first tar typically writes directly to a file, but instead it's going to write to stdout (the -), which is then redirected to the other tar which takes stdin rather than a file. Basically this is the same thing as tarring to a file and untarring later, except without the file in between.


The PowerTools book has the copy as:

tar cf - * | (cd <dest> && tar xvBf - )

The '&&' is a conditional that checks the return code of the preceding command. Ihat is, if the "cd " failed, the "tar xf -" would not be executed. I always throw in a -v (verbose) and a -B (reblock input).

I use tar all the time. It is especially useful for copying to a remote system, such as:

tar cvf - . | ssh someone@somemachine '(cd somewhere && tar xBf -)'


tar cf - * | (cd <dest> ; tar xf - )

is going to tar all not hidden files/directories of the current directory to stdout, then piping that into a new subshells' stdin. That shell first changes the current working directory to <dest>, and then untars it to that directory.


Some old versions of cp didn't have -f / -p (and similar) options for preserving permissions, so this tar trick did the job.


I believe the tar will do a Windows style 'merge' operation with deeply nested directories, whereas the cp will overwrite sub-directories.

For example if you have the layout:

dir/subdir/file1

and you copy it to a destination that contains:

dir/subdir/file2

Then with copy you will be left with:

dir/subdir/file1

But with the tar command, your destination will contain:

dir/subdir/file1
dir/subdir/file2

tar cf - *

This uses tar to send * to stdout

|

This does the obvious redirect of stdout to...

(cd <dest> ; tar xf - )

This, which changes PWD to the appropriate location and then extracts from stdin

I do not know why this would be faster than rsync, as there is no compression involved.


The tar solution will preserve symbolic links, whereas cp will just make copies and destroy the links.

tar has been a standard Unix utility a lot longer than rsync. You're more likely to find it in a situation when a directory hierarchy needs to be copied to another location (even another computer). rsync is probably easier to use these days, but is slower because it compares both the source and destinations and sync's them. tar just copies in one direction.


If you have GNU cp (which all Linux-based systems will), the cp --archive will work, even on hard-linked files, and tar is not needed.


As it happens, a co-worker wrote a nearly identical command into one of our scripts. After I spent some time puzzling over it, I asked why he had used that rather than cp. His answer, as I recall it, was that cp is slow when making a copy from one file system to another.

Whether or not this is true would require more testing than I care to spend on the question, but it makes a certain amount of sense. The first tar process reads from the source device as quickly as possible only waiting for that device to read. Meanwhile, the second tar process reads from its input pipe and writes as quickly as possible. It might have to wait for input, but if writes on the destination device are slower than reads on the source device it will only wait on the destination device. A single cp command will have to wait on both the source and the destination devices.

On the other hand, modern operating systems do a pretty good job of pre-caching IO operations. It's entirely possible cp will spend most of its time waiting on writes and getting reads from memory rather than the device itself. It seems like one would need really solid data to chose using two tar commands rather than the more straightforward cp command.


Need Your Help

Move a rectangle around a canvas

c# wpf vb.net canvas

I have a canvas in the middle of my application with controls around it.

Call Javascript from iOS instruments script

javascript ios cordova ui-automation ios-ui-automation

I have a instruments UI automation script for iOS,(in javascript)