...making Linux just a little more fun! |
By Jim Dennis, Ben Okopnik, Dan Wilder, Breen, Chris, and... (meet the Gang) ... the Editors of Linux Gazette... and You! |
We have guidelines for asking and answering questions. Linux questions only, please.
We make no guarantees about answers, but you can be anonymous on request.
See also: The Answer Gang's
Knowledge Base
and the LG
Search Engine
From Kapil Hari Paranjape
Answered By: Kapil Hari Paranjape, Ben A. Okopnik, Thomas Adam, Jay R. Ashworth
Hello,
I have always thought that filtering files "in-place" was not possible from the command line... ...until today---one lives and learns.
dd if=file bs=4k | filter | dd of=file bs=4k conv=notrunc
Where "file" is the file you want to filter and "filter" is the filtering program you want to apply.
Examples:
http://www.itworld.com/nl/unix_sys_adm/09252002
have a perl solution.
[Thomas] Perl/sed/ruby all honour the '-i' flag, which is a started, then just apply regexps to match anything but the filter expression.
[Ben] The "buffer" program does exactly the same as the above; the process is called "reblocking".
buffer < foo | filter > foo
[Ben] If the file is bigger than 1MB, you'll need to specify a larger queue with the "-m" option, but that's usually not an issue.
Conversely, as Thomas mentioned, you could use Perl's, etc. "in-place edit" switch:
# rot13 perl -i -wpe'y/a-zA-Z/n-za-mN-ZA-M/' file # lc everything perl -i -wpe'$_=lc' file
[Ben] buffer < foo | filter > foo
[Jay] Oh, cause buffer reads the entire file before the '>' can stomp it?
Well, that's not exactly the same...
Doesn't that still depend on order of evaluation by the shell? Is that defined?
[Thomas] Well, yes....
- buffer < foo
- | filer is acted upon
- Resultant output to file
[Jay] Well, not necessarily.
[Ben] Well, yeah - just about as definitively as anything in Bash is. Otherwise Kapil's method wouldn't work either. Neither would piping anything through "sort". The left side of the pipe has to terminate before the right side can do anything with the output; in many cases, there is no output until just before the left side terminates.
[Jay] In fact I think that's wrong: I don't think the dd method does depend on order of eval; the writing copy of dd can't try to write a block until it has it, so I believe that that method is guaranteed never to stomp data.
[Jay] A shell could (un)reasonably decide to evaluate the output redirection (ie: stomp on the file) before the buffer program can read it. At best, it might be a race condition between the two sides of the pipe.
I don't think, intuitively, that it's at all reliable, where as I think the dd approach probably is.
[Ben] Uh, not any shell that contains a working implementation of IPC. One that's broken, certainly. Chances are that if time ran backwards, it probably wouldn't work too well either...
Please state the mean and the average probabilities and the relevant confidence levels for the accuracy of your intuitive approach. The data generated in the course of your study may or may not be used as grounds for questioning during your exam.
[Jay] Every shell programing I've read in 20 years warns against that construct, precisely because most shells will set up the redirect first and stomp the output file. As for the pipeline, I believe that most shells exec the last component first. Maybe bash has changed that; I remember a warning about it in the Bourne book.
The nature of the thread changes slightly -- Thomas Adam
[Kapil]
Hi,
Just a few additional remarks:
(a) perl, python and vi/ex do offer alternate solutions ... but see below.
(b) I couldn't locate "buffer"---where do I find it?
[Thomas] Oddly enough, under Debian it is in the 'buffer' package.
[Kapil] (c) Just to defend the "dd" solution a bit:
When the "dd" command-line given in the earlier mail is terminated (for any reason like a Control-C), it outputs the number of blocks read/written. Thus, the intrepid user can restart the process by modifying it with suitable "seek" and "skip" commands. Of course, this assumes that the filter operates on data sizes less than 4k independently.
[Thomas] See the "-S", "-s", and "-z" to buffer(1)
[Kapil] I became aware of this "dd" procedure while trying to (yes I'm crazy) encrypt one entire disk partition in-place. The problem with the other solutions is that they require a lot of memory to run.
As far reading and writing to pipes is concerned, here is how I understand it---please correct me if I am wrong. The kernel has its own internal settings on how much data is buffered before a writing process is put at the back of the queue and the reading process is woken up. Thus killing any one process in the "dd" pipeline could only result in less data being written than was read---an error from which one can recover as described above.
[Ben] Since the source and the target file are the same, wouldn't you end up with some truncated version of your data (i.e., the source being gone no matter what)? It seems to me that the difference between complete destruction of data and truncation of it at some random point can only matter theoretically, except in a vanishingly small number of situations.
[Jay] No, you wouldn't.
The target side dd is doing random access.
It writes the blocks sequentially, but it writes them into the standing file, one at a time, without touching the blocks around them. Likewise on the read side. The killer is the redirection, which his approach does not use, at all. Not the pipe.
[Ben] Ah. I hadn't realized that. In that case, I agree; there's a large difference. I've just tried it on a 100MB file I've made up for the purpose, and it seems that you're right.
Meet the Gang 1 2 3 4 5 6 7 |