Never write it in 'C' if you can do it in 'awk';
Never do it in 'awk' if 'sed' can handle it;
Never use 'sed' when 'tr' can do the job;
Never invoke 'tr' when 'cat' is sufficient;
Avoid using 'cat' whenever possible.
--Taylor's Laws of Programming
Last month, we looked at loops and conditional execution. This time around, we'll look at a few of the simpler "external" tools (i.e., GNU utilities) that are commonly used in shell scripts.
Something to keep in mind as you read this article: the tools available to you as a script writer, as you might have guessed from the above quote, are arranged in a rough sort of a "power hierarchy". It's important to remember this - if you find yourself continually being frustrated by the limitations of a specific tool, it may not have enough "juice" to do the job.
Some time ago, while writing a script that processed Clipper database files, I found myself pushed up against the wall by the limitations of arrays in "bash"; after a day and a half of fighting it, I swore a bitter oath, glued a "screw it" label over the original attempt, and rewrote it in "awk".
It took 15 minutes.
I didn't tell anyone at the time; even my good friends would have taken a "clue-by-4" to my head to make sure that the lesson stuck...
Don't be stubborn about changing tools when the original one proves under-powered.
Strange as it may seem, 'cat' - which you've probably used on innumerable occasions - can do a number of useful things beyond simple catenation. As an example, 'cat -v file.txt' will print the contents of "file.txt" to the screen - and will also show you all the non-text characters that might normally be invisible (this excludes the standard textfile characters such as `end-of-line' and `tab'), in "'^' notation". This can be very useful when you've got something that is supposed to be a text file, but various utilities keep failing to process it and give errors like "This is a binary file!". This capability can also come in handy when converting files from one type to another (see the section on 'tr'). If you decide you'd like to see all the characters in the file, the `-A' switch will fill the bill - `$' signs will show the end-of-lines (the buck stops here?), and `^I' will show the tabs.
'-n' is another useful option. This one will number all the lines (you can use `-b' to number only the non-blank lines) of a file - very useful when you want to create a `line selector', i.e., whenever you want to have a "handle" for a specific line which you would then pass to another utility, say, 'sed' (which is very happy with line numbers).
'cat' can also serve as a "mini-editor", if you need to insert more than a line or two into a file during the execution of your script. In most cases, the built-in 'read' function of 'bash' will take care of that sort of thing - but it is designed as more of a "question/reply" mechanism; 'cat' is a bit more useful for file input.
Last, but not least, 'cat' is very useful for displaying formatted text,
e.g., the error messages at the beginning of a shell script.
Here are two script "snippets" for comparison:
-f Force mode:
if no mental activity is detected,
take a Scientific Wild-Ass Guess (SWAG) and execute.
-n Read your neighbor's mind; commonly used to retrieve
the URLs of really good porno sites.
-r Reboot brain via TCP (Telepathic Control Protocol) - for
those times when you're drawing a complete blank.
-s Read the supervisor's mind; implies the '-f' option.
I tend to think of 'cat' as an "initial processor" for text that will be further worked on with other tools. That's not to say that it's unimportant - in some cases, it's almost irreplaceable. Indeed, your 'cat' can do tricks that are not only entertaining but useful... and you don't even need a litter box.
When it comes to "one character at a time" processing, this utility, despite it's oddities in certain respects (e.g., characters specified by their ASCII value have to be entered in octal), is one of the most useful ones in our toolbox. Here's a script using it that replaces those "DOS-text-to-Unix" conversion utilities:
cat $1|tr -d '\015'
<grin> I guess I'd better take time to explain; I can already hear the screams of rage from all those folks who just learned about 'if' constructs in last month's column.
"What happened to that nice `if' statement you said we needed at the beginning of the script? and what's that `&&' thing?"
Believe it or not, it's all still there - at least the mechanism that makes the "right stuff" happen. Now, though, instead of using the structure of the statement and fitting our commands into the "slots" in the syntax, we use the return value of the commands, and make the logic do the work. Let's take a look at this very important concept. henever you use a command, it returns a code on exit - typically 0 for success, and 1 for failure (exceptions are things like the 'length' function, which returns a value). Some programs return a variety of numbers for specific types of exits, which is why you'd normally want to test for zero versus non-zero, rather than testing for `1' specifically. You can implement the same mechanism in your scripts (this is a good coding policy): if your script generates a variety of messages on different exit conditions, use 'exit n' as the last statement, where `n' is the code to be returned. The plain 'exit' statement returns 0. These codes, by the way, are invisible - they're internal "flags"; there's nothing printed on the screen, so don't bother looking.
To test for them, 'bash' provides a simple mechanism - the reserved
words `&&' (logical AND) and `||' (logical OR). In the script above,
the statement basically says "if $1 has a length of zero, then the following
statements (echo... echo... exit)
should be executed". If you're not familiar with binary logic, this may
be confusing, so here's a quick rundown that will suffice for our purposes:
in an 'A && B' statement,
if 'A' is true, then 'B' will also be true (i.e., if 'B' is a command,
it will be executed). In an
'A || B' statement, if 'A' is false, then 'B' will be true (i.e., executed). The converse of either statement is obvious (a.k.a. "is left as an exercise for the student".) <grin>
As a comparison, here are two script fragments that do much the same thing:
You have to be a bit cautious about using the second version for anything
more complex than "echo" statements: if you use a command in the part after
the `&&' which returns a failure code, both it and the statements
after `||' will be executed! This in itself can be useful, if that's what
you need - but you have to be aware of how the mechanism works.
Back to the original "d2u" script - note the use of the `\' character at the end of the second line: this `escape' character "cancels" the `end-of-line' character, making the following line a continuation of the current one. This is a neat trick for enhancing readability in scripts with long lines, allowing you to visually break them while maintaining program continuity. I put it between the "echo" statement and the text string for a reason: whitespace (here, spaces and tabs) makes no difference to a command string but stands out like a beacon in text, creating ugly formatting problems. Make sure your line breaks happen in reasonable places in the command string - i.e., not in the middle of text or quoted syntax.
The "active" part of the script, "cat $1|tr -d '\015'", pipes the original text into 'tr', which deletes DOS's "CR/Carriage Return" character (0x0D), shown here in octal (\015). That's the bit... err, _byte_ that makes DOS text different from Unix text - we use just the "LF/Newline" character (0x0A), while DOS uses both (CR/LF). This is why Unix text looks like
This is line one*This is line two*This is line three*
in DOS, and DOS text like
This is line one^M
This is line two^M
This is line three^M
"A word to the wise" applicable to any budding shell-script writer: close study of the "tr" man page will pay off handsomely. This is a tool that you will find yourself using again and again.
A very useful pair of tools, with mostly identical syntax. By default they print, respectively, the first/last 10 lines of a given file; the number and the units are easily changed via syntax. Here's a snippet that shows how to read a specific line in a file, using its line number as a "handle" (you may recall this from the discussion on "cat"):
These programs can also be used to "identify" very large files without the necessity of reading the whole thing; if you know that one of a number of very large databases contains a unique field name that identifies it as the one you want, you can do something like this:
So - the above case is simple enough; we take the first 10k bytes (you'd adjust it to whatever size chunk is necessary to capture all the field names) off the top of each database by using 'head', then use 'grep' to look for the string. If it's found, we print the name of the file. Those of you who have to deal with large numbers of multi-gigabyte databases can really appreciate this capability.
'tail' is interesting in its own way; one of the syntax differences is the '+' switch, which answers the question of "how do I read everything after the first X characters/lines?" Believe it or not, that can be a very important question - and a very difficult one to answer in any other way... (Also sprach The Voice of Bitter Experience.)
In my experience, 'cut' comes in for a lot more usage than 'paste' - it's very good at dealing with fields in formatted data, allowing you to separate out the info you need. As an example, let's say that you have a directory where you need to get a list of all the files that are 100k or more in size, once a week (logfiles over a size limit, perhaps). You can set up a "cron" job to e-mail you:
'ls -r --sort=size $dir' gives us a listing of `$dir' sorted by size in `reverse' order (smallest to largest). We pipe that through "tr -s ' '" to collapse all repeated spaces to a single space, then use "cut" with space as a delimiter (now that the spaces are singular, we can actually use them to separate the fields) to return fields 5 and 9 (size and filename). We then use 'grep' to look at the very beginning of the line (where the size is listed) and print every line that starts with a digit between 1 and 9, repeats that match 5 times, and is followed by a space. The lines that match are piped into 'mail' and sent off to the recipient.
'paste' can be useful at times. The simplest way of describing it that
I can think of is a "vertical 'cat'" - it merges files line by line,
instead of "head to tail". As an example, I had a long list of songs followed by the names of the groups that performed them, and I wanted the song names to be in quotes. The songs were separated from the names by tabs. Here was the solution:
cut -f 1 $1 > groups
# 'Tab' is the default separator
cut -f 2- $1 > songs
for n in $(seq $(grep -c $ songs))
paste -d "" quotes songs quotes
paste list1 groups > list
rm quotes songs groups list1
The "Vise-Grips" of Unix. :) This utility, as well as its more specialized relatives 'fgrep' and 'egrep', is used primarily for searching files for matching text strings, using the 'regexp' (Regular Expression) mechanism. (There are actually two of these, the 'basic' and the 'extended', either one of which can be used; the 'basic' is the default for 'grep'.)
"Let's see now; I know the quote that I want is in of these 400+ text files in this directory - something about "Who hath desired the Sea". What was it, again?..."
Odin:~$ grep -iA 12 "who hath desired the sea" *
Poems.txt-Who hath desired the
Sea? - the sight of salt water unbounded -
Poems.txt-The heave and the halt and the hurl and the crash of the comber
Poems.txt-The sleek-barrelled swell before storm, grey, foamless, enormous,
Poems.txt- and growing -
Poems.txt:Stark calm on the lap of the Line or the crazy-eyed hurricane
Poems.txt- blowing -
Poems.txt-His Sea in no showing the same - his Sea and the same 'neath each
Poems.txt- His Sea as she slackens or thrills?
Poems.txt-So and no otherwise - so and no otherwise - hillmen desire their
"Yep, that was the one; so, it's in `Poems.txt'..."
'grep' has a wide variety of switches (the "-A <n>" switch that I used above determines the number of lines of context after the matched line that will be printed; the "-i" switch means "ignore case") that allow precise searches within a single file or a group of files, as well as specifying the type of output when a match is found (or conversely, when no match is found). I've used 'grep' in several of the "example" scripts so far, and use it, on the average, about a dozen times a day, command line and script usage together: the search for the above Kipling quote (including my muttered comments) happened just a few minutes before I sharpened my cursor and scribbled this paragraph.
You can also use it to search binary files too, with 'strings' (a utility that prints only the text strings found in binary files) as a useful companion: an occasionally useful "last-ditch" procedure for those programs where the author has hidden the help/syntax info behind some obscure switch, and 'man', 'info', and the '/usr/doc/' directory come up empty.
Often, there is a requirement for performing some task the same number of times as there are lines in a given file, e.g., reading in each line of a configuration file and parsing it. 'grep' helps us here, too:
This is a snippet from a scheduling program I wrote some time ago; whenever I log in, it reminds me of appointments, etc. for that day. 'grep', in this instance, numbers the lines (this is used in further processing - not shown) and polls every line for the "end-of-line" metacharacter ('$') which matches every line in the file. The result is then parsed into the date and text variables, and the script executes an "alarm and display" routine if the appointment date matches today's date.
In order to produce good shell scripts, you need to be very familiar
with how all of these tools work - at the very least, have a good idea
what a given tool can and cannot do (you can always look up the exact syntax via 'man'). There are many other, more complex tools that are available to us - but these six programs will get you started and keep you going for a long time, as well as giving you a broad field of possibilities for script experimentation of your own.
Until next month -
I used to program my IBM PC to make hideous noises to wake me up. I also made the conscious decision to hard-code the alarm time into the program, so as to make it more difficult for me to reset it. After I realised that I was routinely getting up, editing the source file, recompiling the program and rerunning it for 15 minutes extra sleep, before going back to bed, I gave up and made the alarm time a command-line option.