[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]
LINUX GAZETTE
...making Linux just a little more fun!
Perl One-Liner of the Month: The Case of the Evil Spambots
By Ben Okopnik

A REPORTER'S NOTE

To forestall some sure-to-happen complaints, I'd like to underscore the necessity of having the current version of Perl (at least 5.8.0, as of this writing) in order to play with the scripts presented in these articles. One-liners, to a far greater degree than proper scripts, rely on new and unusual language features, and languages tend to "grow" new features and drop old, outdated ones as version numbers rise. Perl, heading for its 17th year of growth and development, is no exception.

One of a number of possible problems with one-liners is fragility, especially in those (many of them) which are dependent on cryptocontext, side effects, and undocumented features, which are likely - in fact, are certain - to change without notice. One-liners are hacks which often demonstrate some clever twist or feature, which encourages the use of all of the above. Remember - these are fun toys which (hopefully) lead to a better understanding of Perl; trying to use them as you would robust, solid code would be a serious error. If you don't understand the basics of Perl, this is not the place to start.
 

Debugging is twice as hard as writing the code in the first place. 
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it.
 -- Brian W. Kernighan


Caveat Lector (Let the reader beware).

Ben Okopnik
On board S/V "Ulysses", Saint Augustine, Florida


Frink Ooblick had fallen asleep at the keyboard. He had been alternately playing and trying to puzzle out the number-guessing game that Woomert had written (the first had proven easy, but the second still eluded him); in fact, his last unfinished game was still visible on the screen:


perl -wlne'BEGIN{$b=rand$=}$a=qw/Up exit Down/[($_<=>int$b)+1];print eval$a' 50 Down 25 Up 37 Up 44 Up
What was the secret? How did it work? [1] Frink's dreams were full of floating bits of code which spiraled off into the distance or mutated into monstrous shapes, threatening to consume the world. The hand shaking his shoulder, waking him, was therefore a welcome relief. Woomert stood at his side, looking impatient.

 - "Wake up, Frink, wake up! The game's afoot, you slug-a-bed; let's go!"

 - "Uh... Erm... I'm, uh, awake. What's up?"

 - "In the living room. Come on, come on, there's not a moment to lose!"

Frink's first sight of their visitor brought him to a stop. Used to dealing with the working crowd - sysadmins, techs, etc. - he had expected the usual scruffy-and-competent look, perhaps complete with hiking boots; what greeted his eyes was a fellow in a pinstripe suit, crisp white shirt, a red "power" tie, and lacquered black shoes. He had been impatiently pacing the floor, and brightened up considerably at the sight of Frink.

 - "Ah, this must be the second team member in your organizational hierarchy! Excellent; now, we can get into actualizing the power strategies that will reorganize this, erm, unpredicted opportunity into the profit slot on the balance sheet. All right, here's how we wind-tunnel this: the securitization of the computing resources is predicated on leveraging..."

Keeping a cautious eye on their visitor, Frink prison-whispered to Woomert: "What's he saying? And what language is it in?"

 - "It's Marketroid. You need to learn at least the basics of it; not that it's spoken by the people who sign the checks - they don't have much time for that sort of thing - but you're going to run into it in the business world, and it's best to be prepared. Usually, though, most of these people can still speak English; let's see if this fellow remembers how. Oh, Mr. Wibbley!"

Their visitor had just finished what he obviously considered an explanation of the problem, had switched off the overhead LCD projector, put away his laser pointer, and was looking at them in an expectant manner. Clearly, he had heard of Woomert's reputation and was relying on the famous Hard-Nosed Computer Detective to deal with... well, whatever it was.

 - "Mr. Wibbley - that was an excellent presentation, but I wonder if you could restate the problem in more basic terms for my assistant here. I'm afraid he's not up on proper business terminology, and has missed the more subtle points."


Their visitor heaved a sigh, and dropped into the nearby easy chair.

 - "Oh, sure. You know, they were going to send one of the system administrators to talk to you, but of course I insisted on doing the presentation myself as soon as I heard about it. After all, one of them wouldn't have even thought of using that textured salmon-and-peach background on the slides, and that's all the rage these days! Anyway, I did get a note from him that explains it in his own words; it's crude and unsophisticated, not at all proper marketing technique, but I suppose you fellows will understand it..."

The crumpled and coffee-stained napkin, most of which was covered with calculations, reminders, and something that looked like firewall rules, contained a short note framed with a red marker pen:
 

Woomert, spambots are harvesting the e-mail addresses on our website (we've tagged them with the "plus hack", [2] so we know where it's coming from); the amount of spam we're getting is growing by leaps and bounds. We need to have the addresses out there - it's our contact info, site problem reports, etc. - but we've got to stop the 'bots somehow! I've already written the CGI to handle the hot links, but we need to have the actual addresses displayed on the pages, and the 'bots are getting those. Any ideas? The page is at http://xxxxxxxxxxxx.xxx. I've created an account for you; just go to ssh://xxxxxxxxx.xxx/xxx, password 'xxxxxxxxxx'. Thanks! 
 - Int Main

After Woomert had ushered out their visitor (and reassured him that, indeed, the salmon-and-peach background was delightful), he returned to the living room where Frink awaited him.

 - "What are you going to do, Woomert? Any plans?"

 - "Yes; let's take a peek at their website, then get out there and look around. It's a mistake to make decisions ahead of your facts, and we have few facts at hand."

...

Once again, Woomert and Frink found themselves surrounded by the familiar sights and sounds of a working web site. They could see the Web server easily spawning off threads without significantly affecting CPU load; clearly, the local sysadmin had installed mod_perl [3]. Here and there, data streams whisked by, and everything moved like a smoothly-oiled machine.

A sudden shadow made Frink look up. "What the..." Before he could go any further, a horrifying creature, all tentacles, lenses, and evil intent [4] leaped upon the scene, sucked up a copy of every HTML file at once, and was gone in a blink.

 - "What was that, Woomert - a spambot?"

 - "Yep. These things traverse the Net, collecting e-mail addresses and reporting them to their scummy spammer masters. Given the nature of the Net, you can't stop them - but you can make them much less effective. Spammers are stupid, their bots even more so, and that's what we're going to rely on. Mind you, whatever we do is only going to be a temporary solution; eventually, spammers (or at least their hired techie help) will catch on to this particular method - but by then, we'll implement other solutions."

Walking up to a convenient terminal, Woomert slipped on his favorite typing gloves and fired off a rapid volley.


perl -MRFC::RFC822::Address=valid -wne'/[\w-]+@[\w.-]+/||next;print valid$&' *html
A line of '1's appeared on the screen; Woomert smiled and his fingers again flew over the keyboard.

perl -i -wlpe's=[\w-]+@[\w.-]+=join"",map{sprintf"&#%s;",ord}split//,$&=e' *html
This time, there was no output; however, Woomert looked satisfied. He quickly shot off an email to the local sysadmin that contained some instructions and included a shorter version of the last one-liner -

perl -we'map{printf"&#%s;",ord}split//,pop' user@host.com
- "All right-o, Frink; our work here is done. Home, here we come!"

...

The old-fashioned coal-fired samovar [6] was gently perking; the zavarka (tea concentrate), made with excellent Georgian tea, gave off a marvelous smell. A plate of canapés, ranging from the best Russian butter and wild blackberry jam on freshly-baked fluffy white bread to beluga caviar on a heavy, dark rye rubbed with just a touch of garlic, was set close at hand, and both Woomert and Frink were merrily foraging in the gourmet field thus presented. Eventually they settled back, replete with good food, and Frink's curiosity could be contained no longer.

 - "Woomert, when I try to puzzle out your one-liners, I can only get so far; then I run out of steam. Can you tell me about what you did?"

Lying back in his favorite armchair, Woomert smiled.

 - "Instead, why don't you start by telling me what part you understood? I like to see how far you've advanced, Frink; it's been a pleasure to me to see you picking up some of the finer points. I'll take it from there."

 - "All right, then... Let's start with the first one:


perl -MRFC::RFC822::Address=valid -wne'/[\w-]+@[\w.-]+/||next;print valid$&' *html
I recognized all the command-line switches:

-Mmodule Use the specified module
-w Enable warnings
-n Non-printing loop
-e Execute the following commands

However, I couldn't quite puzzle out the '-MRFC::RFC822::Address=valid'syntax - what was that?"

 - "Ah. As 'perldoc perlvar' tells us, in the entry for '-M', it's a bit of syntactic sugar; '-MBar=foo' is a shortcut for 'use Bar qw/foo/', which imports the specified function 'foo' from module 'Bar'. Go on, you're doing well."

Frink cleared his throat.

 - "In that case, I think I have it figured out... almost. Let me take a quick look at 'perldoc perlvar' and 'perldoc RFC::RFC822::Address'... Yes, that's what I thought - I've got it! The regex at the beginning -

/[\w-]+@[\w.-]+/

tries to match e-mail addresses - it's not perfect, but should do reasonably well. What it says is "match any character in [a-zA-Z0-9-] repeated one or more times, followed by '@', followed by any character in [a-zA-Z0-9.-] repeated one or more times". If the match does not succeed - the '||' logical-or operator handles that - go to the next line."

 - "Brilliant, Frink! What happens then?"

 - "If it does succeed, 'next' is skipped over, and 'print valid$&' is invoked. The module documentation tells me that the 'valid' function tests an e-mail address for RFC822 (e-mail specification) conformance, and returns true or false based on validity. '$&', according to 'perldoc perlvar', is the last successful pattern match - in other words, whatever was matched by the regex. Since you saw all '1's and no errors - any matches that weren't RFC822-valid would have returned something like "Use of uninitialized value in print at -e line 1" - what you matched was all valid. What you were doing here is checking to see that your regex only matched actual addresses. How did I do?"

 - "Excellent, my dear Frink; you're coming along well! As a side note, it's generally best to avoid the use of  $&, $`, and $' as well as 'use English' in scripts; there's a rather large performance penalty associated with them (see 'perldoc perlvar'). However, here we had a very small list of matches, and so I went ahead with it. Go on, see what you can make of the next one."

 - "Um... the next one, right. Well, I've got part of it -


perl -i -wpe's=[\w-]+@[\w.-]+=join"",map{sprintf"&#%s;",ord}split//,$&=e' *html
-i In-place edit (modify the specified file[s])
-w Enable warnings
-p Printing loop
-e Execute the following commands

Mmmm... I got sorta lost here, Woomert. I see that regex that you'd used before, but what's that 's=' bit?"

 - "It's one of those convenient tweaks that Perl provides - although, admittedly, the basic idea was stolen from 'sed'. It's simply an alternate delimiter used with the 's' (substitute) operator; there are times when using the default delimiter ("/") is highly inconvenient and leads to "toothpick Hell" - as, for example, in matching a directory name:

s/\/path\/to\/my\/directory/my home directory/

Far better to use an alternate delimiter, one that is not contained in the text of either the pattern or the replacement:

s#/path/to/my/directory#my home directory#

As long as it's non-alphanumeric and non-whitespace, it'll work fine. There are some special cases, but they're all sensible ones; using a single quote disables interpolation in both the pattern and the replacement (see the rules in 'perldoc perlop'), and using braces or brackets as delimiters requires rather obvious syntax:

s{a}{b}
s(a)(b)
s[a][b]

Many people like '#' as a delimiter; I prefer '=', since '#' tends to come up in HTML and comments. Can you make sense of any of the rest?"

- "I'm afraid not. You're matching the email addresses as previously, and replacing them with something, but I can't figure out what."

- "All right; it is rather involved. The replacement part of the substitution is actual Perl code; we can do that thanks to the 'e' (evaluate) modifier on the end of the 's' operator. Let's parse the relevant code from right to left:

join"",map{sprintf"&#%s;",ord}split//,$&
We know that '$&' contains an email address; the next thing we do is use the 'split' function which converts a scalar to a list, splitting it on whatever is specified between the delimiters. In this case, however, the delimiter is empty, a null - so the returned list has each character of the address as a separate element in the list. We now pass this list to the 'map' function, which will evaluate the code specified in the {block} for each element of the supplied list and return the result - as another list.

Within the block itself, each character is used as an argument to the 'ord' function, which returns the ASCII value of that character; this, in turn, is used as the argument for the 'sprintf' function which returns the following formatted string:

&#<ASCII_value>;

for each value so specified. After all the characters in the list have been processed, we use the 'join' function to convert the list back to a scalar - which the substitute operator will now use as a replacement string for the original email address. What used to be "foo@bar.com" now looks like

&#102;&#111;&#111;&#64;&#98;&#97;&#114;&#46;&#99;&#111;&#109;

This, you must admit, looks nothing like an e-mail address - so spambots will not be able to read it!"

Frink looked troubled.

 - "Woomert, I hate to tell you... but human beings won't be able to read it either!"

Woomert took another sip of his tea and smiled.

 - "You're forgetting one thing, Frink. Humans aren't going to be reading this; since it's part of the HTML files, it's going to be read by browsers. As it happens, the HTML specification for showing ASCII characters by their value is

&#<ASCII_value>;

which is exactly what we've produced. Try this yourself: save the text between the following lines as "text.html" and view it in a browser.


<html><head><title></title></head><body> &#87;&#111;&#111;&#109;&#101;&#114;&#116;&#32;&#70;&#111;&#111;&#110;&#108;&#121; </body></html>
Do you see what I mean?"

A few moments later, Frink looked up from the keyboard.

 - "Woomert, what a great solution! Your client will be able to display the addresses without them being harvested, and the Web page will still look the same as it did before. I can tell by comparison that the last bit of code:


perl -we'map{printf"&#%s;",ord}split//,pop' user@host.com
simply enables the sysadmin to convert any new addresses before popping them into the HTML. Wonderful!"

 - "A large part of the complete solution, of course, was the CGI that the local admin had written - that takes a bit more than a one-liner, although not very much more, given the power of the CGI module. Remember, Frink: as your powers grow, make certain to align yourself with the side of Good rather than Evil. Not only is it the right thing to do; the people around you are far more likely to have brains!"
 
 



[1] Oddly enough, my mysterious correspondent did not include the solution to this, perhaps deeming it simple enough (!) for the public to figure out - or (and I suspect this to be the more likely scenario) he has not yet figured it out himself. Readers are welcome to write in with their ideas... but for now, the workings of Woomert's game remain a puzzle.

[2] A number of commonly-used Mail Transfer Agents will ignore anything that follows a plus sign in the username part of the address, e.g. <smith+yahoo@joe.com> will be routed exactly the same as <smith@joe.com>. This can be a very useful mechanism for tracing and reducing spam: a "plus-hacked" address that becomes too spam-loaded can be directed to "/dev/null" and replaced by a newly generated one (say, <smith+yahoo1@joe.com> - which would also go to <smith@joe.com>.)

[3] A.K.A. "Apache On Steroids". From the mod_perl documentation:

The Apache/Perl integration project brings together the full power of
the Perl programming language and the Apache HTTP server. This is
achieved by linking the Perl runtime library into the server and
providing an object oriented Perl interface to the server's C language
API.

These pieces are seamlessly glued together by the `mod_perl' server
plugin, making it is possible to write Apache modules entirely in
Perl. In addition, the persistent interpreter embedded in the server
avoids the overhead of starting an external interpreter program and
the additional Perl start-up (compile) time.

There are many major benefits to using mod_perl; if you use Apache in any serious fashion without it, you're almost certainly throwing away some of your time and effort.

[4] If you've seen "The Matrix", just picture the Sentinels. If you haven't seen it, hey, you've got only yourself to blame. :)

[5] Gibberish is the written form of the Marketroid language. It was formerly spoken by the Gibbers, who all died out as a result of their complete inability to do anything (as opposed to talking about it.) It is exactly as comprehensible as its spoken counterpart, although many people confuse the two: "it's all marketroid gibberish!" is a highly redundant statement.

[6] See the "Russian Tea HOWTO", by Dániel Nagy, for the proper way to make and serve Russian tea. The man knows what he's talking about.

 

Ben is a Contributing Editor for Linux Gazette and a member of The Answer Gang.

picture Ben was born in Moscow, Russia in 1962. He became interested in electricity at age six--promptly demonstrating it by sticking a fork into a socket and starting a fire--and has been falling down technological mineshafts ever since. He has been working with computers since the Elder Days, when they had to be built by soldering parts onto printed circuit boards and programs had to fit into 4k of memory. He would gladly pay good money to any psychologist who can cure him of the resulting nightmares.

Ben's subsequent experiences include creating software in nearly a dozen languages, network and database maintenance during the approach of a hurricane, and writing articles for publications ranging from sailing magazines to technological journals. Having recently completed a seven-year Atlantic/Caribbean cruise under sail, he is currently docked in Baltimore, MD, where he works as a technical instructor for Sun Microsystems.

Ben has been working with Linux since 1997, and credits it with his complete loss of interest in waging nuclear warfare on parts of the Pacific Northwest.


Copyright © 2003, Ben Okopnik. Copying license http://www.linuxgazette.net/copying.html
Published in Issue 86 of Linux Gazette, January 2003

[ Prev ][ Table of Contents ][ Front Page ][ Talkback ][ FAQ ][ Next ]