Screen scraping with Perl

Screen scraping is a relatively well-known idea, but for those who are not familiar with it, the term refers to the process of extracting data from a website. This may involve sending form information, navigating through the site, etc., but the part I'm most interested in is processing the HTML to extract the information I'm looking for.

As I mentioned in my article about outliners, I've been organising myself recently, and as part of that process of organisation I've been writing several screen scrapers to reduce the amount of browsing I do: repeatedly visiting news sites to see if they have been updated is a waste of anyone's time, and in these times of feed readers, it's even less tolerable.

Liferea, my feed reader of choice, has a facility to read a feed generated by a command, and I have been taking advantage of this facility. As well as reducing the amount of time I spend reading the news from various sources, this also allows me to keep track of websites I wouldn't normally remember to read.

Perl

In my article about feed readers I mentioned RSSscraper, a Ruby-based framework for writing screen scrapers. As much as I like RSSscraper, I've been writing my screen scrapers in Perl. Ruby looks like a nice language, but I find Perl's regexes easier to use, and CPAN is filled with convenient modules to do just about everything you can think of (and many more things you'd probably never think of).

Most of my screen scrapers use regexes, mainly because Perl's regexes were haunting me: there was a something I just wasn't grasping, and I wanted to push past it (and I have, and now I can't remember what the block was :). There are much better ways to write screen scrapers: Perl has modules like WWW::Mechanize, HTML::TokeParser, etc., that make screen scraping easier.

Using regexes

First of all, here's a list of scrapers:

bill-bailey.pl.txt: Bill Bailey is a British comedian, whose blog lacks any sort of feed.
mmoore.pl.txt: Michael Moore doesn't really need an introduction. His blog is also feedless.
sun-bizarre.pl.txt, sun-viral.pl.txt: I read The Sun. There, I admitted it. I work in a factory, and it's good to keep up with the news that everyone else reads, but mostly it's because I like looking at pictures of scantily clad women. [shrug]. I also have sun-pic.pl.txt, which allows me to bypass The Sun's annoying popups.
telsa.pl.txt: Telsa, wife of Alan Cox, keeps one of the most interesting diaries on the 'net
tp.pl.txt: Grabs a list of Terry Pratchett's Usenet posts from Google Groups (Google does provide an API, but Google Groups isn't currently available).
uf.pl.txt: Grabs the latest Userfriendly comic strip.

Most of the scrapers work in exactly the same way: fetch the page using LWP::Simple, split the page into sections, and extract the blog entry from each section. sun-pic.pl is a dirty hackish attempt to bypass popups, and The Sun's horrible site's tendency to crash Mozilla. It's called with the address of the page, grabs the images from the popups, and puts them in a specific directory. It's not meant to be useful to anyone else, other than as an example of a quick and dirty script that's different from the other examples here. If you're interested, read the comments in the script.

I'll use Telsa's diary as an example, because the page layout is clear, and the script I wrote is one of the better examples (I'd learned to use the /x modifier for clarity in regexes by then).

   <dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
   <dd>
    <p>
      [Content]
    </p>
   </dd>

... and so on.

Each entry starts with <dt>, so I use that as the point at which to split. From each entry, I want to grab the anchor name, the title (between the <strong> tags), and everything that follows, until the </dd> tag.

The script looks like this:

#!/usr/bin/perl -w

use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;

my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);

$rss->channel(title       => "The more accurate diary. Really.",
              link        => $url,
              description => "Telsa's diary of life with a hacker:" 
	      		     . " the current ramblings");

foreach (split ('<dt>', $page))
{
	if (/<a\sname="
             ([^"]*)     # Anchor name
             ">
             <strong>
             ([^>]*)     # Post title
             <\/strong><\/a><\/dt>\s*<dd>
             (.*)        # Body of post
             <\/dd>/six)
	{
		$rss->add_item(title       => $2,
			       link        => "$url#$1",
		       	       description => encode_entities($3));
	}
}

print $rss->as_string;

Most of the scrapers follow this general recipe, but the Michael Moore and Terry Pratchett scrapers have two important differences.

Michael Moore's blog, unlike most blogs, has the links for each item on a separate part of the page from the content that's being scraped, so I have a function to scrape the content again for the link:

sub findurl ($$)
{
	my $title = shift;
	my $pagein = shift;
	if ($pagein =~ /<a href="(index.php\?id=[^"]*)">$title<\/a>/i)
	{
		return "http://www.michaelmoore.com/words/diary/$1";
	}
}

It's important to have a unique URL for each item in a feed, because most feed readers use the link as a key, and will only display one entry for each link.

The Terry Pratchett scraper is also different, in that instead of using LWP::Simple, it uses LWP::Agent. Google wouldn't accept a request from my script, so I used LWP::Agent to masquerade as a browser:

my $ie="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)";

my $ua = LWP::UserAgent->new;
$ua->agent($ie);
my $url = "http://groups.google.com/groups?safe=images&as_uauthors=Terry%20Pratchett&lr=lang_en&hl=en";
my $response = $ua->get ($url);

if ($response->is_success) 
{
	[scrape as usual]
}
else 
{
	die $response->status_line;
}

(The content of the page is held in $response->content).

One thing I'm still looking at is getting the news from The Sun. The problem with this page is that it has some of the worst abuses of HTML I've ever seen. This snippet uses HTML::TableExtract to extract most of the headlines.

use LWP::Simple;
use HTML::TableExtract;

my $html_string = get ("http://www.thesun.co.uk/section/0,,2,00.html");
my $te = new HTML::TableExtract(depth => 6);
$te->parse($html_string);
foreach $ts ($te->table_states) 
{
	print "Table found at ", join(',', $ts->coords), ":\n";
	foreach $row ($ts->rows) 
	{
		print join(',', @$row), "\n";
	}
}

HTML::TableExtract is a nice module that lets you extract the text content of any table. The "depth" option allows you to select a depth of tables within other tables (the page grabbed by this script has most of its headlines at a depth of 6 tables within tables, but there are others at a depth of 7 -- I think I'll come back to that one). You can also specify a "count" option to tell it which table to extract from, or a "headers" option, which makes the module look for columns with those headers.

Lastly, I'd like to take a look at HTML::TokeParser::Simple. If I had known about this module when I started writing screen scrapers, they would be a lot easier to understand, and more resiliant to change. The scraper for Telsa's diary, for example, will break if the <a> tag has a href attribute as well as a name attribute.

HTML::TokeParser::Simple is, as the name implies, a simplified version of HTML::TokeParser, which allows you to look for certain tags within a file. HTML::TokeParser::Simple gives a number of methods with a prefix of either "is_" or "return_" that tell you if a tag is a certain type or returns it, respectively. HTML::TokeParser::Simple also inherits from HTML::TokeParser, so it has full access to HTML::TokeParser's methods.

The Telsa scraper using HTML::TokeParser::Simple looks like this (text version):

#!/usr/bin/perl -w

use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
use HTML::TokeParser::Simple;

my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
my $stream = HTML::TokeParser::Simple->new(\$page);
my $tag;

$rss->channel(title       => "The more accurate diary. Really.",
              link        => $url,
              description => "Telsa's diary of life with a hacker:" 
	      		     . " the current ramblings");

while ($tag = $stream->get_token)
{
	next unless $tag->is_start_tag ('a');
	next unless $tag->return_attr("name") ne "";
	my $link = $tag->return_attr("name");
	$tag = $stream->get_token;
	next unless $tag->is_start_tag ('strong');
	$tag = $stream->get_token;
	my $title = $tag->as_is;
	$tag = $stream->get_token;
	next unless $tag->is_end_tag ('/strong');
	$tag = $stream->get_token;
	next unless $tag->is_end_tag ('/a');
	$tag = $stream->get_token;
	next unless $tag->is_end_tag ('/dt');
	$tag = $stream->get_token;
	#We've got whitespace; on to the next tag
	$tag = $stream->get_token;
	next unless $tag->is_start_tag ('dd');
	my $content = "";
	$tag = $stream->get_token;
	until ($tag->is_end_tag('/dd'))
	{
		$content .= $tag->as_is;
		$tag = $stream->get_token;
		next;
	}
	$rss->add_item(title       => $title,
		       link        => "$url#$link",
	       	       description => encode_entities($content));
}

print $rss->as_string;

This is more verbose than necessary, but does the same thing as the regex version. A better version would use HTML::TokeParser's get_tag method (text):

#!/usr/bin/perl -w

use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
use HTML::TokeParser::Simple;

my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
my $stream = HTML::TokeParser::Simple->new(\$page);
my $tag;

$rss->channel(title       => "The more accurate diary. Really.",
              link        => $url,
              description => "Telsa's diary of life with a hacker:" 
	      		     . " the current ramblings");

while ($tag = $stream->get_tag('a'))
{
	next unless $tag->return_attr("name") ne "";
	my $link = $tag->return_attr("name");
	$tag = $stream->get_tag ('strong');
	$tag = $stream->get_token;
	my $title = $tag->as_is;
	$tag = $stream->get_tag ('dd');
	my $content = "";
	$tag = $stream->get_token;
	until ($tag->is_end_tag('/dd'))
	{
		$content .= $tag->as_is;
		$tag = $stream->get_token;
		next;
	}
	$rss->add_item(title       => $title,
		       link        => "$url#$link",
	       	       description => encode_entities($content));
}

print $rss->as_string;

There are plenty of other modules for manipulating HTML: a CPAN search gave me 7417 results!

If you're hungry for more, I recommend reading these articles from Perl.com: Create RSS channels from HTML news sites and Screen-scraping with WWW::Mechanize. As a parting shot, I've also included a script that generates del.icio.us-like XML from a Mozilla bookmark file: watch out for next month's Linux Gazette to find out what it's for!

[BIO] Jimmy is a single father of one, who enjoys long walks... Oh, right.

Jimmy has been using computers from the tender age of seven, when his father inherited an Amstrad PCW8256. After a few brief flirtations with an Atari ST and numerous versions of DOS and Windows, Jimmy was introduced to Linux in 1998 and hasn't looked back.

In his spare time, Jimmy likes to play guitar and read: not at the same time, but the picks make handy bookmarks.

Copyright © 2004, Jimmy O'Regan. Released under the Open Publication license unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 108 of Linux Gazette, November 2004

<-- prev | next -->