...making Linux just a little more fun!

<-- 2c Tips | TAG Index | 1 | 2 | 3 | 4 | Knowledge Base | News Bytes -->

The Answer Gang

By Jim Dennis, Jason Creighton, Chris G, Karl-Heinz, and... (meet the Gang) ... the Editors of Linux Gazette... and You!



(?) Transcoding UTF to ISO8859-1

From Riza Aziz

Answered By: Jimmy O'Regan, Ben Okopnik

Dear Answer Gang,

I am having some problems with reading converted UTF8-encoded web pages on my Palm PDA. I can't figure out how to transcode UTF, which the Palm doesn't understand, into ISO8859-1, which the Palm displays properly.

Some background: I have a script that downloads the latest news from a few sites. It's an ugly RSS look-a-like for sites that don't support RSS. I then use "htmlconv", a Perl script from the txt2pdbdoc package, to convert the downloaded pages into a text-only format which I can then upload to the Palm. The script also converts character entity references (egrave, quot etc.) into ISO8859-1 characters. On the Palm, I use Cspotrun to read the PDB files.

I downloaded the txt2pdbdoc package a long time ago and it worked fine with Redhat 6. When I upgraded to Redhat 9, Perl's UTF handling broke the script because it assumed I wanted the converted web page in UTF. This poses a problem because the Palm doesn't understand UTF characters; accented letters and certain punctuation marks become strange symbols. By adding

	use encoding 'latin1';
	use open ':encoding(iso-8859-1)';

everything worked again.

Now, I've come across a website (http://www.zmag.org/recent_featured__links.cfm) that uses UTF directly. Instead of using "egrave" for an accented E, it uses the UTF character directly. The converter script doesn't know what to do with the character and I get all sorts of strange symbols when viewing the file on my Palm.

Is there any way to convert the UTF characters directly into ISO8859-1? And how do I get rid of any characters that don't map directly, so strange symbols don't show up on my Palm? I've messed around with the encoding pragmas but I can't get anything to work.

Thanks!

(!) [Jimmy] Righto. I have a silly perl script that prints out the Polish alphabet (so I don't have to trawl through the iso-8859-2 man page for the long names of the odd characters) that looks like this:

See attached alfabet.pl.txt

To get iso8859-1 output, I could replace the last two lines with:
use Encode;
$alfa = encode ("iso-8859-1", $Alfabet);
print "$alfa\n";
print lc "$alfa\n";
(or perl alfabet.pl|recode 'utf-8..iso-8859-1')
To get rid of the extra characters, you'd probably be better off converting to ASCII rather than ISO-8859-1 -- Perl will print a question mark instead. (recode will too, if you use the -f option, to force an irreversable change. Otherwise, it'll stop as soon as it finds a character that it can't convert).

(?) I looked everywhere in my system and I can't find recode. Does it belong to any particular package?

(!) [Jimmy] It's normally in its own package. http://packages.debian.org/stable/text/recode

(?) I did try substituting all the UTF characters with their common ASCII equivalents e.g. open & closing quotes with ". I created a hash as above and used s/// but nothing happened.

One strange thing: the single closing quote character under UTF is \x{2019}, which I tried substituting for. However, running hexdump on the file shows the character is actually E28099... what gives? What can I do to get a straight ASCII dump of the file?

(!) [Jimmy] From http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

...............

The following byte sequences are used to represent a character. The sequence to be used depends on the Unicode number of the character:
U-00000000 - U-0000007F:  0xxxxxxx
U-00000080 - U-000007FF:  110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 1010 1001 (copyright sign) is encoded in UTF-8 as

11000010 10101001 = 0xC2 0xA9
and character U+2260 = 0010 0010 0110 0000 (not equal to) is encoded as:

11100010 10001001 10100000 = 0xE2 0x89 0xA0

...............

If you want to look at the raw text, just use a text editor that isn't unicode aware.
(!) [Jimmy] One thing to be aware of when dealing with files created on Windows (as the page you pointed to was) is that Windows usually uses UTF-16LE rather than UTF-8.

(?) Yeah, created with MS Frontpage. Blech  :) All this time I thought that ISO8859-1 was the standard encoding for web pages, with other encodings used for Chinese, Japanese script etc. Is mixing and matching allowed?

(!) [Jimmy] I don't know, to be honest. I'd assume that if you want to do that, you'd be strongly encouraged to use Unicode. (Umm... actually, these days, you're encouraged to use XHTML rather than HTML. Since XHTML is based on XML, it's UTF-8 unless stated otherwise).

(?) Thanks for the link to an excellent website on most things Unicode. It cleared my misconception that Unicode sequences correspond to actual byte sequences, when they don't e.g. \x{2019} is actually E28099, not 2019.

(!) [Jimmy] Erm... be careful with your phrasing there. The part that's written to disk is an encoded version of the Unicode sequence. UTF-7 is a good example of how it works: IIRC, it's UTF-8 encoded with Base64.

(?) I think I have the problem mostly solved. I added the following pragmas:

       use encoding 'utf8';
       open( OUTPUT, '>:encoding(iso-8859-1)', "$txt_file"
)

So, the script processes everything in Unicode but spits out the results in ISO-8859-1.

The hard bit for me is the substitution. The following snippet is supposed to do the substitution, but it doesn't work:

    %utf_entity = (
    	"\x{2019}",	'"',
	"\x{201c}",	'"',
	"\x{201d}",	'"',
	);
    s/(\X+);/exists $utf_entity{$1} ? $utf_entity{$1} : $1
/eg;

Instead, I get an error for each non-matching Unicode character:

	"\x{2019}" does not map to iso-8859-1 at
/home/riza/bin/htmlconv-utf line 302

However, using s/\x{2019}/"'"/eg, s/\x{201c}/"'"/eg and so on for every non-matching character works. It's a really clunky way of doing things but the resulting file displays perfectly on the Palm. How do I match hex sequences for non-matching Unicode characters in a regex, without wiping out all other characters?

(!) [Jimmy] OK, so you have something like this:

See attached utf-1.pl.txt

With some functions from the Encode module, you get the right output:

See attached utf-2.pl.txt

The regex still isn't working though. Let's break it down:

s/(\X+);/
I'll assume the semi-colon was a typo. I think the pattern should really be (\X) though; you're using it to match individual characters against a hash, so you don't want to get more than one character. If you want to see what you're matching, you could use something like this:

s/(\X)(?{print "Matched: $^N\n"})/
and shouldn't it be $utf_entity{"$1"} instead of $utf_entity{$1} ?
But do you really need to do that stuff with the hash when you could use tr instead?

tr [\x{2019}\x{201c}\x{201d}] ["];
Jimmy O'Regan wrote:
> and shouldn't it be $utf_entity{"$1"} instead of  $utf_entity{$1} ?
OK, no. Here's a working version:

See attached utf-3.pl.txt

(?) I think the problem was with the encoding of the file handle and a bit with the regex itself. Below are snippets of my version of the converter script. However, I'm following the original author's method of slurping up all the input and putting it into $_, whereas your script loops through the input. Which method is better?

(!) [Jimmy] How long is a piece of string?  :) (No, really - if you know your file will fit into memory, your way is better, otherwise my way is better (I think :)).
(!) [Ben] That being the big caveat - although HTML files are not likely to be so huge that they'd cause an OOM on a modern machine.

(?)

open( INPUT, '<:encoding(utf8)', "$html_file" ) or die
"$me: can not open $html_file for input\n";
$_ = join( ", <INPUT> );		# slurp up all of HTML
(!) [Ben] This is not a good idea. You're reading in <INPUT> as a list (which takes ~5x the memory for the amount of data), then "join"ing the list - seems rather wasteful, particularly since you don't need to do any of the above (entities are not going to be broken across lines.) For future reference, try this:
open Fh, "foo" or die "foo: $!\n";

{
	local $/;	# Undef the EOL character
	$in = <Fh>;	# Slurp the content in scalar context
}
close Fh;

(?)

close( INPUT );

if ( $txt_file ) {
    open( OUTPUT, '>:encoding(iso-8859-1)', "$txt_file" ) or
	  die "$me: can not open $txt_file for output\n";
    select OUTPUT;
}

### various HTML-stripping bits here

%utf_entity = (
	"\x{2019}",	"'",
	"\x{201c}",	'"',
	"\x{201d}",	'"',
	"\x{2026}",	"...",
        "\x{fffd}",     "",
);
s/(\X)/ exists $utf_entity{$1} ? $utf_entity{$1} : $1 /eg;

print "$_\n";

The above regex works. I found that I didn't have to use $_ = encode_utf8($_) to get it running, as long as the non-matching UTF characters were stripped out before output. If a character was left in, its Unicode sequence in plain text was shown in the output file e.g. \x{fffd}.

(!) [Jimmy] It dawned on me this morning that I only needed to use the open() stuff or the Encode stuff, but I'll just stare at the ground, shuffle my feet, and mutter something about being doubly sure  :) That stuff was a hangover from pasting a line in the wrong place, before I noticed you were trying to match too much.
(!) [Ben] As to the script itself, well -

See attached utf2iso-8859-1.pl.txt

Use redirection to write your output to a file - or pipe it into something else for further processing.

(?) I think your way (of looping through the file) is actually better for a large range of file sizes.

I usually "cat" a bunch of HTML files together before converting them with the script. Using the slurp method, a cobbled-together file of 1 mb or more pretty much kills the computer  :) It just sits there, not processing but taking up a lot of memory. After 10 minutes I have to kill the process.

OTOH if you loop over the file, that would allow it to better allocate memory, I guess?

(!) [Ben] Depends on your meaning of "loop over" - if you load the entire file into memory and loop over it (as both slurping and the 'for' loop will), then no. This is one of the things I constantly emphasize to my Perl students: you can easily bring down your machine by slurping files - do not do it unless you're very confident that the maximum file size will be no more than a tiny fraction of the memory.
# Wrong ways for arbitrary file sizes, OK for small files:
### Slurp into array
@file = <Foo>;
### Load <Foo> into memory as an array
for $line ( <Foo> ){ do_stuff( $line ); }
### Load Foo into memory as a string
{ local $/; $file = <Foo>; }

# Right ways when in doubt:
### Read the filehandle one line at a time
while ( $line = <Foo> ){ do_stuff( $line ); }
### Read a paragraph at a time, in case there are continuation lines
### (e.g., mail headers)
{ local $/ = "\n\n"; while ( $line = <Foo> ){ do_stuff( $line ); }

(?) Many thanks for the help! I'm attaching the whole script, in case someone might have use for it.

See attached htmlconv-utf.pl.txt

(!) [Ben] Both Jimmy and I mentioned the reasons why slurping can be dangerous, but there are times when you can't avoid it - although constructs like this one tend to handle most of the "line-continuation" scenarios:
while ( $line = <Fh> ){
	while ( test_for_incomplete_line( $line ) ){
		$line .= <Fh>;
	}

	# Process $line further
	...
}
In other words, if there's some metric you can use for distinguishing an incomplete line, then you don't need to slurp. Conversely, if you're looking at formatted text, you can also avoid slurping by processing a paragraph at a time:
$File = "/foo/bar/gribble.qux";
open File or die "$File: $!\n";
{
	local $/ = "\n\n";	# Define EOL
	while ( <File> ){
		process_content();
	}
}
Asking which method is "better" makes no sense until you consider the data that you're processing. In your case, since entities don't break across lines, slurping is unnecessary - so processing it a line at a time is quite sensible.

(?) Thanks! I guess the script's original author wrote it only to convert small, single HTML files instead of huge HTML lumps of multiple files.

(!) [Ben] That would be my guess. Either that, or he didn't even consider the issue. Either way, handling the problem in the script would have been trivial:
# After processing all the command-line options, loop over files
for ( @ARGV ){
       if ( -s > $MAX_SIZE ){
               warn "File $_ rejected: TOO LARGE!\n";
               next;
       }
       process_files( $_ );
}

(?) Just curious, what kind of data would have entities spread across multiple lines ie. binary data? Even plain text would be terminated with CR or CR/LF, correct?

(!) [Ben] Well, what makes them entities is that they're atomic - i.e., irreducible units. That means that they _can't_ be broken, across lines or in any other way - otherwise they become, erm. non-entities. :)
(!) [Jimmy] Erm... not quite. Let me veer off-topic a little...
In SGML and XML you can define your own entities, which can contain pretty much anything you want -- a multi-line disclaimer, for instance. Since the trend in browsers has moved towards generic XML browsers that render using CSS or XSL stylesheets (but with a fallback mode to handle the mangled mess that is HTML), defining your own entities is possible, though not advisable.
You can define binary data as an entity, but anything other than plain text will at best be ignored, at worst cause an error. If you want to output a CR, for example, you would have to use XSL-FO (though there is a way to preserve whitespace from XSL, including CR or CR/LF, you just can't do it as flexibly as plain text).
(Defining your own tags is OK, though - browsers ignore tags they don't understand. Most HTML browsers can handle XHTML[1] because of this).
[1] That is, as long as the XHTML namespace isn't prefixed, and some browsers have trouble with <br/>, <hr/>, etc. (Though they can manage <br />, etc)
(!) [Ben] Jimmy, correct me if I'm wrong but - we were speaking of HTML character entities, right? Otherwise, the methane-based entities from the Dravidian cluster are going to complain about discrimination. If we're going to bring in every other kind of entity and ignore them, it'll look like solid grounds for a lawsuit. :)
To restate my point with a bit more precision, though: HTML character entities cannot be broken - otherwise, they'll be, well, - broken.
(!) [Jimmy] Sorry. Thought the original question was something completely different.

(?) Thanks for all the help in solving this UTF problem and giving me some insights into the murky world of Unicode, at the same time cleaning up my atrocious Perl  :) You're all a real blessing for the Linux community. Keep up the excellent work!

(!) [Ben] You're certainly welcome; glad we could help!

This page edited and maintained by the Editors of Linux Gazette
HTML script maintained by Heather Stern of Starshine Technical Services, http://www.starshine.org/


Each TAG thread Copyright © its authors, 2005

Published in issue 117 of Linux Gazette August 2005

<-- 2c Tips | TAG Index | 1 | 2 | 3 | 4 | Knowledge Base | News Bytes -->
Tux