Tux

...making Linux just a little more fun!

2-cent Tip: Unicode conversion

Benjamin A. Okopnik [ben at linuxgazette.net]
Mon, 4 Sep 2006 15:27:16 -0400

A couple of years ago, I decided to stop wrestling with what I call "encoding craziness" for various bits of non-English text that I have scattered around my file system. Russian, for example, has at least four different encodings that I've run into - and guessing which one a given text file was written in was like a game of darts played in the dark. At 300 yards. With your hands tied behind you, so you had to use your toes. Oh, and while you were severely drunk on Stoli vodka. :) UTF-8 (Unicode) allowed me to, well, unify all of that into one single encoding that was readable without scrambling for whichever character set I needed (and may or may not have installed.) Better yet, Unicode usually displays just fine in HTML browsers - no special entity encoding is required.

For some reason, though, good converters appear to be something of a black art - and finding one that works, as opposed to all those that claim to work, was rather frustrating. Therefore, I decided to write one in my favorite language, Perl - only to find that the job has already been done for me, via the 'encoding' pragma. In other words, conversion from, say, KOI8-R to UTF-8 is no more complex than this:

# Convert and write to a file
perl -Mencoding=koi8r,STDOUT,utf8 -pe0 < file.koi8r > file.utf8
# Or just display it in a pager:
perl -Mencoding=koi8r,STDOUT,utf8 -pe0 < file.koi8r|less
It is literally that simple. Pretty much every encoding you can imagine is available (see 'perldoc Encode::Supported' for the naming conventions and charsets). The conversion does not have to be to UTF-8 - it'll do any of the listed charsets - but why would you care? :)

# Print the Kanji for 'Rakuda' (Camel) from multibyte strings:
perl -Mencoding=euc-jp,STDOUT,utf-8 -wle'print "Follow the
\xF1\xD1\xF1\xCC!"'
Follow the 駱駝!
# Or you can do it in Hiragana, but using Unicode values instead:
perl -Mencoding=shift-jis,STDOUT,utf8 -wle'print "Follow the
\x{3089}\x{304F}\x{3060}!"'
Follow the らくだ!
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Kapil Hari Paranjape [kapil at imsc.res.in]
Tue, 5 Sep 2006 08:36:25 +0530

On Mon, 04 Sep 2006, Benjamin A. Okopnik wrote:

> For some reason, though, good converters appear to be something of a
> black art - and finding one that works, as opposed to all those that
> claim to work, was rather frustrating.

Doesn't 'tcs' work properly? This program is in a Debian package by the same name. I don't use multiple languages on the computer (which may be surprising given where I am) so I haven't had a chance to check.

Regards,

Kapil. --


Top    Back


Benjamin A. Okopnik [ben at linuxgazette.net]
Mon, 4 Sep 2006 23:46:53 -0400

On Tue, Sep 05, 2006 at 08:36:25AM +0530, Kapil Hari Paranjape wrote:

> 
> On Mon, 04 Sep 2006, Benjamin A. Okopnik wrote:
> > For some reason, though, good converters appear to be something of a
> > black art - and finding one that works, as opposed to all those that
> > claim to work, was rather frustrating.
> 
> Doesn't 'tcs' work properly? This program is in a Debian package by
> the same name. I don't use multiple languages on the computer (which
> may be surprising given where I am) so I haven't had a chance to
> check.

I hadn't run across it before, but it seems a bit limited:

ben at Fenrir:~$ apt-cache show tcs|grep '^ '
 tcs translates character sets from one encoding to another.
 .
 Supported encodings include utf (ISO utf-8), ascii, ISO 8859-[123456789],
 koi8, jis-kanji, ujis, ms-kanji, jis, gb, big5, unicode, tis, msdos, and
 atari.
Just looking at the ISO8859* list, it's already missing the Nordic languages, Thai, the Baltics, the Celtics, and Latin 9 and 10. The solution that I suggested can handle all of those and a much broader range besides - it includes 9 variants of just Unicode alone.

* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


John Karns [johnkarns at gmail.com]
Wed, 6 Sep 2006 19:53:12 -0500 (COT)

On Mon, 4 Sep 2006, Benjamin A. Okopnik wrote:

> A couple of years ago, I decided to stop wrestling with what I call
> "encoding craziness" for various bits of non-English text that I have
> scattered around my file system.

[snipped some entertaining imagery of a drunk Russian throwing darts with his toes] ...

> UTF-8 (Unicode) allowed me to, well, unify all of that into one single
> encoding that was readable without scrambling for whichever character
> set I needed (and may or may not have installed.) Better yet, Unicode
> usually displays just fine in HTML browsers - no special entity encoding
> is required.

This leads me to some questions about character sets. Sorry about the over-long text, but I didn't want to omit any important details, and the subject of fonts / charsets can get rather deep very quickly.

In the earlier thread here, "Copyright Notice", you quoted a phrase in German which had an umlated "u" (hope that's the right terminology) in one of the words. My system (Ubuntu 5.10 [1]* with lots of added dev libs and other packages) is set up for utf-8, and the following env vars are set:

    LESSCHARSET=utf-8
    LANG=en_US.UTF-8
    LANGUAGE=en_US.UTF-8
---

Running locale on the system yields:

    LANG=en_US.UTF-8
    LC_CTYPE="en_US.UTF-8"
    LC_NUMERIC="en_US.UTF-8"
    LC_TIME="en_US.UTF-8"
    LC_COLLATE="en_US.UTF-8"
    LC_MONETARY="en_US.UTF-8"
    LC_MESSAGES="en_US.UTF-8"
    LC_PAPER="en_US.UTF-8"
    LC_NAME="en_US.UTF-8"
    LC_ADDRESS="en_US.UTF-8"
    LC_TELEPHONE="en_US.UTF-8"
    LC_MEASUREMENT="en_US.UTF-8"
    LC_IDENTIFICATION="en_US.UTF-8"
    LC_ALL=
The mutt pager correctly displays the non-ASCII character, as does vim. But my pager "less" (ver 382), showed it as a blank space. I thought perhaps the installed version of "less" was compiled w/o Unicode support, so I DL'd the most recent source I could find (ver 394) and compiled it. I looked for a Unicode related configuration option, but auto-make showed no special options at all, so nothing there. However it seems to come a little closer by displaying the character as "<FC>" => "FC" in angle brackets, so it's at least recognizing it as a two byte character. Also, looking at the packages source file, "charset.c", utf-8 is listed as an element of the array "charset", so utf-8 it seems that it should support it.

Your message contained the following two headers:

    Content-Type: text/plain; charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

So my questions are:

1) What's the difference between iso-8859-1 and utf-8?

After Googling for an answer to that, I came across some info that led me to understand that the difference is 8-bit vs. 16-bit. Does utf / Unicode then stand alone as 16-bit vs the rest of the character sets like 8859, latin[123], etc?

1a) What's the main advantage of choosing Unicode over the others?

Intuition says that a 16-bit charset covers more ground than one of 8-bits, which might lead to being a better bridge of the gap between more languages, in many cases alleviating the need to switch between different charsets when focusing on various different language character sets. Am I close? Any other reasons?

2) Is the bracketed "FC" an indication that it's attempting to display as utf-8, and simply not finding the character set? Or is it indicative of something else, such as displaying it as some other character set?

3) Does it seem more likely that I'm overlooking a step in the compilation configuration or missing some other needed env var for the run time env? Or is this a known issue with GNU less?

[1]* works wonderfully on this aging Dell I8100, best I've ever had a laptop run under Linux - at the end of the day I just suspend to RAM - haven't done a shutdown or reboot now in 14 days! Probably mostly thanks to the improvements in the kernel suspend code, but they seem to have the ACPI scripting functioning very well too. Hibernate hasn't proven to be quite as smooth though.

-- 
John Karns


Top    Back


Benjamin A. Okopnik [ben at linuxgazette.net]
Fri, 8 Sep 2006 22:47:59 -0400

On Wed, Sep 06, 2006 at 07:53:12PM -0500, John Karns wrote:

> On Mon, 4 Sep 2006, Benjamin A. Okopnik wrote:
> 
> > A couple of years ago, I decided to stop wrestling with what I call
> > "encoding craziness" for various bits of non-English text that I have
> > scattered around my file system.
> 
> [snipped some entertaining imagery of a drunk Russian throwing darts with
> his toes] ...
> 
> > UTF-8 (Unicode) allowed me to, well, unify all of that into one single
> > encoding that was readable without scrambling for whichever character
> > set I needed (and may or may not have installed.) Better yet, Unicode
> > usually displays just fine in HTML browsers - no special entity encoding
> > is required.
> 
> This leads me to some questions about character sets.  Sorry about the
> over-long text, but I didn't want to omit any important details, and the
> subject of fonts / charsets can get rather deep very quickly.

Yes indeed. Now, I'd like to note right up front that I am anything but an expert in fonts: I'm much more of a one-trick pony (although it's a hell of a good trick.) The major reason for my "success" with the font system is that I kept throwing myself against it until I understood enough of what was going on to start adjusting the process. I'm not quite up to that point with console fonts, though... one of these days, I'll go leave some more forehead prints on those bricks.

> In the earlier thread here, "Copyright Notice", you quoted a phrase in 
> German which had an umlated "u" (hope that's the right terminology) in one 
> of the words.  My system (Ubuntu 5.10 [1]* with lots of added dev libs and 
> other packages) is set up for utf-8, and the following env vars are set:
> 
>     LESSCHARSET=utf-8
>     LANG=en_US.UTF-8
>     LANGUAGE=en_US.UTF-8
> 
> ---
> 
> Running locale on the system yields:
> 
>     LANG=en_US.UTF-8
>     LC_CTYPE="en_US.UTF-8"
>     LC_NUMERIC="en_US.UTF-8"
>     LC_TIME="en_US.UTF-8"
>     LC_COLLATE="en_US.UTF-8"
>     LC_MONETARY="en_US.UTF-8"
>     LC_MESSAGES="en_US.UTF-8"
>     LC_PAPER="en_US.UTF-8"
>     LC_NAME="en_US.UTF-8"
>     LC_ADDRESS="en_US.UTF-8"
>     LC_TELEPHONE="en_US.UTF-8"
>     LC_MEASUREMENT="en_US.UTF-8"
>     LC_IDENTIFICATION="en_US.UTF-8"
>     LC_ALL=

That's exactly what mine reports.

> The mutt pager correctly displays the non-ASCII character, as does vim.
> But my pager "less" (ver 382), showed it as a blank space.  I thought
> perhaps the installed version of "less"  was compiled w/o Unicode support,
> so I DL'd the most recent source I could find (ver 394) and compiled it.
> I looked for a Unicode related configuration option, but auto-make showed
> no special options at all, so nothing there.  However it seems to come a
> little closer by displaying the character as "<FC>" => "FC" in angle
> brackets, so it's at least recognizing it as a two byte character.

Have you checked it against Markus Kuhn's UTF-8 test file? It's available at http://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html (among other places), and is my first step in troubleshooting UTF-8 problems. The second step is to make sure that I have some good UTF-8 (iso10646) fonts installed (although these days, you get them as part of the 'xfonts-base' kit), and that my xterms use them as the default fonts. This may still be of use to anyone still using Xfree86, though:

ben at Fenrir:~$ egrep '^xterm.*(utf8|font)' .Xresources 
xterm*font: -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
xterm*utf8:1
ben at Fenrir:~$ grep -- '-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1' /usr/share/fonts/X11/misc/fonts.dir 
9x18.pcf.gz -misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1
ben at Fenrir:~$ dlocate 9x18.pcf.gz
xfonts-base: /usr/share/fonts/X11/misc/9x18.pcf.gz
>  Also,
> looking at the packages source file, "charset.c", utf-8 is listed as an
> element of the array "charset", so utf-8 it seems that it should support
> it.
> 
> Your message contained the following two headers:
> 
>     Content-Type: text/plain; charset="iso-8859-1"
>     Content-Transfer-Encoding: quoted-printable
> 
> 
> So my questions are:
> 
> 1) What's the difference between iso-8859-1 and utf-8?

The first one is the default charset for Latin-1; the interesting bit in comparing the two is that the first 256 chars of both ("codepoints", in Unicode parlance) are identical.

>     After Googling for an answer to that, I came across some info that led
>     me to understand that the difference is 8-bit vs. 16-bit.  Does utf /
>     Unicode then stand alone as 16-bit vs the rest of the character sets
>     like 8859, latin[123], etc?

There's a lot more to it; it's a big, complex issue. You can find a good explanation of it under 'perldoc perlunicode', in the "Unicode Regular Expression Support Level" section. Just as a starting point, UTF-8 is variable-length (1 to 6 bytes), with currently-allocated characters requiring a 4-byte space. Also, take a look at 'perldoc Encode::Supported', which delineates the differences between the UTF* charsets.

> 1a) What's the main advantage of choosing Unicode over the others?
> 
>     Intuition says that a 16-bit charset covers more ground than one of
>     8-bits, which might lead to being a better bridge of the gap between
>     more languages, in many cases alleviating the need to switch between
>     different charsets when focusing on various different language
>     character sets.  Am I close?  Any other reasons?

That's the main reason I use it. There are others - e.g., one single charset that everybody can use for a reference would eliminate all the conversion errors and related communication problems in one swell foop (note that this does not mean all communication problems by far - but certainly a large class of them.) That, all by itself, is an invaluable benefit.

> 2) Is the bracketed "FC" an indication that it's attempting to display as
> utf-8, and simply not finding the character set?  Or is it indicative of
> something else, such as displaying it as some other character set?

I suspect it means that you're not actually using UTF-8. 2-byte Unicode characters, when read by something that's not Unicode-aware, look like - surprise! - two characters. Just to check:

ben at Fenrir:~$ perl -Mencoding=latin2,STDOUT,utf8 -wle 'print "\xFC"'
?

Yep, that's the dude. At least in UTF-8. :) In ASCII, he's just plain old 'FC'.

> 3) Does it seem more likely that I'm overlooking a step in the compilation
> configuration or missing some other needed env var for the run time env?
> Or is this a known issue with GNU less?

Not in my case; it displays the above umlaut just fine when I hang it off the right side of a pipe.

ben at Fenrir:~$ less --version
less 394
Copyright (C) 1984-2005 Mark Nudelman

As I recall from way back when I started this "font adventure", there was some oddity in X that, despite the proper settings for LC_CTYPE, etc., it just would not do the right thing - until I put

export LANG=en_US.UTF-8

into my '~/.xinitrc'. I don't know if that's still applicable, but I've still got it in there.

> [1]* works wonderfully on this aging Dell I8100, best I've ever had a
> laptop run under Linux - at the end of the day I just suspend to RAM -
> haven't done a shutdown or reboot now in 14 days!  Probably mostly thanks
> to the improvements in the kernel suspend code, but they seem to have the
> ACPI scripting functioning very well too.  Hibernate hasn't proven to be
> quite as smooth though.

Mine just don't work, period. Bleagh. :((( To the best of my ability to figure it out, the ACPI on this Acer 2012 is so horrendously broken that it's not even worth trying to fix (although I'd downloaded Intel's ACPI compiler/decompiler, dutifully fixed all the errors, and shoved it all back in, it didn't seem to make any difference.) Well, this laptop is getting toward the end of its useful life... we'll see how the next one goes.

* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]
Sat, 9 Sep 2006 15:57:56 +0100

On Fri, Sep 08, 2006 at 10:47:59PM -0400, Benjamin A. Okopnik wrote:

> On Wed, Sep 06, 2006 at 07:53:12PM -0500, John Karns wrote:
> > On Mon, 4 Sep 2006, Benjamin A. Okopnik wrote:

Hi there,

Minor correction, or possibly "enhanced explanation"...

> Now, I'd like to note right up front that I am anything but
> an expert in fonts: 

...from someone in a similar position.

> > Running locale on the system yields:
> > 
> >     LANG=en_US.UTF-8
> >     LC_CTYPE="en_US.UTF-8"
> >     LC_NUMERIC="en_US.UTF-8"
> >     LC_TIME="en_US.UTF-8"
> >     LC_COLLATE="en_US.UTF-8"
> >     LC_MONETARY="en_US.UTF-8"
> >     LC_MESSAGES="en_US.UTF-8"
> >     LC_PAPER="en_US.UTF-8"
> >     LC_NAME="en_US.UTF-8"
> >     LC_ADDRESS="en_US.UTF-8"
> >     LC_TELEPHONE="en_US.UTF-8"
> >     LC_MEASUREMENT="en_US.UTF-8"
> >     LC_IDENTIFICATION="en_US.UTF-8"
> >     LC_ALL=
>  
> That's exactly what mine reports.

I have something analogous, except that I set LC_COLLATE to C because I think ABC looks better than AaB, and I like [a-z] to mean what I think it means.

> > The mutt pager correctly displays the non-ASCII character, as does vim.
> > But my pager "less" (ver 382), showed it as a blank space.  I thought

The mutt pager has (in theory at least -- I haven't checked for real) access to the Content-Type: of the text, so it can know to interpret it as iso-8859-1. vim probably auto-guesses the content encoding (which is not unreasonable for it to do, absent other instructions). less heeds the other instructions from the environment, and prints what it sees -- the single octet 0xFC. That's invalid UTF-8, and so it may have chosen to display it as broken or unknown -- "od -xc" or "cat -v" might have shown an alternate view of the same character.

> > no special options at all, so nothing there.  However it seems to come a
> > little closer by displaying the character as "<FC>" => "FC" in angle
> > brackets, so it's at least recognizing it as a two byte character.

The single octet 0xFC is not a valid UTF-8 character -- it is part of a multibyte sequence, which is almost certainly not correctly completed in your test file. The single octet 0xFC is a valid iso-8859-1 character, and it is u-umlaut. "man ascii" for the first-bit-zero characters (which, by not-quite-luck, are the same in ASCII, iso-8859-1, and UTF-8); "man iso_8859_1" for (most of) the first-bit-one characters in that encoding; and "man utf-8" for a description of how the first-bit-one characters are encoded in two octets in that encoding.

Your test above doesn't show it reading a two-byte character, by the way -- it read one byte, and wasn't able to display it properly, so showed you the hex-encoding of it.

("wasn't able to display" is probably because it was invalid UTF-8. But it could have been because your less didn't have access to a font that had the right thing for character 252. Unlikely, I'd guess.)

> > So my questions are:
> > 
> > 1) What's the difference between iso-8859-1 and utf-8?
> 
> The first one is the default charset for Latin-1; the interesting bit in
> comparing the two is that the first 256 chars of both ("codepoints", in
> Unicode parlance) are identical.

Just to clarify (hopefully): the characters 0 - 127 are the single octets 0x00 - 0x7F and are identical in each; the characters 128 - 255 are the same in each (not worrying about the ones below 160), but they are a single octet in iso-8859-1 and two octets in UTF-8.

So: Latin-1 says "character number 252 is u-umlaut". Unicode says "character number 252 is u-umlaut". iso-8859-1 says "character number 252 is the octet with decimal value 252, or 0xfc". UTF-8 says "character number 252 is the two octets 0xc3 0xbc".

If you're given an octet stream comprising 0xc3 0xbc, and you know by some other means that the stream is UTF-8, you can print u-umlaut. If you know by some other means that the stream is iso-8859-1, you can print A-tilde one-quarter (which is how those two characters would be encoded). If you don't know how the stream is encoded, you can flag that fact, or you can try guessing.

This is kind of the reverse of what you reported, where the octet stream contained 0xfc -- them that knew it was the right encoding could display it; them that guessed right could display it; them that knew it was the wrong encoding, or guessed wrong, or didn't guess, couldn't display it "correctly", and so made do.

Hopefully that explains why things looked the way they did.

> There's a lot more to it; it's a big, complex issue. 

Oh so true.

> > 2) Is the bracketed "FC" an indication that it's attempting to display as
> > utf-8, and simply not finding the character set?  Or is it indicative of
> > something else, such as displaying it as some other character set?
> 
> I suspect it means that you're not actually using UTF-8. 2-byte Unicode
> characters, when read by something that's not Unicode-aware, look like -
> surprise! - two characters. Just to check:

My guess, as above, is that the content was encoded as iso-8859-1, but that less was told that it had been encoded as UTF-8.

> ``
> ben at Fenrir:~$ perl -Mencoding=latin2,STDOUT,utf8 -wle 'print "\xFC"'
> ?
> ''
> 
> Yep, that's the dude. At least in UTF-8. :) In ASCII, he's just plain
> old 'FC'.

ASCII stops at 7F. In iso-8859-1 (and probably some others too) he's FC.

> > 3) Does it seem more likely that I'm overlooking a step in the compilation
> > configuration or missing some other needed env var for the run time env?
> > Or is this a known issue with GNU less?

od can show you what the octets are.

Visual examination of the octets can make a fair guess at what the encoding really is. Tell less that that is the charset (man less for the various ways of doing that; LESSCHARSET is probably the easiest), and it should display it correctly.

Good luck,

f

-- 
Francis Daly        francis at daoine.org


Top    Back


John Karns [johnkarns at gmail.com]
Sun, 10 Sep 2006 22:30:32 -0500 (COT)

On Sat, 9 Sep 2006, Francis Daly wrote:

> On Fri, Sep 08, 2006 at 10:47:59PM -0400, Benjamin A. Okopnik wrote:
>> On Wed, Sep 06, 2006 at 07:53:12PM -0500, John Karns wrote:
>>> On Mon, 4 Sep 2006, Benjamin A. Okopnik wrote:
>
> Hi there,

Hello!

> I have something analogous, except that I set LC_COLLATE to C because I
> think ABC looks better than AaB, and I like [a-z] to mean what I think
> it means.

Nice tip!

>>> The mutt pager correctly displays the non-ASCII character, as does vim.
>>> But my pager "less" (ver 382), showed it as a blank space.  I thought
>
> The mutt pager has (in theory at least -- I haven't checked for real)
> access to the Content-Type: of the text, so it can know to interpret it
> as iso-8859-1. vim probably auto-guesses the content encoding (which is
> not unreasonable for it to do, absent other instructions). less heeds
> the other instructions from the environment, and prints what it sees --
> the single octet 0xFC. That's invalid UTF-8, and so it may have chosen
> to display it as broken or unknown -- "od -xc" or "cat -v" might have
> shown an alternate view of the same character.
>
>>> no special options at all, so nothing there.  However it seems to come a
>>> little closer by displaying the character as "<FC>" => "FC" in angle
>>> brackets, so it's at least recognizing it as a two byte character.
>
> The single octet 0xFC is not a valid UTF-8 character -- it is part of a
> multibyte sequence, which is almost certainly not correctly completed in
> your test file. The single octet 0xFC is a valid iso-8859-1 character,
> and it is u-umlaut. "man ascii" for the first-bit-zero characters (which,
> by not-quite-luck, are the same in ASCII, iso-8859-1, and UTF-8); "man
> iso_8859_1" for (most of) the first-bit-one characters in that encoding;

This does indeed seem to be the case.

> and "man utf-8" for a description of how the first-bit-one characters
> are encoded in two octets in that encoding.

So in essence you're saying that text of the original message was in fact not utf-8, but 8859-1, as was specified in the "content type" header of that message. Makes sense.

> Your test above doesn't show it reading a two-byte character, by the way
> -- it read one byte, and wasn't able to display it properly, so showed
> you the hex-encoding of it.

Yeah, I was getting laid up on thinking in terms of a sequence of two ascii chars rather than a hex

> ("wasn't able to display" is probably because it was invalid UTF-8. But
> it could have been because your less didn't have access to a font that
> had the right thing for character 252. Unlikely, I'd guess.)

As you suggest, it appears not to be the case, since the same less binary can display the character correctly in the man page.

> So: Latin-1 says "character number 252 is u-umlaut". Unicode says
> "character number 252 is u-umlaut". iso-8859-1 says "character number
> 252 is the octet with decimal value 252, or 0xfc". UTF-8 says "character
> number 252 is the two octets 0xc3 0xbc".
>
> If you're given an octet stream comprising 0xc3 0xbc, and you know by
> some other means that the stream is UTF-8, you can print u-umlaut. If
> you know by some other means that the stream is iso-8859-1, you can
> print A-tilde one-quarter (which is how those two characters would be
> encoded). If you don't know how the stream is encoded, you can flag that
> fact, or you can try guessing.
>
> This is kind of the reverse of what you reported, where the octet stream
> contained 0xfc -- them that knew it was the right encoding could display
> it; them that guessed right could display it; them that knew it was the
> wrong encoding, or guessed wrong, or didn't guess, couldn't display it
> "correctly", and so made do.
>
> Hopefully that explains why things looked the way they did.

[snip]

Your explanation seems to be accurate, at least as far as agreeing with the results of my experiments. I'll take your word for it about the text arriving in my mailbox as 8859 encoding,

I tried different methods of "exporting" the text from both pine and mutt. Pine offers an "export" option, as well as piping the text "raw text" or "free output". Depending on how you do it, the results vary for what I'll call the "extended character". The 1st method translates the char to "FC" as I noted in my my original post. The 2nd ("raw text") translates it to a byte sequence, "ox3d 0x46 0x43" - as viewed within the hex display mode of the mc viewer.

Viewing the resultant files with less does not render the character as u-umlaut in any instance, including piping the text to "cat -v filename" (written as the same three octet sequence as the 2nd case above. Piping to "od -xc" agrees with the cat -v output. However after editting / re-writing the text file created from Pine's export option using vim, the character became "0xc3 0xbc", which "less" correctly renders as u-umlaut, and oddly enough (to me), irrespective of the presence of the var LESSCHARSET. And if set, the result is the same whether it's set to iso8859 or utf-8.

Thanks to both Ben and Francis for helping to make this issue a little bit clearer.

-- 
John Karns


Top    Back


Benjamin A. Okopnik [ben at linuxgazette.net]
Mon, 11 Sep 2006 00:26:28 -0400

On Sun, Sep 10, 2006 at 10:30:32PM -0500, John Karns wrote:

> 
> Your explanation seems to be accurate, at least as far as agreeing with
> the results of my experiments.  I'll take your word for it about the text
> arriving in my mailbox as 8859 encoding,
> 
> I tried different methods of "exporting" the text from both pine and mutt.
> Pine offers an "export" option, as well as piping the text "raw text" or
> "free output".  Depending on how you do it, the results vary for what I'll
> call the "extended character".  The 1st method translates the char to "FC"
> as I noted in my my original post.  The 2nd ("raw text") translates it to
> a byte sequence, "ox3d 0x46 0x43" - as viewed within the hex display mode
> of the mc viewer.

In Mutt, you can edit the Content-Type with '^E'. In my case, when I look at that email, I see the '?' just fine - but when I hit the 'v' key, the listing for that attachment looks like this:

 I     1 <no description>  			[text/plain, 8bit, iso-8859-1, 1.4K]
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]
Mon, 11 Sep 2006 11:34:38 +0100

On Sun, Sep 10, 2006 at 10:30:32PM -0500, John Karns wrote:

> I tried different methods of "exporting" the text from both pine and mutt.
> Pine offers an "export" option, as well as piping the text "raw text" or
> "free output".  Depending on how you do it, the results vary for what I'll
> call the "extended character".  The 1st method translates the char to "FC"
> as I noted in my my original post.  The 2nd ("raw text") translates it to
> a byte sequence, "ox3d 0x46 0x43" - as viewed within the hex display mode
> of the mc viewer.

The difference between those two is related to the joy of SMTP.

RFC 821 from 1982 said "SMTP data is 7-bit ASCII characters". Although that RFC has been obsoleted, SMTP cannot reliably be considered 8-bit clean.

RFC 2821 from 2001 adds "Service extensions may modify this rule to permit transmission of full 8-bit data bytes as part of the message body", and the 8BITMIME extension seem to be from 1995 or earlier, but all of that requires someone else's machine to be sane.

So rather than ship high-bit-set data and trust the SMTP servers to do the right[*] thing, the mail creator can encode the content to 7-bit data using one of a variety of methods.

[*] FSVO.

uuencode is a handy way -- I often use it for transferring small files from my machine -- "uuencode filename filename | xsel" on my machine, "uudecode" in a shell on remote and x-paste the selection. (C-a [ and C-a ] within screen can be used too, of course.)

base64-encoding is also used, and shares with uuencode the fact that all of the content is unreadable without decoding.

A more common current way is officially called quoted-printable, but is also known by less complimentary terms. That was the one used here, according to headers sent earlier:

    Content-Type: text/plain; charset="iso-8859-1"
    Content-Transfer-Encoding: quoted-printable

The details aren't too important, but it turns one high-bit-set octet into three high-bit-unset octets, the first one of which is "=", which happens to be the 0x3d that you saw.

Url-encoding is another way of turning 8-bit into 7-bit characters, which is similar to quoted-printable, except that it uses % not = as its "here comes something different" character.

Those two share the feature that all ASCII content can be read without decoding, and if you only expect a handful of accented characters, you can probably get used to recognising them in their encoded form.

So apparently "raw text" is "the octets that were sent via SMTP", and "free output" is "the content-transfer-decoded version of same" (possibly with some extra MIME-decoding too -- but you'll only see that clearly if more than one attachment was included originally).

> Viewing the resultant files with less does not render the character as
> u-umlaut in any instance, including piping the text to "cat -v filename"
> (written as the same three octet sequence as the 2nd case above.  

The "raw text" should contain three characters "=FC", while "free output" should contain a single octet 0xfc which is invalid UTF-8 unless followed by five octets with the high bits 10 (and that's beyond the range of Unicode). So that's to be expected -- there's no u-umlaut there, unless the viewer knows that it is iso-8859-1 encoded.

> Piping
> to "od -xc" agrees with the cat -v output.  However after editting /
> re-writing the text file created from Pine's export option using vim, the
> character became "0xc3 0xbc", which "less" correctly renders as u-umlaut,

So far so good -- vim has written the output as UTF-8 whatever is input was.

> and oddly enough (to me), irrespective of the presence of the var
> LESSCHARSET.  And if set, the result is the same whether it's set to
> iso8859 or utf-8.

That bit is a bit odd. With LESSCHARSET unset, the manpage says it will check some other envariables including LANG, which for you makes it UTF-8. But setting LESSCHARSET to iso8859 should cause it to display as A-tilde one-quarter, if I read the manpage right.

Doesn't work for me either, by the way. Perhaps I'm reading the manpage wrong, or have odd fonts here...

f

-- 
Francis Daly        francis at daoine.org


Top    Back