Alien Tongues

umit [umit at aim4media.com]

Fri, 7 Mar 2008 10:28:59 +0100

[[[ This originally had a subject line that allegedly referred to an old LG (issue 86!) tips. It didn't bear much resemblance in the Linux-relevant portions of the thread, and even less here in Launderette. -- Kat ]]]

Selamlar, Ben Umit , Linuxgazettde tipslerde sizin isminizi gordum ve bir soru soruyum dedim ariyordum da bi neticelik. Siz nasil Debian ayarlarina yapacagini biliyormusunuz? Birde baska bir sorum olacak. Server usb dvd romu gormuyor. Nasil ayarlamam gerekir?

-- 
Aim4Media BV | Achter 't Veer 34 | 4191 AD | Geldermalsen | the Netherlands
| T.: +31 3456 222 71 | 
F.: +31 3456 222 81 | MSN.:umitkaya@live.nl | 
WWW.: HYPERLINK "http://www.aim4media.com/"http://www.aim4media.com | @.:
HYPERLINK "mailto:bram@aim4media.com"umit@aim4media.com

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 7 Mar 2008 07:54:03 -0500

On Fri, Mar 07, 2008 at 10:28:59AM +0100, umit wrote:

> 
>    Selamlar,
> 
>    Ben Umit , Linuxgazettde tipslerde sizin isminizi gordum ve bir soru
>    soruyum dedim ariyordum da bi neticelik.
> 
>    Siz nasil Debian ayarlarina yapacagini biliyormusunuz?
> 
>    Birde baska bir sorum olacak. Server usb dvd romu gormuyor. Nasil
>    ayarlamam gerekir?

Well, this adds another one to the TAG "non-English posts" collection. I don't think we've ever had Turkish here before - and as far as I know, we don't have anyone here who can help translate it. Although Rick may surprise me.

Umit: the Linux Gazette has a world-wide distribution [1], but we don't have a staff of translators that can handle any language. If you could rephrase your (obviously Linux-related) question in English, we might be able to help you.

[1] Last year, I got very curious about who's reading us and where they're coming from, and did some serious parsing on the LG Web server log files. Millions of readers, from literally every country (figured by source IP) I could think of. Nifty fun.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Neil Youngman [ny at youngman.org.uk]

Fri, 7 Mar 2008 14:54:46 +0000

On Friday 07 March 2008 09:28, umit wrote:

> Selamlar,
>
> Ben Umit , Linuxgazettde tipslerde sizin isminizi gordum ve bir soru
> soruyum dedim ariyordum da bi neticelik.
> Siz nasil Debian ayarlarina yapacagini biliyormusunuz?
> Birde baska bir sorum olacak. Server usb dvd romu gormuyor. Nasil ayarlamam
> gerekir?

http://www.stars21.com/translator/turkish_to_english.html offers ud

"The translation of your request is:

Fish over it, sizes terminology Linuxgazett Fish tipsler Fish Brew Dum And the question closes one question Of reconstruction Iyor Dum There bi neticelik.

Siz Nasil produces flow moment ayarlarina Acagini biliyormu is your water ?

My one question in the unit baska will be. Videodisk usb romu brews doesn't announce waiters. Nasil My adjustment gerek ?"

Also http://www.tranexp.com:2000/Translate/result.shtml gives

"Mole Umit Linuxgazettde tipslerde your isminizi gordum and one each question question dedim ariyordum even if bi consequence. You advice Debian ayarlarina yapacagini biliyormusunuz? Suddenly basket one each responsibility which will happen. Wealth usb dvd romu gormuyor. Advice regulation kitty"

I hope that's clear?

Neil

Top Back

Rick Moen [rick at linuxmafia.com]

Fri, 7 Mar 2008 08:44:21 -0800

Neil Youngman (ny@youngman.org.uk) was attempting to translate "umit's" post:

> My one question in the unit baska will be.

Aye, that was bothering me, too.

Neil also commented:

> Although Rick may surprise me.

Alas, honestly, I'm as mystified as the rest of us lot. The Turks are a charming people, and I always enjoy visiting their country, and I'd love to learn their language some day, but that day is not today.

-- 
Cheers,                                             "Reality is not optional."
Rick ("But speaks passable Glaswegian") Moen              -- Thomas Sowell
rick@linuxmafia.com

Top Back

Ben Okopnik [ben at linuxmafia.com]

Fri, 7 Mar 2008 19:20:42 -0500

On Fri, Mar 07, 2008 at 08:44:21AM -0800, Rick Moen wrote:

> Neil Youngman (ny@youngman.org.uk) was attempting to translate "umit's"
> post:
> 
> > My one question in the unit baska will be.
> 
> Aye, that was bothering me, too.
> 
> 
> Neil also commented:
> 
> > Although Rick may surprise me.

That was me, actually. I just figured: you spent, what, a couple of days there at least? Learning the language (either before or after you've revised the government, made the streets safer, and raised the standard of living by some large factor[1], natch) seems like the obvious next thing.

> Alas, honestly, I'm as mystified as the rest of us lot.  The Turks are a
> charming people, and I always enjoy visiting their country, and I'd love to
> learn their language some day, but that day is not today.
> 
> -- 
> Cheers,                                             "Reality is not optional."
> Rick ("But speaks passable Glaswegian") Moen              -- Thomas Sowell

^^^^^^^^^^

Well, that shouldn't surprise anybody. It's 2/3 Norwegian but a little more transparent, right?

[1] Admittedly, doing it after would have added a little zing to the other tasks.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 7 Mar 2008 12:10:14 -0500

On Fri, Mar 07, 2008 at 02:54:46PM +0000, Neil Youngman wrote:

> 
> http://www.stars21.com/translator/turkish_to_english.html offers us
> 
> "The translation of your request is:
> 
> Fish over it, sizes terminology Linuxgazett Fish tipsler Fish Brew Dum And the 
> question closes one question Of
> reconstruction
> Iyor Dum There bi neticelik.
> 
> Siz Nasil produces flow moment ayarlarina Acagini biliyormu is your water ?
> 
> My one question in the unit baska will be. Videodisk usb romu brews doesn't 
> announce waiters. Nasil My adjustment
> gerek
> ?"

I'm sorry - that was Turkish to *what*??? Even Engrish doesn't come close to this.

http://www.engrish.com/

> Also http://www.tranexp.com:2000/Translate/result.shtml gives
> 
> "Mole Umit Linuxgazettde tipslerde your isminizi gordum and one each question 
> question dedim ariyordum even if bi consequence. You advice Debian ayarlarina 
> yapacagini biliyormusunuz? Suddenly basket one each responsibility which will 
> happen. Wealth usb dvd romu gormuyor. Advice regulation kitty"

Ah, much better - particularly the "Advice regulation kitty" bit. When combined with the "Fish over it" and "is your water?", it makes for a panorama of breathtaking scope - something to gladden the heart of anyone with a severe drug problem.

> I hope that's clear?

Transparent as, uh, a very transparent thing. The question of what it's being clear about is one that we'll politely leave aside for another time (i.e., when the drinks at the local watering hole are 2-for-1.)

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Rick Moen [rick at linuxmafia.com]

Fri, 7 Mar 2008 10:47:49 -0800

Quoting Ben Okopnik (ben@linuxgazette.net):

> Ah, much better - particularly the "Advice regulation kitty" bit. When
> combined with the "Fish over it" and "is your water?", it makes for a
> panorama of breathtaking scope - something to gladden the heart of
> anyone with a severe drug problem.

You know, the original post looks Turkish, and it's the most frequently encountered language of that family, but there are quite a number of other, related tongues. Do you happen to have any old school chums who speak Chuvash or Dolgan? ;->

-- 
Cheers,                            "To summarize the summary of the summary:
Rick Moen                           People are a problem."
rick@linuxmafia.com                                       -- Douglas Adams

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 7 Mar 2008 19:30:41 -0500

On Fri, Mar 07, 2008 at 10:47:49AM -0800, Rick Moen wrote:

> Quoting Ben Okopnik (ben@linuxgazette.net):
> 
> > Ah, much better - particularly the "Advice regulation kitty" bit. When
> > combined with the "Fish over it" and "is your water?", it makes for a
> > panorama of breathtaking scope - something to gladden the heart of
> > anyone with a severe drug problem.
> 
> You know, the original post looks Turkish, and it's the most
> frequently encountered language of that family, but there are quite a
> number of other, related tongues.  Do you happen to have any old school
> chums who speak Chuvash or Dolgan?  ;->

I've heard Chuvash spoken a couple of times - that was about 40 years ago. That's as close as I can get. Never even heard of the Dolgans, even though Wikipedia tells me they're in geographical proximity - that's not surprising, though, since there are only 5000 speakers of it (Wikipedia again.) But those folks would have used (modified) Cyrillic rather than the Latin alphabet.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sat, 8 Mar 2008 15:15:36 +0000

On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:

> On Fri, Mar 07, 2008 at 02:54:46PM +0000, Neil Youngman wrote:
>  >
>  > http://www.stars21.com/translator/turkish_to_english.html offers us
>
> >
>  > "The translation of your request is:
>  >
>  > Fish over it, sizes terminology Linuxgazett Fish tipsler Fish Brew Dum And the
>  > question closes one question Of
>  > reconstruction
>  > Iyor Dum There bi neticelik.
>  >
>  > Siz Nasil produces flow moment ayarlarina Acagini biliyormu is your water ?
>  >
>  > My one question in the unit baska will be. Videodisk usb romu brews doesn't
>  > announce waiters. Nasil My adjustment
>  > gerek
>  > ?"
>
>
> I'm sorry - that was Turkish to *what*??? Even Engrish doesn't come
>  close to this.
>

Ooh... that question demonstrates at least one major problem in machine translation[1]: missing diacritics[2] -- at Apertium, we're trying to get involved in GSoC, and diacritic restoration is one of the projects we want someone to work on.

Also, there's at least one word ('Linuxgazettde") that's not spelled correctly[3], and (as a named entity) not likely to be recognised - which means nothing in the sentence is likely to be correctly translated.

A Turkish to English translator is a pretty ambitious thing to try to make, given the distance between the languages, even when dealing with text that's well written.

[1] There are lots...

[2] Assuming it is Turkish.

[2] the 'de' is the locative case suffix, according to Wikipedia; but I'm referring to the fact that 'Linux Gazette' is two words.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 9 Mar 2008 18:09:29 -0400

On Sat, Mar 08, 2008 at 03:15:36PM +0000, Jimmy O'Regan wrote:

> On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
> >
> > I'm sorry - that was Turkish to *what*??? Even Engrish doesn't come
> >  close to this.
> 
> Ooh... that question demonstrates at least one major problem in
> machine translation[1]: missing diacritics[2] -- at Apertium, we're
> trying to get involved in GSoC, and diacritic restoration is one of
> the projects we want someone to work on.

That would be pretty tough for Russian. Even in as short of a span as the last 30 years or so, I've seen the '?' ('yo') almost completely elided in favor of the 'e' ('ye') in written Russian, at least in any writing that is not explicitly formal. Even magazines and newspapers, for the most part, have done so. Mind you, the pronunciation of the words has remained exactly the same - a pain in the ass when you're learning Russian, as Kat is doing nowadays (I have to point out every instance where this applies as she reads; amazingly enough, she seems to be making some kind of sense out of that mess.)

I don't even know if you could do it as a dictionary lookup, either. I can't think of an example off the top of my head, but I'm sure there are words that you can spell with either one and produce a real word in both cases - but with different meanings.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Mon, 10 Mar 2008 15:55:14 +0000

On 09/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:

> On Sat, Mar 08, 2008 at 03:15:36PM +0000, Jimmy O'Regan wrote:
>  > On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
>  > >
>
> > > I'm sorry - that was Turkish to *what*??? Even Engrish doesn't come
>  > >  close to this.
>  >
>  > Ooh... that question demonstrates at least one major problem in
>  > machine translation[1]: missing diacritics[2] -- at Apertium, we're
>  > trying to get involved in GSoC, and diacritic restoration is one of
>  > the projects we want someone to work on.
>
>
> That would be pretty tough for Russian. Even in as short of a span as
>  the last 30 years or so, I've seen the 'ë' ('yo') almost completely
>  elided in favor of the 'e' ('ye') in written Russian, at least in any
>  writing that is not explicitly formal. Even magazines and newspapers,
>  for the most part, have done so. Mind you, the pronunciation of the
>  words has remained exactly the same - a pain in the ass when you're
>  learning Russian, as Kat is doing nowadays (I have to point out every
>  instance where this applies as she reads; amazingly enough, she seems to
>  be making some kind of sense out of that mess.)
>

Actually, Russian is much easier in this regard than most languages, because it's just one character

(And let me sympathise with Kat, because my Russian is still at the 'reading like a five-year-old stage

In Apertium, we use mini paradigms to handle that (for example, my English-Polish dictionary uses British English internally, but the English morphological dictionary has a 'compile time option'[1] to generate American forms).

>  I don't even know if you could do it as a dictionary lookup, either.
>  I can't think of an example off the top of my head, but I'm sure there
>  are words that you can spell with either one and produce a real word in
>  both cases - but with different meanings.
>

Well, the same thing happens in most languages; our part of speech disambiguation module is primarily rule-based (if we encounter "I'd", it'll select 'I would' if the next word is 'go', and 'I had' if the next word is 'gone', because those sequences are enforced, for example), but it also supports statistical modelling, so it attempts to pick the correct word in ambiguous cases based on collocations and frequency, given a bilingual corpus.

Of course, a few of us don't like that particular method of working: we're hoping to get a GSoC student to extend it so we can manually specify collocations etc. Statistical translation is great for language pairs that have huge corpora available, but most don't[2]; and better for people like Google[3] who have access to massive corpora.

But ambiguity problems are unavoidable. Oh heck, let me just throw one word at you - стул[4]. Or the sentence 'I read'.

[1] It's really an XSLT stylesheet that chooses one of the modes, but for want of a better term...

[2] Also, the available corpora may not be particularly representative of normal language; the largest English-Polish (-pretty much any other EU language) corpus is the JRC Acquis, which is based on EU legal proceedings, which means anything trained using it is going to translate with it will come out as legalese :/

[3] There are also a few inherent problems in statistical translation that can't be helped: http://www.google.com/translate?u=http%3A%2F%2Fel[...]will say 'The English edition of the Encyclopaedia currently includes 32,538 articles.' for ' Η ελληνική έκδοση της εγκυκλοπαίδειας περιλαμβάνει αυτή τη στιγμή 32.537 άρθρα.' (That should, of course, be 'the Greek edition...') You can use their 'Suggest a better translation' feature, but you'd have to do it a few hundred (if not thousand) times for it to have any effect.

[4] means 'chair' or 'stool' (as in 'stool sample'); different plural forms, identical singular forms.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Mon, 10 Mar 2008 19:38:34 -0400

On Mon, Mar 10, 2008 at 03:55:14PM +0000, Jimmy O'Regan wrote:

> On 09/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
> > On Sat, Mar 08, 2008 at 03:15:36PM +0000, Jimmy O'Regan wrote:
> >  > On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
> >
> > > > I'm sorry - that was Turkish to *what*??? Even Engrish doesn't come
> >  > >  close to this.
> >  >
> >  > Ooh... that question demonstrates at least one major problem in
> >  > machine translation[1]: missing diacritics[2] -- at Apertium, we're
> >  > trying to get involved in GSoC, and diacritic restoration is one of
> >  > the projects we want someone to work on.
> >
> > That would be pretty tough for Russian. Even in as short of a span as
> >  the last 30 years or so, I've seen the 'ë' ('yo') almost completely
> >  elided in favor of the 'e' ('ye') in written Russian, at least in any
> >  writing that is not explicitly formal. Even magazines and newspapers,
> >  for the most part, have done so. Mind you, the pronunciation of the
> >  words has remained exactly the same - a pain in the ass when you're
> >  learning Russian, as Kat is doing nowadays (I have to point out every
> >  instance where this applies as she reads; amazingly enough, she seems to
> >  be making some kind of sense out of that mess.)
> 
> Actually, Russian is much easier in this regard than most languages,
> because it's just one character

Nope. There's 'ë', 'й' (i-kratkaya), and 'щ' (sche), diacrits all.

> (And let me sympathise with Kat, because my Russian is still at the
> 'reading like a five-year-old stage

Hey, that's not bad! Some five-year-olds are very capable readers.

> >  I don't even know if you could do it as a dictionary lookup, either.
> >  I can't think of an example off the top of my head, but I'm sure there
> >  are words that you can spell with either one and produce a real word in
> >  both cases - but with different meanings.
> 
> Well, the same thing happens in most languages; our part of speech
> disambiguation module is primarily rule-based (if we encounter "I'd",
> it'll select 'I would' if the next word is 'go', and 'I had' if the
> next word is 'gone', because those sequences are enforced, for
> example), but it also supports statistical modelling, so it attempts
> to pick the correct word in ambiguous cases based on collocations and
> frequency, given a bilingual corpus.

Oh - nice. I hadn't realized that it was that sophisticated.

> Of course, a few of us don't like that particular method of working:
> we're hoping to get a GSoC student to extend it so we can manually
> specify collocations etc. Statistical translation is great for
> language pairs that have huge corpora available, but most don't[2];
> and better for people like Google[3] who have access to massive
> corpora.
> 
> But ambiguity problems are unavoidable. Oh heck, let me just throw one
> word at you - стул[4]. Or the sentence 'I read'.

Oh, heck - the classic "The spirit is willing, but the flesh is weak"/"The vodka is good but the meat is rotten" problem is going to be with us for a long, long time.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Tue, 11 Mar 2008 02:18:30 +0000

On 10/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:

> On Mon, Mar 10, 2008 at 03:55:14PM +0000, Jimmy O'Regan wrote:
>  > On 09/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
>  > > On Sat, Mar 08, 2008 at 03:15:36PM +0000, Jimmy O'Regan wrote:
>  > >  > On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
>  > >
>  > > That would be pretty tough for Russian. Even in as short of a span as
>  > >  the last 30 years or so, I've seen the 'ë' ('yo') almost completely
>  > >  elided in favor of the 'e' ('ye') in written Russian, at least in any
>  > >  writing that is not explicitly formal. Even magazines and newspapers,
>  > >  for the most part, have done so. Mind you, the pronunciation of the
>  > >  words has remained exactly the same - a pain in the ass when you're
>  > >  learning Russian, as Kat is doing nowadays (I have to point out every
>  > >  instance where this applies as she reads; amazingly enough, she seems to
>  > >  be making some kind of sense out of that mess.)
>  >
>  > Actually, Russian is much easier in this regard than most languages,
>  > because it's just one character 
>
>
> Nope. There's 'ë', 'й' (i-kratkaya), and 'щ' (sche), diacrits all.
>

Dang

Eh... it's still not too bad. In Polish, verbs ending in -ować take the endings -uję for first person and -uje for third person in the present tense - that's roughly 5000 verbs in my morphological analyser. That would be ok if the personal pronoun was there to disambiguate, but in Polish it tends not to be.

>
>  > (And let me sympathise with Kat, because my Russian is still at the
>  > 'reading like a five-year-old stage 
>
>
> Hey, that's not bad! Some five-year-olds are very capable readers. 
>

Heh. True. Put me in the class with the average five year olds I can follow things fairly well, but reading aloud is... a bit beyond me. It got a lot easier after I started to spend more time around some native and near-native speakers. In my last few weeks, one of the Latvian girls took to telling me things in Russian before she told me what it was in English.

>
>  > >  I don't even know if you could do it as a dictionary lookup, either.
>  > >  I can't think of an example off the top of my head, but I'm sure there
>  > >  are words that you can spell with either one and produce a real word in
>  > >  both cases - but with different meanings.
>  >
>  > Well, the same thing happens in most languages; our part of speech
>  > disambiguation module is primarily rule-based (if we encounter "I'd",
>  > it'll select 'I would' if the next word is 'go', and 'I had' if the
>  > next word is 'gone', because those sequences are enforced, for
>  > example), but it also supports statistical modelling, so it attempts
>  > to pick the correct word in ambiguous cases based on collocations and
>  > frequency, given a bilingual corpus.
>
>
> Oh - nice. I hadn't realized that it was that sophisticated.
>

Well, partly We don't have a real parser: our transfer rules are basically converted into regexes, which sucks from one point of view - that any rule that has to take into account adjectives or adverbs has to exist N times for N possible amounts[1], but it's bloody fast, and... well, full parse systems I've seen have some oddities, like taking words from within parentheses and placing them outside(???).

The same tools can also guess the likely part of speech of an unknown word once trained; it's not integrated (and I'd prefer if it did stem checking: knowing about the ending -ing in the sentence 'he is qrbvzing' to trigger the rule for a verb rather than an adjective, for example) but it could really make a difference.

We also have the capability to use statistical phrase alignment tools like Giza++ to generate transfer rules from a corpus; the advantage is that not only can we fine tune them afterwards, but we can also take a small set of sentences that have been manually disambiguated, and generate the rules from them.

But... I can't deny that Statistical MT is cool. There's a great open source system called Moses (http://www.statmt.org/moses/); if anyone ever finds the need for, say, a Maltese to Greek translator, and can spare the 16 odd hours of processing time, they can get a decent enough translation without needing the extras.

>
>  > Of course, a few of us don't like that particular method of working:
>  > we're hoping to get a GSoC student to extend it so we can manually
>  > specify collocations etc. Statistical translation is great for
>  > language pairs that have huge corpora available, but most don't[2];
>  > and better for people like Google[3] who have access to massive
>  > corpora.
>  >
>  > But ambiguity problems are unavoidable. Oh heck, let me just throw one
>  > word at you - стул[4]. Or the sentence 'I read'.
>
>
> Oh, heck - the classic "The spirit is willing, but the flesh is
>  weak"/"The vodka is good but the meat is rotten" problem is going to be
>  with us for a long, long time.
>

I was talking to a professional translator last week, who had a whole list of those. (http://en.wikipedia.org/wiki/Moses#Horned_Moses was one) The best one he gave was of an American president (Carter, maybe) who visited Poland with a translator who was not a native speaker, and mistranslated his opening sentence from 'Yesterday, when I left America' to something equivalent to 'Yesterday, when I defected from America'.

[1] Of course, that could probably be fixed easily enough with some XSL or Perl...

Top Back

Samuel Bisbee-vonKaufmann [sbisbee at computervip.com]

Sun, 09 Mar 2008 23:04:24 +0000

>On Sat, Mar 08, 2008 at 03:15:36PM +0000, Jimmy O'Regan wrote:
>> On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
>> >
>> > I'm sorry - that was Turkish to *what*??? Even Engrish doesn't come
>> >  close to this.
>> 
>> Ooh... that question demonstrates at least one major problem in
>> machine translation[1]: missing diacritics[2] -- at Apertium, we're
>> trying to get involved in GSoC, and diacritic restoration is one of
>> the projects we want someone to work on.
>
>That would be pretty tough for Russian. Even in as short of a span as
>the last 30 years or so, I've seen the '?' ('yo') almost completely
>elided in favor of the 'e' ('ye') in written Russian, at least in any
>writing that is not explicitly formal. Even magazines and newspapers,
>for the most part, have done so. Mind you, the pronunciation of the
>words has remained exactly the same - a pain in the ass when you're
>learning Russian, as Kat is doing nowadays (I have to point out every
>instance where this applies as she reads; amazingly enough, she seems to
>be making some kind of sense out of that mess.)
>
>I don't even know if you could do it as a dictionary lookup, either.
>I can't think of an example off the top of my head, but I'm sure there
>are words that you can spell with either one and produce a real word in
>both cases - but with different meanings.
>

One might solve this problem with a UI solution: allowing the user to choose which translation makes more sense. Hopefully the other words in the sentence had been translated well enough that someone could fill in the blank sensibly from a few choices. Common spell checking implementations would provide a good template.

Of course the choices provided may be too close to one another, but that takes us back to some of the common problems with natural language processing and translation. Example, that cultures like to imbue words and their combinations with special, non-obvious meanings that cannot yet be taught to machines and may vary depending on geographic regions of a few miles (villages, etc.).

That or there is probably a character set and multinationalization argument in there somewhere, 'cause software developers and their orgs/companies are so well known for following standards. ahem

-- 
Sam Bisbee

Top Back

Jimmy O'Regan [joregan at gmail.com]

Mon, 10 Mar 2008 16:34:40 +0000

On 09/03/2008, Samuel Bisbee-vonKaufmann <sbisbee@computervip.com> wrote:

> >On Sat, Mar 08, 2008 at 03:15:36PM +0000, Jimmy O'Regan wrote:
>  >> On 07/03/2008, Ben Okopnik <ben@linuxgazette.net> wrote:
>  >> >
>  >> > I'm sorry - that was Turkish to *what*??? Even Engrish doesn't
>  >> > come close to this.
>  >>
>  >> Ooh... that question demonstrates at least one major problem in
>  >> machine translation[1]: missing diacritics[2] -- at Apertium,
>  >> we're trying to get involved in GSoC, and diacritic restoration is
>  >> one of the projects we want someone to work on.
>  >
>  >That would be pretty tough for Russian. Even in as short of a span
>  >as the last 30 years or so, I've seen the '?' ('yo') almost
>  >completely elided in favor of the 'e' ('ye') in written Russian, at
>  >least in any writing that is not explicitly formal. Even magazines
>  >and newspapers, for the most part, have done so. Mind you, the
>  >pronunciation of the words has remained exactly the same - a pain in
>  >the ass when you're learning Russian, as Kat is doing nowadays (I
>  >have to point out every instance where this applies as she reads;
>  >amazingly enough, she seems to be making some kind of sense out of
>  >that mess.)
>  >
>  >I don't even know if you could do it as a dictionary lookup, either.
>  >I can't think of an example off the top of my head, but I'm sure
>  >there are words that you can spell with either one and produce a
>  >real word in both cases - but with different meanings.
>  >
>
>
> One might solve this problem with a UI solution: allowing the user to
> choose which translation makes more sense. Hopefully the other words
> in the sentence had been translated well enough that someone could
> fill in the blank sensibly from a few choices. Common spell checking
> implementations would provide a good template.
>

In the unambiguous cases, having a spell checker in the UI can probably help; in the ambiguous cases, it can be a hindrance: given that the usual users of machine translators are translating from an unfamiliar language, they may just pick the wrong gloss, whereas the machine translator at least has a chance of getting the correct meaning if the sentence is long enough to provide disambiguation.

One of the enhancements I hope to get around to making to our gui (apt-get install apertium-tolk is a version of the DicLookUp feature (http://xixona.dlsi.ua.es/apertium-www/?id=lookup) on the apertium site, which will pop up a list of possible glosses when you click on a word.

>  Of course the choices provided may be too close to one another, but
>  that takes us back to some of the common problems with natural
>  language processing and translation. Example, that cultures like to
>  imbue words and their combinations with special, non-obvious meanings
>  that cannot yet be taught to machines and may vary depending on
>  geographic regions of a few miles (villages, etc.).
>

Kinda. But that gets humans too. Once, a Polish girl I know told me, talking about our former boss[1] 'I have him in my ass'.

Eh???

Then I translated it back to Polish in my head: 'mam go w dupie' -> 'I don't give a shit about him'.

Ah.

Particular combinations of words aren't really a problem. Take it from someone who makes bilingual dictionaries ;) They are tedious, though.

>  That or there is probably a character set and multinationalization
>  argument in there somewhere, 'cause software developers and their
>  orgs/companies are so well known for following standards. ahem

Eh... Standards help, sometimes. Sometimes they don't. There are something like 20 standard ways of transliterating Russian Cyrillic to latin (and that's just for an English audience) - and transliterating unknown words is necessary (hopefully, it'll be a person's name (My solution - when I have time to work on Russian - will be to provide a set of transliterators for the major standards and allow the user to choose, but the current ISO scheme is my personal favourite, so that'll be default

[1] Oh... I quit the hellhole

Top Back

Jimmy O'Regan [joregan at gmail.com]

Tue, 11 Mar 2008 22:20:40 +0100

On 10/03/2008, Jimmy O'Regan <joregan@gmail.com> wrote:

> On 09/03/2008, Samuel Bisbee-vonKaufmann <sbisbee@computervip.com> wrote:
>  > One might solve this problem with a UI solution: allowing the user
>  > to choose which translation makes more sense. Hopefully the other
>  > words in the sentence had been translated well enough that someone
>  > could fill in the blank sensibly from a few choices. Common spell
>  > checking implementations would provide a good template.
>
> In the unambiguous cases, having a spell checker in the UI can
>  probably help; in the ambiguous cases, it can be a hindrance: given

Oh... it occurs to me to say that we do use spell-checking techniques - we have a tool to use deduce cognates in word lists, based on low edit distances (though our system also applies standard transliterations between languages, so regular differences between languages don't factor into it. In April, I'm going to take two 250,000-ish word lists (Polish and Upper Sorbian -- extremely closely related languages) and see how it works out.

Top Back