Playing With Sid : Apertium: Open source machine translation

Jimmy O'Regan [joregan at gmail.com]

Thu, 3 Jul 2008 10:57:57 +0100

---------- Forwarded message ----------

From: Jimmy O'Regan <joregan@gmail.com>
Date: 2008/7/3
Subject: Re: Playing With Sid : Apertium: Open source machine translation
To: Arky <rakesh_ambati@yahoo.com>

2008/7/3 Arky <rakesh_ambati@yahoo.com>:

> Arky has sent you a link to a blog:
>
> Accept my kudos
>

Thanks, and thanks for the link

Would you mind if I forwarded your e-mail to Linux Gazette? LG has a lot of Indian readers, and maybe we can recruit some others who would be interested in helping with an Indian language pair.

Also, I note from your blog that you have lttoolbox installed, and that you can read Devanagari - we have a Sanskrit analyser that uses Latin transliteration; I tried to write a transliterator to test it out, but, alas, it doesn't work, and I can't tell why.

It's in our 'incubator' module in SVN, here: http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sa-XX/apertium-sa-XX.ucs-wx.xml

> Blog: Playing With Sid
> Post: Apertium: Open source machine translation
> Link:
> http://playingwithsid.blogspot.com/2008/07/apertium-open-source-machine.html
>
> --
> Powered by Blogger
> http://www.blogger.com/
>

Top Back

Jimmy O'Regan [joregan at gmail.com]

Thu, 3 Jul 2008 11:39:05 +0100

2008/7/3 Jimmy O'Regan <joregan@gmail.com>:

> 2008/7/3 Arky <rakesh_ambati@yahoo.com>:
>> Arky has sent you a link to a blog:
>>
>> Accept my kudos
>>
>
> Thanks, and thanks for the link 
>
> Would you mind if I forwwarded your e-mail to Linux Gazette? LG has a
> lot of Indian readers, and maybe we can recruit some others who would
> be interested in helping with an Indian language pair.
>

So, basically: Apertium has analysers for Sanskrit, Hindi and Urdu that look to be relatively "complete". Alone, they could be useful as, for example, stemmers for Lucene - Felipe, one of our developers (who received his doctorate this week - congratulations!) wrote code to allow the output of our analysers to be used by Lucene, as well as adding other linguistically-oriented features to Lucene - code available here: https://issues.apache.org/jira/browse/LUCENE-1284

What would be really great would be if we could find either GPLd bilingual dictionaries, or, better, volunteers to help work on translators. We can help to write translation rules[1], but we need native speakers to tell us which rules need to be written.

[1] Francis Tyers, our 'community coordinator' and I (99% him, but still...) have been working on an early alpha level Welsh-English translator, despite neither of us being able to speak a word of Welsh. It's very basic, but already gives better results than the only other Welsh-English translator we can find:

Roedd y Comisiwn yn ymchwilio i'r honiadau bod yr AS wedi methu datgan £103,000 o roddion.

the Commission Was investigating the allegations that the MP has failed to declare £103,000 of gifts. (Apertium)

"He was the Commission crookedly ymchwiliad I ' group claims be he drives ACE has failed declare he gifts." (InterTran)

(Our translation uses two of my rules

Top Back

Jimmy O'Regan [joregan at gmail.com]

Thu, 3 Jul 2008 11:41:39 +0100

2008/7/3 Jimmy O'Regan <joregan@gmail.com>:

>    the Commission Was investigating the allegations that the MP has
> failed to declare £103,000 of gifts. (Apertium)

Oh, and I fixed that capitalisation problem: The Commission was investigating the allegations that the MP has failed to declare £103,000 of gifts.

Welsh-English can be taken for a test run on our 'alpha testing' interface: http://wwww.apertium.org/testing/

Top Back

Kat Tanaka Okopnik [kat at linuxgazette.net]

Thu, 3 Jul 2008 09:25:21 -0700

On Thu, Jul 03, 2008 at 11:41:39AM +0100, Jimmy O'Regan wrote:

> 2008/7/3 Jimmy O'Regan <joregan@gmail.com>:
> >    the Commission Was investigating the allegations that the MP has
> > failed to declare £103,000 of gifts. (Apertium)
> 
> Oh, and I fixed that capitalisation problem: The Commission was
> investigating the allegations that the MP has failed to declare
> £103,000 of gifts.
> 
> Welsh-English can be taken for a test run on our 'alpha testing'
> interface: http://wwww.apertium.org/testing/

Oh, nifty! I have a friend else-net who's learning Welsh now, I'll pass it on to her and see what she thinks.

-- 
Kat Tanaka Okopnik
Linux Gazette Mailbag Editor
kat@linuxgazette.net

Top Back

Jimmy O'Regan [joregan at gmail.com]

Thu, 3 Jul 2008 18:58:45 +0100

2008/7/3 Kat Tanaka Okopnik <kat@linuxgazette.net>:

> On Thu, Jul 03, 2008 at 11:41:39AM +0100, Jimmy O'Regan wrote:
>> 2008/7/3 Jimmy O'Regan <joregan@gmail.com>:
>> >    the Commission Was investigating the allegations that the MP has
>> > failed to declare $B!r(B103,000 of gifts. (Apertium)
>>
>> Oh, and I fixed that capitalisation problem: The Commission was
>> investigating the allegations that the MP has failed to declare
>> $B!r(B103,000 of gifts.
>>
>> Welsh-English can be taken for a test run on our 'alpha testing'
>> interface: http://wwww.apertium.org/testing/
>
> Oh, nifty! I have a friend else-net who's learning Welsh now, I'll pass
> it on to her and see what she thinks.
>

Two of the best resources for Welsh on the Internet are Kevin Donnelly's Eurfa (http://www.eurfa.org.uk/) and Mark Nodine's dictionary (http://www.cs.cf.ac.uk/fun/welsh/LexiconForms.html) - and both of them have granted us permission to use their data under the GPL, as well as providing us with comments and feedback (in addition, Kevin did the export of his database himself, saving us some trouble).

There are quite a lot of GPLd resources out there, for a lot of languages. Unfortunately, a lot of dictionaries, etc. are basically useless to us - MT lexicons are quite different from standard dictionaries in a number of respects. We have tools that can, for example, induce cognates using the same techniques as spell checkers, but this can give as many misses as hits, and because these lists need to be manually checked and edited afterwards, I've found that it's actually quicker to just use the spell checker for L2 on a list of words from L1.

(And even when this does work, it can give less than optimal results: the Polish 'pies' would be added as 'пёс' in Russian, but 'собака' is the preferred translation, for example)

Top Back

Jimmy O'Regan [joregan at gmail.com]

Thu, 3 Jul 2008 11:06:32 +0100

2008/7/3 Rakesh 'arky' Ambati <rakesh_ambati@yahoo.com>:

>
>
> --- On Thu, 7/3/08, Jimmy O'Regan <joregan@gmail.com> wrote:
>
>> Would you mind if I forwwarded your e-mail to Linux
>> Gazette? LG has a
>> lot of Indian readers, and maybe we can recruit some others
>> who would
>> be interested in helping with an Indian language pair.
>
> Sure, Please do.

Coo. CC'd

>
>>
>> Also, I note from your blog that you have lttoolbox
>> installed, and
>> that you can read Devanagari - we have a Sanskrit analyser
>> that uses
>> latin transliteration; I tried to write a transliterator to
>> test it
>> out, but, alas, it doesn't work, and I can't tell
>> why.
>>
>> It's in our 'incubator' module in SVN, here:
>> http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sa-XX/apertium-sa-XX.ucs-wx.xml
>>
>
> I was introduced to apertium as part of NLP summer school last month, let me grab that module and look at it.
>
> Please kindly mention why/what did work with the transliterator you created.

We have a Sanskrit analyser that uses WX notation. I made a transliterator that converts UTF-8 Devanagari to WX notation, as input (so we could benefit from FSTs, rather than using Python or something similar to perform the conversion); however, some of the combined ligatures aren't being processed properly - it's transliterating in part, but some UTF characters are still coming through. I can't read devanagari, so I can't tell if I'm missing something, or what the problem might be.

Top Back