...making Linux just a little more fun!

Mailbag 2

This month's answers created by:

[ Rick Moen ]
...and you, our readers!

Editor's Note

Jimmy O'Regan forwarded and crosslinked several followup discussions on his article on Apertium from LG 152. For your reading ease and pleasure, I've gathered those threads into this separate page. -- Kat

Cuneiform OCR source available for linux

Jimmy O'Regan [joregan at gmail.com]

Tue, 1 Jul 2008 11:22:57 +0100

Cognitive released the source of the kernel of their OCR system (http://www.cuneiform.ru/eng/index.html), and the linux port (https://launchpad.net/cuneiform-linux) has reported their first success: https://launchpad.net/cuneiform-linux/+announcement/561

Hopefully, they'll be able to get the layout engine working - I've used the Windows version, and it was very good at analysing a complicated mixed language document.

[ Thread continues here (3 messages/1.29kB) ]

Playing With Sid : Apertium: Open source machine translation

Jimmy O'Regan [joregan at gmail.com]

Thu, 3 Jul 2008 10:57:57 +0100

---------- Forwarded message ----------

From: Jimmy O'Regan <joregan@gmail.com>
Date: 2008/7/3
Subject: Re: Playing With Sid : Apertium: Open source machine translation
To: Arky <rakesh_ambati@yahoo.com>

2008/7/3 Arky <rakesh_ambati@yahoo.com>:

> Arky has sent you a link to a blog:
> Accept my kudos

Thanks, and thanks for the link :)

Would you mind if I forwarded your e-mail to Linux Gazette? LG has a lot of Indian readers, and maybe we can recruit some others who would be interested in helping with an Indian language pair.

Also, I note from your blog that you have lttoolbox installed, and that you can read Devanagari - we have a Sanskrit analyser that uses Latin transliteration; I tried to write a transliterator to test it out, but, alas, it doesn't work, and I can't tell why.

It's in our 'incubator' module in SVN, here: http://apertium.svn.sourceforge.net/svnroot/apertium/trunk/apertium-sa-XX/apertium-sa-XX.ucs-wx.xml

> Blog: Playing With Sid
> Post: Apertium: Open source machine translation
> Link:
> http://playingwithsid.blogspot.com/2008/07/apertium-open-source-machine.html
> --
> Powered by Blogger
> http://www.blogger.com/

[ Thread continues here (6 messages/9.61kB) ]

Apertium-transfer-tool info request

Jimmy O'Regan [joregan at gmail.com]

Fri, 11 Jul 2008 17:47:01 +0100

[I'm assuming Arky's previous permission grant still stands, also cc'ing the apertium list, for further comment]

---------- Forwarded message ----------

From: Rakesh 'arky' Ambati <rakesh_ambati@yahoo.com>
Date: 2008/7/11
Subject: Apertium-transfer-tool info request
To: joregan@gmail.com


I am trying to use apertium-transfer-tools on Ubuntu Hardy, can you kindly point to working example/tutorial where transfer rules are generated from alignment templates.



Rakesh 'arky' Ambati Blog [ http://playingwithsid.blogspot.com ]

[ Thread continues here (3 messages/13.54kB) ]

[Apertium-stuff] Apertium-transfer-tool info request

Felipe Sanchez Martinez [fsanchez at dlsi.ua.es]

Fri, 11 Jul 2008 23:45:29 +0200


Jimmy, very good explanation, :D

> Also; the rules that a-t-t generates are for the 'transfer only' mode
> of apertium-transfer: this example uses the chunk mode - most language
> pairs, unless the languages are very closely related, would really
> be best served with chunk mode. Converting a-t-t to support this is on
> my todo list, and though doing it properly may take a while, I can
> probably get a crufty, hacked version together fairly quickly. With a
> couple of sed scripts and an extra run of GIZA++ etc., we can also
> generate rules for the interchunk module.

We could exchange some ideas about that, and future improvements such as the use of context-dependent lexicalized categories. This would give a-t-t better generalization capabilities and make the set of inferred rules smaller.

> The need for the bilingual dictionary seemed a little strange to me at
> first, but Mikel, Apertium's BDFL, explained that it really helps to
> reduce bad alignments. This probably means that a-t-t can't generate
> rules for things like the Polish to English 'coraz piêkniejsza' ->
> 'prettier and prettier', but I haven't checked that yet.

The bilingual dictionary is used to derive a set of restrictions to prevent an alignment template (AT) to be applied in certain conditions in which it will generate a wrong translation. Restrictions refer to the target language (TL) inflection information of the non-lexicalized words in the AT. For example, suppose that you want to translate the following phrase from English into Spanish:

"the narrow street", with the following morphological analysis (after tagging): "^the<det><def><sp>$ ^narrow<adj><sint>$ ^street<n><sg>$"

The bilingual dictionary says: '' ^narrow<adj><sint>$ -------> estrecho<adj><f><ND>$ ^street<n><sg>$" -------> calle<n><f><sg>$ ''

Supose that you want to apply this AT:

SL:   the<det><def><sp>   <adj><sint>  <n><sg>
TL:   el<det><def><f><sg> <n><f><sg>   <adj><f><sg>
Alignment: 1:1  2:3 3:2
Rstrictions (indexes refer to the TL part of the AT):
       w_2 = n.f.*  w_3 = adj.*,   

* Note: "the" and "el" are lexicalized words

This AT generalizes:

[ ... ]

[ Thread continues here (6 messages/15.77kB) ]

Talkback: Discuss this article with The Answer Gang

Copyright © 2008, . Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 153 of Linux Gazette, August 2008