Two-cent tip: Download whole directory as zip file

Silas S. Brown [ssb22 at cam.ac.uk]

Fri, 12 Sep 2008 15:05:13 +0100

A quick "download whole directory as zip file" CGI

If you have a large collection of files, and you put them on your webserver without any special index, then it's likely that the server will generate its own index HTML for you. This is all very well, but I recently had the tedious experience of downloading 46 separate small files from my webserver, using somebody's Windows box with Internet Explorer and a "download manager" that took me through 3 dialog boxes per click in a foreign language. Wouldn't it be nice if I could tell the web server to zip them all up and send me the zip file.

You can do this because the Unix "zip" utility (package "zip" on most distributions) is capable of writing to standard output. At a minimum, you can create a CGI script like this:

#!/bin/bash
echo Content-Type: application/zip
echo "Content-Disposition: attachment; filename=files.zip"
echo
zip -9r - *

This zips the content of the current directory, sending the result to standard output (that's what the dash - is for) and telling the Web browser that it's a zip file called files.zip.

But we can go one up on that - the following short script will list the contents of the directory, with an optional "download as zip" link that sets the filename appropriately. If you're using the small Mathopd webserver, you can edit /etc/mathopd.conf and set AutoIndexCommand to the path of this script:

export Filename="$(pwd|sed -e 's,.*/,,').zip"
if test "$QUERY_STRING" == zip; then
  echo Content-type: application/zip
  echo "Content-Disposition: attachment; filename=$Filename"
  echo
  zip -9r - *
else
  echo "Content-type: text/html; charset=utf-8"
  echo
  echo "<HTML><BODY><A HREF=\"..\">Parent directory</A> |"
  echo "<A HREF=\"./?zip\">Download $Filename</A>"
  echo "<h2>Contents of $Filename</h2><UL>"
  for N in *; do
    echo "<LI><A HREF=\"$N\">$N</A> ($(du -h "$N"|cut -f1))</LI>"
  done
  echo "</UL></BODY></HTML>"
fi

This assumes that any non-ASCII filenames will be listed in UTF-8 (otherwise change the charset).

-- 
Silas S Brown http://people.pwf.cam.ac.uk/ssb22

Top Back

Silas S. Brown [ssb22 at cam.ac.uk]

Fri, 12 Sep 2008 15:26:20 +0100

Further to my previous email, if you want the resulting zip file to unpack into a subdirectory (rather than having everything at top level), then instead of

zip -9r - *

you can have

cd .. ; zip -9r - "$(echo "$Filename"|sed -e s/.zip$//)"

where Filename has been set earlier (i.e. in the longer version of the script).

Silas

Top Back

Martin J Hooper [martinjh at blueyonder.co.uk]

Fri, 12 Sep 2008 15:49:51 +0100

Silas S. Brown wrote:

> But we can go one up on that - the following short script will
> list the contents of the directory, with an optional "download
> as zip" link that sets the filename appropriately.  If you're
> using the small Mathopd webserver, you can edit
> /etc/mathopd.conf and set AutoIndexCommand to the path of this
> script:

Can you do this on Apache2? I'm presuming that you can somehow... Not really an expert on configuring web servers!

Plus getting my server to run CGI scripts too.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 12 Sep 2008 11:24:09 -0400

On Fri, Sep 12, 2008 at 03:49:51PM +0100, Martin J Hooper wrote:

> Silas S. Brown wrote:
> > But we can go one up on that - the following short script will
> > list the contents of the directory, with an optional "download
> > as zip" link that sets the filename appropriately.  If you're
> > using the small Mathopd webserver, you can edit
> > /etc/mathopd.conf and set AutoIndexCommand to the path of this
> > script:
> 
> Can you do this on Apache2?  I'm presuming that you can
> somehow...  Not really an expert on configuring web servers!

Of course. It's not really server dependent - any server that can interpret CGI and has Bash installed should be able to handle it.

> Plus getting my server to run CGI scripts too.

You need to read up on the "Options" directive and the 'ExecCGI' setting.

http://httpd.apache.org/docs/2.0/mod/core.html#options

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 12 Sep 2008 10:52:16 -0400

On Fri, Sep 12, 2008 at 03:05:13PM +0100, Silas S. Brown wrote:

> A quick "download whole directory as zip file" CGI

[snip]

> export Filename="$(pwd|sed -e 's,.*/,,').zip"
> if test "$QUERY_STRING" == zip; then
>   echo Content-type: application/zip
>   echo "Content-Disposition: attachment; filename=$Filename"
>   echo
>   zip -9r - *
> else
>   echo "Content-type: text/html; charset=utf-8"
>   echo
>   echo "<HTML><BODY><A HREF=\"..\">Parent directory</A> |"
>   echo "<A HREF=\"./?zip\">Download $Filename</A>"

I'm afraid that's not going to work: the above link will produce a URL that follows the current directory name with the query string, whereas what you need is the current scriptname followed by "?zip". So, slight modification:

echo "<A HREF=\"./${0##*/}?zip\">Download $Filename</A>"

Other than that - great tip, Silas! In the past, I've just downloaded the index page and used the "-i" switch of "wget" to read/retrieve all the links (when on someone else's server), or just made up a tarball when it's on my own - but this makes for a nice option.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Fri, 12 Sep 2008 13:01:04 -0400

On Fri, Sep 12, 2008 at 05:33:05PM +0100, Silas S. Brown wrote:

> Hi Ben,
> 
> On Fri, Sep 12, 2008 at 10:52:16AM -0400, Ben Okopnik wrote:
> > >   echo "<A HREF=\"./?zip\">Download $Filename</A>"
> > I'm afraid that's not going to work: the above link will produce a URL
> > that follows the current directory name with the query string, whereas
> > what you need is the current scriptname followed by "?zip".
> 
> That's correct if you're running it as a normal CGI script, but not
> if you're running it as a Mathopd AutoIndexCommand as I suggested.
> With AutoIndexCommand, the script is run for ANY directory that
> doesn't have an index.html, so you don't have to copy the script
> into every subdirectory.  In this case, you DON'T want to point the
> URL to the script itself, because the script itself is probably in
> a different directory from the one being listed.  You want to point
> the URL back to the directory, so that, next time the script is
> called by AutoIndexCommand, it's in the right directory to make the
> zip file.  Otherwise, it would be in the script directory instead,
> which may be different.
> 
> I have tested this, honest

Oh, I believe you.

I've never used Mathopd, myself, but there are directives you can use in, say, Apache that would do something like that, too. On the other hand, making the script server-independent has its attractions too.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 14 Sep 2008 20:47:39 -0400

[ Silas, do me a favor and CC TAG on any replies; that way, this exchange can be published along with the tip. Thanks! ]

On Sat, Sep 13, 2008 at 10:10:45AM +0100, Silas S. Brown wrote:

> Hi Ben,
> 
> On Fri, Sep 12, 2008 at 01:01:04PM -0400, Ben Okopnik wrote:
> > On the other hand, making the script server-independent has
> > its attractions too.
> 
> Yes, but if the URL points to the script not
> the directory, you'd then have to put the
> script in every directory and subdirectory that
> you want to be browsable in this way.

Well, you could simply create a symlink to the script in every directory where you wanted it. After all, how many directories do most of us want to make available that way? A couple at most.

> Unless
> we add a parameter specifying the directory to
> be zipped, in which case we need to implement
> security mechanisms to stop people using it to
> access directories they shouldn't, and then it
> would no longer be short enough to be a
> two-cent tip

How would that be different from your suggestion for Mathopd? In fact, it seems that the Mathopd solution would be less controllable: if you wanted to stop someone from zipping up some subdirectory, it seems to me that you'd have a difficult time of it - you'd have to make that specific directory non-browseable by, e.g., creating an empty 'index.html' in it.

> Maybe we should just say that it needs a Web
> server that can run an arbitrary script to make
> the default index of a directory. I know Mathopd
> is such a server (AutoIndexCommand) and I expect
> there are others, but I don't have a complete
> list of which servers can and can't.

This is why I said that making it server-dependent is not the best idea. It's sort of like writing a script that will only run on unpredictably arbitrary versions of Unix.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Silas S. Brown [ssb22 at cam.ac.uk]

Mon, 15 Sep 2008 09:49:39 +0100

Hi Ben,

On Sun, Sep 14, 2008 at 08:47:39PM -0400, Ben Okopnik wrote:

> Well, you could simply create a symlink to the script in every directory
> where you wanted it. After all, how many directories do most of us want
> to make available that way? A couple at most.

Well I'm exporting my collection of audio recordings for my language-learning software Gradint. (Some of the recordings are copyright, so it's only for personal acces and the URL is not public.) The collection is in a hierarchy with a total of 328 directories.

I could make 328 symlinks using

for N in $(find . -type d); do ln -s /path/to/target $N/index.cgi; done

but I'd have to re-do that whenever any new directory is created (manually or via cron). (And also, having symlinks means I wouldn't be able to put that directory on a filesystem that doesn't support symlinks, such as the VFAT filesystem on the extra USB disk of my slug; I'd have to make sure there's an e2fs image so that the symlinks work. But that's a small issue.)

I'd rather the mathopd option as it saves a lot of symlink maintenance. And yes mathopd does give you control over which directories it allows this for: the AutoIndexCommand can be set for "all subdirectories of a given directory" and nothing else.

Lots of people run Mathopd (it's a nice light-weight server, especially good for resource-limited situations), so I think it's worth noting the Mathopd option. The server-independent option might be useful for others, but it does need more work (linking the CGI to all subdirectories). So perhaps both should be included. But make sure that, if you're using the Mathopd option, the script does not refer to itself (only to the directory).

Incidentally, in Apache if you set

DirectoryIndex index.html index.cgi

then index.cgi will be run by default if index.html is not present, and that means the script should be able to run unchanged (it does not need to refer to itself, it can just refer to the directory), but you still need to link it to every directory.

If you do put the CGI in all directories, it might be a good idea to add

-x index.cgi

to the zip command line, to avoid cluttering the zip file with multiple copies of index.cgi.

> > we add a parameter specifying the directory to
> > be zipped, ..
> How would that be different from your suggestion for Mathopd?

It would be more complex and take longer to implement especially the security etc.

I wonder if anyone out there knows of a featureful and robust script that's already been written?

Meanwhile, I just wanted to write a short one and show it can be done.

Silas

Top Back

Ben Okopnik [ben at linuxgazette.net]

Mon, 15 Sep 2008 09:27:24 -0400

On Mon, Sep 15, 2008 at 09:49:39AM +0100, Silas S. Brown wrote:

> Hi Ben,
> 
> On Sun, Sep 14, 2008 at 08:47:39PM -0400, Ben Okopnik wrote:
> > Well, you could simply create a symlink to the script in every directory
> > where you wanted it. After all, how many directories do most of us want
> > to make available that way? A couple at most.
> 
> Well I'm exporting my collection of audio recordings
> for my language-learning software Gradint. 
> (Some of the recordings are copyright, so it's only
> for personal acces and the URL is not public.)
> The collection is in a hierarchy with a total
> of 328 directories.

Obviously, in your case, it's different - radically so - and it makes sense to do it that way.

By the way, Silas, I hope you're not taking this wrong: I'm trying to provide useful critique rather than criticism here. Your idea is indeed a very useful one.

> I could make 328 symlinks using
> 
> for N in $(find . -type d); do ln -s /path/to/target $N/index.cgi; done

Or just

find . -type d -exec ln -s /path/to/target {}/index.cgi \;

> but I'd have to re-do that whenever any new
> directory is created (manually or via cron).

Seems like using 'cron' would be reasonable, but remembering to "undo" that crontab entry if and when you get rid of that structure could be a problem.

> (And also, having symlinks means  I wouldn't be
> able to put that directory on a filesystem that
> doesn't support symlinks, such as the VFAT
> filesystem on the extra USB disk of my slug;
> I'd have to make sure there's an e2fs image so
> that the symlinks work.  But that's a small issue.)

Or you could just copy rather than linking - it's a tiny script.

> I'd rather the mathopd option as it saves a lot
> of symlink maintenance.  And yes mathopd does
> give you control over which directories it
> allows this for: the AutoIndexCommand can be
> set for "all subdirectories of a given
> directory" and nothing else.
> 
> Lots of people run Mathopd (it's a nice
> light-weight server, especially good for
> resource-limited situations), so I think it's
> worth noting the Mathopd option.  The
> server-independent option might be useful for
> others, but it does need more work (linking the
> CGI to all subdirectories).  So perhaps both
> should be included.

That makes sense.

I just realized: if you have a hierarchy, and you run this script in the top directory, you'll get every file in the structure, recursively. That could be a problem - or a feature, depending on what the user wanted. It might make sense to add a 'Download this directory only' option, with 'zip -9D - *' as the executable part.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Mon, 15 Sep 2008 14:40:54 +0100

2008/9/15 Silas S. Brown <ssb22@cam.ac.uk>:

> Hi Ben,
>
> On Sun, Sep 14, 2008 at 08:47:39PM -0400, Ben Okopnik wrote:
>> Well, you could simply create a symlink to the script in every directory
>> where you wanted it. After all, how many directories do most of us want
>> to make available that way? A couple at most.
>
> Well I'm exporting my collection of audio recordings
> for my language-learning software Gradint.

Well heck, why not advertise a little?

http://people.pwf.cam.ac.uk/ssb22/gradint/ seems to be the link.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Mon, 15 Sep 2008 10:06:09 -0400

On Mon, Sep 15, 2008 at 02:40:54PM +0100, Jimmy O'Regan wrote:

> 2008/9/15 Silas S. Brown <ssb22@cam.ac.uk>:
> > Hi Ben,
> >
> > On Sun, Sep 14, 2008 at 08:47:39PM -0400, Ben Okopnik wrote:
> >> Well, you could simply create a symlink to the script in every directory
> >> where you wanted it. After all, how many directories do most of us want
> >> to make available that way? A couple at most.
> >
> > Well I'm exporting my collection of audio recordings
> > for my language-learning software Gradint.
> 
> Well heck, why not advertise a little? 
> 
> http://people.pwf.cam.ac.uk/ssb22/gradint/ seems to be the link.

Oh yeah - speaking of which - thanks for your "charlearn" program as well, Silas! I've played around with learning the Japanese syllabary, and found it quite useful for that.

http://people.pwf.cam.ac.uk/ssb22/gradint/charlearn.html

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Silas S. Brown [ssb22 at cam.ac.uk]

Tue, 16 Sep 2008 09:44:07 +0100

Hi Ben,

> I hope you're not taking this wrong

Not at all

> I've played around with learning the Japanese syllabary,
> and found it quite useful for that.

よろしく, glad it's useful.

You should be able to use Gradint as well if you can find someone to record the words and phrases for you. Unfortunately there doesn't seem to be a free Japanese speech synthesizer yet, although we could put together a simple one by concatenation, if we can find a set of good-quality GPL'd recordings of all possible Japanese syllables (including the ones made from more than one kana), spoken at a reasonable speaking pace (not drawn out artificially), all of similar length, and ideally all the same pitch (although Praat can compensate for that if necessary). But I don't know where to get that from; all the Japanese people I know are too nervous to do such a thing.

Silas

Top Back

Jimmy O'Regan [joregan at gmail.com]

Tue, 16 Sep 2008 10:25:19 +0100

2008/9/16 Silas S. Brown <ssb22@cam.ac.uk>:

> Hi Ben,
>
>> I hope you're not taking this wrong
>
> Not at all 
>
>> I've played around with learning the Japanese syllabary,
>> and found it quite useful for that.
>
> よろしく, glad it's useful.
>
> You shuld be able to use Gradint as well if you
> can find someone to record the words and
> phrases for you.  Unfortunately there doesn't
> seem to be a free Japanese speech synthesizer

There is GalateaTalk (http://sourceforge.jp/projects/galateatalk) which is part of the Galatea project (http://hil.t.u-tokyo.ac.jp/~galatea/), which also includes a speech recognition system, and an avatar system. The licence looks relatively inoffensive:

"Open-source Toolkit for Anthropomorphic Spoken Dialogue Agent" is a result of the Information Technology Development Projects Specific Development Areas of Information-technology Promotion Agency (IPA), Japan. ASTEM is the author of "Open-source Toolkit for Anthropomorphic Spoken Dialogue Agent" and retains the copyright thereto. However, as long as you accept and remain in strict compliance with the terms and conditions of the license set forth herein, you are hereby granted a royalty-free license to use "Open-source Toolkit for Anthropomorphic Spoken Dialogue Agent" including the source code thereof and the documentation thereto (collectively referred to herein as the "Software"). Use by you of the Software shall constitute acceptance by you of all terms and conditions of the license set forth herein.

Terms and Conditions of License

1. So long as you accept and strictly comply with the terms and conditions of the license set forth herein, ASTEM will not enforce the copyright or moral rights owned by ASTEM in respect of the Software, in connection with the use, copying, duplication, adaptation, modification, preparation of a derivative work, aggregation with another program, or insertion into another program of the Software or the distribution or transmission of the Software. However, in the event you or any other user of the Software revises all or any portion of the Software, and such revision is distributed, then, in addition to the notice required to be affixed pursuant to paragraph 2 below, a notice shall be affixed indicating that the Software has been revised, and indicating the date of such revision and the name of the person or entity that made the revision.

2. In the event you provide to any third party all or any portion of the Software, whether for copying, duplication, adaptation, modification, preparation of a derivative work, aggregation with another program, insertion into another program, or other user, you shall affix the following copyright notice and all terms and conditions of this license (both the English and Japanese language versions) as set forth herein, without any revision or change whatsoever.

Form of copyright notice:

3. ASTEM is licensing the Software, which is the trial product of research and development, on an as is and royalty-free basis, and makes no warranty or guaranty whatsoever with respect to the Software, whether express or implied, irrespective of the nation where used, and whether or not arising out of statute or otherwise, including but not limited to any warranty or guaranty with respect to quality, performance, merchantability, fitness for a particular purpose, absence of defects, or absence of infringement of copyright, patent rights, trademark rights or other intellectual property rights, trade secrets or proprietary rights of any third party. You and every other user of the Software hereby acknowledge that the Software is licensed without any warranty or guaranty, and assume all risks arising out of the absence of any warranty or guaranty. In the event the terms and conditions of this license are inconsistent with the obligations imposed upon you by judgment of a court or for any other reason, you may not use the Software. ASTEM shall not have any liability to you or to any third party for damages or liabilities of any nature whatsoever arising out of your use of or inability to use the Software, whether of an ordinary, special, direct, indirect, consequential or incidental nature (including without limitation lost profits) or otherwise, and whether arising out of contract, negligence, tortuous conduct, product liability or any other legal theory or reason whatsoever of any nation or jurisdiction.

4. When you use the Software for purposes associated with any system for nuclear power related matters, air traffic control or other traffic control, medical, emergency or security related uses, or any other system which may pose a significant risk of loss of life, bodily injury or damage to property, you shall take responsibility, and the developer or supplier of the Software cannot be asked for responsibility at all.

5. ASTEM will not conduct any support or maintenance of the Software.

6. This license of use of the Software shall be governed by the laws of Japan, and the Kyoto District Court shall have exclusive primary jurisdiction with respect to all disputes arising with respect thereto.

Top Back

Silas S. Brown [ssb22 at cam.ac.uk]

Sat, 27 Sep 2008 08:54:47 +0100

Hi, further to my previous emails, I've realised that

du -h

would be better done as

du -h --apparent-size -s

(it's always worth checking to see if there are better parameters to the command). It may also be useful to put a

"($(du -h --apparent-size -s|cut -f1))"

after the "Download $Filename" link.

Silas

Top Back

Silas S. Brown [ssb22 at cam.ac.uk]

Sat, 27 Sep 2008 12:19:06 +0100

On Sat, Sep 27, 2008 at 08:54:47AM +0100, Silas S. Brown wrote:

> "($(du -h --apparent-size -s|cut -f1))"
> 
> after the "Download $Filename" link.

Note that this does of course give the UNcompressed size (not that it makes much difference if the files don't compress well anyway).

Silas

Top Back

Deividson Okopnik [deivid.okop at gmail.com]

Mon, 15 Sep 2008 13:52:40 -0300

On Mon, Sep 15, 2008 at 10:40 AM, Jimmy O'Regan <joregan@gmail.com> wrote:

> 2008/9/15 Silas S. Brown <ssb22@cam.ac.uk>:
>> Hi Ben,
>>
>> On Sun, Sep 14, 2008 at 08:47:39PM -0400, Ben Okopnik wrote:
>>
>> Well I'm exporting my collection of audio recordings
>> for my language-learning software Gradint.
>
> Well heck, why not advertise a little? 
>
> http://people.pwf.cam.ac.uk/ssb22/gradint/ seems to be the link.
>

Looks like a great software - I guess i should start learning my 3rd language

Is there any place to look for pre-made courses? Trying to decide between spanish and italian here.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Mon, 15 Sep 2008 14:37:54 -0400

On Mon, Sep 15, 2008 at 01:52:40PM -0300, Deividson Okopnik wrote:

> On Mon, Sep 15, 2008 at 10:40 AM, Jimmy O'Regan <joregan@gmail.com> wrote:
> > 2008/9/15 Silas S. Brown <ssb22@cam.ac.uk>:
> >> Hi Ben,
> >>
> >> On Sun, Sep 14, 2008 at 08:47:39PM -0400, Ben Okopnik wrote:
> >>
> >> Well I'm exporting my collection of audio recordings
> >> for my language-learning software Gradint.
> >
> > Well heck, why not advertise a little? 
> >
> > http://people.pwf.cam.ac.uk/ssb22/gradint/ seems to be the link.
> >
> 
> Looks like a great software - I guess i should start learning my 3rd language 
> 
> Is there any place to look for pre-made courses? Trying to decide
> between spanish and italian here.

It just so happens that I've been testing something you might find useful.

A couple of weeks ago, I downloaded and tried out 'latrine' (great name, eh? It's supposed to be a LAnguage TRaINEr...) - which turned out to have a major bug in it (I've already reported it.) 'latrine' uses the dictionary files in /usr/share/dictd/ - but since the lookup terms contain not just the words but also the IPA pronunciation keys, 'latrine' requires you to enter not only the word but also the pronunciation key - otherwise, your answer is marked "wrong". Whoops...

Since it's only a text-parsing app, it didn't take me long to hack up a Perl program that did what 'latrine' should have been doing. Kat has been using it since then to practice her Russian, and I have to say that her comprehension has been improving by huge leaps.

Example session (one in Spanish, one in Russian):

ben@Tyr:/tmp$ ./ltrain freedict-eng-spa.dict
* Type '!exit' to quit *
 
Word: bomba
Translation: pump
Congratulations, you got it! (1 right, 0 wrong so far)
[bomb|pump] : [bomba]
 
Word: hiedra
Translation: 
Nope, you missed it. (1 right, 1 wrong so far)
[ivy] : [hiedra|yedra]
 
Word: azul
Translation: blue
Congratulations, you got it! (2 right, 1 wrong so far)
[blue|sky blue] : [azul]
 
Word: más
Translation: more
Congratulations, you got it! (3 right, 1 wrong so far)
[more] : [más]
 
Word: costear
Translation: cost
Nope, you missed it. (3 right, 2 wrong so far)
[bear the cost of|defray the cost of|pay the expenses of] : [costear]
 
Word: entre
Translation: between
Congratulations, you got it! (4 right, 2 wrong so far)
[among|between] : [entre]
 
Word: plano
Translation: !exit
Your score is: 4 right, 2 wrong.

ben@Tyr:/tmp$ ./ltrain freedict-eng-ru.dict
* Type '!exit' to quit *
 
Word: артистический
Translation: artistic
Congratulations, you got it! (1 right, 0 wrong so far)
[artistic] : [артистический]
 
Word: агрессор
Translation: aggressor
Congratulations, you got it! (2 right, 0 wrong so far)
[aggressor|assailant|attacker] : [агрессор]
 
Word: Иран
Translation: Iran
Congratulations, you got it! (3 right, 0 wrong so far)
[Iran|Persia] : [Иран]
 
Word: одевать
Translation: dress
Congratulations, you got it! (4 right, 0 wrong so far)
[clothe|dress] : [одевать|одеть]
 
Word: базар
Translation: bazaar
Congratulations, you got it! (5 right, 0 wrong so far)
[bazaar|fair|market|marketplace] : [базар]
 
Word: цель
Translation: target
Congratulations, you got it! (6 right, 0 wrong so far)
[aim|goal|purpose|target] : [цель]
 
Word: бухгалтер
Translation: !exit
Your score is: 6 right, 0 wrong.

I make a point of keeping separate lexicon files for this rather than using the ones from "/usr/share/dictd" since I find that they require a significant number of corrections. Eventually, after I've polished them up, I'll send them to the 'dictd' folks.

Meanwhile, 'ltrain' (which is self-documenting; just run it without any arguments to see the manpage) is available at 'http://okopnik.com/misc/ltrain'. Comments and suggestions are always welcome.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Tue, 16 Sep 2008 10:47:24 +0100

2008/9/15 Ben Okopnik <ben@linuxgazette.net>:

> I make a point of keeping separate lexicon files for this rather than
> using the ones from "/usr/share/dictd" since I find that they require a
> significant number of corrections. Eventually, after I've polished them
> up, I'll send them to the 'dictd' folks.

Those files originate with the freedict project, which sadly seems to be defunct. I sent some fixes to their Irish-English dictionary around a year ago and subscribed to the mailing list - the fixes have gone untouched, the list has seen little other than spam. In any case, the dict files are generated from TEI-encoded XML files (which is also used internally at Distributed Proofreaders for new Project Gutenberg etexts), which would be the best place to make the changes (I'm sure the Debian maintainer would be happy to integrate your fixes).

Most of the data in Freedict came from a Windows program called Ergane (http://download.travlang.com/Ergane//) which allowed it's dictionaries to be used under public domain terms (though I'm not sure if that was true of later versions). The program used Esperanto as a pivot language, which is probably the reason why there are a few questionable entries in the dictionaries (IIRC, the Russian-English dictionary had 'агент' matched to 'g-man' rather than 'agent' - this is one reason why we have direction restrictions in Apertium).

If you want extra words, wikipedia is a nice place to look. This is a version of Francis Tyers' wikipedia script[1] to output to dict format (original here: http://wiki.apertium.org/wiki/Building_dictionaries):

#!/bin/sh
 
#language to translate from
LANGF=$2
#language to translate to
LANGT=$3
#filename of wordlist
LIST=$1
 
for LWORD in `cat $LIST`; do
        TEXT=`wget -q http://$LANGF.wikipedia.org/wiki/$LWORD -O - |
grep 'interwiki-'$LANGT`;
        if [ $? -eq '0' ]; then
                RWORD=`echo $TEXT |
                cut -f4 -d'"' | cut -f5 -d'/' |
                python -c 'import urllib, sys; print
urllib.unquote(sys.stdin.read());' |
                sed 's/(\w*)//g'`;
                #echo '<e><p><l>'$LWORD'<s n="n"/></l><r>'$RWORD'<s
n="n"/></r></p></e>';
		echo $LWORD;
		echo "     "$RWORD;
		echo;
        fi;
        sleep 8; # don't put undue strain on the Wikimedia servers
done

$ cat in-txt
cat
house
city
 
$ ./wikipedia.sh in-txt en pt
cat
     Gato_doméstico
 
house
     Casa
 
city
     Cidade

[1] Yes, we can do that: interwiki links are not expressive, so not covered by copyright.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Tue, 16 Sep 2008 18:36:20 -0400

On Tue, Sep 16, 2008 at 10:47:24AM +0100, Jimmy O'Regan wrote:

> 2008/9/15 Ben Okopnik <ben@linuxgazette.net>:
> > I make a point of keeping separate lexicon files for this rather than
> > using the ones from "/usr/share/dictd" since I find that they require a
> > significant number of corrections. Eventually, after I've polished them
> > up, I'll send them to the 'dictd' folks.
> 
> Those files originate with the freedict project, which sadly seems to
> be defunct.

(((((((((((((((((((((((((((((

...and a few more '(((('s for good measure. That REALLY sucks!

> I sent some fixes to their Irish-English dictionary around
> a year ago and subscribed to the mailing list - the fixes have gone
> untouched, the list has seen little other than spam. In any case, the
> dict files are generated from TEI-encoded XML files (which is also
> used internally at Distributed Proofreaders for new Project Gutenberg
> etexts), which would be the best place to make the changes (I'm sure
> the Debian maintainer would be happy to integrate your fixes).

I'll look into that in a while.

> Most of the data in Freedict came from a Windows program called Ergane
> (http://download.travlang.com/Ergane//) which allowed it's
> dictionaries to be used under public domain terms (though I'm not sure
> if that was true of later versions). The program used Esperanto as a
> pivot language, which is probably the reason why there are a few
> questionable entries in the dictionaries (IIRC, the Russian-English
> dictionary had 'агент' matched to 'g-man' rather than 'agent' - this
> is one reason why we have direction restrictions in Apertium).

Oh, it's not just a question of direction; there were plenty of simple errors as well. Truncated words, misspellings, and - for some odd reason - words in which a part of the correct word would be repeated, something like "correctorrect" or "wordrdord". There were a number of these. There were also a number of quite silly literal translations, like "White Russia" for "Belorussia" (that's like translating "childbearing" as "kid demeanor".)

> If you want extra words, wikipedia is a nice place to look. This is a
> version of Francis Tyers' wikipedia script[1] to output to dict format
> (original here: http://wiki.apertium.org/wiki/Building_dictionaries):

[snip]

>         sleep 8; # don't put undue strain on the Wikimedia servers

[laugh] Given that I have pretty close to 100k entries in my '/usr/share/dict/words' (I use the standard Scrabble word list), and each translation will take 8 seconds + translation time + network latency (call it 10 seconds total), that would take more than 11 days to download. I wonder if there's an easier way?

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Wed, 17 Sep 2008 01:13:11 +0100

2008/9/16 Ben Okopnik <ben@linuxgazette.net>:

> On Tue, Sep 16, 2008 at 10:47:24AM +0100, Jimmy O'Regan wrote:
>> 2008/9/15 Ben Okopnik <ben@linuxgazette.net>:
>> > I make a point of keeping separate lexicon files for this rather than
>> > using the ones from "/usr/share/dictd" since I find that they require a
>> > significant number of corrections. Eventually, after I've polished them
>> > up, I'll send them to the 'dictd' folks.
>>
>> Those files originate with the freedict project, which sadly seems to
>> be defunct.
>
> (((((((((((((((((((((((((((((
>
> ...and a few more '(((('s for good measure. That REALLY sucks!
>

Well, yeah. But... given what my hobbies are, I've quite a collection of links to dictionaries under free licences, if anyone's interested (as well as collections for which we've politely asked, and were granted rights to distribute under the GPL, but haven't yet had time to convert - Apertium is the Eclipse of open source MT

>> I sent some fixes to their Irish-English dictionary around
>> a year ago and subscribed to the mailing list - the fixes have gone
>> untouched, the list has seen little other than spam. In any case, the
>> dict files are generated from TEI-encoded XML files (which is also
>> used internally at Distributed Proofreaders for new Project Gutenberg
>> etexts), which would be the best place to make the changes (I'm sure
>> the Debian maintainer would be happy to integrate your fixes).
>
> I'll look into that in a while.
>
>> Most of the data in Freedict came from a Windows program called Ergane
>> (http://download.travlang.com/Ergane//) which allowed it's
>> dictionaries to be used under public domain terms (though I'm not sure
>> if that was true of later versions). The program used Esperanto as a
>> pivot language, which is probably the reason why there are a few
>> questionable entries in the dictionaries (IIRC, the Russian-English
>> dictionary had '�����' matched to 'g-man' rather than 'agent' - this
>> is one reason why we have direction restrictions in Apertium).
>
> Oh, it's not just a question of direction; there were plenty of simple
> errors as well. Truncated words, misspellings, and - for some odd reason
> - words in which a part of the correct word would be repeated, something
> like "correctorrect" or "wordrdord". There were a number of these. There
> were also a number of quite silly literal translations, like "White
> Russia" for "Belorussia" (that's like translating "childbearing" as "kid
> demeanor".)
>

Yes; but the 'White Russia' entry exemplifies what I'm talking about - Ergane was built from an Esperantist's point of view, but applied across other languages, without considering whether or not a translation should have restriction or not - translating from 'White Russia' to the native equivalent of Belorussia is the right thing to do; the opposite is not.

(My preferred method is to, where possible, find a similarly archaic term which needs no restrictions, but it's not always possible, and there are probably better uses of the time).

I have seen various of the other kinds of errors though; the Irish examples I fixed were, for the most part, character encoding screw ups, even after which, there were issues of missing spaces, etc.

>> If you want extra words, wikipedia is a nice place to look. This is a
>> version of Francis Tyers' wikipedia script[1] to output to dict format
>> (original here: http://wiki.apertium.org/wiki/Building_dictionaries):
>
> [snip]
>
>>         sleep 8; # don't put undue strain on the Wikimedia servers
>
> [laugh] Given that I have pretty close to 100k entries in my
> '/usr/share/dict/words' (I use the standard Scrabble word list), and
> each translation will take 8 seconds + translation time + network
> latency (call it 10 seconds total), that would take more than 11 days to
> download. I wonder if there's an easier way?
>

Actually, there are plenty of them. I constructed a ~8k Portuguese-English dictionary for my parents[1] using mostly 'dictionary crossing' (though I had to do a bit of manual searching, to ensure my father had the terms he needed for his dialysis - so he can avoid taking the kinds of drugs that'd kill him, etc. I have ~10k French-English from the same method (for my sister's birthday[2] today

There are no end of open source tools available to assist in dictionary construction - but at the end of the day, they all still need a knowledgeable human to check them, which is very much the bottleneck in our process.

[1] They went to Portugal to celebrate their 30th anniversary[3]. They had a bit of a false start when the announcement came that the airline their flight was booked with announced that they had gone broke the day before their flight, but the travel agent came through.

[2] She's honest enough to only use it as a study aid, rather than as a crutch.

[3] Yes, my sister was born the day after my parent's anniversary. No, my mother still hasn't fully forgiven her - all the more so, because that particular year was the first that my father had a job after several years of unemployment, and could afford to take my mother to a restaurant.

Top Back

Jimmy O'Regan [joregan at gmail.com]

Wed, 17 Sep 2008 01:24:43 +0100

2008/9/17 Jimmy O'Regan <joregan@gmail.com>:

> [2] She's honest enough to only use it as a study aid, rather than as a crutch.
>
> [3] Yes, my sister was born the day after my parent's anniversary. No,
> my mother still hasn't fully forgiven her - all the more so, because

Oh; said sister - Catherine - has already won her 'Apertium hero' award by translating more than 500 words (in Irish and Scots Gaelic, simultaneously) to English, in less than 3 hours. (She described it as the best crossword she ever did And with fewer mistakes than automated methods, too (she got a bit confused by the same word in different parts of speech; something that catches even the most expert of our contributors).

Top Back