2 cent tip: reading Freelang dictionaries

Jimmy O'Regan [joregan at gmail.com]

Sun, 5 Sep 2010 14:58:30 +0100

Freelang has a lot of (usually small) dictionaries, for Windows. They have quite a few languages that aren't easy to find dictionaries for, so though the coverage and quality are usually quite low, they're sometimes all that's there.

So, an example: http://www.freelang.net/dictionary/albanian.php

Leads to a file, dic_albanian.exe

This runs quite well in Wine (I haven't found any other way of extracting the contents). On my system, the 'C:\users\jim\Local Settings\Application Data\Freelang Dictionary' translates to '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\ Dictionary/'. The dictionary files are inside the 'language' directory.

Saving this as wb2dict.c:

#include <stdlib.h>
#include <stdio.h>
 
int main (int argc, char** argv)
{
	char src[31];
        char trg[53];
	FILE* f=fopen(argv[1], "r");
	if (f==NULL) {
		fprintf (stderr, "Error reading file: %s\n", argv[1]);
		exit(1);
	}
 
	while (!feof(f)) {
		fread(&src, sizeof(char), 31, f);
		fread(&trg, sizeof(char), 53, f);
		printf ("%s\n   %s\n\n", src, trg);
	}
	
	fclose(f);
	exit(0);
}

The next step depends on the contents... Albanian on Windows uses Codepage 1250, so in this case:

./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f --utf8 albanian-english dictzip albanian-english.dict (as root cp albanian-english.* /usr/share/dictd/

add these lines to /var/lib/dictd/db.list : database albanian-english { data /usr/share/dictd/albanian-english.dict.dz index /usr/share/dictd/albanian-english.index }

/etc/init.d/dictd restart

and now it's available: dict agim 1 definition found

From unknown [albanian-english]:

agim dawn

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 5 Sep 2010 12:30:10 -0400

On Sun, Sep 05, 2010 at 02:58:30PM +0100, Jimmy O'Regan wrote:

> Freelang has a lot of (usually small) dictionaries, for Windows. They
> have quite a few languages that aren't easy to find dictionaries for,
> so though the coverage and quality are usually quite low, they're
> sometimes all that's there.
> 
> So, an example: http://www.freelang.net/dictionary/albanian.php
> 
> Leads to a file, dic_albanian.exe

Sweet. Thanks, Jimmy - I can use that!

> This runs quite well in Wine (I haven't found any other way of
> extracting the contents). On my system, the 'C:\users\jim\Local
> Settings\Application Data\Freelang Dictionary' translates to
> '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\
> Dictionary/'. The dictionary files are inside the 'language'
> directory.

Oh, right - reminds me: for stuff like this, I've got a special directory I use so I don't have to hunt through the WINE structure. I created a symlink at ".wine/drive_c/temp/to_unix" that points to my /tmp directory, so if I just install the program to that directory, it shows up in my /tmp, all ready to be played with.

> Saving this as wb2dict.c:

[snip]

Whoops - that double-prints the last entry in the dictionary. Not a big deal, though.

> The next step depends on the contents... Albanian on Windows uses
> Codepage 1250, so in this case:
> 
> ./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f
> --utf8 albanian-english
> dictzip albanian-english.dict

Or, all of the above in one step:

#!/usr/bin/perl -w
# Created by Ben Okopnik on Sun Sep  5 12:11:02 EDT 2010
use strict;
 
die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
    unless @ARGV;
 
use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
    OUT => ":utf8";
 
(my $dct = $ARGV[0]) =~ s/\.wb$//;
$dct =~ tr/_ A-Z/-_a-z/;
open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
open my $out, "|/usr/bin/dictfmt -f --utf8 $dct"
    or die "Pipe failure: $!\n";
 
{
    my $ret1 = read $in, my $src, 31;
    my $ret2 = read $in, my $tgt, 53;
    last unless $ret1 & $ret2;
    s/\0.*// for $src, $tgt;
    printf $out "%s\n   %s\n\n", $src, $tgt;
    redo;
}
close $in;
system ('dictzip', "$dct.dict");
 
print <<"+EOT+"
database $dct.dict.dz
{
	data  /usr/share/dictd/$dct.dict.dz
	index /usr/share/dictd/$dct.index
}
+EOT+

Just specify the '.wb' file as the first argument and its encoding as the second.

> (as root
> cp albanian-english.* /usr/share/dictd/
> 
> add these lines to /var/lib/dictd/db.list :
> database albanian-english
>  {
>   data  /usr/share/dictd/albanian-english.dict.dz
>   index /usr/share/dictd/albanian-english.index
> }

For convenience, the script actually spits that out so it can be copied and pasted.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sun, 5 Sep 2010 17:36:45 +0100

On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:

> On Sun, Sep 05, 2010 at 02:58:30PM +0100, Jimmy O'Regan wrote:
>> Freelang has a lot of (usually small) dictionaries, for Windows. They
>> have quite a few languages that aren't easy to find dictionaries for,
>> so though the coverage and quality are usually quite low, they're
>> sometimes all that's there.
>>
>> So, an example: http://www.freelang.net/dictionary/albanian.php
>>
>> Leads to a file, dic_albanian.exe
>
> Sweet. Thanks, Jimmy - I can use that!
>
>> This runs quite well in Wine (I haven't found any other way of
>> extracting the contents). On my system, the 'C:\users\jim\Local
>> Settings\Application Data\Freelang Dictionary' translates to
>> '~/.wine/drive_c/users/jim/Local\ Settings/Application\ Data/Freelang\
>> Dictionary/'. The dictionary files are inside the 'language'
>> directory.
>
> Oh, right - reminds me: for stuff like this, I've got a special
> directory I use so I don't have to hunt through the WINE structure. I
> created a symlink at ".wine/drive_c/temp/to_unix" that points to my /tmp
> directory, so if I just install the program to that directory, it shows
> up in my /tmp, all ready to be played with.
>
>> Saving this as wb2dict.c:
>
> [snip]
>
> Whoops - that double-prints the last entry in the dictionary.  Not a
> big deal, though.
>

Ah well... I spent more time on the dict stuff than looking at the raw files/writing the C

It also loses the first entry (I think) because of the way dictfmt adds its initial entries.

>> The next step depends on the contents... Albanian on Windows uses
>> Codepage 1250, so in this case:
>>
>> ./wb2dict Albanian_English.wb|recode 'windows1250..utf8' |dictfmt -f
>> --utf8 albanian-english
>> dictzip albanian-english.dict
>
> Or, all of the above in one step:
>
> ```
> #!/usr/bin/perl -w
> # Created by Ben Okopnik on Sun Sep ?5 12:11:02 EDT 2010
> use strict;
>
> die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
> ? ?unless @ARGV;
>
> use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
> ? ?OUT => ":utf8";
>
> (my $dct = $ARGV[0]) =~ s/\.wb$//;
> $dct =~ tr/_ A-Z/-_a-z/;
> open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
> open my $out, "|/usr/bin/dictfmt -f --utf8 $dct"
> ? ?or die "Pipe failure: $!\n";
>
> {
> ? ?my $ret1 = read $in, my $src, 31;
> ? ?my $ret2 = read $in, my $tgt, 53;
> ? ?last unless $ret1 & $ret2;
> ? ?s/\0.*// for $src, $tgt;

Not quite. The reason I used C was because the data showed some evidence of C string reuse: schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0

... so you'd at least need to split both strings on \0

> ? ?printf $out "%s\n ? %s\n\n", $src, $tgt;
> ? ?redo;
> }
> close $in;
> system ('dictzip', "$dct.dict");
>
> print <<"+EOT+"
> database $dct.dict.dz
> {
> ? ? ? ?data ?/usr/share/dictd/$dct.dict.dz
> ? ? ? ?index /usr/share/dictd/$dct.index
> }
> +EOT+
> '''
>
> Just specify the '.wb' file as the first argument and its encoding as
> the second.
>
>> (as root
>> cp albanian-english.* /usr/share/dictd/
>>
>> add these lines to /var/lib/dictd/db.list :
>> database albanian-english
>> ?{
>> ? data ?/usr/share/dictd/albanian-english.dict.dz
>> ? index /usr/share/dictd/albanian-english.index
>> }
>
> For convenience, the script actually spits that out so it can be copied
> and pasted. 
>
>
> --
> * Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *
>                                              
> TAG mailing list
> TAG at lists.linuxgazette.net
> http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net
>

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sun, 5 Sep 2010 18:12:12 +0100

On 5 September 2010 17:36, Jimmy O'Regan <joregan at gmail.com> wrote:

>>> Saving this as wb2dict.c:
>>
>> [snip]
>>
>> Whoops - that double-prints the last entry in the dictionary.  Not a
>> big deal, though.
>>
>
> Ah well... I spent more time on the dict stuff than looking at the raw
> files/writing the C 
>
> It also loses the first entry (I think) because of the way dictfmt
> adds its initial entries.
>

This version fixes both problems:

#include <stdlib.h> #include <stdio.h>

int main (int argc, char** argv) { char src[31]; char trg[53]; int c; FILE* f=fopen(argv[1], "r"); if (f==NULL) { fprintf (stderr, "Error reading file: %s\n", argv[1]); exit(1); }

printf ("00-database-info\n Converted from %s\n\n", argv[1]); printf ("00-dummy-entry\n For dictfmt\n\n");

while ((c = (int) fgetc(f)) != EOF) { ungetc(c, f); fread(&src, sizeof(char), 31, f); fread(&trg, sizeof(char), 53, f); printf ("%s\n %s\n\n", src, trg); } fclose(f); exit(0); }

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 5 Sep 2010 13:13:59 -0400

On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:

> On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
> 
> > {
> > ? ?my $ret1 = read $in, my $src, 31;
> > ? ?my $ret2 = read $in, my $tgt, 53;
> > ? ?last unless $ret1 & $ret2;
> > ? ?s/\0.*// for $src, $tgt;
> 
> Not quite. The reason I used C was because the data showed some
> evidence of C string reuse:
> schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> 
> ... so you'd at least need to split both strings on \0

Actually, except for the double-printed entry, it produces precisely the same output as your program - so that seems to work just fine.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 5 Sep 2010 13:32:23 -0400

On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:

> 
> Not quite. The reason I used C was because the data showed some
> evidence of C string reuse:
> schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
> 
> ... so you'd at least need to split both strings on \0

Just recalled: C strings are null-terminated, right? That means the assignment to the string will terminate at that first null, regardless of the content after it. I'm just doing that manually.

#include <stdlib.h>
#include <stdio.h>
 
int main()
{
    char *str = "abc\0def";
    printf("%s\n", str);
    exit(0);
}

This will only print the first three characters of the string.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sun, 5 Sep 2010 18:39:40 +0100

On 5 September 2010 18:13, Ben Okopnik <ben at linuxgazette.net> wrote:

> On Sun, Sep 05, 2010 at 05:36:45PM +0100, Jimmy O'Regan wrote:
>> On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:
>>
>> > {
>> > ? ?my $ret1 = read $in, my $src, 31;
>> > ? ?my $ret2 = read $in, my $tgt, 53;
>> > ? ?last unless $ret1 & $ret2;
>> > ? ?s/\0.*// for $src, $tgt;
>>
>> Not quite. The reason I used C was because the data showed some
>> evidence of C string reuse:
>> schmal(t)z\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>> "devojka za s\0"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>> factotum\0\0\0ch\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
>>
>> ... so you'd at least need to split both strings on \0
>
> Actually, except for the double-printed entry, it produces precisely the
> same output as your program - so that seems to work just fine.
>

Sorry, misread "s/\0.*//". I need 1) new glasses, and 2) to clean my monitor

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Top Back

Jimmy O'Regan [joregan at gmail.com]

Sun, 5 Sep 2010 18:55:24 +0100

On 5 September 2010 17:30, Ben Okopnik <ben at linuxgazette.net> wrote:

> ```
> #!/usr/bin/perl -w
> # Created by Ben Okopnik on Sun Sep ?5 12:11:02 EDT 2010
> use strict;
>
> die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
> ? ?unless @ARGV;
>
> use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
> ? ?OUT => ":utf8";
>
> (my $dct = $ARGV[0]) =~ s/\.wb$//;
> $dct =~ tr/_ A-Z/-_a-z/;
> open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
> open my $out, "|/usr/bin/dictfmt -f --utf8 $dct"
> ? ?or die "Pipe failure: $!\n";
>

print $out "00-dummy-entry\n For dictfmt\n\n";

here will get rid of the second bug I had

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 5 Sep 2010 14:28:53 -0400

On Sun, Sep 05, 2010 at 06:55:24PM +0100, Jimmy O'Regan wrote:

> 
> print $out "00-dummy-entry\n   For dictfmt\n\n";
> 
> here will get rid of the second bug I had

OK, so the "improved" version looks like this (I was trying to remember what in Perl handles C strings... 'pack/unpack', of course):

#!/usr/bin/perl -w
# Created by Ben Okopnik on Sun Sep  5 12:11:02 EDT 2010
use strict;
 
die "Usage: ", $0 =~ /([^\/]+)$/, " <dict_file> [encoding]\n"
    unless @ARGV;
 
use open IN => ":encoding(" . (defined $ARGV[1]?$ARGV[1]:'utf8') . ")",
    OUT => ":utf8";
 
(my $dct = $ARGV[0]) =~ s/\.wb$//;
$dct =~ tr/_ A-Z/-_a-z/;
open my $in, $ARGV[0] or die "$ARGV[0]: $!\n";
open my $out, "|/usr/bin/dictfmt -f --utf8 $dct" or die "Pipe failure: $!\n";
 
my $src;
print $out "00-dummy-entry\n\tFor dictfmt\n\n";
printf "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;
close $in;
 
system ('dictzip', "$dct.dict");
print <<"+EOT+"
 
database $dct.dict.dz
{
    data  /usr/share/dictd/$dct.dict.dz
    index /usr/share/dictd/$dct.index
}
+EOT+

The amusing part is the amount of work done by that "printf" line. Real workhorse, that thing.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back

Ben Okopnik [ben at linuxgazette.net]

Sun, 5 Sep 2010 14:47:04 -0400

On Sun, Sep 05, 2010 at 02:28:53PM -0400, Benjamin Okopnik wrote:

Whoops, one mistake there:

> printf "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;

Should be

printf $out "%s\n\t%s\n\n", unpack("Z31 Z53", $src) while read $in, $src, 84;

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *

Top Back