Tux

...making Linux just a little more fun!

A couple of Perl questions...

Jimmy ORegan [joregan at gmail.com]


Tue, 9 Oct 2007 19:20:41 +0100

I have a couple of scripts that almost work, and I was wondering if anyone (Ben? :) could tell me why...

First, I want to convert a list of tags in the IPA PAN's corpus format (subst:pl:dat:f) to Apertium's tag format (n.f.pl.dat). I have this:

#!/usr/bin/perl
 
use warnings;
use strict;
 
# tags to replace
my %terms = qw(n nt pri p1 sec p2 ter p3 subst n);
 
while (<>)
{
	my @in = split/:/;
	my @out = map { ($terms{$_} ne "") ? $terms{$_} : $_ } @in;
	if ($#out > 3) {
		my $type = $out[3];
		$out[3] = $out[2];
		$out[2] = $out[1];
		$out[1] = $type;
	}
	print join '.', @out;
}

That's broken, because it only works for tag sets which have more than 4 entries, but changing the if to "($#out >= 3)" gives me this: ".sg.nomxxs.m3" from "xxs:sg:nom:m3". I also get a lot of warnings:

Use of uninitialized value in string ne at foo.pl line 11, <> line 1085.
Use of uninitialized value in string ne at foo.pl line 11, <> line 1086.

Next, I have a list of names extracted from a Polish morphology dictionary[1] that I'm trying to convert to a list of word stems and endings. I have this, which works (aside from a couple of errors):

#!/usr/bin/perl
 
use warnings;
use strict;
 
use String::Diff qw/diff_fully/;
use Data::Dumper;
 
#test();
while(<>)
{
	s/,\W+$//;
	my $endings = $_;
	my @a = split/, /;
	my $stem = find_stem(@a);
	$endings =~ s/$stem//g;
	print $stem;
	if ($endings =~ /?/) {print ":n.f:";}
	elsif ($endings =~ /owie/) {print ":n.m1:";}
	else {print ":n.??:";}
	print $endings . "\n";
}
 
sub test()
{
	my $test = "Adam, Adama, Adaemie, Adamowi, Adamem, Adamach, Adamami, Adamom";
	my @t = split/, /, $test;
	print find_stem(@t);
	print "\n";
}
 
sub find_stem()
{
	my @in = @_;
	my ($r, $l, $cur, $last);
	my $i=0;
 
	while ($i<($#in))
	{
		($r, $l) = diff_fully($in[$i], $in[$i+1]);
	
		$cur = $r->[0]->[1];
		$last = $cur if (!$last);
		if ($cur ne $last) {
			($r, $l) = diff_fully($last, $cur);
			$last = $r->[0]->[1];
		}
	$i++;
	}
	return $last;
}

but if I change the end of the while() to this:

	else {print ":n.??:";}
	my @ends = split/, /, $endings;
	sort(@ends);
	$endings = join(',', at ends);
	print $endings . "\n";

to sort the endings, it... doesn't. What am I missing?

[1] "S?ownik alternatywny", under the GPL: http://www.kurnik.pl/slownik/odmiany/


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Tue, 9 Oct 2007 16:55:55 -0400

On Tue, Oct 09, 2007 at 07:20:41PM +0100, Jimmy O'Regan wrote:

> I have a couple of scripts that almost work, and I was wondering if
> anyone (Ben? :) could tell me why...

I'll give it a shot. The only problem is, your script is doing more than what you describe here - so I'm going to have to guess about a few things. Worse yet, since your code isn't doing what it's supposed to do, I'm guessing based on wrong data. But hey, for a friend...

> First, I want to convert a list of tags in the IPA PAN's corpus format
> (subst:pl:dat:f) to Apertium's tag format (n.f.pl.dat). I have this:

Something like this, maybe? Again, I'm just guessing.

perl _F: -wlne'shift @F; print "n. at F[2,0,1]"' input_file
> ``
> #!/usr/bin/perl
> 
> use warnings;
> use strict;
> 
> # tags to replace
> my %terms = qw(n nt pri p1 sec p2 ter p3 subst n);
Just a personal reaction here - BLECH. I hate having to count terms to figure out what's a key and what's a value.

my %terms = (	n		=> 'nt',
			 	pri		=> 'p1',
				sec		=> 'p2',
				ter		=> 'p3',
				subst 	=> 'n'
);
> while (<>)
> {

chomp; # If you don't handle that "\n", you'll be sorry...

> 	my @in = split/:/;
> 	my @out = map { ($terms{$_} ne "") ? $terms{$_} : $_ } @in;

What happens when $terms{$_} is undefined? Bad news, that's what. I suspect that this is where your errors are coming from - perhaps with help from the absence of that 'chomp'.

> 	if ($#out > 3) {
> 		my $type = $out[3];
> 		$out[3] = $out[2];
> 		$out[2] = $out[1];
> 		$out[1] = $type;

What happens to your $out[0]? Is it just supposed to be ignored? In any case, you could just use a list slice instead of all the manual swapping:

@out[1 .. 3] = @out[3, 1, 2];

However, I strongly suspect that the 'map' statement is the source of your problems.

> 	}
> 	print join '.', @out;
> }
> ''
> 
> That's broken, because it only works for tag sets which have more than
> 4 entries, but changing the if to "($#out >= 3)" gives me this:
> ".sg.nomxxs.m3" from "xxs:sg:nom:m3". I also get a lot of warnings:

The best thing you could do to help me help you is by providing a bunch of example inputs and expected outputs. It sounds like it should be trivially simple to mung this stuff; this is the kind of thing that Perl is really good at.

> Next, I have a list of names extracted from a Polish morphology
> dictionary[1] that I'm trying to convert to a list of word stems and
> endings. I have this, which works (aside from a couple of errors):
> 
> ``
> #!/usr/bin/perl
> 
> use warnings;
> use strict;
> 
> use String::Diff qw/diff_fully/;
> use Data::Dumper;
> 
> #test();
> while(<>)
> {
> 	s/,\W+$//;
> 	my $endings = $_;
> 	my @a = split/, /;
> 	my $stem = find_stem(@a);
> 	$endings =~ s/$stem//g;
> 	print $stem;
> 	if ($endings =~ /??/) {print ":n.f:";}
> 	elsif ($endings =~ /owie/) {print ":n.m1:";}
> 	else {print ":n.??:";}
> 	print $endings . "\n";
> }
> 
> sub test()
> {
> 	my $test = "Adam, Adama, Adaemie, Adamowi, Adamem, Adamach, Adamami, Adamom";
> 	my @t = split/, /, $test;
> 	print find_stem(@t);
> 	print "\n";
> }
> 
> sub find_stem()
> {
> 	my @in = @_;
> 	my ($r, $l, $cur, $last);
> 	my $i=0;
> 
> 	while ($i<($#in))
> 	{
> 		($r, $l) = diff_fully($in[$i], $in[$i+1]);
> 	
> 		$cur = $r->[0]->[1];
> 		$last = $cur if (!$last);
> 		if ($cur ne $last) {
> 			($r, $l) = diff_fully($last, $cur);
> 			$last = $r->[0]->[1];
> 		}
> 	$i++;
> 	}
> 	return $last;
> }
> ''
> 
> but if I change the end of the while() to this:
> 
> ``
> 	else {print ":n.??:";}
> 	my @ends = split/, /, $endings;
> 	sort(@ends);
> 	$endings = join(',', at ends);
> 	print $endings . "\n";
> ''
> 
> to sort the endings, it... doesn't. What am I missing?

Like most Perl functions, "sort" doesn't modify the specified variable - it just returns a modified value. Your 'use warnings;' line should definitely have generated a warning about that.

ben at Tyr:/tmp$ perl -wle'@a = qw/3 2 1/; sort @a; print "@a"'
Useless use of sort in void context at -e line 1.
3 2 1

You can try this instead:

 	else {
		print ":n.??:";
	}
 
	# If you're only going to use a variable once, don't bother.
	print join(',', sort split/, /, $endings), "\n";
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Tue, 9 Oct 2007 17:03:44 -0400

On Tue, Oct 09, 2007 at 04:55:55PM -0400, Benjamin Okopnik wrote:

> 
> Something like this, maybe? Again, I'm just guessing.
> 
> ``
> perl _F: -wlne'shift @F; print "n. at F[2,0,1]"' input_file
> ''

Arrgh. Must be the welding fumes getting to me. That should be

perl -F: -walne'shift @F; print "n. at F[2,0,1]"' input_file
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Jimmy ORegan [joregan at gmail.com]


Wed, 10 Oct 2007 00:16:10 +0100

On 09/10/2007, Ben Okopnik <ben at linuxgazette.net> wrote:

> On Tue, Oct 09, 2007 at 07:20:41PM +0100, Jimmy O'Regan wrote:
> > I have a couple of scripts that almost work, and I was wondering if
> > anyone (Ben? :) could tell me why...
>
> I'll give it a shot. The only problem is, your script is doing more than
> what you describe here - so I'm going to have to guess about a few
> things. Worse yet, since your code isn't doing what it's supposed to do,
> I'm guessing based on wrong data. But hey, for a friend...
>

Oh, OK. Input:

subst:pl:acc:f
subst:pl:acc:m1
subst:pl:acc:m2
subst:pl:acc:m3
subst:pl:acc:n
subst:pl:dat:f
subst:pl:dat:m1
subst:pl:dat:m2
adj:sg:nom:n:comp
adj:sg:nom:n:pos
adj:sg:nom:n:sup
adj:sg:voc:f:pos
adj:sg:voc:m1:comp

expected output:

n.f.pl.acc
n.m1.pl.acc
n.m2.pl.acc
n.m3.pl.acc
n.nt.pl.acc
n.f.pl.dat
n.m1.pl.dat
n.m2.pl.dat
adj.nt.sg.nom.comp
adj.nt.sg.nom.pos
adj.nt.sg.nom.sup
adj.f.sg.voc.pos
adj.m1.sg.voc.comp
> > First, I want to convert a list of tags in the IPA PAN's corpus format
> > (subst:pl:dat:f) to Apertium's tag format (n.f.pl.dat). I have this:
>
> Something like this, maybe? Again, I'm just guessing.
>
> ``
> perl _F: -wlne'shift @F; print "n. at F[2,0,1]"' input_file
> ''
>

Kind of. Everything except for 1..3 passes through, except maybe with a change from that hash.

> > ``
> > #!/usr/bin/perl
> >
> > use warnings;
> > use strict;
> >
> > # tags to replace
> > my %terms = qw(n nt pri p1 sec p2 ter p3 subst n);
>
> Just a personal reaction here - BLECH. I hate having to count
> terms to figure out what's a key and what's a value.
>

Oh... yeah. I can see a few more terms that'll have to be swapped, and your way is definitely less confusing.

> ``
> my %terms = (   n               => 'nt',
>                                 pri             => 'p1',
>                                 sec             => 'p2',
>                                 ter             => 'p3',
>                                 subst   => 'n'
> );
> ''
>
> > while (<>)
> > {
>
> chomp;          # If you don't handle that "\n", you'll be sorry...
>

I had one in there at one stage; I don't remember why I took it out.

> >       my @in = split/:/;
> >       my @out = map { ($terms{$_} ne "") ? $terms{$_} : $_ } @in;
>
> What happens when $terms{$_} is undefined? Bad news, that's what. I
> suspect that this is where your errors are coming from - perhaps with
> help from the absence of that 'chomp'.
>
> >       if ($#out > 3) {
> >               my $type = $out[3];
> >               $out[3] = $out[2];
> >               $out[2] = $out[1];
> >               $out[1] = $type;
>
> What happens to your $out[0]? Is it just supposed to be ignored?
> In any case, you could just use a list slice instead of all the manual
> swapping:
>

It's ignored because it's in the right place.

> ``
> @out[1 .. 3] = @out[3, 1, 2];
> ''
>
> However, I strongly suspect that the 'map' statement is the source of
> your problems.
>

Yeah... using that was kind of wishful thinking, because I don't really get it. (Yet!)

> >       }
> >       print join '.', @out;
> > }
> > ''
> >
> > That's broken, because it only works for tag sets which have more than
> > 4 entries, but changing the if to "($#out >= 3)" gives me this:
> > ".sg.nomxxs.m3" from "xxs:sg:nom:m3". I also get a lot of warnings:
>
> The best thing you could do to help me help you is by providing a bunch
> of example inputs and expected outputs. It sounds like it should be
> trivially simple to mung this stuff; this is the kind of thing that Perl
> is really good at.
>
> > ``
> >       else {print ":n.??:";}
> >       my @ends = split/, /, $endings;
> >       sort(@ends);
> >       $endings = join(',', at ends);
> >       print $endings . "\n";
> > ''
> >
> > to sort the endings, it... doesn't. What am I missing?
>
> Like most Perl functions, "sort" doesn't modify the specified variable -
> it just returns a modified value. Your 'use warnings;' line should
> definitely have generated a warning about that.
>

Ah. So it does.

> ``
> ben at Tyr:/tmp$ perl -wle'@a = qw/3 2 1/; sort @a; print "@a"'
> Useless use of sort in void context at -e line 1.
> 3 2 1
> ''
>
> You can try this instead:
>
> ``
>         else {
>                 print ":n.??:";
>         }
>
>         # If you're only going to use a variable once, don't bother.
>         print join(',', sort split/, /, $endings), "\n";
> ''

Oh, cool. Thanks.


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 10 Oct 2007 10:00:03 -0400

On Wed, Oct 10, 2007 at 12:16:10AM +0100, Jimmy O'Regan wrote:

> On 09/10/2007, Ben Okopnik <ben at linuxgazette.net> wrote:
> > On Tue, Oct 09, 2007 at 07:20:41PM +0100, Jimmy O'Regan wrote:
> > > I have a couple of scripts that almost work, and I was wondering if
> > > anyone (Ben? :) could tell me why...
> >
> > I'll give it a shot. The only problem is, your script is doing more than
> > what you describe here - so I'm going to have to guess about a few
> > things. Worse yet, since your code isn't doing what it's supposed to do,
> > I'm guessing based on wrong data. But hey, for a friend...
> >
> 
> Oh, OK. Input:
> 
> ``
> subst:pl:acc:f
> subst:pl:acc:m1
> subst:pl:acc:m2
> subst:pl:acc:m3
> subst:pl:acc:n
> subst:pl:dat:f
> subst:pl:dat:m1
> subst:pl:dat:m2
> adj:sg:nom:n:comp
> adj:sg:nom:n:pos
> adj:sg:nom:n:sup
> adj:sg:voc:f:pos
> adj:sg:voc:m1:comp
> ''
> 
> expected output:
> 
> ``
> n.f.pl.acc
> n.m1.pl.acc
> n.m2.pl.acc
> n.m3.pl.acc
> n.nt.pl.acc
> n.f.pl.dat
> n.m1.pl.dat
> n.m2.pl.dat
> adj.nt.sg.nom.comp
> adj.nt.sg.nom.pos
> adj.nt.sg.nom.sup
> adj.f.sg.voc.pos
> adj.m1.sg.voc.comp
> ''

Ah. OK, got it.

#!/usr/bin/perl -w
# Created by Ben Okopnik on Wed Oct 10 09:21:10 EDT 2007
 
%repl = (
    subst   => 'n',
    n       => 'nt',
);
 
while (<>){
    next unless /:/;
    chomp;
 
    my @in = split /:/;
 
    if (@in > 4){
        @in = @in[0, 3, 1, 2, 4];
        $in[1] =~ s/(.*)/$repl{$1}||$1/e;
    }
    else {
        @in = @in[0, 3, 1, 2];
        $in[0] =~ s/(.*)/$repl{$1}||$1/e;
        $in[1] =~ s/(.*)/$repl{$1}||$1/e;
    }
    print join(".", @in), "\n";
}

It's a little clunky, but... I've got a lot on my mind this morning.

> > However, I strongly suspect that the 'map' statement is the source of
> > your problems.
> 
> Yeah... using that was kind of wishful thinking, because I don't
> really get it. (Yet!)

[grin] You're not the only one. 'map' can get a little complex - especially since, based on its semantics, it can modify a list "in place" - or return a modified list, leaving the original alone.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back