Parsing Edict XML with Perl and XML::LibXML
Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data.
It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format.
A single entry looks like:
IBM Model 1 in Perl
A first attempt based on pseudocode on pg 91 of Statistical Machine Translation by Philipp Koehn.
It’s a sad day when it’s easier to write it in Perl than in Ruby. Next prototype will be in Ruby.
MooseX::Getopt, YAML and properties
I was trying to let users specify a set of hierarchical properties from the command line.
This is how you run it from the command line:
foo tmp --yaml_properties="cake: [ bar, cake ]"
YAML STRING: cake: [ bar, cake ]
$VAR1 = {
'cake' => [
'bar',
'cake'
]
};
Code:
package Foo::Command::tmp;
use Moose;
use Moose::Util::TypeConstraints;
extends qw( MooseX::App::Cmd::Command );
use Data::Dumper;
use YAML::Syck;
# Dont know what kind of data structure the YAML will create
subtype 'YAMLProperties' => as 'Ref';
# Want to force the data structure to load from simple Str
coerce 'YAMLProperties'
=> from 'Str'
=> via {
my $yaml_string = $_;
print "YAML STRING: $yaml_string\n";
return Load( $yaml_string );
};
# Let Getopt know that YAMLProperties is a string
MooseX::Getopt::OptionTypeMap->add_option_type_to_map(
'YAMLProperties' => '=s'
);
has yaml_properties => (
traits => [ qw( Getopt ) ],
isa => 'YAMLProperties',
is => 'ro',
cmd_aliases => 'yp',
coerce => 1,
documentation => 'YAML-based properties. Gives you access to any data structure. Great!',
);
# Verify parameters
sub run {
my ($self, $opt, $args) = @_;
print Dumper($self->yaml_properties);
return 1;
}
1;
Perl HTML::TableExtract, nbsp and ASCII 160
I was trying to match tables with what looked like spaces in them, using HTML::TableExtract, but for some reason it wasn’t matching, using explicit $foo eq ’ ’ and $foo =~ /A\s*\z/.
After 30 minutes of banging my head against the wall, I found out that the spaces were actually in the source, and HTML::TableExtract unhelpfully changes them to their ASCII counterparts. This is not the standard ASCII code for space (which is 32), but freaky bizarro space 160.
What is even worse is that perl’s regex \s doesn’t even cover it.
So I was forced to do something disgusting like:
if (ord($foo) == 160) ...
Perl variables and apostrophes
This came up recently. I’d been using #2 for a while but never hit #1 which was a surprise.
my $person = "Ben"; print "Check out $person's cat\n"; #1 print "Say hi to '$person'\n"; #2 print "What about $person\'s hat\n"; #3 Output: Name "person::s" used only once: possible typo at test.pl line 6. Use of uninitialized value in concatenation (.) or string at test.pl line 6. Check out cat Say hi to 'Ben' What about Ben's hat
Super-hack for turning off Validation in Moose
Thanks to Sartak in #moose for this little gem of hack-tastic meta-fantastic goodness.
{
package Moose;
around has => sub {
my $orig = shift;
my $name = shift;
my %args = @_;
delete $args{isa};
$orig->($name, %args);
};
}