Ben Humphreys

  • Archive
  • RSS

Parsing Edict XML with Perl and XML::LibXML

Edict is a Japanese-English dictionary that is free to use for research (as far as I know). It’s available in a few formats, the most useful of which is XML dump of English-only data.

It might help someone sometime, so I’ve posted a short Perl snippet of how to parse the format.

A single entry looks like:

    • #programming
    • #phd
    • #japanese
    • #perl
  • 4 weeks ago
  • Comments
  • Permalink
  • Share
    Tweet

IBM Model 1 in Perl

A first attempt based on pseudocode on pg 91 of Statistical Machine Translation by Philipp Koehn.

It’s a sad day when it’s easier to write it in Perl than in Ruby. Next prototype will be in Ruby.

    • #nlp
    • #code
    • #perl
    • #programming
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

MooseX::Getopt, YAML and properties

I was trying to let users specify a set of hierarchical properties from the command line.

This is how you run it from the command line:

foo tmp --yaml_properties="cake: [ bar, cake ]"

YAML STRING: cake: [ bar, cake ]
$VAR1 = {
          'cake' => [
                      'bar',
                      'cake'
                    ]
        };

Code:

package Foo::Command::tmp;

use Moose;
use Moose::Util::TypeConstraints;

extends qw( MooseX::App::Cmd::Command );

use Data::Dumper;
use YAML::Syck;

# Dont know what kind of data structure the YAML will create
subtype 'YAMLProperties' => as 'Ref';

# Want to force the data structure to load from simple Str
coerce 'YAMLProperties'
    => from 'Str'
        => via { 
            my $yaml_string = $_;
            print "YAML STRING: $yaml_string\n";
            return Load( $yaml_string );
        };

# Let Getopt know that YAMLProperties is a string
MooseX::Getopt::OptionTypeMap->add_option_type_to_map(
    'YAMLProperties' => '=s'
);


has yaml_properties => (
    traits        => [ qw( Getopt ) ],
    isa           => 'YAMLProperties',
    is            => 'ro',
    cmd_aliases   => 'yp',
    coerce        => 1,
    documentation => 'YAML-based properties. Gives you access to any data structure. Great!',
);


# Verify parameters
sub run {
    my ($self, $opt, $args) = @_;
    
    print Dumper($self->yaml_properties);
    
    return 1;
}

1;
    • #programming
    • #perl
    • #moose
  • 2 years ago
  • Comments
  • Permalink
  • Share
    Tweet

Perl HTML::TableExtract, nbsp and ASCII 160

I was trying to match tables with what looked like spaces in them, using HTML::TableExtract, but for some reason it wasn’t matching, using explicit $foo eq ’ ’ and $foo =~ /A\s*\z/.

After 30 minutes of banging my head against the wall, I found out that the spaces were actually   in the source, and HTML::TableExtract unhelpfully changes them to their ASCII counterparts. This is not the standard ASCII code for space (which is 32), but freaky bizarro space 160.

What is even worse is that perl’s regex \s doesn’t even cover it.

So I was forced to do something disgusting like:

if (ord($foo) == 160) ...
    • #perl
    • #programming
  • 2 years ago
  • Comments
  • Permalink
  • Share
    Tweet

Perl variables and apostrophes

This came up recently. I’d been using #2 for a while but never hit #1 which was a surprise.

my $person = "Ben";
print "Check out $person's cat\n"; #1
print "Say hi to '$person'\n"; #2
print "What about $person\'s hat\n"; #3

Output:
Name "person::s" used only once: possible typo at test.pl line 6.
Use of uninitialized value in concatenation (.) or string at test.pl line 6.
Check out  cat
Say hi to 'Ben'
What about Ben's hat
    • #perl
    • #programming
  • 2 years ago
  • Comments
  • Permalink
  • Share
    Tweet

Super-hack for turning off Validation in Moose

Thanks to Sartak in #moose for this little gem of hack-tastic meta-fantastic goodness.

{
    package Moose;
    around has => sub {
        my $orig = shift;
        my $name = shift;
        my %args = @_;
        delete $args{isa};
        $orig->($name, %args);
    };
}
    • #programming
    • #moose
    • #perl
  • 2 years ago
  • Comments
  • Permalink
  • Share
    Tweet

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr