So, there was some fascinating discussion on my first post that got onto the subject of some of the magic tricks that people play under the hood, notably touching on Smalltalk and Ruby. Piers Cawley summed up my real views on the subject for me, noting
- "With so many of these things it's not the capability itself that's bad, it's the of the bad uses people put it to."
and I use a hell of a lot of magic in perl sometimes, albeit carefully encapsulated magic - the principle reason I'm still using perl is that python and ruby just aren't as flexible as perl. Which means that this article is going to be perl all the way down, because of course more flexible means more ways to shoot yourself in the face - and I'm going to run through three that I've run into in production code and been driven insane by.
Headshot One: sub UNIVERSAL::foo { }
This is probably the least horrible of the things I'm going to discuss - as Piers points out, smalltalk does something fairly equivalent to this a lot to great effect. The idea of this is that you can provide a method on all objects in your program with a single definition, so for example
sub UNIVERSAL::debug_print { require Data::Dumper; warn Data::Dumper::Dumper($_[0]); }
would mean that for any object in your program, you're guaranteed to be able to call
$obj->debug_print
and get the contents of your object dumped to STDERR.
Or, at least, that's the theory. Where this can become a headshot is in its interactions with other things - for a start, now you're messing with a namespace that's shared interpreter-wide so you can get name clashes. But, of course, people are all careful about injecting things into UNIVERSAL:: because they know that, so you're pretty much safe, right?
Wrong.
Just because you don't get a clash at that level, you're effectively making the assumption that no class your program will load, ever, will define a method of that name with a different meaning! So if you have a pretty printer class -
package PrettyPrinter;
sub debug_print {
my ($self, @to_print) = @_;
die "no arguments passed to debug_print" unless @to_print;
...
then as soon as you try
$obj->debug_print
on your pretty printer ... exception time!
The place I encountered this one personally was with UNIVERSAL::moniker combined with Template::Plugin::Class - the latter returns a proxy object that catches all unknown method calls directed to it with an AUTOLOAD (which I considered mentioning for this article but is only really a footshot ... unless somebody else already put one in the global namespace). However, this proxy object only accounts for the standard UNIVERSAL methods, so moniker gets called on the proxy object and fails spectacularly to be any use. The solution, of course, was to inject an explicit moniker method into the proxy object's class, but that's an evil workaround in and of itself. However, provided you know what you're doing a sparse few extra global methods can be very powerful, so it's arguably the least scary of the techniques
Headshot rating: May take an ear off.
Headshot Two: Wrapping or replacing other people's methods
This technique, also known as monkey patching, is a common means of in-place modification of other people's code to allow tweaking without requiring everything deals with a factory. This -can- in theory be a very useful tool, although you should always try to subclass - but, sometimes (for e.g. with my Template::Plugin problem above) you either don't have time or don't have an option to integrate patches upstream or ...
So, anyway, a first attempt might be
sub MonkeyPatch::TargetClass::target_method { ... }
which is ok in theory except ... (1) your sub is declared in the current package so your class globals won't be right,and (2) if the other class ever declared a target_method method itself it'll all go boom. Now, neither matched for my problem above, and if they do perl will give you a handy 'method redefined' warning so you know you did something silly, so this is only really a shooting yourself in the foot moment (I don't think every language has this warning though, it was added because of perl programmers shooting themselves in the head due to its absence years back ... sometimes being old and unfashionable is an advantage :)
So. Let's assume for the moment there -is- a target method and you want to wrap it. Well, so, I guess the easiest way is just to rename the old one and replace it, right?
package MonkeyPatch::TargetClass;
*old_target_method = \&target_method; # set symbol to subref
{
no warnings 'redefine'; # only turn the warning off within the current block
*target_method = sub {
my $self = shift;
...
$self->old_target_method(@_);
}
}
Ok, that's fine. But what if somebody -else- tries to wrap this the same way? They'll overwrite old_target_method with your wrapped version and ... BOOM. Alright, so can we wrap it in place? Well, in perl, yes we can -
package MonkeyPatch::TargetClass;
my $old = \&target_method;
{
no warnings 'redefine';
*target_method = sub {
my $self = shift;
...
$self->$old(@_);
}
So now if somebody else does the same thing, they'll grab our redefined version and wrap that and there are no conflicts - well, so long as our wrappers aren't order-dependent, but at that point you're beyond what this approach can ever handle and it's time to break out multiple inheritance and C3 mro support.
But ... what if we don't know if the method's defined in TargetClass or one of its superclasses? Worse still, what if it -is- currently in TargetClass but an update to the library that comes from factors it out into a base class? You guessed it: BOOM.
So, one last tweak -
becomesmy $old = &target_method;
my $old = __PACKAGE__->can('target_method'); # look up along MRO
and now we are, so far as this sort of hack ever can be, safe and robust - to the point where I was happy to use this trick to tweak bits of CPAN for my Catalyst installer.
So, this one gets a vote of: bloody hard to get right, still dangerous even then, but sometimes useful.
Headshot rating: All fun and games until somebody loses an eye
Headshot Three: Source filtering
And finally, the most strange and fascinating of the bunch, and the one which a client of ours got shot in the face by in the wild inspiring me to write this article in the first place. Source filtering is ... well, for those of you unfamiliar with perl the implementation you're likely to have encountered is C preprocessor macros; I don't -think- any other dynamic languages have been foolish enough to copy this particular feature and if the author of your preferred one proposes doing so, please break his fingers and confiscate his keyboard until the madness passes.
So, right, source filters allow you to provide a textual preprocessor to attach to your source code which runs before the file is fed to the interpreter's parser. Which theoretically allows you to extend the language to do -anything-, for example add a completely pointless switch statement (you can produce something almost identical with other syntactic constructs - if anybody cares comment and I'll elaborate).
It also allows you to shoot yourself in the head. Spectacularly. Observe -
package ShootMeInTheHead;
use strict;
use warnings;
use Switch;=head1 ShootMeInTheHead
=cut
sub foo {
my ($variable) = @_;
return $variable + 1;
}sub bar {
my ($variable) = @_;
return $variab1e + 1;
}
A very simple piece of code, and one with a visible typo - the last return line has a 1 in 'variab1e' which should be an l. And lo, asking perl to compile it yields
Global symbol "$variab1e" requires explicit package name at line 18.
and thus informed of our silly error we may swiftly correct it. Except. Consider the following identically buggy but subtly different file -
package ShootMeInTheHead;
use strict;
use warnings;
use Switch;
# load switch statement=head1 ShootMeInTheHead
=cut
sub foo {
my ($variable) = @_;
return $variable + 1;
}sub bar {
my ($variable) = @_;
return $variab1e + 1;
}
which, while -functionally- identical (in theory), now contains the magic word 'switch'. Now, since we added a line, the compilation error should now be line nine, right?
Global symbol "$variab1e" requires explicit package name at line 14.
Erm. But that's the "return $variable + 1;" line further up. Which is -correct-.
Which is where we originally came in, with one of Shadowcat's clients contacting us in a not very happy state of mind because their core app wouldn't start and their devs swore blind the line the compiler error was coming from was correct. And, indeed, they were right - that line was. A quick comment-out of the 'use Switch' and dependent code revealed the correct line (some 200 lines off, their code was a tad more involved than this), I corrected a typo in a variable name, and lo, the app was working again.
I'll confess to not having tracked down -exactly- how Switch's tweaks to line numbering manage to mangle the error reporting, and I'll also confess to not particularly caring - my fundamental advice to anybody considering using a source filter that alters things beyond a single line is, and will ever be, don't.
Headshot rating: Right between the eyes.
So, having scared myself and perhaps a number of my readers too, I think I'm going to go for a beer and some cat macros while I recover. Adieu!
So, weeks and weeks on from when I planned, I'm finally getting down to sorting my e-mail out. First thing is to pick an IMAP client library. A quick poke for IMAP client libraries on CPAN reveals a few, notably IMAP::Client, Net::IMAP and Net::IMAP::Simple. Of the three, IMAP::Client looks like the most 'engineered' but I don't really feel the need for anything complex just yet, so I'm going to go with Net::IMAP::Simple.
Since I currently have a -lot- of crap in my inbox, I'm going to grab the unread mail and shuffle it into a mailbox called 'scratch' that I can back up, screw up and generally mess with to my heart's content. I'm eliding how I do that because (a) it's boring cp -a stuff at a shell prompt (b) it's specific to the slightly odd qmail setup I built several years ago, which means it's likely useless to anybody reading this.
INBOX/scratch created and populate, it's time to fire up ye olde repl and see what we get (NB: throughout this transcript I'm using ' 8 ' instead of '@' in order to throw off spambots.
$ use Net::IMAP::Simple;
$ my $imap = Net::IMAP::Simple->new('imap.scsys.co.uk');
Net::IMAP::Simple=HASH(0x8256c1c)
$ $imap->login('matt 8 trout.me.uk', 'notmyrealpw');
1;
$ $imap->select('INBOX/scratch');
16202
Riiight. Well, so far so good but that's a lot of e-mail to deal with. Looking at the Net::IMAP::Simple synopsis, my best route to poke at it is to make an Email::Simple object from the headers, so we'll load that and build a quick function that handles doing that for a given msgid.
$ use Email::Simple;
Net::IMAP::Simple=HASH(0x8256c1c)
$ my $mk_hdr = sub { Email::Simple->new(join '', @{$_[0]->top($_[1])}); };
CODE(0x85ac270)
$ my $msg = $mk_hdr->($imap, 1);
Email::Simple=HASH(0x86a1b0c)
$ $msg->header('Delivered-To');
matt 8 trout.me.uk infrastructures 8 trout.me.uk
$ join(',',$msg->header('Delivered-To'));
matt 8 trout.me.uk,infrastructures 8 trout.me.uk
Top. Not only does it work, but we get a list back. Now, my plan is to redirect based on the mail address delivered to for the moment - while yes, I could easily do this by rejigging my mailbox layout I can't be bothered pre-processing old mail and I know there's a mail server migration in my near future so I want my filtering approach to be independent of any of that. Hence why I'm doing things this way - what I'm trying to end up with is a portable version of the set of mail rules I had last time I attempted to use Thunderbird. Anyway. Delivered-To headers are the easiest way of getting hold of this info, but if it went through a remote network it'll already have one from there and I have no desire to classify based on other people's mail addresses, so we need the last one on my own domain. Time for another quick function, taking advantage of the wonderful perl built-in 'reverse' -
$ my $last_local = sub { foreach(reverse $_[0]->header('Delivered-To')) { return $_ if /\@trout.me.uk$/; } };
CODE(0x8548e90)
$ $last_local->($msg);
infrastructures 8 trout.me.uk
$ my @first_ten = map { $mk_hdr->($imap, $_) } 1..10;
Email::Simple=HASH(0x86a2730) Email::Simple=HASH(0x86a66d0) Email::Simple=HASH(0
x86aad00) Email::Simple=HASH(0x86ae078) Email::Simple=HASH(0x86ae72c) Email::Sim
ple=HASH(0x86b98dc) Email::Simple=HASH(0x86cb79c) Email::Simple=HASH(0x86d2660)
Email::Simple=HASH(0x86d2cb4) Email::Simple=HASH(0x86d3440)
$ join("\n", map { $last_local->($_) } @first_ten);
infrastructures 8 trout.me.uk
perl-stuff 8 trout.me.uk
dbix-class 8 trout.me.uk
dbix-class 8 trout.me.uk
utp 8 trout.me.uk
dbix-class 8 trout.me.uk
perl-stuff 8 trout.me.uk
dbix-class 8 trout.me.uk
wiqi 8 trout.me.uk
matt 8 trout.me.uk
Ok, that seems to be giving us the information I needed. So, now to build a simple classifier per-delivered-to. Declaring two hashes, one to map mail parts to targets and one to store unknown results, and build a classifier subroutine that pulls off the part before the @ and checks the target hash for an entry -
$ my %message_targets; my %unknown;
$ @message_targets{qw/infrastructures dbix-class perl-stuff matt utp wiqi/} = qw/lists lists default personal junk junk/;
$ my $classify = sub { $_[0] =~ /(.*?)\@/; return $_[1]->{$1}; };
CODE(0x86a1c5c)
$ $classify->('perl-stuff 8 trout.me.uk',\%message_targets);
default
$ join("\n", map { $classify->($last_local->($_), \%message_targets) } @first_ten);
lists
default
lists
lists
junk
lists
default
lists
junk
personal
Now we build a function that actually puts it all together (somewhat reformatted for clarity albeit I typed it straight in on a single line when I first did it) - this takes a range of message indexes, pulls the headers, finds the local address and does the classify run. If it gets a result it calls a $do_cl function to do something with the classification, if not increments the unknown count for the original address. The reason for keeping the $do_cl part separate is I can initially define it to be something for debugging only -
$ my $do_cl;
my $cl_range = sub {
foreach my $i ($_[0] .. $_[1]) {
my $m = $mk_hdr->($imap, $i);
my $l = $last_local->($m);
my $c = $classify->($l, \%message_targets);
if (defined $c) { $do_cl->($i, $c); } else { $unknown{$l}++; }
}
};
CODE(0x868a4a0)
$ $do_cl = sub { print join(' => ', @_)."\n"; };
CODE(0x86a6034)(check on 1-20 elided)
$ $cl_range->(21, 30);
21 => lists
23 => lists
24 => lists
25 => lists
26 => default
28 => lists
29 => lists
30 => lists
$ %unknown;
lxxyj 8 trout.me.uk 1 ide-hard-disk-compswap 8 trout.me.uk 1 aauzf 8 trout.me.uk 1
So, the classifier seems to work. So now all we need to do is create corresponding folders, modify $do_cl, and we're away -
$ $do_cl = sub {
print join(' => ', @_)."\n";
$imap->copy($_[0], "INBOX/".$_[1]) || warn $imap->errstr;
$imap->delete($_[0]) || warn $imap->errstr;
};
$ my $max_id = $imap->select('INBOX/scratch');
$ $cl_range->(1, $max_id); $imap->expunge_mailbox('INBOX/scratch');
Letting this process finish gets me about 12k messages classified and 500 or so addresses in %unknown, so at this point I resort to actually turning it into a real script - something I was able to do pretty much exactly by copy and pasting from the repl history. The only addition is a more compact form for specifying the targets -
my %target_conf = (
# MAILING LISTS
lists => [ qw(infrastructures dbix-class ...
) ],# PERSONAL ADDRESSES
personal => [ qw(matt matthew paypal ...
) ],# additional target lists elided
);
%message_targets = (
map { my $v = $_; (
map { ($_ => $v) } @{$target_conf{$_}}
) } keys %target_conf
);
where the outer map saves off the key as $v so the inner map can use the arrayref stashed in the conf hash to produce key: arrayref entry, value: $v pairs for the target list the code operates on. I considered externalising thse into a conf file but since the qw() format allows me to use arbitrary whitespace for formatting so the config's only about 60 lines, which for software that's basically single-user (and developed in a couple hours) it really isn't worth the hassle.
It processes about 2.5k messages per hour over my DSL connection and seems to keep both the client machine and the IMAP server box reasonably busy while it's doing it. I could probably improve this but again for my purposes, who cares. The big point for me was going from "mailbox is screwed" to "mailbox is usable and can be cleaed out easily" in the space of an afternoon :)
And with that, I'm off to bed. I hope this was of some vague interest to those of you who got all the way through it!
