Using Devel::REPL to develop a mail cleanout script
So, weeks and weeks on from when I planned, I'm finally getting down to sorting my e-mail out. First thing is to pick an IMAP client library. A quick poke for IMAP client libraries on CPAN reveals a few, notably IMAP::Client, Net::IMAP and Net::IMAP::Simple. Of the three, IMAP::Client looks like the most 'engineered' but I don't really feel the need for anything complex just yet, so I'm going to go with Net::IMAP::Simple.
Since I currently have a -lot- of crap in my inbox, I'm going to grab the unread mail and shuffle it into a mailbox called 'scratch' that I can back up, screw up and generally mess with to my heart's content. I'm eliding how I do that because (a) it's boring cp -a stuff at a shell prompt (b) it's specific to the slightly odd qmail setup I built several years ago, which means it's likely useless to anybody reading this.
INBOX/scratch created and populate, it's time to fire up ye olde repl and see what we get (NB: throughout this transcript I'm using ' 8 ' instead of '@' in order to throw off spambots.
$ use Net::IMAP::Simple;
$ my $imap = Net::IMAP::Simple->new('imap.scsys.co.uk');
Net::IMAP::Simple=HASH(0x8256c1c)
$ $imap->login('matt 8 trout.me.uk', 'notmyrealpw');
1;
$ $imap->select('INBOX/scratch');
16202
Riiight. Well, so far so good but that's a lot of e-mail to deal with. Looking at the Net::IMAP::Simple synopsis, my best route to poke at it is to make an Email::Simple object from the headers, so we'll load that and build a quick function that handles doing that for a given msgid.
$ use Email::Simple;
Net::IMAP::Simple=HASH(0x8256c1c)
$ my $mk_hdr = sub { Email::Simple->new(join '', @{$_[0]->top($_[1])}); };
CODE(0x85ac270)
$ my $msg = $mk_hdr->($imap, 1);
Email::Simple=HASH(0x86a1b0c)
$ $msg->header('Delivered-To');
matt 8 trout.me.uk infrastructures 8 trout.me.uk
$ join(',',$msg->header('Delivered-To'));
matt 8 trout.me.uk,infrastructures 8 trout.me.uk
Top. Not only does it work, but we get a list back. Now, my plan is to redirect based on the mail address delivered to for the moment - while yes, I could easily do this by rejigging my mailbox layout I can't be bothered pre-processing old mail and I know there's a mail server migration in my near future so I want my filtering approach to be independent of any of that. Hence why I'm doing things this way - what I'm trying to end up with is a portable version of the set of mail rules I had last time I attempted to use Thunderbird. Anyway. Delivered-To headers are the easiest way of getting hold of this info, but if it went through a remote network it'll already have one from there and I have no desire to classify based on other people's mail addresses, so we need the last one on my own domain. Time for another quick function, taking advantage of the wonderful perl built-in 'reverse' -
$ my $last_local = sub { foreach(reverse $_[0]->header('Delivered-To')) { return $_ if /\@trout.me.uk$/; } };
CODE(0x8548e90)
$ $last_local->($msg);
infrastructures 8 trout.me.uk
$ my @first_ten = map { $mk_hdr->($imap, $_) } 1..10;
Email::Simple=HASH(0x86a2730) Email::Simple=HASH(0x86a66d0) Email::Simple=HASH(0
x86aad00) Email::Simple=HASH(0x86ae078) Email::Simple=HASH(0x86ae72c) Email::Sim
ple=HASH(0x86b98dc) Email::Simple=HASH(0x86cb79c) Email::Simple=HASH(0x86d2660)
Email::Simple=HASH(0x86d2cb4) Email::Simple=HASH(0x86d3440)
$ join("\n", map { $last_local->($_) } @first_ten);
infrastructures 8 trout.me.uk
perl-stuff 8 trout.me.uk
dbix-class 8 trout.me.uk
dbix-class 8 trout.me.uk
utp 8 trout.me.uk
dbix-class 8 trout.me.uk
perl-stuff 8 trout.me.uk
dbix-class 8 trout.me.uk
wiqi 8 trout.me.uk
matt 8 trout.me.uk
Ok, that seems to be giving us the information I needed. So, now to build a simple classifier per-delivered-to. Declaring two hashes, one to map mail parts to targets and one to store unknown results, and build a classifier subroutine that pulls off the part before the @ and checks the target hash for an entry -
$ my %message_targets; my %unknown;
$ @message_targets{qw/infrastructures dbix-class perl-stuff matt utp wiqi/} = qw/lists lists default personal junk junk/;
$ my $classify = sub { $_[0] =~ /(.*?)\@/; return $_[1]->{$1}; };
CODE(0x86a1c5c)
$ $classify->('perl-stuff 8 trout.me.uk',\%message_targets);
default
$ join("\n", map { $classify->($last_local->($_), \%message_targets) } @first_ten);
lists
default
lists
lists
junk
lists
default
lists
junk
personal
Now we build a function that actually puts it all together (somewhat reformatted for clarity albeit I typed it straight in on a single line when I first did it) - this takes a range of message indexes, pulls the headers, finds the local address and does the classify run. If it gets a result it calls a $do_cl function to do something with the classification, if not increments the unknown count for the original address. The reason for keeping the $do_cl part separate is I can initially define it to be something for debugging only -
$ my $do_cl;
my $cl_range = sub {
foreach my $i ($_[0] .. $_[1]) {
my $m = $mk_hdr->($imap, $i);
my $l = $last_local->($m);
my $c = $classify->($l, \%message_targets);
if (defined $c) { $do_cl->($i, $c); } else { $unknown{$l}++; }
}
};
CODE(0x868a4a0)
$ $do_cl = sub { print join(' => ', @_)."\n"; };
CODE(0x86a6034)(check on 1-20 elided)
$ $cl_range->(21, 30);
21 => lists
23 => lists
24 => lists
25 => lists
26 => default
28 => lists
29 => lists
30 => lists
$ %unknown;
lxxyj 8 trout.me.uk 1 ide-hard-disk-compswap 8 trout.me.uk 1 aauzf 8 trout.me.uk 1
So, the classifier seems to work. So now all we need to do is create corresponding folders, modify $do_cl, and we're away -
$ $do_cl = sub {
print join(' => ', @_)."\n";
$imap->copy($_[0], "INBOX/".$_[1]) || warn $imap->errstr;
$imap->delete($_[0]) || warn $imap->errstr;
};
$ my $max_id = $imap->select('INBOX/scratch');
$ $cl_range->(1, $max_id); $imap->expunge_mailbox('INBOX/scratch');
Letting this process finish gets me about 12k messages classified and 500 or so addresses in %unknown, so at this point I resort to actually turning it into a real script - something I was able to do pretty much exactly by copy and pasting from the repl history. The only addition is a more compact form for specifying the targets -
my %target_conf = (
# MAILING LISTS
lists => [ qw(infrastructures dbix-class ...
) ],# PERSONAL ADDRESSES
personal => [ qw(matt matthew paypal ...
) ],# additional target lists elided
);
%message_targets = (
map { my $v = $_; (
map { ($_ => $v) } @{$target_conf{$_}}
) } keys %target_conf
);
where the outer map saves off the key as $v so the inner map can use the arrayref stashed in the conf hash to produce key: arrayref entry, value: $v pairs for the target list the code operates on. I considered externalising thse into a conf file but since the qw() format allows me to use arbitrary whitespace for formatting so the config's only about 60 lines, which for software that's basically single-user (and developed in a couple hours) it really isn't worth the hassle.
It processes about 2.5k messages per hour over my DSL connection and seems to keep both the client machine and the IMAP server box reasonably busy while it's doing it. I could probably improve this but again for my purposes, who cares. The big point for me was going from "mailbox is screwed" to "mailbox is usable and can be cleaed out easily" in the space of an afternoon :)
And with that, I'm off to bed. I hope this was of some vague interest to those of you who got all the way through it!
Comments
I really wanted to be able to dump the history to a file or at least to stdout, so I spent a 10 minutes and wrote a plugin: DumpHistory. You can find it here.
Please feel free to include it in the ultimate distribution.
-Mark
You may also want to have a look at the ReadLineHistory plugin which was contributed by Shadowcat's sysadmin and hooks into the same mechanism as the bash history file stuff - complementary to what you've done rather than an alternative but maybe also useful to you.