projects
Jump to:
- mail checker (summary of email subjects)
- basic email backup
- referrer spam ban
- simple incremental backups
A non-exhaustive of projects I've embarked on and/or completed. If you are interested in the source for the computer-based ones, please ask.
I haven't done anything worthy of setting up a sourceforge project for, but these might be useful to some people in any case. Perhaps for the the ideas more than the implementation.
mail checker
description
Overview: Creates a summary page of email subjects for multiple accounts
Note if you want the source code for this, please contact me for it.
The number of email addresses I maintain / check is thakfully decreasing, but I still find it laborious to check all of them, especially the ones that tend to receive less important email.
I originally wrote a perl script around (IIRC) POP3lib, or similar. I retrieved mails and searched for the subject using regular expressions. I still use regexes, and still pull the entire email as it works (basically) for now, but my intention is to rewrite to only retrieve headers. I don't know if this functionality is present in the CPAN POP libraries, although it most likely is. I now use a python implementation, partly to help me familiarise myself with python which I now use more than perl (as of 2008 or so, still valid Q4 2009!), and partly because I wanted to switch to IMAP over POP.
My file imap.py processes arguments to check things like: server name, SSL / no SSL, username and password, number of messages to print subjects for, and debug options. I forget if there are others. It uses the imaplib IMAP4 python library.
It also prints dashes to underline headings. I mention this only because I think it's important to pay attention to the little touches that enhance readbility, for anyone that is reading for ideas.
I wrapped the python imap.py ... etc in a bash/sh script to handle calling up my various mailboxes. Since I am
usually ssh'ed into my server, it was enough to print a text file I could occasionally cat. However,
I quickly found this cumbersome for quick checking, and so I got it to spit out a file to a directory that is served on the
web. This is of course a privacy tradeoff, but even in the event that someone does find the file, they will not gain much.
Incidentally, the file also links to webserver stats, making it possibly my most frequently viewed page. It's either that or
the Google homepage (without widgets, ie not iGoogle for those that are interested).
Update: I have updated the code while bored and living in Barcelona. It now pulls only the header, with a resultant modest speed boost. I also added in fully-fledged character encoding support. It had some in before, but it was buggy, which when combined with my regex handling, was causing truncation in some cases. it wasn;t a big deal before, but the code is more understandable and managable now. It also handles esoteric charsets, like those for Arabic, Chinese, Cyrillic, etc. Of course, this support is dependent on python's support, but that is once of the nice things about languages with excellent supporting libraries, like python and perl. You can stand on the shoulders of giants painlessly and seamlessly.
Update 2: This script now pulls 300 bytes of the email body to include
in a collapsed, expandable div. This has brought other problems, however. I've
seen the UTF8 output choke on one email, which is a problem. As it stands, all I do to work
around this is catch the exception and stop output for that particular case.
The other problem is that there doesn't seem to be an easy way to pull the 'plain text' of a
multipart email in the IMAP spec. I read the RFCs till my eyes hurt, and I think it will come
down to writing a simple parser for the BODYSTRUCTURE command and then requesting
the correct part. Not too difficult in theory, but another hour or two of code and
debugging at least.
Update 3: Coding and debugging a very basic, limited BODYSTRUCTURE 'parser' that only
searches for the plain parts of the email actually only took about 45 minutes. The issues are less than before, although
it seems that now different emails make the previews (or 'peeks' as I call them) choke a little.
For example, a Paypal email peek only displays =3D=3D=3D over and over. Base64 encoded emails are just a block of letters and numbers. I have a feeling they can be decoded, although why they weren't decoded automatically by the decoder I put in is another matter. So there are still issues, but I'm progressively knocking off the low-hanging fruit.
Update 4: imap.py now decodes charcter codes in emails. I solved this purely by chance. I still had a tab with
RFC 2045 open at section 6.7 (link), entitled "Quoted-Printable
Content-Transfer-Encoding". Just what I needed. quopri.decodestring along with some logic to detect if it is in
fact quoted-printable did the job nicely.
However, it still doesn't do base64 decoding robustly. There is a way to do it, but I've found it unreliable. Basically I can't decode a particular block of tech because the padding is incorrect. I've read that the correct length is a multiple of either 3 or 4 bytes (sources disagree, and I'm not reading another technical doc at this stage or I might take my own life). The funny thing is that my block of base64-encoded text is 300 bytes, which is a multiple of both 3 and 4. Padding this particular source with "==" caused it to decode, so I may try that as a temporary workaround (who knows why it wants a length of 302?), and 'fix' it to consecutively append "=" if it fails in the future.
Update 5: There was a slight problem with the (web / HTML) output, which meant that any HTML email that didn't include a plain-text version would screw with displaying later subjects, or error out. I also fixed a couple of silly bugs, like assuming we knew how to decode when we actually didn't. That sort of thing.
drawbacks
currently pulls the whole email to check just the subject; pending modification (after I read the RFC and library docs... again)Occasionally truncates subjects. I suspect this is due to either: my regular expression, or 'funny codes' in subject lines. The latter I can live without, really.- Based around
cron, so pulls email even if not needed. An on-demand system would be more efficient, of course. However, this way I can break up my concentration into blocks - I know there is no need to check the page if it is, say, 20:28. Pulling parts of the BODY[TEXT] of the email results in en/decoding errors.More specifically:- Base64 is (seemingly) undecoded.
Some people lie about their emails being plain text, or their encoding isn't trivially decipherable. Eg still with =20 or other codes therein.(see http://tools.ietf.org/html/rfc2045#section-6.7)
Does not specify which part of the body to pull, which leads to results like:--_----------=_112358201558820 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable etc etc
future
Fix the problems. Specifically, change to check-on-demand, but more importantly, change
the implementation to only pull the
email headers.
I'd like to write support for RSS, although that may not be implementable in a meaningful way. I will investigate that.
It might be interesting to perhaps grab, say 140 bytes of the body of the email and put
that in an expandable
div element - collapsed by default, of course.
basic email backup
Overview: Simple plain text file based backup of an email account, 1 file per email.
This ties in with my email checker above, I modified imap.py to write the entire contents of the email (ie headers and body) to a text file corresponding to the order of the email. Thus, the first email is 1.txt, the 327th, 327.txt, and so forth.
My first implementation of this was basic, to dump an inbox when I migrated my email address. Now, it works through mailboxes on an IMAP server and writes out the emails for them all!
My rationale for this solution, rather than using a client to archive the emails is twofold. The first is I have had bad experiences with mail clients importing mails, and the second is it makes for an easier backup solution: run the script, tar the directories, copy to backup server[s] as needed.
referrer spam ban
Overview: Bans IPs from connecting via hosts.deny based on occurrence of a piece of text in the Apache log
Another very simple script, this takes the logs to check through as arguments (very useful with globbing) and looks in those file[s] for a set of strings to check for. If there is a match between one of the strings it is looking for and the line open in the log file (ie a piece of referrer spam) it prints out the IP that made the request, a tab, a hash and the string it matched for easy pasting into hosts.deny.
At the moment it's a completely naïve implementation, but it is still fairly quick. It took 60 seconds to search all my monolithic logs (ie millions of lines cumulatively) for 23 lines to match. Given that this is going to be completely I/O-bound, I'm not sure how it could be sped up. I will investigate as an academic exercise, but if it weren't for my curiosity I probably wouldn't bother.
Basically, it's a faster version of:
cat /var/log/apache2/example.org | grep -i 'mega-spam.com/my-cool-site/' | awk '{ print $1 }' | sort | uniq
This can also be used to ban bots / compromised machines searching for naughty things. For example, I get a lot of requests (a few hundred a month on one site alone) for something called mirserver.rar for some reason. Since the script searches the entire string this will also look for matches in the request string of course.
simple incremental backups
Nothing special about this. I have a couple of scripts that (currently) do hourly and daily incremental backups using rsync. I retain 24 hour history and 7 day history. The scripts are generic enough to do remote backups via smb or over ssh.