projects

Jump to:

A non-exhaustive of projects I've embarked on and/or completed. If you are interested in the source for the computer-based ones, please ask.

I haven't done anything worthy of setting up a sourceforge project for, but these might be useful to some people in any case. Perhaps for the the ideas more than the implementation.

mail checker

description

Overview: Creates a summary page of email subjects for multiple accounts

Note if you want the source code for this, please contact me for it.

The number of email addresses I maintain / check is thakfully decreasing, but I still find it laborious to check all of them, especially the ones that tend to receive less important email.

I originally wrote a perl script around (IIRC) POP3lib, or similar. I retrieved mails and searched for the subject using regular expressions. I still use regexes, and still pull the entire email as it works (basically) for now, but my intention is to rewrite to only retrieve headers. I don't know if this functionality is present in the CPAN POP libraries, although it most likely is. I now use a python implementation, partly to help me familiarise myself with python which I now use more than perl (as of 2008 or so, still valid Q4 2009!), and partly because I wanted to switch to IMAP over POP.

My file imap.py processes arguments to check things like: server name, SSL / no SSL, username and password, number of messages to print subjects for, and debug options. I forget if there are others. It uses the imaplib IMAP4 python library.

It also prints dashes to underline headings. I mention this only because I think it's important to pay attention to the little touches that enhance readbility, for anyone that is reading for ideas.

I wrapped the python imap.py ... etc in a bash/sh script to handle calling up my various mailboxes. Since I am usually ssh'ed into my server, it was enough to print a text file I could occasionally cat. However, I quickly found this cumbersome for quick checking, and so I got it to spit out a file to a directory that is served on the web. This is of course a privacy tradeoff, but even in the event that someone does find the file, they will not gain much. Incidentally, the file also links to webserver stats, making it possibly my most frequently viewed page. It's either that or the Google homepage (without widgets, ie not iGoogle for those that are interested).

Update: I have updated the code while bored and living in Barcelona. It now pulls only the header, with a resultant modest speed boost. I also added in fully-fledged character encoding support. It had some in before, but it was buggy, which when combined with my regex handling, was causing truncation in some cases. it wasn;t a big deal before, but the code is more understandable and managable now. It also handles esoteric charsets, like those for Arabic, Chinese, Cyrillic, etc. Of course, this support is dependent on python's support, but that is once of the nice things about languages with excellent supporting libraries, like python and perl. You can stand on the shoulders of giants painlessly and seamlessly.

Update 2: This script now pulls 300 bytes of the email body to include in a collapsed, expandable div. This has brought other problems, however. I've seen the UTF8 output choke on one email, which is a problem. As it stands, all I do to work around this is catch the exception and stop output for that particular case.

The other problem is that there doesn't seem to be an easy way to pull the 'plain text' of a multipart email in the IMAP spec. I read the RFCs till my eyes hurt, and I think it will come down to writing a simple parser for the BODYSTRUCTURE command and then requesting the correct part. Not too difficult in theory, but another hour or two of code and debugging at least.

Update 3: Coding and debugging a very basic, limited BODYSTRUCTURE 'parser' that only searches for the plain parts of the email actually only took about 45 minutes. The issues are less than before, although it seems that now different emails make the previews (or 'peeks' as I call them) choke a little.

For example, a Paypal email peek only displays =3D=3D=3D over and over. Base64 encoded emails are just a block of letters and numbers. I have a feeling they can be decoded, although why they weren't decoded automatically by the decoder I put in is another matter. So there are still issues, but I'm progressively knocking off the low-hanging fruit.

Update 4: imap.py now decodes charcter codes in emails. I solved this purely by chance. I still had a tab with RFC 2045 open at section 6.7 (link), entitled "Quoted-Printable Content-Transfer-Encoding". Just what I needed. quopri.decodestring along with some logic to detect if it is in fact quoted-printable did the job nicely.

However, it still doesn't do base64 decoding robustly. There is a way to do it, but I've found it unreliable. Basically I can't decode a particular block of tech because the padding is incorrect. I've read that the correct length is a multiple of either 3 or 4 bytes (sources disagree, and I'm not reading another technical doc at this stage or I might take my own life). The funny thing is that my block of base64-encoded text is 300 bytes, which is a multiple of both 3 and 4. Padding this particular source with "==" caused it to decode, so I may try that as a temporary workaround (who knows why it wants a length of 302?), and 'fix' it to consecutively append "=" if it fails in the future.

Update 5: There was a slight problem with the (web / HTML) output, which meant that any HTML email that didn't include a plain-text version would screw with displaying later subjects, or error out. I also fixed a couple of silly bugs, like assuming we knew how to decode when we actually didn't. That sort of thing.

drawbacks

  • currently pulls the whole email to check just the subject; pending modification (after I read the RFC and library docs... again)
  • Occasionally truncates subjects. I suspect this is due to either: my regular expression, or 'funny codes' in subject lines. The latter I can live without, really.
  • Based around cron, so pulls email even if not needed. An on-demand system would be more efficient, of course. However, this way I can break up my concentration into blocks - I know there is no need to check the page if it is, say, 20:28.
  • Pulling parts of the BODY[TEXT] of the email results in en/decoding errors. More specifically:
    • Base64 is (seemingly) undecoded.
    • Some people lie about their emails being plain text, or their encoding isn't trivially decipherable. Eg still with =20 or other codes therein. (see http://tools.ietf.org/html/rfc2045#section-6.7)
  • Does not specify which part of the body to pull, which leads to results like:
    --_----------=_112358201558820
    Content-Type: text/plain; charset="utf-8"
    Content-Transfer-Encoding: quoted-printable
    
    etc etc
    

future

Fix the problems. Specifically, change to check-on-demand, but more importantly, change the implementation to only pull the email headers.

I'd like to write support for RSS, although that may not be implementable in a meaningful way. I will investigate that.

It might be interesting to perhaps grab, say 140 bytes of the body of the email and put that in an expandable div element - collapsed by default, of course.

basic email backup

Overview: Simple plain text file based backup of an email account, 1 file per email.

This ties in with my email checker above, I modified imap.py to write the entire contents of the email (ie headers and body) to a text file corresponding to the order of the email. Thus, the first email is 1.txt, the 327th, 327.txt, and so forth.

My first implementation of this was basic, to dump an inbox when I migrated my email address. Now, it works through mailboxes on an IMAP server and writes out the emails for them all!

My rationale for this solution, rather than using a client to archive the emails is twofold. The first is I have had bad experiences with mail clients importing mails, and the second is it makes for an easier backup solution: run the script, tar the directories, copy to backup server[s] as needed.

referrer spam ban

Overview: Bans IPs from connecting via hosts.deny based on occurrence of a piece of text in the Apache log

Another very simple script, this takes the logs to check through as arguments (very useful with globbing) and looks in those file[s] for a set of strings to check for. If there is a match between one of the strings it is looking for and the line open in the log file (ie a piece of referrer spam) it prints out the IP that made the request, a tab, a hash and the string it matched for easy pasting into hosts.deny.

At the moment it's a completely naïve implementation, but it is still fairly quick. It took 60 seconds to search all my monolithic logs (ie millions of lines cumulatively) for 23 lines to match. Given that this is going to be completely I/O-bound, I'm not sure how it could be sped up. I will investigate as an academic exercise, but if it weren't for my curiosity I probably wouldn't bother.

Basically, it's a faster version of:
cat /var/log/apache2/example.org | grep -i 'mega-spam.com/my-cool-site/' | awk '{ print $1 }' | sort | uniq

This can also be used to ban bots / compromised machines searching for naughty things. For example, I get a lot of requests (a few hundred a month on one site alone) for something called mirserver.rar for some reason. Since the script searches the entire string this will also look for matches in the request string of course.

simple incremental backups

Nothing special about this. I have a couple of scripts that (currently) do hourly and daily incremental backups using rsync. I retain 24 hour history and 7 day history. The scripts are generic enough to do remote backups via smb or over ssh.

system information graphing

Overview: Use gnuplot to graph system data over time (for example)

I wanted to keep an eye on certain information relating to a media server I have set up at home, and partly because I have a weakness for graphs they seemed like the obvious choice for keeping an eye on changing data over time.

Thanks to standard utilities and lmsensors I'm able to track temperature (CPU, system, hard drives); voltages; disk usage - of mounts, particular fodlers of interest; and duration of recorded audio files for fun.

eg (click to enlarge)

I also combined my mail checker with graphing, so I can see the trends in mail volume over time.

Thanks to gnuplot, it's reasonably straighforward to graph any data you happen to have. The process of collection of periodic data is easily done with cron and bash. For simplicity, the graph generation is also done with cron and bash.

What I intend to do in the near future is write some higher-level scripts in either python or Lisp to generate the bash scripts. The only reason for doing this is that updating the bash scripts is a manual process that could be made marginally easier by a higher-level script.