Kroon Information Systems
       
    
MailRetr

I've had a few run-ins with mail retrieval systems like fetchmail, and have finally decided that perhaps it may be worth it to investigate changing the system. It may be possible to modify fetchmail (or even getmail) to suit my needs/requirements, it may be not. In the meantime this is all just concept and nothing concrete is yet in place.

The problem

Most people that read this will know that bandwidth in South Africa is rather expensive. Those that live here will know what I refer to when I mention "the CAP". It's something that the telecomunications providers here must enforce due to the fact that the monopolistic provider apparantly doesn't have the required infrastructure to properly handle it if eveybody just used their bandwidth as required. Bottom line is: Bandwidth is precious, and in some cases EXPENSIVE.

Now, it sometimes happens that there are bugs in software (the particular case that sparked this round of thinking was clamav), in a mail system this can result in temporary failures (4xx rejects). So what happened was that fetchmail would download a 4MB email, pass it to exim, exim would scan it using clamav, clamav would fail with "Zip module failure", exim would issue a 4xx temporary failure, fetchmail would discard the message and repeat the whole process 5 minutes later. Keep this going for 5 days and you end up using quite a lot of bandwidth, not to mention keeping a pipe clogged up by re-downloading the same data over and over and over again.

Possible solution

I've been having a bit of a discussion with some people on the glug-chat mailing list (GLUG == Gauteng Linux User Group). And a few ideas have been throws around, including the fact there are many other possible bandwidth savings in the whole fetchmail process. I'll attempt to discuss some of these in detail, others are so obvious that I really couldn't particularly care expanding it.

Requirements

Firstly there are a few things that I would absolutely require should a system be written from scratch, and would really like to have seen in fetchmail/getmail

  • Modularity/Plugability:
    Various aspects of the system can (and should) be plugable, for example some abstract mail retriever that can be implemented by various implementations to provide different mechanisms for retrieving email. Two such implementations would typically be POP3 and IMAP.
  • plain, SSL and STARTTLS:
    These are possibly POP3/IMAP specific, but these at least should be able to use plaintext connections (like that provided by just about all ISPs - google being the only exception afaik), straight SSL off the bat, also refered to as pop3s and imaps (they are to pop and imap as https is to http). A newer mechanism that is also seen a lot now is the starttls mechianism where the connection starts in plaintext and then gets "upgraded" to TLS/SSL at a later stage.
  • POP3 + IMAP:
    I personally don't have a need atm for ODMR, ETRN or UUCP and would thus currently only require implementation of POP3 and IMAP.
  • Delivery methods:
    I would like to see different delivery methods, including direct delivery into Maildir, mbox, IMAP, SMTP and command delivery, for example invoking /usr/sbin/sendmail, procmail or some other command. The only one I really require atm is SMTP, but these methods should be plugable in any case.
  • Headers first, body later:
    As far as I can tell fetchmail downloads the entire message (including headers) before initiating communications with SMTP (although it keeps the SMTP connection open once it opened it). This means that even if a 5MB message gets rejected before we even get to the data phase of the smtp connection we still downloaded the full 5MB whilst we could have gotten away with only the headers of the email. Both POP3 and IMAP allows for only retrieving the headers of an email.
  • single-drop and multi-drop support:
    Some ISPs provides us with a catch-all mailbox for a domain (including all the collateral disadvantages that comes with this). This is also known as a multi-drop. Same as fetchmail this should be supported in any mail fetching system.
  • multiple configuration backends:
    fetchmail requires it's complete configuration to be passed in a fetchmailrc file, or on the commandline. This is rather annoying and personally I'd like to be able to have different modules that specifies where to download email from, in my case I'd be particularly interested in getting this configuration from a MySQL database. Another almost obvious option would be ldap.
  • multithreaded downloads:
    Downloading mail in a single stream is potentially inefficient, and is definitely unfriendly to other users. This is particularly true in the environments in which I'm working mostly, where a single server retrieves mail from an ISP for an entire domain and re-injects into a local mail system. Consider the case where one user receives a 10MB email from an internation POP3 server after your cap has been reached. This single 10MB email can easily take an hour to download in such conditions (it's fscked I know, not much I can do about it). It would be better if the system could fire up multiple mail retrieval sessions in parallel so that at least a few large emails would be required to block up the entire process.
  • configurable domain/host names:
    fetchmail does not allow for specifying what the From: <> header should contain in cases of bounces. It just assumes that is can use the fqdn of the host. This however is simply not true. Take my setup for example, my local server is called xacatecas.lan, if it tries to send a bounce using return path <> (in according with the spec) and the From: header containing "Mailder Daemon" <mailder-daemon@xacatecas.lan> then that bounce is going to get bounced in all probability simply because xacatecas.lan isn't resolvable publicly on the internet. Instead it should be using mailer-daemon@kroon.co.za as the From: address for bounces (I've had to work around this braindeadness using exim rewrite rules).

Further ideas

The above doesn't address the biggest issue I've experienced, that of temporary mail delivery failures after transmitting the body of the message, in this case fetchmail just discards the message and re-downloads again at a later stage, as explained earlier. It would be much better if the system had it's own little queue where it could temporarily store emails in order to retry delivery at later stage without having to re-download the message again at a later stage. This "queue" could simply be treated as a type of multi-drop and injection into this queue could be treated as a type of delivery, keeping to the abstractions of retrieval and delivery. Any message that is stuck in this store for an extended period of time should be bounced.

Alternative logging schemes. Personally I'd like to be able to report maildrop specific errors back via some kind of user-accessible web interface. Eg, if retrieval from mail.kroon.co.za fails when trying to download mail for jaco@kroon.co.za and the error is "Authentication failed" then I'd like to store that in my database so that I can give that feedback to the user. Currently I need to jump a hundred and one hoops in order to achieve this, including some pretty insane parsing of the output generated to stderr.