MailRetr
I've had a few run-ins with mail retrieval systems like fetchmail, and have
finally decided that perhaps it may be worth it to investigate changing the
system. It may be possible to modify fetchmail (or even getmail) to suit my
needs/requirements, it may be not. In the meantime this is all just concept
and nothing concrete is yet in place.
The problem
Most people that read this will know that bandwidth in South Africa is rather
expensive. Those that live here will know what I refer to when I mention "the
CAP". It's something that the telecomunications providers here must enforce
due to the fact that the monopolistic provider apparantly doesn't have the
required infrastructure to properly handle it if eveybody just used their
bandwidth as required. Bottom line is: Bandwidth is precious, and in some
cases EXPENSIVE.
Now, it sometimes happens that there are bugs in software (the particular
case that sparked this round of thinking was clamav), in a mail system this can
result in temporary failures (4xx rejects). So what happened was that fetchmail
would download a 4MB email, pass it to exim, exim would scan it using clamav,
clamav would fail with "Zip module failure", exim would issue a 4xx temporary
failure,
fetchmail would discard the message and repeat the whole process 5 minutes
later. Keep this going for 5 days and you end up using quite a lot of
bandwidth, not to mention keeping a pipe clogged up by re-downloading the same
data over and over and over again.
Possible solution
I've been having a bit of a discussion with some people on the glug-chat
mailing list (GLUG == Gauteng Linux User Group). And a few ideas have been
throws around, including the fact there are many other possible bandwidth
savings in the whole fetchmail process. I'll attempt to discuss some of these
in detail, others are so obvious that I really couldn't particularly care expanding
it.
Requirements
Firstly there are a few things that I would absolutely require should
a system be written from scratch, and would really like to have seen in
fetchmail/getmail
- Modularity/Plugability:
Various aspects of the system can (and should) be plugable, for example some
abstract mail retriever that can be implemented by various implementations to
provide different mechanisms for retrieving email. Two such implementations
would typically be POP3 and IMAP.
- plain, SSL and STARTTLS:
These are possibly POP3/IMAP specific, but these at least should be able to use
plaintext connections (like that provided by just about all ISPs - google being
the only exception afaik), straight SSL off the bat, also refered to as pop3s
and imaps (they are to pop and imap as https is to http). A newer mechanism
that is also seen a lot now is the starttls mechianism where the connection
starts in plaintext and then gets "upgraded" to TLS/SSL at a later stage.
- POP3 + IMAP:
I personally don't have a need atm for ODMR, ETRN or UUCP and would thus
currently only require implementation of POP3 and IMAP.
- Delivery methods:
I would like to see different delivery methods, including direct delivery
into Maildir, mbox, IMAP, SMTP and command delivery, for example invoking
/usr/sbin/sendmail, procmail or some other command. The only one I really
require atm is SMTP, but these methods should be plugable in any case.
- Headers first, body later:
As far as I can tell fetchmail downloads the entire message (including headers)
before initiating communications with SMTP (although it keeps the SMTP
connection open once it opened it). This means that even if a 5MB message gets
rejected before we even get to the data phase of the smtp connection we still
downloaded the full 5MB whilst we could have gotten away with only the headers
of the email. Both POP3 and IMAP allows for only retrieving the headers of an
email.
- single-drop and multi-drop support:
Some ISPs provides us with a catch-all mailbox for a domain (including all the
collateral disadvantages that comes with this). This is also known as a
multi-drop. Same as fetchmail this should be supported in any mail fetching
system.
- multiple configuration backends:
fetchmail requires it's complete configuration to be passed in a fetchmailrc
file, or on the commandline. This is rather annoying and personally I'd like
to be able to have different modules that specifies where to download email
from, in my case I'd be particularly interested in getting this configuration
from a MySQL database. Another almost obvious option would be ldap.
- multithreaded downloads:
Downloading mail in a single stream is potentially inefficient, and is
definitely unfriendly to other users. This is particularly true in the
environments in which I'm working mostly, where a single server retrieves mail
from an ISP for an entire domain and re-injects into a local mail system.
Consider the case where one user receives a 10MB email from an internation POP3
server after your cap has been reached. This single 10MB email can easily take
an hour to download in such conditions (it's fscked I know, not much I can do
about it). It would be better if the system could fire up multiple mail
retrieval sessions in parallel so that at least a few large emails would be
required to block up the entire process.
- configurable domain/host names:
fetchmail does not allow for specifying what the From: <> header should
contain in cases of bounces. It just assumes that is can use the fqdn of the
host. This however is simply not true. Take my setup for example, my local
server is called xacatecas.lan, if it tries to send a bounce using return path
<> (in according with the spec) and the From: header containing "Mailder
Daemon" <mailder-daemon@xacatecas.lan> then that bounce is going to get
bounced in all probability simply because xacatecas.lan isn't resolvable
publicly on the internet. Instead it should be using mailer-daemon@kroon.co.za
as the From: address for bounces (I've had to work around this braindeadness
using exim rewrite rules).
Further ideas
The above doesn't address the biggest issue I've experienced, that of
temporary mail delivery failures after transmitting the body of the
message, in this case fetchmail just discards the message and re-downloads
again at a later stage, as explained earlier. It would be much better if the
system had it's own little queue where it could temporarily store emails in
order to retry delivery at later stage without having to re-download the
message again at a later stage. This "queue" could simply be treated as a type
of multi-drop and injection into this queue could be treated as a type of
delivery, keeping to the abstractions of retrieval and delivery. Any message that
is stuck in this store for an extended period of time should be bounced.
Alternative logging schemes. Personally I'd like to be able to report
maildrop specific errors back via some kind of user-accessible web interface.
Eg, if retrieval from mail.kroon.co.za fails when trying to download mail for
jaco@kroon.co.za and the error is "Authentication failed" then I'd like to
store that in my database so that I can give that feedback to the user.
Currently I need to jump a hundred and one hoops in order to achieve this,
including some pretty insane parsing of the output generated to stderr.
|