Discussion:
Mail stuck in active queue for loooong time
(too old to reply)
Lars Oberg
2007-09-20 23:43:02 UTC
Permalink
Hi,

Yesterday, emails started to get stuck in the active
queue for a long time (20 - 160 minutes) and I have
not been able to figure out why. I am new to postfix
so I am probably missing something obvious.

Postfix 2.4.5 on CentOS 4.5. Dual PIII 1.3 GHz, 2 GB
RAM, 10K RPM SCSI, raid.

$ qshape active
T 5 10 20 40 80 160 320
640 1280 1280+
TOTAL 428 17 29 120 78 165 18 1
0 0 0
samys.com 409 14 27 117 77 158 15 1
0 0 0
mail02.samys.com 7 3 0 1 0 2 1 0
0 0 0
yahoo.com 2 0 0 1 0 1 0 0
0 0 0
dotlinecorp.com 2 0 0 0 0 2 0 0
0 0 0
aol.com 1 0 1 0 0 0 0 0
0 0 0
ucla.edu 1 0 0 0 0 1 0 0
0 0 0
phaseone.com 1 0 0 0 0 0 1 0
0 0 0
microage.com 1 0 1 0 0 0 0 0
0 0 0
orangeusd.org 1 0 0 0 0 0 1 0
0 0 0
countrywide.com 1 0 0 1 0 0 0 0
0 0 0
ingrammicro.com 1 0 0 0 0 1 0 0
0 0 0
canyondesigngroup.com 1 0 0 0 1 0 0 0
0 0 0

samys.com is our domain. A few weeks ago we switched
from an old server running sendmail to a new server
running postfix. All was well until yesterday when
this problem started.

Example of problem e-mail:
Sep 20 14:43:36 mail02 postfix/smtpd[30175]:
C7BE5C8006:
client=web30111.mail.mud.yahoo.com[209.191.69.43]
Sep 20 14:43:37 mail02 postfix/cleanup[30180]:
C7BE5C8006:
message-id=<***@web30111.mail.mud.yahoo.com>
Sep 20 14:43:37 mail02 postfix/qmgr[30049]:
qmgr_active_feed: incoming/C7BE5C8006
Sep 20 14:43:37 mail02 postfix/qmgr[30049]:
qmgr_message_alloc: active C7BE5C8006
Sep 20 14:43:37 mail02 postfix/qmgr[30049]:
C7BE5C8006: recipient limit 5000
Sep 20 14:43:37 mail02 postfix/qmgr[30049]:
C7BE5C8006: from=<***@yahoo.com>, size=1458,
nrcpt=1 (queue active)
Sep 20 15:22:32 mail02 postfix/qmgr[5857]: qmgr_move:
moved C7BE5C8006 from active to incoming
Sep 20 15:22:32 mail02 postfix/qmgr[5857]:
qmgr_active_feed: incoming/C7BE5C8006
Sep 20 15:22:32 mail02 postfix/qmgr[5857]:
qmgr_message_alloc: active C7BE5C8006
Sep 20 15:22:32 mail02 postfix/qmgr[5857]: C7BE5C8006:
recipient limit 5000
Sep 20 15:22:32 mail02 postfix/qmgr[5857]: C7BE5C8006:
from=<***@yahoo.com>, size=1458, nrcpt=1 (queue
active)
Sep 20 15:31:20 mail02 postfix/qmgr[5857]:
qmgr_peer_select: C7BE5C8006 amavisfeed
[127.0.0.1]:10024 (2 of 7)
Sep 20 15:31:20 mail02 postfix/qmgr[5857]:
qmgr_job_retire: C7BE5C8006
Sep 20 15:31:20 mail02 postfix/qmgr[5857]: send attr
queue_id = C7BE5C8006
Sep 20 15:31:36 mail02 postfix/lmtp[7017]: C7BE5C8006:
to=<***@samys.com>,
relay=127.0.0.1[127.0.0.1]:10024, conn_use=14,
delay=2880, delays=2336/528/0/16, dsn=2.0.0,
status=sent (250 2.0.0 Ok: queued as DDD8EC7FC4)
Sep 20 15:31:36 mail02 postfix/qmgr[5857]:
qmgr_active_done: C7BE5C8006
Sep 20 15:31:36 mail02 postfix/qmgr[5857]: C7BE5C8006:
removed
Sep 20 15:31:36 mail02 postfix/qmgr[5857]:
qmgr_job_free: C7BE5C8006 amavisfeed

Any help with this would be greatly appreciated!

Thanks,
Lars



____________________________________________________________________________________
Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
http://get.games.yahoo.com/proddesc?gamekey=monopolyherenow
Victor Duchovni
2007-09-21 00:01:16 UTC
Permalink
Post by Lars Oberg
Hi,
Yesterday, emails started to get stuck in the active
queue for a long time (20 - 160 minutes) and I have
not been able to figure out why. I am new to postfix
so I am probably missing something obvious.
Broken content_filter?
Post by Lars Oberg
TOTAL 428 17 29 120 78 165 18 1
samys.com 409 14 27 117 77 158 15 1
samys.com is our domain.
qmgr_active_feed: incoming/C7BE5C8006
Woa, why is queue manager doing verbose logging? If your syslog is doing
synchronous writes, you are dead. Who turned on "-v" for master.cf in qmgr?
Or did you start postfix via "postfix -v start"?
Who restarted the queue manager? You won't get very far if the queue
manager is constantly restarting?
Post by Lars Oberg
moved C7BE5C8006 from active to incoming
relay=127.0.0.1[127.0.0.1]:10024, conn_use=14,
delay=2880, delays=2336/528/0/16, dsn=2.0.0,
status=sent (250 2.0.0 Ok: queued as DDD8EC7FC4)
The delivery latency via amavis is 16 seconds, fix this. After disabling
all the overly verbose logging, and not running "postfix reload" except
in emergencies.
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Victor Duchovni
2007-09-21 01:21:52 UTC
Permalink
Post by Victor Duchovni
Woa, why is queue manager doing verbose logging? If
your syslog is doing
synchronous writes, you are dead. Who turned on "-v"
for master.cf in qmgr?
I did in order to debug this problem. How do I know
if syslog is doing synchronous writes?
You almost never need to enable verbose logging in the queue manager,
it is never the cause of congestion, rather the cause is always external
and you need to debug input and output processes, not the scheduler.

On Linux systems syslog files whose names don't start with a "-" in
syslogd.conf are written synchronously, prefix the mail log file with a
"-" if needed. On Solaris syslogd is always synchronous IIRC, ...
Post by Victor Duchovni
Who restarted the queue manager? You won't get very
I restarted postfix a few times after changing
settings in main.cf / master.cf. Again in order to
debug. What is the best way to make changes take
effect?
When you have a large active queue, restarting the queue manager just
adds more load. Most postfix processes don't live forever, so most
changes don't require a "reload".
Post by Victor Duchovni
The delivery latency via amavis is 16 seconds, fix
this.
Spot on! I disabled amavis by commenting out
"content_filter=amavisfeed:[127.0.0.1]:10024". Now
emails go through without any delay.
You may have DNS checks enabled in amavis, possibly via SpamAssasin
doing RBL lookups, ... If you don't have a local DNS cache, get one,
but likely you can disable the remote features for now, and still get
decent filtering.
Of course, now I am wide open to spam & viruses. Any
pointers on how to debug the sudden slow-down when
using content_filter (amavis)?
Ask on the Amavis list...
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Victor Duchovni
2007-09-21 01:43:10 UTC
Permalink
Post by Victor Duchovni
Post by Victor Duchovni
Woa, why is queue manager doing verbose logging? If
your syslog is doing
synchronous writes, you are dead. Who turned on "-v"
for master.cf in qmgr?
I did in order to debug this problem. How do I know
if syslog is doing synchronous writes?
You almost never need to enable verbose logging in the queue manager,
it is never the cause of congestion, rather the cause is always external
and you need to debug input and output processes, not the scheduler.
Twice within 24 hours, people freaked out because POSTFIX DID NOT
IMMEDIATELY TRY TO DELIVER ALL THE MAIL after they fixed a problem
and restarted Postfix.
Instead, Postfix spreads out attempts to deliver delayed mail so
that it won't interfere with mail that arrives after the problem
was fixed. The idea is that consistent performance means avoiding
avoid unnecessary peaks. It seems to have been lost on some people.
This said, once problems *are* fixed, while delivery won't begin
simultaneously for every message, flushing the deferred queue may
be reasonable, provided conditions have changed.

"Never flush a blocked toilet, the mess only gets worse, first
clear the drain, then flush."

Frequent reloads during periods of congestion are counter-productive.
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Ralf Hildebrandt
2007-09-21 07:24:30 UTC
Permalink
Post by Victor Duchovni
"Never flush a blocked toilet, the mess only gets worse, first
clear the drain, then flush."
Yes, but that also means:

Full toilet
Flush
Empty toilet

but with Postfix they're seeing:

Full toilet
Flush
slowly emptying toilet

which is of course the right thing, but the expectation is an
immediate empying of the queue.
--
Ralf Hildebrandt (***@charite.de) ***@charite.de
Postfix - Einrichtung, Betrieb und Wartung Tel. +49 (0)30-450 570-155
http://www.arschkrebs.de
"Was ist der Sinn einer Sig.?" - "Tja. Das kann ich Dir jetzt auch
gerade nicht sagen. Aber eine Gute Frage. Darf ich die als Signatur
verwenden?" (Joachim J�ger/Nils Ketelsen in de.newusers.questions)
Victor Duchovni
2007-09-21 12:19:06 UTC
Permalink
Post by Ralf Hildebrandt
Full toilet
Flush
slowly emptying toilet
No. When the user requests "flush" then Postfix will attempt
to deliver ALL mail immediately.
It will all enter the active queue (if the deferred queue is not too big),
but concurrency limits mean that draining will still be progressive, though
rapid under ideal conditions. My RAMdisk test rig drained 2400 msgs/sec
after a flush (SMTP via a GigE interface to a host 1 router away).
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Ralf Hildebrandt
2007-09-21 07:25:35 UTC
Permalink
Ok, I have now turned off RBL lookups
("skip_rbl_checks 1" in SA/local.cf) and done some
more tests, but I do not see a difference in
performance
That would mean that other, local checks are slow. Are you using a
bayesdb? If yes, are you using a journal for that? How large is the
bayesdb?
--
Ralf Hildebrandt (***@charite.de) ***@charite.de
Postfix - Einrichtung, Betrieb und Wartung Tel. +49 (0)30-450 570-155
http://www.arschkrebs.de
Remember, all software sucks. Some sucks more, and some sucks less. But it
sucks regardless. If I want to see something elegant I go look for a piece
of art.
Ralf Hildebrandt
2007-09-21 07:22:45 UTC
Permalink
I did in order to debug this problem. How do I know
if syslog is doing synchronous writes?
You look into the /etc/syslog.conf?
Of course, now I am wide open to spam & viruses. Any
pointers on how to debug the sudden slow-down when
using content_filter (amavis)?
Increase the logging level of amavisd. At level 2 it prints out timing
info.
--
Ralf Hildebrandt (***@charite.de) ***@charite.de
Postfix - Einrichtung, Betrieb und Wartung Tel. +49 (0)30-450 570-155
http://www.arschkrebs.de
I have never left my schooling interfere with my education. - Mark Twain
Tony Earnshaw
2007-09-21 04:32:10 UTC
Permalink
Lars Oberg skrev, on 21-09-2007 06:16:

[...]
So it still takes about 12-15 seconds in amavis for
content checking even under very low load (and
external lookups disabled). Do you think that this
would be enough to cause 500-600 emails get stuck in
the active queue and take 1-2 hours to get processed?
No one's suggested looking at the amavisd log yet. My site's amavisd-new
blew up on me on an upgrade from RHAS4 to RHEL5; it didn't like the
Berkely DB version, the log file size, normally reasonable, had
increased to 11GB.

[...]

--Tonni
--
Tony Earnshaw
Email: tonni at hetnet dot nl
Victor Duchovni
2007-09-21 16:58:49 UTC
Permalink
One question: For the purpose of speeding up e-mail
(with content filtering), I assume a caching DNS
server would be preferred over a forwarding one. Do
you agree?
A forwarding name server is a caching server, it just forwards cache
misses to a fixed set of upstreams, instead of reaching authoritative
servers directly. Whether it is better to use a forwarder or not
depends on whether:

- The forwarder is likely to have useful data in its cache
given the type of queries being sent.

- The network proxiy of the forwarder.

- The load on the forwarder.

If you are running a large cluster of machines, using a small set of
forwarders is useful, because many machines in the cluster will be
making the same queries. If you are using an ISP to forward MX lookups,
likely OK with many ISPs. The cache hit rate on RBL lookups via a 3rd
party forwarder is much less likely to be a good one.
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Victor Duchovni
2007-09-21 18:13:45 UTC
Permalink
If I understand correctly, a "caching only" DNS server
will do recursive lookups until it has the final
answer to hand back to the mail server, while a
"forwarding mail server" will only do the first
look-up and hand back the result to the mail server's
resolver who then has to do further lookups as needed.
No. Both present the final anwer to the resolver. A forwarding nameserver
forwards cache-misses to another nameserer that does all the hard work,
while serving answers from its cache directly.
Since both types of servers do caching, maybe it does
not matter much which one I pick, since the RBLs will
be cached and does not change too often.
Actually RBLs are quite volatile, and have relative low TTLs (to allow
removed entries to age out quickly). Also the cache hit rate for RBLs
tends to be poor (lots of hosts in a bot net, each makes only a few
connections and is not heard from for a while...)

Avoid forwarding queries to upstreams that ignore TTLs (some ISPs).
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Victor Duchovni
2007-09-21 18:32:13 UTC
Permalink
Post by Victor Duchovni
Actually RBLs are quite volatile, and have relative
low TTLs (to allow
removed entries to age out quickly). Also the cache
hit rate for RBLs
tends to be poor (lots of hosts in a bot net, each
makes only a few
connections and is not heard from for a while...)
So with poor cache hit rate for RBLs, wouldn't a
caching DNS slow things down (by cache-miss and then
having to re-look-up very frequently)?
You have to have a caching name-server, resolvers never do all the work,
it can be close, or far away. On your machine, a single botnet client
will make make a few connections and caching will be useful, even if
globally the cache hit rate is poor.
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Ralf Hildebrandt
2007-09-21 18:14:55 UTC
Permalink
The mail queues have been healthy all morning, but now
we all of a sudden got over 3,000 e-mail in the active
queue! Looking at mailq, it appears we are being
203FAC814A* 4479 Fri Sep 21 10:43:16
Are these valid recipient addresses?
Is there a way to make postfix immediately bounce
emails like the ones above (where user does not exist
on our domain) without sending it through content
filtering, etc.?
Yes, it's the default.
Please show "postconf -n" output.
--
Ralf Hildebrandt (***@charite.de) ***@charite.de
Postfix - Einrichtung, Betrieb und Wartung Tel. +49 (0)30-450 570-155
http://www.arschkrebs.de
I'm locked in a maze of little projects, all of which suck.
Victor Duchovni
2007-09-21 18:16:24 UTC
Permalink
You need recipient validation! Now!!!
Is there a way to make postfix immediately bounce
emails like the ones above (where user does not exist
on our domain) without sending it through content
filtering, etc.? It would alleviate the load on the
server dramatically in cases like this.
Not "bounce", "reject".

http://www.postfix.org/LOCAL_RECIPIENT_README.html
http://www.postfix.org/ADDRESS_CLASS_README.html
http://www.postfix.org/postconf.5.html#local_recipient_maps
http://www.postfix.org/postconf.5.html#relay_recipient_maps

avoid catch-all aliases and wild-card (domain to domain) canonical or
virtual mappings.
--
Viktor.

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the "Reply-To" header.

To unsubscribe from the postfix-users list, visit
http://www.postfix.org/lists.html or click the link below:
<mailto:***@postfix.org?body=unsubscribe%20postfix-users>

If my response solves your problem, the best way to thank me is to not
send an "it worked, thanks" follow-up. If you must respond, please put
"It worked, thanks" in the "Subject" so I can delete these quickly.
Loading...