It is *amazing* how much more relaxed I feel now our mail server is working again!
- Monday
- Network has huge problems at weekend
- Mail server's actual mail delivery (exim) and imap location (dovecot) is on an external disk, mounted via iscsi
- External disk ends up mounted read-only so mail hangs horribly
- Reboot machine
- Machine is a virtual machine - normally these come up pretty quickly, but in this case there's a big timeout related to the iscsi disk and it takes ages
- We think it's hung so we go look at the virtual hosts
- The actual main virtual machine maintainer is busy with other things, so we try muddle through
- It doesn't seem to be running on any of the virtual hosts, so we pick one and start it
- All seems well briefly, then things start going horribly wrong
- Turns out we ran two of it at the same time and it has trashed /var - time to reinstall
- Reinstall base system using kickstart, and then I try get the actual mail servers running
- We have a copy of /etc from the old machine, so I rsync this across to the new one
- Start to get services back up, then reboot, machine reboots in emergency mode
- The network falling down again for a couple of hours in the middle of this did *not* help
- Give up and go home
- Tuesday
- Log in at emergency boot screen and look through the journal
- I've definitely upset it with that rsync - now it can't mount half its file systems at all
- Unpicking whatever damage I've just done is going to be difficult
- Start over and reinstall base system again
- Decide to mount disc over NFS this time rather than iscsi
- Mount disc, set up servers (yesterday's notes really help)
- Dovecot seems to be working OK now!
- Fix some symlinks so mail will actually deliver (oops I'd caused lots of bounces)
- Realise disc is mounted NFS 4
- Change it to mount the disk NFS 3 - it's all nearly sorted, but there's something weird going on, try a reboot
- Machine doesn't come back up. Oh.
- John helps get it partially up with some fiddling of boot options and chroot (this is magic to me)
- Fix fstab to not try mount the disk until the network is up - reboot now succeeds
- Both exim and dovecot now start up happily, mail is being delivered and people can log in
- Trouble is once you've logged in mail clients stall "waiting for server" if you try view a folder
- Network maintenance is due at 5pm so shut down and go home
- Wednesday
- Boot machine up, wonder why dovecot didn't start, til I remember we disabled it on purpose
- Fiddle with dovecot options, fiddle with NFS options
- Turn on debug logging, but it doesn't help much
- Bizarrely I can *sometimes* view folders and messages, but even then it's very very very slow
- We narrow it down to almost certainly being a locking issue
- Resort to strace - which reveals it's stalling on fctrl() calls
- Tell dovecot to use flock and mount the disk nolock - suddenly everything is working again!
- Hurray!
I still have a few bits of housekeeping to sort out. And this all needs to be properly documented and in some cases packaged, and added to kickstarter post-install scripts. But it's good to be back. And boy have I learned a lot in 3.5 days!
no subject
Date: 2016-07-20 03:02 pm (UTC)Definitely leave notes for future generations (especially if future generations might be you ;)
no subject
Date: 2016-07-20 03:22 pm (UTC)