Mastodon
sungate.co.uk

sungate.co.uk

Ramblings about stuff

DNS, DHCP, Squid and Network syslog all on the same box …

Should have seen this coming I suppose, but today the server which (a) provides our DNS cache, (b) is our DHCP server, (c) is our web proxy server (Squid) and (d) is our network syslog host, fell over in a heap. The astute observers amongst you will appreciate that this is a Bad Thing.

To be honest, there were rumbles of unhappiness from this server a couple of weeks ago, so I was at least slightly prepared. In fact, I had prepared a failover server for the DNS and DHCP functionality. The box which failed today, let’s call it Box A, gives out itself as the primary DNS server when it serves DHCP leases; the failover server, Box B, gives out itself as the primary DNS. So, if Box A fails, Box B will be available for giving out DHCP leases, and those leases will have a valid DNS server associated with them. This part of my redundant setup behaved as expected, and correctly. However, there were a couple of annoyances associated with it:

  • Other servers on the network have Box A hardcoded as their primary DNS. And, as far as I can tell, even if you change their /etc/resolv.conf to have Box B listed first, that won’t take effect until all network apps, or possibly the network interface itself, is restarted. This is annoying and meant once Box A went tits up, all these other servers on the network were trying to access the non-existent machine. Unless I brought all these others offline (not really a viable option), they were going to be slow and sluggish;
  • Those desktop users who were in early had got a DHCP lease from Box A – this meant that, when it went offline, they had a non-existent DNS server in their configurations, resulting in slow and sluggish behaviour.

Well, I asked all desktop users to reboot their PCs if they were having difficulties. Prior to doing this, I also made sure that the web proxy (located on the defunct Box A) was bypassed. This helped.

When it came to troubleshooting Box A to actually find out what was wrong, it was just One Thing After Another. Bearing in mind that Box A is a fairly basic rack server with a single SCSI disk on a single SCSI controller, this is what happened:

  1. First, I ran a RAM check. I left this running for about two hours and there were no errors;
  2. Initial investigations suggested that either the SCSI controller or the SCSI disk might be faulty – there were kernel errors which pointed towards this possibility;
  3. I hoped that I might be able to keep the system ‘up’ for long enough to copy everything from the SCSI disk across to a spare IDE disk, then bring up an IDE-only server. This would certainly work if the SCSI setup was to blame;
  4. So, I connected a spare IDE disk and booted up with a Knoppix CD. I created the partitions on the IDE disk and began to format them. After kicking off one of the formats, I went back to my desk to read my email and, on returning, Box A had rebooted itself. Not Good;
  5. I tried this a couple of times, but I kept getting the system hang or reboot;
  6. Eventually I thought “Bugger this …” and decided to cut my losses and perform a fresh OS installation on to the IDE disk. I took the SCSI controller and SCSI disk out of the box and began the install, my plan being to restore most of the system from backups afterwards;
  7. However, fate was against me here too. If my theory that the SCSI system was at fault was correct, then I should have had no problem with a non-SCSI setup. But It Was Not To Be. Kernel panic during installation;
  8. OK, so it’s not the SCSI system that’s at fault. I’m beginning to think that it’s probably the motherboard which is playing up, which is Bad News.

I’ve been troubleshooting for most of the day by this time and I’m getting fed up, so I change the IP of our development server so that it matches Box A and install ‘dnsmasq’ so that it can respond to all the DNS requests that are still coming in. That appears to work and, while far from perfect, will probably do for a few days until a replacement Box A can be obtained. The web proxy can wait: our connectivity is better now we have ADSL. And the networked-syslog is not exactly mission-critical, of course.

Still, you’ve got to laugh, don’t you?

5 Responses to DNS, DHCP, Squid and Network syslog all on the same box …

  1. Probably showing my ignorance here but…

    a) Why weren’t your DHCP servers defining primary and secondary DNS servers, and b) if they were why weren’t the clients resolving correctly?

    Cheers,

    Permalink
  2. Nasty 🙁 Still – at least you’re playing with a Linux box, and not faffing about with a failed WinXP install like me….

    Permalink
  3. why weren’t the clients resolving correctly?

    Each client gets three DNS servers via DHCP … however, under Win98 at least, it “keeps on trying” the first one, even if it is down. i.e. they don’t “get the hint”. This is most definitely a bug, but something I can’t easily fix.

    Permalink
  4. Dodgey PSU or CPU/system fan can do horrible things like that. Try changing them.

    Permalink
  5. Dodgey PSU

    This PSU is rather non-standard and doesn’t appear to match any of my spares.

    CPU/system fan

    Ditto. We’re getting a replacement box sourced …

    Permalink

Comments are closed.