How to build a rescue VLAN and why you might want to

This article offers a a look at why I needed a rescue vlan inside my Colocation facility, and how I went about setting it up so other rack occupants will be able to use it too.

System upgrades gone bad.

This weekend I spent time to update a three node docker swarm that I have running out in the rack in colo. Debian 10.5 was rolled out to these machines at this time. When that happened, grub ate itself. On rebooting two of the three servers, we got the dreaded:

error: symbol `grub_calloc' not found.
Entering rescue mode...
grub rescue>

This is a pretty simple fix, if you can get into a Linux live environment or a rescue image from the installer. However these are Supermicro servers, with a version of their IPMI that only allows you to mount ISO that are on a SMB share local to the IPMI device.

At this point, I had 3 options. Drive to the rack and fix it on location (not my first choice), setup a file server with SMB shares (also not my first choice), or create an environment that can PXE any device out there. I decided to go with the PXE route.

Creating a PXE Environment in a remote location

There are a couple different ways to PXE boot in the environment. One of the easier ways is netboot.xyz. They do the hard lifting of getting updated versions of operating systems listed and working for you. You can also selfhost, but I decided for this to just use their hosted system so we didn’t have to maintain it ourselves.

First, I selected an empty VLAN and assigned it an IP block behind the firewall that would allow NAT out. Once these were done, I setup a debian VM to act as the dhcp server for this setup. If we already had a dhcp server somewhere in this environment, I would have just added the subnet and options to there.

Once the server was setup (debian), I added the packages I needed (isc-dhcp-server and tftpd-hpa), and downloaded the PXE file from netboot.xyz.

Then, I edited /etc/dhcp/dhcpd.conf with the following:

option domain-name "rescue.n3bbq.org";
option domain-name-servers 1.1.1.1, 8.8.8.8;
default-lease-time 600;
max-lease-time 7200;

ddns-update-style none;

subnet 172.17.30.0 netmask 255.255.255.0 {
    range 172.17.30.50 172.17.30.250;
    option routers 172.17.30.1;
    # Our DHCP Server address is also TFTP
    next-server 172.17.30.2;
    filename "/netboot.xyz.kpxe";
}

This gives 200 ips available and tells the machines that are PXE booting to pull the netboot.xyz.kpxe file from the tftp server.

A test VM set to boot PXE was an easy way to test the configuration worked before reconfiguring the switch port on the servers in question.

Rebooting the servers into netboot.xyz

Next I swapped the switch configuration around on the port to be on the VLAN and rebooted the server via IPMI. I watched it’s console and told the machine to PXE boot this time only on boot.

It booted into netboot.xyz, and I selected the Linux Installers -> Debian -> Debian 10 -> Rescue Mode. I was able to automatically rebuild the software RAID arrays, and mount them (/ and /boot). Then I could install grub to the /dev/sda and /dev/sdb devices (since they are mirrors). A reboot to the disk confirmed the fix.

Just being able to move vlans made this fix fairly quick for the second system since it was already in place.

Rescue vlan - Sun, Sep 13, 2020

System upgrades gone bad.

Creating a PXE Environment in a remote location

Rebooting the servers into netboot.xyz

Back to Home