So, everyone would agree that a helping hand is nice to have now and then. Like the time I thought it would be a good idea to skateboard while holding onto my brother’s car as he drove down the street. It was his helping hand reaching down to pick me up off the road (bleeding) and sitting me in the car that I won’t forget (I still have that scar on my hip). It was brother helping brother – an understanding that when one is down, the other will help get him on his feet (hopefully before mom sees so that we could get our story coordinated as to how it happened). In the UCS world, the brothers in this scenario are the Fabric Interconnects (I’m not sure who the mother is).
There are times when a Fabric Interconnect might encounter a software failure – for whatever reason, and land at the “loader” prompt. It’s rare, but it can happen. The loader prompt is a lonely place and it’s not pleasant. The good news is that if you still have a single FI working, you can use it to resurrect the broken FI. First off, if you ever find yourself staring at the loader prompt, stop cursing and just try to unplug it and plug it back in. Don’t worry if “dir” shows no files – just try it. I’ve seen it work and the FI comes right back on the next boot. If that doesn’t work, you have some work to do…
The loader is just that – a loader. It’s “loads” an OS – like a bootstrap. You need 3 files to permanently get out of the loader – kickstart, system, and the UCSM image. Luckily all of these live on your remaining FI. The bad news is you can’t get to them without bringing it down as well. So, if you’re in production and can’t get afford to bring down the entire UCS pod, you should stop reading and call TAC. They can get you the 3 magic files you need and can get it all running without bringing anything additional offline. But if you’re in a situation where you can afford to take down the remaining FI, you can fix this problem yourself.
To make this work, you will need:
- Non-functional FI
- Functional FI
- FTP Server
- TFTP Server
Your basic recovery will include:
- Disconnect the L1/L2 cables between the FI’s to avoid messing up the cluster data they share
- Boot FI-A to loader
- Force FI-B to loader
- Boot kickstart on FI-B
- Assign IP address to FI-B
- FTP kickstart, image, and ucsm images from FI-B to an FTP server
- Reboot FI-B back to its normal state
- Get kickstart image onto TFTP server (unless FTP/TFTP are the same server)
- Boot kickstart image on FI-A via TFTP server
- FTP kickstart, system, and ucsm images down to FI-A
- Copy ucsm image file to the root
- Load system image on FI-A
- “Activate” the firmware on FI-A
-
Connect L1/L2 cables back and rejoin the cluster
Reboot the “good” FI (known in this document now as FI-B), and begin pressing CTRL+R to interrupt the boot process. You will find FI-B now stops at the loader prompt too. Now type
boot /installables/switch/ <tab>
which will show you all files in this folder. You are looking for the obvious kickstart file and you want the latest one. To make the display easier to read, I would type this:
boot /installables/switch/ucs-6100-k9-kickstart <tab>
Backspace is not an option so if you make an error, use the arrow keys and the “delete” key to fix typos.
Select the latest image, hit enter, and FI-B now beings to boot the kickstart image. Give it a few minutes and you should find it stops at the “boot” prompt. This prompt is not as lonely as the loader prompt, but it’s still not a fun place to be (at least you can backspace now). You actually will have much more functionality then you did with loader, but won’t need it for this exercise. At this point you need to assign and IP address to FI-B so that you can FTP the kickstart image to an FTP server. The commands will look like this:
#Config t
#int mgmt 0
#ip address X.X.X.X <mask>
#no shut
Wait 10-15 seconds
#<ctrl+z to return the shell to the top level>
# copy bootflash:installables/switch/ucs-6100-k9-kickstart <tab>
Select the latest version and copy it to the FTP server.
DO NOT USE THE FILES IN THE ROOT OF BOOTFLASH AT ANY TIME DURING THIS PROCESS. Nothing catastrophic will happen, but the FI will not boot in the end.
The shell will prompt you for ftp server address and credentials and it should look something like this:
You need to allow about 10-15 additional seconds after you “no shut” the interface for the IP to become active and useable.
You now need to copy the system and UCSM images as well as you will need them soon enough. The other two files will look something like:
installables/switch/ucs-manager-k9.2.1.0.418.bin
installables/switch/ucs-6100-k9-system.5.0.3.N2.2.10.418.bin
again – your versions will be different
Once you are returned to the boot prompt, and all 3 files are copied, the kickstart file is on the FTP server. You should boot FI-B back into production. You now need to get the kickstart file available via TFTP using whatever process you do to make that happen. One word of caution here – TFTP blows. It runs on UDP, it’s slow, and it has no error checking. If your first attempt at booting fails, try a different TFTP server program (trust me on this – I had bruises on my head from banging it on the wall). Once the file is available via TFTP, return to FI-A which is at the loader prompt. You will now boot that kickstart image via TFTP using these commands:
Incidentally, you cannot ping this address from an outside station. Just FYI
Then it begins loading the image. It should take just a few seconds to actually start booting. FI-A will not land at the boot prompt like FI-B did earlier. You need to rebuild the filesystem on FI-A, so type:
#init system
This will take a few minutes. When it’s done, you can now use FTP to copy the 3 files down to FI-A. Use this command to retrieve each file:
#Copy ftp: bootflash:
The shell will prompt you for everything it needs to copy the files.
After all 3 are copied, one very important command needs to be run now and it won’t make sense, but you must do this. You need to run this command:
Copy bootflash:/ucs-manager-k9.2.1.0.418.bin bootflash:/nuova-sim-mgmt-nsg.0.1.0.001.bin
The nuova-sim-mgmt-nsg.0.1.0.001.bin is an exact name that is needed here.
Now that you have all 3 files local on the FI, you would be able to recover much quicker if the FI were to lose power. At this moment, if that happened, you would be returned to the loader prompt, but you would be able to boot via bootflash instead of TFTP. Anyway, you are now at the boot prompt and need to finish booting. Type load bootflash://ucs-6100-k9-system.5.0.3.N2.2.10.418.bin. This will start loading the system image and when it’s done loading, it will look for the UCSM image that you also copied and the FI should come up. It will walk you through the setup menu and since the L1/l2 cables are not connected, I would go ahead and set it up as standalone – we will join it to the cluster soon. Once you are logged into the FI, you need to activate the current firmware to set the startup variables. The easiest way to do this is in the GUI. Just go into Firmware Management and select “Activate Firmware” and select the FI. You will likely see that no version is in the startup column. Regardless, you need to activate the version that is already there. If it doesn’t let you, exit Firmware Management and navigate to the Fabric Interconnect on the left-side Tree menu and activate the firmware from there using the “force” option. This will fix up the ucsm image file that we copied to the root as well (turns it into a symbolic link).
That’s about it. You should be OK to erase the config on FI-A (#connect local-mgmt), hook up the L1/l2 cables and rejoin the cluster on the next reboot. I really hope you don’t need to ever use this… I mainly wrote this blog for myself because in the lab we do a lot of crazy stuff and I often forget a step here and there. So I wanted it all written down to refer back to and I’ve wanted to get this one done for quite some time.
Thanks for reading…
-Jeff