[This article was originally posted by me on March 24, 2016 on NebulousIT.com]
So, anyone familiar with computers and networking knows that disasters are not a question of IF…they are not even a question of WHEN….they are a statement of fact that WILL come to pass. To avoid an unrecoverable situation, you should always make backups. They are essential to most platforms and Cisco ACI is no different. Thankfully, backups are made easy inside the Application Policy Infrastructure Controller (APIC). Not only are backups easy, but they can be automated and scheduled so you don’t even have to think about them. Pretty cool stuff. What we’re going to walk through in this article is:
- What is “Fabric Recovery” and can’t I just import a saved configuration?
Let’s start with Fabric Recovery and we’ll make our way to importing a saved config. To make sure we’re all on the same page, let’s discuss the basics of fabric discovery in ACI. When you initially install APIC and connect the cluster the leaf switches, you are essentially having all APICs and all switches agree on a number of essentials such as VTEP address pool, admin password, the size of the APIC cluster (the number of nodes), the name of the fabric, and a few other things. Most of these values are created by the user during the APIC install. Once the switches have been adopted, they cannot become a part of any other fabric without having their configs completely erased. Switches that are a part of one fabric are essentially invisible to a foreign APIC trying to communicate with them. All of the APICs share a common database that is synchronized between the cluster members in real time. A change can be made to any APIC is reflected on all nodes. This is a great thing – unless some sort of corruption is introduced because it may also get replicated. I should start by saying that we have never had an issue with database corruption – ever. We essentially created a solution to a problem that doesn’t exist – but we’re not foolish enough to think that it’s not possible. If this corruption actually happened, and all nodes are unable to operate properly, you’ll need to do this procedure. First, you have to be running APIC 1.2 (Brazos) or later/ We didn’t implement the feature prior to that. I almost forgot to tell you the best part – we can recover the database to a known good state without affecting traffic flow on the fabric (like brain surgery on a patient who is awake)! This entire process is documented in painful detail in the following KB article (which you should follow if you are doing this): http://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/b_KB_Recovering_the_Fabric.html. My intention in the steps below is to provide some background about the process. Here we go:
1) Backups are essential. As mentioned earlier, the ACI framework allows backups to a remote location and even has a scheduler to backup the config however often you desire. So, we assume you have a good backup for this process. MAKE CERTAIN YOU ENABLE ENCRPYTION on your backups (it’s a simple checkbox and you enter a password). This is not optional if you want to fully recover – you must do this if you intend to restore secure properties like VMM passwords or even the password to APIC itself.
2) SSH to each APIC and run the command: ‘acidiag touch clean’ and then ‘acidiag reboot’ (there is a simpler command called ‘eraseconfig setup’ that combines these two steps, but it may not be implemented in your version of APIC.
3) When APIC1 comes back online, you will be able to login to it using your previous admin credentials. In addition to retaining your login credentials, it will also retain the original VTEP addresses, Fabric Name, OOB mgmt addresses, and multicast addresses. But it knows nothing about the fabric – not a single switch will show up in the topology or fabric membership. You will NOT be able to login to APIC2 or APIC3 at this stage which is totally normal – just like when you built the cluster but had not done fabric discovery. You see, the APICs use inband communication via the ‘infra’ tenant to see each other and exchange cluster credentials, etc. If the fabric has not been discovered (or at least the initial leaf), the ‘infra’ tenant doesn’t see any other nodes. Maybe you never noticed, but when setup runs on APIC2 and APIC3, it doesn’t ask for a password because it will get synced from APIC1. Right now, APIC2 and APIC3 cannot see APIC1 so no password exchange can happen. When we are in this stage of recovery, we cannot just re-discover the fabric because the switches will assume we are a foreign APIC trying to take over. The switches will invoke their ‘cloak of invisibility’ and we will not see them at all in an attempt to rediscover. So we need a way around that.
4) In the APIC GUI, create an import job to bring our old config back. We will not run this import job manually. That’s very important – I’ll say it again – we are not starting this job, we are simply creating it. Part of creating the job is to specify a remote source where the backup file exists. Use your remote server address and credentials to access the backup file. The recovery process will use it as needed without us invoking it manually. In the off chance that you accidentally import the config manually, just run acidiag touch clean again on that one APIC and start over. Technically speaking, you would probably be fine to proceed on because your import job will REPLACE the config, but I would erase again to be safe. It should look like this:
5) Now, we SSH to APIC1 and login. From the APIC1 command line, run the following command:
#trigger fabric-recovery 3 <import job name>
The ‘3’ is the cluster size (3 APICs in most cases, but could be more or less). The process now begins and you get to watch APIC recover. What happens during this process is APIC beings sending out modified LLDP frames with special a TLV. Essentially they tell the switches to trust the APIC that is sending the frames to be authoritative for the fabric. The recovery process finds the first APIC (the one you are running it on), and then the first leaf it is attached to. It then uses LLDP to interrogate that leaf and finds any other APICs attached to it as well as any spines attached. It slowly begins to build a map of the fabric and you can be logged into the GUI of APIC1 the whole time and watch this happen in real time (the CLI will also give you feedback about what is going on). Once the recovery process completes, it will gracefully exit out and you’ll have a fully configured fabric back – as well as your entire configuration (EPG state, AAA settings, VMM integration, user accounts, and anything else you setup in APIC). This is such a cool process – an NO PACKET LOSS on the fabric the entire time. As each APIC is discovered, you will also be able to login to that APIC. It should look something like this:
One thing that might come up is that you fat-finger the ftp/scp credentials or file name to access your backup. It happens (I know). The good news is that you can go fix the authentication (or filename or whatever you missed) and simply restart the recovery. It will know where it was and pick up at that point – we tried to think of everything .
This is something you can easily test in the lab – and I encourage you to do so. The process is very straight forward but you should try it once in the lab before you need to do it for real. To set your mind at ease about this process…the fabric is live the whole time and you’re not dropping packets, but you really aren’t doing much to the fabric anyway. You are mostly just rebuilding APIC. APIC discovers the live fabric as it actually exists and then reconciles it against the configuration backup. There is a small possibility of packet loss – but it depends solely on you. If the backup you are restoring is old and doesn’t contain the same data that is live on the fabric, we trust the backup. For instance, you created some EPGs after you did your backup. Those EPGs are live on the leaf switches. When the recovery tries to reconcile, it will trust the backup when making decisions and you may lose the newly created EPGs or whatever you have created. This is for good reason – if the last change you made corrupted the database, it likely isn’t in your backup. So we don’t want to ingest the same possibly bad data.
Back to our original question about Fabric Recovery vs importing a saved config…you could wipe the config off of each APIC (acidiag touch clean) and then just import a saved config on APIC1 and reboot (acidiag reboot). Once you do this, APIC will know the switch ID’s of each switch in the fabric, but will not know their topology. Remember, these switches don’t trust this APIC as being authoritative and will not communicate with it. Since you cannot communicate with the fabric, the infra tenant is unavailable to the cluster and the APICs cannot communicate with each other – so you cannot login to APIC2 or APIC3. Importing your saved config does restore everything else about your APIC(AAA, VM Networking, tenants, L4-L7, etc) but it’s not all useable. Anything that relies on inband communication through the fabric will fail. OOB stuff will likely all work. The bottom line here is that importing your config is not a recovery solution and it is not intended as a disaster recovery option.
I hope you gained some insight from this post. Thanks for reading!